TheorIA Dataset

Our Aspiration: Create a high-quality structured dataset containing all physics results. Complete, exact, and open-source for the global physics community.

๐Ÿš€ Looking for Contributors!

If you want to contribute to an open source project at the intersection of physics and AI, this is your place!

Browse All Entries
GitHub Repository FAQ

Why TheorIA?

There is a lack of structured physiscs datasets that systematically captures all physics results. TheorIA aims to serve both as a useful reference for researchers and as training data for next-generation AI models and LLMs in physics.

Current Status: We have many entries generated by AI that require expert curation and validation to meet our quality standards. We are also working on having a better structure for the dataset.

We Need You: We're seeking physicists to review and curate entries. Your expertise ensures accuracy, and your name will be permanently associated with every entry you review, building your scientific legacy.

What's Inside?

๐Ÿ“‹ Structured Entries

Each entry of the dataset is a physics result, which is either validated by a physicist or generated by AI and looking for a specialist to improve it.

๐Ÿงฎ Formal Derivations

AsciiMath step by step, annotated for easier understanding and programmatic formalization to guarantee correctness.

๐Ÿ“ One JSON per Entry

Individual JSON files under entries/ folder: one file per physics result, enabling parallel contributions, clean version control, and conflict-free collaboration.

๐Ÿง  Centralized Assumptions

Standardized database of physics assumptions ensuring consistent terminology across all entries.

๐Ÿ“š Rich Context

Entries include unified assumptions, historical context, cross-entry dependencies, and domain classifications (ArXiv-style tags) for comprehensive understanding.

๐Ÿ““ Interactive Notebooks

Every entry includes an automatically generated Jupyter notebook that opens directly in Google Colab for interactive exploration and verification of the physics derivations.

๐ŸŒ Open License

CCโ€‘BY 4.0โ€”use it, remix it, teach with it.

How to Contribute

We are tiny right now, every new entry counts! Multiple ways to contribute:

๐Ÿ™ GitHub Issues

Submit entries through structured issue forms

  • โœ“ Direct integration
  • โœ“ Community discussion
  • โœ“ Automatic processing
Submit via GitHub
For Coders: Direct Development

1Fork and Clone

Fork the repository on GitHub and clone it to your local machine.

2Create a JSON Entry

Create a JSON file in the entries/ folder following the instructions in the CONTRIBUTING.md file and the schema in schemas/entry.schema.json.

3Submit a Pull Request

CI will automatically validate your JSON against the schema. If it passes, we will review and merge it.

Using TheorIA for Machine Learning

Generate a complete ML-ready dataset with resolved assumptions using our built-in script:

# Generate dataset with only reviewed entries (recommended)
python scripts/build_ml_dataset.py

# Include draft entries (for larger dataset)
python scripts/build_ml_dataset.py --include-drafts

This creates dataset.json with all entries, resolved assumption text, and metadata ready for your ML pipeline.

License & Citation

Licensed under CC-BY 4.0 License. If you use it please cite it as:

Loading citation...

Contact

Issues, questions, or exciting physics ideas? Drop an issue on GitHubโ€”and let's build this thing together.