Our Aspiration: Create a high-quality structured dataset containing all physics results. Complete, exact, and open-source for the global physics community.
Looking for Contributors!If you want to contribute to an open source project at the intersection of physics and AI, this is your place!
GitHub Repository FAQThere is a lack of structured physiscs datasets that systematically captures all physics results. TheorIA aims to serve both as a useful reference for researchers and as training data for next-generation AI models and LLMs in physics.
Current Status: We have many entries generated by AI that require expert curation and validation to meet our quality standards. We are also working on having a better structure for the dataset.
We Need You: We're seeking physicists to review and curate entries. Your expertise ensures accuracy, and your name will be permanently associated with every entry you review, building your scientific legacy.
Each entry of the dataset is a physics result, which is either validated by a physicist or generated by AI and looking for a specialist to improve it.
AsciiMath step by step, annotated for easier understanding and programmatic formalization to guarantee correctness.
Individual JSON files under entries/ folder: one file per physics result, enabling parallel contributions, clean version control, and conflict-free collaboration.
Standardized database of physics assumptions ensuring consistent terminology across all entries.
Entries include unified assumptions, historical context, cross-entry dependencies, and domain classifications (ArXiv-style tags) for comprehensive understanding.
Every entry includes an automatically generated Jupyter notebook that opens directly in Google Colab for interactive exploration and verification of the physics derivations.
CCโBY 4.0โuse it, remix it, teach with it.
We are tiny right now, every new entry counts! Multiple ways to contribute:
User-friendly contribution system designed for scientists and researchers
Submit entries through structured issue forms
Fork the repository on GitHub and clone it to your local machine.
Create a JSON file in the entries/ folder following the
instructions in the CONTRIBUTING.md file
and the schema in schemas/entry.schema.json.
CI will automatically validate your JSON against the schema. If it passes, we will review and merge it.
Generate a complete ML-ready dataset with resolved assumptions using our built-in script:
# Generate dataset with only reviewed entries (recommended)
python scripts/build_ml_dataset.py
# Include draft entries (for larger dataset)
python scripts/build_ml_dataset.py --include-drafts
This creates dataset.json with all entries, resolved assumption text, and metadata ready for your ML pipeline.
Licensed under CC-BY 4.0 License. If you use it please cite it as:
Loading citation...
Issues, questions, or exciting physics ideas? Drop an issue on GitHubโand let's build this thing together.