TheorIA Dataset

A comprehensive, structured dataset of theoretical physics derivations. Aspiring to be complete, exact, and open-source for the global physics community.

Open Source Physics Knowledge
🚀 Looking for Contributors!

If you want to contribute to an open source project at the intersection of physics and AI, this is your place!

Why TheorIA?

There is a lack of structured physiscs datasets that systematically captures all physics results. TheorIA aims to serve both as a useful reference for researchers and as training data for next-generation AI models and LLMs in physics.

Current Status: We have many entries generated by AI that require expert curation and validation to meet our quality standards. We are also working on having a better structure for the dataset.

We Need You: We're seeking physicists to review and curate entries. Your expertise ensures accuracy, and your name will be permanently associated with every entry you review, building your scientific legacy.

What's Inside?

📋 Structured Entries

Each entry of the dataset is a physics result, which is either validated by a physicist or generated by AI and looking for a specialist to improve it.

🧮 Formal Derivations

AsciiMath step by step, annotated for easier understanding and programmatic formalization to guarantee correctness.

📁 One JSON per Entry

Individual JSON files under entries/ folder: one file per physics result, enabling parallel contributions, clean version control, and conflict-free collaboration.

🧠 Centralized Assumptions

Standardized database of physics assumptions ensuring consistent terminology across all entries.

📚 Rich Context

Entries include unified assumptions, historical context, cross-entry dependencies, and domain classifications (ArXiv-style tags) for comprehensive understanding.

📓 Interactive Notebooks

Every entry includes an automatically generated Jupyter notebook that opens directly in Google Colab for interactive exploration and verification of the physics derivations.

🌐 Open License

CC‑BY 4.0—use it, remix it, teach with it.

How to Contribute

We are tiny right now, every new entry counts! Multiple ways to contribute:

🐙 GitHub Issues

Submit entries through structured issue forms

  • ✓ Direct integration
  • ✓ Community discussion
  • ✓ Automatic processing
Submit via GitHub
For Coders: Direct Development

1Fork and Clone

Fork the repository on GitHub and clone it to your local machine.

2Create a JSON Entry

Create a JSON file in the entries/ folder following the instructions in the CONTRIBUTING.md file and the schema in schemas/entry.schema.json.

3Submit a Pull Request

CI will automatically validate your JSON against the schema. If it passes, we will review and merge it.

Using TheorIA for Machine Learning

Generate a complete ML-ready dataset with resolved assumptions using our built-in script:

# Generate dataset with only reviewed entries (recommended)
python scripts/build_ml_dataset.py

# Include draft entries (for larger dataset)
python scripts/build_ml_dataset.py --include-drafts

This creates dataset.json with all entries, resolved assumption text, and metadata ready for your ML pipeline.

License & Citation

Licensed under CC-BY 4.0 License. If you use it please cite it as:

Loading citation...

Contact

Issues, questions, or exciting physics ideas? Drop an issue on GitHub, and let's build this thing together!