TheorIA Dataset addresses the need for a high-quality, structured collection of theoretical physics derivations
that are self-contained and mathematically rigorous. Unlike scattered resources across textbooks and papers,
this dataset provides a unified format where each entry includes complete derivations, programmatic
verification, and clear explanations suitable for machine learning applications and educational purposes.
For example, in Wikipedia each physics entry has a different structure, the derivations are not clearly located,
there is a lot of context - we are reducing it to the strictly minimum necessary needed to understand the
concept and where it comes from.
How can TheorIA Dataset be used?
Educational Tools: As a reference to students that want something minimal and with the
correct information
Research Tool: Allow reserchers to find new and more cleaner relationships between physics
concepts and results
Knowledge Base Development: Building AI systems and training models that understand and
manipulate physics equations
Academic Reference: Quick access to properly formatted, peer-reviewed physics derivations
Will this remain a free initiative?
Yes, absolutely. TheorIA Dataset is released under the CC-BY-4.0 license and will always remain
freely available to the academic community, researchers, educators, and anyone interested in theoretical
physics. This is a commitment to open science and knowledge sharing.
How can I participate?
There are several ways to contribute:
Submit new entries: Follow the guidelines in CONTRIBUTING.md to add new physics derivations
Peer review: Review existing entries for accuracy and completeness
Improve documentation: Help enhance explanations and fix errors
Testing: Run validation scripts and report issues
Community building: Share the project and engage in discussions
The draft entries are AI-generated and serve as a starting point for the dataset. However, they do not yet meet
the quality standards we want to achieve for the final dataset. Current AI models are not as good yet at
producing rigorous physics derivations, and that is, in part, the point of this collective effort - to create high-quality
training data that can help improve future AI systems in physics. This is precisely where people with knowledge
in physics are needed - to review, refine, and validate these entries to ensure they meet our rigorous academic
standards. Your expertise in physics is crucial for transforming these drafts into high-quality, peer-reviewed
entries.
What makes a good entry?
A good entry should be:
Self-contained: All symbols defined, assumptions clearly stated
Mathematically rigorous: Complete derivations with all steps shown
Programmatically verified: Include Python code that validates the mathematics
Well-documented: Clear explanations and proper references
Following standards: Comply with the JSON schema and use AsciiMath format
Most importantly, one should follow the CONTRIBUTING guidelines strictly to ensure consistency and quality across all entries.
How do I know if my entry is correct?
Within the repository there are automatic ways to ensure that the entry meets some minimum format and types, it
is called the schema. We also ensure that the programmatic verification runs without error.
If you are adding or modifying an entry by cloning the repository you can use:
make validate FILE=your_entry - Check schema compliance
make test-entry FILE=your_entry - Run full validation including programmatic verification
make test - Test all entries
If you are doing it through the forms in the webpage, we will run them for you and keep some communication via
email.
Can I suggest improvements to existing entries?
Yes! You can:
Open an issue to discuss potential improvements
Submit a pull request with your suggested changes
Use the "modify-entry" issue template for structured feedback
What physics domains are included?
We follow the arXiv taxonomy for domain classification, including but not limited to:
General Relativity and Quantum Cosmology (gr-qc)
High Energy Physics Theory (hep-th)
Condensed Matter Physics (cond-mat)
Quantum Physics (quant-ph)
Mathematical Physics (math-ph)
How is quality ensured?
Schema validation: All entries must conform to strict JSON schema
Programmatic verification: Each entry includes code that follows the derivations and
verifies them
Peer review: Community review process for all submissions
Automated testing: Continuous integration checks all entries