Frequently Asked Questions

Why TheorIA Dataset?

TheorIA Dataset addresses the need for a high-quality, structured collection of theoretical physics derivations that are self-contained and mathematically rigorous. Unlike scattered resources across textbooks and papers, this dataset provides a unified format where each entry includes complete derivations, programmatic verification, and clear explanations suitable for machine learning applications and educational purposes.

For example, in Wikipedia each physics entry has a different structure, the derivations are not clearly located, there is a lot of context - we are reducing it to the strictly minimum necessary needed to understand the concept and where it comes from.

How can TheorIA Dataset be used?

  • Educational Tools: As a reference to students that want something minimal and with the correct information
  • Research Tool: Allow researchers to find new and cleaner relationships between physics concepts and results
  • Knowledge Base Development: Building AI systems and training models that understand and manipulate physics equations
  • Academic Reference: Quick access to properly formatted, peer-reviewed physics derivations

Will this remain a free initiative?

Yes, absolutely. TheorIA Dataset is released under the CC-BY-4.0 license and will always remain freely available to the academic community, researchers, educators, and anyone interested in theoretical physics. This is a commitment to open science and knowledge sharing.

How can I participate?

There are several ways to contribute:

  • Submit new entries: Follow the guidelines in CONTRIBUTING.md to add new physics derivations
  • Peer review: Review existing entries for accuracy and completeness
  • Improve documentation: Help enhance explanations and fix errors
  • Testing: Run validation scripts and report issues
  • Community building: Share the project and engage in discussions

See our Contributing Guidelines for detailed instructions on how to get started.

Why are there so many draft entries?

The draft entries are AI-generated and serve as a starting point for the dataset. However, they do not yet meet the quality standards we want to achieve for the final dataset. Current AI models are not as good yet at producing rigorous physics derivations, and that is, in part, the point of this collective effort - to create high-quality training data that can help improve future AI systems in physics.

This is precisely where people with knowledge in physics are needed - to review, refine, and validate these entries to ensure they meet our rigorous academic standards. Your expertise in physics is crucial for transforming these drafts into high-quality, peer-reviewed entries.

What makes a good entry?

A good entry should be:

  • Self-contained: All symbols defined, assumptions clearly stated
  • Mathematically rigorous: Complete derivations with all steps shown
  • Programmatically verified: Include Python code that validates the mathematics
  • Well-documented: Clear explanations and proper references
  • Following standards: Comply with the JSON schema and use AsciiMath format

Most importantly, one should follow the CONTRIBUTING guidelines strictly to ensure consistency and quality across all entries.

How do I know if my entry is correct?

Within the repository there are automatic ways to ensure that the entry meets some minimum format and types, it is called the schema. We also ensure that the programmatic verification runs without error.

If you are adding or modifying an entry by cloning the repository you can use:

  • make validate FILE=your_entry - Check schema compliance
  • make test-entry FILE=your_entry - Run full validation including programmatic verification
  • make test - Test all entries

If you are doing it through the forms in the webpage, we will run them for you and keep some communication via email.

Can I suggest improvements to existing entries?

Yes! You can:

  • Open an issue to discuss potential improvements
  • Submit a pull request with your suggested changes
  • Use the "modify-entry" issue template for structured feedback

What physics domains are included?

We follow the arXiv taxonomy for domain classification, including but not limited to:

  • General Relativity and Quantum Cosmology (gr-qc)
  • High Energy Physics Theory (hep-th)
  • Condensed Matter Physics (cond-mat)
  • Quantum Physics (quant-ph)
  • Mathematical Physics (math-ph)

How is quality ensured?

  • Schema validation: All entries must conform to strict JSON schema
  • Programmatic verification: Each entry includes code that follows the derivations and verifies them
  • Peer review: Community review process for all submissions
  • Automated testing: Continuous integration checks all entries