OpenAI o1 Reasoning Model Surpasses Human Experts in Science

TL;DR

Artificial intelligence has evolved into reasoning models that utilize reinforcement learning-based Chain-of-Thought (CoT) to self-correct errors and develop logic.
The o1 model recorded 78.0% on the GPQA Diamond benchmark. Measures PhD-level science knowledge, surpassing the human expert average of 69.7% for the first time.
Users should establish new work procedures to allocate AI thinking time according to task complexity and directly review the validity of the derived logical steps.

Example: Beside a researcher staying up all night to solve a complex scientific problem, a machine suggests the optimal experimental path after a brief silence. Artificial intelligence has moved beyond the stage of simply listing sentences into an era of thinking—breaking down problems, revising strategies, and finding answers much like a human. The OpenAI o1 model serves as a signal fire, transforming AI from a simple answer generator into a research partner.

Current Status

The emergence of reasoning AI is shifting the paradigm from the simple response methods—often cited as a limitation of existing Large Language Models—to a deliberate thinking process. The OpenAI o1 model is trained to generate an internal Chain-of-Thought. This model undergoes a process of deliberation before delivering an answer. It is characterized by decomposing problems into sub-steps and attempting different strategies if an initial approach fails.

Quantitative metrics support this shift. The o1 model achieved an accuracy of 78.0% on the GPQA Diamond benchmark, which tests PhD-level scientific knowledge. This figure exceeds the average score of human experts (69.7%). It indicates that AI can surpass human capabilities not just in understanding specialized knowledge, but in high-level reasoning domains. Released alongside it, o1-preview and o1-mini recorded accuracies of 73.3% and 60.0% respectively, providing options based on task difficulty and resource efficiency.

In fact, in human preference evaluations, o1-preview was preferred over GPT-4o by 94% in data analysis and 95% in mathematics, demonstrating a significant leap in reasoning capabilities compared to previous models.

Analysis

The core of this change lies in the shift of artificial intelligence from intuitive, fast responses to logical, deliberate reasoning. While past models predicted the next most probable word based on vast datasets, o1 explores optimal thought paths through reinforcement learning. This means language models have begun to calculate logical paths in a manner similar to chess AI.

However, this advancement brings clear management challenges. First is the opacity resulting from the Hidden Chain-of-Thought (Hidden CoT). Since the model's internal thinking chain is not disclosed to the user, researchers may find it difficult to verify the basis of a conclusion and may have to rely solely on the output.

Second is the issue of computational cost and time. Reasoning models consume more time and resources to produce an answer than general models. This makes them difficult to apply to all services and places a burden on users to decide whether to use a fast model or a deep-thinking model depending on the difficulty of the task. The law that intelligence increases as thinking time lengthens is directly linked to rising economic costs.

Practical Application

Users should fundamentally change the way they query AI. Instead of simply entering a question, they should request depth of thought from the model. The potential of reasoning models is maximized in fields where logical steps are critical, such as algorithm design, chemical property analysis, and mathematical proofs.

Example: In a situation requiring complex logic design, request an analysis of potential exceptions in the system architecture and a step-by-step logic design for resolution.

Checklist for Today:

Prioritize assigning o1-series models to tasks where logical error verification is critical to test performance differences.
Draft an AI resource allocation guide by separating real-time response tasks from in-depth research tasks.
Create a checklist to manually review each step of the answer provided by the AI to prevent logical leaps.

FAQ

Q: Are reasoning models superior to existing models in all tasks? A: No. In fields requiring creative writing, simple fact-checking, or fast conversation, general models are more efficient in terms of cost and speed. o1 is specialized for fields where there is a correct answer and logical steps are important, such as math, science, and coding.

Q: Can users directly see the AI's internal thinking process? A: Currently, the internal Chain-of-Thought is not disclosed; only summarized information or the final result is provided. This is a policy for technical security and to prevent model circumvention; users should verify the logic of the output after the fact.

Q: Does longer reasoning time lead to higher costs? A: According to the Inference Scaling Law, infrastructure costs increase as more tokens are generated and calculations are performed over a longer duration. Therefore, a strategic approach is needed to adjust the model's level of thinking according to the importance of the task.

Conclusion

The emergence of reasoning AI, represented by OpenAI o1, demonstrates that technology has evolved beyond a knowledge repository into a logic engine. The GPQA benchmark results are evidence that AI can become a practical partner in academic research and complex problem-solving.

The key going forward lies in how efficiently and transparently these reasoning capabilities can be controlled. We have now reached a point where we should move beyond asking AI what it knows, to designing and supervising how AI thinks.

References

🛡️ Learning to reason with LLMs | OpenAI
🛡️ Learning to Reason with LLMs | OpenAI
🛡️ OpenAI o1 System Card

Aionda