
The best, but the most confusing?
OpenAI’s recently released o3 and o4-mini artificial intelligence models have achieved amazing results in calculations and coding. However, there is one big problem — they “hallucinate” more than their predecessors, that is, fabricate data.
Has the previous progress stopped?
Usually, each new model is more accurate than the one before it. But according to OpenAI’s internal tests, the o3 and o4-mini models performed worse than previous reasoning models — o1, o3-mini, and even “traditional” models like GPT-4o.
How? Why?
For example, in OpenAI’s PersonQA test:
- o1 – 16% hallucinated
- o3-mini – 14.8%
- o3 – 33%
- o4-mini fabricated 48% of the time!
This is because these new models are trying to reason more. More reasoning means more correct thoughts, but also more incorrect conclusions.
Independent research confirms this
A non-governmental laboratory called Transluce found that the o3 model even invented completely false processes, such as “I ran the code on a MacBook Pro.” In fact, the model does not have this ability.
What can be done?
There are some solutions:
- Models like GPT-4o give much more accurate results when working with web searches — 90% accuracy in the SimpleQA test!
- But this approach does not always work, especially in areas where privacy is important.
- Intelligence has increased, accuracy has decreased
OpenAI has acknowledged this problem and said that “more research is needed.” But this problem threatens the use of artificial intelligence, especially in law, medicine, or other fields that require high accuracy.
So as AI’s capabilities expand, the question of its reliability has become more pressing.
Leave a Reply