While artificial intelligence continues to deliver groundbreaking tools that simplify various aspects of human life, the issue of hallucination remains a persistent and growing concern.
According to IBM, hallucination in AI is “a phenomenon where, in a large language model (LLM)—often a generative AI chatbot or computer vision tool—the system perceives patterns or objects that are non-existent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.”
OpenAI’s technical report on its latest models—o3 and o4-mini—reveals that these systems are more prone to hallucinations than earlier versions such as o1, o1-mini, and o3-mini, or even the “non-reasoning” model GPT-4o.
To evaluate hallucination tendencies, OpenAI used PersonQA, a benchmark designed to assess how accurately models respond to factual, person-related queries.
“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers,” the report notes.
The findings are significant: the o3 model hallucinated on 33% of PersonQA queries—roughly double the rates recorded by o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating 48% of the time.
Despite the results, OpenAI did not offer a definitive explanation for the increase in hallucinations. Instead, it stated that “more research” is needed to understand the anomaly. If larger and more capable reasoning models continue to exhibit increased hallucination rates, the challenge of mitigating such errors may only intensify.
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” OpenAI spokesperson Niko Felix told TechCrunch.
According to IBM, hallucination in AI is “a phenomenon where, in a large language model (LLM)—often a generative AI chatbot or computer vision tool—the system perceives patterns or objects that are non-existent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.”
OpenAI’s technical report on its latest models—o3 and o4-mini—reveals that these systems are more prone to hallucinations than earlier versions such as o1, o1-mini, and o3-mini, or even the “non-reasoning” model GPT-4o.
To evaluate hallucination tendencies, OpenAI used PersonQA, a benchmark designed to assess how accurately models respond to factual, person-related queries.
“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers,” the report notes.
The findings are significant: the o3 model hallucinated on 33% of PersonQA queries—roughly double the rates recorded by o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating 48% of the time.
Despite the results, OpenAI did not offer a definitive explanation for the increase in hallucinations. Instead, it stated that “more research” is needed to understand the anomaly. If larger and more capable reasoning models continue to exhibit increased hallucination rates, the challenge of mitigating such errors may only intensify.
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” OpenAI spokesperson Niko Felix told TechCrunch.
You may also like
US conducts fresh airstrikes in Yemen, Iran-backed Houthi rebels say
Hot weather maps show exact date England will bask in 20C heat - check your area
The 1% Club's Lee Mack faces backlash as ITV fans slam 'brutal' remark
Ollie Watkins fires Arsenal transfer message as Unai Emery makes costly Aston Villa mistake
BGT semi-finalists - all the Golden Buzzer acts who are through to 2025 live shows