 
	Failing Grade
A team of researchers at Facebook’s parent company Meta has come up with a new benchmark to gauge the abilities of AI assistants like OpenAI’s large language model GPT-4.
And judging by current standards, OpenAI’s current crop of AI models are all… still pretty stupid.
The team, which includes “AI godfather” and Meta chief scientist Yann LeCun, came up with an exam called GAIA that’s made up of 466 questions that “are conceptually simple for humans yet challenging for most advanced AIs,” per a yet-to-be-peer-reviewed paper.
The results speak for themselves: human respondents were capable of correctly answering 92 percent of the questions, while GPT-4, even equipped with some manually selected plugins, scored a measly 15 percent. OpenAI’s recently-released GPT4 Turbo scored less than ten percent, according to the team’s published GAIA leaderboard.
It’s unclear, however, how competing LLMs like Meta’s own Llama 2 or Google’s Bard fared.
Nonetheless, the research demonstrates that we’re likely still a long way away from reaching artificial general intelligence (AGI), the state at which AI algorithms can outperform humans in intellectual tasks.
Stupid Lawyer
That conclusion also flies in the face of some lofty claims made by notable figures in the AI industry.
“This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry,” the researchers write in their paper.
Case in point, in January OpenAI competitor Anthropic claimed its AI dubbed Claude got a “marginal pass” on a blindly graded law and economics exam at George Mason University.
In its GPT-4 documentation, OpenAI also claimed that its model “exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top ten percent of test takers.”
But how to actually gauge the intelligence of these systems has remained a thorny debate. Tools like GPT-4 still have plenty of inherent flaws and still can’t reliably tell the truth from fiction.
In other words, how could an algorithm really pass the bar if it can’t even tell whether Australia exists?
Limited Understanding
LeCun has long been an outspoken critic of AI doomsaying and has repeatedly downplayed comments alleging that we’re facing an existential threat in the form of a rogue AGI.
“LLMs obviously have *some* understanding of what they read and generate,” he tweeted over the weekend. “But this understanding is very limited and superficial. Otherwise, they wouldn’t confabulate so much and wouldn’t make mistakes that are contrary to common sense.”
That, however, may not always be the case. If recent rumors are to be believed, OpenAI is working on a next-generation model dubbed Q*, pronounced Q star, that could introduce a level of deductive reasoning and “planning.”
But whether it’ll manage to score a higher mark on Meta’s brutal GAIA test remains to be seen.
More on LLMs: Guy Brags About “Stealing” Millions of Pageviews by Rewriting Competitors’ Articles Using AI
