Government Test Finds That AI Wildly Underperforms Compared to Human Employees

A real stinker.

Sums It Up

Generative AI is absolutely terrible at summarizing information compared to humans, according to the findings of a trial for the Australian Securities and Investment Commission (ASIC) spotted by Australian outlet Crikey.

The trial, conducted by Amazon Web Services, was commissioned by the government regulator as a proof of concept for generative AI's capabilities, and in particular its potential to be used in business settings.

That potential, the trial found, is not looking promising.

In a series of blind assessments, the generative AI summaries of real government documents scored a dire 47 percent on aggregate based on the trial's rubric, and were decisively outdone by the human-made summaries, which scored 81 percent.

The findings echo a common theme in reckonings with the current spate of generative AI technology: not only are AI models a poor replacement for human workers, but their awful reliability means it's unclear if they'll have any practical use in the workplace for the majority of organizations.

Signature Shoddiness

The assessment used Meta's open source Llama2-70B, which isn't the newest model out there, but with up to 70 billion parameters, it's certainly a capable one.

The AI model was instructed to summarize documents submitted to a parliamentary inquiry, and specifically to focus on what was related to ASIC, such as where the organization was mentioned, and to include references and page numbers. Alongside the AI, human employees at ASIC were asked to write summaries of their own.

Then five evaluators were asked to assess the human and the AI-generated summaries after reading the original documents. These were done blindly — the summaries were simply labeled A and B — and scorers had no clue that AI was involved at all.

Or at least, they weren't supposed to. At the end, when the assessors had finished up and were told about the true nature of the experiment, three said that they suspected they were looking at AI outputs, which is pretty damning on its own.

Sucks On All Counts

All in all, the AI performed lower on all criteria compared to the human summaries, the report said.

Strike one: the AI model was flat-out incapable of providing the page numbers of where it got its information.

That's something the report notes can be fixed with some tinkering with the AI model. But a more fundamental issue was that it regularly failed to pick up on nuance or context, and often made baffling choices about what to emphasize or highlight.

Beyond that, the AI summaries tended to include irrelevant and redundant information and were generally "waffly" and "wordy."

The upshot: these AI summaries were so bad that the assessors agreed that using them could require more work down the line, because of the amount of fact-checking they require. If that's the case, then the purported upsides of using the technology — cost-cutting and time-saving — are seriously called into question.

More on AI: NaNoWriMo Slammed for Saying That Opposition to AI-Generated Books Is Ableist

Share This Article