In Leaked Audio, Microsoft Cherry-Picked Examples to Make Its AI Seem Functional

A Microsoft researcher giving an internal presentation on an early demo of an AI tool admits that the responses were "cherry-picked." — *Image: Getty / Futurism*

Pick and Choose

Microsoft “cherry-picked” examples of its generative AI’s output after it would frequently “hallucinate” incorrect responses, Business Insider reports.

The scoop comes from leaked audio of an internal presentation on an early version of Microsoft’s Security Copilot, a ChatGPT-like AI tool designed to help cybersecurity professionals.

According to BI, the audio contains a Microsoft researcher discussing the results of “threat hunter” tests in which the AI analyzed a Windows security log for possible malicious activity.

“We had to cherry-pick a little bit to get an example that looked good because it would stray and because it’s a stochastic model, it would give us different answers when we asked it the same questions,” said Lloyd Greenwald, a Microsoft Security Partner giving the presentation, as quoted by BI.

“It wasn’t that easy to get good answers,” he added.

Halluci Nation

Functioning like a chatbot — you type a query into a chat window, you get an answer in the style of a customer service rep — Security Copilot is largely built on OpenAI’s GPT-4 large language model, which also underpins Microsoft’s other generative AI outings like the Bing Search assistant. According to Greenwald, Microsoft got early access to GPT-4, and these demos were “initial explorations” of the tech’s capabilities.

Not unlike the Bing AI, which during the early stages of its release would return responses so insane that it had to be “lobotomized,” the researchers said that Security Copilot frequently “hallucinated” incorrect responses during its early iterations — a problem seemingly endemic to the technology.

“Hallucination is a big problem with LLMs and there’s a lot we do at Microsoft to try to eliminate hallucinations and part of that is grounding it with real data,” Greenwald said in the audio, “but this is just taking the model without grounding it with any data.”

In other words, the LLM Microsoft used to build Security Copilot, GPT-4, wasn’t trained on cybersecurity specific data at the time. Instead, it was used straight out of the box, relying only on its standard — but still immense — general dataset.

Cherry on Top

Sharing another set of security questions, Greenwald revealed that “this is just what we demoed to the government.”

It’s unclear if Microsoft used these “cherry-picked” examples in its presentations to the government and other potential customers — or if its researchers were this candid about how the examples were chosen.

A Microsoft spokesperson told BI that “the technology discussed at the meeting was exploratory work that predated Security Copilot and was tested on simulations created from public data sets for the model evaluations,” adding that “no customer data was used.”

More on AI: Microsoft’s Stuffing Talking Generative AI Into Your Car