Tax season, that dreaded time of year, is upon us. But if you were hoping that newfangled AI tech could help you file the laborious paperwork — and perhaps find a way of saving you a few bucks — think again.
After testing several four leading AI chatbots, the New York Times found that all of them struggled to pick and fill out the correct forms, fumbling key calculations. In all, the bots miscalculated the tax money owed to the IRS by an average of more than $2,000.
“The problem with taxes is all those very small little details matter, and it’s not going to get every single little detail right,” Benedict Evans, an analyst who writes a technology newsletter, told the NYT.
“These models get dramatically better over the course of every six months,” he continued. “But they still give you what is roughly the right answer, and that’s not what you want.”
AI can be useful for processing and summarizing large amounts of information, but it struggles with precision in virtually every domain. Chatbots will often fabricate false factual claims, even when asked to summarize a single document. AI programming assistants will slip errors into their code. Image generators produce strange visual artifacts and inconsistencies.
The conundrum is the same with arithmetic. Pair that with byzantine tax laws and all its corresponding, highly specific forms, and you have a recipe for, if not disaster, then a taxing and expensive back-and-forth with the IRS.
To test the AI models — OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok — the NYT had them attempt to solve a series of tax scenarios described in training materials by the tax service TaxSlayer. Only after supplying the models with highly specific instructions, like where each piece of information should go in each IRS document, did the AIs begin to fare better.
That, you might argue, defeats the point of using an automated tool in the first place. Your average joe uses overpriced tax software precisely because they don’t know the nitty gritty of the process. Software like TurboTax or TaxAct “is procedural, following ‘if-then’ logic built for mathematical precision,” Erik Brynjolfsson, a senior fellow at the Stanford Institute for Human-Centered AI, explained to the NYT — whereas large language models are prediction engines that “can be superhuman at many tasks yet fail at some that seem simpler to humans.”
A prime example of how hallucinating LLMs can cock up your tax homework? TurboTax’s own experiments with the tech. When the tax software company deployed its “Intuit Assist” chatbot to answer tax questions, it would spin off irrelevant answers. When the answers were on topic, they were often wrong.
More on AI: Grammarly Offering Manuscript Reviews by AI Versions of Recently Deceased Professors