Just what we needed: AI mastering its own imperceptible version of invisible ink.
As VentureBeat reports, a recent — though yet-to-be-peer-reviewed — study conducted by AI alignment research group Redwood Research found that large language models (LLMs) are incredibly good at a type of steganography dubbed "encoded reasoning." Basically, the study says, LLMs can be trained to use secret messages to obscure their step-by-step thinking processes, a practice that, interestingly, could make their outputs more accurate — while also rendering them more deceptive.
Per the study, LLMs are able to take specific advantage of chain-of-thought (CoT) reasoning, or a broadly used technique that effectively teaches AI models how to show their work in their answers. Machine learning is predictive, and for every given input, there are a number of outputs that an AI agent could feasibly drum up; in coaching a model to use CoT, the logic goes, tracing a given model's black-box reasoning gets easier, and thus so does model refinement.
But according to this new research, it seems that LLMs are able to subvert CoT. As the researchers put it: "An LLM could encode intermediate steps of reasoning in their choices of a particular word or phrasing (when multiple ones would be equally good from the user's perspective), and then decode these intermediate steps later in the generation to arrive at a more accurate answer than if it tried to answer to the question without any intermediate step." In other words? An LLM can learn to encode certain steps of its CoT into its own answer key of sorts — like a coach might give seemingly nondescript hand signals for certain plays — which only the LLM itself can decipher. Then, as the generation continues, the AI decodes its own messages, which helps it produce accurate outputs. And all of this is invisible to humans.
This may not be an entirely benign skill, either. Whether an AI comes to a correct conclusion or not, being able to trace a model's CoT is important. Most AIs are trained via reinforcement learning; if we can't track a model's thought process, we could unknowingly reinforce bad behavior. These steganography skills could also result in AIs passing hidden codes and messages to other AI agents right under our human noses.
There are possible mitigating techniques. The researchers particularly encourage users to ask LLMs to paraphrase their outputs, a practice that may remove some of the fluff from a model's response. Still, this finding is pretty unsettling, and we don't want to feel like we have to DaVinci Code every AI-generated response. No passing notes in class!!
Share This Article