New ChatGPT AI Watches Through Your Camera, Offers Advice on What It Sees

Today, OpenAI showed off its latest large language model (LLM) called GPT-4o — that's a lower-case "o," for "omni" — that the company promises can "reason across audio, vision, and text in real time."

During its brief announcement, the company demonstrated the AI's uncanny ability to assess what it "sees" through the user's smartphone camera, allowing it to help solve math problems and even assist with coding.

OpenAI is making the new model "available to all ChatGPT users, including on the free plan" per OpenAI CEO Sam Altman. "So far, GPT-4 class models have only been available to people who pay a monthly subscription."

It's arguably a natural evolution of the popular AI chatbot; by harnessing a live video stream, the assistant could likely be more helpful by benefiting from far more context.

It's also unsurprising, considering we've seen very similar demos by AI hardware companies Humane and Rabbit, who both attempted to bring an AI chatbot-based gadget with a built-in camera to market this year — albeit with catastrophic results.

Given its positioning at the forefront of the tech, however, OpenAI is leveraging the computing power of the modern smartphone instead, and from what we've seen, that approach makes for a far more seamless experience, with barely any delays between a user's question and GPT-4o's answer.

and with video mode!! pic.twitter.com/cpjKokEGVd
— Sam Altman (@sama) May 13, 2024

OpenAI claims GPT-40 can respond to audio inputs in as little as 232 milliseconds, which is "similar to human response time in a conversation." That's in part because it doesn't have to transcribe the text, and has "all inputs and outputs" be "processed by the same neural network."

In other words, OpenAI may have just given both Humane and Rabbit, whose products can take an eternity to respond to the user's inputs, a run for their money.

The new model also sounds considerably more natural and "emotional," with a life-like female voice seemingly picking up on tone and emotions of the user in real-time. Put differently, it's a lot closer to Scarlett Johansson's voice in the 2013 sci-fi blockbuster "Her."

"I'm all ears," ChatGPT told OpenAI research lead Barret Zoph with a notably cheerful voice during a demo today. "What math problem can I help you tackle today?"

The demo, which relied on ChatGPT's new ability to see the world around it, didn't go by without a hitch.

"OK, I see it," ChatGPT said after Zoph asked her to help with a calculus problem without revealing the answer right away.

"No, I didn't show you yet!" Zoph answered, perplexed.

"Whoops, I got too excited," ChatGPT replied sheepishly while Zoph hurriedly wrote out the math problem with a sharpie on a paper in front of him. "I'm ready when you are."

Of course, we should take what OpenAI showed off today with a healthy grain of salt. Tech demos are tech demos — and it would be far from the first time we've seen big tech companies fudge the presentation with carefully rehearsed and conveniently curated demonstrations.

Late last month, for instance, news broke that the creators of a two-minute video titled "Air Head" — allegedly generated with OpenAI's new text-to-video AI Sora — had augmented the footage with more traditional filmmaking techniques.

In short, it remains to be seen how well the new ChatGPT will be able to respond to questions that involve a live smartphone camera feed in the real world, which tends to be far messier than a simple math problem written out in a perfectly lit studio environment.

Besides, OpenAI likely hasn't been able to solve far stickier problems, like having its AIs "hallucinate" facts or perpetuate harmful biases.

Nonetheless, based on what we've seen today, it's still a considerable step forward for the tech that could make it even more useful than it is today.

More on ChatGPT: Stack Overflow Bans Users for Protesting Against It Selling Their Answers to OpenAI as Training Data

Share This Article