OpenAI made a huge splash this week with its text-to-photorealistic video AI called Sora.
The company showed off some seriously impressive sample clips, from a couple walking through a snowy landscape to an airborne camera smoothly following a white vintage SUV as it makes its way up a dirt road.
It certainly appears to be a considerable leap for generative AI technology — and perhaps in domains far beyond video. In fact, OpenAI is already referring to Sora as a "world simulator," capable of understanding important aspects of the three-dimensional world around us, whether it's outputting a CGI-like scene of a digital landscape or an video of a woman walking down a neon-lit street at night.
"Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world," the company wrote.
"It learns about 3D geometry and consistency," Sora research scientist Tim Brooks told Wired. "We didn’t bake that in — it just entirely emerged from seeing a lot of data."
Broadly speaking, Sora is the natural evolution of a diffusion transformer model, which so far has mostly been used to AI-generate high-resolution images. In simple terms, diffusion models work by gradually adding noise to the original image and then progressively learning how to remove this noise, thereby creating a new image.
To train Sora, OpenAI fed it huge amounts of captioned videos to establish a connection between video footage and text input.
Apart from generating entirely new footage from prompts, Sora can also extend existing clips or turn AI-generated images into video.
While developing Sora, OpenAI researchers observed a "number of interesting emergent capabilities when trained at scale." For instance, it can "simulate some aspects of people, animals and environments from the physical world," according to the company's documentation.
Generated clips show that Sora can generate footage with dynamic and astonishingly smooth camera shifts as it pans, tracks, or zooms, demonstrating a considerable degree of apparent understanding of 3D spaces.
Tantalizingly, the company even seems to be suggesting that the tech could grow into a platform for gaming.
"These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world," the company writes, "and the objects, animals and people that live within them."
At the same time, Sora is far from perfect. For one, the model still doesn't fully understand cause and effect.
"For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark," the company writes.
Another clip shows a glass cup leaking its contents without actually shattering first.
Despite its limitations, Sora may be an early glimpse of a future in which AI-generated video could quickly become impossible to distinguish from the real thing.
And OpenAI is extremely aware of the potential for the tech to be misused. As a result, the company has chosen to slowly roll out the tool to "red teamers to assess critical areas for harms or risks."
"We’re going to be very careful about all the safety implications for this," project researcher Bill Peebles told Wired.
More on Sora: OpenAI Reveals Impressive AI That Generates Photorealistic Video
Share This Article