In Brief
MIT researchers have created an algorithm that hopes to understand human visual social cues and predict what would happen next. Giving AI the ability to understand and predict human social interaction could one day pave the way to efficient home assistant systems as well as intelligent security cameras that can call an ambulance or the police ahead of time.

Get Smarter with TV?

MIT’s Computer Science and Artificial Intelligence Laboratory created an algorithm that utilizes deep learning, which enables artificial intelligence (AI) to use patterns of human interaction to predict what will happen next. Researchers fed the program with videos featuring human social interactions and tested it to see if it “learned” well enough to be able to predict them.

The researchers’ weapons of choice? 600 hours of Youtube videos and sitcoms, including The Office, Desperate Housewives, and Scrubs. While this lineup may seem questionable, MIT doctoral candidate and project researcher Carl Vondrick reasons out that accessibility and realism were part of the criteria.

“We just wanted to use random videos from YouTube,” Vondrick said. “The reason for television is that it’s easy for us to get access to that data, and it’s somewhat realistic in terms of describing everyday situations.”

They showed the computer videos of people who are one second away from doing one of these four actions: hugging, kissing, high-fiving and handshaking. The AI was able to guess correctly 43% of the time compared to humans, who were right 71% of the time.

Potential Future

Giving AI the ability to understand visuals the way humans can could be a precursor to what would be efficient home assistants, as well as intelligent security cameras that could call an ambulance or the police ahead of time.

While this isn’t the first attempt at video prediction, it is the most accurate thus far. The reason is that, first, the new algorithm deviates from previous attempts at video predicting, wherein pixel-by-pixel representations were a priority. It predicts using abstract representation and focuses on the important signs: it learns on its own and uses “visual representations” to discriminate between visual cues that are important in social interactions from those that are not. It’s something that comes naturally to humans, but is far more complicated in AI.

“It’s not hugely different from some other things that people have done, but they’ve gotten substantially better results out of it than people have in this area before,” says Pedro Domingos, a machine learning expert and professor at the University of Washington.