In Brief
  • A new AI lip reader has been built to process whole sentences at a time, allowing the AI to teach itself what letter corresponds to each slight mouth movement.
  • LipNet was 1.78 times more accurate than human lip readers in translating the same sentences.

“The First Sentence-level Lipreading Model”

Lip reading is a way of understanding speech by interpreting a person’s lip movement. However, human speech is highly complex and nuanced, where one lip movement could correspond to different phonemes, or basic units of sound. Therefore, the practice is prone to errors, which can sometimes lead to humorous results.

Scientists from Oxford University have described an artificial intelligence system, called LipNet, which can accurately read lips. The system employs deep learning to train itself using 29,000 three-second-long videos labeled with captions.

previous system read lips on a word-to-word basis where it was taught to associate a phoneme with a certain lip movement. It achieved an accuracy of 79.6 percent. LipNet on the other hand, works on whole sentences at a time, achieving an accuracy of 93.4 percent. When compared against human lip readers who scored an accuracy of 52.3 percent, LipNet was 1.78 times more accurate than them in translating the same sentences.

More Tools For The Hearing-Impaired

While the accuracy of the system is impressive, it is still not perfect. The videos fed to it had ideal lighting and the speaker front-facing the camera. The results may vary when done with a less ideal video, resulting in some skepticism about the results of the research.

However, the technology does show promise, and scientists are looking for applications for this technology. For example, it could be used as a tool for the hearing-impaired. Other technologies solve this problem in different ways like taking advantage of sensory substitution.

According to OpenAI’s Jack Clark, getting this to work in the real world will take three major improvements: a large amount of video of people speaking in real-world situations, getting the AI to be capable of reading lips from multiple angles, and varying the kinds of phrases the AI can predict.