- The new system merges computer vision and language models into a single jointly trained system to produce a human readable sequence of words
- It then replaces the recurrent neural network and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images
- This could eventually help visually impaired people understand pictures and make it easier for everyone to search on Google for images
Share This Article