• The problem with existing systems has been the large variation in exactly where (which frames) the action appears in the videos, and without 3D information it’s hard for the robot to infer what part of the image to model and how to associate a specific action such as a hand movement with a specific object
  • The researchers took a crack, if you will, at this problem by developing a deep-learning system that uses object recognition (it’s an egg) and grasping-type recognition (it’s using a “precision large diameter” grasping type) and can also learn what tool to use and how to grasp it.
  • Their system handles all of that with recognition modules based on a “convolutional neural network” (CNN), which also predicts the most likely action to take using language derived by mining a large database.

