Experience and intuition tell us if two people walking towards each other will shake hands or go in for a kiss, or if a drumstick hitting a table will make a sharp “clack” or a dull “clunk” – but teaching a machine to predict such visual and audio scenarios in real-time is tricky.
Two teams at Massachusetts Institute of Technology are getting closer, though. Carl Vondrick and colleagues developed an algorithm that can anticipate human interactions more accurately than ever before, while Andrew Owens and his team wrote another algorithm that can search for and apply sounds to a silent video.
To get their machines up to speed, both teams trained them on hours of visual or audio.
The action prediction algorithm was treated to 600 hours of YouTube videos while the sound prediction algorithm was taught by “watching” a drumstick hit or scratch a number of different surfaces.
If shown a video of a new situation, which stopped a second before a hug, kiss, handshake or high five, the action prediction algorithm managed to predict what would happen next more than 43% of the time – a marked improvement on other algorithms’ 36%.
The sound predictor, post-training, turned out good enough to fool people.
So why bother teaching machines to anticipate movements and sounds?
Sound prediction, while handy for film special effects, can also let robots better understand their surroundings by determining if an object is hard or soft. Striking a desk makes a very different sound to a teddy bear.
Action prediction, while not yet accurate enough for practical applications, will one day allow robots to better navigate human environments. Rescue robots might be able to anticipate when someone falls from a burning building.