One thing that musicians and producers alike forever longed for was software that would separate a mix into its vital instrumental and vocal parts. We’ve had this available for a few years now, but the next step is extracting different sounds from a video by pointing on the part of the screen that we want to hear. Researchers at MIT now have this figured out thanks to artificial intelligence, and have developed an app they’re calling PixelPlayer.
Since it’s artificial intelligence-based, PixelPlayer needs to be trained (and the researchers have done that). After that, it will identify the source of sound on the video, and calculate the volume of each pixel in the image. It then “spatially localizes” them, meaning that it identifies regions in the clip that generate similar sound waves. This is outlined in a new paper called “The Sound Of Pixels.”
The current app is based on a neural network trained on MUSIC (Multimodal Sources of Instrument Combinations), which is based on 714 unlabeled videos selected from YouTube containing all sorts of acoustic instruments, including guitars, cellos, clarinets, flutes, and other instruments.
With PixelPlayer, not only can you isolate an instrument by just pointing to a spot on the video, but you can also rebalance the audio mix as well. That said, you’d have to classify PixelPlayer as purely experimental at the moment since the audio quality is pretty marginal by professional standards (see the video below). That said, the researchers are definitely aware of its limitations, and plan to improve the audio quality and add more instruments in the future.
Although the researchers see PixelPlayer as an aid to sound editing, it’s also the first step for robots to understand environmental sounds, which can either be a threat or a great help to the human who does this now. And AI marches on.