Meta's open-source ImageBind AI aims to mimic human perception

ImageBind could eventually lead to leaps forward in accessibility and creating mixed reality environments.

·3 min read
Peter DaSilva / reuters

Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how humans perceive or imagine an environment. While image generators like Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to generate visual scenes based only on a text description, ImageBind casts a broader net. It can link text, images / videos, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) — and it does this without having to first train on every possibility. It’s an early stage of a framework that could eventually generate complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three).

You could view ImageBind as moving machine learning closer to human learning. For example, if you’re standing in a stimulating environment like a busy city street, your brain (largely unconsciously) absorbs the sights, sounds and other sensory experiences to infer information about passing cars and pedestrians, tall buildings, weather and much more. Humans and other animals evolved to process this data for our genetic advantage: survival and passing on our DNA. (The more aware you are of your surroundings, the more you can avoid danger and adapt to your environment for better survival and prosperity.) As computers get closer to mimicking animals’ multi-sensory connections, they can use those links to generate fully realized scenes based only on limited chunks of data.

So, while you can use Midjourney to prompt “a basset hound wearing a Gandalf outfit while balancing on a beach ball” and get a relatively realistic photo of this bizarre scene, a multimodal AI tool like ImageBind may eventually create a video of the dog with corresponding sounds, including a detailed suburban living room, the room’s temperature and the precise locations of the dog and anyone else in the scene. “This creates distinctive opportunities to create animations out of static images by combining them with audio prompts,” Meta researchers said today in a developer-focused blog post. “For example, a creator could couple an image with an alarm clock and a rooster crowing, and use a crowing audio prompt to segment the rooster or the sound of an alarm to segment the clock and animate both into a video sequence.”

Series of two graphs with the title
Meta’s graph showing ImageBind’s accuracy outperforming single-mode models.

As for what else one could do with this new toy, it points clearly to one of Meta’s core ambitions: VR, mixed reality and the metaverse. For example, imagine a future headset that can construct fully realized 3D scenes (with sound, movement, etc.) on the fly. Or, virtual game developers could perhaps eventually use it to take much of the legwork out of their design process. Similarly, content creators could make immersive videos with realistic soundscapes and movement based on only text, image or audio input. It’s also easy to imagine a tool like ImageBind opening new doors in the accessibility space, generating real-time multimedia descriptions to help people with vision or hearing disabilities better perceive their immediate environments.

“In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each respective modality,” said Meta. “ImageBind shows that it’s possible to create a joint embedding space across multiple modalities without needing to train on data with every different combination of modalities. This is important because it’s not feasible for researchers to create datasets with samples that contain, for example, audio data and thermal data from a busy city street, or depth data and a text description of a seaside cliff.”

Meta views the tech as eventually expanding beyond its current six “senses,” so to speak. “While we explored six modalities in our current research, we believe that introducing new modalities that link as many senses as possible — like touch, speech, smell, and brain fMRI signals — will enable richer human-centric AI models.” Developers interested in exploring this new sandbox can start by diving into Meta’s open-source code.