AI Converts Sounds into Street-View Images, Bridging Audio and Visual Perception

Image by Unsplash

This post is also available in: עברית (Hebrew)

Researchers at The University of Texas at Austin have developed a groundbreaking method that uses generative artificial intelligence to convert sounds from audio recordings into street-view images. Their study, published in Computers, Environment and Urban Systems, demonstrates that AI can replicate the human ability to connect audio and visual perceptions of environments, providing vivid visual representations from sounds alone.

According to TechXplore, the team trained an AI model by pairing audio clips with corresponding images from urban and rural streetscapes across North America, Asia, and Europe. These paired datasets, which included 10-second audio samples and still images of various locations, allowed the AI to learn how acoustic environments contain visual cues. By feeding the model new audio inputs, it was able to generate high-resolution images that closely matched real-world scenes.

Yuhao Kang, an assistant professor of geography and co-author of the study explained: “Our study found that acoustic environments contain enough visual cues to generate highly recognizable streetscape images that accurately depict different places”. The results were impressive, with the AI-generated images showing strong correlations with real-world photos. Human participants were able to correctly match 80% of generated images with their corresponding audio samples, further validating the accuracy of the AI model.

Not only did the AI replicate the proportions of buildings, sky, and greenery, but it also captured subtle details such as architectural styles, object distances, and the lighting conditions. The study also highlighted how certain sounds, like traffic or nocturnal insect chirps, can reveal time-of-day information, adding depth to the AI’s ability to simulate environmental conditions.

Kang, whose research focuses on the intersection of geospatial AI and human-environment interaction, emphasized the potential for AI to go beyond recognizing physical surroundings and enrich our understanding of how we subjectively experience places. This work suggests that machines may one day offer a multisensory approach to interpreting environments, bridging the gap between what we hear and what we see.