This post is also available in: עברית (Hebrew)
Deepfakes and tempered audio can now be created with the press of a button and are therefore much more dangerous. New research by School of Information students at the University of Michigan makes it easy to determine the authenticity of an audio clip.
Romit Barua, Gautham Koorma, and Sarah Barrington first presented their research on voice cloning as their final master project. The team worked with Professor Hany Farid and looked into different techniques for differentiating a real from a cloned voice designed to impersonate a specific person.
According to Techxplore, the team first analyzed audio samples of real and fake voices by looking at perceptual features or patterns that can be visually identified. Through this lens, they focused on looking at audio waves and noticed that real human voices often had more pauses and varied in volume throughout the clip, because real people tend to use filler words and move around and away from the microphone while recording.
The team analyzed these features and managed to pinpoint pauses and amplitude (consistency and variation in voice) as key factors to look for when trying to determine a voice’s authenticity. However, they also found that this method can yield less accurate results.
The team then tried a different, more detailed approach, and used a ready-made audio wave analysis package that extracts over 6,000 features before selecting the 20 most important ones. The team analyzed these extracted features and compared them to other audio clips, thus creating a more accurate method.
The most accurate results, however, were yielded by using their learned features, which involve training a deep-learning model. The team did so by feeding the raw audio to the model, from which it processes and extracts multidimensional representations called embeddings. Once generated, the model uses these embeddings to distinguish real and synthetic audio.
This final method has consistently been the most accurate and has recorded 0% error in lab settings. However, this method could be difficult to understand without proper context.
The team claims this research may address growing concerns about using voice cloning and deepfakes for nefarious purposes. Barrington explained: “Voice cloning is one of the first instances where we’re witnessing deepfakes with real-world utility, whether that’s to bypass a bank’s biometric verification or to call a family member asking for money.”
“No longer are only world leaders and celebrities at risk, but everyday people as well. This work represents a significant step in developing and evaluating detection systems in a manner that is robust and scalable for the general public.”