Netizens are more likely to be duped by misinformation presented in text form compared to video clips created with the help of algorithms, according to a study.
Fake content generated by machine-learning models is becoming increasingly realistic. Images of people of across different ages, genders, and races look like real photographs. Voices can be cloned and made manipulated to follow a script. Videos appear lifelike with face-swapping or lip-syncing techniques. These so-called deepfakes can make it look as though people have said or done things they haven’t, tricking us into believing falsehoods.
Experts and pundits feared people would be more easily duped by deepfake videos because they would find the material more believable by seeing it, while text would be easy to identify as fake because the writing would be obviously written by a machine or otherwise made up.
But an experiment run by researchers at MIT demonstrated the opposite. Seeing is not believing. People find it difficult to identify made up text versus computer-generated video. Even if this seems obvious to you, at least someone’s done the study. That’s science.
“We find that communication modalities mediate discernment accuracy: participants are more accurate on video with audio than silent video, and more accurate on silent video than text transcripts,” the team wrote in a paper released this month on arXiv that has been submitted for peer-review.
The academics recruited 5,727 participants in their experiment, and asked them to read, listen, and watch a variety of political speeches given by President Joe Biden and Donald Trump. They were told 50 per cent of the content they viewed was fake, and were asked to judge if something seemed real or false. Text transcripts of fake soundbites for the two men were output by software. Fake video clips were generated by using wav2lip to lip-sync video footage of the two men giving speeches to recordings by professional voice actors mimicking the pair from fake scripts.
To make sure the results weren’t skewed by political orientation, about half of the group were Democrats, while the other half were Republicans. All in all, they were able to identify whether something was fake or not about 57 per cent of the time for text, compared to 76 per cent for just audio; and 82 per cent for videos with audio. This might be more of a test of the voice actors’ abilities, but what do we know?
People are less likely to be tricked into believing falsehoods if they have more information available to them, the researchers concluded.
“These findings suggest that ordinary people are generally attentive to each communication modality when tasked to discern real from fake and have a relatively keen sense for what the two most recent US presidents sound like,” they wrote. “As participants have access to more information via audio and video, they are able to make more accurate assessments as to whether a political speech has been fabricated.”
Participants can judge whether audio and videos seem fake by listening and watching out for telltale signs, which is trickier with text. The context of the text becomes important because there are no visual or audio clues that people can easily pick up on. The question for text becomes: is this something that Joe Biden or Donald Trump would say, when given to them by a scriptwriter? The gap between sniffing out misinformation accurately across text, speech, and video is likely to decrease as the quality of deepfakes becomes more convincing.
The researchers said they were planning to investigate the use of more complex deepfakes generated using more sophisticated methods, such as face-swapping in videos. The Register has asked the team for comment.
“The danger of fabricated videos may not be the average algorithmically produced deepfake but rather a single, highly polished, and extremely convincing video,” they warned. ®