Lipreading artificial intelligence could help the deaf—or spies

first_img By Matthew HutsonJul. 31, 2018 , 3:15 PM Click to view the privacy policy. Required fields are indicated by an asterisk (*) Email Lip-reading artificial intelligence could help the deaf—or spies For millions who can’t hear, lip reading offers a window into conversations that would be lost without it. But the practice is hard—and the results are often inaccurate (as you can see in these Bad Lip Reading videos). Now, researchers are reporting a new artificial intelligence (AI) program that outperformed professional lip readers and the best AI to date, with just half the error rate of the previous best algorithm. If perfected and integrated into smart devices, the approach could put lip reading in the palm of everyone’s hands.“It’s a fantastic piece of work,” says Helen Bear, a computer scientist at Queen Mary University of London who was not involved with the project.Writing computer code that can read lips is maddeningly difficult. So in the new study scientists turned to a form of AI called machine learning, in which computers learn from data. They fed their system thousands of hours of videos along with transcripts, and had the computer solve the task for itself. Sign up for our daily newsletter Get more great content like this delivered right to you! Country iStock.com/Jake Olimb Country * Afghanistan Aland Islands Albania Algeria Andorra Angola Anguilla Antarctica Antigua and Barbuda Argentina Armenia Aruba Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda Bhutan Bolivia, Plurinational State of Bonaire, Sint Eustatius and Saba Bosnia and Herzegovina Botswana Bouvet Island Brazil British Indian Ocean Territory Brunei Darussalam Bulgaria Burkina Faso Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands Central African Republic Chad Chile China Christmas Island Cocos (Keeling) Islands Colombia Comoros Congo Congo, the Democratic Republic of the Cook Islands Costa Rica Cote d’Ivoire Croatia Cuba Curaçao Cyprus Czech Republic Denmark Djibouti Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Falkland Islands (Malvinas) Faroe Islands Fiji Finland France French Guiana French Polynesia French Southern Territories Gabon Gambia Georgia Germany Ghana Gibraltar Greece Greenland Grenada Guadeloupe Guatemala Guernsey Guinea Guinea-Bissau Guyana Haiti Heard Island and McDonald Islands Holy See (Vatican City State) Honduras Hungary Iceland India Indonesia Iran, Islamic Republic of Iraq Ireland Isle of Man Israel Italy Jamaica Japan Jersey Jordan Kazakhstan Kenya Kiribati Korea, Democratic People’s Republic of Korea, Republic of Kuwait Kyrgyzstan Lao People’s Democratic Republic Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Liechtenstein Lithuania Luxembourg Macao Macedonia, the former Yugoslav Republic of Madagascar Malawi Malaysia Maldives Mali Malta Martinique Mauritania Mauritius Mayotte Mexico Moldova, Republic of Monaco Mongolia Montenegro Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands New Caledonia New Zealand Nicaragua Niger Nigeria Niue Norfolk Island Norway Oman Pakistan Palestine Panama Papua New Guinea Paraguay Peru Philippines Pitcairn Poland Portugal Qatar Reunion Romania Russian Federation Rwanda Saint Barthélemy Saint Helena, Ascension and Tristan da Cunha Saint Kitts and Nevis Saint Lucia Saint Martin (French part) Saint Pierre and Miquelon Saint Vincent and the Grenadines Samoa San Marino Sao Tome and Principe Saudi Arabia Senegal Serbia Seychelles Sierra Leone Singapore Sint Maarten (Dutch part) Slovakia Slovenia Solomon Islands Somalia South Africa South Georgia and the South Sandwich Islands South Sudan Spain Sri Lanka Sudan Suriname Svalbard and Jan Mayen Swaziland Sweden Switzerland Syrian Arab Republic Taiwan Tajikistan Tanzania, United Republic of Thailand Timor-Leste Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates United Kingdom United States Uruguay Uzbekistan Vanuatu Venezuela, Bolivarian Republic of Vietnam Virgin Islands, British Wallis and Futuna Western Sahara Yemen Zambia Zimbabwe The researchers started with 140,000 hours of YouTube videos of people talking in diverse situations. Then, they designed a program that created clips a few seconds long with the mouth movement for each phoneme, or word sound, annotated. The program filtered out non-English speech, nonspeaking faces, low-quality video, and video that wasn’t shot straight ahead. Then, they cropped the videos around the mouth. That yielded nearly 4000 hours of footage, including more than 127,000 English words.The process and the resulting data set—seven times larger than anything of its kind—are “important and valuable” for anyone else who wants to train similar systems to read lips, says Hassan Akbari, a computer scientist at Columbia University who was not involved in the research.The process relies in part on neural networks, AI algorithms containing many simple computing elements connected together that learn and process information in a way similar to the human brain. When the team fed the program unlabeled video, these networks produced cropped clips of mouth movements. The next program in the system, which also used neural networks, took those clips and came up with a list of possible phonemes and their probabilities for each video frame. A final set of algorithms took those sequences of possible phonemes and produced sequences of English words.After training, the researchers tested their system on 37 minutes of video it had not seen before. The AI misidentified only 41% of the words, they report in a paper posted this month to the website arXiv. That might not sound like a lot, but the best previous computer method, which focuses on individual letters rather than phonemes, had a word error rate of 77%. In the same study, professional lip readers erred at a rate of 93% (though in real life they have context and body language to go on, which helps). The work was done by DeepMind, an AI company based in London, which declined to comment on the record.Bear likes that the program understands that a phoneme can look different depending on what is said before and after. (For example, the mouth makes a different shape to say the “t” in “boot” than the one in “beet.”) She also likes that the system has separate stages for predicting phonemes from lips and predicting words from phonemes. That means if you want to teach the system to recognize new vocabulary words, you need to retrain only the last stage. But the AI has its weaknesses, she says. It requires clear, straight-ahead video, and a 41% error rate is far from perfect.Integrating the program into a phone would allow the hard of hearing to take a “translator” with them wherever they go, Akbarni says. Such a translator could also help people who cannot speak, for example because of damaged vocal cords. For others, it could simply help parse cocktail chatter.Bear sees other applications, such as analyzing security video, interpreting historical footage, or hearing a Skype partner when the audio drops. The new AI approach might even answer one of the world’s greatest mysteries: In the 2002 World Cup Final, French soccer player Zinedine Zidane was ejected for dramatically headbutting an opponent in the chest. He was apparently provoked by trash talk. What was said? We may finally know, but we might regret we asked.last_img read more

Continue reading
isbthzsv
Warning: printf(): Too few arguments in /www/wwwroot/shyllm.com/wp-content/themes/bizmo/content.php on line 65
Leave a comment