Computer Vision News - December 2016

If you are not covering your mouth yet, you probably should. At least, if you don’t want anybody else to know what you are saying. This is the one of the consequences of an impressive project called LipNet , a model that maps variable-length sequences of video frames to text, making use of spatiotemporal convolutions, a Long Short-Term Memory (LSTM) recurrent neural network, and the connectionist temporal classification loss, trained entirely end-to-end. LipNet attains a remarkable 93.4% sentence-level word accuracy , outperforming previous state-of-the- art models, which performed the same tasks with accuracy just below 80%. An even more striking comparison is that with hearing-impaired people , generally considered among the most accurate human expert lipreaders: the authors found that on average this population achieved an accuracy of 52.3%, much below Lipnet’s. Besides Lipnet’s notable performance, the main novelty of this work is that it maps sequences of image frames of a speaker’s mouth to entire sentences , eliminating the need to segment videos into words before predicting a sentence. Here is the PDF of the paper . Real-world applications for accurate machine lipreaders would include hearing aids improvement , silent dictation in public spaces and speech recognition in noisy environments. The sentences tested in the video above follow a pattern, which makes the demo slightly less impressive. Yannis Assael , one of the authors, explains that sentences without context are important to evaluate the actual performance per word. Furthermore, this is due to the GRID dataset (one of the few available), offering 64,000 possible combinations of sentences with a fixed structure. “Real-world applications would include hearing aids improvement” Computer Vision News Focus on… 9 LIPNET - How easy is lip-reading ?

RkJQdWJsaXNoZXIy NTc3NzU=