Wednesday, May 13, 2020

13 May 2020, Moscow – Speaker diarisation and speech recognition technology created by Speech Technology Center (part of the Sberbank ecosystem) was named the best at the international CHiME Speech Separation and Recognition Challenge (CHiME-6). The technology was highly acclaimed for its ability to recognise English speech from multiple microphones in a natural environment. Speech Technology Center showed the best test results in the most difficult track of the challenge, significantly outperforming the competitors.

At CHiME, the organisers challenge the best teams from all over the world with different tasks that become more complicated with each new competition. At CHiME-5, competitors solved what is known as a cocktail party problem, the problem of recognising spontaneous speech from several speakers when the speech and noise are partially superimposed – a typical mingling situation. That block meant working with segmented (previously highlighted) speech. At the CHiME-6 challenge, for the first time in history, the competitors were tasked with solving such a problem while working with non-segmented speech overlapping up to 20%. It was this most difficult task that Speech Technology Center team focused on.

The recordings for the contest were made at 20 dinners in real houses, at parties where people cooked, ate, washed up, communicated freely and emotionally, joked and laughed. The simultaneous speech of 2–4 people, reverberation and intense noise, such as clinking tableware, water pouring from the tap, whirring A/C, footsteps or laughter, are the biggest difficulties here.

The goal of the participants was to create a recognition system that would ‘listen’ to the recordings and return a full transcript with the fewest errors. Speech Technology Center team went on to take the first place:

This success was achieved by developing a unique algorithm for allocating separate speech segments for each of the speakers. The team also created a complex leveraging several neural networks of different architectures to distinguish different speakers, implement the beam-forming (pointing a microphone at a particular speaker) effect and directly recognise speech.

Speech Technology Center was joined at the competition by scientific teams from all over the world, representing well-known IT companies, such as Toshiba, as well as major universities prominent in the speech technology field – namely, Johns Hopkins University (USA), University of Science and Technology of China, Brno University of Technology (Czech Republic) and others.

“Speech Technology Center has been creating, developing and improving speech technology for 30 years. This year, at CHiME-6, for the first time in history, we faced the most challenging task: working with non-segmented speech. The ability to precisely recognise the speech by different speakers, even when interrupted by noise, allows these services to outgrow the innovation status and enter everyday use, assisting businesses and making our lives easier. High-quality processing of non-segmented speech will allow, for example, conducting competent logging of meetings where several speakers speak at once, while the intelligent speech analytics will help automate the work of call centres by recognising spontaneous speech, classifying voice calls, checking compliance with the script, assessing customer satisfaction and quality of dialogue, thus significantly optimising the workflow of modern call centres in retail, e-commerce and telecom. The recognition Speech Technology Center has enjoyed at this major international competition is not just a personal victory for us but a landmark event for the entire industry, and we are happy to bring the voice recognition challenges that the strongest teams from around the world are facing to a new level while also showcasing our core competencies in the global market,” commented Dmitriy Dyrmovskiy, CEO of Speech Technology Center.

“The goal of CHiME is to ensure that the strongest teams from around the world share their experiences and advance global speech recognition projects. We commend Speech Technology Center’s achievements in this area,” stated Jon Barker of Sheffield University (UK), a member of the organising committee for the CHiME Challenge.

Speech Technology Center (part of the Sberbank ecosystem) is a global developer of intelligent speech and face recognition technologies, and an expert resource in AI and machine learning, the member of ESSMA, the European Stadium and Safety Management Association unites the stadium industry. Speech Technology Center is one of the few companies in the world that creates and develops both biometric modalities: face and voice. Voice falsification detection and speech recognition solutions by Speech Technology Center occupy leading positions in the world rankings by NIST, ASVspoof Challenge, VOiCES and CHiME Challenge. Speech Technology Center solutions are in demand in 70 countries around the world.

