The development of natural language-based speech recognition systems faces many technical challenges, including the use of an accurate speech recognition engine to translate what the machine hears into text—and an integrated natural language processor that can determine the meaning or intent of the content, Then return a meaningful response or action. These topics have been extensively studied for decades, and not much discussion here. This article mainly discusses the technical challenges that are often overlooked but also important in far-field speech interface systems: speech preprocessing before speech reaches the speech recognition engine.
Even the most modern speech recognition engine has a basic requirement for it to work well—the input to the engine must be speech. Although this seems an obvious requirement for far-field voice interface systems, it is one of the most challenging requirements. The "far field" here refers to a system where the user's voice is more than half a meter away from the product microphone. For example, a smartphone near the user ’s face forms a “near-field†use case, but speaks to an arm-length PC or tablet or to a TV, stereo system, light switch, Thermostat or smart home controller speech ADC are counted as "far field" use cases.
There are many important differences between the near-field and far-field use cases. These differences create technical challenges that are not present in the near-field system but are very difficult in the far-field system.
1. Large dynamic range: In the far-field system, the user's voice may be very low, because he / she is a few meters away from the product microphone, but the interference may be very large, such as when there is music playback in a voice-controlled speaker system.
2. Low signal-to-noise ratio (SNR), low direct path to reverberation path ratio (DRR), and voice and noise in unknown directions: The voice-to-noise ratio in far-field systems is much smaller than in near-field systems. As users continue to move away from the microphone of the product, the voice level will become smaller and smaller, while the background noise level remains unchanged.
Similarly, the indirect path from the user's mouth to the microphone—the reflected path from surfaces such as walls and windows along the way—may have significant power (ie, low DRR) compared to the direct path from the user to the microphone. This reverberation effect can cause great problems when using traditional speech processing techniques and speech recognition engines.
Finally, in far-field systems, the direction of the user's voice relative to the microphone and the direction of noise relative to the microphone are unknown. In typical applications, the noise even comes from the same direction as the user's voice.
3. Full-duplex voice interaction: In many far-field systems, when a user speaks to the product, audio content such as music, movies, or voice prompts may be playing in the product's speakers. In this case, a full-duplex echo canceller needs to be used in order to cancel the playback output of the product while listening to the user's voice. The situation is even more complicated in systems where the echo canceller does not fully understand the playback content.
Under these circumstances, it is a very challenging task to implement a system that still picks up sound well. This article will explain why traditional methods cannot provide acceptable performance under these far-field conditions, and then propose a solution that can provide excellent far-field performance in a very cost-effective manner.
Large dynamic range
The voice capture system for smart home devices needs to support a large signal dynamic range, from soft whispers to loud audio content playback. For devices that are within a distance of 0.5 meters to 3 meters from the user, the voice level at the device microphone is approximately 75 dB to 44 dB SPL. For a small audio playback device, the SPL level of the playback content at the device microphone may be close to 95dB. This typical and extremely challenging use case has a great influence on the choice of microphone and analog-to-digital converter (ADC) in the device.
For far-field applications, choosing a microphone with a high signal-to-noise ratio is very important. As mentioned above, the SPL level of the target speech signal may be as low as 44dB. For a 1kHz sound at 94dB SPL, if a microphone with a signal-to-noise ratio (SNR) of 66dB is used and the equivalent noise floor is 28dB SPL, then the worst-case speech-to-microphone noise ratio is 16dB. If you choose a microphone with a signal-to-noise ratio of 55dB, then the ratio of voice to the microphone's own noise may be as low as 5dB!
The noise floor inside the ADC is also important, because if the dynamic range of the ADC in the application is not enough, it will also cause signal saturation.
Figure 1 shows the input reference noise of two ADCs, both of which are functions of the microphone enhancement settings. The red line shows the performance of an 18-bit ADC with a dynamic range of about 96dB, and the blue line shows the performance of a 24-bit ADC with a dynamic range of about 106dB. For reference, the gray line shows the microphone's own noise level with a signal-to-noise ratio of 66dB and a sensitivity of -43dBV / Pascal.
Figure 1: The noise of the microphone itself and the noise from the ADC will be added together to form the total noise floor of the system.
Figures 2 and 3 show the properties of the system when using ADCs with 96dB dynamic range and 106dB dynamic range, respectively. 106dB ADC can provide lower noise floor and higher saturation point. The reasonable setting is to use 24dB microphone enhancement value for 96dB ADC and 12dB microphone enhancement value for 106dB ADC. In this example, the noise floor when using a 106dB ADC is 2dB lower and the saturation point is 12dB higher. The 2dB lower noise floor is especially important for picking up speech in far-field conditions.
Figure 2: This table shows the system properties when using a 96dB ADC.
Figure 3: This table shows the system properties when using a 106dB ADC.
Taking into account factors such as peak content and resonance, the SPL level generated at the microphone due to echo may reach 96dB or higher. Therefore, for devices with loud playback and small volume, saturation problems are very common when using ADCs with a 96dB or lower dynamic range. When these problems are encountered in an actual system, the only solution is usually to further reduce the microphone enhancement value, but doing so will increase the noise floor. In this example, the microphone enhancement value needs to be reduced to 12dB. However, compared to 106dB ADC, this will make the noise floor 4.3dB higher. Therefore, we can know that the preferred solution for far-field products is to use a microphone with a high signal-to-noise ratio and an ADC with a dynamic range of 106dB or higher.
Graphic Overlay,Panel Mount Tactile Switch,Custom Front Panel Membrane,Panel Tactile Switch Membrane
CIXI MEMBRANE SWITCH FACTORY , https://www.cnjunma.com