声音基础知识

FFT

声音基础知识 [复制链接]

Sound Information

Sound Pressure Level
Sound Pressure Level (SPL) decreases proportionately with distance "x" from the sound source. Figures 1 and 2 show the SPL drop off, expressed in dB, as a function of distance "x" from the origin of the sound. For speech, the reference point is generally accepted as 96dB SPL approximately 1cm from the lips of a person talking. The equation plotted is:

dB = 96 " 20log(x/0.01), or alternatively, dB = 96 + 20log(0.01/x) where the (0.01/x or x/0.01) term is the reference value distance of 0.01m, (1cm) relative to distance "x" in meters
.

Both curves show a loss of 6dB for every doubling of distance. Figure 1 is for distances out to 200cm. Figure 2 is a magnified portion for distances out to 50cm, and shows how rapidly sound pressure drops as distance increases from the sound source, even for short distances. For example, at 10cm distance, the SPL drops ~20dB, from 96dB SPL to ~76dB SPL.

Near-Field versus Far-Field Sound
The near field of a sound source is generally considered to be within 1 wavelength of the lowest frequency signal of interest.

If the lowest frequency of interest for speech was 300Hz, then the wavelength λ is equal to c/f or 331.1/300, or 1.104 meters, where c is the speed of sound (331.3 meters/second) at sea level and 0 degrees Celsius. For a frequency of 3500Hz, λ is equal to c/f or 331.1/3500, or 0.0946 meters (9.46cm). Therefore, the general near-field range limit of speech signals extends from ∼9.5 cm to ∼1.1 meters from the source. Beyond ∼1 meter, the speech signal is generally considered to be in the far-field of the speech source.

For closely spaced microphones in an array, near-field sound sources present a spherical wavefront with strong signal amplitude, pressure gradient, and frequency dependent differences based upon the distance from the source for each microphone in the array. For example, assume 2 microphones were placed 3cm apart, at a distance of 5cm for the nearest microphone to the sound source.

Figure 2 shows that the first microphone would experience a signal of 82dB SPL while the second microphone (at 8cm distance) would experience a signal of 78dB SPL. Although the difference is 4dB, the overall signals levels are still relatively high.

The near field speech signals in the microphones will be highly correlated with essentially the same spectral content. Compared to the nearest microphone, the signal in the most distant microphone from the source will be reduced in amplitude, as well as delayed by the time it takes for the sound to travel from the nearest microphone to the most distant microphone. Recovery of the speech signal in this case is relatively easy.

Sound signals originating beyond the speech near-field of the microphone array are considered to be in the far-field, presenting essentially planar wavefronts to closely spaced microphones in an array. Each microphone will be sensing almost identical acoustic energy and random phase signals, but those signals are lowly correlated unless the microphones are very close to each other. The further away these signals are from the microphone array, the lower the absolute SPL level at the microphones will be.

As a further example, if the same microphone array was placed at a distance of 150cm (1.5 meters) from the source, the SPL at 150cm would be ∼52.5dB for nearest microphone and at 153cm would be ∼52.3dB for the furthermost microphone. Although the difference is only 0.2dB, the overall signals levels are down ~ 30dB from the sound source to the closest microphone.

The difference signal between the microphone outputs, when appropriately processed and filtered, tends to cancel the far field noise signals while leaving a substantially high-level speech signal in the combined output of the two microphone amplifier and processing circuitry.

Acoustic Noise Characteristics
There are three classifications of acoustic noise fields: coherent, incoherent, and diffuse.

Coherent noise is characterized as propagating to the microphones without any form of reflections, spreading, or attenuation due to environmental obstacles.

Incoherent noise is noise at one location which is uncorrelated with noise at all other locations, and is considered spatially white.

Diffuse noise is characterized as noises of relatively equal energy that radiate in all directions simultaneously. Good examples would be office noises, airport lounges, traffic noise, i.e. practically all commonly encountered noisy environments.

There are two acoustic noise types: stationary, and non-stationary.

Stationary noise is characterized as being relatively constant in energy, with known and slowly changing spectral content, and tends to be predictable. Good examples would be engine noise, air conditioning fans, random or "white" noise, and so on. Noise suppression algorithms work well with these types of sounds.

Non-stationary noise is characterized by short duration changes in volume and content, such as people speaking loudly, passing traffic sounds, or clapping hands for example, and is basically not predictable. If such sounds occur, they would be gone perhaps before any rational noise recognition and suppression technique could be completely applied. Non-stationary noise is often embedded in a stationary noise field.

The most difficult problem occurs when the noise sources have the same temporal (time) and/or spectral (frequency spectrum), and coherency characteristics as the desired speech signal. This occurs when the background noise is non-stationary and consists primarily of other people talking, as would be the case in restaurants and bars, transportation terminals and parties.

Microphone Array Solutions
Microphone array solutions can be an effective technology to suppress stationary and non-stationary noise, depending upon the methods used.

Using appropriate algorithms, individual microphone signals in an array are filtered and combined resulting in beamforming or spatial filtering, which creates complex microphone array polar response patterns having the ability to point to, or away from, particular sound positions. Thus, sounds in certain positions can be isolated and enhanced, or suppressed or rejected. Likewise, correlation of signals in the microphone channels allow determination of direction and location of major sounds.

Depending upon the complexity of the array and its purpose, the array processing can be done with analog circuitry, with a Digital Signal Processor (DSP), a computer software program, or a combination of methods.

Beamforming
There are two beamforming techniques: adaptive and fixed.

For adaptive beamformers, the beam is capable of being steered to various directions using data dependent filtering and variable time responses to the data. Many methods have been developed for building adaptive beamformers. The signal processing is more complex, but permits arbitrary freedom to the array design in terms of number and types of microphones and their spacings. Adaptive beamformers are typically made with digital signal processors or computer software.

For fixed beamformers, the beam is optimized for the direction of the desired sound while suppressing sounds from other directions as much as possible. Typically, closely spaced, differential microphone end-fire arrays, which are inherently directive, are used with or without fixed-time delays to steer the beam. Any filtering and signal processing is also optimized and fixed for the particular mechanical design. Fixed beamformers can be made with analog circuitry, digital signal processors or computer software.

Fixed beamformer solutions are often the preferred solution for speech applications, especially those involving speech recognition. If implemented in analog circuitry, they:

respond instantaneously to the noise input
are easy to implement and do not require any algorithm software development
provide an acceptable level of signal-to-background noise ratio improvement (SNRI) for stationary and non-stationary noise
typically exhibit very low to no speech distortion, which also improves overall mean opinion scores in speech quality tests (ITU-T P.835)
have inherently low 'computational' complexity and signal latency
are lower power consumption than other solutions

By comparison, adaptive beamformer solutions implemented in DSP or software:

require time to repetitively recognize and converge on the noise signal while applying and adjusting the suppression algorithm
provide an overall higher SNRI, but often at the expense of speech output signal artifacts such as delays due to noise signal convergence time, pops and clicks, un-intended muting, frequency distortions, echoes, or aperiodic changes in signal levels generally associated with the sub-band frequency signal processing methods used
are more difficult to implement, requiring algorithm software development
require higher power consumption

All beamformer solutions using very small arrays are highly sensitive to errors due to microphone gain and phase mis-match over frequency, as well as differences in the acoustic paths which could arise if they are embedded in a product instead of being in open air. Therefore, beamformer solutions must have some means of compensating for these types of errors whether it be within the beamformer implementation itself, or requiring matched microphones and acoustical paths external to the beamformer implementation.

Microphone Spacing
The Nyquist rate of spatial sampling is 1/2 the wavelength (d= λ/2) of the highest frequency of interest. To spatially sample one wavelength of the frequency of interest, two sensors are required 1/2 wavelength apart.

An analogy to oversampling would be where d < 1/2 wavelength (d <λ/2), which allows the wavelength to be sampled more than two times.

Spatial undersampling would be where d > 1/2 wavelength (d > λ/2), which would allow one wavelength of the frequency of interest to complete and restart before the second sensor can sample the signal. Spatial undersampling can result in aliasing higher frequency signals down into the frequency band of interest, giving confusing results. To avoid aliasing the sampler is bandwidth limited above the maximum frequency of interest.

Many studies have shown that very effective microphone arrays can be built with sensor spacings much smaller that the minimum needed for the Nyquist rate. Consider the case where sensors are spaced at 1/8 the wavelength of interest.

In a speech only system, the frequency range is 300Hz to 3500Hz, with the greatest amount of vocal energy being between 500Hz to 2500Hz. The λ/8 spacing would be 1.18cm for 3500Hz and 1.65cm for 2500Hz.

Frequencies below 3500Hz and 2500Hz would still be oversampled because the wavelengths get longer, and the 1.18cm or 1.65cm spacings effectively sample more of the signal.

An alternative calculation could be to set the spacing at 2cm and determine the wavelength spacing (λ)/(c/df) at 2500Hz to be: λ/(331.1/0.02*2500) = λ/6.62

If the spatial sampling rate remains less than λ/2 for the highest frequency of interest, the microphone spacings can be adjusted to fit the product application.But, as the spacing d gets closer (spatial sampling rate gets faster), the far-field signals in the microphones are more highly correlated and the array has better overall background noise suppression over a wider range of frequencies. As the spacing gets further apart, the array has less overall suppression and becomes restricted to lower frequency responses.

Once the sensor spacing is fixed, the array is now optimized for the frequency of interest. In the case of a fixed beamformer, the array response pattern has also been fixed.

For any particular product, a design tradeoff is made between frequency range of operation vs. desired noise suppression levels at those frequencies, theoretical vs. practical microphone spacings, and overall array system cost and complexity.

Microphone Array Solution Example

An example of a microphone array solution providing up to 20dB background noise suppression in speech applications is the LMV1088 Far Field Noise Suppression Microphone Array Amplifier from National Semiconductor. The LMV1088 is an analog fixed beamformer designed for use in a differential 2 microphone end-fire array with omni-directional microphones.

The basic application diagram is shown in Figure 3. The datasheet can be found at: Datasheet.

The microphones are typically spaced in a line 1.5cm to 2.5cm apart physically, or the equivalent acoustic path distance. The optimal distance from the end fire array for the person speaking is 2cm to 10cm, with reduced performance further away. The speech signal loss with distance can be estimated with the aid of Figures 1 and 2.

The LMV1088 provides for initial compensation for differences in the acoustic--microphone--amplifier signal paths of the 2 channels, correctional filtering to achieve natural voice quality output, and band limit filtering.

Internal amplifier gain adjustments are made by I²C command, allowing the use of microphones with a wide range of sensitivities, and permitting the matching of the output signal level to analog input channel signal requirements of a wide variety of communications processors and equipment.

The device supports four operating modes, selectable by I²C command.

Default mode--noise suppression using both microphones.
Independent mode--use microphone 1 or 2 independently, (no noise suppression)
Summed--the outputs of the 2 microphones are added together, giving a 6dB gain in the microphone signal, (no noise suppression)

The analog nature of the LMV1088 provides certain characteristics not available for traditional DSP type solutions:

No noise convergence computation time to adapt to background noise levels and types, thereby providing essentially instantaneous response to desired signal and background noise signals, and eliminating annoying temporary speech drop outs
No frequency distortions, pops/clicks, or other artifacts in the output since no Sub-band Frequency processing algorithms are used
Very low power consumption, on the order of 1/10th the power consumption of a DSP solution
No voice processing programming code is needed, allowing simpler and faster integration into a system
Enhances any single channel echo cancelling processing that already exists in the system

An interactive demonstration recording of the LMV1088 with and without far-field noise suppression active is located at: Demo.

Microphone Array Solutions Comparison and Testing
Any comparison and measure of a particular background noise suppression solution against any other solution must be done under identical and exactly repeatable environmental conditions in order to have any validity.

Standard methods are set up for this purpose, the most common of which are the International Telecommunications Union standards ITU-T Rec. P0056e, 58e, 64e, 0830e, and ITU-T P835. ITU-T P835 deals specifically with subjective testing for evaluation of speech in a system that includes noise suppression. This specification describes a methodology for evaluating the subjective quality of speech in noise and is particularly appropriate for the evaluation of noise suppression algorithms. The methodology uses separate rating scales to independently estimate the subjective quality of the speech signal alone, the background noise alone, and overall quality (mean opinion scores) of the speech in the background noise.

In addition, the IEEE lists specifications IEEE 1209-1994: Standard Methods for Measuring Transmission performance of Telephone Handsets and Headsets and IEEE 269-1992: Standard Method for Measuring Transmission Performance of Analog and Digital Telephone Sets. Both documents have been superceded by IEEE 269-2002.

The standards taken together cover objective numerical measurements as well as subjective voice quality and electronic voice recognition performance of background noise suppression solutions. Where noise suppression numbers are quoted by manufacturers, these may be the best numbers achieved, but may not be acceptable in any particular application if voice quality is taken into account.

Therefore, it is often difficult, and potentially misleading, to quote noise suppression numbers on solution datasheets unless the test conditions have been clearly documented. Datasheets typically will not provide that level of detail. Even if such detail was provided, it would be highly unlikely that the datasheet conditions would be the conditions encountered in a customer's particular application.