ISSN ONLINE(2278-8875) PRINT (2320-3765)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Analysis and Synthesis of Sinusoidal Noise in Monaural Speech Using CASA

FathimaC.M1, Khadeeja mol .K.U2
  1. M.Tech, Applied Electronics, Ilahia College of Engineering and Technology, Kochi, India1
  2. Asst. Professor Electronics & Communication Engineering, Ilahia College of Engineering and Technology, Kochi, India2
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

Abstract

CASA is the technique used to segregate a target speech from a monaural mixture. This article proposes a technique to separate the sinusoidal noise from monaural mixtures. Many sounds are there that are important to humans are having pseudo-periodic structure over a particular period /stretch of time. Where this fixed period is typically range of 100Hz-5KHz which gives the corresponding pitch percept.The systematic evaluation of this algorithm gives a tremendous and noticeable improvement in noise segregation.

Keywords

monaural speech segregation, CASA, sinusoidal noise analysis

INTRODUCTION

Human beings are capable to distinguish and track various noisy environments, while this remains as a big challenge to computers. ‘Auditory scene analysis’ written by Bergman published in 1990was the first explained the perception and analysis of complex acoustic mixtures which in turn lead to the invention of the computational model, CASA(computational auditory scene analysis).
Human auditory system is simulated and processed by a CASA as similar to the human auditory perception. This has two stages: first is the segmentation and then grouping. The input signal is decomposes to sensorysegments in segmentation stage and the signals that are likely came from same source is grouped together as ‘target stream’. CASA is capable of dealing with monaural speech segregation efficiently and its getting more efficient in time by time. Brown and cook who proposed the CASA system which employs maps of many of the auditory features from the cochlear model of speech segregation. And a priori knowledge of input signal is does not require for this system but have some limitations ie, it cannot handle sequential grouping problem effectively and often leaves missing parts in the segregated speech [4].
CASA model for voiced speech segregation is proposed by Wang& Brown[3,5]and is based on oscillatory correlation. For this it uses harmonicity and temporal continuity as major grouping cues. And this implementation is able to recover most of the target speech back, but was unable to get high frequency signals back
Hu&Wang[6,7] proposes the system for the voiced speech segregation and it is a typical monaural system; and this groups the unresolved and resolved harmonics separately. And in [8] for pitch estimation an improved tandem algorithm is provided.
Multi scale offset and onset analysis is analysis is employed for the unvoiced speech segregation in Hu-Wang system. Acoustic phonetic features are used in classification stage after voiced speech segregation for distinguishing unvoiced segments from interference [10, 11]. This article proposes an improved and advanced system for sinusoidal noise analysis and synthesis by using computational auditory scene analysis.
For the periodic signals, which can be approximated by the sum of sinusoids and whose frequencies are the integer multiple of the fundamental frequency and the magnitude and phase can be uniquely determined to match the signal called Fourier analysis. Spectrogram is one of the manifestation which shows short time Fourier transform magnitude as a function of time.
A series of normally horizontal , uniformly spaced energy ridges is revealed by a narrowband spectrogram which is correspond to the sinusoidal Fourier component of harmonics which is an equivalent representation of the sound waveform.
To represent each of these ridges explicitly and separately, as a set of frequency and magnitude values is the key idea and is the aim of sine wave modelling.
In monaural speech segregation response energy feature plays an important role in initial segmentation. In formal CASA system T-F (Time-Frequency)’s response energy was taken as a constant value and is used as the threshold which was less efficient since the intrusions are unknown.
The binary mask map is constructed after further grouping and unit labeling. The scattered and broken auditory elements present in the binary mask will produce unwanted fluctuations and utterance which in turns degrade the quality of the resynthesized speech.so in [6] Hu- Wang system includes a smoothing stage in order to avoid this unwanted fluctuation by removing the segments shorter than 30ms and so on.

SINE WAVE ANALISIS

Sine wave analysis is a quite simple concept. As shown in the spectrogram ,from the short time Fourier transform, frequency and the magnitude of the spectral peaks at each time step is find out and thread them together and the representation will be obtained. It get complicated because of a couple of reasons. First one is difficulty in picking up peaks. Also resolution of STFT is typically not all that good. So the need of interpolating the maximum in both frequency and magnitude arises.

SYSTEM DESCRIPTION

Figure 2 represents the system for monaural speech segregation based on CASA. And comparing with other segregation since it is using morphological image processing so an additional smoothing stage is added to improve the initial segmentation stage.

Basic periphery processing

In the initial stage 128 channel gamma tone filter banks and a simulation of neuro mechanical transduction of inner hair cells is used to model auditory periphery system. The input signal is decomposed into T-F domain by passing through the auditory periphery model. The psychological observation of auditory periphery will provide the gamma tone filters and it’s the standard model of cochlear filtering. The impulse response of the gammatone filer is given by,
image
a low pass filter is used to extract the response energy feature of every channel [6]. The output is represented as h(c,n).

Feature Extraction

1.Correlogram: Auto correlation of inner hair cell response h(c,n)in the T-F domain is used to construct the correlogram.
image
Where, c is the order of the channel , m is the time frame, is the no.of samples in a frame of 20 ms.
2. Cross channel correlation: it indicates wheather the filter responds to the same target. And is calculated as,
image
Where L is the sampling no. and is normalized to zero mean and unity variance.
3. Response Energy: The response energy is the correlogram A(c,m,0) when=0.
4. Onset/Offset detection: sudden intensity change is expressed interms of onset and offsets.

Initial Segmentation

It comprise of two parts one is voiced and another one is unvoiced speech segregation. onset and offset method is used for unvoiced segmentation where as the voiced segmentation is based on extracted features.
Comparing the background noise and targeted speech the later has more stronger response energy of T-F units. The energy features A(c,m,0)and the cross channel correlation feature C(c,m) is used for the estimation of estimated target and is as follows[6]
image
Where,?? is the constant and is 0.985 [5] and ? is the threshold for effective target energy and,
image
Where M= total no. of frames in a single channel and is the constant which decides the threshold and is approximated to 1.2.

Pitch Tracking

For the CASA system the tracking and detection of pitch in complex environment is quite difficult and seems to be challenging .
But the use of tandem algorithm makes things easier ,it can track many pitch contours and can efficiently handle the multi talker problem. For this, primarily the pitch estimation should be complemented.
From the segmented units the units which with strong energy and high cross channel correlation are taken likely from the target speech and these are called as active units. And the estimated target pitch is calculated as ,
 
image
image
image

Grouping and Unit Labelling

In this stage streams are formed by grouping T-F units andthese groups are labelled in to target speech andback ground noise. Segregation is needed for both voiced and unvoiced speeches. For the non-speech interference tandem algorithm is used for the voice speech segregation. And if the intrusion is another speech then grouping is performed by analyzing the pitch contour

Morphological Image Processing

Intrusions can besuppressed by using proper morphological image processing; It is performed by removing the unwanted particles and complementing the broken auditory elements thereby enhancing the segregated speech
The proper dilation and erosion is fundamental in morphological image processing.(i)Dilation:It is the process that “thickens” or “grows” the object in a binary image. The thickening is controlled by a structuring element B Let B is the structuring element and A is the mask is the reflection set and is defined as ,
image
image
image
(ii)Erosion: It is the process ot “Thins” or “Shrinks” the object in a binary image.
image
Mask smoothing is carried out by using morphological image processing. In this stage active elements are considered to have similar periodicity pattern. And , the smoothing extend is defined by the simulating element B
image
pruning is the process that is used to remove the isolated particles and smooth the spurious salience in the segments in the obtained mask. And is represented as,
image
Where as complementing is
image
Is applied after pruning on the broken auditory elf in the low frequency range. For high frequency range residual interference energy distributed. For high frequency is complimentary is supplied unnecessary mode will brought in to segregated speech

Re synthesis

Segregated speech is resynthesized after smoothing stage. While analyzing the sine waves, resynthesize is based on analysis by using simple sine wave oscillator bank.
And tracking and resynthesizing harmonic peaks with sinusoids works pretty well. But some energy was not reproduced such on breath noise.

Residual Extraction

Resynthesizing and tracking of the harmonic peaks with sinusoids worked pretty well. Even then some energy is not reproduced, like the breath noise because it didn’t results in any harmonic peaks. Thus the by subtracting the resynthesized signal from original signal will results in the final signal.In practice it will not work if we are not careful to make the frequencies ,magnitude and phase of the reconstructed sinusoid exactly match the original.

EVALUATION AND COMPARISON

For validating the effectiveness of a proposed method it is necessary to have a comparison study with existing system. The data base consist of 170 mixtures obtained by mixing 17 intrusions at various SNR levels. The original utterance are selected randomly from the TIMIT data base. The Fs is selected as 16KHz. The intrusions selected are,N1,white noise; N2, rock music ; N3,siren ;N4, telephone; N5, electric fan; N6, alarm clock;N7 traffic noise;N8bird chirp with water flow;N9 wind noise ;N10 rain;N11 cocktail party ;N12 crowd noise at a play ground;N13crowd noise with music ;N14 crowd noise with clap;N15 babble noise;N16 male speech; N17 female speech. In the last two cases the interference is much weaker than the target utterance.
The average SNR is selected by using the equation
image
So(n) is the original speech and S’(n) is the segregated speech.system.
The better the performance, lower the and will be and vice versa.
The result of the comparison is given in the following tables includes final SNR result, comparison of

CONCLUSION

This article concentrates on the synthesis and removal of sinusoidal noise from monaural speech. The segregation is carried out with the aid of CASA. In this an improved threshold selection results in the better performance. While analyzing the SNR , , it is clear that the proposed system has been improvement in terms of reduction in noise and cutting the energy loss.

Abbreviations:

CASA-computational auditory scene analysis, IBM-ideal binary mask, SNR- signal to noise ratio, PESQ- perceptual evaluation of speech quality.

Tables at a glance

Table icon Table icon Table icon
Table 1 Table 2 Table 3
 

Figures at a glance

Figure 1 Figure 2
Figure 1 Figure 2
 

References