Predicting binaural speech intelligibility from signals estimated by a blind source separation algorithm

State-of-the-art binaural objective intelligibility measures (OIMs) require individual source signals for making intelligibility predictions, limiting their usability in real-time online operations. This limitation may be addressed by a blind source separation (BSS) process, which is able to extract the underlying sources from a mixture. In this study, a speech source is presented with either a stationary noise masker or a ﬂuctuat-ing noise masker whose azimuth varies in a horizontal plane, at two speech-to-noise ratios (SNRs). Three binaural OIMs are used to predict speech intelligibility from the signals separated by a BSS algorithm. The model predictions are compared with listeners’ word identiﬁcation rate in a perceptual listening ex-periment. The results suggest that with SNR compensation to the BSS-separated speech signal, the OIMs can maintain their predictive power for individual maskers compared to their performance measured from the direct signals. It also reveals that the errors in SNR between the estimated signals are not the only factors that decrease the predictive accuracy of the OIMs with the separated signals. Artefacts or distortions on the estimated signals caused by the BSS algorithm may also be concerns.


Introduction
Objective intelligibility measures (OIMs, e.g.[1,2]) are useful in providing reasonable and fast predictions of speech intelligibility in various adverse listening conditions.Therefore, they are widely used in place of resource-consuming listening experiments using human listeners in many fields such as acoustic design [3], hearing impairment [4] and algorithm optimisations for improving speech intelligibility [5].With further extensions to binaural listening, OIMs are capable of dealing with more realistic listening situations [6,7,8].
However, the majority of state-of-the-art OIMs are doubleended.To make an intelligibility prediction, they require prior information about the original clean speech signal and the noise signal(s) or the speech+noise mixture, as well as the strict mixing process such as the signal-to-noise ratio (SNR).Their usability is therefore limited in many practical scenarios in which the original signals are not readily available, for example, when estimating intelligibility from speech signals recorded by a pair of microphones in noisy public places.While there are some established single-ended methods (e.g.[9]) for predicting speech quality directly from a processed/degraded signal, very few relevant studies seek to predict intelligibility without accessing individual speech and masker sources.In [10], a singleended method based on speech-to-reverberation modulation en-ergy ratio (SRME) was proposed.With further improvements, it demonstrated high correlations with subjective data from a hearing-impaired cohort in noisy reverberant conditions [4].However, the SRME metric may not be suitable for predicting intelligibility in fluctuating noise maskers, whose effects not only reduce the modulation depth of the speech signal, but also introduce stochastic disturbance to speech modulation.
Predicting intelligibility directly from the speech+noise mixture may be difficult; an intermediate approach could be to estimate the source signals from the mixture -any doubleended OIM can then make a prediction using the estimated signals.For binaural recordings, the state-of-the-art blind source separation (BSS) methods [11,12,13] using interaural level difference (ILD) and interaural phase difference (IPD) have demonstrated good performance for two-channel source separation.These BSS methods can largely preserve binaural cues, as well as maintain the energy of each sound source, which is vital for speech intelligibility in noise.How well, then, can speech intelligibility be predicted single-endedly from the BSSseparated signals using existing OIMs, compared to the OIMs' performance when using ground truth signals?
The aim of this study is therefore to examine the performance of three binaural OIMs in predicting intelligibility from the outputs of a BSS algorithm, in both stationary and fluctuating noise maskers.The model predictions are compared with listeners' sentence-level word identification rate in a perceptual listening experiment.As the BSS may not thoroughly preserve the original SNR, two different SNR compensation schemes are tested in order to improve the OIM performance with the BSSseparated signals.for example, so that the model will hold statistics of binaural features.Since the separation model is source positions-and the SNR level-dependent, these model parameters need to be estimated from the binaural mixture, from which the sources are estimated.As the output of the BSS stage, the separated signals are then fed into a binaural OIM for intelligibility estimation.

The binaural BSS algorithm
As in [11], source signal from a certain direction arrives at two ears with different time delays and levels: where L(t, f ) and R(t, f ) are the time-frequency (TF) representations of the left-ear and right-ear signals indexed by time frame t and frequency bin f .α(t, f ) and β(t, f ) denote interaural level difference (ILD) and interaural phase difference (IPD), respectively.Note that β is the frequency representation of the interaural time delay If there is only one source signal coming from a specific direction θ, a Gaussian mixture model (GMM) can be employed to characterise the above bimodal features with the independence assumption between IPD and ILD: where ϕ τ |θ is the prior for a signal coming from azimuth θ to yield the delay τ , and τ,f |θ } contains the frequency-, azimuth-and delay-dependent mean ξ τ,f |θ and variance σ 2 τ,f |θ , while f |θ } consists of the frequency-and azimuth-dependent mean µ f |θ and variance η 2 f |θ .The parameter set, i.e.Ψ θ = {ϕ τ |θ , Ψ IPD θ , Ψ ILD θ }, can be learned from binaural recordings containing only one signal from azimuth θ.Fig. 2 shows an example of parameter set Ψ θ .
For multiple sources coming from directions θi, i = 1, 2, • • • , I, based on the sparsity assumption that there is only one dominant signal at each TF point, we can adopt Eqn. 2 to Frequency (kHz) where wi is the weight of the i-th source coming from θi.The weight varies with the relative energy of each source in the mixture, e.g. the SNR level for one-target-one-masker cases.Given mixtures with known source directions θi, we can obtain the weight wi for each source at each TF point using an iterative expectation maximisation (EM) process.Note that, unlike the EM process applied to GMM models which estimates all the parameters, Ψ θ i is fixed to the parameter set directly extracted from the binaural mixture that contains only one source from direction θi.This avoids the overfitting problem when one signal is much weaker than the other signals, failing in yielding enough dominant features for convergence.
When applying the trained BSS model {wi, Ψ θ i } I i=1 to new binaural recordings, the TF separation mask for the source coming from θi is generated as Mi(t, f ) = τ p(t, f, i, τ ), which is applied to both L(t, f ) and R(t, f ) to obtain the final binaural source estimate.

Binaural objective intelligibility metrics
One recent and two standard OIMs with their binaural extensions were adopted as the backend intelligibility predictors.
The binaural distortion-weighted glimpse proportion (BiD-WGP).BiDWGP consists of two main components.The first one accounts for the local audibility of speech in noise by quantifying the number of speech regions with local SNR above a certain threshold, known as 'glimpses', on the spectrotemporal excitation pattern (STEP, [14]).The second component measures the effect of masker-induced distortions on speech envelope.To model binaural listening, glimpses and the frequency-dependent distortion factors are computed for both ears.The binaural interaction is accounted for by applying the gain computed as the binaural masking level difference (BMLD, [15,16]) to the speech STEP when glimpses are defined.The better-ear effect is then simulated by combining glimpses from the two ears.The final intelligibility index is the sum of the numbers of glimpses in each frequency band, weighted by the distortion factor and band importance function.See [8] for more details.Note that, in this study it is assumed that the binaural signals of sources are directly accessible; the stage of estimating binaural signals from the single-channel signals engaged in [8] is omitted in the current implementation.Further comparisons on the outputs of the two implementations confirmed almost identical results.
The binaural Speech Intelligibility Index (BiSII).BiSII extends its monaural standard measure -Speech Intelligibility Index [1] -to account for the better-ear effect and binaural interaction in binaural listening [6].The apparent SNR in each frequency is computed for the two ears, taking the larger SNR between the two ears as the binaural SNR for that frequency.The frequency-specific BMLD gains are then added to the SNRs to obtain the effective SNRs, which are used for the final intelligibility index calculation.Otherwise, the implementation follows the standard procedure as described in [1].
The binaural Speech Transmission Index (BiSTI).An extension was introduced in [7] to enable the STI [2] to predict binaural intelligibility.Similarly to BiSII, for the better-ear effect the modulation transfer functions (MTFs) for each frequency band are calculated separately for both ears, and the larger value is then considered as the binaural MTF for that channel.The gain due to the binaural interaction is computed for frequencies of 0.5, 1 and 2 kHz using an method based on interaural crosscorrelation.More details are described in [7].As implementa- tion in this study, the standard framework of the STI calculation [2] is used, except that the MTF is calculated using a phaselocked method [17], with a revised normalisation term [18].

Experiments
The binaural samples used for testing were drawn from [8].Harvard sentences were mixed with a stationary noise masker (SSN: speech-shaped noise) or a fluctuating noise masker (CS: female competing speech) at two SNR levels: -9 and -6 dB for SSN; -18 and -15 dB for CS.Both speech and masker sources were placed on a 2-metre radius.While the speech source was fixed ahead of the listener (θ t = 0 • ), the location of the masker varied in azimuth of θ m ∈ [0, −10, 20, −30, 60, −90, 90, 120, −150, 180] • on a horizontal plane.The virtual anechoic sound field was simulated by convolving the single channel signals with corresponding binaural room impulse responses.In total, 32 conditions (2 maskers × 2 SNRs × 8 masker locations1 ) were tested.
For each of the 32 conditions, a BSS model was trained offline.The required parameter set in Eqn. 2 was first calculated from the binaural mixtures for the target speech at θ s and for the masker at each θ m , respectively.Fig. 2 illustrates the learned parameter set for the competing speech masker at θ m = π/3.
After being processed by the BSS algorithm, the separated signals were then fed into the three binaural OIMs separately for calculating objective intelligibility scores.As the reference, objective scores were also computed from the signals of ground truth, which are referred to as the direct signals.

Evaluation
Subjective listening tests [8] were first carried out in a word identification task for each of the simulated mixing conditions introduced earlier, which involved a group of 14 native British English speakers with normal hearing.The word-recognition score is used as the measure for the subjective binaural speech intelligibility.
The 220 sentences used in the subjective listening tests in [8] were processed by the BSS for each of the 32 conditions.The upper row of Fig. 3 displays the difference be-tween the direct signal and the separated signal in terms of SNR and interaural SNR difference (ISD), defined as ∆X = (Xdirect − Xseparated), where X denotes the measurement used.The results suggest that while the spatial cues are well preserved (|∆ISD| < 0.5 dB), the BSS algorithm tends to underestimate the SNR level of the separated signals in all conditions.The average SNRs of the separated signals across masker locations and preset SNR levels are 1.7 dB lower in SSN and 2.7 dB lower in CS.Such underestimations are especially prominent when the masker is further off the central axis of the listener, i.e. 60 • , ±90 • and 120 • , as well as 10 • in CS.
The lower row of Fig. 3 presents the ∆OIM for the three OIMs.Overall, the predictive patterns when using the separated signals reflect the impact due to the SNR underestimation.For individual OIMs, the Euclidean distance between the objective scores calculated from the the direct signals and the separated signals (row 'Raw' in Table 1) was computed for individual maskers (SSN and CS) separately and for the all 32 conditions (overall).Predictions of BiDWGP and BiSII, which quantify the masked-audibility directly using signal energy, are largely deviated from that of using direct signals.The BiSTI metric which measures modulation reduction, appears to be less sensitive to the SNR underestimation.Nevertheless, the effect due to the masker location is clear for all OIMs, especially in SSN.
The objective predictions of the OIMs were further compared to the mean word identification rate (ranging between 20% and 93%, [8]) of 14 native British English speakers in the 32 conditions.The linear relationship between the objective and subjective intelligibility is measured as the Pearson correlation coefficient ρ and the error of the standard deviation of listener scores, defined as σe = σ d 1 − ρ 2 , where σ d is the standard deviation of listener scores per condition.Table 2 exhibits the performance for all OIMs in sub-conditions, with the first shaded row showing the performance when the direct signals are used, as the 'benchmark'.If the benchmark correlations are assumed to be the true performance of each OIM, any other higher or lower correlation relative to the benchmark should caused by the errors of the BSS algorithm when predicting intelligibility from the separated signals.
With the separated signals (row 'Raw' in Table 2), BiSII can still maintain a linear relationship with listener performance reasonably well compared to its benchmarks, although it has produced smaller intelligibility indices than those with direct signals.However, while BiSTI only preserves its predictive

Selective SNR compensation
With the separated signals, the objective predictions seem more sensitive to the SNR underestimation when the masker is at 60 • , ±90 • and 120 • than at other locations, as illustrated in Fig. 3.A solution would be to apply the gain only to these conditions.A set of 50 different sentences from the same corpus were used to explore the optimal gains.The optimisation was to maximise the Pearson correlation between the predictions of using the direct signals and using the separated signals.The higher the correlation, the closer the performance for the separated signals was to the true performance.The optimisation was conducted on the same two noise maskers but with an extended SNR range from -12 to 0 dB for SSN and -18 to -6 dB for CS, taking a 3-dB step.The examined values for the gain were from 0.5 to 3 dB with a 0.5-dB step.The final optimal value of 1.5 dB was chosen as the point at which the mean correlation across the sub-conditions (SSN, CS and overall) was the best, based on the mean performance across all the three OIMs.However, it is worth noting that this procedure can only optimise the linear relationship between the predictions using the two approaches; it may not necessarily reduce the distance between the two types of predictions (see row 'Sel.Comp.' of Table 1).
As demonstrated in row 'Sel.Comp.' of Table 2, by applying a constant 1.5-dB gain to the separated speech signals in the conditions where the masker is at 60 • , ±90 • and 120 • , a remarkable improvement in the performance of BiDWGP was received, making it almost as accurate (ρ = 0.89) as when predicting from the direct signals (ρ ≥ 0.90).While BiSII lost some accuracy in CS (from 0.84 to 0.78), BiSTI maintained its benchmark performance for individual maskers.Nevertheless, all the OIMs still lack some robustness for cross-masker predic-

Conclusions
Three OIMs were employed to predict binaural speech intelligibility from the BSS-separated signals.Overall, except for across-masker prediction, the OIMs may provide similar predictive accuracy to their benchmark performance measured from the direct signals.As the outputs of the BSS algorithm, the SNR between the separated signals tends to be underestimated, especially when the masker is at 60 • , ±90 • and 120 • in SSN as well as 10 • in CS.The ideal SNR rectification does not recover the true performance for the OIMs, revealing that errors in SNR preservation are not the only issues for OIMs to make reliable intelligibility predictions from the separated signals; other aspects, such as the artefacts resulted from the BSS algorithm, may also play a role.The fact that selective SNR compensation largely benefited BiDWGP but not BiSII implies that, due to their different mechanisms, compensation to the estimated speech signal may need to be optimised individually for specific OIM for best performance.
Further work will focus on identifying the kinds of distortions introduced by the BSS that OIMs can not account for, hence reduced predictive power of OIMs.Particularly, we should investigate the relationship between these distortions and different mechanisms of OIMs, and exploit this relationship in practical usages.For real-time processing there may be insufficient information on which BSS model is to train.Thus, a localisation model could also be employed at the early stage of the pipeline in order to estimate the source location.In addition, for an appropriate BSS model, statistics of the masker need also to be learnt online.

Fig. 1 Figure 1 :
Fig.1illustrates the framework of the proposed system.A BSS algorithm[11,12,13] is applied to extract both the target and masker signals.To implement real-time source separation and intelligibility prediction, the separation model is trained offline.The training data can be obtained at the stage of sound check

Figure 3 :
Figure 3: Comparisons of SNR and ISD levels (upper) and OIM predictions (lower) between the direct signal and that separated from SSN (left) and CS (right), calculated as the mean across the 220 sentences for each condition.

Table 1 :
Euclidean distance between the objective intelligibility scores computed from the direct and the separated signals.Having observed that the BSS algorithm leads to lower SNR for the separated signals, we also investigated the model performance when the SNR of the separated signals was rectified to match that of the direct signals.This process effectively reset ∆SN R to 0 dB for all conditions, resulting in the shortened Euclidean distance as shown in row 'SNR Rec.' of Table1.The performance of each OIM is displayed in row 'SNR Rec.' of Table 2. BiSII and BiSTI achieved similar performance to their own benchmarks for individual maskers; there are some improvements in the performance of BiDWGP.However, perfectly restoring the SNR of the separated signals is almost impossible in practice without having prior knowledge of the true SNR.

Table 2 :
Objective-subjective correlation coefficients ρ (σe) for using the direct signals (in grey) and the separated signals..e.overall) using separated signals.For DWGP, given its high benchmark overall correlation (ρ = 0.90), this is presumably due to the inconsistent distance shift from the objective scores computed from the direct signals in different maskers, as read from row 'Sel.Comp.' of Table1.