Philippa Demonte
Downward Dynamic Range Expansion versus Ducking: Test of Speech Intelligibility Data
Demonte, Philippa
Authors
Abstract
Data collected towards the PhD thesis by P. Demonte (2022), which evaluated the effectiveness of several different object-based approaches to audio towards improving speech intelligibility, primarily for application to broadcast audio.
This particular investigation (one of four) determined with both a subjective, quantitative speech-in-noise test** and objective intelligibility metrics that foreground speech intelligibility can be significantly improved by applying downward dynamic range expansion (DDRE) to the background sounds, including background music. In contrast to ducking (linear attenuation of the background relative to the foreground dialogue), the application of DDRE allows the narrative intent of the background sounds to be retained, but provides more space for the foreground dialogue.
** Speech in noise test (SINT): a listening experiment in which participants listen to spoken sentences played simultaneously with background sound, and are tasked with writing down or repeating aloud the sentences as heard and understood. Correct word scores converted to percentages then act as a quantitative proxy for speech intelligibility.
This speech-in-noise test used spoken sentences from a re-recorded version of the HARVARD speech corpus, which can be found in another collection by P. Demonte on the Salford Figshare.
The tabs of the Excel spreadsheet are as follows:
* Summary
- overview of the independent variables
- playback settings
- averaged amplitude statistics of the listening stimuli audio files
- research questions
- key results, statistical analyses, and interpretations, including the effects on foreground speech intelligibility of applying: 1) DDRE to the background sounds, 2) ducking to the background sounds, and 3) DDRE versus ducking
* Combined Data
- the raw data collected from the speech-in-noise test and the associated objective intelligibility metrics
- PID is the anonymised participant ID number
- PO is the trial number
- Sentence shows the HARVARD speech corpus list number and sentence number within that list
- Masker: either background music or speech-modulated noise (SMN), with either DDRE or ducking applied or no audio engineering application; speech-to-noise ratio (SNR) that the masking noise was initially set at; the DDRE settings used (DDRE threshold, DDRE ratio, attack time, release time, etc, or equivalent for ducking)
- Heard = the sentence as heard and cognitively understood by the participant
- CWS - correct word score (out of 5)
- CWF - correct word score as a ratio
- TARGET1 to TARGET5 are the target words in each sentence
* TotalWordScores
- the speech-in-noise test data collated by participant (rows) versus the masking noise and audio engineering application (columns)
* WordRecognition%
- the speech-in-noise test data converted from total word scores to percentages
* StudentizedResiduals
- a test for looking for outliers in the collated experiment data
* ShapiroWilk
- a test to check if the criteria for normal distribution are fulfilled with the collated experiment data
* MaskerDDRE
- Mauchly's test of sphericity
- 2-way RMANOVA - statistical analysis of the collated experiment data to see if either the background sound type and/or the downward dynamic range expansion settings applied to the background sound have any significant effect on foreground speech intelligibility
* MaskerDucking
- same as for MaskerDDRE, but looking at the effect on speech intelligibility of different ducking settings applied to the two different background sounds
* PairedTtest
- comparing the effect on speech intelligibility of applying DDRE to the background sound versus applying ducking
* AMP_Stats
Amplitude statistics of all the masking noise .wav audio files relative to the speech .wav audio files, as determined using a Matlab script (incorporating Cooke and Tang's [2016] HEGP objective intelligibility metrics) and Adobe Audition:
- speech-to-noise ratio (SNR) (dB SNR)
- glimpse proportion (GP) of the corresponding spoken sentence relative to the background noise
- high energy glimpse proportion (HEGP) of the corresponding spoken sentence relative to the background noise
- Peak Gain
- Total root mean square
- Maximum root mean square
- Minimum root mean square
- Dynamic Range
- Loudness
* HEGP
Averaged high energy glimpse proportion of the speech versus background sounds played to each participant
* DR_dB, LoudnessLUFS, PeakGain_dB, TotalRMS_dB
- as for the HEGP page, but Dynamic Range, Loudness, Peak Gain, and Total root mean square
Accompanying this particular investigation was a two-alternative choice (2-AC) listening experiment in which participants listened to pairs of audio files in each trial with either DDRE, ducking or no audio engineering manipulation applied to the background sound, and selected which one they: 1) perceived to be of the better sound quality, and 2) preferred. See under the P.Demonte profile on Salford Figshare for the quantitative, subjective data collected.
For further information, contact:
p.demonte@edu.salford.ac.uk
philippademonte@gmail.com
Online Publication Date | Sep 13, 2022 |
---|---|
Publication Date | Sep 13, 2022 |
Deposit Date | Jan 23, 2025 |
DOI | https://doi.org/10.17866/rd.salford.21078049.v1 |
Publisher URL | https://salford.figshare.com/articles/dataset/Downward_Dynamic_Range_Expansion_versus_Ducking_Test_of_Speech_Intelligibility_Data/21078049 |
Collection Date | Sep 13, 2022 |