Hidden auxiliary media channels in audio signals by perceptually insignificant component replacement

This paper proposes a method for the formation of an auxiliary media channel within a host signal. Using a psychoacoustic frequency masking model, perceptually insignificant subband components of the host audio signal are identified and removed. The auxiliary channel data are placed in the empty subbands in the host signal and scaled to a level below the audible threshold. An implementation is given along with results suggesting that the proposed method can effectively hide an auxiliary media channel in a normal audio signal without degrading the perceived sound quality.


INTRODUCTION
Data hiding and steganography techniques have attracted much attention in recent years [1].For audio signals, frequency masking curves provide a means for the embedding of data below the perceptual threshold of the host signal [2,3].The feasibility of a "sub channel" making use of "an audio channel's capacity below the perceptual threshold" was postulated by Ding [4], however no tangible proofs and results were reported.Perceptually insignificant components of the host signal are first removed and then replaced with new data.Figure 1 shows how components of a signal lie below the perceptual threshold.This allows for a clean separation of the host and embedded signal.An auxiliary media channel (AMC) has many applications, they may be used for metadata, graphics or additional audio features such as foreign languages.These channels usually rely on additional bandwidth, additional storage space, and or special codecs.The proposed method differs from other audio steganograpy techniques that focus on secure covert communications to embed short utterances in longer signals [5].The proposal is also substantially different from schemes such as Dolby Pro Logic in that the additional content is masked before decoding and can be used to carry various forms of data.A standard audio signal from which an auxiliary channel could be extracted when needed has significant advantages.This paper extends the idea presented in [4] for use with 'CD quality' audio signals, detailing a proposed implementation and example usage.
Thus, a negative value of SM R sb (n) indicates a perceptually insignificant subband.The set of perceptually insignificant subbands, R, is defined as, Stage 2 : Subband Decomposition.The DCTs of both x(t) and y(t) are obtained using Eq. 3. where ) such that k = 12(n−1)+m i.e.Each subband, n, comprises of 12 consecutive coefficients m from X c (k) where m = {1, 2, . . ., 12}.

Stage 3 : Removal
The perceptually insignificant subbands are removed by zeroing their coefficients as in Eq. 4.
Stage 4 : Auxiliary Signal Formation.The subband data of the AMC needs to be re-arranged such that its energy contents are now located in the empty spaces corresponding to the host signal as in Eq. 5. Figure 3 shows an example of the DCT coefficients of the auxiliary signal compared to that of the host signal with subbands removed.
Stage 5 : Correction for masking.To ensure the embedded AMC remains below the audible threshold, a correction factor, α, has to be applied, this value is obtained by comparison of the maximum SPL in each subband of y * (t), L * sb (n), and minimum masking level of x(t), LT min .L * (t) is then obtained by calculating the PSD of y * (t) according to Eq. 7.
An auxiliary signal to mask ratio, ASM R sb (n), is determined by, This allows the determination of the multiplication factor, α, given by, where β = max(ASM R sb (n)).Finally we obtain our encoded signal as, Auxiliary Channel Extraction The subband locations and the values of α will be required at the decoding stage to enable successful extraction of the auxiliary media channel with no degradation in quality.It was suggested in [4] that component replacement would cause little difference to the masking curve such that it could be again calculated by the decoder, this is contested in [8] although to the best of the authors' knowledge, practical results have yet to be published for either hypothesis.A study into the feasibility of this method is too lengthy for inclusion in this paper.The extraction method presented in this paper is based on the assumption that the decoding data discussed above is transmitted to the receiver by any valid method.
To extract the AMC, the DCT of s(t), S c (t) is obtained using Eq. 3 and rearranged into S sb (n, m).Based on the assumption that we know R(m) and α, the auxiliary audio channel s * can be determined as, then all other values of S * sb (n, m) are equal to zero.S * c (k) is constructed by rearranging S * sb (n, m) such that k = 12(n − 1) + m.Taking the IDCT of S * c (k) as in Eq. 6 and division by α yields the extracted auxiliary channel, y * (t).

RESULTS AND TEST PROCEDURE
The case chosen for this paper was to embed a telephone quality speech signal into a CD quality music host signal, an initial test was carried out with 6 music clips (of 20 seconds in length ) to investigate the numbers of perceptually insignificant subbands.The results are presented below in Table 1.The required bandwidth to reproduce a telephone quality speech signal is 300-3400 Hz, which can be achieved by using the first 5 subbands (i.e.0-3445 Hz).Using the same music clips, with R restricted to the five smallest values, a 20 second speech clip was embedded as an auxiliary channel.During the process, the value of α was recorded and the results are presented in table 2. As large as possible value of α is desirable as this represents the strength of the auxiliary signal relative to the host which will ultimately affect the robustness of the auxiliary channel.Comparison of insignificant subbands values against α values for each type of music shows that a larger number of the former does not relate to a larger value of the latter, i.e. more subbands does not mean greater AMC strength.

Subjective Testing
A small test group of 6 listeners was used to determine whether the embedding of the auxiliary channel had any perceptual effect.The listeners were presented with the clips of the sequences both with the perceptually insignificant subbands removed and replaced in a randomized order and asked whether they could distinguish the two, giving their answer in terms of how the second sequence compared to the first.The third example, Rock, was used as a control, As the auxiliary signal data was made to remain below the masking curve it is assumed that the difference between signal with subbands removed and that with subbands replaced would be inaudible.Results indicating that some listeners identified the encoded version as sounding superior suggest that listeners may have in fact had difficulty in distinguishing the two, hence it may be the case that listeners are able to detect minor differences but cannot reliably conclude whether one is preferable over the other.

DISCUSSIONS AND CONCLUSIONS
This paper has identified a method for the formation of an auxiliary media channel in a standard audio signal based on a method described as perceptually insignificant component replacement.An algorithm for the implementation of such a method has also been described in detail.It has been shown how the channel can be used to carry a reduced bandwidth speech signal.The DCT was used for subband removal and replacement due to its 'near FFT' performance and simplicity of manipulation.The well known Psychoacoustic Model 1 was used to identify the perceptually insignificant subbands.Analysis of several pieces of music showed that on average 14-18 subbands could be regarded as perceptually insignificant, with the minimum always exceeding the five subbands required to carry the telephone quality speech signal.However, the number of available subbands was not the only criterion as the signals with fewer available subbands allowed a greater strength of the embedded signal compared to the host.Subjective testing of the embedding stage confirmed that whilst the auxiliary channel data remained below the masking curve then any differences would not be audible.Backwards compatibility is maintained as the auxiliary media channel does not affect performance of the encoded signal on equipment without the decoding capabilities.

Fig. 3 .
Fig. 3. Comparison of DCT for Host Signal with subbands removed and Auxiliary Signal.The x-axis represents the DCT coefficients.

Fig. 4 .
Fig. 4. Masking Curve The data output from the psychoacoustic model are in terms of 32 equal width subbands n, where n = {1, 2, . . ., 32}.The output gives the minimum masking level, LT min (n) dB and the maximum Sound Pressure Level (SPL), L sb (n), normalized to a maximum per frame of 96 dB by addition of a value, ∆, such that that allow removal and replacement.The host signal is represented x(t) and the auxiliary signal y(t), where t represents a discrete time series.Stage 1 : Psychoacoustic Model.

Table 1 .
Example numbers of insignificant subbands

Table 3 .
Subjective Testing Results