Poster Presentations

Monday, Sept 5th, 15:30 – 17:00

Distributed recursive estimation of norm values for non-triviality constraint in adaptive distributed blind system identification

Matthias Blochberger, KU Leuven

Abstract: The problem of multichannel blind system identification (BSI) using the cross-relation (CR) formulation requires a non-triviality constraint on the solution. In other work, considering a wireless sensor network (WSN), we introduced an adaptive distributed algorithm using general-form consensus ADMM where the constraint manifests itself as a division operation by the norm of the network-wide multichannel-stacked solution vector. This operation necessitates repeated transmissions of norm values throughout the network until every network node has the norm information of all other nodes. Therefore, to reduce overall network transmission effort, we extend this algorithm with a recursive estimation scheme based on the iterative method of distributed averaging. In the extended algorithm, each node only requires norm information of its neighboring nodes to recursively estimate the norm value of the network-wide stacked solution vector. Preliminary results show that the extended algorithm provides estimation performance close to the non-extended one with fewer inter-node transmissions. We further show how network topology influences the performance of the algorithm.

Low latency deep joint source channel audio coding

Mohammad Bokaei, Aalborg University

Abstract: In this work, we studied task-oriented real-time audio and speech compression and communication over wireless channels for hearing aids. For this, data received acoustically through the microphone array of the hearing device and by an external audio device are used. This external device is supposed to receive, record, and process the acoustic data and send it through a wireless channel to be received by the hearing aid device. The final task is to enhance the original speech using the acoustically received data at the hearing aid device and the audio data transmitted by the external device. More precisely, this work is focused on the wireless communication stage of this system. Real-time wireless communication is required to provide additional information about the target speech or noise environment by using wirelessly connected external audio devices. This, however, is not guaranteed by conventional communication systems. We introduced a DNN-based speech joint source and channel coding scheme to design a robust and low-latency wireless communication system. The numerical results illustrate the proposed method’s performance compared to state-of-the-art communication systems in terms of mean square error (MSE).

Sound zone control with variable span trade-off filter and kernel based sound field interpolation

Jesper Brunnström, KU Leuven

Abstract: An extension to the sound zone control method variable span trade-off filter (VAST) is proposed, using kernel based sound field interpolation to control sound over continuous regions. The traditional VAST controls sound at discrete control points, for each of which a room impulse response estimate from all loudspeakers is required. To apply the traditional VAST for larger regions, the control points must be densely distributed over the space, leading to a large amount of required room impulse responses. Using sound field interpolation instead, it is possible to maintain good control performance while requiring fewer control points. The kernel based sound field interpolation allows for arbitrarily spaced arrays, and can be computed ahead of time, leading to a low computational cost during operation. Assuming prior knowledge of the loudspeaker positions, a directional weighting can be added, improving interpolation performance and therefore also the sound zone control. The methods are evaluated by simulation in a reverberant space, where the proposed method with directional weighting outperforms the traditional VAST over the full bandwidth of the signal. 

Sound zones creation using prior knowledge of room impulse responses

José Miguel Cadavid Tobón, Aalborg University

Abstract: Sound zone methods aim to control the sound field produced by an array of loudspeakers in a room, to render a given audio content in a specific area, while making it almost inaudible in other defined regions. Such control is excerpted with a set of filters whose design relies on the Room Impulse Responses (RIRs), which contain information of the electro-acoustical path between each loudspeaker and a group of microphones inside the regions to control. Currently, these methods are successfully implemented only under closed-form solutions. This means that any change in the system requires repeating the process from the beginning, including calculating a new set of filters and acquiring additional knowledge about the new conditions. In real-time applications, short acquisition and processing times are critical, limiting the amount of information that should be retrieved. Our current work explores strategies to decrease the amount of time and information required when updating control filters for sound zones rendering without degrading their performance. In this poster, the use of prior knowledge to obtain the new RIRs is studied, and its influence in the overall performance is evaluated under two objective metrics: the Acoustic Contrast Ratio (ACR) and the Normalized MSE (NMSE).

A reduced-communication distributed steered response power approach to source localization in wireless acoustic sensor networks

Bilgesu Çakmak, KU Leuven

Abstract: In wireless acoustic sensor networks (WASNs), the conventional steered response power (SRP) approach to source localization requires each node to transmit its microphone signal to a fusion center. In our previous work, we showed two different fusion strategies for local, single-node SRP maps computed using only the microphone pairs within a node. In the first fusion strategy, we sum all single-node SRP maps in a fusion center, requiring less communication than the conventional SRP approach because the single-node SRP maps typically have less parameters than the raw microphone signals. In the second fusion strategy, the single-node SRP maps are distributively averaged without using a fusion center, requiring communication amongst connected nodes only. This poster proposes an improvement of the method by thresholding the SRP map values before summing all single-node SRP maps or before performing the distributed averaging. This way, communicational load could be reduced even more.

Deep complex-valued convolutional-recurrent networks for single source DOA estimation

Eric Grinstein, Imperial College London

Abstract: Despite having conceptual and practical advantages, Complex-Valued Neural Networkss (CVNNs) have been much less explored for audio signal processing tasks than their real-valued counterparts. We investigate the use of a complex-valued Convolutional Recurrent Neural Network (CRNN) for Direction-of-Arrival (DOA) estimation of a single sound source on an enclosed room. By training and testing our model with recordings from the DCASE 2019 dataset, we show our architecture compares favourably to a real-valued CRNN counterpart both in terms of estimation error as well as speed of convergence. We also show visualizations of the complex-valued feature representations learned by our method and provide interpretations for them.

Dereverberation in acoustic sensor networks using the weighted prediction error algorithm with microphone-dependent prediction delays

Anselm Lohmann, Carl-von-Ossietzky University of Oldenburg

Abstract: In many hands-free speech communication applications, the quality and intelligibility of speech can be severely degraded by room reverberation. State-of-the-art multi-microphone dereverberation methods include the weighted prediction error (WPE) algorithm. Here, dereverberation is achieved by subtracting a filtered delayed version of the reverberant microphone signals from a chosen reference microphone signal. The delay, called prediction delay, is chosen to reduce the correlation between the prediction signals and the direct speech component in the reference microphone signal, hence reducing the distortion in the dereverberated signal. In compact arrays with closely spaced microphones, the prediction delay is typically chosen based on the autocorrelation of speech, i.e. independent of the microphone array geometry. However, when considering acoustic sensor networks, large inter-microphone distances may exist, leading to larger delays with respect to the reference microphone signal, which if uncompensated for, may lead to distortion in the dereverberated signal. In order to reduce the correlation between the prediction and reference microphone signals, in this contribution we propose to align the microphone signals based on the time-difference-of-arrival (TDOA) with respect to the reference microphone, leading to microphone-dependent prediction delays in the WPE algorithm. We consider two versions, either using a coarse frame-wise compensation or a more precise time-domain compensation.  Experimental results using simulated acoustic impulse responses for a variety of microphone-source positions with different levels of reverberation show that applying TDOA compensation improves the dereverberation performance of WPE. Comparing between the two compensation schemes, we find that the performance improvement increases with the precision of compensation applied.

Two-stage voice modification for privacy enhancement

Franceso Nespoli, Nuance Communications Inc.

Abstract: In recent years, the need for privacy preservation when manipulating or storing personal data has become a major issue. Here, we present two systems addressing the speaker-level anonymization problem, each with a different anonymization-to-utility score ratio. Both systems have been inspired by the baseline model provided by the Voice Privacy Challenge 2022 organizers but, instead of applying the baseline algorithms directly to the original speech signals, utterances were first modified using either an ASRT-TTS or a voice style transfer (VST) step. Voice modification was obtained with a VST system whereas heavy voice modification was achieved by first transcribing the original speech and then re-synthesizing a speech signal using TTS. Utility and anonymization scores measured with the provided baseline systems display high anonymization results for the transcription plus text-to-speech system at the cost of a lower utility score. Contrarily, the VST-based application leads to higher utility scores at the expense of a lower level of anonymization.

Binaural speech enhancement using STOI-optimal masks

Vikas Tokala, Imperial College London

Abstract: STOI-optimal masking has been previously proposed and developed for single-channel speech enhancement. We consider the extension to the task of binaural speech enhancement in which the spatial information is known to be important to speech understanding and therefore should be preserved by the enhancement processing. Masks are estimated for each of the binaural channels individually and a ‘better-ear listening’ mask is computed by choosing the maximum of the two masks in each time-frequency bin. The estimated mask is used to supply probability information about the speech presence in each time-frequency bin to an Optimally-modified Log Spectral Amplitude (OM-LSA) enhancer. We show that using the proposed method for binaural signals with a directional noise not only improves the SNR of the noisy signal but also preserves the binaural cues and intelligibility.