The Journal of the Audio Engineering Society — the official publication of the AES — is the only peer-reviewed journal devoted exclusively to audio technology. Published 10 times each year, it is available to all AES members and subscribers.
The Journal contains state-of-the-art review papers, technical papers, and engineering reports; standards committee work, convention and conference announcements, membership news, and book reviews.
Authors:Moliner, Eloi; Välimäki, Vesa
Affiliation:Acoustics Lab, Department of Information and Communications Engineering, Aalto University, Espoo, Finland; Acoustics Lab, Department of Information and Communications Engineering, Aalto University, Espoo, Finland
Page:100
Audio inpainting aims to reconstruct missing segments in corrupted recordings. Most existing methods produce plausible reconstructions when the gap lengths are short, but struggle to reconstruct gaps larger than about 100 ms. This paper explores diffusion models, a recent class of deep learning models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, and is able to regenerate gaps of any size. An improved deep neural network architecture based on the constant-Q transform that allows the model to exploit pitchequivariant symmetries in audio is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps, up to 300 ms. The results of a formal listening test indicate that, for short gaps in the range of 50 ms, the proposed method delivers performance comparable to the baselines. For wider gaps up to 300 ms long, our method outperforms the baselines and retains good or fair audio quality. The method presented in this paper can be applied to restoring sound recordings that suffer from severe local disturbances or dropouts.
Download: PDF (HIGH Res) (2.2MB)
Download: PDF (LOW Res) (985KB)
Authors:Vanhatalo, Tara; Legrand, Pierrick; Desainte-Catherine, Myriam; Hanna, Pierre; Pille, Guillaume
Affiliation:Inria Bordeaux Sud-Ouest, Institute of Mathematics of Bordeaux, UMR 5251 CNRS, University of Bordeaux, F-33405 Talence, France; University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France; Orosys, F-34980 Saint-Gély-du-Fesc, France; Inria Bordeaux Sud-Ouest, Institute of Mathematics of Bordeaux, UMR 5251 CNRS, University of Bordeaux, F-33405 Talence, France; University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France; Orosys, F-34980 Saint-Gély-du-Fesc, France; University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France; Orosys, F-34980 Saint-Gély-du-Fesc, France; Orosys, F-34980 Saint-Gély-du-Fesc, France
Page:114
Neural networks have seen increased popularity in recent years for nonlinear audio effects modelling. Such a task requires sampling and creates high frequency harmonics that can quickly surpass the Nyquist rate, creating aliasing in the baseband. In this work, we study the impact of processing audio with neural networks and the potential aliasing these highly nonlinear algorithms can incur or aggravate. Namely, we evaluate the performance of a number of anti-aliasing methods for use in real-time. Notably, one method of anti-aliasing capable of real-time performance was identified: forced sparsity through network pruning.
Download: PDF (HIGH Res) (4.5MB)
Download: PDF (LOW Res) (1.3MB)
Author:Parolo, Giuseppe
Affiliation:Ångstrom Audiolab
Page:123
The classical characterization of nonlinear distortions of electronic devices such as audio amplifiers involves the calculation of some indicators, such as Total Harmonic Distortion, Total Harmonic Distortion + Noise, and Intermodulation Distortion, obtained by measuring the additional spectral components generated by the device against conventional input signals. This paper will explore the relationships that link these components and therefore how they affect the calculation of indicators. In particular, it will be seen how the current the measures, leaving out some components, make them not representative of the overall entity of nonlinear distortions suffered by the signal. The topic will be developed using black-box--type models, untethered from the particular circuit type of the physical device. Thorough knowledge of spectral relationships can be a guide in tuning amplifiers; measurements, recalculated by integrating missing components, can be used both to more accurately frame the distorting effects of amplifiers and to enable more appropriate classification.
Download: PDF (HIGH Res) (7.4MB)
Download: PDF (LOW Res) (2.1MB)
Authors:Denk, Florian; Jürgensen, Lukas; Husstedt, Hendrik
Affiliation:German Institute of Hearing Aids, Lübeck, Germany; German Institute of Hearing Aids, Lübeck, Germany; German Institute of Hearing Aids, Lübeck, Germany
Page:145
Users of earphones, hearing aids or other ear-worn devices frequently experience an unnatural or "boomy" sound of their own voice. This is caused by the occlusion effect, i.e., an amplification of body-conducted components at low frequencies and an attenuation of airconducted high-frequency components of the voice. Although the classic method to reduce the occlusion effect is to partly open the ear canal, active control of the ear canal sound pressure to improve own-voice perception, referred to as Occlusion Effect Cancellation (OEC), is now provided in the transparency mode of many commercial active noise control earphones. In this work, the OEC functionality of four earphones has been evaluated by subjective ratings, probe tube measurements, and measurements in a prototype coupler that features a simulation of body- and air-conducted own-voice components. Results show substantial benefits of OEC that differ between devices and that the various effects of ear canal occlusion across the whole frequency range have to be compensated for satisfactory own-voice quality. Measurements in the prototype coupler approximate the occlusion effects in real ears and may be a useful complement to tedious and potentially unreliable real-ear measurements in human subjects.
Download: PDF (HIGH Res) (8.5MB)
Download: PDF (LOW Res) (4.3MB)
Authors:He, Weijun; Zhao, Weijun; Lin, Weijun; He, Yuxin; Feng, Qi
Affiliation:School of Electronics and Information, Guangdong Polytechnic Normal University, Guangzhou, China; School of Electronics and Information, Guangdong Polytechnic Normal University, Guangzhou, China; School of Electronics and Information, Guangdong Polytechnic Normal University, Guangzhou, China; School of Electronics and Information, Guangdong Polytechnic Normal University, Guangzhou, China; School of Electronics and Information, Guangdong Polytechnic Normal University, Guangzhou, China
Page:161
Prosody conversion is an important part in voice conversion, where fundamental frequency (F0), which carries important speaker individuality information (e.g., tone, intonation, etc.), is regarded as one of the key prosodic features in the excitation model for speech synthesis. In a conventional approach based on continuous wavelet transform for modeling F0, analysis is carried out on a frame level and is prone to losing high-frequency information in the process of decomposition and reconstruction. In order to address this problem, the paper shows a representation of long-term fundamental frequency based on Wavelet Packet Transform (WPT). Specifically, the long-term F0 is decomposed usingWPT, and a joint vector is formed by combining the resulted average power spectrum. Furthermore, the method is applied in a voice conversion system. Voice conversion experiments are conducted on Chinese and English speech data to evaluate the performance of the proposed method. The results show that the proposed method is obviously better than the method based on wavelet transform in all conversion scenarios but performs a little worse than the method based on mean and variance in same-gender conversion scenario.
Download: PDF (HIGH Res) (3.9MB)
Download: PDF (LOW Res) (577KB)
Authors:Ackermann, David; Brinkmann, Fabian; Weinzierl, Stefan
Affiliation:Audio Communication Group, Technische Universität Berlin, Germany; Audio Communication Group, Technische Universität Berlin, Germany; Audio Communication Group, Technische Universität Berlin, Germany
Page:170
This article presents a database of recordings and radiation patterns of individual notes for 41 modern and historical musical instruments, measured with a 32-channel spherical microphone array in anechoic conditions. In addition, directivities averaged in 1/3-octave bands have been calculated for each instrument, which are suitable for use in acoustic simulation and auralization. The data are provided in Spatially Oriented Format for Acoustics. Spatial upsampling of the directivities was performed based on spherical spline interpolation and converted to OpenDAFF and Generic Loudspeaker Library formats for use in room acoustic and electro-acoustic simulation software. For this purpose, a method is presented for how these directivities can be referenced to a specific microphone position in order to achieve a physically correct auralization without coloration. The data is available under the CC BY-NC 4.0 license.
Download: PDF (HIGH Res) (3.6MB)
Download: PDF (LOW Res) (579KB)