Separate What You Describe: Language-Queried Audio Source Separation

Xubo Liu1
Haohe Liu1
Qiuqiang Kong2
Xinhao Mei1
Jinzheng Zhao1
Qiushi Huang1
Mark D. Plumbley1
Wenwu Wang1

1University of Surrey, UK
2ByteDance, China


Abstract

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., ''a man tells a joke followed by people laughing''). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios.



Oral Presentation at INTERSPEECH



Language-Queried Audio Source Separation

In this work, we present LASS-Net, which is learned to jointly process acoustic and linguistic information, and separate the target source described by the natural language expressions. In LASS-Net, a Transformer-based query network (QueryNet) is used to encode the language expression into a query embedding, and a ResUNet-based separation network (SeparationNet) is then used to separate the target source from mixture conditioned on the query embedding. The architecture of our proposed model is illustrated as follows.



Audio Separation Results

We demonstrate the audio samples separated by our proposed LASS-Net. There are four audio clips for each example:

(a) audio mixture

(b) separated audio source queried by AudioCaps description

(c) separated audio sources queried by our collected human description

(d) ground truth target source.



Language Query (AudioCaps): ''A person shouts nearby and then emergency vehicle sirens sounds''

Language Query (human): ''A man is speaking with ambulance and police siren sound in the background''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''A motor vibrates and then revs up and down''

Language Query (human): ''The engine sound of a vehicle''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''People laugh followed by people singing while music plays''

Language Query (human): ''A music show is presenting to the public''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''Someone is typing on a keyboard''

Language Query (human): ''The sound of hitting the keyboard''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''Distant claps of thunder with rain falling and a man speaking''

Language Query (human): ''Very rainy and a man is talking dirty words in the background''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''Heavy wind and birds chirping''

Language Query (human): ''A bird is chirping under the thunder storm''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''Applauding followed by people singing and a tambourine''

Language Query (human): ''A show start with audience applauding and then singing''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''A woman is giving a speech''

Language Query (human): ''A female is speaking with clearing throat sounds''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''Church bells ringing''

Language Query (human): ''Someone is striking a large church bell''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source

Language Query (AudioCaps): ''An adult male is laughing''

Language Query (human): ''A man is laughing really hard''
fname
fname
fname
fname
(a) audio mixture
(b) LASS (AudioCaps)
(c) LASS (human)
(d) target source



Paper

X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M.D. Plumbley, W. Wang

Separate What You Describe: Language-Queried Audio Source Separation

Accepted at INTERSPEECH 2022


[Bibtex]   [ArXiv]   [Code]




Acknowledgements

This research was supported by a Newton Institutional Links Award from the British Council (Grant number 623805725), a Research Scholarship from the China Scholarship Council (CSC), and a PhD Studentship from the University of Surrey.