|
|
|
|
|
|
|
|
|
2ByteDance, China |
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., ''a man tells a joke followed by people laughing''). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. |
|
In this work, we present LASS-Net, which is learned to jointly process acoustic and linguistic information, and separate the target source described by the natural language expressions. In LASS-Net, a Transformer-based query network (QueryNet) is used to encode the language expression into a query embedding, and a ResUNet-based separation network (SeparationNet) is then used to separate the target source from mixture conditioned on the query embedding. The architecture of our proposed model is illustrated as follows. |
We demonstrate the audio samples separated by our proposed LASS-Net. There are four audio clips for each example: (a) audio mixture (b) separated audio source queried by AudioCaps description (c) separated audio sources queried by our collected human description (d) ground truth target source. |
|||||||||||
Language Query (AudioCaps): ''A person shouts nearby and then emergency vehicle sirens sounds'' Language Query (human): ''A man is speaking with ambulance and police siren sound in the background'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''A motor vibrates and then revs up and down'' Language Query (human): ''The engine sound of a vehicle'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''People laugh followed by people singing while music plays'' Language Query (human): ''A music show is presenting to the public'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''Someone is typing on a keyboard'' Language Query (human): ''The sound of hitting the keyboard'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''Distant claps of thunder with rain falling and a man speaking'' Language Query (human): ''Very rainy and a man is talking dirty words in the background'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''Heavy wind and birds chirping'' Language Query (human): ''A bird is chirping under the thunder storm'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''Applauding followed by people singing and a tambourine'' Language Query (human): ''A show start with audience applauding and then singing'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''A woman is giving a speech'' Language Query (human): ''A female is speaking with clearing throat sounds'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''Church bells ringing'' Language Query (human): ''Someone is striking a large church bell'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
Language Query (AudioCaps): ''An adult male is laughing'' Language Query (human): ''A man is laughing really hard'' |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M.D. Plumbley, W. Wang Separate What You Describe: Language-Queried Audio Source Separation Accepted at INTERSPEECH 2022 [Bibtex]   [ArXiv]   [Code] |
AcknowledgementsThis research was supported by a Newton Institutional Links Award from the British Council (Grant number 623805725), a Research Scholarship from the China Scholarship Council (CSC), and a PhD Studentship from the University of Surrey. |