Multimodal Learning for Audio and Language

Special Session at ICASSP 2024

Sounds carry a wide range of information about environments, from individual physical events to sound scenes. Using deep learning and machine learning methods for analysis, recognition and synthesis of sounds has achieved remarkable results in recent years. Human beings perceive the world via multimodal information, we hear the sounds and communicate in language. Recently, the field of audio and language has emerged as an important research area in audio signal processing and natural language processing. Multi-modal audio-language tasks hold immense potential in various application scenarios. For instance, automatic audio captioning aims to provide meaningful language descriptions of audio content, benefiting the hearing-impaired in comprehending environmental sounds. Language-based audio retrieval facilitates efficient multimedia content retrieval and sound analysis for security surveillance. Text-queried audio source separation aims to separate an arbitrary sound from an audio mixture, which offers a flexible user-interface in future audio editing and creation applications. Text-to-audio generation endeavors to synthesize audio content based on language descriptions, serving as sound synthesis tools for film making, game design, virtual reality, digital media, and aiding text understanding for the visually impaired. However, audio-language multimodal learning presents challenges in comprehending the audio events and scenes within an audio clip, as well as interpreting the textual information presented in natural language. Furthermore, the limited size of existing audio-language datasets hampers generalization in real-world scenarios. This special session aims to present a collection of recent advances in the area of audio-language multimodal learning, from leading researchers, and exchange the ideas from researchers and identify the potential new research directions for further advancement of the field.

Potential topics of interest include but are not limited to:

Audio large language models (LLMs)
Automatic audio captioning
Language-based audio retrieval
Audio question answering
Text-to-audio generation
Text-queried audio source separation
Audio-language representation learning
New datasets and tasks for audio-language learning
New performance metrics for evaluation of audio-language tasks

Organising team

Xubo Liu, University of Surrey, UK

Mark D. Plumbley, University of Surrey, UK

Wenwu Wang, University of Surrey, UK