Research internship : Audio source separation for music, gaming and speech

Nahimic is hiring!

About

Nahimic has the conviction that everyone should benefit from the best audio experience, which is why they designed smart and tailor-made solutions that offer 3D sound in real-time on all standard stereo equipment. Their 3D Sound audio solution "Nahimic" has been embedded in the Top of the line Gaming computers and motherboards since 2015.
In line with these convictions, Nahimic has risen to the challenge of bringing this technology mainstream, offering this experience to everyone who wants more immersion, more emotions in their media (Netflix, Spotify, Amazon Prime Video, Apple Music, and much more). Nahimic is helped by talented engineers, developers, product owners, and managers, but is constantly looking for new talent. It currently has offices in France, Singapore and Taïwan and is looking to expand to the US.
(nb: A-Volute made the strategic choice to bring its leading brand's product name "Nahimic" as the current name of the company.)

Job Description

Nahimic (a.k.a. A-Volute) is a company based in Villeneuve d’Ascq (France) that publishes audio enhancement software for the gaming industry, in particular the Nahimic software on MSI laptop. Nahimic
has developed a solution for digital and real-time 3D sound. The suite of audio effects proposed by Nahimic includes effects to improve multimedia content (music or movie) and gaming experience, as well as microphone effects for communication such as noise reduction.
You will join the R&D team that is in charge of proposing and prototyping innovative audio algorithms.

Advisors
— Nathan Souviraà-Labastie, R&D engineer (Nahimic)
— Damien Granger, R&D engineer (Nahimic)
— Raphaël Greff, CTO and R&D Director (Nahimic)

Approaches and topics for the internship

Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible but more specifically A-Volute is interested in the following type of audio signals :
— music for instance for automatic karaoke generation
— video game audio for enhanced game experience
— speech for comfort and quality during communication
Our current algorithm already equals state of the art [15, 16] on a music separation task and many tracks of improvement are possible, both in terms of implementations and applications (details hereafter).
The successful candidate will work on one or several of the following topics according to her/his aspirations and skills. In addition to those topics, visiting PhDs or researchers can also make their own topic proposal, for instance to try their approaches on our internal substantive dataset (description upon request).

New core algorithm
Machine learning is a fast changing research domain and an algorithm can move from being state of the art to being obsolete in less than a year. The work would be to try recent powerful neural network approaches on the audio source separation tasks (on music signals as a first step). Other research domains outside audio (like computer vision) might be considered as sources of inspiration. For instance, the approaches in[17, 9] have shown promising results on other tasks.

Multi-task approach
Metadata such as music style/genre could be used during the training. One possible way is to consider those classes as other tasks to be solved together with the separation tasks. This is a very challenging machine learning problem, especially because the different tasks are heterogeneous (classification, regression, signal estimation) and just a few studies targeting audio multi-task have been carried out so far (exhaustive list from advisors knowledge [7, 11, 13]). Potential advantages are performance improvement for the main/principal task and computational cost reduction in products as several tasks are achieved at the same time. The work would be to investigate this approach based on the previous internal work and network architecture.

Data augmentation
One core aspect that has been raised by last Sisec challenge [15] is that the use of additional data was the key for music separation. Most of the existing approaches [16, 12, 3] use data augmentation (remixing, additional audio effect, reverb), and this could be an axis of amelioration for our current algorithm as well. The work would be to investigate this approach. Interesting recent studies can be found in [2, 12].

Automatic backing tracks generation
While a karaoke version of a song is the song without the singing voice, a backing track is a version of a song without a given instrument (e.g., drum, guitar, piano...). The work would mainly consist in trying new parameters (window size, overlap between windows, instrument specific spectral basis and cost functions...) so that our current algorithm can also address automatic backing track generation. Interesting recent studies can be found in [10, 4].

Extension to multi-source So far, most of the state of the art approaches [15] have addressed the backing track problem as a one instrument versus the rest problem, hence using specific networks for each instruments when multiple instruments are present in the mix. A more challenging problem would be to estimate all the different instruments with a single network.

Extension to multi-channel
Music separation algorithms can take into account stereo versions of songs [15]. Making our current algorithm able to also take the spatial information of multi-channel recordings into account would potentially lead to significant improvements. The work would mainly consist in designing an evolution of the current used network architecture (more references upon request).

Lyrics generation
For a full karaoke experience, the subtitles (and sometimes a video clip) are usually required and should somehow or other be synchronized with the music piece. A first axis of work would be the adaptation of a state of the art speech-to-text method for singing voice to obtain a singing-to-text algorithm. An axis of amelioration would be to use the lyrics that are available online in plain text version. It would then be a matter of synchronizing and displaying this version based on a comparison with the version produced by the singing-to-text algorithm.

Subjective cost functions
Most audio separation techniques seek to optimize an objective criterion, for example the divergence of Itakura-Saito, the mean square error between spectrograms or energy metrics such as the signal to distortion or artifact ratio [18]. However, these metrics do not reflect much about the quality perceived by the auditor when this should be the objective of such a separation algorithm. For example, most audio source separation techniques use a time-frequency masking step, and in many cases this step induces artefacts (chirping) that are perceptible to the ear but not fully taken into account by objective criteria.
Some automatic approaches for subjective evaluation exist but are not often used [1], [6]. In particular, the algorithm for predicting subjective notes developed in [6], which is however the state of the art of the domain, is particularly slow and cannot be used as a cost function during the learning of a source separation algorithm because its use would increase the learning time by 7000% (which is prohibitive when you know that the time of a complete learning on this task is about several weeks). In previous work, A-Volute has successfully developed an end-to-end subjective note prediction algorithm with the same performance as [6] but with a minimal impact on learning time. The work would be to demonstrate that this approach improves separation results in terms of perception, i.e. subjectively. It will first be necessary to convert the algorithm from Keras to Pytorch, verify that the same prediction performances are achieved, in order to then use it as a cost function within the learning of an open algorithm of the state of the art such as [14].

Audio gaming - FPS
Contrary to real world recording where the mixture of different sounds results from a physical additive process, the audio mixture in video game is achieved by the game engine (level, position, spread, ... of each audio object and impact of the environment i.e., Reverb). Thus each game is potentially different in terms of audio mixing and a game by game approach might be necessary. The work would be to adapt the current internal audio FPS separation algorithm to the most played games like Fortnite, CSGO, Overwatch. A first task would be to decide the list of video games of interest and to depict their audio mixing.

Speech denoising
An already important amount of data have been collected from the web including clean speech, noise, ambient, event. The work would be to appropriately mix those data to form a dataset that is representative of Nahimic software use cases. The hyper-parameters of our current algorithm should then be tuned to learn this dataset. Transfer learning (e.g., using ULMFiT [8, 5]) is one track of reflection.

Preferred Experience

Skills

Who are we looking for ? Preparing an engineering degree or a master’s degree, or even a PhD (e.g., 3 months visit), you preferably have knowledge in the development and implementation of advanced algorithms for digital audio signal processing or experience in Natural Language Processing (NLP) or symbolic data processing.

Whereas not mandatory, notions in the following various fields would be appreciated :

  • Audio, acoustics and psychoacoustics
  • Audio effects in general : compression, equalization, etc.
  • Machine learning and artificial neural networks.
  • Statistics, probabilist approaches, optimization.
  • Programming language :Matlab, Python, Pytorch, Keras, Tensorflow.
  • Sound spatialization effects : binaural synthesis, ambisonics, artificial reverberation.
  • Voice recognition, voice command.
  • Voice processing effects : noise reduction, echo cancellation, array processing.
  • Virtual, augmented and mixed reality.
  • Computer programming and development : Max/MSP, C/C++/C#.
  • Video game engines : Unity, Unreal Engine, Wwise, FMod, etc.
  • Audio editing software : Audacity, Adobe Audition, etc.
  • Scientific publications and patent applications.
  • Fluent in English and French.
  • Demonstrate intellectual curiosity.

Références
[1] M. Cartwright, B. Pardo et G. J. Mysore. « Crowdsourced pairwise-comparison for source separation evaluation ». In : p. 5.
[2] A. Cohen-Hadria, A. Roebel et G. Peeters. « Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation ». In : arXiv preprint arXiv :1903.01415 (2019).
[3] A. Défossez et al. « Music Source Separation in the Waveform Domain ». In : arXiv preprint arXiv :1911.13254 (2019).
[4] D. Ditter et T. Gerkmann. « A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet ». In : arXiv preprint arXiv :1910.11615 (2019).
[5] J. Eisenschlos et al. « MultiFiT : Efficient Multi-lingual Language Model Fine-tuning ». In : arXiv preprint arXiv :1909.04761 (2019).
[6] V. Emiya et al. « Subjective and Objective Quality Assessment of Audio Source Separation ». In : IEEE Transactions on Audio, Speech, and Language Processing 19.7 (sept. 2011), p. 2046-2057.
[7] P. Georgiev et al. « Low-resource multi-task audio sensing for mobile and embedded devices via shared deep neural network representations ». In : Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1.3 (2017), p. 50.
[8] J. Howard et S. Ruder. « Universal language model fine-tuning for text classification ». In : arXiv preprint arXiv :1801.06146 (2018).
[9] Language Modelling on Penn Treebank (Word Level). 2019 (accessed the 5th of December 2019).
[10] M. Pariente et al. « Filterbank design for end-to-end speech separation ». In : arXiv preprint arXiv :1910.10400 (2019).
[11] G. Pironkov, S. Dupont et T. Dutoit. « Multi-task learning for speech recognition : an overview. » In : ESANN. 2016.
[12] L. Prétet et al. « Singing voice separation : a study on training data ». In : ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, p. 506-510.
[13] D. Stoller, S. Ewert et S. Dixon. « Jointly detecting and separating singing voice : A multi-task approach ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 329-339.
[14] F.-R. Stoter et al. « Open-Unmix - A Reference Implementation for Music Source Separation ». In : Journal of Open Source Software (2019).
[15] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305.
[16] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588.
[17] A. Vaswani et al. « Attention is all you need ». In : Advances in neural information processing systems. 2017, p. 5998-6008.
[18] E. Vincent, R. Gribonval et C. Fevotte. « Performance measurement in blind audio source separation ». In : IEEE Transactions on Audio, Speech, and Language Processing 14.4 (juil. 2006). 5*, p. 1462-1469.

Additional Information

  • Contract Type: Internship (Between 5 and 6 months)
  • Location: Villeneuve D'ascq, France (59491)
  • Education Level: Master's Degree
  • Experience: < 6 months