Nahimic (a.k.a. A-Volute) is a company based in Villeneuve d’Ascq (France) that publishes audio enhancement software for the gaming industry, in particular the Nahimic software on MSI laptop. Nahimic
has developed a solution for digital and real-time 3D sound. The suite of audio effects proposed by Nahimic includes effects to improve multimedia content (music or movie) and gaming experience, as well as microphone effects for communication such as noise reduction.
You will join the R&D team that is in charge of proposing and prototyping innovative audio algorithms.
— Nathan Souviraà-Labastie, R&D engineer (Nahimic)
— Damien Granger, R&D engineer (Nahimic)
— Raphaël Greff, CTO and R&D Director (Nahimic)
Approaches and topics for the internship
Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible but more specifically A-Volute is interested in the following type of audio signals :
— music for instance for automatic karaoke generation
— video game audio for enhanced game experience
— speech for comfort and quality during communication
Our current algorithm already equals state of the art [15, 16] on a music separation task and many tracks of improvement are possible, both in terms of implementations and applications (details hereafter).
The successful candidate will work on one or several of the following topics according to her/his aspirations and skills. In addition to those topics, visiting PhDs or researchers can also make their own topic proposal, for instance to try their approaches on our internal substantive dataset (description upon request).
New core algorithm
Machine learning is a fast changing research domain and an algorithm can move from being state of the art to being obsolete in less than a year. The work would be to try recent powerful neural network approaches on the audio source separation tasks (on music signals as a first step). Other research domains outside audio (like computer vision) might be considered as sources of inspiration. For instance, the approaches in[17, 9] have shown promising results on other tasks.
Metadata such as music style/genre could be used during the training. One possible way is to consider those classes as other tasks to be solved together with the separation tasks. This is a very challenging machine learning problem, especially because the different tasks are heterogeneous (classification, regression, signal estimation) and just a few studies targeting audio multi-task have been carried out so far (exhaustive list from advisors knowledge [7, 11, 13]). Potential advantages are performance improvement for the main/principal task and computational cost reduction in products as several tasks are achieved at the same time. The work would be to investigate this approach based on the previous internal work and network architecture.
One core aspect that has been raised by last Sisec challenge  is that the use of additional data was the key for music separation. Most of the existing approaches [16, 12, 3] use data augmentation (remixing, additional audio effect, reverb), and this could be an axis of amelioration for our current algorithm as well. The work would be to investigate this approach. Interesting recent studies can be found in [2, 12].
Automatic backing tracks generation
While a karaoke version of a song is the song without the singing voice, a backing track is a version of a song without a given instrument (e.g., drum, guitar, piano...). The work would mainly consist in trying new parameters (window size, overlap between windows, instrument specific spectral basis and cost functions...) so that our current algorithm can also address automatic backing track generation. Interesting recent studies can be found in [10, 4].
Extension to multi-source So far, most of the state of the art approaches  have addressed the backing track problem as a one instrument versus the rest problem, hence using specific networks for each instruments when multiple instruments are present in the mix. A more challenging problem would be to estimate all the different instruments with a single network.
Extension to multi-channel
Music separation algorithms can take into account stereo versions of songs . Making our current algorithm able to also take the spatial information of multi-channel recordings into account would potentially lead to significant improvements. The work would mainly consist in designing an evolution of the current used network architecture (more references upon request).
For a full karaoke experience, the subtitles (and sometimes a video clip) are usually required and should somehow or other be synchronized with the music piece. A first axis of work would be the adaptation of a state of the art speech-to-text method for singing voice to obtain a singing-to-text algorithm. An axis of amelioration would be to use the lyrics that are available online in plain text version. It would then be a matter of synchronizing and displaying this version based on a comparison with the version produced by the singing-to-text algorithm.
Subjective cost functions
Most audio separation techniques seek to optimize an objective criterion, for example the divergence of Itakura-Saito, the mean square error between spectrograms or energy metrics such as the signal to distortion or artifact ratio . However, these metrics do not reflect much about the quality perceived by the auditor when this should be the objective of such a separation algorithm. For example, most audio source separation techniques use a time-frequency masking step, and in many cases this step induces artefacts (chirping) that are perceptible to the ear but not fully taken into account by objective criteria.
Some automatic approaches for subjective evaluation exist but are not often used , . In particular, the algorithm for predicting subjective notes developed in , which is however the state of the art of the domain, is particularly slow and cannot be used as a cost function during the learning of a source separation algorithm because its use would increase the learning time by 7000% (which is prohibitive when you know that the time of a complete learning on this task is about several weeks). In previous work, A-Volute has successfully developed an end-to-end subjective note prediction algorithm with the same performance as  but with a minimal impact on learning time. The work would be to demonstrate that this approach improves separation results in terms of perception, i.e. subjectively. It will first be necessary to convert the algorithm from Keras to Pytorch, verify that the same prediction performances are achieved, in order to then use it as a cost function within the learning of an open algorithm of the state of the art such as .
Audio gaming - FPS
Contrary to real world recording where the mixture of different sounds results from a physical additive process, the audio mixture in video game is achieved by the game engine (level, position, spread, ... of each audio object and impact of the environment i.e., Reverb). Thus each game is potentially different in terms of audio mixing and a game by game approach might be necessary. The work would be to adapt the current internal audio FPS separation algorithm to the most played games like Fortnite, CSGO, Overwatch. A first task would be to decide the list of video games of interest and to depict their audio mixing.
An already important amount of data have been collected from the web including clean speech, noise, ambient, event. The work would be to appropriately mix those data to form a dataset that is representative of Nahimic software use cases. The hyper-parameters of our current algorithm should then be tuned to learn this dataset. Transfer learning (e.g., using ULMFiT [8, 5]) is one track of reflection.