Neural Audio Codec

TL;DR

Recently, there are many nerual audio codecs that are released by different parties. Here I list some of them.

SoundStream
EnCodec
DAC (descript-audio-codec)

DAC (descript-audio-codec)

Paper link: High-Fidelity Audio Compression with Improved RVQGAN
Ablation study: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5 Repo: https://github.com/descriptinc/descript-audio-codec MIT License

Major contribution mentioned in paper:

We make the following contributions:

We introduce Improved RVQGAN a high fidelity universal audio compression model, that can compress 44.1 KHz audio into discrete codes at 8 kbps bitrate (~90x compression) with minimal loss in quality and fewer artifacts. Our model outperforms state-of-the-art methods by a large margin even at lower bitrates (higher compression) , when evaluated with both quantitative metrics and qualitative listening tests.

We identify a critical issue in existing models which don’t utilize the full bandwidth due to codebook collapse (where a fraction of the codes are unused) and fix it using improved codebook learning techniques.

We identify a side-effect of quantizer dropout - a technique designed to allow a single model to support variable bitrates, actually hurts the full-bandwidth audio quality and propose a solution to mitigate it.

We make impactful design changes to existing neural audio codecs by adding periodic inductive biases, multi-scale STFT discriminator, multi-scale mel loss and provide thorough ablations and intuitions to motivate them.

Our proposed method is a universal audio compression model, capable of handling speech, music, environmental sounds, different sampling rates and audio encoding formats.

Training data:

Speech: DAPS dataset, DNS Challenge 4, Common Voice dataset, VCTK dataset.
Music: MUSDB dataset, Jamendo dataset.
Environmental sound, balanced and unbalanced train segments from AudioSet.

Test data:

evaluation segments from AudioSet, two speakers that are held out from DAPS (F10, M10) for speech, and the test split of MUSDB. We extract 3000 10-second segments (1000 from each domain), as our test set.

An interesting follow-up

The multi-Scale Neural Audio Codec (SNAC) is an interesting followup of the DAC. It is also under MIT license.

It should be interesting to try this repo.

More examples can be found here: https://hubertsiuzdak.github.io/snac/

EnCodec

Paper link: High Fidelity Neural Audio Compression
Repo: https://github.com/facebookresearch/encodec MIT License

It has been used in many papers.

The main structure is that it consists of encoder, decoder and quantizer. This is very similar with VQGAN used in image synthesis. The Discriminator is used for adversarial training. Residual Vector Quantization (RVQ) is used for quantizer.