DeepVQE-AEC (GGUF)

GGML/GGUF inference model for DeepVQE (Indenbom et al., Interspeech 2023) — joint acoustic echo cancellation (AEC), noise suppression, and dereverberation.

Quick Start

Build

Requires CMake 3.20+ and a C++17 compiler. The ggml library is included as a git submodule.

git clone --recursive https://github.com/richiejp/deepvqe-ggml
cd deepvqe-ggml/ggml

# CLI only
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# With shared library (C API for FFI from Python, Go, etc.)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DDEEPVQE_BUILD_SHARED=ON
cmake --build build

CLI

Process STFT-domain audio (NumPy .npy files):

./build/deepvqe deepvqe.gguf --input-npy mic_stft.npy ref_stft.npy

C API

The shared library (libdeepvqe.so) exposes a simple C API for integration into any language:

#include "deepvqe_api.h"

// Load model
uintptr_t ctx = deepvqe_new("deepvqe.gguf");

// Process 16kHz mono float32 audio
//   mic: microphone input (with echo + noise)
//   ref: far-end reference (what the speaker is hearing)
//   out: cleaned output (pre-allocated, same length)
int ret = deepvqe_process_f32(ctx, mic, ref, n_samples, out);

// int16 PCM variant also available
int ret = deepvqe_process_s16(ctx, mic_s16, ref_s16, n_samples, out_s16);

deepvqe_free(ctx);

Python (ctypes)

import ctypes, numpy as np

lib = ctypes.CDLL("./build/libdeepvqe.so")
lib.deepvqe_new.restype = ctypes.c_void_p
lib.deepvqe_new.argtypes = [ctypes.c_char_p]
lib.deepvqe_process_f32.restype = ctypes.c_int
lib.deepvqe_process_f32.argtypes = [
    ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p,
    ctypes.c_int, ctypes.c_void_p,
]
lib.deepvqe_free.argtypes = [ctypes.c_void_p]

ctx = lib.deepvqe_new(b"deepvqe.gguf")

mic = np.zeros(16000, dtype=np.float32)  # 1 second of 16kHz audio
ref = np.zeros(16000, dtype=np.float32)
out = np.empty_like(mic)

ret = lib.deepvqe_process_f32(
    ctx,
    mic.ctypes.data,
    ref.ctypes.data,
    len(mic),
    out.ctypes.data,
)

lib.deepvqe_free(ctx)

Used in production by VoxInput for real-time voice input with echo cancellation.

Model Details


Architecture	DeepVQE with AlignBlock (soft delay estimation)
Parameters	~8.0M
Sample rate	16 kHz
STFT	512 FFT, 256 hop (16 ms), sqrt-Hann window
Delay range	dmax=32 frames (320 ms)
Format	GGUF
Variants	F32 (31 MB), Q8_0 (8.5 MB)

Quantization

The Q8_0 variant (deepvqe_q8.gguf) reduces model size by 73% (31 MB to 8.5 MB) using GGML Q8_0 quantization with selective layer preservation.

Layer group	Quantization	Reason
Encoder/decoder (2-5) weights	Q8_0	Residual connections mitigate error
Bottleneck GRU + FC weights	Q8_0	Largest tensors (~3.6M params)
AlignBlock (attention)	F32	Softmax precision for delay estimation
dec1 (mask output)	F32	Directly controls complex convolving mask
All biases, ChannelAffine	F32	Small tensors, negligible size savings

Divergence from F32: output max error 5e-2, mean error 7e-4.

Training

Trained on the full DNS5 16 kHz dataset (~300K clean speech files after DNSMOS quality filtering, 64K noise, 60K impulse responses) on a single NVIDIA RTX 5070 (16 GB).

Safety note: Training data was filtered by DNSMOS perceived quality scores, which can misclassify distressed speech (e.g. screaming, crying) as noise. This model may therefore attenuate or distort such signals and should not be relied upon for emergency call or safety-critical applications.

Data

DNS5 (Microsoft, CC BY 4.0)
ICASSP 2022 AEC Challenge — echo scenarios

See deepvqe-ggml for training code and full documentation.

References

DeepVQE paper (Indenbom et al., 2023)
deepvqe-ggml — training & export code

Downloads last month: 449

GGUF

Model size

7.97M params

Architecture

deepvqe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Paper for richiejp/deepvqe-aec-gguf

DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation

Paper • 2306.03177 • Published Jun 5, 2023