DeepVQE-AEC (GGUF)

GGML/GGUF inference model for DeepVQE (Indenbom et al., Interspeech 2023) โ€” joint acoustic echo cancellation (AEC), noise suppression, and dereverberation.

Quick Start

Build

Requires CMake 3.20+ and a C++17 compiler. The ggml library is included as a git submodule.

git clone --recursive https://github.com/richiejp/deepvqe-ggml
cd deepvqe-ggml/ggml

# CLI only
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# With shared library (C API for FFI from Python, Go, etc.)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DDEEPVQE_BUILD_SHARED=ON
cmake --build build

CLI

Process STFT-domain audio (NumPy .npy files):

./build/deepvqe deepvqe.gguf --input-npy mic_stft.npy ref_stft.npy

C API

The shared library (libdeepvqe.so) exposes a simple C API for integration into any language:

#include "deepvqe_api.h"

// Load model
uintptr_t ctx = deepvqe_new("deepvqe.gguf");

// Process 16kHz mono float32 audio
//   mic: microphone input (with echo + noise)
//   ref: far-end reference (what the speaker is hearing)
//   out: cleaned output (pre-allocated, same length)
int ret = deepvqe_process_f32(ctx, mic, ref, n_samples, out);

// int16 PCM variant also available
int ret = deepvqe_process_s16(ctx, mic_s16, ref_s16, n_samples, out_s16);

deepvqe_free(ctx);

Python (ctypes)

import ctypes, numpy as np

lib = ctypes.CDLL("./build/libdeepvqe.so")
lib.deepvqe_new.restype = ctypes.c_void_p
lib.deepvqe_new.argtypes = [ctypes.c_char_p]
lib.deepvqe_process_f32.restype = ctypes.c_int
lib.deepvqe_process_f32.argtypes = [
    ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p,
    ctypes.c_int, ctypes.c_void_p,
]
lib.deepvqe_free.argtypes = [ctypes.c_void_p]

ctx = lib.deepvqe_new(b"deepvqe.gguf")

mic = np.zeros(16000, dtype=np.float32)  # 1 second of 16kHz audio
ref = np.zeros(16000, dtype=np.float32)
out = np.empty_like(mic)

ret = lib.deepvqe_process_f32(
    ctx,
    mic.ctypes.data,
    ref.ctypes.data,
    len(mic),
    out.ctypes.data,
)

lib.deepvqe_free(ctx)

Used in production by VoxInput for real-time voice input with echo cancellation.

Model Details

Architecture DeepVQE with AlignBlock (soft delay estimation)
Parameters ~8.0M
Sample rate 16 kHz
STFT 512 FFT, 256 hop (16 ms), sqrt-Hann window
Delay range dmax=32 frames (320 ms)
Format GGUF
Variants F32 (31 MB), Q8_0 (8.5 MB)

Quantization

The Q8_0 variant (deepvqe_q8.gguf) reduces model size by 73% (31 MB to 8.5 MB) using GGML Q8_0 quantization with selective layer preservation.

Layer group Quantization Reason
Encoder/decoder (2-5) weights Q8_0 Residual connections mitigate error
Bottleneck GRU + FC weights Q8_0 Largest tensors (~3.6M params)
AlignBlock (attention) F32 Softmax precision for delay estimation
dec1 (mask output) F32 Directly controls complex convolving mask
All biases, ChannelAffine F32 Small tensors, negligible size savings

Divergence from F32: output max error 5e-2, mean error 7e-4.

Training

Trained on the full DNS5 16 kHz dataset (~300K clean speech files after DNSMOS quality filtering, 64K noise, 60K impulse responses) on a single NVIDIA RTX 5070 (16 GB).

Safety note: Training data was filtered by DNSMOS perceived quality scores, which can misclassify distressed speech (e.g. screaming, crying) as noise. This model may therefore attenuate or distort such signals and should not be relied upon for emergency call or safety-critical applications.

Data

See deepvqe-ggml for training code and full documentation.

References

Downloads last month
449
GGUF
Model size
7.97M params
Architecture
deepvqe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for richiejp/deepvqe-aec-gguf