You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🏆 BrightoSV Speaker Verification V1.2

COMMERCIAL SOTA • GLOBAL RELEASE

Bank-Grade Voice Identity Verification (Xác thực Định danh Giọng nói cấp Ngân hàng)

BrightoSV V1.5 SOTA Performance

BrightoSV Speaker Verification V1.2 is a Commercial SOTA voiceprint verification system, engineered for offline, on-premise deployment in banking, government, healthcare, and enterprise sectors. It answers the fundamental question: "Is this person who they claim to be?"

(BrightoSV Speaker Verification V1.2 là hệ thống xác thực giọng nói SOTA thương mại, được thiết kế cho triển khai nội bộ, hoàn toàn offline trong ngân hàng, chính phủ, y tế và doanh nghiệp.)

Trained on a massive multilingual corpus with 22,000++ speakers spanning 9+ languages with 600,000+ speaker cohort for score normalization, evaluated on 20 million scored pairs — delivering statistically ironclad results at extreme operating points.

🏆 Key Performance Indicators

🏦 Bank-Grade Benchmarks (4-Second QA Gate, 4s/2s Windows)

Evaluated on 10,000,000 positive + 10,000,000 negative pairs with strict bank-grade QA (Audio ≥ 4s, SNR ≥ 10dB, Speech Ratio ≥ 15%):

Metric (Chỉ số)	3-Enrollment	5-Enrollment	Significance (Ý nghĩa)
EER	1.478%	1.184% 👑	🏆 Commercial SOTA. Sub-1.2% on diverse multilingual test set. (Tỷ lệ lỗi cân bằng dưới 1.2%).
FRR @ FAR=0.1%	4.31%	3.18% 👑	✅ Bank-Grade Achieved. Under 5% at 1-in-1,000 impostor protection. (Đạt chuẩn ngân hàng tại FAR 0.1%).
FRR @ FAR=0.01%	8.60%	6.65%	🔒 Maximum Security. 1-in-10,000 impostor protection for high-value transactions. (Bảo vệ 1/10,000 cho giao dịch giá trị cao).
Tail Gap 5%	+4.4628	+5.3463 👑	🛡️ Full Separation. 95th percentile genuine scores fully above worst impostor region. (Phân tách hoàn toàn vùng đuôi).
Latency ⚡	—	< 60ms	Real-time processing on consumer GPU. (Xử lý thời gian thực).

📱 Extended Coverage Benchmarks (2-Second QA Gate, 2s/1s Windows)

For scenarios requiring shorter audio — mobile apps, call centers, IoT:

Metric (Chỉ số)	3-Enrollment	5-Enrollment	Significance (Ý nghĩa)
EER	2.362%	1.751% 👑	🏆 Best-in-Class. Strong accuracy even with 2-second audio. (Độ chính xác cao ngay cả với audio 2 giây).
FRR @ FAR=0.1%	8.02%	6.16%	✅ Practical Consumer UX. Manageable retry rate for mobile. (Tỷ lệ xác thực lại phù hợp ứng dụng di động).
FRR @ FAR=0.01%	15.91%	12.58%	🔒 High Security on Short Audio. Viable for multi-factor deployments. (Bảo mật cao trên audio ngắn).
Tail Gap 5%	+2.3151	+3.1082	🛡️ Positive Separation. Genuine and impostor tails clearly separated. (Vùng đuôi tách biệt rõ ràng).

📊 Evaluation Methodology — Statistical Rigor

Unlike many speaker verification benchmarks that report on small test sets, BrightoSV V1.2 is evaluated at industrial scale:

Aspect	Detail
Positive pairs	10,000,000 (same-speaker, cross-utterance)
Negative pairs	10,000,000 (different-speaker, balanced 1:1)
Total scored pairs	20,000,000
Unique speakers	3,900+ in evaluation set
Multi-enrollment	3 and 5 enrollment utterances, mean-aggregated
Score normalization	AS-NORM with 600,000+ speaker cohort
Quality fusion	QMF (Quality Metric Fusion) — compensates for speaker-specific and duration offsets
QA gates	5-gate bank-grade: Duration, Clipping, RMS Energy, Speech Ratio, SNR

This scale ensures that operating points at FAR=0.01% (1 in 10,000) are backed by actual counts of 1,000 impostor threshold crossings, not statistical extrapolation.

(Quy mô này đảm bảo các chỉ số tại FAR=0.01% được xác thực bởi 1,000 mẫu vượt ngưỡng thực tế, không phải ngoại suy thống kê.)

🌍 Multilingual Training — Global Voice Coverage

BrightoSV V1.2 is trained on a large-scale multilingual corpus ensuring language-agnostic voiceprint extraction:

Language	Coverage	Notes
🇻🇳 Vietnamese	★★★★★	Primary language. Extensive dialect coverage (Northern, Central, Southern). (Ngôn ngữ chính, bao phủ đầy đủ phương ngữ Bắc, Trung, Nam).
🇬🇧 English	★★★★★	Primary languages Multiple accents (US, UK, AU, Indian, Singapore)
🇨🇳 Chinese	★★★★★	Primary languages Mandarin and regional variants
🇰🇷 Spanish	★★★★☆	Native speaker corpus
🇩🇪 German	★★★★☆	European language coverage
🇫🇷 French	★★★★☆	Including African French variants
🇳🇱 Dutch	★★★★☆	European language coverage
🇯🇵 Japanese	★★★★☆	Native speaker corpus
🇰🇷 Korean	★★★★☆	Native speaker corpus
🇸🇦 Arabic	★★★★☆	Multiple dialect coverage

Key principle: Speaker identity is carried by vocal tract shape, pitch dynamics, and articulatory patterns — these are language-independent. A speaker can enroll in Vietnamese and verify in English. The model extracts who is speaking, not what is being said.

(Nguyên tắc then chốt: Định danh người nói mang tính phổ quát, không phụ thuộc ngôn ngữ. Người dùng có thể đăng ký bằng tiếng Việt và xác thực bằng tiếng Anh.)

🛡️ Robustness — Augmentation & Real-World Resilience

The model is battle-tested against real-world audio degradation through comprehensive augmentation during training:

🔊 Noise Resilience

Category	Examples	Goal
🏙️ Urban	Street noise, sirens, traffic, construction	On-the-go verification
🏠 Domestic	TV/radio background, appliances, children	Work-from-home reliability
🗣️ Babble	Crowd noise, overlapping speakers, cafeteria	The hardest scenario — solved
⛈️ Natural	Wind, rain, thunder	Outdoor stability
🐾 Biological	Coughing, sneezing, baby crying	Disentangle speaker from artifacts

📡 Channel & Codec Resilience

Codec / Channel	Simulation
GSM / AMR	Mobile telephony compression
VoIP (Zalo, WhatsApp)	Internet calling artifacts
MP3 / AAC / OGG	Lossy compression at various bitrates
Microphone variance	Laptop, phone, headset, far-field
Room acoustics	Reverb, echo, room impulse response

🎛️ SpecAugment

Time and frequency masking applied during training to prevent overfitting to specific spectral patterns, forcing the model to learn robust speaker representations from partial information.

Result: The model maintains accuracy down to 10 dB SNR — equivalent to speaking in a moderately noisy café. Below 10 dB, the QA gate rejects the audio before inference, protecting against unreliable decisions.

🎯 Production Deployment

Three Security Levels

Level	Min Audio	Enrollment	FAR Options	Use Case
🏦 `bank_strict`	≥ 4.0s	5 samples	0.1% / 0.01%	High-value banking, government (Ngân hàng giá trị cao)
🏛️ `bank_flex`	≥ 4.0s	3 samples	0.1% / 0.01%	Standard banking, telecom (Ngân hàng tiêu chuẩn)
📱 `consumer`	≥ 2.0s	3 samples	0.1% / 0.01%	Mobile apps, call centers, IoT (Ứng dụng di động)

Scoring Pipeline

Audio → QA Gate (5 checks) → Windowed Embedding Extraction → AS-NORM (600K cohort) → QMF → Decision

Component	Detail
Embedding	512-dimensional voiceprint
Score normalization	AS-NORM with top-300 cohort matching
Quality fusion	Cohort Mean Fusion + Duration compensation
Multi-enrollment	Mean-aggregated across sessions
Storage	~2 KB per enrolled speaker

QA Gate — Mandatory Pre-Check

All audio MUST pass bank-grade QA before inference:

Check	Threshold	Purpose
Duration	≥ 2s or ≥ 4s	Sufficient speech content
Clipping	< 0.1%	No distorted audio
RMS Energy	−45 to −5 dBFS	Proper recording level
Speech Ratio	≥ 15%	Actual speech, not silence
SNR	≥ 10 dB	Acceptable noise level

Without QA gate, tail performance degrades significantly. QA is mandatory for production. (Không có QA gate, hiệu suất vùng đuôi giảm đáng kể. QA là bắt buộc.)

🆚 Performance Context

Why These Numbers Matter

Metric	BrightoSV V1.2 (4s/5e)	What It Means
EER 1.184%	For every 1,000 verification attempts, ~12 are errors (combined false accept + false reject)
FRR 3.18% @ FAR 0.1%	At 1-in-1,000 impostor protection: only 3.2% genuine users need to retry
FRR 6.65% @ FAR 0.01%	At 1-in-10,000 impostor protection: only 6.7% genuine users need to retry
Tail Gap +5.35	The hardest 5% of genuine speakers are still 5.35 score units above the strongest impostor region

Scale Comparison

Aspect	Typical Academic Eval	BrightoSV V1.2
Test pairs	~100K–500K	20,000,000
Speakers	~100–500	3,900+
Score normalization	Often none	AS-NORM (600K cohort)
Multi-enrollment	Rarely tested	3-enroll and 5-enroll
QA gates	Rarely applied	5-gate bank-grade
Languages	Usually 1–2	9+ languages

⚙️ Technical Specifications

Specification	Value
Model Version	V1.2 (Commercial SOTA)
Parameters	316M (High-Capacity Self-Supervised Backbone)
Embedding Dimension	512
Input Sample Rate	16kHz (Auto-resampling supported)
Input Formats	WAV, FLAC, MP3, OGG, M4A
Output	512D L2-normalized embedding
Backends	PyTorch, ONNX, HuggingFace

🚀 Hardware & Performance

Specification	Value
GPU Support	NVIDIA T4, A10, A100, H100, L4
CPU Support	Intel Xeon, AMD EPYC (via ONNX)
Inference Latency	< 60ms (GPU) / < 400ms (CPU ONNX)
Model Size	~1.2 GB
Batch Processing	Supported
Deployment	Fully offline after initial download

🌍 Application Scenarios

Sector	Use Case	Recommended Level
🏦 Banking & Finance	Wire transfers, Voice Banking, Phone Banking	`bank_strict`
🆔 eKYC	Customer onboarding, Remote identity verification	`bank_flex`
🪙 Crypto & FinTech	Wallet protection, Transaction authorization	`bank_strict`
✈️ National Security	Border control, Immigration screening	`bank_strict`
🎧 Call Centers	Caller identity verification, Fraud prevention	`bank_flex`
🏥 Healthcare	Patient identity, Telemedicine authentication	`bank_flex`
📱 Consumer Apps	Voice login, Smart home, Voice assistants	`consumer`

🔒 Privacy & Security

Aspect	Implementation
Audio Retention	Zero. Audio processed in RAM, immediately discarded. (Không lưu audio).
Voiceprint	512 numbers. Non-reversible — cannot reconstruct voice. (Không thể tái tạo giọng nói).
Deployment	On-premise or private cloud. No external calls. (Triển khai nội bộ, không gọi ra ngoài).
Compliance	GDPR, PDPA, PCI-DSS ready
Data Sovereignty	100% local processing. Your data never leaves your infrastructure.

🤝 Combined with Anti-Spoofing

For maximum security, deploy BrightoSV Speaker Verification alongside BrightoSV Anti-Spoofing V1.5:

Audio → Anti-Spoof Check (Is this a real voice?) → Speaker Verify (Is this the right person?) → Decision

Layer	Model	Purpose
Layer 1	Anti-Spoof V1.5	Reject deepfakes, replay attacks, TTS
Layer 2	Speaker Verify V1.2	Confirm speaker identity

This dual-layer architecture provides defense-in-depth: even if a sophisticated deepfake passes liveness detection, it must still match the enrolled voiceprint — and vice versa.

(Kiến trúc hai lớp cung cấp phòng thủ theo chiều sâu: ngay cả khi deepfake vượt qua kiểm tra liveness, vẫn phải khớp voiceprint — và ngược lại.)

📈 Roadmap

Version	Status	Highlight
V1.2	🟢 Current	Commercial SOTA — EER 1.184%, Bank-grade verified
V1.5 (LMF)	🔵 In progress	Large Margin Fine-tuning — targeting FRR@0.01% < 5%
V2.0	🟡 Planned	Next-generation architecture

📞 Access & Licensing

This model is Private and available exclusively for enterprise partners under NDA. (Model nội bộ, chỉ cung cấp cho đối tác Doanh nghiệp ký NDA.)

Thương mại & Triển khai

License trọn gói hoặc qua API
Hỗ trợ tích hợp theo yêu cầu (triển khai, tối ưu hiệu năng, giám sát chất lượng)
Công ty Cổ phần SphinX (sphinxjsc.com) được giao quyền đóng gói, cung cấp API và phân phối

Bản quyền & License

Thương mại / Proprietary. Việc sử dụng, phân phối lại hoặc tạo bản phái sinh cần có chấp thuận bằng văn bản từ BrighTO Technology.

Liên hệ

Purpose	Contact
Commercial Licensing	`nguyen@brighto.ai`, `nghia@brighto.ai`
API & Distribution	`duc@sphinxjsc.com` (SphinX JSC)
Technical Inquiries	`nguyen@hatto.com`

🏆 BrightoSV Speaker Verification V1.2

Commercial SOTA • Multilingual • Bank-Grade • Offline-Ready

EER 1.184% · 20M Eval Pairs · 600K Cohort · 9+ Languages

Built in Vietnam 🇻🇳 • Engineered for the World 🌏

This model card refers to BrightoSV Speaker Verification V1.2 (Commercial SOTA Release). All benchmark results are verified on internal evaluation sets comprising 20,000,000 scored pairs across 3,900+ speakers with strict bank-grade QA methodology and AS-NORM score normalization using a 600,000+ speaker cohort.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Equal Error Rate (%) - Bank-Grade 4s / 5-Enroll
self-reported

1.184
Equal Error Rate (%) - Consumer 2s / 5-Enroll
self-reported

1.751
FRR @ FAR=0.1% (%) - Bank-Grade 4s / 5-Enroll
self-reported

3.180
FRR @ FAR=0.01% (%) - Bank-Grade 4s / 5-Enroll
self-reported

6.650
Tail Gap 5% - Bank-Grade 4s / 5-Enroll
self-reported

5.346