Argus Sentinel β WAF ML Classifier (V3)
Production-grade Web Application Firewall classifier. Detects 6 attack types in HTTP requests with sub-millisecond latency on CPU.
Key metrics (test_realistic β production-like distribution, 94% clean):
- Macro F1: 0.866 | FPR: 0.83% | Mean attack recall: 0.889 | Latency: 0.24ms
Model Overview
| Property | Value |
|---|---|
| Architecture | CNN text encoder + numeric features fusion |
| Parameters | 1.17M |
| Vocab size | 8,192 (BPE ByteLevel) |
| Max sequence length | 128 tokens |
| ONNX model size | 4.5 MB (FP32) / 1.2 MB (INT8) |
| Inference latency | 0.24 ms avg (CPU, single thread) |
| Training loss | Focal BCEWithLogitsLoss (gamma=2.0) |
| Best epoch | 3 / 8 (early stopping, selected on Macro F1) |
Architecture
HTTP Request Text
|
v
[BPE Tokenizer (vocab=8192, max_len=128)]
|
+---> [Embedding (128-dim)]
| |
| [Conv1D (128 ch, k=3) + BatchNorm + ReLU] x2
| |
| [AdaptiveMaxPool1d β 128-dim]
|
+---> [6 Numeric Features]
|
[Linear 6β32 + ReLU]
|
[Concatenate (128 + 32 = 160)]
|
[Linear 160β128β64 + ReLU + Dropout(0.1)]
|
+---------+---------+
| |
[Label Head β 7] [Risk Head β 1]
| |
[Sigmoid] [Sigmoid]
| |
label_probs [7] risk_score [1]
Tokenizer Specification
| Property | Value |
|---|---|
| Type | BPE (Byte-Pair Encoding) via HuggingFace tokenizers library |
| Algorithm | ByteLevel BPE β operates on UTF-8 bytes, not characters |
| Pre-tokenizer | ByteLevel (add_prefix_space=false, trim_offsets=true, use_regex=true) |
| Normalizer | None (raw bytes, no lowercasing or unicode normalization) |
| Post-processor | TemplateProcessing β prepends [CLS] token automatically |
| Vocab size | 8,192 tokens (7,933 merges + 3 special tokens + 256 byte tokens) |
| Special tokens | [PAD] (id=0), [UNK] (id=1), [CLS] (id=2) |
| Max length | 128 tokens (truncation=Right, padding=Right to fixed 128) |
| Byte fallback | false β unknown bytes map to [UNK] |
| File | tokenizer.json (HuggingFace tokenizers JSON format) |
Input text construction: "{method} {path}?{query} {body[:200]}" β capped at 500 chars before tokenization.
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
result = tok.encode("GET /search?q=test HTTP/1.1")
# result.ids β [2, 546, 287, ...] (starts with [CLS]=2)
# result.attention_mask β [1, 1, 1, ...]
Label ID Mapping (CRITICAL)
Output label_probs tensor shape: [batch, 7]. Each index maps to:
| Index | Label | Description |
|---|---|---|
| 0 | clean |
Benign / legitimate request |
| 1 | xss |
Cross-Site Scripting |
| 2 | sqli |
SQL Injection |
| 3 | path_traversal |
Directory / Path Traversal |
| 4 | command_injection |
OS Command Injection |
| 5 | scanner |
Vulnerability scanner / probe |
| 6 | spam_bot |
Spam bot / automated abuse |
Multi-label: Labels are NOT mutually exclusive. Multiple labels can be active simultaneously (e.g., index 2 + 5 = scanner performing SQLi). Exception:
clean(index 0) is exclusive β if clean=1, all others must be 0.
ONNX Inputs
| Name | Shape | Dtype | Description |
|---|---|---|---|
input_ids |
[batch, 128] |
int32 |
BPE token IDs from tokenizer.json |
attention_mask |
[batch, 128] |
int32 |
1 for real tokens, 0 for [PAD] |
numeric_features |
[batch, 6] |
float32 |
Request-level features (RAW values, see below) |
Numeric Features β Normalization Parameters (CRITICAL)
Features are passed as RAW values β the model was trained on unnormalized features. Pass the same raw scale at inference.
| Index | Feature | Computation | Training Range | Mean | Std |
|---|---|---|---|---|---|
| 0 | content_length |
len(body) if body else 0 |
0 β 662 | 35.2 | 63.5 |
| 1 | num_headers |
len(headers_dict) |
3 β 13 | 7.7 | 1.3 |
| 2 | has_body |
1.0 if body present, else 0.0 |
0 β 1 | 0.42 | 0.49 |
| 3 | session_request_count |
Total requests in session, or 0 |
0 β 20 | 3.0 | 6.0 |
| 4 | session_duration |
Session time span in seconds, or 0 |
0 β 4,965,381 | 619,322 | 1,198,151 |
| 5 | session_pattern_score |
Behavioral pattern score, or 0 |
0 β 0.5 | 0.09 | 0.18 |
Python:
def extract_numeric_features(request: dict) -> list[float]:
body = request.get("body") or ""
headers = request.get("headers") or {}
return [
float(len(body)), # content_length
float(len(headers)), # num_headers
1.0 if body else 0.0, # has_body
float(request.get("session_request_count") or 0), # session_request_count
float(request.get("session_duration") or 0), # session_duration
float(request.get("session_pattern_score") or 0), # session_pattern_score
]
Rust:
fn extract_numeric_features(request: &HttpRequest) -> [f32; 6] {
let body_len = request.body.as_ref().map_or(0, |b| b.len());
[
body_len as f32,
request.headers.len() as f32,
if body_len > 0 { 1.0 } else { 0.0 },
request.session_request_count.unwrap_or(0) as f32,
request.session_duration.unwrap_or(0.0),
request.session_pattern_score.unwrap_or(0.0),
]
}
If you don't have session data, pass
[content_length, num_headers, has_body, 0.0, 0.0, 0.0]β ~79% of training examples had null session features.
ONNX Outputs
| Name | Shape | Dtype | Description |
|---|---|---|---|
label_probs |
[batch, 7] |
float32 |
Per-label probabilities after sigmoid |
risk_score |
[batch, 1] |
float32 |
Aggregate risk score [0, 1] |
Per-Label Thresholds (CRITICAL for deployment)
Do NOT use a default 0.5 threshold for all labels. Use these optimized thresholds from thresholds.json:
| Label | Threshold | Recall | Precision | F1 |
|---|---|---|---|---|
| clean | 0.20 | 0.998 | 0.992 | 0.995 |
| xss | 0.50 | 0.951 | 0.585 | 0.724 |
| sqli | 0.74 | 0.732 | 0.940 | 0.823 |
| path_traversal | 0.68 | 0.896 | 0.794 | 0.842 |
| command_injection | 0.66 | 0.826 | 0.626 | 0.712 |
| scanner | 0.70 | 0.980 | 0.945 | 0.962 |
| spam_bot | 0.72 | 1.000 | 1.000 | 1.000 |
Performance
Production-Like (test_realistic β 25,000 examples, 94% clean)
| Metric | Value |
|---|---|
| Macro F1 | 0.866 |
| FPR on clean | 0.83% |
| Mean attack recall | 0.889 |
| Label | Recall | Precision | F1 |
|---|---|---|---|
| clean | 0.998 | 0.992 | 0.995 |
| xss | 0.951 | 0.585 | 0.724 |
| sqli | 0.732 | 0.940 | 0.823 |
| path_traversal | 0.896 | 0.794 | 0.842 |
| command_injection | 0.826 | 0.626 | 0.712 |
| scanner | 0.980 | 0.945 | 0.962 |
| spam_bot | 1.000 | 1.000 | 1.000 |
Stratified Stress Test (test β 49,830 examples)
| Metric | Value |
|---|---|
| Macro F1 | 0.787 |
| FPR on clean | 7.9% |
Adversarial Robustness (test_mixed_adversarial β 22,250 examples)
| Metric | Value |
|---|---|
| Macro F1 | 0.499 |
| FPR on clean | 11.8% |
| XSS recall | 0.833 |
Latency (ONNX Runtime, CPU, 1 thread, batch=1)
| Metric | FP32 | INT8 |
|---|---|---|
| Average | 0.24 ms | 1.20 ms |
| Throughput | ~4,100 req/s | ~830 req/s |
On CPU without VNNI, FP32 is faster than dynamic INT8. Use
model.onnxon standard CPUs.
Training
| Hyperparameter | V1 | V3 (current) |
|---|---|---|
| Loss | BCEWithLogitsLoss | Focal BCE (gamma=2.0) |
| Learning rate | 1e-3 | 1e-4 |
| Batch size | 256 | 128 |
| Epochs | 5 | 8 (early stop at 6, best=3) |
| Patience | 2 | 3 |
| Checkpoint selection | Best val_loss | Best Macro F1 |
| Calibration | None | Per-label threshold tuning |
| Data augmentation | None | +20k augmented (encoding, headers, noise, context swap) |
Dataset
| Property | Value |
|---|---|
| Total examples | 498,345 (+20k augmented) |
| Training split | 418,685 |
| Real traffic | 62.6% |
| Synthetic | 37.4% |
| Multi-label | 17.0% |
| Hard negatives | 15.0% |
| Unique sources | 12 |
| Sources | CIC-IDS-2017, CSE-CIC-IDS-2018, HIKARI-2021, WebAttackPayloads, PayloadsAllTheThings, + synthetic |
Usage
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
import json
from tokenizers import Tokenizer
# Load model and tokenizer
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file("tokenizer.json")
thresholds = json.load(open("thresholds.json"))["thresholds"]
label_names = ["clean", "xss", "sqli", "path_traversal",
"command_injection", "scanner", "spam_bot"]
def classify_request(method, path, query, headers, body):
# 1. Build text
text = f"{method} {path}"
if query: text += f"?{query}"
if body: text += f" {body[:200]}"
text = text[:500]
# 2. Tokenize
enc = tokenizer.encode(text)
input_ids = np.array([enc.ids], dtype=np.int32)
attention_mask = np.array([enc.attention_mask], dtype=np.int32)
# 3. Numeric features (RAW values)
numeric = np.array([[
float(len(body or "")),
float(len(headers)),
1.0 if body else 0.0,
0.0, 0.0, 0.0, # session features (0 if unavailable)
]], dtype=np.float32)
# 4. Inference
probs, risk = session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"numeric_features": numeric,
})
# 5. Apply per-label thresholds
detections = {
name: float(probs[0][i])
for i, name in enumerate(label_names)
if name != "clean" and probs[0][i] >= thresholds[name]
}
return {
"risk_score": float(risk[0][0]),
"detections": detections,
"is_clean": len(detections) == 0,
}
# Example
result = classify_request("GET", "/search", "q=' OR 1=1--", {"Host": "example.com"}, None)
print(result)
# {'risk_score': 0.87, 'detections': {'sqli': 0.94}, 'is_clean': False}
Rust (ort crate)
use ort::{Session, Value};
use ndarray::Array2;
fn main() -> anyhow::Result<()> {
let session = Session::builder()?
.with_model_from_file("model.onnx")?;
let input_ids = Array2::<i32>::zeros((1, 128)); // from tokenizer
let attention_mask = Array2::<i32>::zeros((1, 128)); // from tokenizer
let numeric_features = Array2::<f32>::zeros((1, 6)); // extract_numeric_features()
let outputs = session.run(ort::inputs![
"input_ids" => &input_ids,
"attention_mask" => &attention_mask,
"numeric_features" => &numeric_features,
]?)?;
let label_probs: Vec<f32> = outputs[0].extract_tensor::<f32>()?.view().iter().copied().collect();
let risk_score: f32 = *outputs[1].extract_tensor::<f32>()?.view().first().unwrap();
// Apply thresholds from thresholds.json
let thresholds = [0.20, 0.50, 0.74, 0.68, 0.66, 0.70, 0.72];
let labels = ["clean", "xss", "sqli", "path_traversal",
"command_injection", "scanner", "spam_bot"];
for (i, (prob, thr)) in label_probs.iter().zip(thresholds.iter()).enumerate() {
if i > 0 && prob >= thr {
println!("DETECTED: {} ({:.3})", labels[i], prob);
}
}
println!("Risk score: {:.4}", risk_score);
Ok(())
}
Decision Logic
thresholds = json.load(open("thresholds.json"))["thresholds"]
# Per-label detection
triggered = [name for i, name in enumerate(label_names)
if name != "clean" and probs[0][i] >= thresholds[name]]
# Risk-score action
score = float(risk[0][0])
if score >= 0.8: action = "BLOCK"
elif score >= 0.5: action = "CHALLENGE"
elif score >= 0.2: action = "LOG"
else: action = "ALLOW"
Version History
V3 (current) β Production-Hardened
Fixed V2 recall collapse. Multi-checkpoint selection on Macro F1. Per-label threshold optimization replaces Platt scaling.
V2 β Focal Loss + Calibration (superseded)
Introduced Focal Loss and Platt calibration. FPR dropped to 0.18% but XSS recall collapsed to 0.016 and CMDi to 0.222 due to aggressive calibration.
V1 β Baseline
BCE loss, fixed 0.5 thresholds. High recall (~0.98) but lower Macro F1 (0.828) and higher FPR (0.83%).
| Metric | V1 | V2 | V3 |
|---|---|---|---|
| Macro F1 | 0.828 | 0.669 | 0.866 |
| FPR | 0.83% | 0.18% | 0.83% |
| XSS recall | 0.980 | 0.016 | 0.951 |
| CMDi recall | 0.985 | 0.222 | 0.826 |
| Latency | 0.77ms | 0.99ms | 0.24ms |
Deployment Strategy
Phase 1 β Shadow Mode: Deploy alongside existing WAF rules, log predictions, compare decisions, tune thresholds.
Phase 2 β Safe Blocking: Enable blocking for high-confidence classes (scanner 0.98 recall, spam_bot 1.00, xss 0.95). Monitor FPR.
Phase 3 β Full Deployment: Activate all labels with thresholds.json. Use risk-score actions (BLOCK/CHALLENGE/LOG/ALLOW).
Artifacts
| File | Size | Description |
|---|---|---|
model.onnx |
4.5 MB | Production model (FP32, fastest on CPU) |
model_int8.onnx |
1.2 MB | INT8 quantized (for VNNI hardware) |
model_optimized.onnx |
4.5 MB | Graph-optimized FP32 |
tokenizer.json |
510 KB | BPE tokenizer |
config.json |
1.5 KB | Architecture + training config |
thresholds.json |
1.3 KB | Per-label thresholds (must use at inference) |
metrics.json |
12 KB | Full 3-set evaluation results |
training_history.json |
7.3 KB | Per-epoch training history |
Known Limitations
- SQLi recall at 0.73: High threshold (0.74) trades recall for precision. Lower to 0.60 if SQLi detection is critical.
- Adversarial robustness: Fuzzed/encoded payloads have lower recall (test_adversarial macro F1 = 0.50).
- No session-level model: Classifies individual requests. Session features help but don't replace session analysis.
- Sequence truncation: Requests truncated to 128 tokens. Place attack-relevant fields early in the text.
- FP32 > INT8 on CPU: Without VNNI, FP32 is faster. Use
model.onnxon standard CPUs.
Citation
@misc{argus_sentinel_2026,
title = {Argus Sentinel: A Low-Latency CNN-Based WAF Classifier},
author = {Fizcko},
year = {2026},
howpublished = {Hugging Face Model Hub},
note = {V3, 1.17M params, 0.24ms latency, Macro F1 0.866, FPR 0.83\%}
}
- Downloads last month
- 28