Argus Sentinel β€” WAF ML Classifier (V3)

Production-grade Web Application Firewall classifier. Detects 6 attack types in HTTP requests with sub-millisecond latency on CPU.

Key metrics (test_realistic β€” production-like distribution, 94% clean):

  • Macro F1: 0.866 | FPR: 0.83% | Mean attack recall: 0.889 | Latency: 0.24ms

Model Overview

Property Value
Architecture CNN text encoder + numeric features fusion
Parameters 1.17M
Vocab size 8,192 (BPE ByteLevel)
Max sequence length 128 tokens
ONNX model size 4.5 MB (FP32) / 1.2 MB (INT8)
Inference latency 0.24 ms avg (CPU, single thread)
Training loss Focal BCEWithLogitsLoss (gamma=2.0)
Best epoch 3 / 8 (early stopping, selected on Macro F1)

Architecture

HTTP Request Text
     |
     v
[BPE Tokenizer (vocab=8192, max_len=128)]
     |
     +---> [Embedding (128-dim)]
     |           |
     |     [Conv1D (128 ch, k=3) + BatchNorm + ReLU] x2
     |           |
     |     [AdaptiveMaxPool1d β†’ 128-dim]
     |
     +---> [6 Numeric Features]
               |
         [Linear 6β†’32 + ReLU]
               |
         [Concatenate (128 + 32 = 160)]
               |
         [Linear 160β†’128β†’64 + ReLU + Dropout(0.1)]
               |
     +---------+---------+
     |                   |
[Label Head β†’ 7]   [Risk Head β†’ 1]
     |                   |
[Sigmoid]            [Sigmoid]
     |                   |
label_probs [7]    risk_score [1]

Tokenizer Specification

Property Value
Type BPE (Byte-Pair Encoding) via HuggingFace tokenizers library
Algorithm ByteLevel BPE β€” operates on UTF-8 bytes, not characters
Pre-tokenizer ByteLevel (add_prefix_space=false, trim_offsets=true, use_regex=true)
Normalizer None (raw bytes, no lowercasing or unicode normalization)
Post-processor TemplateProcessing β€” prepends [CLS] token automatically
Vocab size 8,192 tokens (7,933 merges + 3 special tokens + 256 byte tokens)
Special tokens [PAD] (id=0), [UNK] (id=1), [CLS] (id=2)
Max length 128 tokens (truncation=Right, padding=Right to fixed 128)
Byte fallback false β€” unknown bytes map to [UNK]
File tokenizer.json (HuggingFace tokenizers JSON format)

Input text construction: "{method} {path}?{query} {body[:200]}" β€” capped at 500 chars before tokenization.

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
result = tok.encode("GET /search?q=test HTTP/1.1")
# result.ids β†’ [2, 546, 287, ...]  (starts with [CLS]=2)
# result.attention_mask β†’ [1, 1, 1, ...]

Label ID Mapping (CRITICAL)

Output label_probs tensor shape: [batch, 7]. Each index maps to:

Index Label Description
0 clean Benign / legitimate request
1 xss Cross-Site Scripting
2 sqli SQL Injection
3 path_traversal Directory / Path Traversal
4 command_injection OS Command Injection
5 scanner Vulnerability scanner / probe
6 spam_bot Spam bot / automated abuse

Multi-label: Labels are NOT mutually exclusive. Multiple labels can be active simultaneously (e.g., index 2 + 5 = scanner performing SQLi). Exception: clean (index 0) is exclusive β€” if clean=1, all others must be 0.


ONNX Inputs

Name Shape Dtype Description
input_ids [batch, 128] int32 BPE token IDs from tokenizer.json
attention_mask [batch, 128] int32 1 for real tokens, 0 for [PAD]
numeric_features [batch, 6] float32 Request-level features (RAW values, see below)

Numeric Features β€” Normalization Parameters (CRITICAL)

Features are passed as RAW values β€” the model was trained on unnormalized features. Pass the same raw scale at inference.

Index Feature Computation Training Range Mean Std
0 content_length len(body) if body else 0 0 – 662 35.2 63.5
1 num_headers len(headers_dict) 3 – 13 7.7 1.3
2 has_body 1.0 if body present, else 0.0 0 – 1 0.42 0.49
3 session_request_count Total requests in session, or 0 0 – 20 3.0 6.0
4 session_duration Session time span in seconds, or 0 0 – 4,965,381 619,322 1,198,151
5 session_pattern_score Behavioral pattern score, or 0 0 – 0.5 0.09 0.18

Python:

def extract_numeric_features(request: dict) -> list[float]:
    body = request.get("body") or ""
    headers = request.get("headers") or {}
    return [
        float(len(body)),                                    # content_length
        float(len(headers)),                                 # num_headers
        1.0 if body else 0.0,                                # has_body
        float(request.get("session_request_count") or 0),    # session_request_count
        float(request.get("session_duration") or 0),         # session_duration
        float(request.get("session_pattern_score") or 0),    # session_pattern_score
    ]

Rust:

fn extract_numeric_features(request: &HttpRequest) -> [f32; 6] {
    let body_len = request.body.as_ref().map_or(0, |b| b.len());
    [
        body_len as f32,
        request.headers.len() as f32,
        if body_len > 0 { 1.0 } else { 0.0 },
        request.session_request_count.unwrap_or(0) as f32,
        request.session_duration.unwrap_or(0.0),
        request.session_pattern_score.unwrap_or(0.0),
    ]
}

If you don't have session data, pass [content_length, num_headers, has_body, 0.0, 0.0, 0.0] β€” ~79% of training examples had null session features.

ONNX Outputs

Name Shape Dtype Description
label_probs [batch, 7] float32 Per-label probabilities after sigmoid
risk_score [batch, 1] float32 Aggregate risk score [0, 1]

Per-Label Thresholds (CRITICAL for deployment)

Do NOT use a default 0.5 threshold for all labels. Use these optimized thresholds from thresholds.json:

Label Threshold Recall Precision F1
clean 0.20 0.998 0.992 0.995
xss 0.50 0.951 0.585 0.724
sqli 0.74 0.732 0.940 0.823
path_traversal 0.68 0.896 0.794 0.842
command_injection 0.66 0.826 0.626 0.712
scanner 0.70 0.980 0.945 0.962
spam_bot 0.72 1.000 1.000 1.000

Performance

Production-Like (test_realistic β€” 25,000 examples, 94% clean)

Metric Value
Macro F1 0.866
FPR on clean 0.83%
Mean attack recall 0.889
Label Recall Precision F1
clean 0.998 0.992 0.995
xss 0.951 0.585 0.724
sqli 0.732 0.940 0.823
path_traversal 0.896 0.794 0.842
command_injection 0.826 0.626 0.712
scanner 0.980 0.945 0.962
spam_bot 1.000 1.000 1.000

Stratified Stress Test (test β€” 49,830 examples)

Metric Value
Macro F1 0.787
FPR on clean 7.9%

Adversarial Robustness (test_mixed_adversarial β€” 22,250 examples)

Metric Value
Macro F1 0.499
FPR on clean 11.8%
XSS recall 0.833

Latency (ONNX Runtime, CPU, 1 thread, batch=1)

Metric FP32 INT8
Average 0.24 ms 1.20 ms
Throughput ~4,100 req/s ~830 req/s

On CPU without VNNI, FP32 is faster than dynamic INT8. Use model.onnx on standard CPUs.


Training

Hyperparameter V1 V3 (current)
Loss BCEWithLogitsLoss Focal BCE (gamma=2.0)
Learning rate 1e-3 1e-4
Batch size 256 128
Epochs 5 8 (early stop at 6, best=3)
Patience 2 3
Checkpoint selection Best val_loss Best Macro F1
Calibration None Per-label threshold tuning
Data augmentation None +20k augmented (encoding, headers, noise, context swap)

Dataset

Property Value
Total examples 498,345 (+20k augmented)
Training split 418,685
Real traffic 62.6%
Synthetic 37.4%
Multi-label 17.0%
Hard negatives 15.0%
Unique sources 12
Sources CIC-IDS-2017, CSE-CIC-IDS-2018, HIKARI-2021, WebAttackPayloads, PayloadsAllTheThings, + synthetic

Usage

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np
import json
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file("tokenizer.json")
thresholds = json.load(open("thresholds.json"))["thresholds"]
label_names = ["clean", "xss", "sqli", "path_traversal",
               "command_injection", "scanner", "spam_bot"]

def classify_request(method, path, query, headers, body):
    # 1. Build text
    text = f"{method} {path}"
    if query: text += f"?{query}"
    if body: text += f" {body[:200]}"
    text = text[:500]

    # 2. Tokenize
    enc = tokenizer.encode(text)
    input_ids = np.array([enc.ids], dtype=np.int32)
    attention_mask = np.array([enc.attention_mask], dtype=np.int32)

    # 3. Numeric features (RAW values)
    numeric = np.array([[
        float(len(body or "")),
        float(len(headers)),
        1.0 if body else 0.0,
        0.0, 0.0, 0.0,  # session features (0 if unavailable)
    ]], dtype=np.float32)

    # 4. Inference
    probs, risk = session.run(None, {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "numeric_features": numeric,
    })

    # 5. Apply per-label thresholds
    detections = {
        name: float(probs[0][i])
        for i, name in enumerate(label_names)
        if name != "clean" and probs[0][i] >= thresholds[name]
    }

    return {
        "risk_score": float(risk[0][0]),
        "detections": detections,
        "is_clean": len(detections) == 0,
    }

# Example
result = classify_request("GET", "/search", "q=' OR 1=1--", {"Host": "example.com"}, None)
print(result)
# {'risk_score': 0.87, 'detections': {'sqli': 0.94}, 'is_clean': False}

Rust (ort crate)

use ort::{Session, Value};
use ndarray::Array2;

fn main() -> anyhow::Result<()> {
    let session = Session::builder()?
        .with_model_from_file("model.onnx")?;

    let input_ids = Array2::<i32>::zeros((1, 128));       // from tokenizer
    let attention_mask = Array2::<i32>::zeros((1, 128));   // from tokenizer
    let numeric_features = Array2::<f32>::zeros((1, 6));   // extract_numeric_features()

    let outputs = session.run(ort::inputs![
        "input_ids" => &input_ids,
        "attention_mask" => &attention_mask,
        "numeric_features" => &numeric_features,
    ]?)?;

    let label_probs: Vec<f32> = outputs[0].extract_tensor::<f32>()?.view().iter().copied().collect();
    let risk_score: f32 = *outputs[1].extract_tensor::<f32>()?.view().first().unwrap();

    // Apply thresholds from thresholds.json
    let thresholds = [0.20, 0.50, 0.74, 0.68, 0.66, 0.70, 0.72];
    let labels = ["clean", "xss", "sqli", "path_traversal",
                  "command_injection", "scanner", "spam_bot"];

    for (i, (prob, thr)) in label_probs.iter().zip(thresholds.iter()).enumerate() {
        if i > 0 && prob >= thr {
            println!("DETECTED: {} ({:.3})", labels[i], prob);
        }
    }
    println!("Risk score: {:.4}", risk_score);

    Ok(())
}

Decision Logic

thresholds = json.load(open("thresholds.json"))["thresholds"]

# Per-label detection
triggered = [name for i, name in enumerate(label_names)
             if name != "clean" and probs[0][i] >= thresholds[name]]

# Risk-score action
score = float(risk[0][0])
if score >= 0.8:   action = "BLOCK"
elif score >= 0.5:  action = "CHALLENGE"
elif score >= 0.2:  action = "LOG"
else:               action = "ALLOW"

Version History

V3 (current) β€” Production-Hardened

Fixed V2 recall collapse. Multi-checkpoint selection on Macro F1. Per-label threshold optimization replaces Platt scaling.

V2 β€” Focal Loss + Calibration (superseded)

Introduced Focal Loss and Platt calibration. FPR dropped to 0.18% but XSS recall collapsed to 0.016 and CMDi to 0.222 due to aggressive calibration.

V1 β€” Baseline

BCE loss, fixed 0.5 thresholds. High recall (~0.98) but lower Macro F1 (0.828) and higher FPR (0.83%).

Metric V1 V2 V3
Macro F1 0.828 0.669 0.866
FPR 0.83% 0.18% 0.83%
XSS recall 0.980 0.016 0.951
CMDi recall 0.985 0.222 0.826
Latency 0.77ms 0.99ms 0.24ms

Deployment Strategy

Phase 1 β€” Shadow Mode: Deploy alongside existing WAF rules, log predictions, compare decisions, tune thresholds.

Phase 2 β€” Safe Blocking: Enable blocking for high-confidence classes (scanner 0.98 recall, spam_bot 1.00, xss 0.95). Monitor FPR.

Phase 3 β€” Full Deployment: Activate all labels with thresholds.json. Use risk-score actions (BLOCK/CHALLENGE/LOG/ALLOW).


Artifacts

File Size Description
model.onnx 4.5 MB Production model (FP32, fastest on CPU)
model_int8.onnx 1.2 MB INT8 quantized (for VNNI hardware)
model_optimized.onnx 4.5 MB Graph-optimized FP32
tokenizer.json 510 KB BPE tokenizer
config.json 1.5 KB Architecture + training config
thresholds.json 1.3 KB Per-label thresholds (must use at inference)
metrics.json 12 KB Full 3-set evaluation results
training_history.json 7.3 KB Per-epoch training history

Known Limitations

  • SQLi recall at 0.73: High threshold (0.74) trades recall for precision. Lower to 0.60 if SQLi detection is critical.
  • Adversarial robustness: Fuzzed/encoded payloads have lower recall (test_adversarial macro F1 = 0.50).
  • No session-level model: Classifies individual requests. Session features help but don't replace session analysis.
  • Sequence truncation: Requests truncated to 128 tokens. Place attack-relevant fields early in the text.
  • FP32 > INT8 on CPU: Without VNNI, FP32 is faster. Use model.onnx on standard CPUs.

Citation

@misc{argus_sentinel_2026,
  title        = {Argus Sentinel: A Low-Latency CNN-Based WAF Classifier},
  author       = {Fizcko},
  year         = {2026},
  howpublished = {Hugging Face Model Hub},
  note         = {V3, 1.17M params, 0.24ms latency, Macro F1 0.866, FPR 0.83\%}
}
Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support