JaneGPT v2 Janus โ Intent Classification Model
Hierarchical command understanding with state-aware runtime behavior for practical assistant workflows.
7.95M
Parameters
82
Runtime Turns
0
Errors
25.3 ms
Mean Latency
100%
OOD Precision
30.6 MB
Checkpoint
๐๏ธ The Temple of Janus (Web Experience)
We have deployed a dedicated interactive environment to showcase the essence of JaneGPT-v2 Janus.
Note: This is a visual and technical walkthrough; it does not feature a live chat interface.
- ๐ Enter the Experience
- Best Viewed On: Desktop (Chrome/Edge) for full hardware-accelerated 3D effects.
Quickstart (2 minutes)
Install + first prediction
pip install -r requirements.txt
from janegpt_v2_janus.inference import JaneGPTv3NLU
nlu = JaneGPTv3NLU(
model_path="weights/janegpt_v2_janus.pt",
tokenizer_path="weights/tokenizer.json",
)
state = {}
result = nlu.predict("set volume", state=state)
print(result)
if result.get("type") == "command":
state = nlu.update_state(result, state)
Runtime wrapper (recommended for assistant flows)
from runtime.jane_nlu_runtime import JaneNLURuntime
rt = JaneNLURuntime(base_dir=".")
state = {}
out, state = rt.handle_turn("set volume", state)
print(out) # expected: clarify prompt for missing VALUE
out, state = rt.handle_turn("55", state)
print(out) # expected: resolved local command
Run bundled demos
python examples/demo_inference.py
python examples/demo_runtime.py
python examples/demo_runtime_suite.py
What You Get
- Single-pass multitask prediction: domain + action + BIO slots.
- Runtime-safe clarification loops for missing required slots.
- Stateful follow-ups (for example, "that is not enough" after a volume change).
- Local command routing with controlled chat fallback.
- Compact deployment footprint: ~30.62 MB checkpoint.
Model Architecture
Interactive Architecture Visualization
1. Tokenization & Embedding Layer
Input text is converted to token IDs and projected into a 256-dimensional embedding space.
Tokenizer
BPE, vocab=8,192
Max Length
96 tokens
Output Shape
(batch, 96, 256)
Embedding
8192 โ 256 dim
โ
2. Transformer Backbone (8 Blocks)
Bidirectional attention layers with residual connections. Each block processes hidden states through grouped query attention and feed-forward networks.
Attention Type
Grouped Query (GQA)
Query Heads
8 heads
KV Heads
4 heads (2:1)
Head Dimension
32 (256รท8)
Position Embedding
RoPE
FFN Expansion
256 โ 672 โ 256
FFN Activation
SwiGLU
Normalization
RMSNorm
Causal Masking
OFF (bidirectional)
Dropout Rate
0.1
Grouped Query Attention reduces KV cache 50% while maintaining quality
โ
3. Multi-Task Prediction Heads (Parallel)
Three independent classification heads process the backbone output simultaneously for domain, action, and slot predictions.
Domain Head
Input: Last token (pooled)
Arch: Linear(256) โ GELU โ Dropout โ Linear(10)
Output: 10 classes
Arch: Linear(256) โ GELU โ Dropout โ Linear(10)
Output: 10 classes
Action Head
Input: Last token (pooled)
Arch: Linear(256) โ GELU โ Dropout โ Linear(33)
Output: 33 classes
Arch: Linear(256) โ GELU โ Dropout โ Linear(33)
Output: 33 classes
Slot Head
Input: All tokens
Arch: Linear(256) โ Linear(15 BIO)
Output: 15 labels/token
Arch: Linear(256) โ Linear(15 BIO)
Output: 15 labels/token
โ
4. Output & Post-Processing
Raw logits are converted to predictions. For slots, BIO tags are decoded into semantic spans.
Domain Output
10 classes
Action Output
33 classes
Slots Decoder
BIO โ Spans
Confidence
Softmax scores
Training Objective
Architecture Specifications
| Component | Configuration | Details |
|---|---|---|
| Backbone Type | Transformer (GPT-style) | Bidirectional, non-causal attention |
| Vocabulary Size | 8,192 | BPE tokenization |
| Embedding Dim | 256 | Token + Rotary Position embeddings |
| Attention Heads | 8 Query, 4 KV | Grouped Query Attention (GQA) for efficiency |
| Head Dimension | 32 | per head_dim = embed_dim / num_heads |
| Transformer Blocks | 8 Layers | Each with Attn + FFN + Residuals |
| Feed-Forward Hidden | 672 | SwiGLU gate activation |
| Position Encoding | RoPE | Rotary Position Embeddings (theta=10000) |
| Normalization | RMSNorm | Pre-layer normalization |
| Max Sequence Length | 96 tokens | Approximately 60-80 words |
| Dropout Rate | 0.1 | Applied during training |
| Total Parameters | 7,949,626 | All trainable |
| Parameter Breakdown | Backbone: 7.80M, Task Heads: 146K | Efficient multitask design |
Task Configuration
| Task | Type | Classes | Architecture |
|---|---|---|---|
| Domain Classification | Sequence-level | 10 domains | Pooled โ Linear(256) โ GELU โ Linear(10) |
| Action Classification | Sequence-level | 33 actions | Pooled โ Linear(256) โ GELU โ Linear(33) |
| Slot Tagging | Token-level | 15 BIO labels | Per-token โ Linear(256) โ Linear(15) |
Benchmark Results
Runtime reliability
82-turn suite
Predict latency
CUDA ยท batch=1 ยท lower is better
OOD rejection quality
Schema-agnostic ยท hover values
Comprehensive Benchmark Summary
Full Benchmark Evidence
All values from real holdout evaluations โ no synthetic or inflated numbers
| Metric | Detail | Jane v2 | Janus |
|---|---|---|---|
| Speed (mean latency) | CUDA, batch=1 | 31.60 ms | 25.31 ms |
| Throughput | CUDA, single GPU | 32 pred/sec | Stable across 82 turns, 0 errors |
| OOD F1 | BANKING77 | 94.31% | 87.80% |
| OOD F1 | CLINC OOS | 89.16% | 79.23% |
| OOD Precision | BANKING77 | 99.35% | 100.00% |
| OOD Precision | CLINC OOS | 99.14% | 100.00% |
| OOD Recall | BANKING77 | 89.75% | 78.25% |
| OOD Recall | CLINC OOS | 81.00% | 65.60% |
| Validation Accuracy | Domain (best epoch) | โ | 99.83% |
| Validation Accuracy | Action (best epoch) | โ | 99.87% |
| Validation Accuracy | Domain+Action pair (best epoch) | โ | 99.83% |
| Slot Extraction F1 | All 15 slot types | โ | 1.000 (100%) |
| Training Loss | Epoch 1 โ 4 | โ | 0.060 โ 0.020 โ 0.002 โ 0.001 |
| Validation Loss | Epoch 1 โ 3 | โ | 0.0153 โ 0.0116 โ 0.0115 (stable) |
| Runtime Reliability | 82-turn conversation test | โ | 0 errors, 0 crashes |
| Domain Confusion | 10 domains | โ | 99%+ per-domain, minimal cross-confusion |
| Action Confusion | 33 actions | โ | Perfect diagonal, no action commonly confused |
Live Output Shapes (click to expand)
Command output
{
"type": "command",
"domain": "apps",
"action": "launch",
"slots": {
"APP_NAME": {
"text": "chrome",
"start": 5,
"end": 11,
"confidence": 0.999
}
},
"confidence": 0.97,
"route": "local"
}
Clarification output
{
"type": "clarify",
"question": "What value should I set it to?",
"debug": {
"domain": "volume",
"action": "set",
"reason": "missing_VALUE"
}
}
Label schema
- Domains (10): volume, brightness, media, apps, browser, productivity, screen, window, system, conversation
- Actions (33): up, down, set, mute, unmute, play, pause, next, previous, launch, close, switch, search, set_reminder, screenshot, read, explain, undo, quit, chat, minimize, maximize, restore, focus, copy, paste, cut, lock, sleep, wifi_on, wifi_off, bluetooth_on, bluetooth_off
- Slot labels (BIO, 15): VALUE, APP_NAME, QUERY, DURATION, TIME, WINDOW_NAME, TEXT
Visual Benchmark Evidence
Confusion Matrix โ Interactive Breakdown
Per-Class True vs Predicted
Single stacked bar per head โ segment width = sample ratio. Hover any segment for details.
volume
brightness
media
apps
browser
productivity
screen
window
system
conversation
up
down
set
mute
unmute
play
pause
next
previous
launch
close
switch
search
set_reminder
screenshot
read
explain
undo
quit
chat
minimize
maximize
restore
focus
copy
paste
cut
lock
sleep
wifi_on
wifi_off
bluetooth_on
bluetooth_off
View original confusion matrix images
Additional diagnostics
Upload-Ready Layout
.
|- README.md
|- .gitattributes
|- LICENSE
|- requirements.txt
|- assets/
| |- jane-janus-glitch.webp
|- janegpt_v2_janus/
| |- __init__.py
| |- architecture.py
| |- dataset.py
| |- inference.py
| |- labels.py
| |- multitask.py
|- runtime/
| |- jane_nlu_runtime.py
|- examples/
| |- demo_inference.py
| |- demo_runtime.py
| |- demo_runtime_suite.py
|- weights/
| |- janegpt_v2_janus.pt
| |- tokenizer.json
|- reports/
| |- fair_benchmarks.json
| |- fair_benchmarks.md
| |- janus_model_report.json
| |- janus_model_report.md
| |- public_benchmarks.json
| |- *.png benchmark visuals
Limitations
- English-focused command language.
- Command NLU model, not an open-domain generative chatbot.
- MASSIVE and SNIPS mapped-intent accuracy is excluded from headline claims because mapping coverage is partial.
Use Cases
- Virtual assistant command routing
- Smart home intent classification
- Voice command understanding
- Chatbot intent detection
- Edge device deployment (small enough for embedded systems)
Part of the JANE Project
JANE โ a fully offline, privacy-first AI voice assistant.
๐ JANE AI Assistant on GitHub ๐ JaneGPT-v2 on GitHub
Created By
Ravindu Senanayake
Built from scratch โ architecture, tokenizer, and training pipeline designed and implemented by the author.
License
Apache-2.0 (see LICENSE).
- Downloads last month
- 245
Evaluation results
- OOD Precision on BANKING77self-reported1.000
- OOD F1 on BANKING77self-reported0.878
- OOD Recall on BANKING77self-reported0.782
- OOD Precision on CLINC OOSself-reported1.000
- OOD F1 on CLINC OOSself-reported0.792
- OOD Recall on CLINC OOSself-reported0.656
- Validation Domain Accuracyself-reported0.998
- Validation Action Accuracyself-reported0.999
- Slot Extraction F1self-reported1.000