JaneGPT v2 Janus โ€” Intent Classification Model


Experience Janus

Jane Janus animated hero banner

Hierarchical command understanding with state-aware runtime behavior for practical assistant workflows.

7.95M
Parameters
82
Runtime Turns
0
Errors
25.3 ms
Mean Latency
100%
OOD Precision
30.6 MB
Checkpoint

๐Ÿ›๏ธ The Temple of Janus (Web Experience)

We have deployed a dedicated interactive environment to showcase the essence of JaneGPT-v2 Janus.

Note: This is a visual and technical walkthrough; it does not feature a live chat interface.


Quickstart (2 minutes)

Install + first prediction
pip install -r requirements.txt
from janegpt_v2_janus.inference import JaneGPTv3NLU

nlu = JaneGPTv3NLU(
    model_path="weights/janegpt_v2_janus.pt",
    tokenizer_path="weights/tokenizer.json",
)

state = {}
result = nlu.predict("set volume", state=state)
print(result)

if result.get("type") == "command":
    state = nlu.update_state(result, state)
Runtime wrapper (recommended for assistant flows)
from runtime.jane_nlu_runtime import JaneNLURuntime

rt = JaneNLURuntime(base_dir=".")
state = {}

out, state = rt.handle_turn("set volume", state)
print(out)  # expected: clarify prompt for missing VALUE

out, state = rt.handle_turn("55", state)
print(out)  # expected: resolved local command
Run bundled demos
python examples/demo_inference.py
python examples/demo_runtime.py
python examples/demo_runtime_suite.py

What You Get

  • Single-pass multitask prediction: domain + action + BIO slots.
  • Runtime-safe clarification loops for missing required slots.
  • Stateful follow-ups (for example, "that is not enough" after a volume change).
  • Local command routing with controlled chat fallback.
  • Compact deployment footprint: ~30.62 MB checkpoint.

Model Architecture

Interactive Architecture Visualization

๐Ÿ”ค 1. Tokenization & Embedding Layer
Input text is converted to token IDs and projected into a 256-dimensional embedding space.
Tokenizer BPE, vocab=8,192
Max Length 96 tokens
Output Shape (batch, 96, 256)
Embedding 8192 โ†’ 256 dim
โ†“
2. Transformer Backbone (8 Blocks)
Bidirectional attention layers with residual connections. Each block processes hidden states through grouped query attention and feed-forward networks.
Attention Type Grouped Query (GQA)
Query Heads 8 heads
KV Heads 4 heads (2:1)
Head Dimension 32 (256รท8)
Position Embedding RoPE
FFN Expansion 256 โ†’ 672 โ†’ 256
FFN Activation SwiGLU
Normalization RMSNorm
Causal Masking OFF (bidirectional)
Dropout Rate 0.1
Grouped Query Attention reduces KV cache 50% while maintaining quality
โ†“
3. Multi-Task Prediction Heads (Parallel)
Three independent classification heads process the backbone output simultaneously for domain, action, and slot predictions.
Domain Head
Input: Last token (pooled)
Arch: Linear(256) โ†’ GELU โ†’ Dropout โ†’ Linear(10)
Output: 10 classes
Action Head
Input: Last token (pooled)
Arch: Linear(256) โ†’ GELU โ†’ Dropout โ†’ Linear(33)
Output: 33 classes
Slot Head
Input: All tokens
Arch: Linear(256) โ†’ Linear(15 BIO)
Output: 15 labels/token
โ†“
4. Output & Post-Processing
Raw logits are converted to predictions. For slots, BIO tags are decoded into semantic spans.
Domain Output 10 classes
Action Output 33 classes
Slots Decoder BIO โ†’ Spans
Confidence Softmax scores

Training Objective

Weighted Multi-Task Loss
loss = 1.0 ร— L_domain + 1.0 ร— L_action + 1.5 ร— L_slots

Where: L_domain = CrossEntropy(domain_logits, domain_labels) L_action = CrossEntropy(action_logits, action_labels) L_slots = CrossEntropy(slot_logits, slot_labels) with ignore_index=-100 (padding)

Slot weight (1.5ร—) reflects higher complexity of sequence tagging vs. classification

Architecture Specifications

Component Configuration Details
Backbone Type Transformer (GPT-style) Bidirectional, non-causal attention
Vocabulary Size 8,192 BPE tokenization
Embedding Dim 256 Token + Rotary Position embeddings
Attention Heads 8 Query, 4 KV Grouped Query Attention (GQA) for efficiency
Head Dimension 32 per head_dim = embed_dim / num_heads
Transformer Blocks 8 Layers Each with Attn + FFN + Residuals
Feed-Forward Hidden 672 SwiGLU gate activation
Position Encoding RoPE Rotary Position Embeddings (theta=10000)
Normalization RMSNorm Pre-layer normalization
Max Sequence Length 96 tokens Approximately 60-80 words
Dropout Rate 0.1 Applied during training
Total Parameters 7,949,626 All trainable
Parameter Breakdown Backbone: 7.80M, Task Heads: 146K Efficient multitask design

Task Configuration

Task Type Classes Architecture
Domain Classification Sequence-level 10 domains Pooled โ†’ Linear(256) โ†’ GELU โ†’ Linear(10)
Action Classification Sequence-level 33 actions Pooled โ†’ Linear(256) โ†’ GELU โ†’ Linear(33)
Slot Tagging Token-level 15 BIO labels Per-token โ†’ Linear(256) โ†’ Linear(15)

Benchmark Results

Runtime reliability
82-turn suite
82
Turns
67
Local
12
Clarify
0
Errors
fair_benchmarks.json
Predict latency
CUDA ยท batch=1 ยท lower is better
25.3ms
Pยทmean
34.6ms
Pยทp95
35.4ms
Fwdยทmean
36.7ms
Fwdยทp95
janus_model_report.json
OOD rejection quality
Schema-agnostic ยท hover values
87.8%
B77 F1
100%
B77 Prec
78.3%
B77 Rec
79.2%
CL F1
100%
CL Prec
65.6%
CL Rec
fair_benchmarks.json

Comprehensive Benchmark Summary

Full Benchmark Evidence
All values from real holdout evaluations โ€” no synthetic or inflated numbers
Metric Detail Jane v2 Janus
Speed (mean latency) CUDA, batch=1 31.60 ms 25.31 ms
Throughput CUDA, single GPU 32 pred/sec Stable across 82 turns, 0 errors
OOD F1 BANKING77 94.31% 87.80%
OOD F1 CLINC OOS 89.16% 79.23%
OOD Precision BANKING77 99.35% 100.00%
OOD Precision CLINC OOS 99.14% 100.00%
OOD Recall BANKING77 89.75% 78.25%
OOD Recall CLINC OOS 81.00% 65.60%
Validation Accuracy Domain (best epoch) โ€” 99.83%
Validation Accuracy Action (best epoch) โ€” 99.87%
Validation Accuracy Domain+Action pair (best epoch) โ€” 99.83%
Slot Extraction F1 All 15 slot types โ€” 1.000 (100%)
Training Loss Epoch 1 โ†’ 4 โ€” 0.060 โ†’ 0.020 โ†’ 0.002 โ†’ 0.001
Validation Loss Epoch 1 โ†’ 3 โ€” 0.0153 โ†’ 0.0116 โ†’ 0.0115 (stable)
Runtime Reliability 82-turn conversation test โ€” 0 errors, 0 crashes
Domain Confusion 10 domains โ€” 99%+ per-domain, minimal cross-confusion
Action Confusion 33 actions โ€” Perfect diagonal, no action commonly confused

Live Output Shapes (click to expand)

Command output
{
  "type": "command",
  "domain": "apps",
  "action": "launch",
  "slots": {
    "APP_NAME": {
      "text": "chrome",
      "start": 5,
      "end": 11,
      "confidence": 0.999
    }
  },
  "confidence": 0.97,
  "route": "local"
}
Clarification output
{
  "type": "clarify",
  "question": "What value should I set it to?",
  "debug": {
    "domain": "volume",
    "action": "set",
    "reason": "missing_VALUE"
  }
}
Label schema
  • Domains (10): volume, brightness, media, apps, browser, productivity, screen, window, system, conversation
  • Actions (33): up, down, set, mute, unmute, play, pause, next, previous, launch, close, switch, search, set_reminder, screenshot, read, explain, undo, quit, chat, minimize, maximize, restore, focus, copy, paste, cut, lock, sleep, wifi_on, wifi_off, bluetooth_on, bluetooth_off
  • Slot labels (BIO, 15): VALUE, APP_NAME, QUERY, DURATION, TIME, WINDOW_NAME, TEXT

Visual Benchmark Evidence

Train and validation loss

Smoothed train loss

Validation slot F1

Confusion Matrix โ€” Interactive Breakdown

Per-Class True vs Predicted
Single stacked bar per head โ€” segment width = sample ratio. Hover any segment for details.
Domain Sample Distribution โ€” 3,110 total samples โ€” hover each segment
volume
430 samples (13.8%)
Accuracy: 100%
Misclassified: 0
brightness
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0
media
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
apps
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0
browser
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0
productivity
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
screen
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0
window
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
system
580 samples (18.6%)
Accuracy: 100%
Misclassified: 0
conversation
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
volume
brightness
media
apps
browser
productivity
screen
window
system
conversation
Source: validation set confusion matrix โ€” segment widths proportional to sample count
Action Sample Distribution โ€” 3,205 total samples โ€” hover each segment
up
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0
down
165 samples (5.1%)
Accuracy: 100%
Misclassified: 0
set
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0
mute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
unmute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
play
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
pause
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
next
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
previous
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
launch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
close
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
switch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
search
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
set_reminder
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
screenshot
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
read
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
explain
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
undo
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
quit
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
chat
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
minimize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
maximize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
restore
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
focus
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
copy
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
paste
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
cut
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
lock
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
sleep
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
wifi_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
wifi_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
bluetooth_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
bluetooth_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
up
down
set
mute
unmute
play
pause
next
previous
launch
close
switch
search
set_reminder
screenshot
read
explain
undo
quit
chat
minimize
maximize
restore
focus
copy
paste
cut
lock
sleep
wifi_on
wifi_off
bluetooth_on
bluetooth_off
Source: validation set confusion matrix โ€” segment widths proportional to sample count
View original confusion matrix images

Domain confusion matrix

Action confusion matrix

Additional diagnostics

Learning rate schedule

Epoch time profile

Raw training loss


Upload-Ready Layout

.
|- README.md
|- .gitattributes
|- LICENSE
|- requirements.txt
|- assets/
|  |- jane-janus-glitch.webp
|- janegpt_v2_janus/
|  |- __init__.py
|  |- architecture.py
|  |- dataset.py
|  |- inference.py
|  |- labels.py
|  |- multitask.py
|- runtime/
|  |- jane_nlu_runtime.py
|- examples/
|  |- demo_inference.py
|  |- demo_runtime.py
|  |- demo_runtime_suite.py
|- weights/
|  |- janegpt_v2_janus.pt
|  |- tokenizer.json
|- reports/
|  |- fair_benchmarks.json
|  |- fair_benchmarks.md
|  |- janus_model_report.json
|  |- janus_model_report.md
|  |- public_benchmarks.json
|  |- *.png benchmark visuals

Limitations

  • English-focused command language.
  • Command NLU model, not an open-domain generative chatbot.
  • MASSIVE and SNIPS mapped-intent accuracy is excluded from headline claims because mapping coverage is partial.

Use Cases

  • Virtual assistant command routing
  • Smart home intent classification
  • Voice command understanding
  • Chatbot intent detection
  • Edge device deployment (small enough for embedded systems)

Part of the JANE Project

JANE โ€” a fully offline, privacy-first AI voice assistant.

๐Ÿ”— JANE AI Assistant on GitHub ๐Ÿ”— JaneGPT-v2 on GitHub


Created By

Ravindu Senanayake

Built from scratch โ€” architecture, tokenizer, and training pipeline designed and implemented by the author.

GitHub


License

Apache-2.0 (see LICENSE).

Downloads last month
245
Safetensors
Model size
7.95M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results