JaneGPT v2 Janus — Intent Classification Model

Hierarchical command understanding with state-aware runtime behavior for practical assistant workflows.

7.95M

Parameters

Runtime Turns

Errors

25.3 ms

Mean Latency

100%

OOD Precision

30.6 MB

Checkpoint

🏛️ The Temple of Janus (Web Experience)

We have deployed a dedicated interactive environment to showcase the essence of JaneGPT-v2 Janus.

Note: This is a visual and technical walkthrough; it does not feature a live chat interface.

🔗 Enter the Experience
Best Viewed On: Desktop (Chrome/Edge) for full hardware-accelerated 3D effects.

Quickstart (2 minutes)

Install + first prediction

pip install -r requirements.txt

from janegpt_v2_janus.inference import JaneGPTv3NLU

nlu = JaneGPTv3NLU(
    model_path="weights/janegpt_v2_janus.pt",
    tokenizer_path="weights/tokenizer.json",
)

state = {}
result = nlu.predict("set volume", state=state)
print(result)

if result.get("type") == "command":
    state = nlu.update_state(result, state)

Runtime wrapper (recommended for assistant flows)

from runtime.jane_nlu_runtime import JaneNLURuntime

rt = JaneNLURuntime(base_dir=".")
state = {}

out, state = rt.handle_turn("set volume", state)
print(out)  # expected: clarify prompt for missing VALUE

out, state = rt.handle_turn("55", state)
print(out)  # expected: resolved local command

Run bundled demos

python examples/demo_inference.py
python examples/demo_runtime.py
python examples/demo_runtime_suite.py

What You Get

Single-pass multitask prediction: domain + action + BIO slots.
Runtime-safe clarification loops for missing required slots.
Stateful follow-ups (for example, "that is not enough" after a volume change).
Local command routing with controlled chat fallback.
Compact deployment footprint: ~30.62 MB checkpoint.

Model Architecture

Interactive Architecture Visualization

🔤 1. Tokenization & Embedding Layer

Input text is converted to token IDs and projected into a 256-dimensional embedding space.

Tokenizer BPE, vocab=8,192

Max Length 96 tokens

Output Shape (batch, 96, 256)

Embedding 8192 → 256 dim

↓

2. Transformer Backbone (8 Blocks)

Bidirectional attention layers with residual connections. Each block processes hidden states through grouped query attention and feed-forward networks.

Attention Type Grouped Query (GQA)

Query Heads 8 heads

KV Heads 4 heads (2:1)

Head Dimension 32 (256÷8)

Position Embedding RoPE

FFN Expansion 256 → 672 → 256

FFN Activation SwiGLU

Normalization RMSNorm

Causal Masking OFF (bidirectional)

Dropout Rate 0.1

Grouped Query Attention reduces KV cache 50% while maintaining quality

↓

3. Multi-Task Prediction Heads (Parallel)

Three independent classification heads process the backbone output simultaneously for domain, action, and slot predictions.

Domain Head

Input: Last token (pooled)
Arch: Linear(256) → GELU → Dropout → Linear(10)
Output: 10 classes

Action Head

Input: Last token (pooled)
Arch: Linear(256) → GELU → Dropout → Linear(33)
Output: 33 classes

Slot Head

Input: All tokens
Arch: Linear(256) → Linear(15 BIO)
Output: 15 labels/token

↓

4. Output & Post-Processing

Raw logits are converted to predictions. For slots, BIO tags are decoded into semantic spans.

Domain Output 10 classes

Action Output 33 classes

Slots Decoder BIO → Spans

Confidence Softmax scores

Training Objective

Weighted Multi-Task Loss

loss = 1.0 × L_domain + 1.0 × L_action + 1.5 × L_slots

Where: L_domain = CrossEntropy(domain_logits, domain_labels) L_action = CrossEntropy(action_logits, action_labels) L_slots = CrossEntropy(slot_logits, slot_labels) with ignore_index=-100 (padding)

Slot weight (1.5×) reflects higher complexity of sequence tagging vs. classification

Architecture Specifications

Component	Configuration	Details
Backbone Type	Transformer (GPT-style)	Bidirectional, non-causal attention
Vocabulary Size	8,192	BPE tokenization
Embedding Dim	256	Token + Rotary Position embeddings
Attention Heads	8 Query, 4 KV	Grouped Query Attention (GQA) for efficiency
Head Dimension	32	per head_dim = embed_dim / num_heads
Transformer Blocks	8 Layers	Each with Attn + FFN + Residuals
Feed-Forward Hidden	672	SwiGLU gate activation
Position Encoding	RoPE	Rotary Position Embeddings (theta=10000)
Normalization	RMSNorm	Pre-layer normalization
Max Sequence Length	96 tokens	Approximately 60-80 words
Dropout Rate	0.1	Applied during training
Total Parameters	7,949,626	All trainable
Parameter Breakdown	Backbone: 7.80M, Task Heads: 146K	Efficient multitask design

Task Configuration

Task	Type	Classes	Architecture
Domain Classification	Sequence-level	10 domains	Pooled → Linear(256) → GELU → Linear(10)
Action Classification	Sequence-level	33 actions	Pooled → Linear(256) → GELU → Linear(33)
Slot Tagging	Token-level	15 BIO labels	Per-token → Linear(256) → Linear(15)

Benchmark Results

Runtime reliability

82-turn suite

Turns

Local

Clarify

Errors

fair_benchmarks.json

Predict latency

CUDA · batch=1 · lower is better

25.3ms

P·mean

34.6ms

P·p95

35.4ms

Fwd·mean

36.7ms

Fwd·p95

janus_model_report.json

OOD rejection quality

Schema-agnostic · hover values

Both BANKING77 CLINC

87.8%

B77 F1

100%

B77 Prec

78.3%

B77 Rec

79.2%

CL F1

100%

CL Prec

65.6%

CL Rec

fair_benchmarks.json

Comprehensive Benchmark Summary

Full Benchmark Evidence

All values from real holdout evaluations — no synthetic or inflated numbers

Metric	Detail	Jane v2	Janus
Speed (mean latency)	CUDA, batch=1	31.60 ms	25.31 ms
Throughput	CUDA, single GPU	32 pred/sec	Stable across 82 turns, 0 errors
OOD F1	BANKING77	94.31%	87.80%
OOD F1	CLINC OOS	89.16%	79.23%
OOD Precision	BANKING77	99.35%	100.00%
OOD Precision	CLINC OOS	99.14%	100.00%
OOD Recall	BANKING77	89.75%	78.25%
OOD Recall	CLINC OOS	81.00%	65.60%
Validation Accuracy	Domain (best epoch)	—	99.83%
Validation Accuracy	Action (best epoch)	—	99.87%
Validation Accuracy	Domain+Action pair (best epoch)	—	99.83%
Slot Extraction F1	All 15 slot types	—	1.000 (100%)
Training Loss	Epoch 1 → 4	—	0.060 → 0.020 → 0.002 → 0.001
Validation Loss	Epoch 1 → 3	—	0.0153 → 0.0116 → 0.0115 (stable)
Runtime Reliability	82-turn conversation test	—	0 errors, 0 crashes
Domain Confusion	10 domains	—	99%+ per-domain, minimal cross-confusion
Action Confusion	33 actions	—	Perfect diagonal, no action commonly confused

Live Output Shapes (click to expand)

Command output

{
  "type": "command",
  "domain": "apps",
  "action": "launch",
  "slots": {
    "APP_NAME": {
      "text": "chrome",
      "start": 5,
      "end": 11,
      "confidence": 0.999
    }
  },
  "confidence": 0.97,
  "route": "local"
}

Clarification output

{
  "type": "clarify",
  "question": "What value should I set it to?",
  "debug": {
    "domain": "volume",
    "action": "set",
    "reason": "missing_VALUE"
  }
}

Label schema

Domains (10): volume, brightness, media, apps, browser, productivity, screen, window, system, conversation
Actions (33): up, down, set, mute, unmute, play, pause, next, previous, launch, close, switch, search, set_reminder, screenshot, read, explain, undo, quit, chat, minimize, maximize, restore, focus, copy, paste, cut, lock, sleep, wifi_on, wifi_off, bluetooth_on, bluetooth_off
Slot labels (BIO, 15): VALUE, APP_NAME, QUERY, DURATION, TIME, WINDOW_NAME, TEXT

Visual Benchmark Evidence

Train and validation loss

Smoothed train loss

Validation slot F1

Confusion Matrix — Interactive Breakdown

Per-Class True vs Predicted

Single stacked bar per head — segment width = sample ratio. Hover any segment for details.

Domains (10) Actions (33)

Domain Sample Distribution — 3,110 total samples — hover each segment

volume
430 samples (13.8%)
Accuracy: 100%
Misclassified: 0

brightness
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0

media
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

apps
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0

browser
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0

productivity
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

screen
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0

window
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

system
580 samples (18.6%)
Accuracy: 100%
Misclassified: 0

conversation
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

volume

brightness

media

apps

browser

productivity

screen

window

system

conversation

Source: validation set confusion matrix — segment widths proportional to sample count

Action Sample Distribution — 3,205 total samples — hover each segment

up
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0

down
165 samples (5.1%)
Accuracy: 100%
Misclassified: 0

set
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0

mute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

unmute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

play
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

pause
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

next
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

previous
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

launch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

close
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

switch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

search
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

set_reminder
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

screenshot
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

read
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

explain
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

undo
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

quit
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

chat
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

minimize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

maximize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

restore
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

focus
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

copy
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

paste
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

cut
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

lock
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

sleep
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

wifi_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

wifi_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

bluetooth_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

bluetooth_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

down

set

mute

unmute

play

pause

launch

switch

set_reminder

screenshot

read

explain

undo

quit

chat

minimize

maximize

restore

focus

copy

paste

cut

lock

sleep

wifi_on

wifi_off

bluetooth_on

bluetooth_off

Source: validation set confusion matrix — segment widths proportional to sample count

View original confusion matrix images

Domain confusion matrix

Action confusion matrix

Additional diagnostics

Learning rate schedule

Epoch time profile

Raw training loss

Upload-Ready Layout

.
|- README.md
|- .gitattributes
|- LICENSE
|- requirements.txt
|- assets/
|  |- jane-janus-glitch.webp
|- janegpt_v2_janus/
|  |- __init__.py
|  |- architecture.py
|  |- dataset.py
|  |- inference.py
|  |- labels.py
|  |- multitask.py
|- runtime/
|  |- jane_nlu_runtime.py
|- examples/
|  |- demo_inference.py
|  |- demo_runtime.py
|  |- demo_runtime_suite.py
|- weights/
|  |- janegpt_v2_janus.pt
|  |- tokenizer.json
|- reports/
|  |- fair_benchmarks.json
|  |- fair_benchmarks.md
|  |- janus_model_report.json
|  |- janus_model_report.md
|  |- public_benchmarks.json
|  |- *.png benchmark visuals

Limitations

English-focused command language.
Command NLU model, not an open-domain generative chatbot.
MASSIVE and SNIPS mapped-intent accuracy is excluded from headline claims because mapping coverage is partial.

Use Cases

Virtual assistant command routing
Smart home intent classification
Voice command understanding
Chatbot intent detection
Edge device deployment (small enough for embedded systems)

Part of the JANE Project

JANE — a fully offline, privacy-first AI voice assistant.

🔗 JANE AI Assistant on GitHub 🔗 JaneGPT-v2 on GitHub

Created By

Ravindu Senanayake

Built from scratch — architecture, tokenizer, and training pipeline designed and implemented by the author.

License

Apache-2.0 (see LICENSE).

Downloads last month: 245

Safetensors

Model size

7.95M params

Tensor type

F32

Evaluation results

OOD Precision on BANKING77
self-reported

1.000
OOD F1 on BANKING77
self-reported

0.878
OOD Recall on BANKING77
self-reported

0.782
OOD Precision on CLINC OOS
self-reported

1.000
OOD F1 on CLINC OOS
self-reported

0.792
OOD Recall on CLINC OOS
self-reported

0.656
Validation Domain Accuracy
self-reported

0.998
Validation Action Accuracy
self-reported

0.999
Slot Extraction F1
self-reported

1.000