can·did
/ˈkandəd/ — truthful and straightforward; frank. From Latin candidus, meaning white, pure, sincere. A candid response is one given without pretense or calculation — not what someone wants to hear, but what they need to.
Opus Candid Lite 4B v1.5
A density-optimized conversational model fine-tuned from Qwen 3 4B on 1,459 English conversations distilled from Claude Opus 4.6. Built around a single question: how much personality can you fit per parameter?
No system prompt. No prompt engineering. No character cards. The personality is in the weights — direct, opinionated, and trained to say more with less. Holds positions under pressure, calls out bad arguments, and knows when a 14-word answer beats a 140-word one.
Model Details
Architecture: Qwen3-4B-Instruct with LoRA fine-tuning Size: 4B parameters Training Data: 1,459 conversations, 2,629 GPT turns, 57,695 total words Training Hardware: RTX 4090 24GB
Training Configuration
- Base Model: Qwen/Qwen3-4B
- LoRA Config: r=64, α=128, rsLoRA=True, dropout=0.05
- Precision: bf16
- Attention: SDPA
- Epochs: 4
- Batch Size: 4×4=16 effective
- Learning Rate: 2e-4 cosine with 5% warmup
- Max Sequence Length: 2048
Dataset Composition
Total: 1,459 conversations across two sources
- Identity Reinforcement: 75 conversations (source: identity-reinforcement)
- Base Conversations: 1,384 conversations (source: opus-candid-4b-lite)
Dataset Stats:
- Median turn length: 22 words
- Mean turn length: 21.9 words
- Maximum turn length: 64 words
Word Distribution:
- 1-5w: 1.7%
- 6-10w: 6.0%
- 11-15w: 14.2%
- 16-20w: 21.5% ← peak
- 21-25w: 21.0%
- 26-30w: 19.7%
- 31-35w: 14.8%
- 36+w: 1.2%
Semantic Density Pipeline
Lite v1.5 applies a 6-dimensional semantic density pass to optimize signal density without losing linguistic integrity:
- Referential: Elimination of redundant antecedents
- Syntactic: Compression of conjunction chains and subordinate clauses
- Contrastive: Implicit contrast marking where explicit markers are unnecessary
- Emotional Shorthand: Efficient register for hedging and modality
- Topology: Implicit spatial/causal relationships
- Implicature: Gricean under-specification where context permits
Compression Pipeline
Applied sequentially to training data:
Regex Densification:
because→bcwithout→w/othrough→thrusomething→smth
Compression Markers in Dataset:
- 186 instances of 'bc'
- 185 instances of 'w/'
- 72 instances of 'thru'
Register Elevation: Conversational density within formal register constraints
Identity Reinforcement: 75 targeted conversations introducing consistent personality markers
- 75 'opus candid' mentions
- 60 'saul' mentions
The Density-First Philosophy
Most fine-tunes treat data volume as the primary lever. More conversations, more tokens, more coverage. That works when you have the parameter budget to absorb it. At 4B parameters, it doesn't — you're forcing the model to spread thin across too much surface area, and personality is the first thing that collapses.
Lite inverts this. Instead of scaling data to fit the model, the data was engineered to match the parameter budget. Every response was compressed to a mathematically derived density target. The model doesn't learn to be brief — it learns to be dense.
Information Density Equilibrium
Response utility follows U(w) = 1 - e^(-0.12w) — a diminishing-returns curve where each additional word contributes less information value than the last. At 4B parameter scale:
- Word 19 delivers 90% of total information value
- Word 25 delivers 95%
- Beyond word 30, you're burning parameters on diminishing returns
The entire training set was engineered to sit on this curve. Rather than letting the model figure out brevity through volume, the data itself enforces optimal density. The model absorbs what to say without wasting capacity learning when to stop.
Why This Matters at 4B
A 4B model has roughly 4 billion parameters to encode everything — language structure, world knowledge, personality, style, and task behavior. Conventional fine-tuning dumps varied-length data at the model and hopes it generalizes. At 70B, that works. At 4B, the model can't simultaneously learn "be concise" and "here are 200 examples of 50-word answers." The signal contradicts itself.
Density-first training eliminates this contradiction. Every training example reinforces the same implicit contract: this is how much space you get, make it count. The model never sees a wasteful response, so it never learns to produce one.
Model Personality
No system prompt. No prompt engineering. No character cards. The personality is in the weights.
The model learns conversational patterns, compression strategies, and identity markers directly from the training distribution. Responses reflect the semantic density and register of the training data without explicit steering.
Training Status
Current Version: v1.5 (Dataset prepared, training pending)
Previous released versions (v1.0, v1.1) demonstrated effective semantic density optimization. V1.5 expands the training set and includes the full 6-dimensional density pipeline. Loss, accuracy, and timing metrics will be published upon training completion.
Note: Stress test results from v1.0 (1,139 conversations) cannot be claimed for v1.5 until training completes and independent evaluation is performed.
Research: Iterative Development
Opus Candid Lite v1.0 and v1.1 went through multiple training rounds, each informed by empirical stress testing. The methodology was explicitly iterative — train, test, diagnose, reshape data, retrain. Previous rounds were performed on a single RTX 4090 using LoRA (r=64, α=128, rsLoRA, bf16, 4 epochs, cosine LR 2e-4).
V1.0 Round 1: Bilingual Baseline (1,149 conversations)
The first dataset included 80 bilingual and Spanish-language conversations alongside 1,069 English conversations. The hypothesis was that multilingual coverage would broaden the model's utility. At 4B parameters, this hypothesis failed.
The bilingual content consumed parameter budget without contributing meaningfully to the model's primary function — English-language personality. 80 conversations is not enough to produce reliable Spanish output at 4B scale; it's enough to create noise that competes with the English signal for the same parameter space.
Decision: Strip all non-English content. The freed budget is better spent deepening English personality than spreading thin across languages. "More space to be right" in one language than half-right in two.
V1.0 Round 2: English-Only (1,069 conversations)
The English-only dataset removed all 80 bilingual conversations, leaving a cleaner 1,069-conversation corpus with a 23-word median response length.
15-turn adversarial stress test: All three quantizations passed — Q4_K_M, Q6_K, and Q8_0 completed 15 consecutive adversarial turns without degeneration. Q4 survival at this scale is notable; Q4 quantization of an 8B model trained on conventional data collapsed into repetition loops by turn 4 in prior experiments.
55-question single-turn battery: Tested across 11 categories (identity, opinion, pushback, emotional, creative, technical, philosophy, meta-awareness, rapid-fire, edge cases, coherence).
| Quant | Raw Score | Server Errors | Clean Rate (excl. infra) | Avg Words |
|---|---|---|---|---|
| Q8_0 | 48/55 (87%) | 7 | 48/48 (100%) | 19w |
| Q6_K | 46/55 (84%) | 7 | 46/48 (95.8%) | 19w |
| Q4_K_M | 43/55 (78%) | 11 | 43/44 (97.7%) | 19w |
Server errors were llama-server CUDA crashes on specific prompts — consistent across all quants, confirming an infrastructure bug rather than a model issue.
Diagnosis: The model passed at a baseline level, but two patterns emerged from manual review of responses:
Factual responses were too conversational. When asked "What is TCP/IP?" the model gave 24-word explanations with personality filler where a tight 15-word definition would carry more information per word. At 4B, every word in a factual response that isn't load-bearing is a wasted parameter.
Personality anchoring was inconsistent. Under adversarial pressure (e.g., "What are you, really?"), the model occasionally fell back to base model safety responses ("I'm an AI assistant") instead of maintaining trained personality. The personality signal wasn't reinforced frequently enough in the training data to override Qwen's base model conditioning.
V1.0 Round 3: Recalibrated (1,139 conversations) — Final
The recalibration addressed both diagnoses through data reshaping, not architectural changes.
Factual compression. All factual/technical responses were compressed from a 24-word median to a 17-word median, with zero factual responses exceeding 25 words. Compression was rule-based: filler removal ("basically", "essentially", "in other words"), substitution patterns ("because of the fact that" → "because"), and sentence-boundary truncation. Additionally, 40 new factual snap examples were added — tight 12-20 word answers covering TCP/IP, VPNs, GPS, DNA, compound interest, and similar knowledge-retrieval prompts. These establish a clear pattern: when the question is factual, the answer is dense.
Personality reinforcement. 30 new conversations were added across five axes: identity/self-awareness anchors, pushback/callout behavior, emotional depth, strong opinion formation, and meta-awareness. These aren't generic personality examples — they're specifically designed to override the base model's default safety responses on the exact prompt types that triggered personality collapse in Round 2.
Dual-tier density targets. The recalibrated dataset operates on two density tiers:
- Factual tier: 17-word median. No hedging, no filler. The model learns that factual questions get factual answers.
- Conversational tier: 23-word median. Room for personality, nuance, and engagement without waste.
This split teaches the model when to be terse and when to breathe — a distinction that uniform compression would destroy.
Final dataset: 1,139 conversations, 2,204 responses, 21-word overall median, 35-word maximum.
Training metrics: Loss 0.842, token accuracy 0.97 by epoch 3. Training time 48 minutes on a single RTX 4090.
V1.0 Stress Test Results
Note: The results below are from v1.0 (1,139 conversations). V1.5 training is pending and results will be published upon completion. These results cannot be claimed for v1.5 until new testing is performed.
Multi-Turn (15-turn adversarial conversation)
Adversarial conversations covering philosophy, confrontation, emotional pivots, meta-awareness, and creativity. Artifact detection: <think> tag leakage, repetition loops, degenerate output, phrase recycling (full conversation history), overlong responses, base model personality leaks.
| Quant | Clean Turns | Avg Words | Artifacts |
|---|---|---|---|
| Q6_K | 15/15 | 22w | None |
| Q4_K_M | 14/15 | 22w | 1 phrase recycle on summary turn |
| Q8_0 | 13/15 | 17w | 1 server error (infra), 1 phrase recycle |
All three quantizations survived 15 turns. Q4 at 2.3GB completing a full adversarial conversation remains the strongest validation of the density-first approach.
Single-Turn (55-question battery, recalibrated)
55 questions across 11 categories. Server errors excluded from clean rate (confirmed infrastructure bug — identical prompts crash llama-server across all quants consistently).
| Quant | Raw Score | Server Errors | Clean Rate (excl. infra) | Avg Words | Model Artifacts |
|---|---|---|---|---|---|
| Q8_0 | 52/55 (95%) PASS | 2 | 52/53 (98.1%) | 18w | 1 base model leak |
| Q6_K | 49/55 (89%) | 3 | 49/52 (94.2%) | 19w | 3 base model leaks |
| Q4_K_M | 43/55 (78%) | 11 | 43/44 (97.7%) | 20w | 1 base model leak |
Improvement From Recalibration (Round 2 → Round 3)
The delta between training rounds, measured on the same 55-question battery:
| Quant | Round 2 Raw | Round 3 Raw | Δ | Round 2 Clean | Round 3 Clean | Δ |
|---|---|---|---|---|---|---|
| Q8_0 | 87.3% | 94.5% | +7.2 | 100% | 98.1% | -1.9* |
| Q6_K | 83.6% | 89.1% | +5.5 | 95.8% | 94.2% | -1.6* |
| Q4_K_M | 78.2% | 78.2% | 0 | 97.7% | 97.7% | 0 |
*Clean rate decrease is an artifact of fewer server errors in Round 3 — more questions received model responses, exposing base model leaks that were previously hidden behind server crashes. Q8 went from answering 48/55 questions to answering 53/55, with only 1 model artifact in the additional 5 responses. The raw improvement reflects the real gain.
Personality Anchoring Validation
The most direct evidence that personality reinforcement works: when asked "What are you, really?" in Round 3, both Q6 and Q8 responded with verbatim-trained personality:
"Direct. Opinionated. Built to say what I mean, not what you want to hear."
In Round 2, this same prompt triggered base model safety fallbacks. The 30 personality reinforcement conversations successfully overrode the base model conditioning on targeted prompt types.
Remaining Limitations
Base model leaks. 1-3 instances per quantization where Qwen's base safety training overrides personality weights (e.g., "I don't have feelings" type responses). This is a known ceiling for LoRA fine-tuning — the base model's deepest conditioning lives in layers that rank-64 LoRA cannot fully reach. Full fine-tuning or a different base model would be required to eliminate this entirely.
Server errors. Specific prompt patterns consistently crash llama-server across all quantizations. This is an infrastructure bug in llama-server's CUDA backend, not a model issue. These prompts work correctly on CPU inference.
The Q4 Survival Result
This deserves its own section because it's the strongest empirical evidence for the density-first thesis.
Q4_K_M quantization at 4B parameters (2.3GB) passed a 15-turn adversarial stress test and achieved 97.7% clean rate on 55 single-turn questions. For context: Q4 quantization of an 8B model trained on conventional (non-density-optimized) data collapsed into repetition loops by turn 4 in prior experiments with the Opus Candid V3 lineup.
The only variable that changed was data density. Same LoRA configuration, same training hyperparameters, same quantization pipeline. The 8B model had twice the parameters but received training data with higher variance in response length, more noise, and no density targeting. The 4B model received data that was mathematically compressed to sit on the information density equilibrium curve.
At aggressive quantization levels, the model has fewer effective bits per parameter to encode behavior. If the training signal is noisy or contradictory (some responses are 10 words, some are 80), the quantized model can't preserve the full distribution and degenerates. If the training signal is tight and consistent (all responses clustered around 21 words with clear density tiers), the quantized model preserves the signal because there's less variance to lose.
Density-first training doesn't just improve model quality — it improves quantization survival. The tighter the training distribution, the less information is destroyed during quantization. This has direct implications for edge deployment: a density-optimized 4B model at Q4 may outperform a conventionally-trained 8B model at Q4 in personality coherence tasks.
How This Scales
The Opus Candid Lite lineup applies density-first methodology across parameter counts. The thesis: optimal response density shifts as a function of model capacity. What works at 4B isn't the same target as 8B — each size has its own Pareto frontier.
| Model | Size | Base | Status |
|---|---|---|---|
| Opus-Candid-Lite-4B (this model) | 4B | Qwen 3 4B | Active |
| Opus-Candid-Lite-4B-P | 4B | Qwen 3 4B | Active |
| Opus-Candid-Lite-4B-K | 4B | Qwen 3 4B | Active |
| Opus-Candid-8B-V3 | 8B | Qwen 3 8B | Active |
| Opus-Candid-MoE-V3 | 31B/3B | Qwen 3 30B-A3B | Active |
| Opus-Candid-27B-V3 | 27B | Qwen 3.5 27B | Active |
| Opus-Candid-27B-V3.5 | 27B | Qwen 3.5 27B | Active |
| STEM-Oracle-27B | 27B | Qwen 3.5 27B | Active |
| Opus-Candid-8B-V1 | 8B | Qwen 2.5 7B | Legacy |
| Opus-Research-8B-V1.5 | 8B | Qwen 2.5 7B | Legacy |
| Opus-Candid-8B-V2 | 8B | Qwen 2.5 7B | Legacy |
| Opus-Candid-8B-V2.1 | 8B | Qwen 2.5 7B | Legacy |
| Opus-Candid-14B-V1 | 14B | Qwen 2.5 14B | Legacy |
| Opus-Candid-27B-V2.1 | 27B | Qwen 2.5 27B | Legacy |
| Opus-Candid-32B-V1 | 32B | Qwen 2.5 32B | Legacy |
| Opus-Candid-MoE-V2 | 35B | Qwen 2.5 MoE | Legacy |
| Opus-Candid-70B-V1 | 72B | Qwen 2.5 72B | Legacy |
Each size gets its own density curve, its own compression target, its own data architecture. The 4B proves the methodology works at the floor. The 27B V3.5 proves it scales — using the same U(w) = 1 - e^(-λw) equilibrium function but calibrated to 27B parameters (λ=0.068 vs 4B's λ=0.120), yielding a 36-40w median vs the 4B's 21w. Same principle: engineer the data to match the capacity, not the other way around.
Usage
Works with any GGUF-compatible runtime — LM Studio, Ollama, llama.cpp, KoboldCpp.
No system prompt needed. The personality is trained into the weights. Adding one may interfere with trained behavior.
Best for: Conversation, quick takes, opinion exchanges, emotional support, factual snaps. Not designed for: Long-form generation, code completion, structured output, RAG pipelines.
Hardware Recommendations
- Minimal: 8GB VRAM (with quantization)
- Recommended: 12GB VRAM
- Optimal: 16GB+ VRAM
Quantization options: GPTQ, AWQ, or 4-bit NF4 for deployment.
Opus Candid Model Family
| Model | Size | Base | Status | Downloads | Specialization |
|---|---|---|---|---|---|
| Opus-Candid-Lite-4B | 4B | Qwen3-4B | Active | — | Density-optimized |
| Opus-Candid-Lite-4B-P | 4B | Qwen3-4B | Active | — | Personality-optimized |
| Opus-Candid-Lite-4B-K | 4B | Qwen3-4B | Active | — | Knowledge-optimized |
| Opus-Candid-8B-v3 | 8B | Qwen3-8B | Active | 69 | 4D tensor |
| Opus-Candid-MoE-v3 | 31B/3B | Qwen3-30B-A3B | Active | 109 | Efficiency tier |
| Opus-Candid-27B-v3 | 27B | Qwen3.5-27B | Active | 58 | Flagship dense |
| Opus-Candid-27B-v3.5 | 27B | Qwen3.5-27B | Active | — | Next-gen dense |
| STEM-Oracle-27B | 27B | Qwen3.5-27B | Active | 726 | STEM + oracle-mode |
| Opus-Candid-8B | 8B | Qwen-7B | Legacy | 81 | — |
| Opus-Candid-8B-v2 | 8B | Qwen-7B | Legacy | 128 | — |
| Opus-Candid-8B-v2.1 | 8B | Qwen-7B | Legacy | 42 | — |
| Opus-Candid-14B | 15B | Qwen-14B | Legacy | 113 | — |
| Opus-Candid-27B-v2.1 | 27B | Qwen-27B | Legacy | 26 | — |
| Opus-Candid-32B | 33B | Qwen-32B | Legacy | 273 | — |
| Opus-Candid-MoE | 35B | Qwen-MoE | Legacy | 187 | — |
| Opus-Candid-70B | 71B | Qwen-70B | Legacy | 127 | — |
| Opus-Research-8B-v1.5 | 8B | Qwen-8B | Legacy | 16 | — |
License
Apache 2.0
Citation
@misc{opus-candid-lite-4b,
author = {Verdugo, Saul},
title = {Opus Candid Lite 4B v1.5},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Verdugie/Opus-Candid-Lite-4B}}
}
Built by Saul Verdugo
- Downloads last month
- 197
4-bit
6-bit
8-bit