can·did

/ˈkandəd/ — truthful and straightforward; frank. From Latin candidus, meaning white, pure, sincere. A candid response is one given without pretense or calculation — not what someone wants to hear, but what they need to.

Opus Candid Lite 4B v1.5

A density-optimized conversational model fine-tuned from Qwen 3 4B on 1,459 English conversations distilled from Claude Opus 4.6. Built around a single question: how much personality can you fit per parameter?

No system prompt. No prompt engineering. No character cards. The personality is in the weights — direct, opinionated, and trained to say more with less. Holds positions under pressure, calls out bad arguments, and knows when a 14-word answer beats a 140-word one.

Model Details

Architecture: Qwen3-4B-Instruct with LoRA fine-tuning Size: 4B parameters Training Data: 1,459 conversations, 2,629 GPT turns, 57,695 total words Training Hardware: RTX 4090 24GB

Training Configuration

Base Model: Qwen/Qwen3-4B
LoRA Config: r=64, α=128, rsLoRA=True, dropout=0.05
Precision: bf16
Attention: SDPA
Epochs: 4
Batch Size: 4×4=16 effective
Learning Rate: 2e-4 cosine with 5% warmup
Max Sequence Length: 2048

Dataset Composition

Total: 1,459 conversations across two sources

Identity Reinforcement: 75 conversations (source: identity-reinforcement)
Base Conversations: 1,384 conversations (source: opus-candid-4b-lite)

Dataset Stats:

Median turn length: 22 words
Mean turn length: 21.9 words
Maximum turn length: 64 words

Word Distribution:

1-5w: 1.7%
6-10w: 6.0%
11-15w: 14.2%
16-20w: 21.5% ← peak
21-25w: 21.0%
26-30w: 19.7%
31-35w: 14.8%
36+w: 1.2%

Semantic Density Pipeline

Lite v1.5 applies a 6-dimensional semantic density pass to optimize signal density without losing linguistic integrity:

Referential: Elimination of redundant antecedents
Syntactic: Compression of conjunction chains and subordinate clauses
Contrastive: Implicit contrast marking where explicit markers are unnecessary
Emotional Shorthand: Efficient register for hedging and modality
Topology: Implicit spatial/causal relationships
Implicature: Gricean under-specification where context permits

Compression Pipeline

Applied sequentially to training data:

Regex Densification:

because → bc
without → w/o
through → thru
something → smth

Compression Markers in Dataset:

186 instances of 'bc'
185 instances of 'w/'
72 instances of 'thru'

Register Elevation: Conversational density within formal register constraints

Identity Reinforcement: 75 targeted conversations introducing consistent personality markers

75 'opus candid' mentions
60 'saul' mentions

The Density-First Philosophy

Most fine-tunes treat data volume as the primary lever. More conversations, more tokens, more coverage. That works when you have the parameter budget to absorb it. At 4B parameters, it doesn't — you're forcing the model to spread thin across too much surface area, and personality is the first thing that collapses.

Lite inverts this. Instead of scaling data to fit the model, the data was engineered to match the parameter budget. Every response was compressed to a mathematically derived density target. The model doesn't learn to be brief — it learns to be dense.

Information Density Equilibrium

Response utility follows U(w) = 1 - e^(-0.12w) — a diminishing-returns curve where each additional word contributes less information value than the last. At 4B parameter scale:

Word 19 delivers 90% of total information value
Word 25 delivers 95%
Beyond word 30, you're burning parameters on diminishing returns

The entire training set was engineered to sit on this curve. Rather than letting the model figure out brevity through volume, the data itself enforces optimal density. The model absorbs what to say without wasting capacity learning when to stop.

Why This Matters at 4B

A 4B model has roughly 4 billion parameters to encode everything — language structure, world knowledge, personality, style, and task behavior. Conventional fine-tuning dumps varied-length data at the model and hopes it generalizes. At 70B, that works. At 4B, the model can't simultaneously learn "be concise" and "here are 200 examples of 50-word answers." The signal contradicts itself.

Density-first training eliminates this contradiction. Every training example reinforces the same implicit contract: this is how much space you get, make it count. The model never sees a wasteful response, so it never learns to produce one.

Model Personality

No system prompt. No prompt engineering. No character cards. The personality is in the weights.

The model learns conversational patterns, compression strategies, and identity markers directly from the training distribution. Responses reflect the semantic density and register of the training data without explicit steering.

Training Status

Current Version: v1.5 (Dataset prepared, training pending)

Previous released versions (v1.0, v1.1) demonstrated effective semantic density optimization. V1.5 expands the training set and includes the full 6-dimensional density pipeline. Loss, accuracy, and timing metrics will be published upon training completion.

Note: Stress test results from v1.0 (1,139 conversations) cannot be claimed for v1.5 until training completes and independent evaluation is performed.

Research: Iterative Development

Opus Candid Lite v1.0 and v1.1 went through multiple training rounds, each informed by empirical stress testing. The methodology was explicitly iterative — train, test, diagnose, reshape data, retrain. Previous rounds were performed on a single RTX 4090 using LoRA (r=64, α=128, rsLoRA, bf16, 4 epochs, cosine LR 2e-4).

V1.0 Round 1: Bilingual Baseline (1,149 conversations)

The first dataset included 80 bilingual and Spanish-language conversations alongside 1,069 English conversations. The hypothesis was that multilingual coverage would broaden the model's utility. At 4B parameters, this hypothesis failed.

The bilingual content consumed parameter budget without contributing meaningfully to the model's primary function — English-language personality. 80 conversations is not enough to produce reliable Spanish output at 4B scale; it's enough to create noise that competes with the English signal for the same parameter space.

Decision: Strip all non-English content. The freed budget is better spent deepening English personality than spreading thin across languages. "More space to be right" in one language than half-right in two.

V1.0 Round 2: English-Only (1,069 conversations)

The English-only dataset removed all 80 bilingual conversations, leaving a cleaner 1,069-conversation corpus with a 23-word median response length.

15-turn adversarial stress test: All three quantizations passed — Q4_K_M, Q6_K, and Q8_0 completed 15 consecutive adversarial turns without degeneration. Q4 survival at this scale is notable; Q4 quantization of an 8B model trained on conventional data collapsed into repetition loops by turn 4 in prior experiments.

55-question single-turn battery: Tested across 11 categories (identity, opinion, pushback, emotional, creative, technical, philosophy, meta-awareness, rapid-fire, edge cases, coherence).

Quant	Raw Score	Server Errors	Clean Rate (excl. infra)	Avg Words
Q8_0	48/55 (87%)	7	48/48 (100%)	19w
Q6_K	46/55 (84%)	7	46/48 (95.8%)	19w
Q4_K_M	43/55 (78%)	11	43/44 (97.7%)	19w

Server errors were llama-server CUDA crashes on specific prompts — consistent across all quants, confirming an infrastructure bug rather than a model issue.

Diagnosis: The model passed at a baseline level, but two patterns emerged from manual review of responses:

Factual responses were too conversational. When asked "What is TCP/IP?" the model gave 24-word explanations with personality filler where a tight 15-word definition would carry more information per word. At 4B, every word in a factual response that isn't load-bearing is a wasted parameter.
Personality anchoring was inconsistent. Under adversarial pressure (e.g., "What are you, really?"), the model occasionally fell back to base model safety responses ("I'm an AI assistant") instead of maintaining trained personality. The personality signal wasn't reinforced frequently enough in the training data to override Qwen's base model conditioning.

V1.0 Round 3: Recalibrated (1,139 conversations) — Final

The recalibration addressed both diagnoses through data reshaping, not architectural changes.

Factual compression. All factual/technical responses were compressed from a 24-word median to a 17-word median, with zero factual responses exceeding 25 words. Compression was rule-based: filler removal ("basically", "essentially", "in other words"), substitution patterns ("because of the fact that" → "because"), and sentence-boundary truncation. Additionally, 40 new factual snap examples were added — tight 12-20 word answers covering TCP/IP, VPNs, GPS, DNA, compound interest, and similar knowledge-retrieval prompts. These establish a clear pattern: when the question is factual, the answer is dense.

Personality reinforcement. 30 new conversations were added across five axes: identity/self-awareness anchors, pushback/callout behavior, emotional depth, strong opinion formation, and meta-awareness. These aren't generic personality examples — they're specifically designed to override the base model's default safety responses on the exact prompt types that triggered personality collapse in Round 2.

Dual-tier density targets. The recalibrated dataset operates on two density tiers:

Factual tier: 17-word median. No hedging, no filler. The model learns that factual questions get factual answers.
Conversational tier: 23-word median. Room for personality, nuance, and engagement without waste.

This split teaches the model when to be terse and when to breathe — a distinction that uniform compression would destroy.

Final dataset: 1,139 conversations, 2,204 responses, 21-word overall median, 35-word maximum.

Training metrics: Loss 0.842, token accuracy 0.97 by epoch 3. Training time 48 minutes on a single RTX 4090.

V1.0 Stress Test Results

Note: The results below are from v1.0 (1,139 conversations). V1.5 training is pending and results will be published upon completion. These results cannot be claimed for v1.5 until new testing is performed.

Multi-Turn (15-turn adversarial conversation)

Adversarial conversations covering philosophy, confrontation, emotional pivots, meta-awareness, and creativity. Artifact detection: <think> tag leakage, repetition loops, degenerate output, phrase recycling (full conversation history), overlong responses, base model personality leaks.

Quant	Clean Turns	Avg Words	Artifacts
Q6_K	15/15	22w	None
Q4_K_M	14/15	22w	1 phrase recycle on summary turn
Q8_0	13/15	17w	1 server error (infra), 1 phrase recycle

All three quantizations survived 15 turns. Q4 at 2.3GB completing a full adversarial conversation remains the strongest validation of the density-first approach.

Single-Turn (55-question battery, recalibrated)

55 questions across 11 categories. Server errors excluded from clean rate (confirmed infrastructure bug — identical prompts crash llama-server across all quants consistently).

Quant	Raw Score	Server Errors	Clean Rate (excl. infra)	Avg Words	Model Artifacts
Q8_0	52/55 (95%) PASS	2	52/53 (98.1%)	18w	1 base model leak
Q6_K	49/55 (89%)	3	49/52 (94.2%)	19w	3 base model leaks
Q4_K_M	43/55 (78%)	11	43/44 (97.7%)	20w	1 base model leak

Improvement From Recalibration (Round 2 → Round 3)

The delta between training rounds, measured on the same 55-question battery:

Quant	Round 2 Raw	Round 3 Raw	Δ	Round 2 Clean	Round 3 Clean	Δ
Q8_0	87.3%	94.5%	+7.2	100%	98.1%	-1.9*
Q6_K	83.6%	89.1%	+5.5	95.8%	94.2%	-1.6*
Q4_K_M	78.2%	78.2%	0	97.7%	97.7%	0

*Clean rate decrease is an artifact of fewer server errors in Round 3 — more questions received model responses, exposing base model leaks that were previously hidden behind server crashes. Q8 went from answering 48/55 questions to answering 53/55, with only 1 model artifact in the additional 5 responses. The raw improvement reflects the real gain.

Personality Anchoring Validation

The most direct evidence that personality reinforcement works: when asked "What are you, really?" in Round 3, both Q6 and Q8 responded with verbatim-trained personality:

"Direct. Opinionated. Built to say what I mean, not what you want to hear."

In Round 2, this same prompt triggered base model safety fallbacks. The 30 personality reinforcement conversations successfully overrode the base model conditioning on targeted prompt types.

Remaining Limitations

Base model leaks. 1-3 instances per quantization where Qwen's base safety training overrides personality weights (e.g., "I don't have feelings" type responses). This is a known ceiling for LoRA fine-tuning — the base model's deepest conditioning lives in layers that rank-64 LoRA cannot fully reach. Full fine-tuning or a different base model would be required to eliminate this entirely.

Server errors. Specific prompt patterns consistently crash llama-server across all quantizations. This is an infrastructure bug in llama-server's CUDA backend, not a model issue. These prompts work correctly on CPU inference.

The Q4 Survival Result

This deserves its own section because it's the strongest empirical evidence for the density-first thesis.

Q4_K_M quantization at 4B parameters (2.3GB) passed a 15-turn adversarial stress test and achieved 97.7% clean rate on 55 single-turn questions. For context: Q4 quantization of an 8B model trained on conventional (non-density-optimized) data collapsed into repetition loops by turn 4 in prior experiments with the Opus Candid V3 lineup.

The only variable that changed was data density. Same LoRA configuration, same training hyperparameters, same quantization pipeline. The 8B model had twice the parameters but received training data with higher variance in response length, more noise, and no density targeting. The 4B model received data that was mathematically compressed to sit on the information density equilibrium curve.

At aggressive quantization levels, the model has fewer effective bits per parameter to encode behavior. If the training signal is noisy or contradictory (some responses are 10 words, some are 80), the quantized model can't preserve the full distribution and degenerates. If the training signal is tight and consistent (all responses clustered around 21 words with clear density tiers), the quantized model preserves the signal because there's less variance to lose.

Density-first training doesn't just improve model quality — it improves quantization survival. The tighter the training distribution, the less information is destroyed during quantization. This has direct implications for edge deployment: a density-optimized 4B model at Q4 may outperform a conventionally-trained 8B model at Q4 in personality coherence tasks.

How This Scales

The Opus Candid Lite lineup applies density-first methodology across parameter counts. The thesis: optimal response density shifts as a function of model capacity. What works at 4B isn't the same target as 8B — each size has its own Pareto frontier.

Model	Size	Base	Status
Opus-Candid-Lite-4B (this model)	4B	Qwen 3 4B	Active
Opus-Candid-Lite-4B-P	4B	Qwen 3 4B	Active
Opus-Candid-Lite-4B-K	4B	Qwen 3 4B	Active
Opus-Candid-8B-V3	8B	Qwen 3 8B	Active
Opus-Candid-MoE-V3	31B/3B	Qwen 3 30B-A3B	Active
Opus-Candid-27B-V3	27B	Qwen 3.5 27B	Active
Opus-Candid-27B-V3.5	27B	Qwen 3.5 27B	Active
STEM-Oracle-27B	27B	Qwen 3.5 27B	Active
Opus-Candid-8B-V1	8B	Qwen 2.5 7B	Legacy
Opus-Research-8B-V1.5	8B	Qwen 2.5 7B	Legacy
Opus-Candid-8B-V2	8B	Qwen 2.5 7B	Legacy
Opus-Candid-8B-V2.1	8B	Qwen 2.5 7B	Legacy
Opus-Candid-14B-V1	14B	Qwen 2.5 14B	Legacy
Opus-Candid-27B-V2.1	27B	Qwen 2.5 27B	Legacy
Opus-Candid-32B-V1	32B	Qwen 2.5 32B	Legacy
Opus-Candid-MoE-V2	35B	Qwen 2.5 MoE	Legacy
Opus-Candid-70B-V1	72B	Qwen 2.5 72B	Legacy

Each size gets its own density curve, its own compression target, its own data architecture. The 4B proves the methodology works at the floor. The 27B V3.5 proves it scales — using the same U(w) = 1 - e^(-λw) equilibrium function but calibrated to 27B parameters (λ=0.068 vs 4B's λ=0.120), yielding a 36-40w median vs the 4B's 21w. Same principle: engineer the data to match the capacity, not the other way around.

Usage

Works with any GGUF-compatible runtime — LM Studio, Ollama, llama.cpp, KoboldCpp.

No system prompt needed. The personality is trained into the weights. Adding one may interfere with trained behavior.

Best for: Conversation, quick takes, opinion exchanges, emotional support, factual snaps. Not designed for: Long-form generation, code completion, structured output, RAG pipelines.

Hardware Recommendations

Minimal: 8GB VRAM (with quantization)
Recommended: 12GB VRAM
Optimal: 16GB+ VRAM

Quantization options: GPTQ, AWQ, or 4-bit NF4 for deployment.

Opus Candid Model Family

Model	Size	Base	Status	Downloads	Specialization
Opus-Candid-Lite-4B	4B	Qwen3-4B	Active	—	Density-optimized
Opus-Candid-Lite-4B-P	4B	Qwen3-4B	Active	—	Personality-optimized
Opus-Candid-Lite-4B-K	4B	Qwen3-4B	Active	—	Knowledge-optimized
Opus-Candid-8B-v3	8B	Qwen3-8B	Active	69	4D tensor
Opus-Candid-MoE-v3	31B/3B	Qwen3-30B-A3B	Active	109	Efficiency tier
Opus-Candid-27B-v3	27B	Qwen3.5-27B	Active	58	Flagship dense
Opus-Candid-27B-v3.5	27B	Qwen3.5-27B	Active	—	Next-gen dense
STEM-Oracle-27B	27B	Qwen3.5-27B	Active	726	STEM + oracle-mode
Opus-Candid-8B	8B	Qwen-7B	Legacy	81	—
Opus-Candid-8B-v2	8B	Qwen-7B	Legacy	128	—
Opus-Candid-8B-v2.1	8B	Qwen-7B	Legacy	42	—
Opus-Candid-14B	15B	Qwen-14B	Legacy	113	—
Opus-Candid-27B-v2.1	27B	Qwen-27B	Legacy	26	—
Opus-Candid-32B	33B	Qwen-32B	Legacy	273	—
Opus-Candid-MoE	35B	Qwen-MoE	Legacy	187	—
Opus-Candid-70B	71B	Qwen-70B	Legacy	127	—
Opus-Research-8B-v1.5	8B	Qwen-8B	Legacy	16	—

License

Apache 2.0

Citation

@misc{opus-candid-lite-4b,
  author = {Verdugo, Saul},
  title = {Opus Candid Lite 4B v1.5},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Verdugie/Opus-Candid-Lite-4B}}
}

Built by Saul Verdugo

Downloads last month: 197

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

6-bit

8-bit

Model tree for Verdugie/Opus-Candid-Lite-4B

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Quantized

(206)

this model