Leo Innocenzi

leryud

AI & ML interests

None yet

Recent Activity

liked a model 12 days ago

z-lab/Qwen3.6-27B-DFlash

liked a model 21 days ago

Qwen/Qwen3.6-27B

liked a model about 1 month ago

ByteDance-Seed/Stable-DiffCoder-8B-Instruct

View all activity

Organizations

None yet

liked a model 12 days ago

z-lab/Qwen3.6-27B-DFlash

Text Generation • 2B • Updated 20 days ago • 52.8k • 300

liked a model 21 days ago

Qwen/Qwen3.6-27B

Image-Text-to-Text • 28B • Updated 23 days ago • 3.26M • • 1.3k

liked a model about 1 month ago

ByteDance-Seed/Stable-DiffCoder-8B-Instruct

Text Generation • 8B • Updated Mar 25 • 1.87k • 137

reacted to SeaWolf-AI's post with 👍 about 1 month ago

Post

5560

🧬 Darwin V6: Diagnostic-Guided Evolutionary Model Merging

We are releasing Darwin-31B-Opus — a reasoning-enhanced model merging Google's Gemma-4-31B-it and TeichAI's Claude Opus Distill using the Darwin V6 engine.

Model: FINAL-Bench/Darwin-31B-Opus
Demo: FINAL-Bench/Darwin-31B-Opus

🔬 What Darwin V6 Does

Conventional merging tools (mergekit, etc.) apply a single ratio to all tensors. Set ratio=0.5 and all 1,188 tensors blend identically, with no distinction between which tensors matter for reasoning versus coding.

Darwin V6 diagnoses both parents at the tensor level before merging. It measures Shannon entropy, standard deviation, and L2 norm for every tensor, then passes 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) through the model to determine layer-wise functional importance. Each of the 1,188 tensors receives an independent optimal ratio.

combined = static(entropy/std/norm) x 0.4 + probe(cosine_distance) x 0.6
final_ratio = mri_ratio x mri_trust + genome_ratio x (1 - mri_trust)

When one parent is overwhelmingly superior for a tensor (ratio < 0.15 or > 0.85), Darwin transplants it directly without interpolation. The mri_trust parameter itself is optimized by CMA-ES evolutionary search, so optimal transplant intensity is determined automatically. After merging, a Health Check compares the child against both parents layer-by-layer to detect interference or function loss.

🧬 Parent Models
Father: google/gemma-4-31B-it
Mother: TeichAI/gemma-4-31B-it-Claude-Opus-Distill

🧬 Results
Compared under identical conditions (same 50 questions, same seed, greedy, thinking mode):
Father: 60.0% (30/50)
Darwin-31B-Opus: 66.0% (33/50) — +10% relative improvement
ARC-Challenge: 82.89% (loglikelihood, zero-shot, 200 questions)
Optimal genome found by evolution:
ffn_ratio=0.93 — FFN layers strongly favor Mother (Claude Opus Distill)
block_5 (L50-L59)=0.86 and more...

11 replies

liked a dataset 4 months ago

google/reveal

Viewer • Updated Apr 9, 2024 • 6.1k • 43 • 35

liked 3 models 4 months ago

liked a dataset 5 months ago

choosealicense/licenses

Updated Apr 17, 2024 • 3.28k • 47

reacted to hesamation's post with ❤️ 5 months ago

Post

4859

this is big... 50 AI researchers from Bytedance, Alibaba, Tencent, and other labs/universities just published a 300-page paper with surprising lessons about coding models and agents (data, pre and post-training, etc).

key highlights:

> small LLMs can beat proprietary giants
RL (RLVR specifically) gives small open-source models an edge over big models in reasoning. a 14B model trained with RLVR on high-quality verified problems can match the performance of OpenAI's o3.

> models have a hard time learning Python.
mixing language models during pre-training is good, but Python behaves different from statically typed languages. languages with similar syntax (Java and C#, or JavaScript and TypeScript) creates high positive synergy. mixing Python heavily into the training of statically typed languages can actually hurt because of Python's dynamic typing.

> not all languages are equal (coding scaling laws)
the amount of data required to specialize a model on a language drastically depends on the language. paper argues like C# and Java are easier to learn (less training data required). languages like Python and Javascript are actually more tricky to learn, ironically (you see AI most used for these languages :)

> MoE vs Dense (ability vs stability)
MoE models offer higher capacity, but are much more fragile during SFT than dense models. hyperparams in training have a more drastic effect in MoE models, while dense models are more stable. MoE models also require constant learning rate schedules to avoid routing instability.

> code models are "insecure" by default (duh)
training on public repos makes models learn years of accumulated insecure coding patterns. safety fine-tuning often fails to work much on code. a model might refuse to write a hate speech email but will happily generate a SQL-injection vulnerable function because it "works."

read the full paper:
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence (2511.18538)