[Agent Evaluation] Testing CoPaw-Flash-9B as a local coding agent

#1
by FutureMa - opened

Hi AgentScope team and community,

I recently tested CoPaw-Flash-9B as a local coding agent using a custom frontend (claude-code-clean). I evaluated it on a task involving creating, debugging, and fixing a terminal text adventure game.

I've written a detailed evaluation report on my GitHub. Here is a quick summary of the findings:

Strengths:

  • Instruction Following: Excellent at interpreting intent, handling Chinese/English naturally.
  • Reasoning & Tool Use: The built-in <think> CoT is very effective, and it correctly invokes Bash, Write, and Edit tools in sequence.

Areas for Improvement (Weaknesses):

  • Agent Loop Incompleteness: The model frequently stops mid-task and waits for human nudges (continue), lacking the autonomy to drive the task to completion on its own.
  • Error Diagnosis & Self-Verification: When tests fail, it struggles with incremental debugging (tends to rewrite whole files) and lacks robust runtime self-verification.

You can read the full evaluation details and rating breakdown in my repo here:
👉 CoPaw-Flash-9B Local Agent Evaluation

Thanks for training and open-sourcing this model! It shows great promise as a 9B model. Hopefully, these edge-case observations are helpful for your future fine-tuning iterations.

Is the "Agent Loop Incompleteness" problem not a harness problem? Have you tried with agent zero?

Is the "Agent Loop Incompleteness" problem not a harness problem? Have you tried with agent zero?

Hi @sebastienbo , good question! I actually tested this directly using claude-code-clean, which is essentially Anthropic's Claude Code—a robust, fully autonomous coding agent framework. The harness itself handles loops perfectly fine with larger models (like Claude or GPT). The "Incompleteness" happens because CoPaw-Flash-9B stops predicting the next tool call mid-task and just waits for human confirmation or a "continue" nudge.

Hi @FutureMa ,
Thank you very much for detailed evaluation and valuable contribution. Your findings are very helpful not only for us, but also for the broader community.
We are still continuing to actively update and improve the CoPaw-Flash models. Please stay tuned, and thanks again for your support!

Hi @hiyuchang and @sebastienbo ,

Quick update: we actually managed to fix the "Agent Loop Incompleteness" issue I mentioned!

We trained a LoRA based on CoPaw-Flash-9B specifically for agentic data analysis. You can find it here: CoPaw-Flash-9B-DataAnalyst-LoRA

A few quick takeaways from our benchmark on 29 real-world Kaggle datasets:

  • Model vs. Harness: @sebastienbo To answer your question, it was indeed a model-side alignment issue.
  • The Fix: Before LoRA, the base model averaged 1.2 iterations before waiting for human nudges. After LoRA, it drives the task autonomously for 26.0 continuous iterations (writing code, executing, debugging, and plotting).
  • Result: It achieved a 90% completion rate entirely on its own.

CoPaw-Flash-9B proved to be a fantastic foundation with huge potential. Hope our fine-tuning experiment and benchmark data provide some useful reference for your future iterations!

Sign up or log in to comment