| # Tool Calling Training Data Analysis |
|
|
| **Generated:** 2026-04-06 |
| **Files Analyzed:** |
| - `training-data/tool_examples.jsonl` (original) |
| - `training-data_v2/tool_examples.jsonl` (regenerated) |
|
|
| --- |
|
|
| ## Executive Summary |
|
|
| The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors. |
|
|
| **Key Findings on Original Data:** |
| - ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files) |
| - ❌ Heavy prompt duplication (7.5x average) |
| - ❌ No multi-step tool chains (only 1 tool per example) |
| - ❌ All examples use identical tool definitions |
|
|
| **Action Taken:** Generated 500 new examples using the project's generator script. |
|
|
| **Recommendation:** The original data needs substantial improvements before use in training. |
|
|
| --- |
|
|
| ## 1. Statistics Overview |
|
|
| ### Original Data (tool_examples.jsonl) |
| |
| | Metric | Value | |
| |--------|-------| |
| | Total Examples | 1,000 | |
| | Unique Prompts | 133 | |
| | Average Duplication | 7.52x | |
| | Unique Tool Sequences | 5 | |
| | Examples with Issues | ~107 (10.7%) | |
| |
| ### New Data (tool_examples_v2.jsonl) |
| |
| | Metric | Value | |
| |--------|-------| |
| | Total Examples | 500 | |
| | File Size | 1.9 MB | |
| | Tools per Example | 5 (static definition) | |
| |
| ### Tool Call Distribution (Original) |
| |
| | Tool | Call Count | |
| |------|------------| |
| | Bash | 200 | |
| | FileRead | 200 | |
| | FileWrite | 200 | |
| | WebSearch | 200 | |
| | Grep | 200 | |
| |
| All examples have exactly **one tool call** - no multi-step chains exist. |
| |
| --- |
| |
| ## 2. Prompt Diversity Analysis (Original Data) |
| |
| ### Prompt Categories |
| |
| | Category | Count | Percentage | |
| |----------|-------|------------| |
| | Python | 207 | 20.7% | |
| | React | 149 | 14.9% | |
| | File Read | 134 | 13.4% | |
| | File Write | 119 | 11.9% | |
| | Other | 114 | 11.4% | |
| | Run Command | 80 | 8.0% | |
| | Docker/K8s | 67 | 6.7% | |
| | Search | 50 | 5.0% | |
| | Git | 40 | 4.0% | |
| | Testing | 31 | 3.1% | |
| | Package Management | 9 | 0.9% | |
| |
| ### Most Duplicated Prompts |
| |
| | Prompt | Occurrences | |
| |--------|-------------| |
| | "Run the tests with pytest" | 40 | |
| | "Run npm install to install dependencies" | 40 | |
| | "Write a simple React component to src/components/Button.jsx" | 67 | |
| |
| --- |
| |
| ## 3. Tool Usage Breakdown |
| |
| ### Tool Definitions |
| |
| All 1,000 original examples use **identical tool definitions** with 5 tools: |
| - `Bash` - Execute bash commands |
| - `FileRead` - Read file contents |
| - `FileWrite` - Create/overwrite files |
| - `WebSearch` - Search the web |
| - `Grep` - Search for patterns in files |
| |
| ### Tool Call Issues Found (Original Data) |
| |
| #### Wrong Search Patterns (105 instances / 10.5%) |
| |
| The `WebSearch` tool frequently uses queries that don't match the user's question: |
| |
| | User Question | Actual Search Query | |
| |--------------|---------------------| |
| | "How do I use async/await in Python?" | "AWS Lambda cold start optimization" | |
| | "How do I use React hooks properly?" | "SQL join types explained" | |
| | "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" | |
| | "How do I use React hooks properly?" | "TypeScript generics tutorial" | |
| | "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" | |
| |
| #### Wrong File Paths (2 instances) |
| |
| The `FileWrite` tool sometimes writes to incorrect file types: |
| |
| | User Request | Written Path | |
| |-------------|--------------| |
| | "Create a src/components/Header.jsx file" | Written to `config.json` | |
| | "Create a src/middleware.py file with settings" | Written to `config.yaml` | |
| |
| #### Pattern/File Type Mismatches (Grep) |
| |
| The `Grep` tool sometimes searches with mismatched patterns: |
| |
| | Pattern | File Pattern | Issue | |
| |---------|-------------|-------| |
| | `class ` | `*.ts` | Python pattern in TypeScript files | |
| | `SELECT ` | `*.js` | SQL pattern in JavaScript files | |
| | `TODO` | `*.md` | Searching TODO in markdown files | |
| |
| --- |
| |
| ## 4. Data Quality Issues |
| |
| ### Critical Issues |
| |
| 1. **No Multi-Step Tool Chains** |
| - All 1,000 examples use exactly one tool call |
| - Real coding tasks typically require 2-5+ tool calls |
| - Example: "Read file → Find pattern → Search docs → Write fix" |
| |
| 2. **Search Query Mismatches** |
| - 10.5% of WebSearch calls have irrelevant queries |
| - Indicates the generator script has logic errors |
| |
| 3. **Heavy Prompt Duplication** |
| - 133 unique prompts duplicated to 1,000 examples |
| - "Write a simple React component" appears 67 times |
| - This creates overfitting to specific prompts |
| |
| 4. **Identical Tool Definitions** |
| - All examples use the same 5 tools with identical descriptions |
| - No variation in tool schemas or parameter structures |
| |
| ### Moderate Issues |
| |
| 5. **File Path Hallucination** |
| - Tool calls reference files that don't exist in actual codebase |
| - Example: asking for `tests/test_main.py` but reading `src/app.js` |
|
|
| 6. **Response Fabrication** |
| - Assistant responses sometimes claim to show content that wasn't actually read |
| - Example: "Here's the README.md" when README.md wasn't the file requested |
|
|
| --- |
|
|
| ## 5. Recommendations for Improvement |
|
|
| ### Immediate Actions (Completed) |
|
|
| 1. ✅ **Regenerated Data** |
| ``` |
| Generated 500 new examples in training-data_v2/tool_examples.jsonl |
| ``` |
|
|
| ### Script Fixes Needed |
|
|
| The generator script (`scripts/generate_tool_data.py`) needs: |
|
|
| 1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions |
| 2. Fix `FILE_PATTERNS` - wrong file types for requested content |
| 3. Add multi-step chain generation |
| 4. Add prompt variation templates |
| 5. Add validation to check query/content relevance |
|
|
| ### Future Improvements |
|
|
| 1. **Add Multi-Step Examples** |
| - Real tasks require reading files, searching, editing |
| - Generate chains of 2-4 tool calls per example |
|
|
| 2. **Increase Prompt Diversity** |
| - Target 500+ unique prompts instead of duplicating |
| - Use template variations and paraphrasing |
|
|
| 3. **Vary Tool Definitions** |
| - Different tools per example |
| - Add tool variations (e.g., different Bash commands) |
|
|
| --- |
|
|
| ## 6. Conclusion |
|
|
| The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements: |
|
|
| - ~10% of examples have incorrect tool parameters |
| - Heavy duplication leads to overfitting |
| - No multi-step chains fail to represent real coding workflows |
| - Synthetic generation errors are systematic |
|
|
| **Action Completed:** Generated 500 new examples via the project's generator script. |
|
|
| **Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration. |
|
|
| --- |
|
|
| ## Appendix: Quick Stats |
|
|
| ### Original Data |
| ``` |
| Total examples: 1,000 |
| Unique prompts: 133 |
| Tool call issues: 107 (10.7%) |
| Multi-tool chains: 0 (0%) |
| Identical tool defs: 100% |
| Average duplication: 7.52x |
| ``` |
|
|
| ### New Data (Generated) |
| ``` |
| Total examples: 500 |
| File size: 1.9 MB |
| Location: training-data_v2/tool_examples.jsonl |
| ``` |