Stack-2-9-finetuned / docs /TOOL_DATA_ANALYSIS.md
walidsobhie-code
feat: add inference API, quickstart guide, roadmap, and combined tool data
b03a8a0
# Tool Calling Training Data Analysis
**Generated:** 2026-04-06
**Files Analyzed:**
- `training-data/tool_examples.jsonl` (original)
- `training-data_v2/tool_examples.jsonl` (regenerated)
---
## Executive Summary
The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.
**Key Findings on Original Data:**
- ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
- ❌ Heavy prompt duplication (7.5x average)
- ❌ No multi-step tool chains (only 1 tool per example)
- ❌ All examples use identical tool definitions
**Action Taken:** Generated 500 new examples using the project's generator script.
**Recommendation:** The original data needs substantial improvements before use in training.
---
## 1. Statistics Overview
### Original Data (tool_examples.jsonl)
| Metric | Value |
|--------|-------|
| Total Examples | 1,000 |
| Unique Prompts | 133 |
| Average Duplication | 7.52x |
| Unique Tool Sequences | 5 |
| Examples with Issues | ~107 (10.7%) |
### New Data (tool_examples_v2.jsonl)
| Metric | Value |
|--------|-------|
| Total Examples | 500 |
| File Size | 1.9 MB |
| Tools per Example | 5 (static definition) |
### Tool Call Distribution (Original)
| Tool | Call Count |
|------|------------|
| Bash | 200 |
| FileRead | 200 |
| FileWrite | 200 |
| WebSearch | 200 |
| Grep | 200 |
All examples have exactly **one tool call** - no multi-step chains exist.
---
## 2. Prompt Diversity Analysis (Original Data)
### Prompt Categories
| Category | Count | Percentage |
|----------|-------|------------|
| Python | 207 | 20.7% |
| React | 149 | 14.9% |
| File Read | 134 | 13.4% |
| File Write | 119 | 11.9% |
| Other | 114 | 11.4% |
| Run Command | 80 | 8.0% |
| Docker/K8s | 67 | 6.7% |
| Search | 50 | 5.0% |
| Git | 40 | 4.0% |
| Testing | 31 | 3.1% |
| Package Management | 9 | 0.9% |
### Most Duplicated Prompts
| Prompt | Occurrences |
|--------|-------------|
| "Run the tests with pytest" | 40 |
| "Run npm install to install dependencies" | 40 |
| "Write a simple React component to src/components/Button.jsx" | 67 |
---
## 3. Tool Usage Breakdown
### Tool Definitions
All 1,000 original examples use **identical tool definitions** with 5 tools:
- `Bash` - Execute bash commands
- `FileRead` - Read file contents
- `FileWrite` - Create/overwrite files
- `WebSearch` - Search the web
- `Grep` - Search for patterns in files
### Tool Call Issues Found (Original Data)
#### Wrong Search Patterns (105 instances / 10.5%)
The `WebSearch` tool frequently uses queries that don't match the user's question:
| User Question | Actual Search Query |
|--------------|---------------------|
| "How do I use async/await in Python?" | "AWS Lambda cold start optimization" |
| "How do I use React hooks properly?" | "SQL join types explained" |
| "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" |
| "How do I use React hooks properly?" | "TypeScript generics tutorial" |
| "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" |
#### Wrong File Paths (2 instances)
The `FileWrite` tool sometimes writes to incorrect file types:
| User Request | Written Path |
|-------------|--------------|
| "Create a src/components/Header.jsx file" | Written to `config.json` |
| "Create a src/middleware.py file with settings" | Written to `config.yaml` |
#### Pattern/File Type Mismatches (Grep)
The `Grep` tool sometimes searches with mismatched patterns:
| Pattern | File Pattern | Issue |
|---------|-------------|-------|
| `class ` | `*.ts` | Python pattern in TypeScript files |
| `SELECT ` | `*.js` | SQL pattern in JavaScript files |
| `TODO` | `*.md` | Searching TODO in markdown files |
---
## 4. Data Quality Issues
### Critical Issues
1. **No Multi-Step Tool Chains**
- All 1,000 examples use exactly one tool call
- Real coding tasks typically require 2-5+ tool calls
- Example: "Read file → Find pattern → Search docs → Write fix"
2. **Search Query Mismatches**
- 10.5% of WebSearch calls have irrelevant queries
- Indicates the generator script has logic errors
3. **Heavy Prompt Duplication**
- 133 unique prompts duplicated to 1,000 examples
- "Write a simple React component" appears 67 times
- This creates overfitting to specific prompts
4. **Identical Tool Definitions**
- All examples use the same 5 tools with identical descriptions
- No variation in tool schemas or parameter structures
### Moderate Issues
5. **File Path Hallucination**
- Tool calls reference files that don't exist in actual codebase
- Example: asking for `tests/test_main.py` but reading `src/app.js`
6. **Response Fabrication**
- Assistant responses sometimes claim to show content that wasn't actually read
- Example: "Here's the README.md" when README.md wasn't the file requested
---
## 5. Recommendations for Improvement
### Immediate Actions (Completed)
1.**Regenerated Data**
```
Generated 500 new examples in training-data_v2/tool_examples.jsonl
```
### Script Fixes Needed
The generator script (`scripts/generate_tool_data.py`) needs:
1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
2. Fix `FILE_PATTERNS` - wrong file types for requested content
3. Add multi-step chain generation
4. Add prompt variation templates
5. Add validation to check query/content relevance
### Future Improvements
1. **Add Multi-Step Examples**
- Real tasks require reading files, searching, editing
- Generate chains of 2-4 tool calls per example
2. **Increase Prompt Diversity**
- Target 500+ unique prompts instead of duplicating
- Use template variations and paraphrasing
3. **Vary Tool Definitions**
- Different tools per example
- Add tool variations (e.g., different Bash commands)
---
## 6. Conclusion
The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements:
- ~10% of examples have incorrect tool parameters
- Heavy duplication leads to overfitting
- No multi-step chains fail to represent real coding workflows
- Synthetic generation errors are systematic
**Action Completed:** Generated 500 new examples via the project's generator script.
**Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.
---
## Appendix: Quick Stats
### Original Data
```
Total examples: 1,000
Unique prompts: 133
Tool call issues: 107 (10.7%)
Multi-tool chains: 0 (0%)
Identical tool defs: 100%
Average duplication: 7.52x
```
### New Data (Generated)
```
Total examples: 500
File size: 1.9 MB
Location: training-data_v2/tool_examples.jsonl
```