Stack-2-9-finetuned / docs /TOOL_DATA_ANALYSIS.md

walidsobhie-code

feat: add inference API, quickstart guide, roadmap, and combined tool data

b03a8a0 25 days ago

6.91 kB

	# Tool Calling Training Data Analysis

	Generated: 2026-04-06
	Files Analyzed:
	- `training-data/tool_examples.jsonl` (original)
	- `training-data_v2/tool_examples.jsonl` (regenerated)

	---

	## Executive Summary

	The original tool calling training data had significant quality issues that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.

	Key Findings on Original Data:
	- ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
	- ❌ Heavy prompt duplication (7.5x average)
	- ❌ No multi-step tool chains (only 1 tool per example)
	- ❌ All examples use identical tool definitions

	Action Taken: Generated 500 new examples using the project's generator script.

	Recommendation: The original data needs substantial improvements before use in training.

	---

	## 1. Statistics Overview

	### Original Data (tool_examples.jsonl)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total Examples \| 1,000 \|
	\| Unique Prompts \| 133 \|
	\| Average Duplication \| 7.52x \|
	\| Unique Tool Sequences \| 5 \|
	\| Examples with Issues \| ~107 (10.7%) \|

	### New Data (tool_examples_v2.jsonl)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total Examples \| 500 \|
	\| File Size \| 1.9 MB \|
	\| Tools per Example \| 5 (static definition) \|

	### Tool Call Distribution (Original)

	\| Tool \| Call Count \|
	\|------\|------------\|
	\| Bash \| 200 \|
	\| FileRead \| 200 \|
	\| FileWrite \| 200 \|
	\| WebSearch \| 200 \|
	\| Grep \| 200 \|

	All examples have exactly one tool call - no multi-step chains exist.

	---

	## 2. Prompt Diversity Analysis (Original Data)

	### Prompt Categories

	\| Category \| Count \| Percentage \|
	\|----------\|-------\|------------\|
	\| Python \| 207 \| 20.7% \|
	\| React \| 149 \| 14.9% \|
	\| File Read \| 134 \| 13.4% \|
	\| File Write \| 119 \| 11.9% \|
	\| Other \| 114 \| 11.4% \|
	\| Run Command \| 80 \| 8.0% \|
	\| Docker/K8s \| 67 \| 6.7% \|
	\| Search \| 50 \| 5.0% \|
	\| Git \| 40 \| 4.0% \|
	\| Testing \| 31 \| 3.1% \|
	\| Package Management \| 9 \| 0.9% \|

	### Most Duplicated Prompts

	\| Prompt \| Occurrences \|
	\|--------\|-------------\|
	\| "Run the tests with pytest" \| 40 \|
	\| "Run npm install to install dependencies" \| 40 \|
	\| "Write a simple React component to src/components/Button.jsx" \| 67 \|

	---

	## 3. Tool Usage Breakdown

	### Tool Definitions

	All 1,000 original examples use identical tool definitions with 5 tools:
	- `Bash` - Execute bash commands
	- `FileRead` - Read file contents
	- `FileWrite` - Create/overwrite files
	- `WebSearch` - Search the web
	- `Grep` - Search for patterns in files

	### Tool Call Issues Found (Original Data)

	#### Wrong Search Patterns (105 instances / 10.5%)

	The `WebSearch` tool frequently uses queries that don't match the user's question:

	\| User Question \| Actual Search Query \|
	\|--------------\|---------------------\|
	\| "How do I use async/await in Python?" \| "AWS Lambda cold start optimization" \|
	\| "How do I use React hooks properly?" \| "SQL join types explained" \|
	\| "What's the difference between Docker and Kubernetes?" \| "Git rebase vs merge" \|
	\| "How do I use React hooks properly?" \| "TypeScript generics tutorial" \|
	\| "What's the difference between Docker and Kubernetes?" \| "TypeScript generics tutorial" \|

	#### Wrong File Paths (2 instances)

	The `FileWrite` tool sometimes writes to incorrect file types:

	\| User Request \| Written Path \|
	\|-------------\|--------------\|
	\| "Create a src/components/Header.jsx file" \| Written to `config.json` \|
	\| "Create a src/middleware.py file with settings" \| Written to `config.yaml` \|

	#### Pattern/File Type Mismatches (Grep)

	The `Grep` tool sometimes searches with mismatched patterns:

	\| Pattern \| File Pattern \| Issue \|
	\|---------\|-------------\|-------\|
	\| `class ` \| `*.ts` \| Python pattern in TypeScript files \|
	\| `SELECT ` \| `*.js` \| SQL pattern in JavaScript files \|
	\| `TODO` \| `*.md` \| Searching TODO in markdown files \|

	---

	## 4. Data Quality Issues

	### Critical Issues

	1. No Multi-Step Tool Chains
	- All 1,000 examples use exactly one tool call
	- Real coding tasks typically require 2-5+ tool calls
	- Example: "Read file → Find pattern → Search docs → Write fix"

	2. Search Query Mismatches
	- 10.5% of WebSearch calls have irrelevant queries
	- Indicates the generator script has logic errors

	3. Heavy Prompt Duplication
	- 133 unique prompts duplicated to 1,000 examples
	- "Write a simple React component" appears 67 times
	- This creates overfitting to specific prompts

	4. Identical Tool Definitions
	- All examples use the same 5 tools with identical descriptions
	- No variation in tool schemas or parameter structures

	### Moderate Issues

	5. File Path Hallucination
	- Tool calls reference files that don't exist in actual codebase
	- Example: asking for `tests/test_main.py` but reading `src/app.js`

	6. Response Fabrication
	- Assistant responses sometimes claim to show content that wasn't actually read
	- Example: "Here's the README.md" when README.md wasn't the file requested

	---

	## 5. Recommendations for Improvement

	### Immediate Actions (Completed)

	1. ✅ Regenerated Data
	```
	Generated 500 new examples in training-data_v2/tool_examples.jsonl
	```

	### Script Fixes Needed

	The generator script (`scripts/generate_tool_data.py`) needs:

	1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
	2. Fix `FILE_PATTERNS` - wrong file types for requested content
	3. Add multi-step chain generation
	4. Add prompt variation templates
	5. Add validation to check query/content relevance

	### Future Improvements

	1. Add Multi-Step Examples
	- Real tasks require reading files, searching, editing
	- Generate chains of 2-4 tool calls per example

	2. Increase Prompt Diversity
	- Target 500+ unique prompts instead of duplicating
	- Use template variations and paraphrasing

	3. Vary Tool Definitions
	- Different tools per example
	- Add tool variations (e.g., different Bash commands)

	---

	## 6. Conclusion

	The original `tool_examples.jsonl` data is NOT suitable for production training without significant improvements:

	- ~10% of examples have incorrect tool parameters
	- Heavy duplication leads to overfitting
	- No multi-step chains fail to represent real coding workflows
	- Synthetic generation errors are systematic

	Action Completed: Generated 500 new examples via the project's generator script.

	Remaining Work: Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.

	---

	## Appendix: Quick Stats

	### Original Data
	```
	Total examples: 1,000
	Unique prompts: 133
	Tool call issues: 107 (10.7%)
	Multi-tool chains: 0 (0%)
	Identical tool defs: 100%
	Average duplication: 7.52x
	```

	### New Data (Generated)
	```
	Total examples: 500
	File size: 1.9 MB
	Location: training-data_v2/tool_examples.jsonl
	```