Stack-2-9-finetuned / docs /TOOL_DATA_ANALYSIS.md

walidsobhie-code

feat: add inference API, quickstart guide, roadmap, and combined tool data

b03a8a0 25 days ago

preview code

raw

history blame contribute delete

6.91 kB

Tool Calling Training Data Analysis

Generated: 2026-04-06
Files Analyzed:

training-data/tool_examples.jsonl (original)
training-data_v2/tool_examples.jsonl (regenerated)

Executive Summary

The original tool calling training data had significant quality issues that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.

Key Findings on Original Data:

❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
❌ Heavy prompt duplication (7.5x average)
❌ No multi-step tool chains (only 1 tool per example)
❌ All examples use identical tool definitions

Action Taken: Generated 500 new examples using the project's generator script.

Recommendation: The original data needs substantial improvements before use in training.

1. Statistics Overview

Original Data (tool_examples.jsonl)

Metric	Value
Total Examples	1,000
Unique Prompts	133
Average Duplication	7.52x
Unique Tool Sequences	5
Examples with Issues	~107 (10.7%)

New Data (tool_examples_v2.jsonl)

Metric	Value
Total Examples	500
File Size	1.9 MB
Tools per Example	5 (static definition)

Tool Call Distribution (Original)

Tool	Call Count
Bash	200
FileRead	200
FileWrite	200
WebSearch	200
Grep	200

All examples have exactly one tool call - no multi-step chains exist.

2. Prompt Diversity Analysis (Original Data)

Prompt Categories

Category	Count	Percentage
Python	207	20.7%
React	149	14.9%
File Read	134	13.4%
File Write	119	11.9%
Other	114	11.4%
Run Command	80	8.0%
Docker/K8s	67	6.7%
Search	50	5.0%
Git	40	4.0%
Testing	31	3.1%
Package Management	9	0.9%

Most Duplicated Prompts

Prompt	Occurrences
"Run the tests with pytest"	40
"Run npm install to install dependencies"	40
"Write a simple React component to src/components/Button.jsx"	67

3. Tool Usage Breakdown

Tool Definitions

All 1,000 original examples use identical tool definitions with 5 tools:

Bash - Execute bash commands
FileRead - Read file contents
FileWrite - Create/overwrite files
WebSearch - Search the web
Grep - Search for patterns in files

Tool Call Issues Found (Original Data)

Wrong Search Patterns (105 instances / 10.5%)

The WebSearch tool frequently uses queries that don't match the user's question:

User Question	Actual Search Query
"How do I use async/await in Python?"	"AWS Lambda cold start optimization"
"How do I use React hooks properly?"	"SQL join types explained"
"What's the difference between Docker and Kubernetes?"	"Git rebase vs merge"
"How do I use React hooks properly?"	"TypeScript generics tutorial"
"What's the difference between Docker and Kubernetes?"	"TypeScript generics tutorial"

Wrong File Paths (2 instances)

The FileWrite tool sometimes writes to incorrect file types:

User Request	Written Path
"Create a src/components/Header.jsx file"	Written to `config.json`
"Create a src/middleware.py file with settings"	Written to `config.yaml`

Pattern/File Type Mismatches (Grep)

The Grep tool sometimes searches with mismatched patterns:

Pattern	File Pattern	Issue
`class`	`*.ts`	Python pattern in TypeScript files
`SELECT`	`*.js`	SQL pattern in JavaScript files
`TODO`	`*.md`	Searching TODO in markdown files

4. Data Quality Issues

Critical Issues

No Multi-Step Tool Chains
- All 1,000 examples use exactly one tool call
- Real coding tasks typically require 2-5+ tool calls
- Example: "Read file → Find pattern → Search docs → Write fix"
Search Query Mismatches
- 10.5% of WebSearch calls have irrelevant queries
- Indicates the generator script has logic errors
Heavy Prompt Duplication
- 133 unique prompts duplicated to 1,000 examples
- "Write a simple React component" appears 67 times
- This creates overfitting to specific prompts
Identical Tool Definitions
- All examples use the same 5 tools with identical descriptions
- No variation in tool schemas or parameter structures

Moderate Issues

File Path Hallucination
- Tool calls reference files that don't exist in actual codebase
- Example: asking for tests/test_main.py but reading src/app.js
Response Fabrication
- Assistant responses sometimes claim to show content that wasn't actually read
- Example: "Here's the README.md" when README.md wasn't the file requested

5. Recommendations for Improvement

Immediate Actions (Completed)

✅ Regenerated Data

Generated 500 new examples in training-data_v2/tool_examples.jsonl

Script Fixes Needed

The generator script (scripts/generate_tool_data.py) needs:

Fix TOOL_CALL_PAIRS mapping - queries don't match questions
Fix FILE_PATTERNS - wrong file types for requested content
Add multi-step chain generation
Add prompt variation templates
Add validation to check query/content relevance

Future Improvements

Add Multi-Step Examples
- Real tasks require reading files, searching, editing
- Generate chains of 2-4 tool calls per example
Increase Prompt Diversity
- Target 500+ unique prompts instead of duplicating
- Use template variations and paraphrasing
Vary Tool Definitions
- Different tools per example
- Add tool variations (e.g., different Bash commands)

6. Conclusion

The original tool_examples.jsonl data is NOT suitable for production training without significant improvements:

~10% of examples have incorrect tool parameters
Heavy duplication leads to overfitting
No multi-step chains fail to represent real coding workflows
Synthetic generation errors are systematic

Action Completed: Generated 500 new examples via the project's generator script.

Remaining Work: Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.

Appendix: Quick Stats

Original Data

Total examples:        1,000
Unique prompts:        133
Tool call issues:      107 (10.7%)
Multi-tool chains:     0 (0%)
Identical tool defs:   100%
Average duplication:   7.52x

New Data (Generated)

Total examples:        500
File size:             1.9 MB
Location:              training-data_v2/tool_examples.jsonl