File size: 4,988 Bytes
b5998ff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | # Stack 2.9 Training Data Format
This document describes the format and structure of training data for Stack 2.9.
## Overview
Training data is stored in JSONL format (JSON Lines), where each line is a valid JSON object representing a single training example.
## File Structure
```
training-data/
βββ tool_examples.jsonl # Original examples (1000)
βββ augmented_tool_examples.jsonl # Augmented examples (2-5x)
βββ scaled/ # Processed datasets
βββ train.jsonl
βββ val.jsonl
```
## Example Format
```json
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant that can use tools to help users solve problems."
},
{
"role": "user",
"content": "Can you show me the tests/test_main.py file?"
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_$1180",
"type": "function",
"function": {
"name": "FileRead",
"arguments": "{\"path\": \"src/main.py\"}"
}
}
]
},
{
"role": "tool",
"content": "Successfully read file: README.md\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```",
"tool_call_id": "call_$1180",
"name": "FileRead"
},
{
"role": "assistant",
"content": "Here's the README.md:\n\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "Bash",
"description": "Execute bash commands in the terminal.",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string", "description": "The bash command to execute"},
"timeout": {"type": "integer", "description": "Timeout in seconds"}
},
"required": ["command"]
}
}
},
{
"type": "function",
"function": {
"name": "FileRead",
"description": "Read the contents of a file.",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Path to the file to read"},
"offset": {"type": "integer", "description": "Line number to start from"},
"limit": {"type": "integer", "description": "Max lines to read"}
},
"required": ["path"]
}
}
}
]
}
```
## Field Definitions
### Top-Level Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `messages` | array | Yes | Array of message objects |
| `tools` | array | Yes | Available tools/functions |
| `source` | string | No | Data source identifier |
### Message Object
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `role` | string | Yes | One of: system, user, assistant, tool |
| `content` | string | Yes* | Message content (null if tool_calls present) |
| `tool_calls` | array | No* | Tool call requests |
| `tool_call_id` | string | No* | ID linking to tool response |
| `name` | string | No* | Tool name (for tool messages) |
*Content is required unless `tool_calls` is present. `tool_call_id` and `name` required for role="tool".
### Tool Call Object
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | Yes | Unique call identifier |
| `type` | string | Yes | Always "function" |
| `function` | object | Yes | Function name and arguments |
| `function.name` | string | Yes | Tool/function name |
| `function.arguments` | object/string | Yes | JSON arguments |
## Data Sources
- **random_synthetic**: Auto-generated with random parameters
- **synthetic_template**: Template-based synthetic examples
- **augmented_***: Augmented from other sources
- **original**: Human-curated examples
## Augmentation
The augmentation script applies these transformations:
1. **Paraphrasing**: Reword user prompts (70% chance)
2. **Difficulty scaling**: Add complexity modifiers
3. **Parameter variation**: Change file paths, commands
4. **Filler words**: Add "please", "thanks" (30% chance)
5. **Edge cases**: Empty input, multi-step, error handling
Run augmentation:
```bash
python scripts/augment_training_data.py \
--input training-data/tool_examples.jsonl \
--output training-data/augmented.jsonl \
--multiplier 3
```
## Validation
Run validation to check data quality:
```bash
python scripts/validate_training_data.py --input training-data/tool_examples.jsonl
```
Checks include:
- Required fields present
- Valid JSON syntax
- Message role ordering
- Tool call structure
- No empty entries
## Converting to Training Format
For training, convert to standard format:
```python
# Example conversion
python scripts/combine_datasets.py \
--input training-data/augmented.jsonl \
--output data/final/train.jsonl \
--format chatml
``` |