File size: 4,988 Bytes
b5998ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Stack 2.9 Training Data Format

This document describes the format and structure of training data for Stack 2.9.

## Overview

Training data is stored in JSONL format (JSON Lines), where each line is a valid JSON object representing a single training example.

## File Structure

```
training-data/
β”œβ”€β”€ tool_examples.jsonl          # Original examples (1000)
β”œβ”€β”€ augmented_tool_examples.jsonl # Augmented examples (2-5x)
└── scaled/                      # Processed datasets
    β”œβ”€β”€ train.jsonl
    └── val.jsonl
```

## Example Format

```json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant that can use tools to help users solve problems."
    },
    {
      "role": "user",
      "content": "Can you show me the tests/test_main.py file?"
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "id": "call_$1180",
          "type": "function",
          "function": {
            "name": "FileRead",
            "arguments": "{\"path\": \"src/main.py\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "Successfully read file: README.md\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```",
      "tool_call_id": "call_$1180",
      "name": "FileRead"
    },
    {
      "role": "assistant",
      "content": "Here's the README.md:\n\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "Bash",
        "description": "Execute bash commands in the terminal.",
        "parameters": {
          "type": "object",
          "properties": {
            "command": {"type": "string", "description": "The bash command to execute"},
            "timeout": {"type": "integer", "description": "Timeout in seconds"}
          },
          "required": ["command"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "FileRead",
        "description": "Read the contents of a file.",
        "parameters": {
          "type": "object",
          "properties": {
            "path": {"type": "string", "description": "Path to the file to read"},
            "offset": {"type": "integer", "description": "Line number to start from"},
            "limit": {"type": "integer", "description": "Max lines to read"}
          },
          "required": ["path"]
        }
      }
    }
  ]
}
```

## Field Definitions

### Top-Level Fields

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `messages` | array | Yes | Array of message objects |
| `tools` | array | Yes | Available tools/functions |
| `source` | string | No | Data source identifier |

### Message Object

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `role` | string | Yes | One of: system, user, assistant, tool |
| `content` | string | Yes* | Message content (null if tool_calls present) |
| `tool_calls` | array | No* | Tool call requests |
| `tool_call_id` | string | No* | ID linking to tool response |
| `name` | string | No* | Tool name (for tool messages) |

*Content is required unless `tool_calls` is present. `tool_call_id` and `name` required for role="tool".

### Tool Call Object

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | Yes | Unique call identifier |
| `type` | string | Yes | Always "function" |
| `function` | object | Yes | Function name and arguments |
| `function.name` | string | Yes | Tool/function name |
| `function.arguments` | object/string | Yes | JSON arguments |

## Data Sources

- **random_synthetic**: Auto-generated with random parameters
- **synthetic_template**: Template-based synthetic examples
- **augmented_***: Augmented from other sources
- **original**: Human-curated examples

## Augmentation

The augmentation script applies these transformations:

1. **Paraphrasing**: Reword user prompts (70% chance)
2. **Difficulty scaling**: Add complexity modifiers
3. **Parameter variation**: Change file paths, commands
4. **Filler words**: Add "please", "thanks" (30% chance)
5. **Edge cases**: Empty input, multi-step, error handling

Run augmentation:
```bash
python scripts/augment_training_data.py \
  --input training-data/tool_examples.jsonl \
  --output training-data/augmented.jsonl \
  --multiplier 3
```

## Validation

Run validation to check data quality:
```bash
python scripts/validate_training_data.py --input training-data/tool_examples.jsonl
```

Checks include:
- Required fields present
- Valid JSON syntax
- Message role ordering
- Tool call structure
- No empty entries

## Converting to Training Format

For training, convert to standard format:
```python
# Example conversion
python scripts/combine_datasets.py \
  --input training-data/augmented.jsonl \
  --output data/final/train.jsonl \
  --format chatml
```