walidsobhie-code commited on
Commit
b03a8a0
·
1 Parent(s): 183b3b6

feat: add inference API, quickstart guide, roadmap, and combined tool data

Browse files

Created:
- inference_api.py: FastAPI server with /generate, /chat, /health endpoints
- requirements_api.txt: FastAPI dependencies
- docs/API.md: API documentation
- docs/QUICKSTART.md: 5-minute quickstart guide
- README_QUICKSTART.md: Simplified landing page
- docs/ROADMAP.md: Project roadmap (current → long-term)
- CONTRIBUTING.md: Contributor guidelines
- docs/TOOL_DATA_ANALYSIS.md: Tool data analysis report
- training-data/tool_examples_combined.jsonl: 1500 tool_calling examples (balanced 5 tools)

Total tool calling training data: 1500 examples
- Grep: 300
- FileRead: 300
- WebSearch: 300
- Bash: 300
- FileWrite: 300

CONTRIBUTING.md CHANGED
@@ -1,26 +1,171 @@
1
- # Contributing
2
 
3
- We welcome contributions! Here's how:
4
 
5
- ## Getting Started
6
- 1. Fork the repository
7
- 2. Clone your fork: `git clone https://github.com/YOUR_USER/$repo.git`
8
- 3. Create a virtual environment
9
- 4. Install dependencies: `pip install -r requirements.txt`
10
 
11
- ## Making Changes
12
- 1. Create a branch: `git checkout -b feature/your-feature-name`
13
- 2. Make your changes
14
- 3. Add tests
15
- 4. Run tests: `pytest tests/`
16
- 5. Commit: `git commit -m "Add your feature"`
17
- 6. Push: `git push origin feature/your-feature-name`
18
- 7. Open a Pull Request
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Code Style
21
- - Follow PEP 8
22
- - Add docstrings
23
- - Include type hints where possible
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Reporting Issues
26
- Open an issue with a clear description and example code.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to Stack 2.9
2
 
3
+ > Last updated: April 2026
4
 
5
+ Thank you for your interest in contributing to Stack 2.9! This document outlines how you can help.
 
 
 
 
6
 
7
+ ## Project State
8
+
9
+ **Before contributing, understand where the project stands:**
10
+
11
+ | Area | Status | Notes |
12
+ |------|--------|-------|
13
+ | Basic code generation | ✅ Working | Main strength of the model |
14
+ | Tool calling | ⚠️ Not trained | Needs fine-tuning on tool patterns |
15
+ | Benchmark scores | ⚠️ Pending | Full evaluation not yet run |
16
+ | Self-evolution | 🔧 Incomplete | Components exist but not connected |
17
+ | Documentation | 🔧 In progress | Some areas need work |
18
+
19
+ ## Quick Start
20
+
21
+ ```bash
22
+ # 1. Fork the repository
23
+ git fork https://github.com/my-ai-stack/stack-2.9.git
24
+
25
+ # 2. Clone your fork
26
+ git clone https://github.com/YOUR_USER/stack-2.9.git
27
+ cd stack-2.9
28
+
29
+ # 3. Create a virtual environment
30
+ python -m venv .venv
31
+ source .venv/bin/activate # Linux/Mac
32
+ # or .venv\Scripts\activate on Windows
33
+
34
+ # 4. Install dependencies
35
+ pip install -r requirements.txt
36
+ ```
37
+
38
+ ## What to Work On
39
+
40
+ ### High Priority
41
+
42
+ 1. **Evaluation** - Run full HumanEval/MBPP benchmarks
43
+ - See `stack/eval/run_proper_evaluation.py`
44
+ - Requires: Python, Ollama or API key
45
+
46
+ 2. **Tool calling tests** - Test and document tool usage
47
+ - Run `python stack.py -c "Your command here"`
48
+ - Report what works/doesn't in issues
49
+
50
+ 3. **Documentation** - Improve tool definitions, API docs
51
+ - Check `docs/TOOLS.md` for accuracy
52
+ - Update `stack/internal/ARCHITECTURE.md`
53
+
54
+ ### Medium Priority
55
+
56
+ 4. **Training scripts** - Improve fine-tuning pipeline
57
+ - See `stack/training/`
58
+ - ⚠️ Do NOT modify Kaggle notebook or training data generation
59
+
60
+ 5. **Deployment** - Fix deployment scripts
61
+ - See `stack/deploy/`, `runpod_deploy.sh`
62
+
63
+ ### Lower Priority
64
+
65
+ 6. **Pattern Memory** - Connect Observer → Learner → Memory → Trainer
66
+ 7. **Voice integration** - Test end-to-end voice pipeline
67
+ 8. **MCP support** - Improve Model Context Protocol integration
68
+
69
+ ## What NOT to Touch
70
+
71
+ ⚠️ **Do NOT modify without explicit approval:**
72
+
73
+ - `kaggle_train_stack29_v5.ipynb` - Kaggle training notebook
74
+ - `colab_train_stack29.ipynb` - Colab training notebook
75
+ - Training data generation scripts in `data/`
76
+ - Model weights in `base_model_qwen7b/`
77
+
78
+ These are core training components. Changes here affect the model itself.
79
 
80
  ## Code Style
81
+
82
+ - **Python:** Follow PEP 8, use type hints where possible
83
+ - **TypeScript:** Use strict mode, add JSDoc comments
84
+ - **Shell:** Use `shellcheck` on bash scripts
85
+ - **General:** Add docstrings to new functions, include examples
86
+
87
+ ### Pre-commit Checks
88
+
89
+ ```bash
90
+ # Run tests before submitting
91
+ pytest samples/ -v
92
+
93
+ # Check code formatting
94
+ ruff check src/ samples/ --fix
95
+ black src/ samples/
96
+ ```
97
+
98
+ ##提交PR
99
+
100
+ ```bash
101
+ # Create a feature branch
102
+ git checkout -b feature/your-feature-name
103
+
104
+ # Make your changes
105
+ # ... edit files ...
106
+
107
+ # Run tests
108
+ pytest samples/ -v
109
+
110
+ # Commit with clear message
111
+ git commit -m "Add: description of what you changed"
112
+
113
+ # Push to your fork
114
+ git push origin feature/your-feature-name
115
+
116
+ # Open a Pull Request
117
+ # Fill in the PR template with:
118
+ # - What you changed
119
+ # - Why it's needed
120
+ # - Testing you did
121
+ # - Screenshots if applicable
122
+ ```
123
+
124
+ ## Pull Request Guidelines
125
+
126
+ 1. **Describe the change clearly** - What does this fix or add?
127
+ 2. **Link related issues** - Use "Fixes #123" if applicable
128
+ 3. **Include tests** - Add unit tests for new features
129
+ 4. **Update docs** - If you add a feature, document it
130
+ 5. **Be patient** - Reviewers may take a few days to respond
131
 
132
  ## Reporting Issues
133
+
134
+ When reporting bugs:
135
+
136
+ ```markdown
137
+ ## Description
138
+ Brief description of the issue
139
+
140
+ ## Steps to Reproduce
141
+ 1. Run `...`
142
+ 2. See error
143
+
144
+ ## Expected Behavior
145
+ What should happen
146
+
147
+ ## Actual Behavior
148
+ What actually happened
149
+
150
+ ## Environment
151
+ - OS:
152
+ - Python version:
153
+ - Provider: (ollama/openai/etc)
154
+ - Model:
155
+ ```
156
+
157
+ ## Communication
158
+
159
+ - **Issues:** GitHub Issues for bugs/features
160
+ - **Discussions:** GitHub Discussions for questions
161
+ - **Discord:** Link in README
162
+
163
+ ## Recognition
164
+
165
+ Contributors will be listed in:
166
+ - README.md "Acknowledgments" section
167
+ - CONTRIBUTORS file (if created)
168
+
169
+ ---
170
+
171
+ **Questions?** Open a GitHub Discussion or ask in Discord.
README_QUICKSTART.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stack 2.9 — Quick Start
2
+
3
+ > **AI coding assistant powered by Qwen2.5-Coder-32B with Pattern Memory.**
4
+
5
+ ```
6
+ git clone https://github.com/my-ai-stack/stack-2.9.git
7
+ cd stack-2.9
8
+ pip install -r requirements.txt
9
+ cp .env.example .env
10
+ python stack.py
11
+ ```
12
+
13
+ That's it. Keep reading for details.
14
+
15
+ ---
16
+
17
+ ## Prerequisites
18
+
19
+ - **Python 3.10+**
20
+ - **GPU** (optional — runs on CPU via cloud providers too)
21
+ - **Git**
22
+
23
+ ---
24
+
25
+ ## Install & Run
26
+
27
+ ```bash
28
+ # Clone
29
+ git clone https://github.com/my-ai-stack/stack-2.9.git
30
+ cd stack-2.9
31
+
32
+ # Install
33
+ python3 -m venv venv && source venv/bin/activate
34
+ pip install -r requirements.txt
35
+
36
+ # Configure (pick a provider below, then edit .env)
37
+ cp .env.example .env
38
+
39
+ # Run!
40
+ python stack.py
41
+ ```
42
+
43
+ ---
44
+
45
+ ## Configure Your Model Provider
46
+
47
+ Edit `.env` with one of these:
48
+
49
+ ### Ollama (Local, Private) — Recommended
50
+ ```env
51
+ MODEL_PROVIDER=ollama
52
+ OLLAMA_MODEL=qwen2.5-coder:32b
53
+ ```
54
+ ```bash
55
+ # First: curl -fsSL https://ollama.ai/install.sh | sh && ollama pull qwen2.5-coder:32b
56
+ ```
57
+
58
+ ### Together AI (Cloud, Fast)
59
+ ```env
60
+ MODEL_PROVIDER=together
61
+ TOGETHER_API_KEY=tog-your-key-here
62
+ TOGETHER_MODEL=togethercomputer/qwen2.5-32b-instruct
63
+ ```
64
+
65
+ ### OpenAI (GPT-4o)
66
+ ```env
67
+ MODEL_PROVIDER=openai
68
+ OPENAI_API_KEY=sk-your-key-here
69
+ OPENAI_MODEL=gpt-4o
70
+ ```
71
+
72
+ ### Anthropic (Claude)
73
+ ```env
74
+ MODEL_PROVIDER=anthropic
75
+ ANTHROPIC_API_KEY=sk-ant-your-key-here
76
+ ANTHROPIC_MODEL=claude-3-5-sonnet-20240229
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Usage
82
+
83
+ ### Interactive Chat
84
+ ```bash
85
+ python stack.py
86
+ ```
87
+
88
+ ### Single Query
89
+ ```bash
90
+ python stack.py -c "Write a Python function to reverse a string"
91
+ ```
92
+
93
+ ### Evaluate Model (GPU required)
94
+ ```bash
95
+ python evaluate_model.py --model-path ./output/merged --benchmark humaneval
96
+ ```
97
+
98
+ ### Deploy with Docker
99
+ ```bash
100
+ docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9
101
+ ```
102
+
103
+ ---
104
+
105
+ ## 5-Minute Overview
106
+
107
+ | Feature | Command |
108
+ |---------|---------|
109
+ | Start chatting | `python stack.py` |
110
+ | Ask one question | `python stack.py -c "your question"` |
111
+ | Run benchmarks | `python evaluate_model.py --model-path ./merged --benchmark both` |
112
+ | List patterns | `python stack.py --patterns list` |
113
+ | Deploy locally | `docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9` |
114
+
115
+ ---
116
+
117
+ ## Hardware Requirements
118
+
119
+ | Model | Minimum | Recommended |
120
+ |-------|---------|------------|
121
+ | 7B | RTX 3060 (6GB) | A100 40GB |
122
+ | 32B | RTX 3090 (24GB) | A100 80GB |
123
+
124
+ No GPU? Use Ollama on your machine or any cloud provider in `.env`.
125
+
126
+ ---
127
+
128
+ ## Key Links
129
+
130
+ - 📖 **Full docs:** [docs/QUICKSTART.md](docs/QUICKSTART.md)
131
+ - 🔧 **46 tools:** [TOOLS.md](TOOLS.md)
132
+ - 🧠 **Pattern memory:** [docs/pattern-moat.md](docs/pattern-moat.md)
133
+ - 🚀 **Training guide:** [docs/TRAINING_7B.md](docs/TRAINING_7B.md)
134
+ - 🐳 **Kubernetes:** [k8s/](k8s/)
135
+
136
+ ---
137
+
138
+ ## What's Inside
139
+
140
+ - **Qwen2.5-Coder-32B** — 32B parameter code-specialized model
141
+ - **Pattern Memory** — learns from successful interactions
142
+ - **46 built-in tools** — file ops, git, shell, search, memory, tasks
143
+ - **Multi-provider** — Ollama, OpenAI, Anthropic, Together AI, OpenRouter
144
+ - **128K context** — handles large codebases
145
+ - **Self-hosted** — full control, private
146
+ - **MCP support** — integrates with any Model Context Protocol server
147
+ - **Voice-ready** — Coqui XTTS for voice cloning
148
+
149
+ ---
150
+
151
+ *Built with ❤️ for developers who want an AI that grows with them.*
docs/API.md ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stack 2.9 Inference API Documentation
2
+
3
+ REST API for code generation using the Stack 2.9 fine-tuned Qwen model.
4
+
5
+ ## Quick Start
6
+
7
+ ### 1. Install Dependencies
8
+
9
+ ```bash
10
+ pip install -r requirements_api.txt
11
+ pip install -r requirements.txt # Core dependencies (transformers, torch, etc.)
12
+ ```
13
+
14
+ ### 2. Set Model Path
15
+
16
+ ```bash
17
+ # Option A: Environment variable
18
+ export MODEL_PATH=/path/to/your/merged/model
19
+
20
+ # Option B: Direct parameter
21
+ MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000
22
+ ```
23
+
24
+ ### 3. Start the Server
25
+
26
+ ```bash
27
+ # Basic usage
28
+ uvicorn inference_api:app --host 0.0.0.0 --port 8000
29
+
30
+ # With auto-reload (development)
31
+ uvicorn inference_api:app --reload --port 8000
32
+
33
+ # Using Python directly
34
+ python inference_api.py
35
+ ```
36
+
37
+ ### 4. Verify It's Running
38
+
39
+ ```bash
40
+ curl http://localhost:8000/health
41
+ ```
42
+
43
+ Expected response:
44
+ ```json
45
+ {
46
+ "status": "healthy",
47
+ "model_loaded": true,
48
+ "model_path": "base_model_qwen7b",
49
+ "device": "cuda",
50
+ "cuda_available": true
51
+ }
52
+ ```
53
+
54
+ ---
55
+
56
+ ## API Endpoints
57
+
58
+ ### `GET /health`
59
+
60
+ Health check endpoint to verify API and model status.
61
+
62
+ **Response:**
63
+ ```json
64
+ {
65
+ "status": "healthy",
66
+ "model_loaded": true,
67
+ "model_path": "/path/to/model",
68
+ "device": "cuda",
69
+ "cuda_available": true
70
+ }
71
+ ```
72
+
73
+ ---
74
+
75
+ ### `GET /model-info`
76
+
77
+ Get information about the currently loaded model.
78
+
79
+ **Response:**
80
+ ```json
81
+ {
82
+ "model_path": "/path/to/model",
83
+ "device": "cuda:0",
84
+ "dtype": "torch.float16"
85
+ }
86
+ ```
87
+
88
+ ---
89
+
90
+ ### `POST /generate`
91
+
92
+ Generate code completion for a prompt.
93
+
94
+ **Request Body:**
95
+ ```json
96
+ {
97
+ "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
98
+ "max_tokens": 128,
99
+ "temperature": 0.2,
100
+ "top_p": 0.95,
101
+ "do_sample": true,
102
+ "repetition_penalty": 1.1,
103
+ "num_return_sequences": 1
104
+ }
105
+ ```
106
+
107
+ **Parameters:**
108
+ | Parameter | Type | Default | Range | Description |
109
+ |-----------|------|---------|-------|-------------|
110
+ | `prompt` | string | required | - | Input prompt to complete |
111
+ | `max_tokens` | int | 512 | 1-4096 | Maximum tokens to generate |
112
+ | `temperature` | float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) |
113
+ | `top_p` | float | 0.95 | 0.0-1.0 | Nucleus sampling threshold |
114
+ | `do_sample` | bool | true | - | Whether to use sampling vs greedy |
115
+ | `repetition_penalty` | float | 1.1 | 1.0-2.0 | Penalize repeated tokens |
116
+ | `num_return_sequences` | int | 1 | 1-10 | Number of sequences to generate |
117
+
118
+ **Response:**
119
+ ```json
120
+ {
121
+ "generated_text": " seen = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in seen:\n return [seen[complement], i]\n seen[num] = i\n return []",
122
+ "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
123
+ "model": "base_model_qwen7b",
124
+ "num_tokens": 45,
125
+ "finish_reason": "stop"
126
+ }
127
+ ```
128
+
129
+ **Example with curl:**
130
+ ```bash
131
+ curl -X POST http://localhost:8000/generate \
132
+ -H "Content-Type: application/json" \
133
+ -d '{
134
+ "prompt": "def fibonacci(n):\n \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
135
+ "max_tokens": 100,
136
+ "temperature": 0.2
137
+ }'
138
+ ```
139
+
140
+ ---
141
+
142
+ ### `POST /chat`
143
+
144
+ Conversational interface for multi-turn interactions.
145
+
146
+ **Request Body:**
147
+ ```json
148
+ {
149
+ "messages": [
150
+ {"role": "user", "content": "Write a function to reverse a string in Python"},
151
+ {"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"},
152
+ {"role": "user", "content": "Make it recursive instead"}
153
+ ],
154
+ "max_tokens": 128,
155
+ "temperature": 0.2,
156
+ "top_p": 0.95
157
+ }
158
+ ```
159
+
160
+ **Message Roles:**
161
+ - `user` - User's message
162
+ - `assistant` - Model's previous response (for conversation history)
163
+
164
+ **Response:**
165
+ ```json
166
+ {
167
+ "message": {
168
+ "role": "assistant",
169
+ "content": "def reverse_string(s):\n if len(s) <= 1:\n return s\n return s[-1] + reverse_string(s[:-1])"
170
+ },
171
+ "model": "base_model_qwen7b",
172
+ "num_tokens": 67,
173
+ "finish_reason": "stop"
174
+ }
175
+ ```
176
+
177
+ **Example with curl:**
178
+ ```bash
179
+ curl -X POST http://localhost:8000/chat \
180
+ -H "Content-Type: application/json" \
181
+ -d '{
182
+ "messages": [
183
+ {"role": "user", "content": "Write a binary search function"}
184
+ ],
185
+ "max_tokens": 150
186
+ }'
187
+ ```
188
+
189
+ ---
190
+
191
+ ### `POST /generate/raw`
192
+
193
+ Same as `/generate` but returns raw output without extracting code from markdown blocks.
194
+
195
+ **Example with curl:**
196
+ ```bash
197
+ curl -X POST http://localhost:8000/generate/raw \
198
+ -H "Content-Type: application/json" \
199
+ -d '{
200
+ "prompt": "def quick_sort(arr):",
201
+ "max_tokens": 200
202
+ }'
203
+ ```
204
+
205
+ ---
206
+
207
+ ### `POST /extract-code`
208
+
209
+ Extract code from a text response that may contain markdown code blocks.
210
+
211
+ **Request Body:**
212
+ ```json
213
+ {
214
+ "prompt": "```python\ndef hello():\n print(\"world\")\n```"
215
+ }
216
+ ```
217
+
218
+ **Response:**
219
+ ```json
220
+ {
221
+ "code": "def hello():\n print(\"world\")"
222
+ }
223
+ ```
224
+
225
+ ---
226
+
227
+ ## Environment Variables
228
+
229
+ | Variable | Default | Description |
230
+ |----------|---------|-------------|
231
+ | `MODEL_PATH` | `base_model_qwen7b` | Path to model directory |
232
+ | `DEVICE` | `cuda` (if available) | Device to use: `cuda` or `cpu` |
233
+ | `PORT` | `8000` | Server port |
234
+ | `HOST` | `0.0.0.0` | Server host |
235
+ | `RELOAD` | `false` | Enable auto-reload for development |
236
+ | `DEFAULT_MAX_TOKENS` | `512` | Default max tokens |
237
+ | `DEFAULT_TEMPERATURE` | `0.2` | Default temperature |
238
+ | `DEFAULT_TOP_P` | `0.95` | Default top_p |
239
+
240
+ ---
241
+
242
+ ## Usage Examples
243
+
244
+ ### Python Client
245
+
246
+ ```python
247
+ import requests
248
+
249
+ API_URL = "http://localhost:8000"
250
+
251
+ # Health check
252
+ health = requests.get(f"{API_URL}/health").json()
253
+ print(f"Model loaded: {health['model_loaded']}")
254
+
255
+ # Code completion
256
+ response = requests.post(
257
+ f"{API_URL}/generate",
258
+ json={
259
+ "prompt": "def merge_sort(arr):\n \"\"\"Return sorted array.\"\"\"\n",
260
+ "max_tokens": 200,
261
+ "temperature": 0.3,
262
+ }
263
+ ).json()
264
+
265
+ print(response["generated_text"])
266
+ ```
267
+
268
+ ### JavaScript/Node.js Client
269
+
270
+ ```javascript
271
+ const API_URL = "http://localhost:8000";
272
+
273
+ // Code completion
274
+ async function generate(prompt) {
275
+ const response = await fetch(`${API_URL}/generate`, {
276
+ method: "POST",
277
+ headers: { "Content-Type": "application/json" },
278
+ body: JSON.stringify({
279
+ prompt,
280
+ max_tokens: 128,
281
+ temperature: 0.2,
282
+ }),
283
+ });
284
+ return response.json();
285
+ }
286
+
287
+ const result = await generate("def binary_search(arr, target):");
288
+ console.log(result.generated_text);
289
+ ```
290
+
291
+ ### Using with OpenAI SDK (with base_url replacement)
292
+
293
+ ```python
294
+ from openai import OpenAI
295
+
296
+ client = OpenAI(
297
+ api_key="not-needed",
298
+ base_url="http://localhost:8000"
299
+ )
300
+
301
+ # Note: This works for basic completions but may need adapter code
302
+ # for full OpenAI compatibility
303
+ response = client.completions.create(
304
+ model="stack-2.9",
305
+ prompt="def factorial(n):",
306
+ max_tokens=100,
307
+ )
308
+ ```
309
+
310
+ ---
311
+
312
+ ## Performance Tips
313
+
314
+ 1. **GPU Recommended**: For fastest inference, run on GPU with CUDA
315
+ 2. **Batch Processing**: For multiple prompts, process sequentially (model is loaded once)
316
+ 3. **Memory**: Ensure adequate GPU memory; reduce `max_tokens` if needed
317
+ 4. **Temperature**: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks
318
+
319
+ ---
320
+
321
+ ## Error Handling
322
+
323
+ **503 Service Unavailable**: Model not loaded or loading failed
324
+ ```json
325
+ {"detail": "Model not loaded. Check /health for status."}
326
+ ```
327
+
328
+ **500 Internal Server Error**: Generation failed
329
+ ```json
330
+ {"detail": "Generation failed: <error message>"}
331
+ ```
332
+
333
+ **400 Bad Request**: Invalid input
334
+ ```json
335
+ {"detail": "Last message must be from user"}
336
+ ```
337
+
338
+ ---
339
+
340
+ ## Architecture Notes
341
+
342
+ - **Single Model Instance**: Model is loaded once at startup and reused
343
+ - **Synchronous Generation**: Uses `torch.no_grad()` for inference
344
+ - **CORS Enabled**: Accepts requests from any origin (configure for production)
345
+ - **No Authentication**: Add middleware (e.g., API key) for production deployments
docs/QUICKSTART.md ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stack 2.9 — 5-Minute Quick Start
2
+
3
+ > **Goal:** Get Stack 2.9 running and solving coding tasks in under 5 minutes.
4
+
5
+ Stack 2.9 is an AI coding assistant powered by **Qwen2.5-Coder-32B** with Pattern Memory — it learns from your interactions and improves over time.
6
+
7
+ ---
8
+
9
+ ## 📋 Prerequisites
10
+
11
+ ### Required
12
+ | Requirement | Version | Check |
13
+ |-------------|---------|-------|
14
+ | Python | 3.10+ | `python3 --version` |
15
+ | Git | Any recent | `git --version` |
16
+ | pip | Latest | `pip --version` |
17
+
18
+ ### Optional (Recommended)
19
+ | Resource | Why You Need It | Minimum |
20
+ |----------|----------------|---------|
21
+ | **GPU** | Fast code generation | RTX 3070 / M1 Pro |
22
+ | **16GB VRAM** | Run 32B model smoothly | 8GB for 7B quantized |
23
+
24
+ > **No GPU?** Stack 2.9 works on CPU via Ollama or cloud providers (OpenAI, Together AI, etc.).
25
+
26
+ ---
27
+
28
+ ## ⚡ Step 1 — Install in 60 Seconds
29
+
30
+ ```bash
31
+ # 1. Clone the repository
32
+ git clone https://github.com/my-ai-stack/stack-2.9.git
33
+ cd stack-2.9
34
+
35
+ # 2. Create a virtual environment (recommended)
36
+ python3 -m venv venv
37
+ source venv/bin/activate # On Windows: venv\Scripts\activate
38
+
39
+ # 3. Install dependencies
40
+ pip install --upgrade pip
41
+ pip install -r requirements.txt
42
+
43
+ # 4. Copy environment template
44
+ cp .env.example .env
45
+ ```
46
+
47
+ **That's it.** If you hit errors, see [Troubleshooting](#-troubleshooting) below.
48
+
49
+ ---
50
+
51
+ ## 🔑 Step 2 — Configure Your Model Provider
52
+
53
+ Stack 2.9 supports multiple LLM providers. **Pick one that matches your setup:**
54
+
55
+ ### Option A: Ollama (Recommended — Local, Private)
56
+
57
+ ```bash
58
+ # Install Ollama (macOS/Linux)
59
+ curl -fsSL https://ollama.ai/install.sh | sh
60
+
61
+ # Pull the Qwen model
62
+ ollama pull qwen2.5-coder:32b
63
+
64
+ # Set environment
65
+ export MODEL_PROVIDER=ollama
66
+ export OLLAMA_MODEL=qwen2.5-coder:32b
67
+ ```
68
+
69
+ Edit your `.env` file:
70
+ ```env
71
+ MODEL_PROVIDER=ollama
72
+ OLLAMA_MODEL=qwen2.5-coder:32b
73
+ ```
74
+
75
+ ### Option B: Together AI (Best for Qwen, Cloud)
76
+
77
+ ```bash
78
+ # Get your API key at https://together.ai
79
+ export TOGETHER_API_KEY=tog-your-key-here
80
+ ```
81
+
82
+ Edit your `.env`:
83
+ ```env
84
+ MODEL_PROVIDER=together
85
+ TOGETHER_API_KEY=tog-your-key-here
86
+ TOGETHER_MODEL=togethercomputer/qwen2.5-32b-instruct
87
+ ```
88
+
89
+ ### Option C: OpenAI (GPT-4o)
90
+
91
+ ```env
92
+ MODEL_PROVIDER=openai
93
+ OPENAI_API_KEY=sk-your-key-here
94
+ OPENAI_MODEL=gpt-4o
95
+ ```
96
+
97
+ ### Option D: Anthropic (Claude)
98
+
99
+ ```env
100
+ MODEL_PROVIDER=anthropic
101
+ ANTHROPIC_API_KEY=sk-ant-your-key-here
102
+ ANTHROPIC_MODEL=claude-3-5-sonnet-20240229
103
+ ```
104
+
105
+ ### Option E: OpenRouter (Unified Access)
106
+
107
+ ```env
108
+ MODEL_PROVIDER=openrouter
109
+ OPENROUTER_API_KEY=sk-or-your-key-here
110
+ OPENROUTER_MODEL=openai/gpt-4o
111
+ ```
112
+
113
+ ---
114
+
115
+ ## 🚀 Step 3 — Run Your First Task
116
+
117
+ ### Interactive Chat Mode
118
+
119
+ ```bash
120
+ python stack.py
121
+ ```
122
+
123
+ You'll see:
124
+ ```
125
+ ╔══════════════════════════════════════════════╗
126
+ ║ Stack 2.9 — AI Coding Assistant ║
127
+ ║ Pattern Memory: Active | Tools: 46 ║
128
+ ╚══════════════════════════════════════════════╝
129
+
130
+ You: Write a Python function to reverse a string
131
+ ```
132
+
133
+ ### Single Query Mode
134
+
135
+ ```bash
136
+ python stack.py -c "Write a Python function to reverse a string"
137
+ ```
138
+
139
+ **Expected output:**
140
+ ```python
141
+ def reverse_string(s):
142
+ """Reverse a string and return it."""
143
+ return s[::-1]
144
+
145
+ # Or for a more robust version:
146
+ def reverse_string(s):
147
+ return ''.join(reversed(s))
148
+ ```
149
+
150
+ ### Ask About Your Codebase
151
+
152
+ ```bash
153
+ python stack.py -c "Find all Python files modified in the last week and list them"
154
+ ```
155
+
156
+ ### Generate and Run Code
157
+
158
+ ```bash
159
+ python stack.py -c "Create a hello world Flask app with one route"
160
+ ```
161
+
162
+ ---
163
+
164
+ ## 📊 Step 4 — Run Evaluation (Optional)
165
+
166
+ > **Note:** Evaluation requires a GPU with ~16GB VRAM or more.
167
+
168
+ ### Prepare Your Fine-Tuned Model
169
+
170
+ After training Stack 2.9 on your data, your merged model will be in:
171
+ ```
172
+ ./output/merged/
173
+ ```
174
+
175
+ ### Run HumanEval Benchmark
176
+
177
+ ```bash
178
+ python evaluate_model.py \
179
+ --model-path ./output/merged \
180
+ --benchmark humaneval \
181
+ --num-samples 10 \
182
+ --output results.json
183
+ ```
184
+
185
+ ### Run MBPP Benchmark
186
+
187
+ ```bash
188
+ python evaluate_model.py \
189
+ --model-path ./output/merged \
190
+ --benchmark mbpp \
191
+ --num-samples 10 \
192
+ --output results.json
193
+ ```
194
+
195
+ ### Run Both Benchmarks
196
+
197
+ ```bash
198
+ python evaluate_model.py \
199
+ --model-path ./output/merged \
200
+ --benchmark both \
201
+ --num-samples 10 \
202
+ --k-values 1,10 \
203
+ --output results.json
204
+ ```
205
+
206
+ **Expected output format:**
207
+ ```
208
+ ============================================================
209
+ HumanEval Results
210
+ ============================================================
211
+ pass@1: 65.00%
212
+ pass@10: 82.00%
213
+ Total problems evaluated: 12
214
+ ============================================================
215
+
216
+ ============================================================
217
+ MBPP Results
218
+ ============================================================
219
+ pass@1: 70.00%
220
+ pass@10: 85.00%
221
+ Total problems evaluated: 12
222
+ ============================================================
223
+ ```
224
+
225
+ ### Quick Evaluation (5 Problems Only)
226
+
227
+ ```bash
228
+ python evaluate_model.py \
229
+ --model-path ./output/merged \
230
+ --benchmark humaneval \
231
+ --num-problems 5 \
232
+ --num-samples 5
233
+ ```
234
+
235
+ ---
236
+
237
+ ## 🐳 Step 5 — Deploy Stack 2.9
238
+
239
+ ### Deploy Locally with Docker
240
+
241
+ ```bash
242
+ # Start the container
243
+ docker build -t stack-2.9 .
244
+ docker run -p 7860:7860 \
245
+ -e MODEL_PROVIDER=ollama \
246
+ -e OLLAMA_MODEL=qwen2.5-coder:32b \
247
+ stack-2.9
248
+ ```
249
+
250
+ Access at: **http://localhost:7860**
251
+
252
+ ### Deploy to RunPod (Cloud GPU)
253
+
254
+ ```bash
255
+ # Edit runpod_deploy.sh with your config first
256
+ bash runpod_deploy.sh --gpu a100 --instance hourly
257
+ ```
258
+
259
+ ### Deploy to Kubernetes
260
+
261
+ ```bash
262
+ # 1. Edit k8s/secret.yaml with your HuggingFace token
263
+ # 2. Apply the manifests
264
+ kubectl apply -f k8s/namespace.yaml
265
+ kubectl apply -f k8s/secret.yaml
266
+ kubectl apply -f k8s/configmap.yaml
267
+ kubectl apply -f k8s/pvc.yaml
268
+ kubectl apply -f k8s/deployment.yaml
269
+ kubectl apply -f k8s/service.yaml
270
+
271
+ # Check status
272
+ kubectl get pods -n stack-29
273
+ kubectl logs -n stack-29 deployment/stack-29
274
+ ```
275
+
276
+ ### Hardware Requirements for Deployment
277
+
278
+ | Model Size | Minimum GPU | Recommended | Quantized (4-bit) |
279
+ |------------|-------------|-------------|-------------------|
280
+ | 7B | RTX 3070 (8GB) | A100 40GB | RTX 3060 (6GB) |
281
+ | 32B | A100 40GB | A100 80GB | RTX 3090 (24GB) |
282
+
283
+ ---
284
+
285
+ ## 🧠 Pattern Memory Quick Guide
286
+
287
+ Stack 2.9 stores successful patterns to help with future tasks.
288
+
289
+ ### List Your Patterns
290
+
291
+ ```bash
292
+ python stack.py --patterns list
293
+ python stack.py --patterns stats
294
+ ```
295
+
296
+ ### Extract Patterns from Your Git History
297
+
298
+ ```bash
299
+ python scripts/extract_patterns_from_git.py \
300
+ --repo-path . \
301
+ --output patterns.jsonl \
302
+ --since-date "2024-01-01"
303
+ ```
304
+
305
+ ### Merge LoRA Adapters (Team Sharing)
306
+
307
+ ```bash
308
+ python scripts/merge_lora_adapters.py \
309
+ --adapters adapter_a.safetensors adapter_b.safetensors \
310
+ --weights 0.7 0.3 \
311
+ --output merged.safetensors
312
+ ```
313
+
314
+ ---
315
+
316
+ ## 🛠️ Troubleshooting
317
+
318
+ ### "Module not found" errors
319
+
320
+ ```bash
321
+ pip install -r requirements.txt
322
+ ```
323
+
324
+ ### "CUDA out of memory" during evaluation
325
+
326
+ ```bash
327
+ # Reduce batch size
328
+ python evaluate_model.py --model-path ./merged --num-samples 5
329
+
330
+ # Or use 4-bit quantization
331
+ # (See docs/TRAINING_7B.md for quantized training)
332
+ ```
333
+
334
+ ### "Model not found" with Ollama
335
+
336
+ ```bash
337
+ ollama pull qwen2.5-coder:32b
338
+ ollama list # Verify it's installed
339
+ ```
340
+
341
+ ### "API key not set" errors
342
+
343
+ ```bash
344
+ # Double-check your .env file
345
+ cat .env
346
+
347
+ # For testing, you can also set inline
348
+ export TOGETHER_API_KEY=tog-your-key
349
+ ```
350
+
351
+ ### Slow inference on CPU
352
+
353
+ ```bash
354
+ # Use a smaller model
355
+ export OLLAMA_MODEL=qwen2.5-coder:7b
356
+
357
+ # Or switch to cloud
358
+ export MODEL_PROVIDER=together
359
+ ```
360
+
361
+ ### Docker build fails
362
+
363
+ ```bash
364
+ # Use Python 3.10 explicitly
365
+ docker build --build-arg PYTHON_VERSION=3.10 -t stack-2.9 .
366
+ ```
367
+
368
+ ### Kubernetes GPU not found
369
+
370
+ ```bash
371
+ # Verify nvidia.com/gpu label on your node
372
+ kubectl get nodes -L nvidia.com/gpu
373
+
374
+ # Install NVIDIA GPU Operator if missing
375
+ # https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/
376
+ ```
377
+
378
+ ---
379
+
380
+ ## 📚 What's Next?
381
+
382
+ | Goal | Go To |
383
+ |------|-------|
384
+ | Train on my own data | `docs/TRAINING_7B.md` |
385
+ | Learn all 46 tools | `TOOLS.md` |
386
+ | Set up team pattern sharing | `docs/pattern-moat.md` |
387
+ | Understand the architecture | `docs/reference/ARCHITECTURE.md` |
388
+ | Report a bug | `SECURITY.md` / GitHub Issues |
389
+
390
+ ---
391
+
392
+ ## ⚡ Quick Reference Card
393
+
394
+ ```bash
395
+ # Install
396
+ git clone https://github.com/my-ai-stack/stack-2.9.git
397
+ cd stack-2.9 && pip install -r requirements.txt
398
+
399
+ # Configure
400
+ cp .env.example .env # Edit with your API keys
401
+
402
+ # Run
403
+ python stack.py # Interactive
404
+ python stack.py -c "your code request" # Single query
405
+
406
+ # Evaluate
407
+ python evaluate_model.py --model-path ./merged --benchmark humaneval
408
+
409
+ # Deploy
410
+ docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9
411
+ ```
412
+
413
+ ---
414
+
415
+ *Stack 2.9 — AI that learns your patterns and grows with you.*
docs/ROADMAP.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stack 2.9 Roadmap
2
+
3
+ > Last updated: April 2026
4
+
5
+ ## Current Status
6
+
7
+ ### What's Working ✅
8
+
9
+ - **Basic code generation** - The model can generate Python, JavaScript, and other code based on prompts
10
+ - **CLI interface** - Working command-line interface (`stack.py`, `src/cli/`)
11
+ - **Multi-provider support** - Ollama, OpenAI, Anthropic, OpenRouter, Together AI integrations
12
+ - **46 built-in tools** - File operations, git, shell, web search, memory, task planning
13
+ - **Pattern Memory infrastructure** - Observer, Learner, Memory components implemented
14
+ - **Training pipeline** - LoRA fine-tuning scripts, data preparation, model merging
15
+ - **Deployment options** - Docker, RunPod, Vast.ai, Kubernetes, HuggingFace Spaces
16
+ - **128K context window** - Extended from base model's 32K
17
+
18
+ ### What's Broken or Missing ⚠️
19
+
20
+ - **Tool calling not trained** - Model doesn't reliably use tools; needs fine-tuning on tool patterns
21
+ - **Benchmark scores unverifiable** - Previous claims removed after audit found only 20/164 HumanEval problems tested
22
+ - **Self-evolution not functional** - Observer/Learner components exist but not connected to training pipeline
23
+ - **Voice integration incomplete** - Coqui XTTS integration present but not tested
24
+ - **Evaluation infrastructure in progress** - New proper evaluation framework built but not run on full benchmarks
25
+
26
+ ### What Needs Testing 🔧
27
+
28
+ - Full HumanEval (164 problems) evaluation
29
+ - Full MBPP (500 problems) evaluation
30
+ - Tool-calling accuracy with real tasks
31
+ - Pattern Memory retrieval and effectiveness
32
+ - Voice input/output pipeline
33
+ - Multi-provider compatibility
34
+
35
+ ### What Needs Documentation 📚
36
+
37
+ - Tool definitions and schemas
38
+ - API reference (internal/ARCHITECTURE.md exists but needs updating)
39
+ - Pattern Memory usage guide
40
+ - Deployment troubleshooting
41
+ - Evaluation methodology
42
+
43
+ ---
44
+
45
+ ## Timeline with Milestones
46
+
47
+ ### Short-Term (1-2 Weeks)
48
+
49
+ | Milestone | Description | Status |
50
+ |-----------|-------------|--------|
51
+ | **S1.1** | Run full HumanEval (164 problems) with proper inference | Not started |
52
+ | **S1.2** | Run full MBPP (500 problems) with proper inference | Not started |
53
+ | **S1.3** | Document all 46 tool definitions in `docs/TOOLS.md` | In progress |
54
+ | **S1.4** | Fix evaluation scripts to use real model inference | Needed |
55
+ | **S1.5** | Create minimal reproducible test for tool calling | Not started |
56
+
57
+ **Owner:** Community contribution welcome
58
+
59
+ ### Medium-Term (1-3 Months)
60
+
61
+ | Milestone | Description | Status |
62
+ |-----------|-------------|--------|
63
+ | **M2.1** | Fine-tune model on tool-calling patterns (RTMP data) | Not started |
64
+ | **M2.2** | Implement and test self-evolution loop (Observer → Learner → Memory → Trainer) | Not started |
65
+ | **M2.3** | Run full benchmark evaluation and publish verified scores | Not started |
66
+ | **M2.4** | Add MCP server support for external tool integration | Partial |
67
+ | **M2.5** | Voice integration end-to-end testing | Not started |
68
+ | **M2.6** | Implement pattern extraction from production usage | Not started |
69
+
70
+ **Owner:** Requires training compute budget or community contribution
71
+
72
+ ### Long-Term (6+ Months)
73
+
74
+ | Milestone | Description | Status |
75
+ |-----------|-------------|--------|
76
+ | **L3.1** | RLHF training for improved tool selection | Future |
77
+ | **L3.2** | Team sync infrastructure (PostgreSQL + FastAPI) | Designed, not implemented |
78
+ | **L3.3** | Federated learning for privacy-preserving updates | Future |
79
+ | **L3.4** | Multi-modal support (images → code) | Future |
80
+ | **L3.5** | Real-time voice-to-voice conversation | Future |
81
+
82
+ **Owner:** Long-term vision, needs significant resources
83
+
84
+ ---
85
+
86
+ ## How to Contribute
87
+
88
+ ### By Priority
89
+
90
+ 1. **Run evaluations** - Help us verify benchmark scores by running `python stack_2_9_eval/run_proper_evaluation.py`
91
+ 2. **Test tool calling** - Try the model with various tools and report what works/doesn't
92
+ 3. **Documentation** - Improve docs, especially tool definitions and API reference
93
+ 4. **Bug reports** - Open issues with reproduction steps
94
+ 5. **Code contributions** - See CONTRIBUTING.md for guidelines
95
+
96
+ ### Contribution Areas
97
+
98
+ | Area | Skill Needed | Priority |
99
+ |------|--------------|----------|
100
+ | Evaluation | Python, ML benchmarking | High |
101
+ | Tool calling tests | Python, CLI usage | High |
102
+ | Documentation | Technical writing | Medium |
103
+ | Training scripts | PyTorch, PEFT | Medium |
104
+ | Deployment | Docker, K8s, Cloud | Low |
105
+ | Pattern Memory | Vector databases, ML | Low |
106
+
107
+ ### Quick Wins for Contributors
108
+
109
+ - Run `python stack.py -c "List files in current directory"` and report if tools work
110
+ - Review `stack/eval/results/` and verify evaluation logs
111
+ - Check `docs/TOOLS.md` accuracy against actual tool implementations
112
+ - Test with different providers (`--provider ollama|openai|anthropic`)
113
+
114
+ ---
115
+
116
+ ## Technical Notes
117
+
118
+ ### Known Limitations
119
+
120
+ 1. **Tool calling is not trained** - The base model has tool capabilities but Stack 2.9 hasn't been fine-tuned to use them reliably
121
+ 2. **Pattern Memory is read-only** - The system stores patterns but doesn't automatically retrain on them yet
122
+ 3. **Evaluation uses stub data** - Some eval scripts return pre-canned answers instead of running model
123
+ 4. **Voice integration untested** - Code exists but hasn't been validated end-to-end
124
+
125
+ ### Next Training Run Requirements
126
+
127
+ To fix tool calling, the next training run needs:
128
+
129
+ - Dataset: `data/rtmp-tools/combined_tools.jsonl` (already generated)
130
+ - Compute: ~1 hour on A100 for LoRA fine-tuning
131
+ - Configuration: Target tool_call logits, use `tool_use_examples.jsonl`
132
+
133
+ ---
134
+
135
+ ## Contact
136
+
137
+ - **Issues:** https://github.com/my-ai-stack/stack-2.9/issues
138
+ - **Discussions:** https://github.com/my-ai-stack/stack-2.9/discussions
139
+ - **Discord:** (link in README)
140
+
141
+ ---
142
+
143
+ *This roadmap is a living document. Updates based on community feedback and project progress.*
docs/TOOL_DATA_ANALYSIS.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tool Calling Training Data Analysis
2
+
3
+ **Generated:** 2026-04-06
4
+ **Files Analyzed:**
5
+ - `training-data/tool_examples.jsonl` (original)
6
+ - `training-data_v2/tool_examples.jsonl` (regenerated)
7
+
8
+ ---
9
+
10
+ ## Executive Summary
11
+
12
+ The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.
13
+
14
+ **Key Findings on Original Data:**
15
+ - ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
16
+ - ❌ Heavy prompt duplication (7.5x average)
17
+ - ❌ No multi-step tool chains (only 1 tool per example)
18
+ - ❌ All examples use identical tool definitions
19
+
20
+ **Action Taken:** Generated 500 new examples using the project's generator script.
21
+
22
+ **Recommendation:** The original data needs substantial improvements before use in training.
23
+
24
+ ---
25
+
26
+ ## 1. Statistics Overview
27
+
28
+ ### Original Data (tool_examples.jsonl)
29
+
30
+ | Metric | Value |
31
+ |--------|-------|
32
+ | Total Examples | 1,000 |
33
+ | Unique Prompts | 133 |
34
+ | Average Duplication | 7.52x |
35
+ | Unique Tool Sequences | 5 |
36
+ | Examples with Issues | ~107 (10.7%) |
37
+
38
+ ### New Data (tool_examples_v2.jsonl)
39
+
40
+ | Metric | Value |
41
+ |--------|-------|
42
+ | Total Examples | 500 |
43
+ | File Size | 1.9 MB |
44
+ | Tools per Example | 5 (static definition) |
45
+
46
+ ### Tool Call Distribution (Original)
47
+
48
+ | Tool | Call Count |
49
+ |------|------------|
50
+ | Bash | 200 |
51
+ | FileRead | 200 |
52
+ | FileWrite | 200 |
53
+ | WebSearch | 200 |
54
+ | Grep | 200 |
55
+
56
+ All examples have exactly **one tool call** - no multi-step chains exist.
57
+
58
+ ---
59
+
60
+ ## 2. Prompt Diversity Analysis (Original Data)
61
+
62
+ ### Prompt Categories
63
+
64
+ | Category | Count | Percentage |
65
+ |----------|-------|------------|
66
+ | Python | 207 | 20.7% |
67
+ | React | 149 | 14.9% |
68
+ | File Read | 134 | 13.4% |
69
+ | File Write | 119 | 11.9% |
70
+ | Other | 114 | 11.4% |
71
+ | Run Command | 80 | 8.0% |
72
+ | Docker/K8s | 67 | 6.7% |
73
+ | Search | 50 | 5.0% |
74
+ | Git | 40 | 4.0% |
75
+ | Testing | 31 | 3.1% |
76
+ | Package Management | 9 | 0.9% |
77
+
78
+ ### Most Duplicated Prompts
79
+
80
+ | Prompt | Occurrences |
81
+ |--------|-------------|
82
+ | "Run the tests with pytest" | 40 |
83
+ | "Run npm install to install dependencies" | 40 |
84
+ | "Write a simple React component to src/components/Button.jsx" | 67 |
85
+
86
+ ---
87
+
88
+ ## 3. Tool Usage Breakdown
89
+
90
+ ### Tool Definitions
91
+
92
+ All 1,000 original examples use **identical tool definitions** with 5 tools:
93
+ - `Bash` - Execute bash commands
94
+ - `FileRead` - Read file contents
95
+ - `FileWrite` - Create/overwrite files
96
+ - `WebSearch` - Search the web
97
+ - `Grep` - Search for patterns in files
98
+
99
+ ### Tool Call Issues Found (Original Data)
100
+
101
+ #### Wrong Search Patterns (105 instances / 10.5%)
102
+
103
+ The `WebSearch` tool frequently uses queries that don't match the user's question:
104
+
105
+ | User Question | Actual Search Query |
106
+ |--------------|---------------------|
107
+ | "How do I use async/await in Python?" | "AWS Lambda cold start optimization" |
108
+ | "How do I use React hooks properly?" | "SQL join types explained" |
109
+ | "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" |
110
+ | "How do I use React hooks properly?" | "TypeScript generics tutorial" |
111
+ | "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" |
112
+
113
+ #### Wrong File Paths (2 instances)
114
+
115
+ The `FileWrite` tool sometimes writes to incorrect file types:
116
+
117
+ | User Request | Written Path |
118
+ |-------------|--------------|
119
+ | "Create a src/components/Header.jsx file" | Written to `config.json` |
120
+ | "Create a src/middleware.py file with settings" | Written to `config.yaml` |
121
+
122
+ #### Pattern/File Type Mismatches (Grep)
123
+
124
+ The `Grep` tool sometimes searches with mismatched patterns:
125
+
126
+ | Pattern | File Pattern | Issue |
127
+ |---------|-------------|-------|
128
+ | `class ` | `*.ts` | Python pattern in TypeScript files |
129
+ | `SELECT ` | `*.js` | SQL pattern in JavaScript files |
130
+ | `TODO` | `*.md` | Searching TODO in markdown files |
131
+
132
+ ---
133
+
134
+ ## 4. Data Quality Issues
135
+
136
+ ### Critical Issues
137
+
138
+ 1. **No Multi-Step Tool Chains**
139
+ - All 1,000 examples use exactly one tool call
140
+ - Real coding tasks typically require 2-5+ tool calls
141
+ - Example: "Read file → Find pattern → Search docs → Write fix"
142
+
143
+ 2. **Search Query Mismatches**
144
+ - 10.5% of WebSearch calls have irrelevant queries
145
+ - Indicates the generator script has logic errors
146
+
147
+ 3. **Heavy Prompt Duplication**
148
+ - 133 unique prompts duplicated to 1,000 examples
149
+ - "Write a simple React component" appears 67 times
150
+ - This creates overfitting to specific prompts
151
+
152
+ 4. **Identical Tool Definitions**
153
+ - All examples use the same 5 tools with identical descriptions
154
+ - No variation in tool schemas or parameter structures
155
+
156
+ ### Moderate Issues
157
+
158
+ 5. **File Path Hallucination**
159
+ - Tool calls reference files that don't exist in actual codebase
160
+ - Example: asking for `tests/test_main.py` but reading `src/app.js`
161
+
162
+ 6. **Response Fabrication**
163
+ - Assistant responses sometimes claim to show content that wasn't actually read
164
+ - Example: "Here's the README.md" when README.md wasn't the file requested
165
+
166
+ ---
167
+
168
+ ## 5. Recommendations for Improvement
169
+
170
+ ### Immediate Actions (Completed)
171
+
172
+ 1. ✅ **Regenerated Data**
173
+ ```
174
+ Generated 500 new examples in training-data_v2/tool_examples.jsonl
175
+ ```
176
+
177
+ ### Script Fixes Needed
178
+
179
+ The generator script (`scripts/generate_tool_data.py`) needs:
180
+
181
+ 1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
182
+ 2. Fix `FILE_PATTERNS` - wrong file types for requested content
183
+ 3. Add multi-step chain generation
184
+ 4. Add prompt variation templates
185
+ 5. Add validation to check query/content relevance
186
+
187
+ ### Future Improvements
188
+
189
+ 1. **Add Multi-Step Examples**
190
+ - Real tasks require reading files, searching, editing
191
+ - Generate chains of 2-4 tool calls per example
192
+
193
+ 2. **Increase Prompt Diversity**
194
+ - Target 500+ unique prompts instead of duplicating
195
+ - Use template variations and paraphrasing
196
+
197
+ 3. **Vary Tool Definitions**
198
+ - Different tools per example
199
+ - Add tool variations (e.g., different Bash commands)
200
+
201
+ ---
202
+
203
+ ## 6. Conclusion
204
+
205
+ The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements:
206
+
207
+ - ~10% of examples have incorrect tool parameters
208
+ - Heavy duplication leads to overfitting
209
+ - No multi-step chains fail to represent real coding workflows
210
+ - Synthetic generation errors are systematic
211
+
212
+ **Action Completed:** Generated 500 new examples via the project's generator script.
213
+
214
+ **Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.
215
+
216
+ ---
217
+
218
+ ## Appendix: Quick Stats
219
+
220
+ ### Original Data
221
+ ```
222
+ Total examples: 1,000
223
+ Unique prompts: 133
224
+ Tool call issues: 107 (10.7%)
225
+ Multi-tool chains: 0 (0%)
226
+ Identical tool defs: 100%
227
+ Average duplication: 7.52x
228
+ ```
229
+
230
+ ### New Data (Generated)
231
+ ```
232
+ Total examples: 500
233
+ File size: 1.9 MB
234
+ Location: training-data_v2/tool_examples.jsonl
235
+ ```
inference_api.py ADDED
@@ -0,0 +1,495 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ FastAPI Inference Server for Stack 2.9 Model
4
+ Provides REST API endpoints for code generation using fine-tuned Qwen models.
5
+
6
+ Usage:
7
+ # With default settings (model loaded from environment or config)
8
+ uvicorn inference_api:app --host 0.0.0.0 --port 8000
9
+
10
+ # With custom model path
11
+ MODEL_PATH=/path/to/model uvicorn inference_api:app --host 0.0.0.0 --port 8000
12
+
13
+ # With reload for development
14
+ uvicorn inference_api:app --reload --port 8000
15
+ """
16
+
17
+ import os
18
+ import logging
19
+ from contextlib import asynccontextmanager
20
+ from typing import Optional, List, Dict, Any
21
+
22
+ import torch
23
+ from fastapi import FastAPI, HTTPException
24
+ from fastapi.middleware.cors import CORSMiddleware
25
+ from pydantic import BaseModel, Field
26
+
27
+ from transformers import AutoModelForCausalLM, AutoTokenizer
28
+
29
+ # Configure logging
30
+ logging.basicConfig(level=logging.INFO)
31
+ logger = logging.getLogger(__name__)
32
+
33
+ # Model configuration
34
+ MODEL_PATH = os.getenv("MODEL_PATH", "base_model_qwen7b")
35
+ DEVICE = os.getenv("DEVICE", "cuda" if torch.cuda.is_available() else "cpu")
36
+ DEFAULT_MAX_TOKENS = int(os.getenv("DEFAULT_MAX_TOKENS", "512"))
37
+ DEFAULT_TEMPERATURE = float(os.getenv("DEFAULT_TEMPERATURE", "0.2"))
38
+ DEFAULT_TOP_P = float(os.getenv("DEFAULT_TOP_P", "0.95"))
39
+
40
+ # Global model and tokenizer (loaded on startup)
41
+ model = None
42
+ tokenizer = None
43
+
44
+
45
+ @asynccontextmanager
46
+ async def lifespan(app: FastAPI):
47
+ """Load model on startup, cleanup on shutdown."""
48
+ global model, tokenizer
49
+ logger.info(f"Loading model from: {MODEL_PATH}")
50
+ logger.info(f"Using device: {DEVICE}")
51
+
52
+ try:
53
+ tokenizer = AutoTokenizer.from_pretrained(
54
+ MODEL_PATH,
55
+ trust_remote_code=True,
56
+ padding_side="left",
57
+ )
58
+
59
+ if tokenizer.pad_token is None:
60
+ tokenizer.pad_token = tokenizer.eos_token
61
+
62
+ model = AutoModelForCausalLM.from_pretrained(
63
+ MODEL_PATH,
64
+ torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
65
+ device_map="auto" if DEVICE == "cuda" else None,
66
+ low_cpu_mem_usage=True,
67
+ trust_remote_code=True,
68
+ )
69
+
70
+ if DEVICE == "cpu":
71
+ model = model.to(DEVICE)
72
+
73
+ model.eval()
74
+ logger.info("Model loaded successfully")
75
+ except Exception as e:
76
+ logger.error(f"Failed to load model: {e}")
77
+ raise
78
+
79
+ yield
80
+
81
+ # Cleanup
82
+ logger.info("Shutting down, cleaning up model...")
83
+ del model
84
+ del tokenizer
85
+ if torch.cuda.is_available():
86
+ torch.cuda.empty_cache()
87
+
88
+
89
+ app = FastAPI(
90
+ title="Stack 2.9 Inference API",
91
+ description="REST API for code generation using Stack 2.9 fine-tuned Qwen model",
92
+ version="1.0.0",
93
+ lifespan=lifespan,
94
+ )
95
+
96
+ # CORS middleware
97
+ app.add_middleware(
98
+ CORSMiddleware,
99
+ allow_origins=["*"],
100
+ allow_credentials=True,
101
+ allow_methods=["*"],
102
+ allow_headers=["*"],
103
+ )
104
+
105
+
106
+ # ============================================================================
107
+ # Request/Response Models
108
+ # ============================================================================
109
+
110
+ class GenerateRequest(BaseModel):
111
+ """Request body for /generate endpoint."""
112
+ prompt: str = Field(..., description="Input prompt/code to complete", min_length=1)
113
+ max_tokens: int = Field(DEFAULT_MAX_TOKENS, ge=1, le=4096, description="Max tokens to generate")
114
+ temperature: float = Field(DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature")
115
+ top_p: float = Field(DEFAULT_TOP_P, ge=0.0, le=1.0, description="Nucleus sampling threshold")
116
+ do_sample: bool = Field(True, description="Whether to use sampling")
117
+ repetition_penalty: float = Field(1.1, ge=1.0, le=2.0, description="Repetition penalty")
118
+ num_return_sequences: int = Field(1, ge=1, le=10, description="Number of sequences to generate")
119
+
120
+ model_config = {
121
+ "json_schema_extra": {
122
+ "example": {
123
+ "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
124
+ "max_tokens": 128,
125
+ "temperature": 0.2,
126
+ "top_p": 0.95,
127
+ }
128
+ }
129
+ }
130
+
131
+
132
+ class GenerateResponse(BaseModel):
133
+ """Response body for /generate endpoint."""
134
+ generated_text: str
135
+ prompt: str
136
+ model: str
137
+ num_tokens: int
138
+ finish_reason: str = "length"
139
+
140
+
141
+ class ChatMessage(BaseModel):
142
+ """A single message in a conversation."""
143
+ role: str = Field(..., description="Role: 'user' or 'assistant'")
144
+ content: str = Field(..., description="Message content")
145
+
146
+
147
+ class ChatRequest(BaseModel):
148
+ """Request body for /chat endpoint."""
149
+ messages: List[ChatMessage] = Field(..., description="Conversation history")
150
+ max_tokens: int = Field(DEFAULT_MAX_TOKENS, ge=1, le=4096, description="Max tokens to generate")
151
+ temperature: float = Field(DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature")
152
+ top_p: float = Field(DEFAULT_TOP_P, ge=0.0, le=1.0, description="Nucleus sampling threshold")
153
+ do_sample: bool = Field(True, description="Whether to use sampling")
154
+ repetition_penalty: float = Field(1.1, ge=1.0, le=2.0, description="Repetition penalty")
155
+
156
+ model_config = {
157
+ "json_schema_extra": {
158
+ "example": {
159
+ "messages": [
160
+ {"role": "user", "content": "Write a function to reverse a string in Python"},
161
+ {"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"},
162
+ {"role": "user", "content": "Make it recursive"},
163
+ ],
164
+ "max_tokens": 128,
165
+ "temperature": 0.2,
166
+ }
167
+ }
168
+ }
169
+
170
+
171
+ class ChatResponse(BaseModel):
172
+ """Response body for /chat endpoint."""
173
+ message: ChatMessage
174
+ model: str
175
+ num_tokens: int
176
+ finish_reason: str = "length"
177
+
178
+
179
+ class HealthResponse(BaseModel):
180
+ """Response body for /health endpoint."""
181
+ status: str
182
+ model_loaded: bool
183
+ model_path: str
184
+ device: str
185
+ cuda_available: bool
186
+
187
+
188
+ class ModelInfoResponse(BaseModel):
189
+ """Response body for /model-info endpoint."""
190
+ model_path: str
191
+ device: str
192
+ dtype: str
193
+
194
+
195
+ # ============================================================================
196
+ # Helper Functions
197
+ # ============================================================================
198
+
199
+ def format_chat_to_prompt(messages: List[ChatMessage]) -> str:
200
+ """
201
+ Format chat messages into a prompt for code generation.
202
+ Uses a simple instruction format suitable for Qwen.
203
+ """
204
+ formatted = []
205
+ for msg in messages:
206
+ if msg.role == "user":
207
+ formatted.append(f"<|im_start|>user\n{msg.content}<|im_end|>")
208
+ elif msg.role == "assistant":
209
+ formatted.append(f"<|im_start|>assistant\n{msg.content}<|im_end|>")
210
+
211
+ formatted.append("<|im_start|>assistant\n")
212
+ return "\n".join(formatted)
213
+
214
+
215
+ def generate_response(
216
+ prompt: str,
217
+ max_new_tokens: int,
218
+ temperature: float,
219
+ top_p: float,
220
+ do_sample: bool,
221
+ repetition_penalty: float,
222
+ num_return_sequences: int,
223
+ ) -> tuple[str, int, str]:
224
+ """
225
+ Generate response from model.
226
+
227
+ Returns:
228
+ tuple: (generated_text, num_tokens, finish_reason)
229
+ """
230
+ inputs = tokenizer(
231
+ prompt,
232
+ return_tensors="pt",
233
+ padding=True,
234
+ truncation=True,
235
+ )
236
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
237
+
238
+ with torch.no_grad():
239
+ outputs = model.generate(
240
+ **inputs,
241
+ max_new_tokens=max_new_tokens,
242
+ temperature=temperature,
243
+ top_p=top_p,
244
+ do_sample=do_sample,
245
+ repetition_penalty=repetition_penalty,
246
+ num_return_sequences=num_return_sequences,
247
+ pad_token_id=tokenizer.pad_token_id,
248
+ eos_token_id=tokenizer.eos_token_id,
249
+ )
250
+
251
+ # Decode the first sequence
252
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
253
+
254
+ # Calculate number of generated tokens
255
+ num_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
256
+
257
+ # Extract just the new tokens (remove prompt)
258
+ if generated_text.startswith(prompt):
259
+ generated_text = generated_text[len(prompt):]
260
+
261
+ generated_text = generated_text.strip()
262
+
263
+ # Determine finish reason
264
+ finish_reason = "stop"
265
+ if num_tokens >= max_new_tokens:
266
+ finish_reason = "length"
267
+
268
+ return generated_text, num_tokens, finish_reason
269
+
270
+
271
+ def extract_code_from_response(text: str) -> str:
272
+ """Extract code block from response if present."""
273
+ if "```python" in text:
274
+ start = text.find("```python") + len("```python")
275
+ end = text.find("```", start)
276
+ if end != -1:
277
+ return text[start:end].strip()
278
+ elif "```" in text:
279
+ start = text.find("```") + len("```")
280
+ # Skip potential language identifier
281
+ if "\n" in text[start:]:
282
+ start = text.find("\n", start) + 1
283
+ end = text.find("```", start)
284
+ if end != -1:
285
+ return text[start:end].strip()
286
+ return text
287
+
288
+
289
+ # ============================================================================
290
+ # API Endpoints
291
+ # ============================================================================
292
+
293
+ @app.get("/health", response_model=HealthResponse)
294
+ async def health_check():
295
+ """
296
+ Health check endpoint.
297
+
298
+ Returns the current status of the API and model.
299
+ """
300
+ return HealthResponse(
301
+ status="healthy" if model is not None else "model_not_loaded",
302
+ model_loaded=model is not None,
303
+ model_path=MODEL_PATH,
304
+ device=DEVICE,
305
+ cuda_available=torch.cuda.is_available(),
306
+ )
307
+
308
+
309
+ @app.get("/model-info", response_model=ModelInfoResponse)
310
+ async def get_model_info():
311
+ """
312
+ Get information about the loaded model.
313
+ """
314
+ if model is None:
315
+ raise HTTPException(status_code=503, detail="Model not loaded")
316
+
317
+ dtype = str(next(model.parameters()).dtype)
318
+
319
+ return ModelInfoResponse(
320
+ model_path=MODEL_PATH,
321
+ device=str(next(model.parameters()).device),
322
+ dtype=dtype,
323
+ )
324
+
325
+
326
+ @app.post("/generate", response_model=GenerateResponse)
327
+ async def generate(request: GenerateRequest):
328
+ """
329
+ Generate code completion for a prompt.
330
+
331
+ Takes a prompt and generates code completion based on the model.
332
+ Supports various generation parameters for controlling output.
333
+ """
334
+ if model is None or tokenizer is None:
335
+ raise HTTPException(
336
+ status_code=503,
337
+ detail="Model not loaded. Check /health for status."
338
+ )
339
+
340
+ try:
341
+ generated_text, num_tokens, finish_reason = generate_response(
342
+ prompt=request.prompt,
343
+ max_new_tokens=request.max_tokens,
344
+ temperature=request.temperature,
345
+ top_p=request.top_p,
346
+ do_sample=request.do_sample,
347
+ repetition_penalty=request.repetition_penalty,
348
+ num_return_sequences=request.num_return_sequences,
349
+ )
350
+
351
+ return GenerateResponse(
352
+ generated_text=generated_text,
353
+ prompt=request.prompt,
354
+ model=MODEL_PATH,
355
+ num_tokens=num_tokens,
356
+ finish_reason=finish_reason,
357
+ )
358
+ except Exception as e:
359
+ logger.error(f"Generation error: {e}")
360
+ raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
361
+
362
+
363
+ @app.post("/generate/raw", response_model=GenerateResponse)
364
+ async def generate_raw(request: GenerateRequest):
365
+ """
366
+ Generate without extracting code from markdown blocks.
367
+
368
+ Returns the raw model output without any post-processing.
369
+ """
370
+ if model is None or tokenizer is None:
371
+ raise HTTPException(
372
+ status_code=503,
373
+ detail="Model not loaded. Check /health for status."
374
+ )
375
+
376
+ try:
377
+ # Get raw response
378
+ inputs = tokenizer(
379
+ request.prompt,
380
+ return_tensors="pt",
381
+ padding=True,
382
+ truncation=True,
383
+ )
384
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
385
+
386
+ with torch.no_grad():
387
+ outputs = model.generate(
388
+ **inputs,
389
+ max_new_tokens=request.max_tokens,
390
+ temperature=request.temperature,
391
+ top_p=request.top_p,
392
+ do_sample=request.do_sample,
393
+ repetition_penalty=request.repetition_penalty,
394
+ num_return_sequences=request.num_return_sequences,
395
+ pad_token_id=tokenizer.pad_token_id,
396
+ eos_token_id=tokenizer.eos_token_id,
397
+ )
398
+
399
+ num_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
400
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
401
+
402
+ if generated_text.startswith(request.prompt):
403
+ generated_text = generated_text[len(request.prompt):]
404
+
405
+ finish_reason = "stop" if num_tokens < request.max_tokens else "length"
406
+
407
+ return GenerateResponse(
408
+ generated_text=generated_text.strip(),
409
+ prompt=request.prompt,
410
+ model=MODEL_PATH,
411
+ num_tokens=num_tokens,
412
+ finish_reason=finish_reason,
413
+ )
414
+ except Exception as e:
415
+ logger.error(f"Generation error: {e}")
416
+ raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
417
+
418
+
419
+ @app.post("/chat", response_model=ChatResponse)
420
+ async def chat(request: ChatRequest):
421
+ """
422
+ Chat endpoint for conversation-style interactions.
423
+
424
+ Takes a conversation history and generates the next assistant response.
425
+ """
426
+ if model is None or tokenizer is None:
427
+ raise HTTPException(
428
+ status_code=503,
429
+ detail="Model not loaded. Check /health for status."
430
+ )
431
+
432
+ if not request.messages:
433
+ raise HTTPException(status_code=400, detail="Messages list cannot be empty")
434
+
435
+ # Check that last message is from user
436
+ if request.messages[-1].role != "user":
437
+ raise HTTPException(
438
+ status_code=400,
439
+ detail="Last message must be from user"
440
+ )
441
+
442
+ try:
443
+ # Format conversation as prompt
444
+ prompt = format_chat_to_prompt(request.messages)
445
+
446
+ generated_text, num_tokens, finish_reason = generate_response(
447
+ prompt=prompt,
448
+ max_new_tokens=request.max_tokens,
449
+ temperature=request.temperature,
450
+ top_p=request.top_p,
451
+ do_sample=request.do_sample,
452
+ repetition_penalty=request.repetition_penalty,
453
+ num_return_sequences=1,
454
+ )
455
+
456
+ return ChatResponse(
457
+ message=ChatMessage(role="assistant", content=generated_text),
458
+ model=MODEL_PATH,
459
+ num_tokens=num_tokens,
460
+ finish_reason=finish_reason,
461
+ )
462
+ except Exception as e:
463
+ logger.error(f"Chat error: {e}")
464
+ raise HTTPException(status_code=500, detail=f"Chat generation failed: {str(e)}")
465
+
466
+
467
+ @app.post("/extract-code")
468
+ async def extract_code(request: GenerateRequest):
469
+ """
470
+ Extract code from a generated response.
471
+
472
+ Useful when you have raw output with markdown code blocks and want to
473
+ extract just the code portion.
474
+ """
475
+ code = extract_code_from_response(request.prompt)
476
+ return {"code": code}
477
+
478
+
479
+ # ============================================================================
480
+ # Main Entry Point
481
+ # ============================================================================
482
+
483
+ if __name__ == "__main__":
484
+ import uvicorn
485
+
486
+ port = int(os.getenv("PORT", "8000"))
487
+ host = os.getenv("HOST", "0.0.0.0")
488
+
489
+ uvicorn.run(
490
+ "inference_api:app",
491
+ host=host,
492
+ port=port,
493
+ reload=os.getenv("RELOAD", "false").lower() == "true",
494
+ workers=1, # Multi-worker can cause GPU memory issues
495
+ )
requirements_api.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FastAPI Inference API Dependencies
2
+ # Install with: pip install -r requirements_api.txt
3
+
4
+ # Web Framework
5
+ fastapi>=0.109.0
6
+ uvicorn[standard]>=0.27.0
7
+
8
+ # Request/Response Models
9
+ pydantic>=2.5.0
10
+
11
+ # CORS Support (included with fastapi but listed for clarity)
12
+ # fastapi includes CORSMiddleware
13
+
14
+ # Server utilities
15
+ python-multipart>=0.0.6
16
+
17
+ # Optional: for async file handling
18
+ httpx>=0.26.0
training-data/tool_examples_combined.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32da2f0f67ba3fd83d180ec2c1a323e77d4263ff5aeb1e8062cf596b070691d5
3
+ size 5669209