feat: add inference API, quickstart guide, roadmap, and combined tool data

Created:
- inference_api.py: FastAPI server with /generate, /chat, /health endpoints
- requirements_api.txt: FastAPI dependencies
- docs/API.md: API documentation
- docs/QUICKSTART.md: 5-minute quickstart guide
- README_QUICKSTART.md: Simplified landing page
- docs/ROADMAP.md: Project roadmap (current → long-term)
- CONTRIBUTING.md: Contributor guidelines
- docs/TOOL_DATA_ANALYSIS.md: Tool data analysis report
- training-data/tool_examples_combined.jsonl: 1500 tool_calling examples (balanced 5 tools)

Total tool calling training data: 1500 examples
- Grep: 300
- FileRead: 300
- WebSearch: 300
- Bash: 300
- FileWrite: 300

Files changed (9) hide show

CONTRIBUTING.md +164 -19
README_QUICKSTART.md +151 -0
docs/API.md +345 -0
docs/QUICKSTART.md +415 -0
docs/ROADMAP.md +143 -0
docs/TOOL_DATA_ANALYSIS.md +235 -0
inference_api.py +495 -0
requirements_api.txt +18 -0
training-data/tool_examples_combined.jsonl +3 -0

CONTRIBUTING.md CHANGED Viewed

@@ -1,26 +1,171 @@
-# Contributing
-We welcome contributions! Here's how:
-## Getting Started
-1. Fork the repository
-2. Clone your fork: `git clone https://github.com/YOUR_USER/$repo.git`
-3. Create a virtual environment
-4. Install dependencies: `pip install -r requirements.txt`
-## Making Changes
-1. Create a branch: `git checkout -b feature/your-feature-name`
-2. Make your changes
-3. Add tests
-4. Run tests: `pytest tests/`
-5. Commit: `git commit -m "Add your feature"`
-6. Push: `git push origin feature/your-feature-name`
-7. Open a Pull Request
 ## Code Style
-- Follow PEP 8
-- Add docstrings
-- Include type hints where possible
 ## Reporting Issues
-Open an issue with a clear description and example code.

+# Contributing to Stack 2.9
+> Last updated: April 2026
+Thank you for your interest in contributing to Stack 2.9! This document outlines how you can help.
+## Project State
+**Before contributing, understand where the project stands:**
+| Area | Status | Notes |
+|------|--------|-------|
+| Basic code generation | ✅ Working | Main strength of the model |
+| Tool calling | ⚠️ Not trained | Needs fine-tuning on tool patterns |
+| Benchmark scores | ⚠️ Pending | Full evaluation not yet run |
+| Self-evolution | 🔧 Incomplete | Components exist but not connected |
+| Documentation | 🔧 In progress | Some areas need work |
+## Quick Start
+```bash
+# 1. Fork the repository
+git fork https://github.com/my-ai-stack/stack-2.9.git
+# 2. Clone your fork
+git clone https://github.com/YOUR_USER/stack-2.9.git
+cd stack-2.9
+# 3. Create a virtual environment
+python -m venv .venv
+source .venv/bin/activate  # Linux/Mac
+# or .venv\Scripts\activate on Windows
+# 4. Install dependencies
+pip install -r requirements.txt
+```
+## What to Work On
+### High Priority
+1. **Evaluation** - Run full HumanEval/MBPP benchmarks
+   - See `stack/eval/run_proper_evaluation.py`
+   - Requires: Python, Ollama or API key
+2. **Tool calling tests** - Test and document tool usage
+   - Run `python stack.py -c "Your command here"`
+   - Report what works/doesn't in issues
+3. **Documentation** - Improve tool definitions, API docs
+   - Check `docs/TOOLS.md` for accuracy
+   - Update `stack/internal/ARCHITECTURE.md`
+### Medium Priority
+4. **Training scripts** - Improve fine-tuning pipeline
+   - See `stack/training/`
+   - ⚠️ Do NOT modify Kaggle notebook or training data generation
+5. **Deployment** - Fix deployment scripts
+   - See `stack/deploy/`, `runpod_deploy.sh`
+### Lower Priority
+6. **Pattern Memory** - Connect Observer → Learner → Memory → Trainer
+7. **Voice integration** - Test end-to-end voice pipeline
+8. **MCP support** - Improve Model Context Protocol integration
+## What NOT to Touch
+⚠️ **Do NOT modify without explicit approval:**
+- `kaggle_train_stack29_v5.ipynb` - Kaggle training notebook
+- `colab_train_stack29.ipynb` - Colab training notebook
+- Training data generation scripts in `data/`
+- Model weights in `base_model_qwen7b/`
+These are core training components. Changes here affect the model itself.
 ## Code Style
+- **Python:** Follow PEP 8, use type hints where possible
+- **TypeScript:** Use strict mode, add JSDoc comments
+- **Shell:** Use `shellcheck` on bash scripts
+- **General:** Add docstrings to new functions, include examples
+### Pre-commit Checks
+```bash
+# Run tests before submitting
+pytest samples/ -v
+# Check code formatting
+ruff check src/ samples/ --fix
+black src/ samples/
+```
+##提交PR
+```bash
+# Create a feature branch
+git checkout -b feature/your-feature-name
+# Make your changes
+# ... edit files ...
+# Run tests
+pytest samples/ -v
+# Commit with clear message
+git commit -m "Add: description of what you changed"
+# Push to your fork
+git push origin feature/your-feature-name
+# Open a Pull Request
+# Fill in the PR template with:
+# - What you changed
+# - Why it's needed
+# - Testing you did
+# - Screenshots if applicable
+```
+## Pull Request Guidelines
+1. **Describe the change clearly** - What does this fix or add?
+2. **Link related issues** - Use "Fixes #123" if applicable
+3. **Include tests** - Add unit tests for new features
+4. **Update docs** - If you add a feature, document it
+5. **Be patient** - Reviewers may take a few days to respond
 ## Reporting Issues
+When reporting bugs:
+```markdown
+## Description
+Brief description of the issue
+## Steps to Reproduce
+1. Run `...`
+2. See error
+## Expected Behavior
+What should happen
+## Actual Behavior
+What actually happened
+## Environment
+- OS:
+- Python version:
+- Provider: (ollama/openai/etc)
+- Model:
+```
+## Communication
+- **Issues:** GitHub Issues for bugs/features
+- **Discussions:** GitHub Discussions for questions
+- **Discord:** Link in README
+## Recognition
+Contributors will be listed in:
+- README.md "Acknowledgments" section
+- CONTRIBUTORS file (if created)
+---
+**Questions?** Open a GitHub Discussion or ask in Discord.

README_QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# Stack 2.9 — Quick Start
+> **AI coding assistant powered by Qwen2.5-Coder-32B with Pattern Memory.**
+```
+git clone https://github.com/my-ai-stack/stack-2.9.git
+cd stack-2.9
+pip install -r requirements.txt
+cp .env.example .env
+python stack.py
+```
+That's it. Keep reading for details.
+---
+## Prerequisites
+- **Python 3.10+**
+- **GPU** (optional — runs on CPU via cloud providers too)
+- **Git**
+---
+## Install & Run
+```bash
+# Clone
+git clone https://github.com/my-ai-stack/stack-2.9.git
+cd stack-2.9
+# Install
+python3 -m venv venv && source venv/bin/activate
+pip install -r requirements.txt
+# Configure (pick a provider below, then edit .env)
+cp .env.example .env
+# Run!
+python stack.py
+```
+---
+## Configure Your Model Provider
+Edit `.env` with one of these:
+### Ollama (Local, Private) — Recommended
+```env
+MODEL_PROVIDER=ollama
+OLLAMA_MODEL=qwen2.5-coder:32b
+```
+```bash
+# First: curl -fsSL https://ollama.ai/install.sh | sh && ollama pull qwen2.5-coder:32b
+```
+### Together AI (Cloud, Fast)
+```env
+MODEL_PROVIDER=together
+TOGETHER_API_KEY=tog-your-key-here
+TOGETHER_MODEL=togethercomputer/qwen2.5-32b-instruct
+```
+### OpenAI (GPT-4o)
+```env
+MODEL_PROVIDER=openai
+OPENAI_API_KEY=sk-your-key-here
+OPENAI_MODEL=gpt-4o
+```
+### Anthropic (Claude)
+```env
+MODEL_PROVIDER=anthropic
+ANTHROPIC_API_KEY=sk-ant-your-key-here
+ANTHROPIC_MODEL=claude-3-5-sonnet-20240229
+```
+---
+## Usage
+### Interactive Chat
+```bash
+python stack.py
+```
+### Single Query
+```bash
+python stack.py -c "Write a Python function to reverse a string"
+```
+### Evaluate Model (GPU required)
+```bash
+python evaluate_model.py --model-path ./output/merged --benchmark humaneval
+```
+### Deploy with Docker
+```bash
+docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9
+```
+---
+## 5-Minute Overview
+| Feature | Command |
+|---------|---------|
+| Start chatting | `python stack.py` |
+| Ask one question | `python stack.py -c "your question"` |
+| Run benchmarks | `python evaluate_model.py --model-path ./merged --benchmark both` |
+| List patterns | `python stack.py --patterns list` |
+| Deploy locally | `docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9` |
+---
+## Hardware Requirements
+| Model | Minimum | Recommended |
+|-------|---------|------------|
+| 7B | RTX 3060 (6GB) | A100 40GB |
+| 32B | RTX 3090 (24GB) | A100 80GB |
+No GPU? Use Ollama on your machine or any cloud provider in `.env`.
+---
+## Key Links
+- 📖 **Full docs:** [docs/QUICKSTART.md](docs/QUICKSTART.md)
+- 🔧 **46 tools:** [TOOLS.md](TOOLS.md)
+- 🧠 **Pattern memory:** [docs/pattern-moat.md](docs/pattern-moat.md)
+- 🚀 **Training guide:** [docs/TRAINING_7B.md](docs/TRAINING_7B.md)
+- 🐳 **Kubernetes:** [k8s/](k8s/)
+---
+## What's Inside
+- **Qwen2.5-Coder-32B** — 32B parameter code-specialized model
+- **Pattern Memory** — learns from successful interactions
+- **46 built-in tools** — file ops, git, shell, search, memory, tasks
+- **Multi-provider** — Ollama, OpenAI, Anthropic, Together AI, OpenRouter
+- **128K context** — handles large codebases
+- **Self-hosted** — full control, private
+- **MCP support** — integrates with any Model Context Protocol server
+- **Voice-ready** — Coqui XTTS for voice cloning
+---
+*Built with ❤️ for developers who want an AI that grows with them.*

docs/API.md ADDED Viewed

	@@ -0,0 +1,345 @@

+# Stack 2.9 Inference API Documentation
+REST API for code generation using the Stack 2.9 fine-tuned Qwen model.
+## Quick Start
+### 1. Install Dependencies
+```bash
+pip install -r requirements_api.txt
+pip install -r requirements.txt  # Core dependencies (transformers, torch, etc.)
+```
+### 2. Set Model Path
+```bash
+# Option A: Environment variable
+export MODEL_PATH=/path/to/your/merged/model
+# Option B: Direct parameter
+MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000
+```
+### 3. Start the Server
+```bash
+# Basic usage
+uvicorn inference_api:app --host 0.0.0.0 --port 8000
+# With auto-reload (development)
+uvicorn inference_api:app --reload --port 8000
+# Using Python directly
+python inference_api.py
+```
+### 4. Verify It's Running
+```bash
+curl http://localhost:8000/health
+```
+Expected response:
+```json
+{
+  "status": "healthy",
+  "model_loaded": true,
+  "model_path": "base_model_qwen7b",
+  "device": "cuda",
+  "cuda_available": true
+}
+```
+---
+## API Endpoints
+### `GET /health`
+Health check endpoint to verify API and model status.
+**Response:**
+```json
+{
+  "status": "healthy",
+  "model_loaded": true,
+  "model_path": "/path/to/model",
+  "device": "cuda",
+  "cuda_available": true
+}
+```
+---
+### `GET /model-info`
+Get information about the currently loaded model.
+**Response:**
+```json
+{
+  "model_path": "/path/to/model",
+  "device": "cuda:0",
+  "dtype": "torch.float16"
+}
+```
+---
+### `POST /generate`
+Generate code completion for a prompt.
+**Request Body:**
+```json
+{
+  "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
+  "max_tokens": 128,
+  "temperature": 0.2,
+  "top_p": 0.95,
+  "do_sample": true,
+  "repetition_penalty": 1.1,
+  "num_return_sequences": 1
+}
+```
+**Parameters:**
+| Parameter | Type | Default | Range | Description |
+|-----------|------|---------|-------|-------------|
+| `prompt` | string | required | - | Input prompt to complete |
+| `max_tokens` | int | 512 | 1-4096 | Maximum tokens to generate |
+| `temperature` | float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) |
+| `top_p` | float | 0.95 | 0.0-1.0 | Nucleus sampling threshold |
+| `do_sample` | bool | true | - | Whether to use sampling vs greedy |
+| `repetition_penalty` | float | 1.1 | 1.0-2.0 | Penalize repeated tokens |
+| `num_return_sequences` | int | 1 | 1-10 | Number of sequences to generate |
+**Response:**
+```json
+{
+  "generated_text": "    seen = {}\n    for i, num in enumerate(nums):\n        complement = target - num\n        if complement in seen:\n            return [seen[complement], i]\n        seen[num] = i\n    return []",
+  "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
+  "model": "base_model_qwen7b",
+  "num_tokens": 45,
+  "finish_reason": "stop"
+}
+```
+**Example with curl:**
+```bash
+curl -X POST http://localhost:8000/generate \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "def fibonacci(n):\n    \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
+    "max_tokens": 100,
+    "temperature": 0.2
+  }'
+```
+---
+### `POST /chat`
+Conversational interface for multi-turn interactions.
+**Request Body:**
+```json
+{
+  "messages": [
+    {"role": "user", "content": "Write a function to reverse a string in Python"},
+    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"},
+    {"role": "user", "content": "Make it recursive instead"}
+  ],
+  "max_tokens": 128,
+  "temperature": 0.2,
+  "top_p": 0.95
+}
+```
+**Message Roles:**
+- `user` - User's message
+- `assistant` - Model's previous response (for conversation history)
+**Response:**
+```json
+{
+  "message": {
+    "role": "assistant",
+    "content": "def reverse_string(s):\n    if len(s) <= 1:\n        return s\n    return s[-1] + reverse_string(s[:-1])"
+  },
+  "model": "base_model_qwen7b",
+  "num_tokens": 67,
+  "finish_reason": "stop"
+}
+```
+**Example with curl:**
+```bash
+curl -X POST http://localhost:8000/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "Write a binary search function"}
+    ],
+    "max_tokens": 150
+  }'
+```
+---
+### `POST /generate/raw`
+Same as `/generate` but returns raw output without extracting code from markdown blocks.
+**Example with curl:**
+```bash
+curl -X POST http://localhost:8000/generate/raw \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "def quick_sort(arr):",
+    "max_tokens": 200
+  }'
+```
+---
+### `POST /extract-code`
+Extract code from a text response that may contain markdown code blocks.
+**Request Body:**
+```json
+{
+  "prompt": "```python\ndef hello():\n    print(\"world\")\n```"
+}
+```
+**Response:**
+```json
+{
+  "code": "def hello():\n    print(\"world\")"
+}
+```
+---
+## Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MODEL_PATH` | `base_model_qwen7b` | Path to model directory |
+| `DEVICE` | `cuda` (if available) | Device to use: `cuda` or `cpu` |
+| `PORT` | `8000` | Server port |
+| `HOST` | `0.0.0.0` | Server host |
+| `RELOAD` | `false` | Enable auto-reload for development |
+| `DEFAULT_MAX_TOKENS` | `512` | Default max tokens |
+| `DEFAULT_TEMPERATURE` | `0.2` | Default temperature |
+| `DEFAULT_TOP_P` | `0.95` | Default top_p |
+---
+## Usage Examples
+### Python Client
+```python
+import requests
+API_URL = "http://localhost:8000"
+# Health check
+health = requests.get(f"{API_URL}/health").json()
+print(f"Model loaded: {health['model_loaded']}")
+# Code completion
+response = requests.post(
+    f"{API_URL}/generate",
+    json={
+        "prompt": "def merge_sort(arr):\n    \"\"\"Return sorted array.\"\"\"\n",
+        "max_tokens": 200,
+        "temperature": 0.3,
+    }
+).json()
+print(response["generated_text"])
+```
+### JavaScript/Node.js Client
+```javascript
+const API_URL = "http://localhost:8000";
+// Code completion
+async function generate(prompt) {
+  const response = await fetch(`${API_URL}/generate`, {
+    method: "POST",
+    headers: { "Content-Type": "application/json" },
+    body: JSON.stringify({
+      prompt,
+      max_tokens: 128,
+      temperature: 0.2,
+    }),
+  });
+  return response.json();
+}
+const result = await generate("def binary_search(arr, target):");
+console.log(result.generated_text);
+```
+### Using with OpenAI SDK (with base_url replacement)
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key="not-needed",
+    base_url="http://localhost:8000"
+)
+# Note: This works for basic completions but may need adapter code
+# for full OpenAI compatibility
+response = client.completions.create(
+    model="stack-2.9",
+    prompt="def factorial(n):",
+    max_tokens=100,
+)
+```
+---
+## Performance Tips
+1. **GPU Recommended**: For fastest inference, run on GPU with CUDA
+2. **Batch Processing**: For multiple prompts, process sequentially (model is loaded once)
+3. **Memory**: Ensure adequate GPU memory; reduce `max_tokens` if needed
+4. **Temperature**: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks
+---
+## Error Handling
+**503 Service Unavailable**: Model not loaded or loading failed
+```json
+{"detail": "Model not loaded. Check /health for status."}
+```
+**500 Internal Server Error**: Generation failed
+```json
+{"detail": "Generation failed: <error message>"}
+```
+**400 Bad Request**: Invalid input
+```json
+{"detail": "Last message must be from user"}
+```
+---
+## Architecture Notes
+- **Single Model Instance**: Model is loaded once at startup and reused
+- **Synchronous Generation**: Uses `torch.no_grad()` for inference
+- **CORS Enabled**: Accepts requests from any origin (configure for production)
+- **No Authentication**: Add middleware (e.g., API key) for production deployments

docs/QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,415 @@

+# Stack 2.9 — 5-Minute Quick Start
+> **Goal:** Get Stack 2.9 running and solving coding tasks in under 5 minutes.
+Stack 2.9 is an AI coding assistant powered by **Qwen2.5-Coder-32B** with Pattern Memory — it learns from your interactions and improves over time.
+---
+## 📋 Prerequisites
+### Required
+| Requirement | Version | Check |
+|-------------|---------|-------|
+| Python | 3.10+ | `python3 --version` |
+| Git | Any recent | `git --version` |
+| pip | Latest | `pip --version` |
+### Optional (Recommended)
+| Resource | Why You Need It | Minimum |
+|----------|----------------|---------|
+| **GPU** | Fast code generation | RTX 3070 / M1 Pro |
+| **16GB VRAM** | Run 32B model smoothly | 8GB for 7B quantized |
+> **No GPU?** Stack 2.9 works on CPU via Ollama or cloud providers (OpenAI, Together AI, etc.).
+---
+## ⚡ Step 1 — Install in 60 Seconds
+```bash
+# 1. Clone the repository
+git clone https://github.com/my-ai-stack/stack-2.9.git
+cd stack-2.9
+# 2. Create a virtual environment (recommended)
+python3 -m venv venv
+source venv/bin/activate    # On Windows: venv\Scripts\activate
+# 3. Install dependencies
+pip install --upgrade pip
+pip install -r requirements.txt
+# 4. Copy environment template
+cp .env.example .env
+```
+**That's it.** If you hit errors, see [Troubleshooting](#-troubleshooting) below.
+---
+## 🔑 Step 2 — Configure Your Model Provider
+Stack 2.9 supports multiple LLM providers. **Pick one that matches your setup:**
+### Option A: Ollama (Recommended — Local, Private)
+```bash
+# Install Ollama (macOS/Linux)
+curl -fsSL https://ollama.ai/install.sh | sh
+# Pull the Qwen model
+ollama pull qwen2.5-coder:32b
+# Set environment
+export MODEL_PROVIDER=ollama
+export OLLAMA_MODEL=qwen2.5-coder:32b
+```
+Edit your `.env` file:
+```env
+MODEL_PROVIDER=ollama
+OLLAMA_MODEL=qwen2.5-coder:32b
+```
+### Option B: Together AI (Best for Qwen, Cloud)
+```bash
+# Get your API key at https://together.ai
+export TOGETHER_API_KEY=tog-your-key-here
+```
+Edit your `.env`:
+```env
+MODEL_PROVIDER=together
+TOGETHER_API_KEY=tog-your-key-here
+TOGETHER_MODEL=togethercomputer/qwen2.5-32b-instruct
+```
+### Option C: OpenAI (GPT-4o)
+```env
+MODEL_PROVIDER=openai
+OPENAI_API_KEY=sk-your-key-here
+OPENAI_MODEL=gpt-4o
+```
+### Option D: Anthropic (Claude)
+```env
+MODEL_PROVIDER=anthropic
+ANTHROPIC_API_KEY=sk-ant-your-key-here
+ANTHROPIC_MODEL=claude-3-5-sonnet-20240229
+```
+### Option E: OpenRouter (Unified Access)
+```env
+MODEL_PROVIDER=openrouter
+OPENROUTER_API_KEY=sk-or-your-key-here
+OPENROUTER_MODEL=openai/gpt-4o
+```
+---
+## 🚀 Step 3 — Run Your First Task
+### Interactive Chat Mode
+```bash
+python stack.py
+```
+You'll see:
+```
+╔══════════════════════════════════════════════╗
+║         Stack 2.9 — AI Coding Assistant     ║
+║  Pattern Memory: Active | Tools: 46          ║
+╚══════════════════════════════════════════════╝
+You: Write a Python function to reverse a string
+```
+### Single Query Mode
+```bash
+python stack.py -c "Write a Python function to reverse a string"
+```
+**Expected output:**
+```python
+def reverse_string(s):
+    """Reverse a string and return it."""
+    return s[::-1]
+# Or for a more robust version:
+def reverse_string(s):
+    return ''.join(reversed(s))
+```
+### Ask About Your Codebase
+```bash
+python stack.py -c "Find all Python files modified in the last week and list them"
+```
+### Generate and Run Code
+```bash
+python stack.py -c "Create a hello world Flask app with one route"
+```
+---
+## 📊 Step 4 — Run Evaluation (Optional)
+> **Note:** Evaluation requires a GPU with ~16GB VRAM or more.
+### Prepare Your Fine-Tuned Model
+After training Stack 2.9 on your data, your merged model will be in:
+```
+./output/merged/
+```
+### Run HumanEval Benchmark
+```bash
+python evaluate_model.py \
+    --model-path ./output/merged \
+    --benchmark humaneval \
+    --num-samples 10 \
+    --output results.json
+```
+### Run MBPP Benchmark
+```bash
+python evaluate_model.py \
+    --model-path ./output/merged \
+    --benchmark mbpp \
+    --num-samples 10 \
+    --output results.json
+```
+### Run Both Benchmarks
+```bash
+python evaluate_model.py \
+    --model-path ./output/merged \
+    --benchmark both \
+    --num-samples 10 \
+    --k-values 1,10 \
+    --output results.json
+```
+**Expected output format:**
+```
+============================================================
+HumanEval Results
+============================================================
+  pass@1: 65.00%
+  pass@10: 82.00%
+  Total problems evaluated: 12
+============================================================
+============================================================
+MBPP Results
+============================================================
+  pass@1: 70.00%
+  pass@10: 85.00%
+  Total problems evaluated: 12
+============================================================
+```
+### Quick Evaluation (5 Problems Only)
+```bash
+python evaluate_model.py \
+    --model-path ./output/merged \
+    --benchmark humaneval \
+    --num-problems 5 \
+    --num-samples 5
+```
+---
+## 🐳 Step 5 — Deploy Stack 2.9
+### Deploy Locally with Docker
+```bash
+# Start the container
+docker build -t stack-2.9 .
+docker run -p 7860:7860 \
+    -e MODEL_PROVIDER=ollama \
+    -e OLLAMA_MODEL=qwen2.5-coder:32b \
+    stack-2.9
+```
+Access at: **http://localhost:7860**
+### Deploy to RunPod (Cloud GPU)
+```bash
+# Edit runpod_deploy.sh with your config first
+bash runpod_deploy.sh --gpu a100 --instance hourly
+```
+### Deploy to Kubernetes
+```bash
+# 1. Edit k8s/secret.yaml with your HuggingFace token
+# 2. Apply the manifests
+kubectl apply -f k8s/namespace.yaml
+kubectl apply -f k8s/secret.yaml
+kubectl apply -f k8s/configmap.yaml
+kubectl apply -f k8s/pvc.yaml
+kubectl apply -f k8s/deployment.yaml
+kubectl apply -f k8s/service.yaml
+# Check status
+kubectl get pods -n stack-29
+kubectl logs -n stack-29 deployment/stack-29
+```
+### Hardware Requirements for Deployment
+| Model Size | Minimum GPU | Recommended | Quantized (4-bit) |
+|------------|-------------|-------------|-------------------|
+| 7B | RTX 3070 (8GB) | A100 40GB | RTX 3060 (6GB) |
+| 32B | A100 40GB | A100 80GB | RTX 3090 (24GB) |
+---
+## 🧠 Pattern Memory Quick Guide
+Stack 2.9 stores successful patterns to help with future tasks.
+### List Your Patterns
+```bash
+python stack.py --patterns list
+python stack.py --patterns stats
+```
+### Extract Patterns from Your Git History
+```bash
+python scripts/extract_patterns_from_git.py \
+    --repo-path . \
+    --output patterns.jsonl \
+    --since-date "2024-01-01"
+```
+### Merge LoRA Adapters (Team Sharing)
+```bash
+python scripts/merge_lora_adapters.py \
+    --adapters adapter_a.safetensors adapter_b.safetensors \
+    --weights 0.7 0.3 \
+    --output merged.safetensors
+```
+---
+## 🛠️ Troubleshooting
+### "Module not found" errors
+```bash
+pip install -r requirements.txt
+```
+### "CUDA out of memory" during evaluation
+```bash
+# Reduce batch size
+python evaluate_model.py --model-path ./merged --num-samples 5
+# Or use 4-bit quantization
+# (See docs/TRAINING_7B.md for quantized training)
+```
+### "Model not found" with Ollama
+```bash
+ollama pull qwen2.5-coder:32b
+ollama list   # Verify it's installed
+```
+### "API key not set" errors
+```bash
+# Double-check your .env file
+cat .env
+# For testing, you can also set inline
+export TOGETHER_API_KEY=tog-your-key
+```
+### Slow inference on CPU
+```bash
+# Use a smaller model
+export OLLAMA_MODEL=qwen2.5-coder:7b
+# Or switch to cloud
+export MODEL_PROVIDER=together
+```
+### Docker build fails
+```bash
+# Use Python 3.10 explicitly
+docker build --build-arg PYTHON_VERSION=3.10 -t stack-2.9 .
+```
+### Kubernetes GPU not found
+```bash
+# Verify nvidia.com/gpu label on your node
+kubectl get nodes -L nvidia.com/gpu
+# Install NVIDIA GPU Operator if missing
+# https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/
+```
+---
+## 📚 What's Next?
+| Goal | Go To |
+|------|-------|
+| Train on my own data | `docs/TRAINING_7B.md` |
+| Learn all 46 tools | `TOOLS.md` |
+| Set up team pattern sharing | `docs/pattern-moat.md` |
+| Understand the architecture | `docs/reference/ARCHITECTURE.md` |
+| Report a bug | `SECURITY.md` / GitHub Issues |
+---
+## ⚡ Quick Reference Card
+```bash
+# Install
+git clone https://github.com/my-ai-stack/stack-2.9.git
+cd stack-2.9 && pip install -r requirements.txt
+# Configure
+cp .env.example .env   # Edit with your API keys
+# Run
+python stack.py                              # Interactive
+python stack.py -c "your code request"        # Single query
+# Evaluate
+python evaluate_model.py --model-path ./merged --benchmark humaneval
+# Deploy
+docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9
+```
+---
+*Stack 2.9 — AI that learns your patterns and grows with you.*

docs/ROADMAP.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Stack 2.9 Roadmap
+> Last updated: April 2026
+## Current Status
+### What's Working ✅
+- **Basic code generation** - The model can generate Python, JavaScript, and other code based on prompts
+- **CLI interface** - Working command-line interface (`stack.py`, `src/cli/`)
+- **Multi-provider support** - Ollama, OpenAI, Anthropic, OpenRouter, Together AI integrations
+- **46 built-in tools** - File operations, git, shell, web search, memory, task planning
+- **Pattern Memory infrastructure** - Observer, Learner, Memory components implemented
+- **Training pipeline** - LoRA fine-tuning scripts, data preparation, model merging
+- **Deployment options** - Docker, RunPod, Vast.ai, Kubernetes, HuggingFace Spaces
+- **128K context window** - Extended from base model's 32K
+### What's Broken or Missing ⚠️
+- **Tool calling not trained** - Model doesn't reliably use tools; needs fine-tuning on tool patterns
+- **Benchmark scores unverifiable** - Previous claims removed after audit found only 20/164 HumanEval problems tested
+- **Self-evolution not functional** - Observer/Learner components exist but not connected to training pipeline
+- **Voice integration incomplete** - Coqui XTTS integration present but not tested
+- **Evaluation infrastructure in progress** - New proper evaluation framework built but not run on full benchmarks
+### What Needs Testing 🔧
+- Full HumanEval (164 problems) evaluation
+- Full MBPP (500 problems) evaluation
+- Tool-calling accuracy with real tasks
+- Pattern Memory retrieval and effectiveness
+- Voice input/output pipeline
+- Multi-provider compatibility
+### What Needs Documentation 📚
+- Tool definitions and schemas
+- API reference (internal/ARCHITECTURE.md exists but needs updating)
+- Pattern Memory usage guide
+- Deployment troubleshooting
+- Evaluation methodology
+---
+## Timeline with Milestones
+### Short-Term (1-2 Weeks)
+| Milestone | Description | Status |
+|-----------|-------------|--------|
+| **S1.1** | Run full HumanEval (164 problems) with proper inference | Not started |
+| **S1.2** | Run full MBPP (500 problems) with proper inference | Not started |
+| **S1.3** | Document all 46 tool definitions in `docs/TOOLS.md` | In progress |
+| **S1.4** | Fix evaluation scripts to use real model inference | Needed |
+| **S1.5** | Create minimal reproducible test for tool calling | Not started |
+**Owner:** Community contribution welcome
+### Medium-Term (1-3 Months)
+| Milestone | Description | Status |
+|-----------|-------------|--------|
+| **M2.1** | Fine-tune model on tool-calling patterns (RTMP data) | Not started |
+| **M2.2** | Implement and test self-evolution loop (Observer → Learner → Memory → Trainer) | Not started |
+| **M2.3** | Run full benchmark evaluation and publish verified scores | Not started |
+| **M2.4** | Add MCP server support for external tool integration | Partial |
+| **M2.5** | Voice integration end-to-end testing | Not started |
+| **M2.6** | Implement pattern extraction from production usage | Not started |
+**Owner:** Requires training compute budget or community contribution
+### Long-Term (6+ Months)
+| Milestone | Description | Status |
+|-----------|-------------|--------|
+| **L3.1** | RLHF training for improved tool selection | Future |
+| **L3.2** | Team sync infrastructure (PostgreSQL + FastAPI) | Designed, not implemented |
+| **L3.3** | Federated learning for privacy-preserving updates | Future |
+| **L3.4** | Multi-modal support (images → code) | Future |
+| **L3.5** | Real-time voice-to-voice conversation | Future |
+**Owner:** Long-term vision, needs significant resources
+---
+## How to Contribute
+### By Priority
+1. **Run evaluations** - Help us verify benchmark scores by running `python stack_2_9_eval/run_proper_evaluation.py`
+2. **Test tool calling** - Try the model with various tools and report what works/doesn't
+3. **Documentation** - Improve docs, especially tool definitions and API reference
+4. **Bug reports** - Open issues with reproduction steps
+5. **Code contributions** - See CONTRIBUTING.md for guidelines
+### Contribution Areas
+| Area | Skill Needed | Priority |
+|------|--------------|----------|
+| Evaluation | Python, ML benchmarking | High |
+| Tool calling tests | Python, CLI usage | High |
+| Documentation | Technical writing | Medium |
+| Training scripts | PyTorch, PEFT | Medium |
+| Deployment | Docker, K8s, Cloud | Low |
+| Pattern Memory | Vector databases, ML | Low |
+### Quick Wins for Contributors
+- Run `python stack.py -c "List files in current directory"` and report if tools work
+- Review `stack/eval/results/` and verify evaluation logs
+- Check `docs/TOOLS.md` accuracy against actual tool implementations
+- Test with different providers (`--provider ollama|openai|anthropic`)
+---
+## Technical Notes
+### Known Limitations
+1. **Tool calling is not trained** - The base model has tool capabilities but Stack 2.9 hasn't been fine-tuned to use them reliably
+2. **Pattern Memory is read-only** - The system stores patterns but doesn't automatically retrain on them yet
+3. **Evaluation uses stub data** - Some eval scripts return pre-canned answers instead of running model
+4. **Voice integration untested** - Code exists but hasn't been validated end-to-end
+### Next Training Run Requirements
+To fix tool calling, the next training run needs:
+- Dataset: `data/rtmp-tools/combined_tools.jsonl` (already generated)
+- Compute: ~1 hour on A100 for LoRA fine-tuning
+- Configuration: Target tool_call logits, use `tool_use_examples.jsonl`
+---
+## Contact
+- **Issues:** https://github.com/my-ai-stack/stack-2.9/issues
+- **Discussions:** https://github.com/my-ai-stack/stack-2.9/discussions
+- **Discord:** (link in README)
+---
+*This roadmap is a living document. Updates based on community feedback and project progress.*

docs/TOOL_DATA_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,235 @@

+# Tool Calling Training Data Analysis
+**Generated:** 2026-04-06
+**Files Analyzed:**
+- `training-data/tool_examples.jsonl` (original)
+- `training-data_v2/tool_examples.jsonl` (regenerated)
+---
+## Executive Summary
+The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.
+**Key Findings on Original Data:**
+- ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
+- ❌ Heavy prompt duplication (7.5x average)
+- ❌ No multi-step tool chains (only 1 tool per example)
+- ❌ All examples use identical tool definitions
+**Action Taken:** Generated 500 new examples using the project's generator script.
+**Recommendation:** The original data needs substantial improvements before use in training.
+---
+## 1. Statistics Overview
+### Original Data (tool_examples.jsonl)
+| Metric | Value |
+|--------|-------|
+| Total Examples | 1,000 |
+| Unique Prompts | 133 |
+| Average Duplication | 7.52x |
+| Unique Tool Sequences | 5 |
+| Examples with Issues | ~107 (10.7%) |
+### New Data (tool_examples_v2.jsonl)
+| Metric | Value |
+|--------|-------|
+| Total Examples | 500 |
+| File Size | 1.9 MB |
+| Tools per Example | 5 (static definition) |
+### Tool Call Distribution (Original)
+| Tool | Call Count |
+|------|------------|
+| Bash | 200 |
+| FileRead | 200 |
+| FileWrite | 200 |
+| WebSearch | 200 |
+| Grep | 200 |
+All examples have exactly **one tool call** - no multi-step chains exist.
+---
+## 2. Prompt Diversity Analysis (Original Data)
+### Prompt Categories
+| Category | Count | Percentage |
+|----------|-------|------------|
+| Python | 207 | 20.7% |
+| React | 149 | 14.9% |
+| File Read | 134 | 13.4% |
+| File Write | 119 | 11.9% |
+| Other | 114 | 11.4% |
+| Run Command | 80 | 8.0% |
+| Docker/K8s | 67 | 6.7% |
+| Search | 50 | 5.0% |
+| Git | 40 | 4.0% |
+| Testing | 31 | 3.1% |
+| Package Management | 9 | 0.9% |
+### Most Duplicated Prompts
+| Prompt | Occurrences |
+|--------|-------------|
+| "Run the tests with pytest" | 40 |
+| "Run npm install to install dependencies" | 40 |
+| "Write a simple React component to src/components/Button.jsx" | 67 |
+---
+## 3. Tool Usage Breakdown
+### Tool Definitions
+All 1,000 original examples use **identical tool definitions** with 5 tools:
+- `Bash` - Execute bash commands
+- `FileRead` - Read file contents
+- `FileWrite` - Create/overwrite files
+- `WebSearch` - Search the web
+- `Grep` - Search for patterns in files
+### Tool Call Issues Found (Original Data)
+#### Wrong Search Patterns (105 instances / 10.5%)
+The `WebSearch` tool frequently uses queries that don't match the user's question:
+| User Question | Actual Search Query |
+|--------------|---------------------|
+| "How do I use async/await in Python?" | "AWS Lambda cold start optimization" |
+| "How do I use React hooks properly?" | "SQL join types explained" |
+| "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" |
+| "How do I use React hooks properly?" | "TypeScript generics tutorial" |
+| "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" |
+#### Wrong File Paths (2 instances)
+The `FileWrite` tool sometimes writes to incorrect file types:
+| User Request | Written Path |
+|-------------|--------------|
+| "Create a src/components/Header.jsx file" | Written to `config.json` |
+| "Create a src/middleware.py file with settings" | Written to `config.yaml` |
+#### Pattern/File Type Mismatches (Grep)
+The `Grep` tool sometimes searches with mismatched patterns:
+| Pattern | File Pattern | Issue |
+|---------|-------------|-------|
+| `class ` | `*.ts` | Python pattern in TypeScript files |
+| `SELECT ` | `*.js` | SQL pattern in JavaScript files |
+| `TODO` | `*.md` | Searching TODO in markdown files |
+---
+## 4. Data Quality Issues
+### Critical Issues
+1. **No Multi-Step Tool Chains**
+   - All 1,000 examples use exactly one tool call
+   - Real coding tasks typically require 2-5+ tool calls
+   - Example: "Read file → Find pattern → Search docs → Write fix"
+2. **Search Query Mismatches**
+   - 10.5% of WebSearch calls have irrelevant queries
+   - Indicates the generator script has logic errors
+3. **Heavy Prompt Duplication**
+   - 133 unique prompts duplicated to 1,000 examples
+   - "Write a simple React component" appears 67 times
+   - This creates overfitting to specific prompts
+4. **Identical Tool Definitions**
+   - All examples use the same 5 tools with identical descriptions
+   - No variation in tool schemas or parameter structures
+### Moderate Issues
+5. **File Path Hallucination**
+   - Tool calls reference files that don't exist in actual codebase
+   - Example: asking for `tests/test_main.py` but reading `src/app.js`
+6. **Response Fabrication**
+   - Assistant responses sometimes claim to show content that wasn't actually read
+   - Example: "Here's the README.md" when README.md wasn't the file requested
+---
+## 5. Recommendations for Improvement
+### Immediate Actions (Completed)
+1. ✅ **Regenerated Data**
+   ```
+   Generated 500 new examples in training-data_v2/tool_examples.jsonl
+   ```
+### Script Fixes Needed
+The generator script (`scripts/generate_tool_data.py`) needs:
+1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
+2. Fix `FILE_PATTERNS` - wrong file types for requested content
+3. Add multi-step chain generation
+4. Add prompt variation templates
+5. Add validation to check query/content relevance
+### Future Improvements
+1. **Add Multi-Step Examples**
+   - Real tasks require reading files, searching, editing
+   - Generate chains of 2-4 tool calls per example
+2. **Increase Prompt Diversity**
+   - Target 500+ unique prompts instead of duplicating
+   - Use template variations and paraphrasing
+3. **Vary Tool Definitions**
+   - Different tools per example
+   - Add tool variations (e.g., different Bash commands)
+---
+## 6. Conclusion
+The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements:
+- ~10% of examples have incorrect tool parameters
+- Heavy duplication leads to overfitting
+- No multi-step chains fail to represent real coding workflows
+- Synthetic generation errors are systematic
+**Action Completed:** Generated 500 new examples via the project's generator script.
+**Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.
+---
+## Appendix: Quick Stats
+### Original Data
+```
+Total examples:        1,000
+Unique prompts:        133
+Tool call issues:      107 (10.7%)
+Multi-tool chains:     0 (0%)
+Identical tool defs:   100%
+Average duplication:   7.52x
+```
+### New Data (Generated)
+```
+Total examples:        500
+File size:             1.9 MB
+Location:              training-data_v2/tool_examples.jsonl
+```

inference_api.py ADDED Viewed

	@@ -0,0 +1,495 @@

+#!/usr/bin/env python3
+"""
+FastAPI Inference Server for Stack 2.9 Model
+Provides REST API endpoints for code generation using fine-tuned Qwen models.
+Usage:
+    # With default settings (model loaded from environment or config)
+    uvicorn inference_api:app --host 0.0.0.0 --port 8000
+    # With custom model path
+    MODEL_PATH=/path/to/model uvicorn inference_api:app --host 0.0.0.0 --port 8000
+    # With reload for development
+    uvicorn inference_api:app --reload --port 8000
+"""
+import os
+import logging
+from contextlib import asynccontextmanager
+from typing import Optional, List, Dict, Any
+import torch
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Model configuration
+MODEL_PATH = os.getenv("MODEL_PATH", "base_model_qwen7b")
+DEVICE = os.getenv("DEVICE", "cuda" if torch.cuda.is_available() else "cpu")
+DEFAULT_MAX_TOKENS = int(os.getenv("DEFAULT_MAX_TOKENS", "512"))
+DEFAULT_TEMPERATURE = float(os.getenv("DEFAULT_TEMPERATURE", "0.2"))
+DEFAULT_TOP_P = float(os.getenv("DEFAULT_TOP_P", "0.95"))
+# Global model and tokenizer (loaded on startup)
+model = None
+tokenizer = None
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Load model on startup, cleanup on shutdown."""
+    global model, tokenizer
+    logger.info(f"Loading model from: {MODEL_PATH}")
+    logger.info(f"Using device: {DEVICE}")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(
+            MODEL_PATH,
+            trust_remote_code=True,
+            padding_side="left",
+        )
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_PATH,
+            torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
+            device_map="auto" if DEVICE == "cuda" else None,
+            low_cpu_mem_usage=True,
+            trust_remote_code=True,
+        )
+        if DEVICE == "cpu":
+            model = model.to(DEVICE)
+        model.eval()
+        logger.info("Model loaded successfully")
+    except Exception as e:
+        logger.error(f"Failed to load model: {e}")
+        raise
+    yield
+    # Cleanup
+    logger.info("Shutting down, cleaning up model...")
+    del model
+    del tokenizer
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+app = FastAPI(
+    title="Stack 2.9 Inference API",
+    description="REST API for code generation using Stack 2.9 fine-tuned Qwen model",
+    version="1.0.0",
+    lifespan=lifespan,
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ============================================================================
+# Request/Response Models
+# ============================================================================
+class GenerateRequest(BaseModel):
+    """Request body for /generate endpoint."""
+    prompt: str = Field(..., description="Input prompt/code to complete", min_length=1)
+    max_tokens: int = Field(DEFAULT_MAX_TOKENS, ge=1, le=4096, description="Max tokens to generate")
+    temperature: float = Field(DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature")
+    top_p: float = Field(DEFAULT_TOP_P, ge=0.0, le=1.0, description="Nucleus sampling threshold")
+    do_sample: bool = Field(True, description="Whether to use sampling")
+    repetition_penalty: float = Field(1.1, ge=1.0, le=2.0, description="Repetition penalty")
+    num_return_sequences: int = Field(1, ge=1, le=10, description="Number of sequences to generate")
+    model_config = {
+        "json_schema_extra": {
+            "example": {
+                "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
+                "max_tokens": 128,
+                "temperature": 0.2,
+                "top_p": 0.95,
+            }
+        }
+    }
+class GenerateResponse(BaseModel):
+    """Response body for /generate endpoint."""
+    generated_text: str
+    prompt: str
+    model: str
+    num_tokens: int
+    finish_reason: str = "length"
+class ChatMessage(BaseModel):
+    """A single message in a conversation."""
+    role: str = Field(..., description="Role: 'user' or 'assistant'")
+    content: str = Field(..., description="Message content")
+class ChatRequest(BaseModel):
+    """Request body for /chat endpoint."""
+    messages: List[ChatMessage] = Field(..., description="Conversation history")
+    max_tokens: int = Field(DEFAULT_MAX_TOKENS, ge=1, le=4096, description="Max tokens to generate")
+    temperature: float = Field(DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature")
+    top_p: float = Field(DEFAULT_TOP_P, ge=0.0, le=1.0, description="Nucleus sampling threshold")
+    do_sample: bool = Field(True, description="Whether to use sampling")
+    repetition_penalty: float = Field(1.1, ge=1.0, le=2.0, description="Repetition penalty")
+    model_config = {
+        "json_schema_extra": {
+            "example": {
+                "messages": [
+                    {"role": "user", "content": "Write a function to reverse a string in Python"},
+                    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"},
+                    {"role": "user", "content": "Make it recursive"},
+                ],
+                "max_tokens": 128,
+                "temperature": 0.2,
+            }
+        }
+    }
+class ChatResponse(BaseModel):
+    """Response body for /chat endpoint."""
+    message: ChatMessage
+    model: str
+    num_tokens: int
+    finish_reason: str = "length"
+class HealthResponse(BaseModel):
+    """Response body for /health endpoint."""
+    status: str
+    model_loaded: bool
+    model_path: str
+    device: str
+    cuda_available: bool
+class ModelInfoResponse(BaseModel):
+    """Response body for /model-info endpoint."""
+    model_path: str
+    device: str
+    dtype: str
+# ============================================================================
+# Helper Functions
+# ============================================================================
+def format_chat_to_prompt(messages: List[ChatMessage]) -> str:
+    """
+    Format chat messages into a prompt for code generation.
+    Uses a simple instruction format suitable for Qwen.
+    """
+    formatted = []
+    for msg in messages:
+        if msg.role == "user":
+            formatted.append(f"<|im_start|>user\n{msg.content}<|im_end|>")
+        elif msg.role == "assistant":
+            formatted.append(f"<|im_start|>assistant\n{msg.content}<|im_end|>")
+    formatted.append("<|im_start|>assistant\n")
+    return "\n".join(formatted)
+def generate_response(
+    prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    do_sample: bool,
+    repetition_penalty: float,
+    num_return_sequences: int,
+) -> tuple[str, int, str]:
+    """
+    Generate response from model.
+    Returns:
+        tuple: (generated_text, num_tokens, finish_reason)
+    """
+    inputs = tokenizer(
+        prompt,
+        return_tensors="pt",
+        padding=True,
+        truncation=True,
+    )
+    inputs = {k: v.to(model.device) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            do_sample=do_sample,
+            repetition_penalty=repetition_penalty,
+            num_return_sequences=num_return_sequences,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+        )
+    # Decode the first sequence
+    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    # Calculate number of generated tokens
+    num_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
+    # Extract just the new tokens (remove prompt)
+    if generated_text.startswith(prompt):
+        generated_text = generated_text[len(prompt):]
+    generated_text = generated_text.strip()
+    # Determine finish reason
+    finish_reason = "stop"
+    if num_tokens >= max_new_tokens:
+        finish_reason = "length"
+    return generated_text, num_tokens, finish_reason
+def extract_code_from_response(text: str) -> str:
+    """Extract code block from response if present."""
+    if "```python" in text:
+        start = text.find("```python") + len("```python")
+        end = text.find("```", start)
+        if end != -1:
+            return text[start:end].strip()
+    elif "```" in text:
+        start = text.find("```") + len("```")
+        # Skip potential language identifier
+        if "\n" in text[start:]:
+            start = text.find("\n", start) + 1
+        end = text.find("```", start)
+        if end != -1:
+            return text[start:end].strip()
+    return text
+# ============================================================================
+# API Endpoints
+# ============================================================================
+@app.get("/health", response_model=HealthResponse)
+async def health_check():
+    """
+    Health check endpoint.
+    Returns the current status of the API and model.
+    """
+    return HealthResponse(
+        status="healthy" if model is not None else "model_not_loaded",
+        model_loaded=model is not None,
+        model_path=MODEL_PATH,
+        device=DEVICE,
+        cuda_available=torch.cuda.is_available(),
+    )
+@app.get("/model-info", response_model=ModelInfoResponse)
+async def get_model_info():
+    """
+    Get information about the loaded model.
+    """
+    if model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    dtype = str(next(model.parameters()).dtype)
+    return ModelInfoResponse(
+        model_path=MODEL_PATH,
+        device=str(next(model.parameters()).device),
+        dtype=dtype,
+    )
+@app.post("/generate", response_model=GenerateResponse)
+async def generate(request: GenerateRequest):
+    """
+    Generate code completion for a prompt.
+    Takes a prompt and generates code completion based on the model.
+    Supports various generation parameters for controlling output.
+    """
+    if model is None or tokenizer is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Model not loaded. Check /health for status."
+        )
+    try:
+        generated_text, num_tokens, finish_reason = generate_response(
+            prompt=request.prompt,
+            max_new_tokens=request.max_tokens,
+            temperature=request.temperature,
+            top_p=request.top_p,
+            do_sample=request.do_sample,
+            repetition_penalty=request.repetition_penalty,
+            num_return_sequences=request.num_return_sequences,
+        )
+        return GenerateResponse(
+            generated_text=generated_text,
+            prompt=request.prompt,
+            model=MODEL_PATH,
+            num_tokens=num_tokens,
+            finish_reason=finish_reason,
+        )
+    except Exception as e:
+        logger.error(f"Generation error: {e}")
+        raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
+@app.post("/generate/raw", response_model=GenerateResponse)
+async def generate_raw(request: GenerateRequest):
+    """
+    Generate without extracting code from markdown blocks.
+    Returns the raw model output without any post-processing.
+    """
+    if model is None or tokenizer is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Model not loaded. Check /health for status."
+        )
+    try:
+        # Get raw response
+        inputs = tokenizer(
+            request.prompt,
+            return_tensors="pt",
+            padding=True,
+            truncation=True,
+        )
+        inputs = {k: v.to(model.device) for k, v in inputs.items()}
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=request.max_tokens,
+                temperature=request.temperature,
+                top_p=request.top_p,
+                do_sample=request.do_sample,
+                repetition_penalty=request.repetition_penalty,
+                num_return_sequences=request.num_return_sequences,
+                pad_token_id=tokenizer.pad_token_id,
+                eos_token_id=tokenizer.eos_token_id,
+            )
+        num_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
+        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        if generated_text.startswith(request.prompt):
+            generated_text = generated_text[len(request.prompt):]
+        finish_reason = "stop" if num_tokens < request.max_tokens else "length"
+        return GenerateResponse(
+            generated_text=generated_text.strip(),
+            prompt=request.prompt,
+            model=MODEL_PATH,
+            num_tokens=num_tokens,
+            finish_reason=finish_reason,
+        )
+    except Exception as e:
+        logger.error(f"Generation error: {e}")
+        raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
+@app.post("/chat", response_model=ChatResponse)
+async def chat(request: ChatRequest):
+    """
+    Chat endpoint for conversation-style interactions.
+    Takes a conversation history and generates the next assistant response.
+    """
+    if model is None or tokenizer is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Model not loaded. Check /health for status."
+        )
+    if not request.messages:
+        raise HTTPException(status_code=400, detail="Messages list cannot be empty")
+    # Check that last message is from user
+    if request.messages[-1].role != "user":
+        raise HTTPException(
+            status_code=400,
+            detail="Last message must be from user"
+        )
+    try:
+        # Format conversation as prompt
+        prompt = format_chat_to_prompt(request.messages)
+        generated_text, num_tokens, finish_reason = generate_response(
+            prompt=prompt,
+            max_new_tokens=request.max_tokens,
+            temperature=request.temperature,
+            top_p=request.top_p,
+            do_sample=request.do_sample,
+            repetition_penalty=request.repetition_penalty,
+            num_return_sequences=1,
+        )
+        return ChatResponse(
+            message=ChatMessage(role="assistant", content=generated_text),
+            model=MODEL_PATH,
+            num_tokens=num_tokens,
+            finish_reason=finish_reason,
+        )
+    except Exception as e:
+        logger.error(f"Chat error: {e}")
+        raise HTTPException(status_code=500, detail=f"Chat generation failed: {str(e)}")
+@app.post("/extract-code")
+async def extract_code(request: GenerateRequest):
+    """
+    Extract code from a generated response.
+    Useful when you have raw output with markdown code blocks and want to
+    extract just the code portion.
+    """
+    code = extract_code_from_response(request.prompt)
+    return {"code": code}
+# ============================================================================
+# Main Entry Point
+# ============================================================================
+if __name__ == "__main__":
+    import uvicorn
+    port = int(os.getenv("PORT", "8000"))
+    host = os.getenv("HOST", "0.0.0.0")
+    uvicorn.run(
+        "inference_api:app",
+        host=host,
+        port=port,
+        reload=os.getenv("RELOAD", "false").lower() == "true",
+        workers=1,  # Multi-worker can cause GPU memory issues
+    )

requirements_api.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+# FastAPI Inference API Dependencies
+# Install with: pip install -r requirements_api.txt
+# Web Framework
+fastapi>=0.109.0
+uvicorn[standard]>=0.27.0
+# Request/Response Models
+pydantic>=2.5.0
+# CORS Support (included with fastapi but listed for clarity)
+# fastapi includes CORSMiddleware
+# Server utilities
+python-multipart>=0.0.6
+# Optional: for async file handling
+httpx>=0.26.0

training-data/tool_examples_combined.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:32da2f0f67ba3fd83d180ec2c1a323e77d4263ff5aeb1e8062cf596b070691d5
+size 5669209