walidsobhie-code commited on
Commit ·
b03a8a0
1
Parent(s): 183b3b6
feat: add inference API, quickstart guide, roadmap, and combined tool data
Browse filesCreated:
- inference_api.py: FastAPI server with /generate, /chat, /health endpoints
- requirements_api.txt: FastAPI dependencies
- docs/API.md: API documentation
- docs/QUICKSTART.md: 5-minute quickstart guide
- README_QUICKSTART.md: Simplified landing page
- docs/ROADMAP.md: Project roadmap (current → long-term)
- CONTRIBUTING.md: Contributor guidelines
- docs/TOOL_DATA_ANALYSIS.md: Tool data analysis report
- training-data/tool_examples_combined.jsonl: 1500 tool_calling examples (balanced 5 tools)
Total tool calling training data: 1500 examples
- Grep: 300
- FileRead: 300
- WebSearch: 300
- Bash: 300
- FileWrite: 300
- CONTRIBUTING.md +164 -19
- README_QUICKSTART.md +151 -0
- docs/API.md +345 -0
- docs/QUICKSTART.md +415 -0
- docs/ROADMAP.md +143 -0
- docs/TOOL_DATA_ANALYSIS.md +235 -0
- inference_api.py +495 -0
- requirements_api.txt +18 -0
- training-data/tool_examples_combined.jsonl +3 -0
CONTRIBUTING.md
CHANGED
|
@@ -1,26 +1,171 @@
|
|
| 1 |
-
# Contributing
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
-
1. Fork the repository
|
| 7 |
-
2. Clone your fork: `git clone https://github.com/YOUR_USER/$repo.git`
|
| 8 |
-
3. Create a virtual environment
|
| 9 |
-
4. Install dependencies: `pip install -r requirements.txt`
|
| 10 |
|
| 11 |
-
##
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Code Style
|
| 21 |
-
|
| 22 |
-
-
|
| 23 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Reporting Issues
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Contributing to Stack 2.9
|
| 2 |
|
| 3 |
+
> Last updated: April 2026
|
| 4 |
|
| 5 |
+
Thank you for your interest in contributing to Stack 2.9! This document outlines how you can help.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
+
## Project State
|
| 8 |
+
|
| 9 |
+
**Before contributing, understand where the project stands:**
|
| 10 |
+
|
| 11 |
+
| Area | Status | Notes |
|
| 12 |
+
|------|--------|-------|
|
| 13 |
+
| Basic code generation | ✅ Working | Main strength of the model |
|
| 14 |
+
| Tool calling | ⚠️ Not trained | Needs fine-tuning on tool patterns |
|
| 15 |
+
| Benchmark scores | ⚠️ Pending | Full evaluation not yet run |
|
| 16 |
+
| Self-evolution | 🔧 Incomplete | Components exist but not connected |
|
| 17 |
+
| Documentation | 🔧 In progress | Some areas need work |
|
| 18 |
+
|
| 19 |
+
## Quick Start
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
# 1. Fork the repository
|
| 23 |
+
git fork https://github.com/my-ai-stack/stack-2.9.git
|
| 24 |
+
|
| 25 |
+
# 2. Clone your fork
|
| 26 |
+
git clone https://github.com/YOUR_USER/stack-2.9.git
|
| 27 |
+
cd stack-2.9
|
| 28 |
+
|
| 29 |
+
# 3. Create a virtual environment
|
| 30 |
+
python -m venv .venv
|
| 31 |
+
source .venv/bin/activate # Linux/Mac
|
| 32 |
+
# or .venv\Scripts\activate on Windows
|
| 33 |
+
|
| 34 |
+
# 4. Install dependencies
|
| 35 |
+
pip install -r requirements.txt
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## What to Work On
|
| 39 |
+
|
| 40 |
+
### High Priority
|
| 41 |
+
|
| 42 |
+
1. **Evaluation** - Run full HumanEval/MBPP benchmarks
|
| 43 |
+
- See `stack/eval/run_proper_evaluation.py`
|
| 44 |
+
- Requires: Python, Ollama or API key
|
| 45 |
+
|
| 46 |
+
2. **Tool calling tests** - Test and document tool usage
|
| 47 |
+
- Run `python stack.py -c "Your command here"`
|
| 48 |
+
- Report what works/doesn't in issues
|
| 49 |
+
|
| 50 |
+
3. **Documentation** - Improve tool definitions, API docs
|
| 51 |
+
- Check `docs/TOOLS.md` for accuracy
|
| 52 |
+
- Update `stack/internal/ARCHITECTURE.md`
|
| 53 |
+
|
| 54 |
+
### Medium Priority
|
| 55 |
+
|
| 56 |
+
4. **Training scripts** - Improve fine-tuning pipeline
|
| 57 |
+
- See `stack/training/`
|
| 58 |
+
- ⚠️ Do NOT modify Kaggle notebook or training data generation
|
| 59 |
+
|
| 60 |
+
5. **Deployment** - Fix deployment scripts
|
| 61 |
+
- See `stack/deploy/`, `runpod_deploy.sh`
|
| 62 |
+
|
| 63 |
+
### Lower Priority
|
| 64 |
+
|
| 65 |
+
6. **Pattern Memory** - Connect Observer → Learner → Memory → Trainer
|
| 66 |
+
7. **Voice integration** - Test end-to-end voice pipeline
|
| 67 |
+
8. **MCP support** - Improve Model Context Protocol integration
|
| 68 |
+
|
| 69 |
+
## What NOT to Touch
|
| 70 |
+
|
| 71 |
+
⚠️ **Do NOT modify without explicit approval:**
|
| 72 |
+
|
| 73 |
+
- `kaggle_train_stack29_v5.ipynb` - Kaggle training notebook
|
| 74 |
+
- `colab_train_stack29.ipynb` - Colab training notebook
|
| 75 |
+
- Training data generation scripts in `data/`
|
| 76 |
+
- Model weights in `base_model_qwen7b/`
|
| 77 |
+
|
| 78 |
+
These are core training components. Changes here affect the model itself.
|
| 79 |
|
| 80 |
## Code Style
|
| 81 |
+
|
| 82 |
+
- **Python:** Follow PEP 8, use type hints where possible
|
| 83 |
+
- **TypeScript:** Use strict mode, add JSDoc comments
|
| 84 |
+
- **Shell:** Use `shellcheck` on bash scripts
|
| 85 |
+
- **General:** Add docstrings to new functions, include examples
|
| 86 |
+
|
| 87 |
+
### Pre-commit Checks
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
# Run tests before submitting
|
| 91 |
+
pytest samples/ -v
|
| 92 |
+
|
| 93 |
+
# Check code formatting
|
| 94 |
+
ruff check src/ samples/ --fix
|
| 95 |
+
black src/ samples/
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
##提交PR
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
# Create a feature branch
|
| 102 |
+
git checkout -b feature/your-feature-name
|
| 103 |
+
|
| 104 |
+
# Make your changes
|
| 105 |
+
# ... edit files ...
|
| 106 |
+
|
| 107 |
+
# Run tests
|
| 108 |
+
pytest samples/ -v
|
| 109 |
+
|
| 110 |
+
# Commit with clear message
|
| 111 |
+
git commit -m "Add: description of what you changed"
|
| 112 |
+
|
| 113 |
+
# Push to your fork
|
| 114 |
+
git push origin feature/your-feature-name
|
| 115 |
+
|
| 116 |
+
# Open a Pull Request
|
| 117 |
+
# Fill in the PR template with:
|
| 118 |
+
# - What you changed
|
| 119 |
+
# - Why it's needed
|
| 120 |
+
# - Testing you did
|
| 121 |
+
# - Screenshots if applicable
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## Pull Request Guidelines
|
| 125 |
+
|
| 126 |
+
1. **Describe the change clearly** - What does this fix or add?
|
| 127 |
+
2. **Link related issues** - Use "Fixes #123" if applicable
|
| 128 |
+
3. **Include tests** - Add unit tests for new features
|
| 129 |
+
4. **Update docs** - If you add a feature, document it
|
| 130 |
+
5. **Be patient** - Reviewers may take a few days to respond
|
| 131 |
|
| 132 |
## Reporting Issues
|
| 133 |
+
|
| 134 |
+
When reporting bugs:
|
| 135 |
+
|
| 136 |
+
```markdown
|
| 137 |
+
## Description
|
| 138 |
+
Brief description of the issue
|
| 139 |
+
|
| 140 |
+
## Steps to Reproduce
|
| 141 |
+
1. Run `...`
|
| 142 |
+
2. See error
|
| 143 |
+
|
| 144 |
+
## Expected Behavior
|
| 145 |
+
What should happen
|
| 146 |
+
|
| 147 |
+
## Actual Behavior
|
| 148 |
+
What actually happened
|
| 149 |
+
|
| 150 |
+
## Environment
|
| 151 |
+
- OS:
|
| 152 |
+
- Python version:
|
| 153 |
+
- Provider: (ollama/openai/etc)
|
| 154 |
+
- Model:
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## Communication
|
| 158 |
+
|
| 159 |
+
- **Issues:** GitHub Issues for bugs/features
|
| 160 |
+
- **Discussions:** GitHub Discussions for questions
|
| 161 |
+
- **Discord:** Link in README
|
| 162 |
+
|
| 163 |
+
## Recognition
|
| 164 |
+
|
| 165 |
+
Contributors will be listed in:
|
| 166 |
+
- README.md "Acknowledgments" section
|
| 167 |
+
- CONTRIBUTORS file (if created)
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
**Questions?** Open a GitHub Discussion or ask in Discord.
|
README_QUICKSTART.md
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 — Quick Start
|
| 2 |
+
|
| 3 |
+
> **AI coding assistant powered by Qwen2.5-Coder-32B with Pattern Memory.**
|
| 4 |
+
|
| 5 |
+
```
|
| 6 |
+
git clone https://github.com/my-ai-stack/stack-2.9.git
|
| 7 |
+
cd stack-2.9
|
| 8 |
+
pip install -r requirements.txt
|
| 9 |
+
cp .env.example .env
|
| 10 |
+
python stack.py
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
That's it. Keep reading for details.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Prerequisites
|
| 18 |
+
|
| 19 |
+
- **Python 3.10+**
|
| 20 |
+
- **GPU** (optional — runs on CPU via cloud providers too)
|
| 21 |
+
- **Git**
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## Install & Run
|
| 26 |
+
|
| 27 |
+
```bash
|
| 28 |
+
# Clone
|
| 29 |
+
git clone https://github.com/my-ai-stack/stack-2.9.git
|
| 30 |
+
cd stack-2.9
|
| 31 |
+
|
| 32 |
+
# Install
|
| 33 |
+
python3 -m venv venv && source venv/bin/activate
|
| 34 |
+
pip install -r requirements.txt
|
| 35 |
+
|
| 36 |
+
# Configure (pick a provider below, then edit .env)
|
| 37 |
+
cp .env.example .env
|
| 38 |
+
|
| 39 |
+
# Run!
|
| 40 |
+
python stack.py
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Configure Your Model Provider
|
| 46 |
+
|
| 47 |
+
Edit `.env` with one of these:
|
| 48 |
+
|
| 49 |
+
### Ollama (Local, Private) — Recommended
|
| 50 |
+
```env
|
| 51 |
+
MODEL_PROVIDER=ollama
|
| 52 |
+
OLLAMA_MODEL=qwen2.5-coder:32b
|
| 53 |
+
```
|
| 54 |
+
```bash
|
| 55 |
+
# First: curl -fsSL https://ollama.ai/install.sh | sh && ollama pull qwen2.5-coder:32b
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### Together AI (Cloud, Fast)
|
| 59 |
+
```env
|
| 60 |
+
MODEL_PROVIDER=together
|
| 61 |
+
TOGETHER_API_KEY=tog-your-key-here
|
| 62 |
+
TOGETHER_MODEL=togethercomputer/qwen2.5-32b-instruct
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### OpenAI (GPT-4o)
|
| 66 |
+
```env
|
| 67 |
+
MODEL_PROVIDER=openai
|
| 68 |
+
OPENAI_API_KEY=sk-your-key-here
|
| 69 |
+
OPENAI_MODEL=gpt-4o
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### Anthropic (Claude)
|
| 73 |
+
```env
|
| 74 |
+
MODEL_PROVIDER=anthropic
|
| 75 |
+
ANTHROPIC_API_KEY=sk-ant-your-key-here
|
| 76 |
+
ANTHROPIC_MODEL=claude-3-5-sonnet-20240229
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Usage
|
| 82 |
+
|
| 83 |
+
### Interactive Chat
|
| 84 |
+
```bash
|
| 85 |
+
python stack.py
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Single Query
|
| 89 |
+
```bash
|
| 90 |
+
python stack.py -c "Write a Python function to reverse a string"
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Evaluate Model (GPU required)
|
| 94 |
+
```bash
|
| 95 |
+
python evaluate_model.py --model-path ./output/merged --benchmark humaneval
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
### Deploy with Docker
|
| 99 |
+
```bash
|
| 100 |
+
docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## 5-Minute Overview
|
| 106 |
+
|
| 107 |
+
| Feature | Command |
|
| 108 |
+
|---------|---------|
|
| 109 |
+
| Start chatting | `python stack.py` |
|
| 110 |
+
| Ask one question | `python stack.py -c "your question"` |
|
| 111 |
+
| Run benchmarks | `python evaluate_model.py --model-path ./merged --benchmark both` |
|
| 112 |
+
| List patterns | `python stack.py --patterns list` |
|
| 113 |
+
| Deploy locally | `docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9` |
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## Hardware Requirements
|
| 118 |
+
|
| 119 |
+
| Model | Minimum | Recommended |
|
| 120 |
+
|-------|---------|------------|
|
| 121 |
+
| 7B | RTX 3060 (6GB) | A100 40GB |
|
| 122 |
+
| 32B | RTX 3090 (24GB) | A100 80GB |
|
| 123 |
+
|
| 124 |
+
No GPU? Use Ollama on your machine or any cloud provider in `.env`.
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## Key Links
|
| 129 |
+
|
| 130 |
+
- 📖 **Full docs:** [docs/QUICKSTART.md](docs/QUICKSTART.md)
|
| 131 |
+
- 🔧 **46 tools:** [TOOLS.md](TOOLS.md)
|
| 132 |
+
- 🧠 **Pattern memory:** [docs/pattern-moat.md](docs/pattern-moat.md)
|
| 133 |
+
- 🚀 **Training guide:** [docs/TRAINING_7B.md](docs/TRAINING_7B.md)
|
| 134 |
+
- 🐳 **Kubernetes:** [k8s/](k8s/)
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## What's Inside
|
| 139 |
+
|
| 140 |
+
- **Qwen2.5-Coder-32B** — 32B parameter code-specialized model
|
| 141 |
+
- **Pattern Memory** — learns from successful interactions
|
| 142 |
+
- **46 built-in tools** — file ops, git, shell, search, memory, tasks
|
| 143 |
+
- **Multi-provider** — Ollama, OpenAI, Anthropic, Together AI, OpenRouter
|
| 144 |
+
- **128K context** — handles large codebases
|
| 145 |
+
- **Self-hosted** — full control, private
|
| 146 |
+
- **MCP support** — integrates with any Model Context Protocol server
|
| 147 |
+
- **Voice-ready** — Coqui XTTS for voice cloning
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
*Built with ❤️ for developers who want an AI that grows with them.*
|
docs/API.md
ADDED
|
@@ -0,0 +1,345 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 Inference API Documentation
|
| 2 |
+
|
| 3 |
+
REST API for code generation using the Stack 2.9 fine-tuned Qwen model.
|
| 4 |
+
|
| 5 |
+
## Quick Start
|
| 6 |
+
|
| 7 |
+
### 1. Install Dependencies
|
| 8 |
+
|
| 9 |
+
```bash
|
| 10 |
+
pip install -r requirements_api.txt
|
| 11 |
+
pip install -r requirements.txt # Core dependencies (transformers, torch, etc.)
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
### 2. Set Model Path
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
# Option A: Environment variable
|
| 18 |
+
export MODEL_PATH=/path/to/your/merged/model
|
| 19 |
+
|
| 20 |
+
# Option B: Direct parameter
|
| 21 |
+
MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### 3. Start the Server
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
# Basic usage
|
| 28 |
+
uvicorn inference_api:app --host 0.0.0.0 --port 8000
|
| 29 |
+
|
| 30 |
+
# With auto-reload (development)
|
| 31 |
+
uvicorn inference_api:app --reload --port 8000
|
| 32 |
+
|
| 33 |
+
# Using Python directly
|
| 34 |
+
python inference_api.py
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### 4. Verify It's Running
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
curl http://localhost:8000/health
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
Expected response:
|
| 44 |
+
```json
|
| 45 |
+
{
|
| 46 |
+
"status": "healthy",
|
| 47 |
+
"model_loaded": true,
|
| 48 |
+
"model_path": "base_model_qwen7b",
|
| 49 |
+
"device": "cuda",
|
| 50 |
+
"cuda_available": true
|
| 51 |
+
}
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## API Endpoints
|
| 57 |
+
|
| 58 |
+
### `GET /health`
|
| 59 |
+
|
| 60 |
+
Health check endpoint to verify API and model status.
|
| 61 |
+
|
| 62 |
+
**Response:**
|
| 63 |
+
```json
|
| 64 |
+
{
|
| 65 |
+
"status": "healthy",
|
| 66 |
+
"model_loaded": true,
|
| 67 |
+
"model_path": "/path/to/model",
|
| 68 |
+
"device": "cuda",
|
| 69 |
+
"cuda_available": true
|
| 70 |
+
}
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
### `GET /model-info`
|
| 76 |
+
|
| 77 |
+
Get information about the currently loaded model.
|
| 78 |
+
|
| 79 |
+
**Response:**
|
| 80 |
+
```json
|
| 81 |
+
{
|
| 82 |
+
"model_path": "/path/to/model",
|
| 83 |
+
"device": "cuda:0",
|
| 84 |
+
"dtype": "torch.float16"
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
### `POST /generate`
|
| 91 |
+
|
| 92 |
+
Generate code completion for a prompt.
|
| 93 |
+
|
| 94 |
+
**Request Body:**
|
| 95 |
+
```json
|
| 96 |
+
{
|
| 97 |
+
"prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
|
| 98 |
+
"max_tokens": 128,
|
| 99 |
+
"temperature": 0.2,
|
| 100 |
+
"top_p": 0.95,
|
| 101 |
+
"do_sample": true,
|
| 102 |
+
"repetition_penalty": 1.1,
|
| 103 |
+
"num_return_sequences": 1
|
| 104 |
+
}
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
**Parameters:**
|
| 108 |
+
| Parameter | Type | Default | Range | Description |
|
| 109 |
+
|-----------|------|---------|-------|-------------|
|
| 110 |
+
| `prompt` | string | required | - | Input prompt to complete |
|
| 111 |
+
| `max_tokens` | int | 512 | 1-4096 | Maximum tokens to generate |
|
| 112 |
+
| `temperature` | float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) |
|
| 113 |
+
| `top_p` | float | 0.95 | 0.0-1.0 | Nucleus sampling threshold |
|
| 114 |
+
| `do_sample` | bool | true | - | Whether to use sampling vs greedy |
|
| 115 |
+
| `repetition_penalty` | float | 1.1 | 1.0-2.0 | Penalize repeated tokens |
|
| 116 |
+
| `num_return_sequences` | int | 1 | 1-10 | Number of sequences to generate |
|
| 117 |
+
|
| 118 |
+
**Response:**
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"generated_text": " seen = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in seen:\n return [seen[complement], i]\n seen[num] = i\n return []",
|
| 122 |
+
"prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
|
| 123 |
+
"model": "base_model_qwen7b",
|
| 124 |
+
"num_tokens": 45,
|
| 125 |
+
"finish_reason": "stop"
|
| 126 |
+
}
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
**Example with curl:**
|
| 130 |
+
```bash
|
| 131 |
+
curl -X POST http://localhost:8000/generate \
|
| 132 |
+
-H "Content-Type: application/json" \
|
| 133 |
+
-d '{
|
| 134 |
+
"prompt": "def fibonacci(n):\n \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
|
| 135 |
+
"max_tokens": 100,
|
| 136 |
+
"temperature": 0.2
|
| 137 |
+
}'
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
### `POST /chat`
|
| 143 |
+
|
| 144 |
+
Conversational interface for multi-turn interactions.
|
| 145 |
+
|
| 146 |
+
**Request Body:**
|
| 147 |
+
```json
|
| 148 |
+
{
|
| 149 |
+
"messages": [
|
| 150 |
+
{"role": "user", "content": "Write a function to reverse a string in Python"},
|
| 151 |
+
{"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"},
|
| 152 |
+
{"role": "user", "content": "Make it recursive instead"}
|
| 153 |
+
],
|
| 154 |
+
"max_tokens": 128,
|
| 155 |
+
"temperature": 0.2,
|
| 156 |
+
"top_p": 0.95
|
| 157 |
+
}
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
**Message Roles:**
|
| 161 |
+
- `user` - User's message
|
| 162 |
+
- `assistant` - Model's previous response (for conversation history)
|
| 163 |
+
|
| 164 |
+
**Response:**
|
| 165 |
+
```json
|
| 166 |
+
{
|
| 167 |
+
"message": {
|
| 168 |
+
"role": "assistant",
|
| 169 |
+
"content": "def reverse_string(s):\n if len(s) <= 1:\n return s\n return s[-1] + reverse_string(s[:-1])"
|
| 170 |
+
},
|
| 171 |
+
"model": "base_model_qwen7b",
|
| 172 |
+
"num_tokens": 67,
|
| 173 |
+
"finish_reason": "stop"
|
| 174 |
+
}
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
**Example with curl:**
|
| 178 |
+
```bash
|
| 179 |
+
curl -X POST http://localhost:8000/chat \
|
| 180 |
+
-H "Content-Type: application/json" \
|
| 181 |
+
-d '{
|
| 182 |
+
"messages": [
|
| 183 |
+
{"role": "user", "content": "Write a binary search function"}
|
| 184 |
+
],
|
| 185 |
+
"max_tokens": 150
|
| 186 |
+
}'
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
### `POST /generate/raw`
|
| 192 |
+
|
| 193 |
+
Same as `/generate` but returns raw output without extracting code from markdown blocks.
|
| 194 |
+
|
| 195 |
+
**Example with curl:**
|
| 196 |
+
```bash
|
| 197 |
+
curl -X POST http://localhost:8000/generate/raw \
|
| 198 |
+
-H "Content-Type: application/json" \
|
| 199 |
+
-d '{
|
| 200 |
+
"prompt": "def quick_sort(arr):",
|
| 201 |
+
"max_tokens": 200
|
| 202 |
+
}'
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
### `POST /extract-code`
|
| 208 |
+
|
| 209 |
+
Extract code from a text response that may contain markdown code blocks.
|
| 210 |
+
|
| 211 |
+
**Request Body:**
|
| 212 |
+
```json
|
| 213 |
+
{
|
| 214 |
+
"prompt": "```python\ndef hello():\n print(\"world\")\n```"
|
| 215 |
+
}
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
**Response:**
|
| 219 |
+
```json
|
| 220 |
+
{
|
| 221 |
+
"code": "def hello():\n print(\"world\")"
|
| 222 |
+
}
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
---
|
| 226 |
+
|
| 227 |
+
## Environment Variables
|
| 228 |
+
|
| 229 |
+
| Variable | Default | Description |
|
| 230 |
+
|----------|---------|-------------|
|
| 231 |
+
| `MODEL_PATH` | `base_model_qwen7b` | Path to model directory |
|
| 232 |
+
| `DEVICE` | `cuda` (if available) | Device to use: `cuda` or `cpu` |
|
| 233 |
+
| `PORT` | `8000` | Server port |
|
| 234 |
+
| `HOST` | `0.0.0.0` | Server host |
|
| 235 |
+
| `RELOAD` | `false` | Enable auto-reload for development |
|
| 236 |
+
| `DEFAULT_MAX_TOKENS` | `512` | Default max tokens |
|
| 237 |
+
| `DEFAULT_TEMPERATURE` | `0.2` | Default temperature |
|
| 238 |
+
| `DEFAULT_TOP_P` | `0.95` | Default top_p |
|
| 239 |
+
|
| 240 |
+
---
|
| 241 |
+
|
| 242 |
+
## Usage Examples
|
| 243 |
+
|
| 244 |
+
### Python Client
|
| 245 |
+
|
| 246 |
+
```python
|
| 247 |
+
import requests
|
| 248 |
+
|
| 249 |
+
API_URL = "http://localhost:8000"
|
| 250 |
+
|
| 251 |
+
# Health check
|
| 252 |
+
health = requests.get(f"{API_URL}/health").json()
|
| 253 |
+
print(f"Model loaded: {health['model_loaded']}")
|
| 254 |
+
|
| 255 |
+
# Code completion
|
| 256 |
+
response = requests.post(
|
| 257 |
+
f"{API_URL}/generate",
|
| 258 |
+
json={
|
| 259 |
+
"prompt": "def merge_sort(arr):\n \"\"\"Return sorted array.\"\"\"\n",
|
| 260 |
+
"max_tokens": 200,
|
| 261 |
+
"temperature": 0.3,
|
| 262 |
+
}
|
| 263 |
+
).json()
|
| 264 |
+
|
| 265 |
+
print(response["generated_text"])
|
| 266 |
+
```
|
| 267 |
+
|
| 268 |
+
### JavaScript/Node.js Client
|
| 269 |
+
|
| 270 |
+
```javascript
|
| 271 |
+
const API_URL = "http://localhost:8000";
|
| 272 |
+
|
| 273 |
+
// Code completion
|
| 274 |
+
async function generate(prompt) {
|
| 275 |
+
const response = await fetch(`${API_URL}/generate`, {
|
| 276 |
+
method: "POST",
|
| 277 |
+
headers: { "Content-Type": "application/json" },
|
| 278 |
+
body: JSON.stringify({
|
| 279 |
+
prompt,
|
| 280 |
+
max_tokens: 128,
|
| 281 |
+
temperature: 0.2,
|
| 282 |
+
}),
|
| 283 |
+
});
|
| 284 |
+
return response.json();
|
| 285 |
+
}
|
| 286 |
+
|
| 287 |
+
const result = await generate("def binary_search(arr, target):");
|
| 288 |
+
console.log(result.generated_text);
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
### Using with OpenAI SDK (with base_url replacement)
|
| 292 |
+
|
| 293 |
+
```python
|
| 294 |
+
from openai import OpenAI
|
| 295 |
+
|
| 296 |
+
client = OpenAI(
|
| 297 |
+
api_key="not-needed",
|
| 298 |
+
base_url="http://localhost:8000"
|
| 299 |
+
)
|
| 300 |
+
|
| 301 |
+
# Note: This works for basic completions but may need adapter code
|
| 302 |
+
# for full OpenAI compatibility
|
| 303 |
+
response = client.completions.create(
|
| 304 |
+
model="stack-2.9",
|
| 305 |
+
prompt="def factorial(n):",
|
| 306 |
+
max_tokens=100,
|
| 307 |
+
)
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
---
|
| 311 |
+
|
| 312 |
+
## Performance Tips
|
| 313 |
+
|
| 314 |
+
1. **GPU Recommended**: For fastest inference, run on GPU with CUDA
|
| 315 |
+
2. **Batch Processing**: For multiple prompts, process sequentially (model is loaded once)
|
| 316 |
+
3. **Memory**: Ensure adequate GPU memory; reduce `max_tokens` if needed
|
| 317 |
+
4. **Temperature**: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks
|
| 318 |
+
|
| 319 |
+
---
|
| 320 |
+
|
| 321 |
+
## Error Handling
|
| 322 |
+
|
| 323 |
+
**503 Service Unavailable**: Model not loaded or loading failed
|
| 324 |
+
```json
|
| 325 |
+
{"detail": "Model not loaded. Check /health for status."}
|
| 326 |
+
```
|
| 327 |
+
|
| 328 |
+
**500 Internal Server Error**: Generation failed
|
| 329 |
+
```json
|
| 330 |
+
{"detail": "Generation failed: <error message>"}
|
| 331 |
+
```
|
| 332 |
+
|
| 333 |
+
**400 Bad Request**: Invalid input
|
| 334 |
+
```json
|
| 335 |
+
{"detail": "Last message must be from user"}
|
| 336 |
+
```
|
| 337 |
+
|
| 338 |
+
---
|
| 339 |
+
|
| 340 |
+
## Architecture Notes
|
| 341 |
+
|
| 342 |
+
- **Single Model Instance**: Model is loaded once at startup and reused
|
| 343 |
+
- **Synchronous Generation**: Uses `torch.no_grad()` for inference
|
| 344 |
+
- **CORS Enabled**: Accepts requests from any origin (configure for production)
|
| 345 |
+
- **No Authentication**: Add middleware (e.g., API key) for production deployments
|
docs/QUICKSTART.md
ADDED
|
@@ -0,0 +1,415 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 — 5-Minute Quick Start
|
| 2 |
+
|
| 3 |
+
> **Goal:** Get Stack 2.9 running and solving coding tasks in under 5 minutes.
|
| 4 |
+
|
| 5 |
+
Stack 2.9 is an AI coding assistant powered by **Qwen2.5-Coder-32B** with Pattern Memory — it learns from your interactions and improves over time.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 📋 Prerequisites
|
| 10 |
+
|
| 11 |
+
### Required
|
| 12 |
+
| Requirement | Version | Check |
|
| 13 |
+
|-------------|---------|-------|
|
| 14 |
+
| Python | 3.10+ | `python3 --version` |
|
| 15 |
+
| Git | Any recent | `git --version` |
|
| 16 |
+
| pip | Latest | `pip --version` |
|
| 17 |
+
|
| 18 |
+
### Optional (Recommended)
|
| 19 |
+
| Resource | Why You Need It | Minimum |
|
| 20 |
+
|----------|----------------|---------|
|
| 21 |
+
| **GPU** | Fast code generation | RTX 3070 / M1 Pro |
|
| 22 |
+
| **16GB VRAM** | Run 32B model smoothly | 8GB for 7B quantized |
|
| 23 |
+
|
| 24 |
+
> **No GPU?** Stack 2.9 works on CPU via Ollama or cloud providers (OpenAI, Together AI, etc.).
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## ⚡ Step 1 — Install in 60 Seconds
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
# 1. Clone the repository
|
| 32 |
+
git clone https://github.com/my-ai-stack/stack-2.9.git
|
| 33 |
+
cd stack-2.9
|
| 34 |
+
|
| 35 |
+
# 2. Create a virtual environment (recommended)
|
| 36 |
+
python3 -m venv venv
|
| 37 |
+
source venv/bin/activate # On Windows: venv\Scripts\activate
|
| 38 |
+
|
| 39 |
+
# 3. Install dependencies
|
| 40 |
+
pip install --upgrade pip
|
| 41 |
+
pip install -r requirements.txt
|
| 42 |
+
|
| 43 |
+
# 4. Copy environment template
|
| 44 |
+
cp .env.example .env
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**That's it.** If you hit errors, see [Troubleshooting](#-troubleshooting) below.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## 🔑 Step 2 — Configure Your Model Provider
|
| 52 |
+
|
| 53 |
+
Stack 2.9 supports multiple LLM providers. **Pick one that matches your setup:**
|
| 54 |
+
|
| 55 |
+
### Option A: Ollama (Recommended — Local, Private)
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
# Install Ollama (macOS/Linux)
|
| 59 |
+
curl -fsSL https://ollama.ai/install.sh | sh
|
| 60 |
+
|
| 61 |
+
# Pull the Qwen model
|
| 62 |
+
ollama pull qwen2.5-coder:32b
|
| 63 |
+
|
| 64 |
+
# Set environment
|
| 65 |
+
export MODEL_PROVIDER=ollama
|
| 66 |
+
export OLLAMA_MODEL=qwen2.5-coder:32b
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
Edit your `.env` file:
|
| 70 |
+
```env
|
| 71 |
+
MODEL_PROVIDER=ollama
|
| 72 |
+
OLLAMA_MODEL=qwen2.5-coder:32b
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Option B: Together AI (Best for Qwen, Cloud)
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
# Get your API key at https://together.ai
|
| 79 |
+
export TOGETHER_API_KEY=tog-your-key-here
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
Edit your `.env`:
|
| 83 |
+
```env
|
| 84 |
+
MODEL_PROVIDER=together
|
| 85 |
+
TOGETHER_API_KEY=tog-your-key-here
|
| 86 |
+
TOGETHER_MODEL=togethercomputer/qwen2.5-32b-instruct
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### Option C: OpenAI (GPT-4o)
|
| 90 |
+
|
| 91 |
+
```env
|
| 92 |
+
MODEL_PROVIDER=openai
|
| 93 |
+
OPENAI_API_KEY=sk-your-key-here
|
| 94 |
+
OPENAI_MODEL=gpt-4o
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
### Option D: Anthropic (Claude)
|
| 98 |
+
|
| 99 |
+
```env
|
| 100 |
+
MODEL_PROVIDER=anthropic
|
| 101 |
+
ANTHROPIC_API_KEY=sk-ant-your-key-here
|
| 102 |
+
ANTHROPIC_MODEL=claude-3-5-sonnet-20240229
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### Option E: OpenRouter (Unified Access)
|
| 106 |
+
|
| 107 |
+
```env
|
| 108 |
+
MODEL_PROVIDER=openrouter
|
| 109 |
+
OPENROUTER_API_KEY=sk-or-your-key-here
|
| 110 |
+
OPENROUTER_MODEL=openai/gpt-4o
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## 🚀 Step 3 — Run Your First Task
|
| 116 |
+
|
| 117 |
+
### Interactive Chat Mode
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
python stack.py
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
You'll see:
|
| 124 |
+
```
|
| 125 |
+
╔══════════════════════════════════════════════╗
|
| 126 |
+
║ Stack 2.9 — AI Coding Assistant ║
|
| 127 |
+
║ Pattern Memory: Active | Tools: 46 ║
|
| 128 |
+
╚══════════════════════════════════════════════╝
|
| 129 |
+
|
| 130 |
+
You: Write a Python function to reverse a string
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### Single Query Mode
|
| 134 |
+
|
| 135 |
+
```bash
|
| 136 |
+
python stack.py -c "Write a Python function to reverse a string"
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
**Expected output:**
|
| 140 |
+
```python
|
| 141 |
+
def reverse_string(s):
|
| 142 |
+
"""Reverse a string and return it."""
|
| 143 |
+
return s[::-1]
|
| 144 |
+
|
| 145 |
+
# Or for a more robust version:
|
| 146 |
+
def reverse_string(s):
|
| 147 |
+
return ''.join(reversed(s))
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
### Ask About Your Codebase
|
| 151 |
+
|
| 152 |
+
```bash
|
| 153 |
+
python stack.py -c "Find all Python files modified in the last week and list them"
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### Generate and Run Code
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
python stack.py -c "Create a hello world Flask app with one route"
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
---
|
| 163 |
+
|
| 164 |
+
## 📊 Step 4 — Run Evaluation (Optional)
|
| 165 |
+
|
| 166 |
+
> **Note:** Evaluation requires a GPU with ~16GB VRAM or more.
|
| 167 |
+
|
| 168 |
+
### Prepare Your Fine-Tuned Model
|
| 169 |
+
|
| 170 |
+
After training Stack 2.9 on your data, your merged model will be in:
|
| 171 |
+
```
|
| 172 |
+
./output/merged/
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
### Run HumanEval Benchmark
|
| 176 |
+
|
| 177 |
+
```bash
|
| 178 |
+
python evaluate_model.py \
|
| 179 |
+
--model-path ./output/merged \
|
| 180 |
+
--benchmark humaneval \
|
| 181 |
+
--num-samples 10 \
|
| 182 |
+
--output results.json
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
### Run MBPP Benchmark
|
| 186 |
+
|
| 187 |
+
```bash
|
| 188 |
+
python evaluate_model.py \
|
| 189 |
+
--model-path ./output/merged \
|
| 190 |
+
--benchmark mbpp \
|
| 191 |
+
--num-samples 10 \
|
| 192 |
+
--output results.json
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
### Run Both Benchmarks
|
| 196 |
+
|
| 197 |
+
```bash
|
| 198 |
+
python evaluate_model.py \
|
| 199 |
+
--model-path ./output/merged \
|
| 200 |
+
--benchmark both \
|
| 201 |
+
--num-samples 10 \
|
| 202 |
+
--k-values 1,10 \
|
| 203 |
+
--output results.json
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
**Expected output format:**
|
| 207 |
+
```
|
| 208 |
+
============================================================
|
| 209 |
+
HumanEval Results
|
| 210 |
+
============================================================
|
| 211 |
+
pass@1: 65.00%
|
| 212 |
+
pass@10: 82.00%
|
| 213 |
+
Total problems evaluated: 12
|
| 214 |
+
============================================================
|
| 215 |
+
|
| 216 |
+
============================================================
|
| 217 |
+
MBPP Results
|
| 218 |
+
============================================================
|
| 219 |
+
pass@1: 70.00%
|
| 220 |
+
pass@10: 85.00%
|
| 221 |
+
Total problems evaluated: 12
|
| 222 |
+
============================================================
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
### Quick Evaluation (5 Problems Only)
|
| 226 |
+
|
| 227 |
+
```bash
|
| 228 |
+
python evaluate_model.py \
|
| 229 |
+
--model-path ./output/merged \
|
| 230 |
+
--benchmark humaneval \
|
| 231 |
+
--num-problems 5 \
|
| 232 |
+
--num-samples 5
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## 🐳 Step 5 — Deploy Stack 2.9
|
| 238 |
+
|
| 239 |
+
### Deploy Locally with Docker
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
# Start the container
|
| 243 |
+
docker build -t stack-2.9 .
|
| 244 |
+
docker run -p 7860:7860 \
|
| 245 |
+
-e MODEL_PROVIDER=ollama \
|
| 246 |
+
-e OLLAMA_MODEL=qwen2.5-coder:32b \
|
| 247 |
+
stack-2.9
|
| 248 |
+
```
|
| 249 |
+
|
| 250 |
+
Access at: **http://localhost:7860**
|
| 251 |
+
|
| 252 |
+
### Deploy to RunPod (Cloud GPU)
|
| 253 |
+
|
| 254 |
+
```bash
|
| 255 |
+
# Edit runpod_deploy.sh with your config first
|
| 256 |
+
bash runpod_deploy.sh --gpu a100 --instance hourly
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
### Deploy to Kubernetes
|
| 260 |
+
|
| 261 |
+
```bash
|
| 262 |
+
# 1. Edit k8s/secret.yaml with your HuggingFace token
|
| 263 |
+
# 2. Apply the manifests
|
| 264 |
+
kubectl apply -f k8s/namespace.yaml
|
| 265 |
+
kubectl apply -f k8s/secret.yaml
|
| 266 |
+
kubectl apply -f k8s/configmap.yaml
|
| 267 |
+
kubectl apply -f k8s/pvc.yaml
|
| 268 |
+
kubectl apply -f k8s/deployment.yaml
|
| 269 |
+
kubectl apply -f k8s/service.yaml
|
| 270 |
+
|
| 271 |
+
# Check status
|
| 272 |
+
kubectl get pods -n stack-29
|
| 273 |
+
kubectl logs -n stack-29 deployment/stack-29
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
### Hardware Requirements for Deployment
|
| 277 |
+
|
| 278 |
+
| Model Size | Minimum GPU | Recommended | Quantized (4-bit) |
|
| 279 |
+
|------------|-------------|-------------|-------------------|
|
| 280 |
+
| 7B | RTX 3070 (8GB) | A100 40GB | RTX 3060 (6GB) |
|
| 281 |
+
| 32B | A100 40GB | A100 80GB | RTX 3090 (24GB) |
|
| 282 |
+
|
| 283 |
+
---
|
| 284 |
+
|
| 285 |
+
## 🧠 Pattern Memory Quick Guide
|
| 286 |
+
|
| 287 |
+
Stack 2.9 stores successful patterns to help with future tasks.
|
| 288 |
+
|
| 289 |
+
### List Your Patterns
|
| 290 |
+
|
| 291 |
+
```bash
|
| 292 |
+
python stack.py --patterns list
|
| 293 |
+
python stack.py --patterns stats
|
| 294 |
+
```
|
| 295 |
+
|
| 296 |
+
### Extract Patterns from Your Git History
|
| 297 |
+
|
| 298 |
+
```bash
|
| 299 |
+
python scripts/extract_patterns_from_git.py \
|
| 300 |
+
--repo-path . \
|
| 301 |
+
--output patterns.jsonl \
|
| 302 |
+
--since-date "2024-01-01"
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
### Merge LoRA Adapters (Team Sharing)
|
| 306 |
+
|
| 307 |
+
```bash
|
| 308 |
+
python scripts/merge_lora_adapters.py \
|
| 309 |
+
--adapters adapter_a.safetensors adapter_b.safetensors \
|
| 310 |
+
--weights 0.7 0.3 \
|
| 311 |
+
--output merged.safetensors
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
## 🛠️ Troubleshooting
|
| 317 |
+
|
| 318 |
+
### "Module not found" errors
|
| 319 |
+
|
| 320 |
+
```bash
|
| 321 |
+
pip install -r requirements.txt
|
| 322 |
+
```
|
| 323 |
+
|
| 324 |
+
### "CUDA out of memory" during evaluation
|
| 325 |
+
|
| 326 |
+
```bash
|
| 327 |
+
# Reduce batch size
|
| 328 |
+
python evaluate_model.py --model-path ./merged --num-samples 5
|
| 329 |
+
|
| 330 |
+
# Or use 4-bit quantization
|
| 331 |
+
# (See docs/TRAINING_7B.md for quantized training)
|
| 332 |
+
```
|
| 333 |
+
|
| 334 |
+
### "Model not found" with Ollama
|
| 335 |
+
|
| 336 |
+
```bash
|
| 337 |
+
ollama pull qwen2.5-coder:32b
|
| 338 |
+
ollama list # Verify it's installed
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
### "API key not set" errors
|
| 342 |
+
|
| 343 |
+
```bash
|
| 344 |
+
# Double-check your .env file
|
| 345 |
+
cat .env
|
| 346 |
+
|
| 347 |
+
# For testing, you can also set inline
|
| 348 |
+
export TOGETHER_API_KEY=tog-your-key
|
| 349 |
+
```
|
| 350 |
+
|
| 351 |
+
### Slow inference on CPU
|
| 352 |
+
|
| 353 |
+
```bash
|
| 354 |
+
# Use a smaller model
|
| 355 |
+
export OLLAMA_MODEL=qwen2.5-coder:7b
|
| 356 |
+
|
| 357 |
+
# Or switch to cloud
|
| 358 |
+
export MODEL_PROVIDER=together
|
| 359 |
+
```
|
| 360 |
+
|
| 361 |
+
### Docker build fails
|
| 362 |
+
|
| 363 |
+
```bash
|
| 364 |
+
# Use Python 3.10 explicitly
|
| 365 |
+
docker build --build-arg PYTHON_VERSION=3.10 -t stack-2.9 .
|
| 366 |
+
```
|
| 367 |
+
|
| 368 |
+
### Kubernetes GPU not found
|
| 369 |
+
|
| 370 |
+
```bash
|
| 371 |
+
# Verify nvidia.com/gpu label on your node
|
| 372 |
+
kubectl get nodes -L nvidia.com/gpu
|
| 373 |
+
|
| 374 |
+
# Install NVIDIA GPU Operator if missing
|
| 375 |
+
# https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/
|
| 376 |
+
```
|
| 377 |
+
|
| 378 |
+
---
|
| 379 |
+
|
| 380 |
+
## 📚 What's Next?
|
| 381 |
+
|
| 382 |
+
| Goal | Go To |
|
| 383 |
+
|------|-------|
|
| 384 |
+
| Train on my own data | `docs/TRAINING_7B.md` |
|
| 385 |
+
| Learn all 46 tools | `TOOLS.md` |
|
| 386 |
+
| Set up team pattern sharing | `docs/pattern-moat.md` |
|
| 387 |
+
| Understand the architecture | `docs/reference/ARCHITECTURE.md` |
|
| 388 |
+
| Report a bug | `SECURITY.md` / GitHub Issues |
|
| 389 |
+
|
| 390 |
+
---
|
| 391 |
+
|
| 392 |
+
## ⚡ Quick Reference Card
|
| 393 |
+
|
| 394 |
+
```bash
|
| 395 |
+
# Install
|
| 396 |
+
git clone https://github.com/my-ai-stack/stack-2.9.git
|
| 397 |
+
cd stack-2.9 && pip install -r requirements.txt
|
| 398 |
+
|
| 399 |
+
# Configure
|
| 400 |
+
cp .env.example .env # Edit with your API keys
|
| 401 |
+
|
| 402 |
+
# Run
|
| 403 |
+
python stack.py # Interactive
|
| 404 |
+
python stack.py -c "your code request" # Single query
|
| 405 |
+
|
| 406 |
+
# Evaluate
|
| 407 |
+
python evaluate_model.py --model-path ./merged --benchmark humaneval
|
| 408 |
+
|
| 409 |
+
# Deploy
|
| 410 |
+
docker build -t stack-2.9 . && docker run -p 7860:7860 stack-2.9
|
| 411 |
+
```
|
| 412 |
+
|
| 413 |
+
---
|
| 414 |
+
|
| 415 |
+
*Stack 2.9 — AI that learns your patterns and grows with you.*
|
docs/ROADMAP.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 Roadmap
|
| 2 |
+
|
| 3 |
+
> Last updated: April 2026
|
| 4 |
+
|
| 5 |
+
## Current Status
|
| 6 |
+
|
| 7 |
+
### What's Working ✅
|
| 8 |
+
|
| 9 |
+
- **Basic code generation** - The model can generate Python, JavaScript, and other code based on prompts
|
| 10 |
+
- **CLI interface** - Working command-line interface (`stack.py`, `src/cli/`)
|
| 11 |
+
- **Multi-provider support** - Ollama, OpenAI, Anthropic, OpenRouter, Together AI integrations
|
| 12 |
+
- **46 built-in tools** - File operations, git, shell, web search, memory, task planning
|
| 13 |
+
- **Pattern Memory infrastructure** - Observer, Learner, Memory components implemented
|
| 14 |
+
- **Training pipeline** - LoRA fine-tuning scripts, data preparation, model merging
|
| 15 |
+
- **Deployment options** - Docker, RunPod, Vast.ai, Kubernetes, HuggingFace Spaces
|
| 16 |
+
- **128K context window** - Extended from base model's 32K
|
| 17 |
+
|
| 18 |
+
### What's Broken or Missing ⚠️
|
| 19 |
+
|
| 20 |
+
- **Tool calling not trained** - Model doesn't reliably use tools; needs fine-tuning on tool patterns
|
| 21 |
+
- **Benchmark scores unverifiable** - Previous claims removed after audit found only 20/164 HumanEval problems tested
|
| 22 |
+
- **Self-evolution not functional** - Observer/Learner components exist but not connected to training pipeline
|
| 23 |
+
- **Voice integration incomplete** - Coqui XTTS integration present but not tested
|
| 24 |
+
- **Evaluation infrastructure in progress** - New proper evaluation framework built but not run on full benchmarks
|
| 25 |
+
|
| 26 |
+
### What Needs Testing 🔧
|
| 27 |
+
|
| 28 |
+
- Full HumanEval (164 problems) evaluation
|
| 29 |
+
- Full MBPP (500 problems) evaluation
|
| 30 |
+
- Tool-calling accuracy with real tasks
|
| 31 |
+
- Pattern Memory retrieval and effectiveness
|
| 32 |
+
- Voice input/output pipeline
|
| 33 |
+
- Multi-provider compatibility
|
| 34 |
+
|
| 35 |
+
### What Needs Documentation 📚
|
| 36 |
+
|
| 37 |
+
- Tool definitions and schemas
|
| 38 |
+
- API reference (internal/ARCHITECTURE.md exists but needs updating)
|
| 39 |
+
- Pattern Memory usage guide
|
| 40 |
+
- Deployment troubleshooting
|
| 41 |
+
- Evaluation methodology
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Timeline with Milestones
|
| 46 |
+
|
| 47 |
+
### Short-Term (1-2 Weeks)
|
| 48 |
+
|
| 49 |
+
| Milestone | Description | Status |
|
| 50 |
+
|-----------|-------------|--------|
|
| 51 |
+
| **S1.1** | Run full HumanEval (164 problems) with proper inference | Not started |
|
| 52 |
+
| **S1.2** | Run full MBPP (500 problems) with proper inference | Not started |
|
| 53 |
+
| **S1.3** | Document all 46 tool definitions in `docs/TOOLS.md` | In progress |
|
| 54 |
+
| **S1.4** | Fix evaluation scripts to use real model inference | Needed |
|
| 55 |
+
| **S1.5** | Create minimal reproducible test for tool calling | Not started |
|
| 56 |
+
|
| 57 |
+
**Owner:** Community contribution welcome
|
| 58 |
+
|
| 59 |
+
### Medium-Term (1-3 Months)
|
| 60 |
+
|
| 61 |
+
| Milestone | Description | Status |
|
| 62 |
+
|-----------|-------------|--------|
|
| 63 |
+
| **M2.1** | Fine-tune model on tool-calling patterns (RTMP data) | Not started |
|
| 64 |
+
| **M2.2** | Implement and test self-evolution loop (Observer → Learner → Memory → Trainer) | Not started |
|
| 65 |
+
| **M2.3** | Run full benchmark evaluation and publish verified scores | Not started |
|
| 66 |
+
| **M2.4** | Add MCP server support for external tool integration | Partial |
|
| 67 |
+
| **M2.5** | Voice integration end-to-end testing | Not started |
|
| 68 |
+
| **M2.6** | Implement pattern extraction from production usage | Not started |
|
| 69 |
+
|
| 70 |
+
**Owner:** Requires training compute budget or community contribution
|
| 71 |
+
|
| 72 |
+
### Long-Term (6+ Months)
|
| 73 |
+
|
| 74 |
+
| Milestone | Description | Status |
|
| 75 |
+
|-----------|-------------|--------|
|
| 76 |
+
| **L3.1** | RLHF training for improved tool selection | Future |
|
| 77 |
+
| **L3.2** | Team sync infrastructure (PostgreSQL + FastAPI) | Designed, not implemented |
|
| 78 |
+
| **L3.3** | Federated learning for privacy-preserving updates | Future |
|
| 79 |
+
| **L3.4** | Multi-modal support (images → code) | Future |
|
| 80 |
+
| **L3.5** | Real-time voice-to-voice conversation | Future |
|
| 81 |
+
|
| 82 |
+
**Owner:** Long-term vision, needs significant resources
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## How to Contribute
|
| 87 |
+
|
| 88 |
+
### By Priority
|
| 89 |
+
|
| 90 |
+
1. **Run evaluations** - Help us verify benchmark scores by running `python stack_2_9_eval/run_proper_evaluation.py`
|
| 91 |
+
2. **Test tool calling** - Try the model with various tools and report what works/doesn't
|
| 92 |
+
3. **Documentation** - Improve docs, especially tool definitions and API reference
|
| 93 |
+
4. **Bug reports** - Open issues with reproduction steps
|
| 94 |
+
5. **Code contributions** - See CONTRIBUTING.md for guidelines
|
| 95 |
+
|
| 96 |
+
### Contribution Areas
|
| 97 |
+
|
| 98 |
+
| Area | Skill Needed | Priority |
|
| 99 |
+
|------|--------------|----------|
|
| 100 |
+
| Evaluation | Python, ML benchmarking | High |
|
| 101 |
+
| Tool calling tests | Python, CLI usage | High |
|
| 102 |
+
| Documentation | Technical writing | Medium |
|
| 103 |
+
| Training scripts | PyTorch, PEFT | Medium |
|
| 104 |
+
| Deployment | Docker, K8s, Cloud | Low |
|
| 105 |
+
| Pattern Memory | Vector databases, ML | Low |
|
| 106 |
+
|
| 107 |
+
### Quick Wins for Contributors
|
| 108 |
+
|
| 109 |
+
- Run `python stack.py -c "List files in current directory"` and report if tools work
|
| 110 |
+
- Review `stack/eval/results/` and verify evaluation logs
|
| 111 |
+
- Check `docs/TOOLS.md` accuracy against actual tool implementations
|
| 112 |
+
- Test with different providers (`--provider ollama|openai|anthropic`)
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Technical Notes
|
| 117 |
+
|
| 118 |
+
### Known Limitations
|
| 119 |
+
|
| 120 |
+
1. **Tool calling is not trained** - The base model has tool capabilities but Stack 2.9 hasn't been fine-tuned to use them reliably
|
| 121 |
+
2. **Pattern Memory is read-only** - The system stores patterns but doesn't automatically retrain on them yet
|
| 122 |
+
3. **Evaluation uses stub data** - Some eval scripts return pre-canned answers instead of running model
|
| 123 |
+
4. **Voice integration untested** - Code exists but hasn't been validated end-to-end
|
| 124 |
+
|
| 125 |
+
### Next Training Run Requirements
|
| 126 |
+
|
| 127 |
+
To fix tool calling, the next training run needs:
|
| 128 |
+
|
| 129 |
+
- Dataset: `data/rtmp-tools/combined_tools.jsonl` (already generated)
|
| 130 |
+
- Compute: ~1 hour on A100 for LoRA fine-tuning
|
| 131 |
+
- Configuration: Target tool_call logits, use `tool_use_examples.jsonl`
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Contact
|
| 136 |
+
|
| 137 |
+
- **Issues:** https://github.com/my-ai-stack/stack-2.9/issues
|
| 138 |
+
- **Discussions:** https://github.com/my-ai-stack/stack-2.9/discussions
|
| 139 |
+
- **Discord:** (link in README)
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
*This roadmap is a living document. Updates based on community feedback and project progress.*
|
docs/TOOL_DATA_ANALYSIS.md
ADDED
|
@@ -0,0 +1,235 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tool Calling Training Data Analysis
|
| 2 |
+
|
| 3 |
+
**Generated:** 2026-04-06
|
| 4 |
+
**Files Analyzed:**
|
| 5 |
+
- `training-data/tool_examples.jsonl` (original)
|
| 6 |
+
- `training-data_v2/tool_examples.jsonl` (regenerated)
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Executive Summary
|
| 11 |
+
|
| 12 |
+
The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.
|
| 13 |
+
|
| 14 |
+
**Key Findings on Original Data:**
|
| 15 |
+
- ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
|
| 16 |
+
- ❌ Heavy prompt duplication (7.5x average)
|
| 17 |
+
- ❌ No multi-step tool chains (only 1 tool per example)
|
| 18 |
+
- ❌ All examples use identical tool definitions
|
| 19 |
+
|
| 20 |
+
**Action Taken:** Generated 500 new examples using the project's generator script.
|
| 21 |
+
|
| 22 |
+
**Recommendation:** The original data needs substantial improvements before use in training.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## 1. Statistics Overview
|
| 27 |
+
|
| 28 |
+
### Original Data (tool_examples.jsonl)
|
| 29 |
+
|
| 30 |
+
| Metric | Value |
|
| 31 |
+
|--------|-------|
|
| 32 |
+
| Total Examples | 1,000 |
|
| 33 |
+
| Unique Prompts | 133 |
|
| 34 |
+
| Average Duplication | 7.52x |
|
| 35 |
+
| Unique Tool Sequences | 5 |
|
| 36 |
+
| Examples with Issues | ~107 (10.7%) |
|
| 37 |
+
|
| 38 |
+
### New Data (tool_examples_v2.jsonl)
|
| 39 |
+
|
| 40 |
+
| Metric | Value |
|
| 41 |
+
|--------|-------|
|
| 42 |
+
| Total Examples | 500 |
|
| 43 |
+
| File Size | 1.9 MB |
|
| 44 |
+
| Tools per Example | 5 (static definition) |
|
| 45 |
+
|
| 46 |
+
### Tool Call Distribution (Original)
|
| 47 |
+
|
| 48 |
+
| Tool | Call Count |
|
| 49 |
+
|------|------------|
|
| 50 |
+
| Bash | 200 |
|
| 51 |
+
| FileRead | 200 |
|
| 52 |
+
| FileWrite | 200 |
|
| 53 |
+
| WebSearch | 200 |
|
| 54 |
+
| Grep | 200 |
|
| 55 |
+
|
| 56 |
+
All examples have exactly **one tool call** - no multi-step chains exist.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## 2. Prompt Diversity Analysis (Original Data)
|
| 61 |
+
|
| 62 |
+
### Prompt Categories
|
| 63 |
+
|
| 64 |
+
| Category | Count | Percentage |
|
| 65 |
+
|----------|-------|------------|
|
| 66 |
+
| Python | 207 | 20.7% |
|
| 67 |
+
| React | 149 | 14.9% |
|
| 68 |
+
| File Read | 134 | 13.4% |
|
| 69 |
+
| File Write | 119 | 11.9% |
|
| 70 |
+
| Other | 114 | 11.4% |
|
| 71 |
+
| Run Command | 80 | 8.0% |
|
| 72 |
+
| Docker/K8s | 67 | 6.7% |
|
| 73 |
+
| Search | 50 | 5.0% |
|
| 74 |
+
| Git | 40 | 4.0% |
|
| 75 |
+
| Testing | 31 | 3.1% |
|
| 76 |
+
| Package Management | 9 | 0.9% |
|
| 77 |
+
|
| 78 |
+
### Most Duplicated Prompts
|
| 79 |
+
|
| 80 |
+
| Prompt | Occurrences |
|
| 81 |
+
|--------|-------------|
|
| 82 |
+
| "Run the tests with pytest" | 40 |
|
| 83 |
+
| "Run npm install to install dependencies" | 40 |
|
| 84 |
+
| "Write a simple React component to src/components/Button.jsx" | 67 |
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## 3. Tool Usage Breakdown
|
| 89 |
+
|
| 90 |
+
### Tool Definitions
|
| 91 |
+
|
| 92 |
+
All 1,000 original examples use **identical tool definitions** with 5 tools:
|
| 93 |
+
- `Bash` - Execute bash commands
|
| 94 |
+
- `FileRead` - Read file contents
|
| 95 |
+
- `FileWrite` - Create/overwrite files
|
| 96 |
+
- `WebSearch` - Search the web
|
| 97 |
+
- `Grep` - Search for patterns in files
|
| 98 |
+
|
| 99 |
+
### Tool Call Issues Found (Original Data)
|
| 100 |
+
|
| 101 |
+
#### Wrong Search Patterns (105 instances / 10.5%)
|
| 102 |
+
|
| 103 |
+
The `WebSearch` tool frequently uses queries that don't match the user's question:
|
| 104 |
+
|
| 105 |
+
| User Question | Actual Search Query |
|
| 106 |
+
|--------------|---------------------|
|
| 107 |
+
| "How do I use async/await in Python?" | "AWS Lambda cold start optimization" |
|
| 108 |
+
| "How do I use React hooks properly?" | "SQL join types explained" |
|
| 109 |
+
| "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" |
|
| 110 |
+
| "How do I use React hooks properly?" | "TypeScript generics tutorial" |
|
| 111 |
+
| "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" |
|
| 112 |
+
|
| 113 |
+
#### Wrong File Paths (2 instances)
|
| 114 |
+
|
| 115 |
+
The `FileWrite` tool sometimes writes to incorrect file types:
|
| 116 |
+
|
| 117 |
+
| User Request | Written Path |
|
| 118 |
+
|-------------|--------------|
|
| 119 |
+
| "Create a src/components/Header.jsx file" | Written to `config.json` |
|
| 120 |
+
| "Create a src/middleware.py file with settings" | Written to `config.yaml` |
|
| 121 |
+
|
| 122 |
+
#### Pattern/File Type Mismatches (Grep)
|
| 123 |
+
|
| 124 |
+
The `Grep` tool sometimes searches with mismatched patterns:
|
| 125 |
+
|
| 126 |
+
| Pattern | File Pattern | Issue |
|
| 127 |
+
|---------|-------------|-------|
|
| 128 |
+
| `class ` | `*.ts` | Python pattern in TypeScript files |
|
| 129 |
+
| `SELECT ` | `*.js` | SQL pattern in JavaScript files |
|
| 130 |
+
| `TODO` | `*.md` | Searching TODO in markdown files |
|
| 131 |
+
|
| 132 |
+
---
|
| 133 |
+
|
| 134 |
+
## 4. Data Quality Issues
|
| 135 |
+
|
| 136 |
+
### Critical Issues
|
| 137 |
+
|
| 138 |
+
1. **No Multi-Step Tool Chains**
|
| 139 |
+
- All 1,000 examples use exactly one tool call
|
| 140 |
+
- Real coding tasks typically require 2-5+ tool calls
|
| 141 |
+
- Example: "Read file → Find pattern → Search docs → Write fix"
|
| 142 |
+
|
| 143 |
+
2. **Search Query Mismatches**
|
| 144 |
+
- 10.5% of WebSearch calls have irrelevant queries
|
| 145 |
+
- Indicates the generator script has logic errors
|
| 146 |
+
|
| 147 |
+
3. **Heavy Prompt Duplication**
|
| 148 |
+
- 133 unique prompts duplicated to 1,000 examples
|
| 149 |
+
- "Write a simple React component" appears 67 times
|
| 150 |
+
- This creates overfitting to specific prompts
|
| 151 |
+
|
| 152 |
+
4. **Identical Tool Definitions**
|
| 153 |
+
- All examples use the same 5 tools with identical descriptions
|
| 154 |
+
- No variation in tool schemas or parameter structures
|
| 155 |
+
|
| 156 |
+
### Moderate Issues
|
| 157 |
+
|
| 158 |
+
5. **File Path Hallucination**
|
| 159 |
+
- Tool calls reference files that don't exist in actual codebase
|
| 160 |
+
- Example: asking for `tests/test_main.py` but reading `src/app.js`
|
| 161 |
+
|
| 162 |
+
6. **Response Fabrication**
|
| 163 |
+
- Assistant responses sometimes claim to show content that wasn't actually read
|
| 164 |
+
- Example: "Here's the README.md" when README.md wasn't the file requested
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
## 5. Recommendations for Improvement
|
| 169 |
+
|
| 170 |
+
### Immediate Actions (Completed)
|
| 171 |
+
|
| 172 |
+
1. ✅ **Regenerated Data**
|
| 173 |
+
```
|
| 174 |
+
Generated 500 new examples in training-data_v2/tool_examples.jsonl
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Script Fixes Needed
|
| 178 |
+
|
| 179 |
+
The generator script (`scripts/generate_tool_data.py`) needs:
|
| 180 |
+
|
| 181 |
+
1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
|
| 182 |
+
2. Fix `FILE_PATTERNS` - wrong file types for requested content
|
| 183 |
+
3. Add multi-step chain generation
|
| 184 |
+
4. Add prompt variation templates
|
| 185 |
+
5. Add validation to check query/content relevance
|
| 186 |
+
|
| 187 |
+
### Future Improvements
|
| 188 |
+
|
| 189 |
+
1. **Add Multi-Step Examples**
|
| 190 |
+
- Real tasks require reading files, searching, editing
|
| 191 |
+
- Generate chains of 2-4 tool calls per example
|
| 192 |
+
|
| 193 |
+
2. **Increase Prompt Diversity**
|
| 194 |
+
- Target 500+ unique prompts instead of duplicating
|
| 195 |
+
- Use template variations and paraphrasing
|
| 196 |
+
|
| 197 |
+
3. **Vary Tool Definitions**
|
| 198 |
+
- Different tools per example
|
| 199 |
+
- Add tool variations (e.g., different Bash commands)
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
## 6. Conclusion
|
| 204 |
+
|
| 205 |
+
The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements:
|
| 206 |
+
|
| 207 |
+
- ~10% of examples have incorrect tool parameters
|
| 208 |
+
- Heavy duplication leads to overfitting
|
| 209 |
+
- No multi-step chains fail to represent real coding workflows
|
| 210 |
+
- Synthetic generation errors are systematic
|
| 211 |
+
|
| 212 |
+
**Action Completed:** Generated 500 new examples via the project's generator script.
|
| 213 |
+
|
| 214 |
+
**Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Appendix: Quick Stats
|
| 219 |
+
|
| 220 |
+
### Original Data
|
| 221 |
+
```
|
| 222 |
+
Total examples: 1,000
|
| 223 |
+
Unique prompts: 133
|
| 224 |
+
Tool call issues: 107 (10.7%)
|
| 225 |
+
Multi-tool chains: 0 (0%)
|
| 226 |
+
Identical tool defs: 100%
|
| 227 |
+
Average duplication: 7.52x
|
| 228 |
+
```
|
| 229 |
+
|
| 230 |
+
### New Data (Generated)
|
| 231 |
+
```
|
| 232 |
+
Total examples: 500
|
| 233 |
+
File size: 1.9 MB
|
| 234 |
+
Location: training-data_v2/tool_examples.jsonl
|
| 235 |
+
```
|
inference_api.py
ADDED
|
@@ -0,0 +1,495 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
FastAPI Inference Server for Stack 2.9 Model
|
| 4 |
+
Provides REST API endpoints for code generation using fine-tuned Qwen models.
|
| 5 |
+
|
| 6 |
+
Usage:
|
| 7 |
+
# With default settings (model loaded from environment or config)
|
| 8 |
+
uvicorn inference_api:app --host 0.0.0.0 --port 8000
|
| 9 |
+
|
| 10 |
+
# With custom model path
|
| 11 |
+
MODEL_PATH=/path/to/model uvicorn inference_api:app --host 0.0.0.0 --port 8000
|
| 12 |
+
|
| 13 |
+
# With reload for development
|
| 14 |
+
uvicorn inference_api:app --reload --port 8000
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
import os
|
| 18 |
+
import logging
|
| 19 |
+
from contextlib import asynccontextmanager
|
| 20 |
+
from typing import Optional, List, Dict, Any
|
| 21 |
+
|
| 22 |
+
import torch
|
| 23 |
+
from fastapi import FastAPI, HTTPException
|
| 24 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 25 |
+
from pydantic import BaseModel, Field
|
| 26 |
+
|
| 27 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 28 |
+
|
| 29 |
+
# Configure logging
|
| 30 |
+
logging.basicConfig(level=logging.INFO)
|
| 31 |
+
logger = logging.getLogger(__name__)
|
| 32 |
+
|
| 33 |
+
# Model configuration
|
| 34 |
+
MODEL_PATH = os.getenv("MODEL_PATH", "base_model_qwen7b")
|
| 35 |
+
DEVICE = os.getenv("DEVICE", "cuda" if torch.cuda.is_available() else "cpu")
|
| 36 |
+
DEFAULT_MAX_TOKENS = int(os.getenv("DEFAULT_MAX_TOKENS", "512"))
|
| 37 |
+
DEFAULT_TEMPERATURE = float(os.getenv("DEFAULT_TEMPERATURE", "0.2"))
|
| 38 |
+
DEFAULT_TOP_P = float(os.getenv("DEFAULT_TOP_P", "0.95"))
|
| 39 |
+
|
| 40 |
+
# Global model and tokenizer (loaded on startup)
|
| 41 |
+
model = None
|
| 42 |
+
tokenizer = None
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
@asynccontextmanager
|
| 46 |
+
async def lifespan(app: FastAPI):
|
| 47 |
+
"""Load model on startup, cleanup on shutdown."""
|
| 48 |
+
global model, tokenizer
|
| 49 |
+
logger.info(f"Loading model from: {MODEL_PATH}")
|
| 50 |
+
logger.info(f"Using device: {DEVICE}")
|
| 51 |
+
|
| 52 |
+
try:
|
| 53 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 54 |
+
MODEL_PATH,
|
| 55 |
+
trust_remote_code=True,
|
| 56 |
+
padding_side="left",
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
if tokenizer.pad_token is None:
|
| 60 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 61 |
+
|
| 62 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 63 |
+
MODEL_PATH,
|
| 64 |
+
torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
|
| 65 |
+
device_map="auto" if DEVICE == "cuda" else None,
|
| 66 |
+
low_cpu_mem_usage=True,
|
| 67 |
+
trust_remote_code=True,
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
if DEVICE == "cpu":
|
| 71 |
+
model = model.to(DEVICE)
|
| 72 |
+
|
| 73 |
+
model.eval()
|
| 74 |
+
logger.info("Model loaded successfully")
|
| 75 |
+
except Exception as e:
|
| 76 |
+
logger.error(f"Failed to load model: {e}")
|
| 77 |
+
raise
|
| 78 |
+
|
| 79 |
+
yield
|
| 80 |
+
|
| 81 |
+
# Cleanup
|
| 82 |
+
logger.info("Shutting down, cleaning up model...")
|
| 83 |
+
del model
|
| 84 |
+
del tokenizer
|
| 85 |
+
if torch.cuda.is_available():
|
| 86 |
+
torch.cuda.empty_cache()
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
app = FastAPI(
|
| 90 |
+
title="Stack 2.9 Inference API",
|
| 91 |
+
description="REST API for code generation using Stack 2.9 fine-tuned Qwen model",
|
| 92 |
+
version="1.0.0",
|
| 93 |
+
lifespan=lifespan,
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
# CORS middleware
|
| 97 |
+
app.add_middleware(
|
| 98 |
+
CORSMiddleware,
|
| 99 |
+
allow_origins=["*"],
|
| 100 |
+
allow_credentials=True,
|
| 101 |
+
allow_methods=["*"],
|
| 102 |
+
allow_headers=["*"],
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
# ============================================================================
|
| 107 |
+
# Request/Response Models
|
| 108 |
+
# ============================================================================
|
| 109 |
+
|
| 110 |
+
class GenerateRequest(BaseModel):
|
| 111 |
+
"""Request body for /generate endpoint."""
|
| 112 |
+
prompt: str = Field(..., description="Input prompt/code to complete", min_length=1)
|
| 113 |
+
max_tokens: int = Field(DEFAULT_MAX_TOKENS, ge=1, le=4096, description="Max tokens to generate")
|
| 114 |
+
temperature: float = Field(DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature")
|
| 115 |
+
top_p: float = Field(DEFAULT_TOP_P, ge=0.0, le=1.0, description="Nucleus sampling threshold")
|
| 116 |
+
do_sample: bool = Field(True, description="Whether to use sampling")
|
| 117 |
+
repetition_penalty: float = Field(1.1, ge=1.0, le=2.0, description="Repetition penalty")
|
| 118 |
+
num_return_sequences: int = Field(1, ge=1, le=10, description="Number of sequences to generate")
|
| 119 |
+
|
| 120 |
+
model_config = {
|
| 121 |
+
"json_schema_extra": {
|
| 122 |
+
"example": {
|
| 123 |
+
"prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
|
| 124 |
+
"max_tokens": 128,
|
| 125 |
+
"temperature": 0.2,
|
| 126 |
+
"top_p": 0.95,
|
| 127 |
+
}
|
| 128 |
+
}
|
| 129 |
+
}
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
class GenerateResponse(BaseModel):
|
| 133 |
+
"""Response body for /generate endpoint."""
|
| 134 |
+
generated_text: str
|
| 135 |
+
prompt: str
|
| 136 |
+
model: str
|
| 137 |
+
num_tokens: int
|
| 138 |
+
finish_reason: str = "length"
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
class ChatMessage(BaseModel):
|
| 142 |
+
"""A single message in a conversation."""
|
| 143 |
+
role: str = Field(..., description="Role: 'user' or 'assistant'")
|
| 144 |
+
content: str = Field(..., description="Message content")
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
class ChatRequest(BaseModel):
|
| 148 |
+
"""Request body for /chat endpoint."""
|
| 149 |
+
messages: List[ChatMessage] = Field(..., description="Conversation history")
|
| 150 |
+
max_tokens: int = Field(DEFAULT_MAX_TOKENS, ge=1, le=4096, description="Max tokens to generate")
|
| 151 |
+
temperature: float = Field(DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature")
|
| 152 |
+
top_p: float = Field(DEFAULT_TOP_P, ge=0.0, le=1.0, description="Nucleus sampling threshold")
|
| 153 |
+
do_sample: bool = Field(True, description="Whether to use sampling")
|
| 154 |
+
repetition_penalty: float = Field(1.1, ge=1.0, le=2.0, description="Repetition penalty")
|
| 155 |
+
|
| 156 |
+
model_config = {
|
| 157 |
+
"json_schema_extra": {
|
| 158 |
+
"example": {
|
| 159 |
+
"messages": [
|
| 160 |
+
{"role": "user", "content": "Write a function to reverse a string in Python"},
|
| 161 |
+
{"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"},
|
| 162 |
+
{"role": "user", "content": "Make it recursive"},
|
| 163 |
+
],
|
| 164 |
+
"max_tokens": 128,
|
| 165 |
+
"temperature": 0.2,
|
| 166 |
+
}
|
| 167 |
+
}
|
| 168 |
+
}
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
class ChatResponse(BaseModel):
|
| 172 |
+
"""Response body for /chat endpoint."""
|
| 173 |
+
message: ChatMessage
|
| 174 |
+
model: str
|
| 175 |
+
num_tokens: int
|
| 176 |
+
finish_reason: str = "length"
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
class HealthResponse(BaseModel):
|
| 180 |
+
"""Response body for /health endpoint."""
|
| 181 |
+
status: str
|
| 182 |
+
model_loaded: bool
|
| 183 |
+
model_path: str
|
| 184 |
+
device: str
|
| 185 |
+
cuda_available: bool
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
class ModelInfoResponse(BaseModel):
|
| 189 |
+
"""Response body for /model-info endpoint."""
|
| 190 |
+
model_path: str
|
| 191 |
+
device: str
|
| 192 |
+
dtype: str
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
# ============================================================================
|
| 196 |
+
# Helper Functions
|
| 197 |
+
# ============================================================================
|
| 198 |
+
|
| 199 |
+
def format_chat_to_prompt(messages: List[ChatMessage]) -> str:
|
| 200 |
+
"""
|
| 201 |
+
Format chat messages into a prompt for code generation.
|
| 202 |
+
Uses a simple instruction format suitable for Qwen.
|
| 203 |
+
"""
|
| 204 |
+
formatted = []
|
| 205 |
+
for msg in messages:
|
| 206 |
+
if msg.role == "user":
|
| 207 |
+
formatted.append(f"<|im_start|>user\n{msg.content}<|im_end|>")
|
| 208 |
+
elif msg.role == "assistant":
|
| 209 |
+
formatted.append(f"<|im_start|>assistant\n{msg.content}<|im_end|>")
|
| 210 |
+
|
| 211 |
+
formatted.append("<|im_start|>assistant\n")
|
| 212 |
+
return "\n".join(formatted)
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
def generate_response(
|
| 216 |
+
prompt: str,
|
| 217 |
+
max_new_tokens: int,
|
| 218 |
+
temperature: float,
|
| 219 |
+
top_p: float,
|
| 220 |
+
do_sample: bool,
|
| 221 |
+
repetition_penalty: float,
|
| 222 |
+
num_return_sequences: int,
|
| 223 |
+
) -> tuple[str, int, str]:
|
| 224 |
+
"""
|
| 225 |
+
Generate response from model.
|
| 226 |
+
|
| 227 |
+
Returns:
|
| 228 |
+
tuple: (generated_text, num_tokens, finish_reason)
|
| 229 |
+
"""
|
| 230 |
+
inputs = tokenizer(
|
| 231 |
+
prompt,
|
| 232 |
+
return_tensors="pt",
|
| 233 |
+
padding=True,
|
| 234 |
+
truncation=True,
|
| 235 |
+
)
|
| 236 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 237 |
+
|
| 238 |
+
with torch.no_grad():
|
| 239 |
+
outputs = model.generate(
|
| 240 |
+
**inputs,
|
| 241 |
+
max_new_tokens=max_new_tokens,
|
| 242 |
+
temperature=temperature,
|
| 243 |
+
top_p=top_p,
|
| 244 |
+
do_sample=do_sample,
|
| 245 |
+
repetition_penalty=repetition_penalty,
|
| 246 |
+
num_return_sequences=num_return_sequences,
|
| 247 |
+
pad_token_id=tokenizer.pad_token_id,
|
| 248 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 249 |
+
)
|
| 250 |
+
|
| 251 |
+
# Decode the first sequence
|
| 252 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 253 |
+
|
| 254 |
+
# Calculate number of generated tokens
|
| 255 |
+
num_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
|
| 256 |
+
|
| 257 |
+
# Extract just the new tokens (remove prompt)
|
| 258 |
+
if generated_text.startswith(prompt):
|
| 259 |
+
generated_text = generated_text[len(prompt):]
|
| 260 |
+
|
| 261 |
+
generated_text = generated_text.strip()
|
| 262 |
+
|
| 263 |
+
# Determine finish reason
|
| 264 |
+
finish_reason = "stop"
|
| 265 |
+
if num_tokens >= max_new_tokens:
|
| 266 |
+
finish_reason = "length"
|
| 267 |
+
|
| 268 |
+
return generated_text, num_tokens, finish_reason
|
| 269 |
+
|
| 270 |
+
|
| 271 |
+
def extract_code_from_response(text: str) -> str:
|
| 272 |
+
"""Extract code block from response if present."""
|
| 273 |
+
if "```python" in text:
|
| 274 |
+
start = text.find("```python") + len("```python")
|
| 275 |
+
end = text.find("```", start)
|
| 276 |
+
if end != -1:
|
| 277 |
+
return text[start:end].strip()
|
| 278 |
+
elif "```" in text:
|
| 279 |
+
start = text.find("```") + len("```")
|
| 280 |
+
# Skip potential language identifier
|
| 281 |
+
if "\n" in text[start:]:
|
| 282 |
+
start = text.find("\n", start) + 1
|
| 283 |
+
end = text.find("```", start)
|
| 284 |
+
if end != -1:
|
| 285 |
+
return text[start:end].strip()
|
| 286 |
+
return text
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
# ============================================================================
|
| 290 |
+
# API Endpoints
|
| 291 |
+
# ============================================================================
|
| 292 |
+
|
| 293 |
+
@app.get("/health", response_model=HealthResponse)
|
| 294 |
+
async def health_check():
|
| 295 |
+
"""
|
| 296 |
+
Health check endpoint.
|
| 297 |
+
|
| 298 |
+
Returns the current status of the API and model.
|
| 299 |
+
"""
|
| 300 |
+
return HealthResponse(
|
| 301 |
+
status="healthy" if model is not None else "model_not_loaded",
|
| 302 |
+
model_loaded=model is not None,
|
| 303 |
+
model_path=MODEL_PATH,
|
| 304 |
+
device=DEVICE,
|
| 305 |
+
cuda_available=torch.cuda.is_available(),
|
| 306 |
+
)
|
| 307 |
+
|
| 308 |
+
|
| 309 |
+
@app.get("/model-info", response_model=ModelInfoResponse)
|
| 310 |
+
async def get_model_info():
|
| 311 |
+
"""
|
| 312 |
+
Get information about the loaded model.
|
| 313 |
+
"""
|
| 314 |
+
if model is None:
|
| 315 |
+
raise HTTPException(status_code=503, detail="Model not loaded")
|
| 316 |
+
|
| 317 |
+
dtype = str(next(model.parameters()).dtype)
|
| 318 |
+
|
| 319 |
+
return ModelInfoResponse(
|
| 320 |
+
model_path=MODEL_PATH,
|
| 321 |
+
device=str(next(model.parameters()).device),
|
| 322 |
+
dtype=dtype,
|
| 323 |
+
)
|
| 324 |
+
|
| 325 |
+
|
| 326 |
+
@app.post("/generate", response_model=GenerateResponse)
|
| 327 |
+
async def generate(request: GenerateRequest):
|
| 328 |
+
"""
|
| 329 |
+
Generate code completion for a prompt.
|
| 330 |
+
|
| 331 |
+
Takes a prompt and generates code completion based on the model.
|
| 332 |
+
Supports various generation parameters for controlling output.
|
| 333 |
+
"""
|
| 334 |
+
if model is None or tokenizer is None:
|
| 335 |
+
raise HTTPException(
|
| 336 |
+
status_code=503,
|
| 337 |
+
detail="Model not loaded. Check /health for status."
|
| 338 |
+
)
|
| 339 |
+
|
| 340 |
+
try:
|
| 341 |
+
generated_text, num_tokens, finish_reason = generate_response(
|
| 342 |
+
prompt=request.prompt,
|
| 343 |
+
max_new_tokens=request.max_tokens,
|
| 344 |
+
temperature=request.temperature,
|
| 345 |
+
top_p=request.top_p,
|
| 346 |
+
do_sample=request.do_sample,
|
| 347 |
+
repetition_penalty=request.repetition_penalty,
|
| 348 |
+
num_return_sequences=request.num_return_sequences,
|
| 349 |
+
)
|
| 350 |
+
|
| 351 |
+
return GenerateResponse(
|
| 352 |
+
generated_text=generated_text,
|
| 353 |
+
prompt=request.prompt,
|
| 354 |
+
model=MODEL_PATH,
|
| 355 |
+
num_tokens=num_tokens,
|
| 356 |
+
finish_reason=finish_reason,
|
| 357 |
+
)
|
| 358 |
+
except Exception as e:
|
| 359 |
+
logger.error(f"Generation error: {e}")
|
| 360 |
+
raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
|
| 361 |
+
|
| 362 |
+
|
| 363 |
+
@app.post("/generate/raw", response_model=GenerateResponse)
|
| 364 |
+
async def generate_raw(request: GenerateRequest):
|
| 365 |
+
"""
|
| 366 |
+
Generate without extracting code from markdown blocks.
|
| 367 |
+
|
| 368 |
+
Returns the raw model output without any post-processing.
|
| 369 |
+
"""
|
| 370 |
+
if model is None or tokenizer is None:
|
| 371 |
+
raise HTTPException(
|
| 372 |
+
status_code=503,
|
| 373 |
+
detail="Model not loaded. Check /health for status."
|
| 374 |
+
)
|
| 375 |
+
|
| 376 |
+
try:
|
| 377 |
+
# Get raw response
|
| 378 |
+
inputs = tokenizer(
|
| 379 |
+
request.prompt,
|
| 380 |
+
return_tensors="pt",
|
| 381 |
+
padding=True,
|
| 382 |
+
truncation=True,
|
| 383 |
+
)
|
| 384 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 385 |
+
|
| 386 |
+
with torch.no_grad():
|
| 387 |
+
outputs = model.generate(
|
| 388 |
+
**inputs,
|
| 389 |
+
max_new_tokens=request.max_tokens,
|
| 390 |
+
temperature=request.temperature,
|
| 391 |
+
top_p=request.top_p,
|
| 392 |
+
do_sample=request.do_sample,
|
| 393 |
+
repetition_penalty=request.repetition_penalty,
|
| 394 |
+
num_return_sequences=request.num_return_sequences,
|
| 395 |
+
pad_token_id=tokenizer.pad_token_id,
|
| 396 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 397 |
+
)
|
| 398 |
+
|
| 399 |
+
num_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
|
| 400 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 401 |
+
|
| 402 |
+
if generated_text.startswith(request.prompt):
|
| 403 |
+
generated_text = generated_text[len(request.prompt):]
|
| 404 |
+
|
| 405 |
+
finish_reason = "stop" if num_tokens < request.max_tokens else "length"
|
| 406 |
+
|
| 407 |
+
return GenerateResponse(
|
| 408 |
+
generated_text=generated_text.strip(),
|
| 409 |
+
prompt=request.prompt,
|
| 410 |
+
model=MODEL_PATH,
|
| 411 |
+
num_tokens=num_tokens,
|
| 412 |
+
finish_reason=finish_reason,
|
| 413 |
+
)
|
| 414 |
+
except Exception as e:
|
| 415 |
+
logger.error(f"Generation error: {e}")
|
| 416 |
+
raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
|
| 417 |
+
|
| 418 |
+
|
| 419 |
+
@app.post("/chat", response_model=ChatResponse)
|
| 420 |
+
async def chat(request: ChatRequest):
|
| 421 |
+
"""
|
| 422 |
+
Chat endpoint for conversation-style interactions.
|
| 423 |
+
|
| 424 |
+
Takes a conversation history and generates the next assistant response.
|
| 425 |
+
"""
|
| 426 |
+
if model is None or tokenizer is None:
|
| 427 |
+
raise HTTPException(
|
| 428 |
+
status_code=503,
|
| 429 |
+
detail="Model not loaded. Check /health for status."
|
| 430 |
+
)
|
| 431 |
+
|
| 432 |
+
if not request.messages:
|
| 433 |
+
raise HTTPException(status_code=400, detail="Messages list cannot be empty")
|
| 434 |
+
|
| 435 |
+
# Check that last message is from user
|
| 436 |
+
if request.messages[-1].role != "user":
|
| 437 |
+
raise HTTPException(
|
| 438 |
+
status_code=400,
|
| 439 |
+
detail="Last message must be from user"
|
| 440 |
+
)
|
| 441 |
+
|
| 442 |
+
try:
|
| 443 |
+
# Format conversation as prompt
|
| 444 |
+
prompt = format_chat_to_prompt(request.messages)
|
| 445 |
+
|
| 446 |
+
generated_text, num_tokens, finish_reason = generate_response(
|
| 447 |
+
prompt=prompt,
|
| 448 |
+
max_new_tokens=request.max_tokens,
|
| 449 |
+
temperature=request.temperature,
|
| 450 |
+
top_p=request.top_p,
|
| 451 |
+
do_sample=request.do_sample,
|
| 452 |
+
repetition_penalty=request.repetition_penalty,
|
| 453 |
+
num_return_sequences=1,
|
| 454 |
+
)
|
| 455 |
+
|
| 456 |
+
return ChatResponse(
|
| 457 |
+
message=ChatMessage(role="assistant", content=generated_text),
|
| 458 |
+
model=MODEL_PATH,
|
| 459 |
+
num_tokens=num_tokens,
|
| 460 |
+
finish_reason=finish_reason,
|
| 461 |
+
)
|
| 462 |
+
except Exception as e:
|
| 463 |
+
logger.error(f"Chat error: {e}")
|
| 464 |
+
raise HTTPException(status_code=500, detail=f"Chat generation failed: {str(e)}")
|
| 465 |
+
|
| 466 |
+
|
| 467 |
+
@app.post("/extract-code")
|
| 468 |
+
async def extract_code(request: GenerateRequest):
|
| 469 |
+
"""
|
| 470 |
+
Extract code from a generated response.
|
| 471 |
+
|
| 472 |
+
Useful when you have raw output with markdown code blocks and want to
|
| 473 |
+
extract just the code portion.
|
| 474 |
+
"""
|
| 475 |
+
code = extract_code_from_response(request.prompt)
|
| 476 |
+
return {"code": code}
|
| 477 |
+
|
| 478 |
+
|
| 479 |
+
# ============================================================================
|
| 480 |
+
# Main Entry Point
|
| 481 |
+
# ============================================================================
|
| 482 |
+
|
| 483 |
+
if __name__ == "__main__":
|
| 484 |
+
import uvicorn
|
| 485 |
+
|
| 486 |
+
port = int(os.getenv("PORT", "8000"))
|
| 487 |
+
host = os.getenv("HOST", "0.0.0.0")
|
| 488 |
+
|
| 489 |
+
uvicorn.run(
|
| 490 |
+
"inference_api:app",
|
| 491 |
+
host=host,
|
| 492 |
+
port=port,
|
| 493 |
+
reload=os.getenv("RELOAD", "false").lower() == "true",
|
| 494 |
+
workers=1, # Multi-worker can cause GPU memory issues
|
| 495 |
+
)
|
requirements_api.txt
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FastAPI Inference API Dependencies
|
| 2 |
+
# Install with: pip install -r requirements_api.txt
|
| 3 |
+
|
| 4 |
+
# Web Framework
|
| 5 |
+
fastapi>=0.109.0
|
| 6 |
+
uvicorn[standard]>=0.27.0
|
| 7 |
+
|
| 8 |
+
# Request/Response Models
|
| 9 |
+
pydantic>=2.5.0
|
| 10 |
+
|
| 11 |
+
# CORS Support (included with fastapi but listed for clarity)
|
| 12 |
+
# fastapi includes CORSMiddleware
|
| 13 |
+
|
| 14 |
+
# Server utilities
|
| 15 |
+
python-multipart>=0.0.6
|
| 16 |
+
|
| 17 |
+
# Optional: for async file handling
|
| 18 |
+
httpx>=0.26.0
|
training-data/tool_examples_combined.jsonl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:32da2f0f67ba3fd83d180ec2c1a323e77d4263ff5aeb1e8062cf596b070691d5
|
| 3 |
+
size 5669209
|