walidsobhie-code

feat: add inference API, quickstart guide, roadmap, and combined tool data

b03a8a0 20 days ago

raw

history blame contribute delete

7.76 kB

Stack 2.9 Inference API Documentation

REST API for code generation using the Stack 2.9 fine-tuned Qwen model.

Quick Start

1. Install Dependencies

pip install -r requirements_api.txt
pip install -r requirements.txt  # Core dependencies (transformers, torch, etc.)

2. Set Model Path

# Option A: Environment variable
export MODEL_PATH=/path/to/your/merged/model

# Option B: Direct parameter
MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000

3. Start the Server

# Basic usage
uvicorn inference_api:app --host 0.0.0.0 --port 8000

# With auto-reload (development)
uvicorn inference_api:app --reload --port 8000

# Using Python directly
python inference_api.py

4. Verify It's Running

curl http://localhost:8000/health

Expected response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_path": "base_model_qwen7b",
  "device": "cuda",
  "cuda_available": true
}

API Endpoints

`GET /health`

Health check endpoint to verify API and model status.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_path": "/path/to/model",
  "device": "cuda",
  "cuda_available": true
}

`GET /model-info`

Get information about the currently loaded model.

Response:

{
  "model_path": "/path/to/model",
  "device": "cuda:0",
  "dtype": "torch.float16"
}

`POST /generate`

Generate code completion for a prompt.

Request Body:

{
  "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
  "max_tokens": 128,
  "temperature": 0.2,
  "top_p": 0.95,
  "do_sample": true,
  "repetition_penalty": 1.1,
  "num_return_sequences": 1
}

Parameters:

Parameter	Type	Default	Range	Description
`prompt`	string	required	-	Input prompt to complete
`max_tokens`	int	512	1-4096	Maximum tokens to generate
`temperature`	float	0.2	0.0-2.0	Sampling temperature (higher = more creative)
`top_p`	float	0.95	0.0-1.0	Nucleus sampling threshold
`do_sample`	bool	true	-	Whether to use sampling vs greedy
`repetition_penalty`	float	1.1	1.0-2.0	Penalize repeated tokens
`num_return_sequences`	int	1	1-10	Number of sequences to generate

Response:

{
  "generated_text": "    seen = {}\n    for i, num in enumerate(nums):\n        complement = target - num\n        if complement in seen:\n            return [seen[complement], i]\n        seen[num] = i\n    return []",
  "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
  "model": "base_model_qwen7b",
  "num_tokens": 45,
  "finish_reason": "stop"
}

Example with curl:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "def fibonacci(n):\n    \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
    "max_tokens": 100,
    "temperature": 0.2
  }'

`POST /chat`

Conversational interface for multi-turn interactions.

Request Body:

{
  "messages": [
    {"role": "user", "content": "Write a function to reverse a string in Python"},
    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"},
    {"role": "user", "content": "Make it recursive instead"}
  ],
  "max_tokens": 128,
  "temperature": 0.2,
  "top_p": 0.95
}

Message Roles:

user - User's message
assistant - Model's previous response (for conversation history)

Response:

{
  "message": {
    "role": "assistant",
    "content": "def reverse_string(s):\n    if len(s) <= 1:\n        return s\n    return s[-1] + reverse_string(s[:-1])"
  },
  "model": "base_model_qwen7b",
  "num_tokens": 67,
  "finish_reason": "stop"
}

Example with curl:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a binary search function"}
    ],
    "max_tokens": 150
  }'

`POST /generate/raw`

Same as /generate but returns raw output without extracting code from markdown blocks.

Example with curl:

curl -X POST http://localhost:8000/generate/raw \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "def quick_sort(arr):",
    "max_tokens": 200
  }'

`POST /extract-code`

Extract code from a text response that may contain markdown code blocks.

Request Body:

{
  "prompt": "```python\ndef hello():\n    print(\"world\")\n```"
}

Response:

{
  "code": "def hello():\n    print(\"world\")"
}

Environment Variables

Variable	Default	Description
`MODEL_PATH`	`base_model_qwen7b`	Path to model directory
`DEVICE`	`cuda` (if available)	Device to use: `cuda` or `cpu`
`PORT`	`8000`	Server port
`HOST`	`0.0.0.0`	Server host
`RELOAD`	`false`	Enable auto-reload for development
`DEFAULT_MAX_TOKENS`	`512`	Default max tokens
`DEFAULT_TEMPERATURE`	`0.2`	Default temperature
`DEFAULT_TOP_P`	`0.95`	Default top_p

Usage Examples

Python Client

import requests

API_URL = "http://localhost:8000"

# Health check
health = requests.get(f"{API_URL}/health").json()
print(f"Model loaded: {health['model_loaded']}")

# Code completion
response = requests.post(
    f"{API_URL}/generate",
    json={
        "prompt": "def merge_sort(arr):\n    \"\"\"Return sorted array.\"\"\"\n",
        "max_tokens": 200,
        "temperature": 0.3,
    }
).json()

print(response["generated_text"])

JavaScript/Node.js Client

const API_URL = "http://localhost:8000";

// Code completion
async function generate(prompt) {
  const response = await fetch(`${API_URL}/generate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      prompt,
      max_tokens: 128,
      temperature: 0.2,
    }),
  });
  return response.json();
}

const result = await generate("def binary_search(arr, target):");
console.log(result.generated_text);

Using with OpenAI SDK (with base_url replacement)

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8000"
)

# Note: This works for basic completions but may need adapter code
# for full OpenAI compatibility
response = client.completions.create(
    model="stack-2.9",
    prompt="def factorial(n):",
    max_tokens=100,
)

Performance Tips

GPU Recommended: For fastest inference, run on GPU with CUDA
Batch Processing: For multiple prompts, process sequentially (model is loaded once)
Memory: Ensure adequate GPU memory; reduce max_tokens if needed
Temperature: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks

Error Handling

503 Service Unavailable: Model not loaded or loading failed

{"detail": "Model not loaded. Check /health for status."}

500 Internal Server Error: Generation failed

{"detail": "Generation failed: <error message>"}

400 Bad Request: Invalid input

{"detail": "Last message must be from user"}

Architecture Notes

Single Model Instance: Model is loaded once at startup and reused
Synchronous Generation: Uses torch.no_grad() for inference
CORS Enabled: Accepts requests from any origin (configure for production)
No Authentication: Add middleware (e.g., API key) for production deployments