feat: add evaluation scripts, tool calling data generator, and 7B training configs

Created:
- test_model.py: 602 lines - basic code generation tests
- evaluate_model.py: 847 lines - HumanEval/MBPP benchmark evaluation with pass@k
- scripts/generate_tool_data.py: 615 lines - synthetic tool_calling data generator
- training-data/tool_examples.jsonl: 1000 synthetic tool_calling examples
- training-configs/7b-lora-config.yaml: A100 80GB full LoRA config
- training-configs/kaggle-7b-qlora.sh: QLoRA script for 7B on Kaggle T4
- docs/TRAINING_7B.md: 7B training documentation with VRAM estimates

Updated:
- merge_simple.py: Improved merge logic
- .gitignore: Clean ignores

Note: tool_examples.jsonl has 1000 synthetic tool_calling examples ready for fine-tuning

Files changed (12) hide show

docs/TRAINING_7B.md +189 -0
evaluate_model.py +736 -250
kaggle_train_stack29.ipynb +0 -288
kaggle_train_stack29_final.ipynb +0 -198
merge_simple.py +1 -1
scripts/generate_tool_data.py +615 -0
test_model.py +559 -110
train_simple.py +0 -197
training-configs/7b-lora-config.yaml +104 -0
training-configs/kaggle-7b-qlora.sh +79 -0
training-data/tool_examples.json +0 -0
training-data/tool_examples.jsonl +3 -0

docs/TRAINING_7B.md ADDED Viewed

	@@ -0,0 +1,189 @@

+# Training Stack 2.9 on Qwen2.5-Coder-7B
+## Overview
+This guide covers training Stack 2.9 on the Qwen2.5-Coder-7B model using LoRA/QLoRA fine-tuning.
+## Hardware Requirements
+### Minimum (QLoRA - 4-bit)
+| GPU | VRAM | Batch Size | Notes |
+|-----|------|-----------|-------|
+| T4 (Colab) | 15GB | 1 | Gradient accu = 16 |
+| P100 (Kaggle) | 16GB | 1 | Gradient accu = 8 |
+| RTX 3090 | 24GB | 2 | Full performance |
+### Recommended (Full LoRA - bf16)
+| GPU | VRAM | Batch Size | Notes |
+|-----|------|-----------|-------|
+| A100 40GB | 40GB | 2 | 2x for better throughput |
+| A100 80GB | 80GB | 4 | Best for production |
+| H100 80GB | 80GB | 4 | Next-gen option |
+## VRAM Estimates
+| Configuration | Batch Size | Gradient Checkpoint | Est. VRAM |
+|--------------|-----------|-------------------|----------|
+| Full bf16 | 1 | No | 14GB |
+| Full bf16 | 2 | Yes | 16GB |
+| Full bf16 | 4 | Yes | 22GB |
+| QLoRA (4-bit) | 1 | Yes | 5-6GB |
+| QLoRA (4-bit) | 2 | Yes | 7-8GB |
+## Quick Start
+### Option 1: Kaggle (QLoRA)
+```bash
+cd /kaggle/working/stack-2.9
+chmod +x training-configs/kaggle-7b-qlora.sh
+./training-configs/kaggle-7b-qlora.sh
+```
+### Option 2: Local (Full LoRA)
+```bash
+cd /path/to/stack-2.9
+python train_local.py \
+    --config training-configs/7b-lora-config.yaml
+```
+### Option 3: Custom Training Script
+```python
+from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
+from peft import LoraConfig, get_peft_model
+# Load model
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen2.5-Coder-7B",
+    torch_dtype="bfloat16",
+    device_map="auto"
+)
+# LoRA config
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
+                   "gate_proj", "up_proj", "down_proj"],
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+# Apply LoRA
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+```
+## Configuration Reference
+### LoRA Parameters
+```yaml
+lora:
+  r: 16              # Rank (8-32 recommended for 7B)
+  alpha: 32          # Usually 2*r
+  dropout: 0.05
+  target_modules:
+    - q_proj
+    - k_proj
+    - v_proj
+    - o_proj
+    - gate_proj
+    - up_proj
+    - down_proj
+```
+### Training Parameters
+```yaml
+training:
+  num_epochs: 3
+  batch_size: 2      # A100: 2-4, T4/P100: 1
+  gradient_accumulation: 8
+  learning_rate: 1.0e-4
+  warmup_steps: 100
+  gradient_checkpointing: true
+  bf16: true
+```
+## Expected Training Time
+Based on ~10K samples, max_length=4096:
+| Hardware | Config | Est. Time |
+|----------|--------|----------|
+| T4 | 4-bit QLoRA | 4-6 hours |
+| P100 | 4-bit QLoRA | 2-3 hours |
+| A100 40GB | bf16 LoRA | 30-45 min |
+| A100 80GB | bf16 LoRA | 20-30 min |
+Times scale linearly with dataset size.
+## After Training
+### Merge LoRA Adapter
+```python
+from peft import PeftModel
+from transformers import AutoTokenizer
+base_model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen2.5-Coder-7B",
+    torch_dtype="bfloat16"
+)
+# Merge adapter
+model = PeftModel.from_pretrained(base_model, "./output/lora")
+merged = model.merge_and_unload()
+# Save
+merged.save_pretrained("./output/merged")
+tokenizer.save_pretrained("./output/merged")
+```
+### Test the Model
+```python
+from transformers import AutoTokenizer, pipeline
+tokenizer = AutoTokenizer.from_pretrained("./output/merged")
+pipe = pipeline("text-generation", model=merged, tokenizer=tokenizer)
+result = pipe("def quick_sort(arr):", max_new_tokens=100)
+print(result[0]["generated_text"])
+```
+## Troubleshooting
+### OOM (Out of Memory)
+- Reduce `batch_size` to 1
+- Enable `gradient_checkpointing: true`
+- Reduce `max_length` (4096 → 2048)
+- Enable 4-bit quantization
+### Training Slow
+- Increase batch size if VRAM allows
+- Enable `use_flash_attention: true` (A100/H100)
+- Reduce gradient accumulation
+### Loss Not Converging
+- Check learning rate (try 5e-5 or 2e-4)
+- Increase epochs (3 → 5)
+- Verify data format matches expected template
+## Alternative: RunPod /cloud Deployment
+For faster training, see `runpod_deploy.sh` at repo root.
+```bash
+# Example: RunPod A100
+bash runpod_deploy.sh --gpu a100 --instance $ hourly
+```
+## Notes
+- **A100 recommended**: Best balance of VRAM and speed
+- **4-bit QLoRA**: Use only if VRAM < 20GB, slightly reduces quality
+- **Gradient checkpointing**: Always enable, minimal perf impact for big memory savings
+- **Flash Attention**: A100/H100 only, significant speed boost

evaluate_model.py CHANGED Viewed

@@ -2,359 +2,845 @@
 """
 HumanEval + MBPP Benchmark Evaluation for Stack 2.9
 Tests code generation quality using pass@k metrics.
 """
 import argparse
 import os
 import json
 import time
-from typing import List, Dict
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
-def load_model(model_path: str):
     """Load the fine-tuned model and tokenizer."""
     print(f"Loading model from: {model_path}")
-    tokenizer = AutoTokenizer.from_pretrained(model_path)
-    model = AutoModelForCausalLM.from_pretrained(
-        model_path,
-        torch_dtype=torch.float16,
-        device_map="auto",
-        low_cpu_mem_usage=True,
-    )
     return model, tokenizer
-def generate_solution(model, tokenizer, prompt: str, max_new_tokens: int = 256) -> str:
-    """Generate a single solution for a problem."""
-    inputs = tokenizer(prompt, return_tensors="pt")
     inputs = {k: v.to(model.device) for k, v in inputs.items()}
-    with torch.no_grad():
-        outputs = model.generate(
-            **inputs,
-            max_new_tokens=max_new_tokens,
-            temperature=0.8,
-            top_p=0.95,
-            do_sample=True,
-            repetition_penalty=1.1,
-        )
-    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    # Extract just the generated part
-    if completion.startswith(prompt):
-        completion = completion[len(prompt):].strip()
-    # Try to extract just the code (between ```python and ``` if present)
     if "```python" in completion:
         start = completion.find("```python") + len("```python")
         end = completion.find("```", start)
         if end != -1:
-            completion = completion[start:end].strip()
-    elif "```" in completion:
         start = completion.find("```") + len("```")
         end = completion.find("```", start)
         if end != -1:
-            completion = completion[start:end].strip()
-    return completion
-def check_correctness(code: str, expected_output=None) -> bool:
-    """Check if generated code produces correct output."""
     try:
-        # Create a namespace for execution
-        namespace = {}
         exec(code, namespace)
-        # If we have expected output, check it
-        if expected_output and 'solution' in namespace:
-            result = namespace['solution']()
-            return result == expected_output
-        # Basic check: code executed without error
-        return True
     except Exception as e:
-        return False
-def evaluate_humaneval(model, tokenizer, num_samples: int = 10, k_values: List[int] = [1, 10, 100]) -> Dict:
-    """Evaluate on HumanEval problems."""
-    print("\n" + "="*60)
-    print("Evaluating on HumanEval")
-    print("="*60)
-    # HumanEval problems (sample - add more as needed)
-    humaneval_problems = [
         {
-            "task_id": "test_1",
-            "prompt": "def two_sum(nums, target):\n    \"\"\"Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.\"\"\"\n",
-            "solution": "def two_sum(nums, target):\n    seen = {}\n    for i, num in enumerate(nums):\n        complement = target - num\n        if complement in seen:\n            return [seen[complement], i]\n        seen[num] = i\n    return []",
-            "test": "assert two_sum([2,7,11,15], 9) == [0,1]",
         },
         {
-            "task_id": "test_2",
-            "prompt": "def is_palindrome(x):\n    \"\"\"Check if a number is a palindrome.\"\"\"\n",
-            "solution": "def is_palindrome(x):\n    if x < 0:\n        return False\n    return str(x) == str(x)[::-1]",
-            "test": "assert is_palindrome(121) == True",
         },
         {
-            "task_id": "test_3",
-            "prompt": "def fizz_buzz(n):\n    \"\"\"Return FizzBuzz list from 1 to n.\"\"\"\n",
-            "solution": "def fizz_buzz(n):\n    return ['FizzBuzz' if i%15==0 else 'Fizz' if i%3==0 else 'Buzz' if i%5==0 else str(i) for i in range(1,n+1)]",
-            "test": "assert fizz_buzz(5) == ['1','2','Fizz','4','Buzz']",
         },
         {
-            "task_id": "test_4",
-            "prompt": "def fibonacci(n):\n    \"\"\"Return the first n Fibonacci numbers.\"\"\"\n",
-            "solution": "def fibonacci(n):\n    if n <= 0:\n        return []\n    fib = [0, 1]\n    while len(fib) < n:\n        fib.append(fib[-1] + fib[-2])\n    return fib[:n]",
-            "test": "assert fibonacci(7) == [0, 1, 1, 2, 3, 5, 8]",
         },
         {
-            "task_id": "test_5",
-            "prompt": "def valid_parentheses(s):\n    \"\"\"Check if string has valid parenthesis matching.\"\"\"\n",
-            "solution": "def valid_parentheses(s):\n    stack = []\n    mapping = {')': '(', '}': '{', ']': '['}\n    for char in s:\n        if char in mapping:\n            if not stack or stack.pop() != mapping[char]:\n                return False\n        else:\n            stack.append(char)\n    return not stack",
-            "test": "assert valid_parentheses('()[]{}') == True",
         },
         {
-            "task_id": "test_6",
-            "prompt": "def reverse_string(s):\n    \"\"\"Reverse a string.\"\"\"\n",
-            "solution": "def reverse_string(s):\n    return s[::-1]",
-            "test": "assert reverse_string('hello') == 'olleh'",
         },
         {
-            "task_id": "test_7",
-            "prompt": "def merge_sorted_lists(l1, l2):\n    \"\"\"Merge two sorted lists into one sorted list.\"\"\"\n",
-            "solution": "def merge_sorted_lists(l1, l2):\n    return sorted(l1 + l2)",
-            "test": "assert merge_sorted_lists([1,3,5], [2,4,6]) == [1,2,3,4,5,6]",
         },
         {
-            "task_id": "test_8",
-            "prompt": "def maximum_subarray(nums):\n    \"\"\"Find the contiguous subarray with the largest sum.\"\"\"\n",
-            "solution": "def maximum_subarray(nums):\n    max_sum = nums[0]\n    current_sum = nums[0]\n    for num in nums[1:]:\n        current_sum = max(num, current_sum + num)\n        max_sum = max(max_sum, current_sum)\n    return max_sum",
-            "test": "assert maximum_subarray([-2,1,-3,4,-1,2,1,-5,4]) == 6",
         },
         {
-            "task_id": "test_9",
-            "prompt": "def climbing_stairs(n):\n    \"\"\"Count ways to climb n stairs (1 or 2 steps at a time).\"\"\"\n",
-            "solution": "def climbing_stairs(n):\n    if n <= 2:\n        return n\n    a, b = 1, 2\n    for _ in range(3, n+1):\n        a, b = b, a + b\n    return b",
-            "test": "assert climbing_stairs(5) == 8",
         },
         {
-            "task_id": "test_10",
-            "prompt": "def contains_duplicate(nums):\n    \"\"\"Check if array contains any duplicate.\"\"\"\n",
-            "solution": "def contains_duplicate(nums):\n    return len(nums) != len(set(nums))",
-            "test": "assert contains_duplicate([1,2,3,1]) == True",
         },
     ]
-    # Limit to num_samples
-    problems = humaneval_problems[:num_samples]
-    results = []
-    for i, problem in enumerate(problems):
-        print(f"\nProblem {i+1}/{len(problems)}: {problem['task_id']}")
-        print(f"Prompt: {problem['prompt'][:50]}...")
-        start = time.time()
-        solution = generate_solution(model, tokenizer, problem['prompt'])
-        elapsed = time.time() - start
-        print(f"Generated in {elapsed:.2f}s")
-        print(f"Solution preview: {solution[:100]}...")
-        # Try to execute the solution
-        correct = check_correctness(solution)
-        results.append({
-            "task_id": problem["task_id"],
-            "solution": solution,
-            "correct": correct,
-            "time": elapsed,
-        })
-        print(f"Result: {'✅ CORRECT' if correct else '❌ INCORRECT'}")
-    # Calculate pass@k
-    passed = sum(1 for r in results if r['correct'])
-    total = len(results)
-    print("\n" + "="*60)
-    print("HumanEval Results")
-    print("="*60)
-    print(f"Total: {total}")
-    print(f"Passed: {passed}")
-    print(f"Pass@1: {100 * passed / total:.1f}%")
-    return {
-        "total": total,
-        "passed": passed,
-        "pass_at_1": passed / total if total > 0 else 0,
-        "results": results,
-    }
-def evaluate_mbpp(model, tokenizer, num_samples: int = 10) -> Dict:
-    """Evaluate on MBPP (Mostly Basic Python Problems)."""
-    print("\n" + "="*60)
-    print("Evaluating on MBPP")
-    print("="*60)
-    # MBPP problems (sample)
-    mbpp_problems = [
         {
-            "task_id": "mbpp_1",
-            "prompt": "def add_numbers(a, b):\n    # Return the sum of a and b\n",
-            "solution": "def add_numbers(a, b):\n    return a + b",
-            "test": "assert add_numbers(2, 3) == 5",
         },
         {
-            "task_id": "mbpp_2",
-            "prompt": "def multiply_list(nums):\n    # Return the product of all numbers in the list\n",
-            "solution": "def multiply_list(nums):\n    result = 1\n    for num in nums:\n        result *= num\n    return result",
-            "test": "assert multiply_list([1, 2, 3, 4]) == 24",
         },
         {
-            "task_id": "mbpp_3",
-            "prompt": "def square(x):\n    # Return the square of x\n",
-            "solution": "def square(x):\n    return x ** 2",
-            "test": "assert square(5) == 25",
         },
         {
-            "task_id": "mbpp_4",
-            "prompt": "def is_even(n):\n    # Return True if n is even, False otherwise\n",
-            "solution": "def is_even(n):\n    return n % 2 == 0",
-            "test": "assert is_even(4) == True",
         },
         {
-            "task_id": "mbpp_5",
-            "prompt": "def string_length(s):\n    # Return the length of string s\n",
-            "solution": "def string_length(s):\n    return len(s)",
-            "test": "assert string_length('hello') == 5",
         },
         {
-            "task_id": "mbpp_6",
-            "prompt": "def get_max(nums):\n    # Return the maximum number from the list\n",
-            "solution": "def get_max(nums):\n    return max(nums)",
-            "test": "assert get_max([1, 5, 3]) == 5",
         },
         {
-            "task_id": "mbpp_7",
-            "prompt": "def get_min(nums):\n    # Return the minimum number from the list\n",
-            "solution": "def get_min(nums):\n    return min(nums)",
-            "test": "assert get_min([1, 5, 3]) == 1",
         },
         {
-            "task_id": "mbpp_8",
-            "prompt": "def count_zeros(nums):\n    # Return the count of zeros in the list\n",
-            "solution": "def count_zeros(nums):\n    return nums.count(0)",
-            "test": "assert count_zeros([0, 1, 0, 2, 0]) == 3",
         },
         {
-            "task_id": "mbpp_9",
-            "prompt": "def reverse_list(lst):\n    # Return a new list with elements in reverse order\n",
-            "solution": "def reverse_list(lst):\n    return lst[::-1]",
-            "test": "assert reverse_list([1, 2, 3]) == [3, 2, 1]",
         },
         {
-            "task_id": "mbpp_10",
-            "prompt": "def unique_elements(lst):\n    # Return list of unique elements\n",
-            "solution": "def unique_elements(lst):\n    return list(set(lst))",
-            "test": "assert unique_elements([1, 2, 2, 3]) == [1, 2, 3]",
         },
     ]
-    problems = mbpp_problems[:num_samples]
-    results = []
-    for i, problem in enumerate(problems):
-        print(f"\nProblem {i+1}/{len(problems)}: {problem['task_id']}")
-        print(f"Prompt: {problem['prompt'][:50]}...")
-        start = time.time()
-        solution = generate_solution(model, tokenizer, problem['prompt'])
-        elapsed = time.time() - start
-        print(f"Generated in {elapsed:.2f}s")
-        print(f"Solution preview: {solution[:100]}...")
-        correct = check_correctness(solution)
-        results.append({
-            "task_id": problem["task_id"],
-            "solution": solution,
-            "correct": correct,
-            "time": elapsed,
-        })
-        print(f"Result: {'✅ CORRECT' if correct else '❌ INCORRECT'}")
-    passed = sum(1 for r in results if r['correct'])
-    total = len(results)
-    print("\n" + "="*60)
-    print("MBPP Results")
-    print("="*60)
-    print(f"Total: {total}")
-    print(f"Passed: {passed}")
-    print(f"Pass@1: {100 * passed / total:.1f}%")
-    return {
-        "total": total,
-        "passed": passed,
-        "pass_at_1": passed / total if total > 0 else 0,
-        "results": results,
-    }
-def save_results(humaneval_results, mbpp_results, output_path: str):
-    """Save evaluation results to JSON."""
-    combined = {
-        "humaneval": humaneval_results,
-        "mbpp": mbpp_results,
-        "summary": {
-            "humaneval_pass_at_1": humaneval_results["pass_at_1"],
-            "mbpp_pass_at_1": mbpp_results["pass_at_1"],
-            "combined_pass_at_1": (
-                humaneval_results["pass_at_1"] + mbpp_results["pass_at_1"]
-            ) / 2,
-        }
-    }
     with open(output_path, 'w') as f:
-        json.dump(combined, f, indent=2)
     print(f"\n✅ Results saved to: {output_path}")
-    return combined
 def main():
-    parser = argparse.ArgumentParser(description="Evaluate fine-tuned Stack 2.9 model")
-    parser.add_argument("--model-path", type=str, required=True, help="Path to fine-tuned model")
-    parser.add_argument("--output", type=str, default="evaluation_results.json", help="Output file for results")
-    parser.add_argument("--num-samples", type=int, default=10, help="Number of samples per benchmark")
     args = parser.parse_args()
     print("="*60)
     print("Stack 2.9 Model Evaluation")
     print("="*60)
     model, tokenizer = load_model(args.model_path)
     model.eval()
-    # Run evaluations
-    humaneval_results = evaluate_humaneval(model, tokenizer, args.num_samples)
-    mbpp_results = evaluate_mbpp(model, tokenizer, args.num_samples)
-    # Save results
-    combined = save_results(humaneval_results, mbpp_results, args.output)
     print("\n" + "="*60)
     print("FINAL SUMMARY")
     print("="*60)
-    print(f"HumanEval Pass@1:  {100 * combined['summary']['humaneval_pass_at_1']:.1f}%")
-    print(f"MBPP Pass@1:       {100 * combined['summary']['mbpp_pass_at_1']:.1f}%")
-    print(f"Combined Score:    {100 * combined['summary']['combined_pass_at_1']:.1f}%")
     print("="*60)
 if __name__ == "__main__":

 """
 HumanEval + MBPP Benchmark Evaluation for Stack 2.9
 Tests code generation quality using pass@k metrics.
+Usage:
+    python evaluate_model.py --model-path /path/to/merged/model --num-samples 10
+    python evaluate_model.py --model-path /path/to/merged/model --output results.json
 """
 import argparse
 import os
 import json
 import time
+import traceback
+from typing import List, Dict, Tuple, Optional
+from collections import defaultdict
+import itertools
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
+def load_model(model_path: str, max_memory: Optional[Dict] = None):
     """Load the fine-tuned model and tokenizer."""
     print(f"Loading model from: {model_path}")
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    kwargs = {
+        "torch_dtype": torch.float16,
+        "device_map": "auto",
+        "low_cpu_mem_usage": True,
+        "trust_remote_code": True,
+    }
+    if max_memory:
+        kwargs["max_memory"] = max_memory
+    model = AutoModelForCausalLM.from_pretrained(model_path, **kwargs)
     return model, tokenizer
+def generate_solution(
+    model,
+    tokenizer,
+    prompt: str,
+    max_new_tokens: int = 256,
+    temperature: float = 0.8,
+    top_p: float = 0.95,
+    num_return_sequences: int = 1
+) -> List[str]:
+    """Generate solutions for a prompt.
+    Returns a list of generated completions.
+    """
+    inputs = tokenizer(prompt, return_tensors="pt", padding=True)
     inputs = {k: v.to(model.device) for k, v in inputs.items()}
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=max_new_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        do_sample=True,
+        repetition_penalty=1.1,
+        num_return_sequences=num_return_sequences,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+    completions = []
+    for output in outputs:
+        text = tokenizer.decode(output, skip_special_tokens=True)
+        # Remove the prompt from the completion
+        if text.startswith(prompt):
+            text = text[len(prompt):]
+        completions.append(text.strip())
+    return completions
+def extract_code(completion: str) -> str:
+    """Extract code from completion, handling markdown code blocks."""
+    # Try to extract from ```python blocks
     if "```python" in completion:
         start = completion.find("```python") + len("```python")
         end = completion.find("```", start)
         if end != -1:
+            return completion[start:end].strip()
+    # Try ``` blocks (generic)
+    if "```" in completion:
         start = completion.find("```") + len("```")
         end = completion.find("```", start)
         if end != -1:
+            return completion[start:end].strip()
+    # If no code blocks, return as-is but clean up
+    return completion.strip()
+def execute_code(code: str, timeout: int = 5) -> Tuple[bool, str, Optional[any]]:
+    """Safely execute code and return (success, error_msg, result).
+    Uses restricted builtins and timeout for safety.
+    """
+    import signal
+    class TimeoutError(Exception):
+        pass
+    def timeout_handler(signum, frame):
+        raise TimeoutError("Execution timed out")
+    # Restricted globals for safe execution
+    safe_builtins = {
+        'print': print,
+        'len': len,
+        'range': range,
+        'str': str,
+        'int': int,
+        'float': float,
+        'bool': bool,
+        'list': list,
+        'dict': dict,
+        'set': set,
+        'tuple': tuple,
+        'sum': sum,
+        'min': min,
+        'max': max,
+        'abs': abs,
+        'sorted': sorted,
+        'reversed': reversed,
+        'enumerate': enumerate,
+        'zip': zip,
+        'map': map,
+        'filter': filter,
+        'any': any,
+        'all': all,
+        'isinstance': isinstance,
+        'type': type,
+        'round': round,
+        'pow': pow,
+        'divmod': divmod,
+        'ord': ord,
+        'chr': chr,
+        'hex': hex,
+        'bin': bin,
+        'id': id,
+    }
+    namespace = {
+        '__builtins__': safe_builtins,
+    }
     try:
+        # Set timeout
+        signal.signal(signal.SIGALRM, timeout_handler)
+        signal.alarm(timeout)
         exec(code, namespace)
+        # Cancel alarm
+        signal.alarm(0)
+        return True, "", namespace.get('result')
+    except TimeoutError as e:
+        signal.alarm(0)
+        return False, f"Timeout after {timeout}s", None
+    except SyntaxError as e:
+        signal.alarm(0)
+        return False, f"Syntax error: {e}", None
     except Exception as e:
+        signal.alarm(0)
+        return False, f"{type(e).__name__}: {e}", None
+def check_correctness(code: str, test_cases: List[Dict]) -> Tuple[bool, str]:
+    """Check if generated code passes test cases.
+    Args:
+        code: The generated code to test
+        test_cases: List of dicts with 'input' and 'expected' keys
+    Returns:
+        Tuple of (all_passed, failure_message)
+    """
+    import types
+    # Create a new namespace for execution
+    namespace = {}
+    safe_builtins = {
+        'print': print,
+        'len': len,
+        'range': range,
+        'str': str,
+        'int': int,
+        'float': float,
+        'bool': bool,
+        'list': list,
+        'dict': dict,
+        'set': set,
+        'tuple': tuple,
+        'sum': sum,
+        'min': min,
+        'max': max,
+        'abs': abs,
+        'sorted': sorted,
+        'reversed': reversed,
+        'enumerate': enumerate,
+        'zip': zip,
+        'map': map,
+        'filter': filter,
+        'any': any,
+        'all': all,
+        'isinstance': isinstance,
+        'type': type,
+        'round': round,
+        'pow': pow,
+    }
+    namespace['__builtins__'] = safe_builtins
+    try:
+        exec(code, namespace)
+    except Exception as e:
+        return False, f"Execution failed: {type(e).__name__}: {e}"
+    for tc in test_cases:
+        func_name = tc.get('function', 'solution')
+        inputs = tc.get('input', ())
+        expected = tc.get('expected')
+        description = tc.get('description', '')
+        if func_name not in namespace:
+            return False, f"Function '{func_name}' not found in code"
+        func = namespace[func_name]
+        try:
+            if isinstance(inputs, tuple):
+                result = func(*inputs)
+            else:
+                result = func(inputs)
+        except Exception as e:
+            return False, f"Failed on {description or str(inputs)}: {type(e).__name__}: {e}"
+        if result != expected:
+            return False, f"Failed on {description or str(inputs)}: expected {expected}, got {result}"
+    return True, ""
+def calculate_pass_at_k(num_correct: int, num_samples: int, k: int) -> float:
+    """Calculate pass@k metric.
+    Uses the estimator: 1 - C(n-c+k-1, k) / C(n+k-1, k)
+    where n = num_samples, c = num_correct, k = k
+    For small samples, this is more accurate than simple c/n.
+    """
+    import math
+    if num_samples < k:
+        return 0.0
+    if num_samples == 0:
+        return 0.0
+    # Bootstrap-style calculation
+    # "At least one of k samples is correct" probability
+    try:
+        # Exact formula: 1 - (C(n-c, k) / C(n, k))
+        # But we use the complementary for numerical stability
+        correct = num_correct
+        n = num_samples
+        fail = n - correct
+        if fail >= k:
+            return 0.0
+        # Calculate probability that at least one succeeds
+        # P(at least 1 success) = 1 - P(all k fail)
+        # P(all k fail) = C(fail, k) / C(n, k)
+        numerator = 1.0
+        denominator = 1.0
+        for i in range(k):
+            numerator *= (fail - i)
+            denominator *= (n - i)
+        p_all_fail = numerator / denominator
+        p_at_least_1_success = 1 - p_all_fail
+        return p_at_least_1_success
+    except:
+        # Fallback to simple ratio
+        return num_correct / num_samples
+def evaluate_problems(
+    model,
+    tokenizer,
+    problems: List[Dict],
+    k_values: List[int] = [1, 10],
+    num_samples_per_problem: int = 10,
+    max_new_tokens: int = 256,
+) -> Dict:
+    """Evaluate model on a set of problems with pass@k metrics.
+    Args:
+        model: The language model
+        tokenizer: The tokenizer
+        problems: List of problem dicts with 'task_id', 'prompt', 'test_cases'
+        k_values: List of k values for pass@k calculation
+        num_samples_per_problem: Number of samples to generate per problem
+        max_new_tokens: Max tokens to generate
+    Returns:
+        Dictionary with evaluation results
+    """
+    all_results = []
+    total_correct_per_k = {k: 0 for k in k_values}
+    total_problems = len(problems)
+    for idx, problem in enumerate(problems):
+        task_id = problem['task_id']
+        prompt = problem['prompt']
+        test_cases = problem.get('test_cases', [])
+        print(f"\n[{idx+1}/{total_problems}] Processing: {task_id}")
+        # Generate multiple samples
+        start_time = time.time()
+        completions = generate_solution(
+            model, tokenizer, prompt,
+            max_new_tokens=max_new_tokens,
+            num_return_sequences=num_samples_per_problem
+        )
+        elapsed = time.time() - start_time
+        print(f"  Generated {len(completions)} samples in {elapsed:.2f}s")
+        # Check each completion
+        correct_flags = []
+        for i, completion in enumerate(completions):
+            code = extract_code(completion)
+            # For pass@10, we consider the completion correct if it passes tests
+            # For pass@1, we only consider the first sample
+            passed, msg = check_correctness(code, test_cases)
+            correct_flags.append(passed)
+            if i == 0:  # Show first result detail
+                print(f"  Sample 1: {'✅' if passed else '❌'} {msg[:60] if msg else 'OK'}")
+        # Calculate pass@k for this problem
+        num_correct = sum(correct_flags)
+        problem_results = {
+            "task_id": task_id,
+            "prompt": prompt,
+            "num_samples": len(completions),
+            "num_correct": num_correct,
+            "pass@k": {},
+        }
+        for k in k_values:
+            if k <= num_samples_per_problem:
+                # How many of the first k samples are correct?
+                correct_in_k = sum(correct_flags[:min(k, len(correct_flags))])
+                if k == 1:
+                    # pass@1 = whether first sample is correct
+                    pass_at_k = 1.0 if correct_flags[0] else 0.0
+                else:
+                    # pass@k = probability that at least one of k is correct
+                    pass_at_k = calculate_pass_at_k(correct_in_k, k, k)
+                problem_results["pass@k"][f"pass@{k}"] = pass_at_k
+                total_correct_per_k[k] += correct_in_k
+        all_results.append(problem_results)
+        # Progress update
+        if k_values[0] == 1:
+            current_pass1 = total_correct_per_k.get(1, 0) / (idx + 1)
+            print(f"  Running Pass@1: {100*current_pass1:.1f}%")
+    # Aggregate results
+    summary = {
+        "total_problems": total_problems,
+        "total_samples_per_problem": num_samples_per_problem,
+    }
+    for k in k_values:
+        if k <= num_samples_per_problem:
+            # Overall pass@k
+            total_correct_for_k = 0
+            total_possible_for_k = 0
+            for r in all_results:
+                if f"pass@{k}" in r["pass@k"]:
+                    # For pass@1, it's binary
+                    if k == 1:
+                        total_correct_for_k += r["num_correct"] > 0
+                    else:
+                        # For pass@10, count how many problems have at least 1 correct in first k
+                        correct_in_k = sum(correct_flags[:min(k, len(correct_flags))])
+                        total_correct_for_k += 1 if correct_in_k > 0 else 0
+                    total_possible_for_k += 1
+            summary[f"pass@{k}"] = total_correct_for_k / total_possible_for_k if total_possible_for_k > 0 else 0
+            summary[f"pass@{k}_exact"] = total_correct_for_k
+            summary[f"total@{k}"] = total_possible_for_k
+    return {
+        "summary": summary,
+        "details": all_results,
+    }
+def get_humaneval_problems() -> List[Dict]:
+    """Return HumanEval benchmark problems."""
+    return [
+        {
+            "task_id": "humaneval/1",
+            "prompt": '''def two_sum(nums, target):
+    """Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.
+    You may assume that each input would have exactly one solution, and you may not use the same element twice.
+    """''',
+            "test_cases": [
+                {"function": "two_sum", "input": ([2,7,11,15], 9), "expected": [0,1], "description": "Basic case"},
+                {"function": "two_sum", "input": ([3,2,4], 6), "expected": [1,2], "description": "Middle elements"},
+                {"function": "two_sum", "input": ([3,3], 6), "expected": [0,1], "description": "Duplicate values"},
+            ],
+        },
         {
+            "task_id": "humaneval/2",
+            "prompt": '''def is_palindrome(x):
+    """Check if an integer is a palindrome. An integer is a palindrome when it reads the same backward as forward."''',
+            "test_cases": [
+                {"function": "is_palindrome", "input": 121, "expected": True, "description": "Positive palindrome"},
+                {"function": "is_palindrome", "input": -121, "expected": False, "description": "Negative number"},
+                {"function": "is_palindrome", "input": 10, "expected": False, "description": "Ends with 0"},
+            ],
         },
         {
+            "task_id": "humaneval/3",
+            "prompt": '''def fizz_buzz(n):
+    """Given number n, return a list of strings from 1 to n. For multiples of 3 add 'Fizz', for multiples of 5 add 'Buzz', for both add 'FizzBuzz'."''',
+            "test_cases": [
+                {"function": "fizz_buzz", "input": 3, "expected": ["1", "2", "Fizz"], "description": "n=3"},
+                {"function": "fizz_buzz", "input": 5, "expected": ["1", "2", "Fizz", "4", "Buzz"], "description": "n=5"},
+                {"function": "fizz_buzz", "input": 15, "expected": ["1","2","Fizz","4","Buzz","Fizz","7","8","Fizz","Buzz","11","Fizz","13","14","FizzBuzz"], "description": "n=15"},
+            ],
         },
         {
+            "task_id": "humaneval/4",
+            "prompt": '''def fibonacci(n):
+    """Return the first n Fibonacci numbers starting from 0 and 1. So fibonacci(7) returns [0, 1, 1, 2, 3, 5, 8]."''',
+            "test_cases": [
+                {"function": "fibonacci", "input": 1, "expected": [0], "description": "n=1"},
+                {"function": "fibonacci", "input": 7, "expected": [0, 1, 1, 2, 3, 5, 8], "description": "n=7"},
+                {"function": "fibonacci", "input": 10, "expected": [0, 1, 1, 2, 3, 5, 8, 13, 21, 34], "description": "n=10"},
+            ],
         },
         {
+            "task_id": "humaneval/5",
+            "prompt": '''def valid_parentheses(s):
+    """Given a string s containing just the characters '(', ')', '{', '}', '[' and ']', determine if the input string is valid. An input string is valid if: Open brackets must be closed by the same type of brackets, and Open brackets must be closed in the correct order."''',
+            "test_cases": [
+                {"function": "valid_parentheses", "input": "()", "expected": True, "description": "Simple pair"},
+                {"function": "valid_parentheses", "input": "()[]{}", "expected": True, "description": "Multiple types"},
+                {"function": "valid_parentheses", "input": "(]", "expected": False, "description": "Mismatched"},
+                {"function": "valid_parentheses", "input": "([)]", "expected": False, "description": "Wrong order"},
+            ],
         },
         {
+            "task_id": "humaneval/6",
+            "prompt": '''def reverse_string(s):
+    """Return the reverse of string s."''',
+            "test_cases": [
+                {"function": "reverse_string", "input": "hello", "expected": "olleh", "description": "Basic"},
+                {"function": "reverse_string", "input": "Hannah", "expected": "hannaH", "description": "Palindrome name"},
+            ],
         },
         {
+            "task_id": "humaneval/7",
+            "prompt": '''def merge_sorted_lists(l1, l2):
+    """Merge two sorted lists into one sorted list."''',
+            "test_cases": [
+                {"function": "merge_sorted_lists", "input": ([1,3,5], [2,4,6]), "expected": [1,2,3,4,5,6], "description": "Interleaved"},
+                {"function": "merge_sorted_lists", "input": ([1,2,3], [4,5,6]), "expected": [1,2,3,4,5,6], "description": "Sequential"},
+            ],
         },
         {
+            "task_id": "humaneval/8",
+            "prompt": '''def maximum_subarray(nums):
+    """Find the contiguous subarray which has the largest sum and return its sum."''',
+            "test_cases": [
+                {"function": "maximum_subarray", "input": [-2,1,-3,4,-1,2,1,-5,4], "expected": 6, "description": "Mixed"},
+                {"function": "maximum_subarray", "input": [1], "expected": 1, "description": "Single element"},
+                {"function": "maximum_subarray", "input": [5,4,-1,7,8], "expected": 23, "description": "Mostly positive"},
+            ],
         },
         {
+            "task_id": "humaneval/9",
+            "prompt": '''def climbing_stairs(n):
+    """You are climbing a staircase. It takes n steps to reach the top. Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?"''',
+            "test_cases": [
+                {"function": "climbing_stairs", "input": 2, "expected": 2, "description": "n=2"},
+                {"function": "climbing_stairs", "input": 3, "expected": 3, "description": "n=3"},
+                {"function": "climbing_stairs", "input": 5, "expected": 8, "description": "n=5"},
+            ],
         },
         {
+            "task_id": "humaneval/10",
+            "prompt": '''def contains_duplicate(nums):
+    """Given an integer array nums, return True if any value appears at least twice in the array, and False if every element is distinct."''',
+            "test_cases": [
+                {"function": "contains_duplicate", "input": [1,2,3,1], "expected": True, "description": "Has duplicate"},
+                {"function": "contains_duplicate", "input": [1,2,3,4], "expected": False, "description": "All unique"},
+            ],
         },
         {
+            "task_id": "humaneval/11",
+            "prompt": '''def roman_to_int(s):
+    """Convert a Roman numeral to an integer."''',
+            "test_cases": [
+                {"function": "roman_to_int", "input": "III", "expected": 3, "description": "Simple"},
+                {"function": "roman_to_int", "input": "IV", "expected": 4, "description": "Subtractive"},
+                {"function": "roman_to_int", "input": "MCMXCIV", "expected": 1994, "description": "Complex"},
+            ],
+        },
+        {
+            "task_id": "humaneval/12",
+            "prompt": '''def longest_common_prefix(strs):
+    """Write a function to find the longest common prefix string amongst an array of strings."''',
+            "test_cases": [
+                {"function": "longest_common_prefix", "input": ["flower","flow","flight"], "expected": "fl", "description": "Basic"},
+                {"function": "longest_common_prefix", "input": ["dog","racecar","car"], "expected": "", "description": "No prefix"},
+            ],
         },
     ]
+def get_mbpp_problems() -> List[Dict]:
+    """Return MBPP (Mostly Basic Python Problems) benchmark problems."""
+    return [
         {
+            "task_id": "mbpp/1",
+            "prompt": '''def add_numbers(a, b):
+    # Return the sum of a and b
+    pass''',
+            "test_cases": [
+                {"function": "add_numbers", "input": (2, 3), "expected": 5, "description": "Basic add"},
+                {"function": "add_numbers", "input": (-1, 1), "expected": 0, "description": "Opposite signs"},
+            ],
         },
         {
+            "task_id": "mbpp/2",
+            "prompt": '''def multiply_list(nums):
+    # Return the product of all numbers in the list
+    pass''',
+            "test_cases": [
+                {"function": "multiply_list", "input": ([1, 2, 3, 4],), "expected": 24, "description": "Basic"},
+                {"function": "multiply_list", "input": ([5,],), "expected": 5, "description": "Single element"},
+                {"function": "multiply_list", "input": ([],), "expected": 1, "description": "Empty (identity)"},
+            ],
         },
         {
+            "task_id": "mbpp/3",
+            "prompt": '''def square(x):
+    # Return the square of x
+    pass''',
+            "test_cases": [
+                {"function": "square", "input": (5,), "expected": 25, "description": "Basic"},
+                {"function": "square", "input": (-3,), "expected": 9, "description": "Negative"},
+                {"function": "square", "input": (0,), "expected": 0, "description": "Zero"},
+            ],
         },
         {
+            "task_id": "mbpp/4",
+            "prompt": '''def is_even(n):
+    # Return True if n is even, False otherwise
+    pass''',
+            "test_cases": [
+                {"function": "is_even", "input": (4,), "expected": True, "description": "Even number"},
+                {"function": "is_even", "input": (7,), "expected": False, "description": "Odd number"},
+                {"function": "is_even", "input": (0,), "expected": True, "description": "Zero is even"},
+            ],
         },
         {
+            "task_id": "mbpp/5",
+            "prompt": '''def string_length(s):
+    # Return the length of string s
+    pass''',
+            "test_cases": [
+                {"function": "string_length", "input": ("hello",), "expected": 5, "description": "Basic"},
+                {"function": "string_length", "input": ("",), "expected": 0, "description": "Empty string"},
+            ],
         },
         {
+            "task_id": "mbpp/6",
+            "prompt": '''def get_max(nums):
+    # Return the maximum number from the list
+    pass''',
+            "test_cases": [
+                {"function": "get_max", "input": ([1, 5, 3],), "expected": 5, "description": "Basic"},
+                {"function": "get_max", "input": ([-1, -5, -3],), "expected": -1, "description": "Negative numbers"},
+            ],
         },
         {
+            "task_id": "mbpp/7",
+            "prompt": '''def get_min(nums):
+    # Return the minimum number from the list
+    pass''',
+            "test_cases": [
+                {"function": "get_min", "input": ([1, 5, 3],), "expected": 1, "description": "Basic"},
+                {"function": "get_min", "input": ([-1, -5, -3],), "expected": -5, "description": "Negative numbers"},
+            ],
         },
         {
+            "task_id": "mbpp/8",
+            "prompt": '''def count_zeros(nums):
+    # Return the count of zeros in the list
+    pass''',
+            "test_cases": [
+                {"function": "count_zeros", "input": ([0, 1, 0, 2, 0],), "expected": 3, "description": "Mixed"},
+                {"function": "count_zeros", "input": ([1, 2, 3],), "expected": 0, "description": "No zeros"},
+            ],
         },
         {
+            "task_id": "mbpp/9",
+            "prompt": '''def reverse_list(lst):
+    # Return a new list with elements in reverse order
+    pass''',
+            "test_cases": [
+                {"function": "reverse_list", "input": ([1, 2, 3],), "expected": [3, 2, 1], "description": "Basic"},
+                {"function": "reverse_list", "input": ([],), "expected": [], "description": "Empty"},
+            ],
         },
         {
+            "task_id": "mbpp/10",
+            "prompt": '''def unique_elements(lst):
+    # Return list of unique elements (preserving order)
+    pass''',
+            "test_cases": [
+                {"function": "unique_elements", "input": ([1, 2, 2, 3],), "expected": [1, 2, 3], "description": "With duplicates"},
+                {"function": "unique_elements", "input": ([1, 2, 3],), "expected": [1, 2, 3], "description": "All unique"},
+            ],
+        },
+        {
+            "task_id": "mbpp/11",
+            "prompt": '''def factorial(n):
+    # Return n! (factorial of n)
+    pass''',
+            "test_cases": [
+                {"function": "factorial", "input": (5,), "expected": 120, "description": "Basic"},
+                {"function": "factorial", "input": (0,), "expected": 1, "description": "Zero factorial"},
+                {"function": "factorial", "input": (1,), "expected": 1, "description": "One factorial"},
+            ],
+        },
+        {
+            "task_id": "mbpp/12",
+            "prompt": '''def is_prime(n):
+    # Return True if n is prime, False otherwise
+    pass''',
+            "test_cases": [
+                {"function": "is_prime", "input": (7,), "expected": True, "description": "Prime"},
+                {"function": "is_prime", "input": (4,), "expected": False, "description": "Not prime"},
+                {"function": "is_prime", "input": (1,), "expected": False, "description": "One is not prime"},
+            ],
         },
     ]
+def save_results(results: Dict, output_path: str):
+    """Save evaluation results to JSON file."""
     with open(output_path, 'w') as f:
+        json.dump(results, f, indent=2)
     print(f"\n✅ Results saved to: {output_path}")
+def print_summary(results: Dict, benchmark_name: str):
+    """Print a summary of evaluation results."""
+    print(f"\n{'='*60}")
+    print(f"{benchmark_name} Results")
+    print('='*60)
+    summary = results.get("summary", {})
+    total = summary.get("total_problems", 0)
+    for key, value in summary.items():
+        if key.startswith("pass@"):
+            print(f"  {key}: {100*value:.2f}%")
+        elif key.endswith("_exact") or key.endswith("_total") or key == "total_problems" or key == "total_samples_per_problem":
+            print(f"  {key}: {value}")
+    print(f"\n  Total problems evaluated: {total}")
+    print('='*60)
 def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate Stack 2.9 model on HumanEval and MBPP benchmarks"
+    )
+    parser.add_argument(
+        "--model-path",
+        type=str,
+        required=True,
+        help="Path to the merged model directory"
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="evaluation_results.json",
+        help="Output file for results (default: evaluation_results.json)"
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=10,
+        help="Number of samples per problem for pass@k (default: 10)"
+    )
+    parser.add_argument(
+        "--max-new-tokens",
+        type=int,
+        default=256,
+        help="Maximum new tokens to generate (default: 256)"
+    )
+    parser.add_argument(
+        "--k-values",
+        type=str,
+        default="1,10",
+        help="Comma-separated k values for pass@k (default: 1,10)"
+    )
+    parser.add_argument(
+        "--benchmark",
+        type=str,
+        choices=["humaneval", "mbpp", "both"],
+        default="both",
+        help="Which benchmark to run (default: both)"
+    )
+    parser.add_argument(
+        "--num-problems",
+        type=int,
+        default=None,
+        help="Limit number of problems per benchmark (default: all)"
+    )
     args = parser.parse_args()
+    # Parse k values
+    k_values = [int(k.strip()) for k in args.k_values.split(",")]
     print("="*60)
     print("Stack 2.9 Model Evaluation")
     print("="*60)
+    print(f"Model path: {args.model_path}")
+    print(f"Output: {args.output}")
+    print(f"Num samples per problem: {args.num_samples}")
+    print(f"Pass@k values: {k_values}")
+    print(f"Benchmark: {args.benchmark}")
+    # Load model
     model, tokenizer = load_model(args.model_path)
     model.eval()
+    all_results = {}
+    total_start = time.time()
+    # Run HumanEval
+    if args.benchmark in ["humaneval", "both"]:
+        print("\n" + "="*60)
+        print("Running HumanEval Benchmark")
+        print("="*60)
+        problems = get_humaneval_problems()
+        if args.num_problems:
+            problems = problems[:args.num_problems]
+        results = evaluate_problems(
+            model, tokenizer,
+            problems,
+            k_values=k_values,
+            num_samples_per_problem=args.num_samples,
+            max_new_tokens=args.max_new_tokens,
+        )
+        all_results["humaneval"] = results
+        print_summary(results, "HumanEval")
+    # Run MBPP
+    if args.benchmark in ["mbpp", "both"]:
+        print("\n" + "="*60)
+        print("Running MBPP Benchmark")
+        print("="*60)
+        problems = get_mbpp_problems()
+        if args.num_problems:
+            problems = problems[:args.num_problems]
+        results = evaluate_problems(
+            model, tokenizer,
+            problems,
+            k_values=k_values,
+            num_samples_per_problem=args.num_samples,
+            max_new_tokens=args.max_new_tokens,
+        )
+        all_results["mbpp"] = results
+        print_summary(results, "MBPP")
+    total_time = time.time() - total_start
+    # Final summary
     print("\n" + "="*60)
     print("FINAL SUMMARY")
     print("="*60)
+    for bench_name in ["humaneval", "mbpp"]:
+        if bench_name in all_results:
+            summary = all_results[bench_name]["summary"]
+            for k in k_values:
+                key = f"pass@{k}"
+                if key in summary:
+                    print(f"  {bench_name.upper()} {key}: {100*summary[key]:.2f}%")
+    print(f"\n  Total evaluation time: {total_time:.1f}s")
     print("="*60)
+    # Add metadata to results
+    all_results["metadata"] = {
+        "model_path": args.model_path,
+        "num_samples": args.num_samples,
+        "k_values": k_values,
+        "total_time_seconds": total_time,
+    }
+    # Save results
+    save_results(all_results, args.output)
 if __name__ == "__main__":

kaggle_train_stack29.ipynb DELETED Viewed

@@ -1,288 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# 🚀 Stack 2.9 - Kaggle Training\n",
-    "\n",
-    "**Free GPU training on Kaggle**\n",
-    "\n",
-    "This notebook trains a LoRA adapter for Stack 2.9 on **Qwen2.5-Coder-7B** using Kaggle's free GPU.\n",
-    "\n",
-    "⏱️ **Expected runtime:** 2-4 hours\n",
-    "💾 **VRAM needed:** ~16GB (Kaggle P100 has 16GB)\n",
-    "\n",
-    "---\n",
-    "\n",
-    "**Instructions:**\n",
-    "1. Enable GPU: Settings → Accelerator → GPU T4\n",
-    "2. Run cells in order from the top\n",
-    "3. Model auto-downloads if not present\n",
-    "\n",
-    "---"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 1: Check GPU\n",
-    "import subprocess\n",
-    "subprocess.run([\"nvidia-smi\"], check=True)\n",
-    "print(\"✅ GPU ready!\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 2: Clone repo and setup paths\n",
-    "import os\n",
-    "import shutil\n",
-    "import subprocess\n",
-    "\n",
-    "os.chdir(\"/kaggle/working\")\n",
-    "\n",
-    "REPO_DIR = \"/kaggle/working/stack-2.9\"\n",
-    "MODEL_DIR = os.path.join(REPO_DIR, \"base_model_qwen7b\")\n",
-    "OUTPUT_DIR = os.path.join(REPO_DIR, \"training_output\")\n",
-    "\n",
-    "# Remove old repo if exists\n",
-    "if os.path.exists(REPO_DIR):\n",
-    "    shutil.rmtree(REPO_DIR)\n",
-    "\n",
-    "# Clone fresh\n",
-    "subprocess.run([\"git\", \"clone\", \"https://github.com/my-ai-stack/stack-2.9.git\", REPO_DIR], check=True)\n",
-    "os.chdir(REPO_DIR)\n",
-    "\n",
-    "print(f\"✅ Working in: {os.getcwd()}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 3: Install dependencies\n",
-    "import subprocess\n",
-    "\n",
-    "subprocess.run([\"pip\", \"install\", \"-q\", \"torch\", \"torchvision\", \"torchaudio\", \"--index-url\", \"https://download.pytorch.org/whl/cu118\"], check=True)\n",
-    "subprocess.run([\"pip\", \"install\", \"-q\", \"transformers==4.40.0\", \"peft==0.10.0\", \"accelerate==0.34.0\", \"datasets\", \"pyyaml\", \"tqdm\", \"scipy\", \"bitsandbytes==0.43.0\"], check=True)\n",
-    "print(\"✅ Dependencies installed\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 4: Prepare training data\n",
-    "import os\n",
-    "import json\n",
-    "\n",
-    "# Check for available training data\n",
-    "REPO_TRAIN_DATA = os.path.join(REPO_DIR, \"training-data/final/train.jsonl\")\n",
-    "MINI_DATA_DIR = os.path.join(REPO_DIR, \"data_mini\")\n",
-    "MINI_DATA_FILE = os.path.join(MINI_DATA_DIR, \"train_mini.jsonl\")\n",
-    "SYNTHETIC_DATA_FILE = os.path.join(REPO_DIR, \"data/synthetic.jsonl\")\n",
-    "\n",
-    "print(\"🔍 Checking for training data...\")\n",
-    "\n",
-    "if os.path.exists(REPO_TRAIN_DATA):\n",
-    "    print(f\"   Found full dataset: {REPO_TRAIN_DATA}\")\n",
-    "    os.makedirs(MINI_DATA_DIR, exist_ok=True)\n",
-    "    if not os.path.exists(MINI_DATA_FILE):\n",
-    "        print(\"   Creating mini dataset (1000 samples)...\")\n",
-    "        import subprocess\n",
-    "        subprocess.run([\"python\", os.path.join(REPO_DIR, \"scripts/create_mini_dataset.py\"),\n",
-    "                       \"--size\", \"1000\", \"--output\", MINI_DATA_FILE, \"--source\", REPO_TRAIN_DATA], check=True)\n",
-    "    DATA_FILE = MINI_DATA_FILE\n",
-    "    \n",
-    "elif os.path.exists(MINI_DATA_FILE):\n",
-    "    DATA_FILE = MINI_DATA_FILE\n",
-    "    print(f\"   Using existing mini dataset: {MINI_DATA_FILE}\")\n",
-    "\n",
-    "else:\n",
-    "    print(\"   No dataset found. Creating synthetic data...\")\n",
-    "    \n",
-    "    # Simple code completion examples\n",
-    "    examples = [\n",
-    "        {\"instruction\": \"Write a Python function to reverse a string\", \n",
-    "         \"output\": \"def reverse_string(s):\\n    return s[::-1]\"},\n",
-    "        {\"instruction\": \"Write a function to check if a number is prime\", \n",
-    "         \"output\": \"def is_prime(n):\\n    if n <= 1:\\n        return False\\n    for i in range(2, int(n**0.5) + 1):\\n        if n % i == 0:\\n            return False\\n    return True\"},\n",
-    "        {\"instruction\": \"Write a binary search function\", \n",
-    "         \"output\": \"def binary_search(arr, target):\\n    left, right = 0, len(arr) - 1\\n    while left <= right:\\n        mid = (left + right) // 2\\n        if arr[mid] == target:\\n            return mid\\n        elif arr[mid] < target:\\n            left = mid + 1\\n        else:\\n            right = mid - 1\\n    return -1\"},\n",
-    "    ]\n",
-    "    \n",
-    "    samples = []\n",
-    "    for i in range(1000):\n",
-    "        for ex in examples:\n",
-    "            samples.append(ex)\n",
-    "    \n",
-    "    os.makedirs(os.path.dirname(SYNTHETIC_DATA_FILE), exist_ok=True)\n",
-    "    with open(SYNTHETIC_DATA_FILE, 'w') as f:\n",
-    "        for s in samples:\n",
-    "            f.write(json.dumps(s) + '\\n')\n",
-    "    \n",
-    "    DATA_FILE = SYNTHETIC_DATA_FILE\n",
-    "    print(f\"   Created synthetic dataset: {len(samples)} samples\")\n",
-    "\n",
-    "print(f\"\\n✅ Using training data: {DATA_FILE}\")\n",
-    "print(f\"   Size: {os.path.getsize(DATA_FILE) / 1024:.1f} KB\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 5: Prepare config for training\n",
-    "import yaml\n",
-    "import os\n",
-    "\n",
-    "os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
-    "\n",
-    "config = {\n",
-    "    'model': {\n",
-    "        'name': 'Qwen/Qwen2.5-Coder-7B',\n",
-    "        'trust_remote_code': True,\n",
-    "        'torch_dtype': 'float16'\n",
-    "    },\n",
-    "    'data': {\n",
-    "        'input_path': DATA_FILE,\n",
-    "        'max_length': 2048,\n",
-    "        'train_split': 1.0\n",
-    "    },\n",
-    "    'lora': {\n",
-    "        'r': 16,\n",
-    "        'alpha': 32,\n",
-    "        'dropout': 0.05,\n",
-    "        'target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],\n",
-    "        'bias': 'none',\n",
-    "        'task_type': 'CAUSAL_LM'\n",
-    "    },\n",
-    "    'training': {\n",
-    "        'num_epochs': 1,\n",
-    "        'batch_size': 2,\n",
-    "        'gradient_accumulation': 4,\n",
-    "        'learning_rate': 2e-4,\n",
-    "        'warmup_steps': 50,\n",
-    "        'weight_decay': 0.01,\n",
-    "        'max_grad_norm': 1.0,\n",
-    "        'logging_steps': 10,\n",
-    "        'save_steps': 100,\n",
-    "        'save_total_limit': 2,\n",
-    "        'fp16': True,\n",
-    "        'bf16': False,\n",
-    "        'gradient_checkpointing': True\n",
-    "    },\n",
-    "    'output': {\n",
-    "        'lora_dir': os.path.join(OUTPUT_DIR, 'lora'),\n",
-    "        'logging_dir': os.path.join(OUTPUT_DIR, 'logs')\n",
-    "    },\n",
-    "    'quantization': {'enabled': False},\n",
-    "    'hardware': {'device': 'cuda', 'num_gpus': 1, 'use_4bit': False, 'use_8bit': False}\n",
-    "}\n",
-    "\n",
-    "config_path = os.path.join(OUTPUT_DIR, \"train_config.yaml\")\n",
-    "with open(config_path, 'w') as f:\n",
-    "    yaml.dump(config, f, default_flow_style=False)\n",
-    "\n",
-    "print(f\"✅ Config saved: {config_path}\")\n",
-    "print(f\"   Model: {config['model']['name']}\")\n",
-    "print(f\"   Data: {config['data']['input_path']}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 6: Train LoRA\n",
-    "import sys\n",
-    "sys.path.insert(0, os.path.join(REPO_DIR, \"stack_2_9_training\"))\n",
-    "\n",
-    "print(\"=\"*60)\n",
-    "print(\"STARTING TRAINING\")\n",
-    "print(\"=\"*60)\n",
-    "\n",
-    "from stack_2_9_training.train_lora import train_lora\n",
-    "\n",
-    "try:\n",
-    "    trainer = train_lora(config_path)\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"TRAINING COMPLETED\")\n",
-    "    print(\"=\"*60)\n",
-    "except Exception as e:\n",
-    "    print(f\"\\n❌ Training failed: {e}\")\n",
-    "    import traceback\n",
-    "    traceback.print_exc()\n",
-    "    raise"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# STEP 7: Merge LoRA adapter\n",
-    "import sys\n",
-    "sys.path.insert(0, os.path.join(REPO_DIR, \"stack_2_9_training\"))\n",
-    "from stack_2_9_training.merge_adapter import merge_adapter\n",
-    "\n",
-    "lora_dir = config['output']['lora_dir']\n",
-    "merged_dir = os.path.join(OUTPUT_DIR, 'merged')\n",
-    "os.makedirs(merged_dir, exist_ok=True)\n",
-    "\n",
-    "print(\"=\"*60)\n",
-    "print(\"MERGING\")\n",
-    "print(\"=\"*60)\n",
-    "\n",
-    "try:\n",
-    "    merge_adapter(\n",
-    "        base_model_name_or_path=config['model']['name'],\n",
-    "        adapter_path=lora_dir,\n",
-    "        output_path=merged_dir,\n",
-    "        use_safetensors=True\n",
-    "    )\n",
-    "    print(\"\\n✅ Merge completed!\")\n",
-    "    print(f\"Files: {os.listdir(merged_dir)}\")\n",
-    "except Exception as e:\n",
-    "    print(f\"\\n❌ Merge failed: {e}\")\n",
-    "    import traceback\n",
-    "    traceback.print_exc()\n",
-    "    raise\n",
-    "\n",
-    "print(\"\\n\" + \"=\"*60)\n",
-    "print(\"🎉 ALL DONE!\")\n",
-    "print(\"=\"*60)\n",
-    "print(f\"\\n📦 Model ready: {merged_dir}\")\n",
-    "print(\"\\n⏳ Download 'merged' folder from Kaggle Output panel before session ends!\")"
-   ]
-  }
- ],
- "metadata": {
-  "kaggle": {
-   "accelerator": "gpu",
-   "dataSources": [],
-   "kernelSpec": {
-    "displayName": "Python 3",
-    "language": "python",
-    "name": "python3"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}

kaggle_train_stack29_final.ipynb DELETED Viewed

@@ -1,198 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# 🚀 Stack 2.9 - Kaggle Training\n",
-    "\n",
-    "Free GPU training on Kaggle using Qwen2.5-Coder-7B.\n",
-    "\n",
-    "⏱️ **Runtime:** 2-4 hours  |  💾 **VRAM:** ~16GB\n",
-    "\n",
-    "**Setup:**\n",
-    "1. Settings → Accelerator → GPU **T4**\n",
-    "2. Run all cells in order\n",
-    "3. Download merged model from Output tab when done"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Check GPU\n",
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Clone repository\n",
-    "import os, shutil, subprocess\n",
-    "\n",
-    "os.chdir('/kaggle/working')\n",
-    "REPO_DIR = '/kaggle/working/stack-2.9'\n",
-    "OUTPUT_DIR = os.path.join(REPO_DIR, 'training_output')\n",
-    "\n",
-    "if os.path.exists(REPO_DIR):\n",
-    "    shutil.rmtree(REPO_DIR)\n",
-    "subprocess.run(['git', 'clone', 'https://github.com/my-ai-stack/stack-2.9.git', REPO_DIR], check=True)\n",
-    "os.chdir(REPO_DIR)\n",
-    "print('✅ Repo ready:', REPO_DIR)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Install dependencies (single command)\n",
-    "!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118\n",
-    "!pip install -q transformers==4.40.0 peft==0.10.0 accelerate==0.34.0 datasets==3.0.0 pyyaml tqdm scipy bitsandbytes==0.43.0\n",
-    "print('✅ Dependencies ready')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Prepare training data (auto-detect or synthetic fallback)\n",
-    "import os, json\n",
-    "\n",
-    "REPO_TRAIN_DATA = os.path.join(REPO_DIR, 'training-data/final/train.jsonl')\n",
-    "MINI_DATA_DIR = os.path.join(REPO_DIR, 'data_mini')\n",
-    "MINI_DATA_FILE = os.path.join(MINI_DATA_DIR, 'train_mini.jsonl')\n",
-    "SYNTHETIC_FILE = os.path.join(REPO_DIR, 'data/synthetic.jsonl')\n",
-    "\n",
-    "print('🔍 Data check')\n",
-    "\n",
-    "if os.path.exists(REPO_TRAIN_DATA):\n",
-    "    os.makedirs(MINI_DATA_DIR, exist_ok=True)\n",
-    "    if not os.path.exists(MINI_DATA_FILE):\n",
-    "        print('   Building mini dataset (1K samples) from full data...')\n",
-    "        !python scripts/create_mini_dataset.py --size 1000 --output {MINI_DATA_FILE} --source {REPO_TRAIN_DATA}\n",
-    "    DATA_FILE = MINI_DATA_FILE\n",
-    "    print('   Using mini dataset')\n",
-    "elif os.path.exists(MINI_DATA_FILE):\n",
-    "    DATA_FILE = MINI_DATA_FILE\n",
-    "    print('   Using existing mini dataset')\n",
-    "else:\n",
-    "    print('   Creating synthetic data (last resort)')\n",
-    "    examples = [\n",
-    "        {'instruction': 'Write a Python function to reverse a string', 'output': 'def reverse_string(s):\\n    return s[::-1]'},\n",
-    "        {'instruction': 'Write a function to check if a number is prime', 'output': 'def is_prime(n):\\n    if n <= 1:\\n        return False\\n    for i in range(2, int(n**0.5) + 1):\\n        if n % i == 0:\\n            return False\\n    return True'},\n",
-    "        {'instruction': 'Write a binary search function', 'output': 'def binary_search(arr, target):\\n    left, right = 0, len(arr) - 1\\n    while left <= right:\\n        mid = (left + right) // 2\\n        if arr[mid] == target:\\n            return mid\\n        elif arr[mid] < target:\\n            left = mid + 1\\n        else:\\n            right = mid - 1\\n    return -1'},\n",
-    "    ]\n",
-    "    samples = examples * 333\n",
-    "    os.makedirs(os.path.dirname(SYNTHETIC_FILE), exist_ok=True)\n",
-    "    with open(SYNTHETIC_FILE, 'w') as f:\n",
-    "        for s in samples:\n",
-    "            f.write(json.dumps(s) + '\\n')\n",
-    "    DATA_FILE = SYNTHETIC_FILE\n",
-    "    print(f'   Synthetic dataset: {len(samples)} examples')\n",
-    "\n",
-    "print(f'\\n✅ Data: {DATA_FILE}')\n",
-    "!ls -lh {DATA_FILE}"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Generate training configuration\n",
-    "import yaml\n",
-    "\n",
-    "os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
-    "\n",
-    "config = {\n",
-    "    'model': {'name': 'Qwen/Qwen2.5-Coder-7B', 'trust_remote_code': True, 'torch_dtype': 'float16'},\n",
-    "    'data': {'input_path': DATA_FILE, 'max_length': 2048, 'train_split': 1.0},\n",
-    "    'lora': {'r': 16, 'alpha': 32, 'dropout': 0.05, 'target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], 'bias': 'none', 'task_type': 'CAUSAL_LM'},\n",
-    "    'training': {'num_epochs': 1, 'batch_size': 2, 'gradient_accumulation': 4, 'learning_rate': 2e-4, 'warmup_steps': 50, 'weight_decay': 0.01, 'max_grad_norm': 1.0, 'logging_steps': 10, 'save_steps': 100, 'save_total_limit': 2, 'fp16': True, 'bf16': False, 'gradient_checkpointing': True},\n",
-    "    'output': {'lora_dir': os.path.join(OUTPUT_DIR, 'lora'), 'logging_dir': os.path.join(OUTPUT_DIR, 'logs')},\n",
-    "    'quantization': {'enabled': False},\n",
-    "    'hardware': {'device': 'cuda', 'num_gpus': 1, 'use_4bit': False, 'use_8bit': False}\n",
-    "}\n",
-    "\n",
-    "config_path = os.path.join(OUTPUT_DIR, 'train_config.yaml')\n",
-    "with open(config_path, 'w') as f:\n",
-    "    yaml.dump(config, f, default_flow_style=False)\n",
-    "\n",
-    "print(f'✅ Config: {config_path}')\n",
-    "print(f\"   Model: {config['model']['name']}\")\n",
-    "print(f\"   Data: {config['data']['input_path']}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Train (using standalone train_simple.py - no package install needed)\n",
-    "print('='*60)\n",
-    "print('STARTING TRAINING')\n",
-    "print('='*60)\n",
-    "\n",
-    "!cd {REPO_DIR} && python train_simple.py --config {config_path}\n",
-    "\n",
-    "print('\\n✅ Training step finished')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Merge LoRA adapter into final model\n",
-    "lora_dir = os.path.join(OUTPUT_DIR, 'lora')\n",
-    "merged_dir = os.path.join(OUTPUT_DIR, 'merged')\n",
-    "\n",
-    "print('='*60)\n",
-    "print('MERGING LORA ADAPTER')\n",
-    "print('='*60)\n",
-    "\n",
-    "!cd {REPO_DIR} && python merge_simple.py \\\n",
-    "    --base-model {config['model']['name']} \\\n",
-    "    --adapter-path {lora_dir} \\\n",
-    "    --output-path {merged_dir} \\\n",
-    "    --use-safetensors\n",
-    "\n",
-    "print('\\n✅ Merge complete!')\n",
-    "print(f'Merged model: {merged_dir}')\n",
-    "!ls -lh {merged_dir}"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 📥 Download Model\n",
-    "\n",
-    "1. Open **Output** tab on the right\n",
-    "2. Find `training_output/merged/`\n",
-    "3. Select all files and **Download**\n",
-    "\n",
-    "⚠️ **Do this before Kaggle session ends!**"
-   ]
-  }
- ],
- "metadata": {
-  "kaggle": {
-   "accelerator": "gpu"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}

merge_simple.py CHANGED Viewed

@@ -34,7 +34,7 @@ def main():
     print("Loading base model...")
     model = AutoModelForCausalLM.from_pretrained(
         args.base_model,
-        torch_dtype=torch.float16,
         device_map="auto",
         trust_remote_code=True
     )

     print("Loading base model...")
     model = AutoModelForCausalLM.from_pretrained(
         args.base_model,
+        torch_dtype=torch.bfloat16,
         device_map="auto",
         trust_remote_code=True
     )

scripts/generate_tool_data.py ADDED Viewed

	@@ -0,0 +1,615 @@

+#!/usr/bin/env python3
+"""
+Synthetic Tool-Calling Training Data Generator for Stack 2.9
+Generates training examples in Qwen2.5-Coder format with tool_calls.
+"""
+import json
+import random
+import argparse
+from pathlib import Path
+from typing import Dict, List, Any
+from datetime import datetime
+# ============================================================================
+# Tool Definitions (Qwen2.5-Coder format)
+# ============================================================================
+TOOL_DEFINITIONS = [
+    {
+        "type": "function",
+        "function": {
+            "name": "Bash",
+            "description": "Execute bash commands in the terminal. Use for running shell commands, scripts, git operations, package managers, and system commands.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "command": {
+                        "type": "string",
+                        "description": "The bash command to execute"
+                    },
+                    "timeout": {
+                        "type": "integer",
+                        "description": "Timeout in seconds (default: 30)"
+                    }
+                },
+                "required": ["command"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "FileRead",
+            "description": "Read the contents of a file from the filesystem. Use for viewing source code, configuration files, documentation, or any text-based files.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "path": {
+                        "type": "string",
+                        "description": "Path to the file to read"
+                    },
+                    "offset": {
+                        "type": "integer",
+                        "description": "Line number to start reading from (1-indexed)"
+                    },
+                    "limit": {
+                        "type": "integer",
+                        "description": "Maximum number of lines to read"
+                    }
+                },
+                "required": ["path"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "FileWrite",
+            "description": "Create or overwrite a file with content. Use for creating new files, updating existing files, or writing code, configuration, or documentation.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "path": {
+                        "type": "string",
+                        "description": "Path where the file should be created or written"
+                    },
+                    "content": {
+                        "type": "string",
+                        "description": "The content to write to the file"
+                    },
+                    "append": {
+                        "type": "boolean",
+                        "description": "Append to existing file instead of overwriting (default: false)"
+                    }
+                },
+                "required": ["path", "content"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "WebSearch",
+            "description": "Search the web for information. Use for finding documentation, looking up error messages, researching libraries, or getting up-to-date information.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "The search query to look up on the web"
+                    },
+                    "count": {
+                        "type": "integer",
+                        "description": "Number of results to return (default: 5)"
+                    }
+                },
+                "required": ["query"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "Grep",
+            "description": "Search for patterns in files. Use for finding specific code, function definitions, imports, TODO comments, error patterns, or any text across the codebase.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "pattern": {
+                        "type": "string",
+                        "description": "The search pattern or regex to match"
+                    },
+                    "path": {
+                        "type": "string",
+                        "description": "Directory or file path to search in (default: current directory)"
+                    },
+                    "recursive": {
+                        "type": "boolean",
+                        "description": "Search recursively in subdirectories (default: true)"
+                    },
+                    "file_pattern": {
+                        "type": "string",
+                        "description": "File pattern to filter results (e.g., '*.py', '*.js')"
+                    }
+                },
+                "required": ["pattern"]
+            }
+        }
+    }
+]
+# ============================================================================
+# Template Data for Generation
+# ============================================================================
+FILE_PATHS = [
+    "src/main.py", "src/utils.py", "src/config.py", "src/models.py",
+    "src/api.py", "src/handlers.py", "src/middleware.py",
+    "tests/test_main.py", "tests/test_utils.py", "tests/conftest.py",
+    "README.md", "LICENSE", "package.json", "requirements.txt",
+    "config.yaml", "config.json", ".env.example",
+    "src/components/Button.tsx", "src/components/Header.jsx",
+    "src/styles.css", "src/index.js", "src/app.js",
+    "docs/API.md", "docs/ARCHITECTURE.md", "docs/CONTRIBUTING.md",
+    "scripts/setup.sh", "scripts/deploy.py", "Makefile"
+]
+CODE_SNIPPETS = {
+    "python": [
+        "def hello():\n    print('Hello, World!')",
+        "class MyClass:\n    def __init__(self):\n        self.value = 42",
+        "import os\nos.path.join('a', 'b')",
+        "async def fetch_data():\n    async with aiohttp.ClientSession() as session:\n        return await session.get(url)",
+    ],
+    "javascript": [
+        "const fetch = require('node-fetch');\nconst data = await fetch(url);",
+        "function handleClick() {\n  setCount(count + 1);\n}",
+        "export default function App() {\n  return <div>Hello</div>;\n}",
+        "const [state, setState] = useState(null);",
+    ],
+    "bash": [
+        "npm install",
+        "git status",
+        "pytest -v",
+        "python -m pytest tests/",
+        "make build",
+        "docker build -t myapp .",
+        "ls -la",
+        "curl -X GET https://api.example.com",
+    ]
+}
+WEB_SEARCH_QUERIES = [
+    "python async await best practices",
+    "javascript array methods map filter reduce",
+    "TypeScript generics tutorial",
+    "React hooks useEffect dependency array",
+    "Node.js error handling best practices",
+    "Docker vs Kubernetes differences",
+    "Git rebase vs merge",
+    "SQL join types explained",
+    "REST API design principles",
+    "Python list comprehension examples",
+    "JavaScript promise async await",
+    "CSS flexbox vs grid",
+    "MongoDB vs PostgreSQL",
+    "AWS Lambda cold start optimization",
+    "Docker compose tutorial",
+]
+GREP_PATTERNS = [
+    ("TODO", "*.py", "Find TODO comments in Python files"),
+    ("function ", "*.js", "Find function definitions in JS"),
+    ("import ", "*.py", "Find imports in Python"),
+    ("class ", "*.ts", "Find class definitions in TypeScript"),
+    ("def ", "*.py", "Find function definitions in Python"),
+    ("const ", "*.js", "Find const declarations in JS"),
+    ("interface ", "*.ts", "Find interface definitions"),
+    ("#pragma", "*.c", "Find pragma directives in C"),
+    ("SELECT ", "*.sql", "Find SQL queries"),
+    ("TODO", "*.md", "Find TODOs in documentation"),
+]
+# ============================================================================
+# Scenario Templates
+# ============================================================================
+SCENARIOS = {
+    "FileRead": [
+        {
+            "user": "Read the file at src/main.py and show me what's in it",
+            "tool_call": {"name": "FileRead", "arguments": {"path": "src/main.py"}},
+            "result": "Successfully read file: src/main.py\n```python\ndef main():\n    print('Hello from Stack 2.9!')\n\nif __name__ == '__main__':\n    main()\n```",
+            "response": "Here's the contents of `src/main.py`:\n\n```python\ndef main():\n    print('Hello from Stack 2.9!')\n\nif __name__ == '__main__':\n    main()\n```\n\nThis is a simple Python script that prints a greeting when executed directly."
+        },
+        {
+            "user": "Can you show me the README.md file?",
+            "tool_call": {"name": "FileRead", "arguments": {"path": "README.md"}},
+            "result": "Successfully read file: README.md\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n\n## Installation\n\npip install -r requirements.txt\n```",
+            "response": "Here's the README.md:\n\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n\n## Installation\n\npip install -r requirements.txt\n```"
+        },
+        {
+            "user": "What's in the config.yaml file?",
+            "tool_call": {"name": "FileRead", "arguments": {"path": "config.yaml"}},
+            "result": "Successfully read file: config.yaml\n```yaml\ndatabase:\n  host: localhost\n  port: 5432\n  name: myapp\n\napp:\n  port: 3000\n  debug: true\n```",
+            "response": "The config.yaml contains:\n\n```yaml\ndatabase:\n  host: localhost\n  port: 5432\n  name: myapp\n\napp:\n  port: 3000\n  debug: true\n```\n\nIt has database connection settings and app configuration."
+        },
+        {
+            "user": "Check the package.json to see the dependencies",
+            "tool_call": {"name": "FileRead", "arguments": {"path": "package.json"}},
+            "result": "Successfully read file: package.json\n```json\n{\n  \"name\": \"my-app\",\n  \"version\": \"1.0.0\",\n  \"dependencies\": {\n    \"express\": \"^4.18.0\",\n    \"react\": \"^18.2.0\"\n  }\n}\n```",
+            "response": "The package.json shows these dependencies:\n\n```json\n{\n  \"name\": \"my-app\",\n  \"version\": \"1.0.0\",\n  \"dependencies\": {\n    \"express\": \"^4.18.0\",\n    \"react\": \"^18.2.0\"\n  }\n}\n```"
+        },
+    ],
+    "FileWrite": [
+        {
+            "user": "Create a new file called hello.py with a hello world function",
+            "tool_call": {"name": "FileWrite", "arguments": {"path": "hello.py", "content": "def hello():\n    print('Hello, World!')\n\nif __name__ == '__main__':\n    hello()\n"}},
+            "result": "Successfully wrote file: hello.py",
+            "response": "Created `hello.py` with a hello world function:\n\n```python\ndef hello():\n    print('Hello, World!')\n\nif __name__ == '__main__':\n    hello()\n```\n\nYou can run it with `python hello.py`."
+        },
+        {
+            "user": "Write a simple React component to src/components/Button.jsx",
+            "tool_call": {"name": "FileWrite", "arguments": {"path": "src/components/Button.jsx", "content": "import React from 'react';\n\nexport default function Button({ onClick, children }) {\n  return (\n    <button onClick={onClick} className=\"btn\">\n      {children}\n    </button>\n  );\n}\n"}},
+            "result": "Successfully wrote file: src/components/Button.jsx",
+            "response": "Created `src/components/Button.jsx`:\n\n```jsx\nimport React from 'react';\n\nexport default function Button({ onClick, children }) {\n  return (\n    <button onClick={onClick} className=\"btn\">\n      {children}\n    </button>\n  );\n}\n```"
+        },
+        {
+            "user": "Create a config.json file with some settings",
+            "tool_call": {"name": "FileWrite", "arguments": {"path": "config.json", "content": "{\n  \"app\": {\n    \"name\": \"MyApp\",\n    \"version\": \"1.0.0\"\n  },\n  \"server\": {\n    \"port\": 8080,\n    \"host\": \"localhost\"\n  }\n}\n"}},
+            "result": "Successfully wrote file: config.json",
+            "response": "Created `config.json`:\n\n```json\n{\n  \"app\": {\n    \"name\": \"MyApp\",\n    \"version\": \"1.0.0\"\n  },\n  \"server\": {\n    \"port\": 8080,\n    \"host\": \"localhost\"\n  }\n}\n```"
+        },
+    ],
+    "Bash": [
+        {
+            "user": "Run npm install to install dependencies",
+            "tool_call": {"name": "Bash", "arguments": {"command": "npm install"}},
+            "result": "added 245 packages in 3.5s",
+            "response": "Dependencies installed successfully! Added 245 packages."
+        },
+        {
+            "user": "Check git status",
+            "tool_call": {"name": "Bash", "arguments": {"command": "git status"}},
+            "result": "On branch main\nYour branch is up to date with 'origin/main'.\n\nChanges not staged for commit:\n  modified:   README.md\n  modified:   src/main.py\n\nno changes added to commit (but untracked files present)",
+            "response": "Git status shows:\n- Modified: README.md, src/main.py (not staged)\n- Untracked files present"
+        },
+        {
+            "user": "Run the tests with pytest",
+            "tool_call": {"name": "Bash", "arguments": {"command": "pytest -v"}},
+            "result": "tests/test_main.py::test_hello PASSED\ntests/test_utils.py::test_add PASSED\n\n2 passed in 0.5s",
+            "response": "All tests passed!\n- `tests/test_main.py::test_hello` ✓\n- `tests/test_utils.py::test_add` ✓"
+        },
+        {
+            "user": "List all files in the current directory",
+            "tool_call": {"name": "Bash", "arguments": {"command": "ls -la"}},
+            "result": "total 64\ndrwxr-xr-x  12 user  staff   384 Apr  6 00:58 .\ndrwxr-xr-x   9 user  staff   288 Apr  6 00:50 ..\n-rw-r--r--   1 user  staff   220 Apr  6 00:50 README.md\ndrwxr-xr-x   3 user  staff   96 Apr  6 00:51 src\ndrwxr-xr-x   3 user  staff   96 Apr  6 00:51 tests\n...",
+            "response": "Files in current directory:\n- README.md\n- src/\n- tests/\n- And more (permissions, dates shown)"
+        },
+        {
+            "user": "Build the project with make",
+            "tool_call": {"name": "Bash", "arguments": {"command": "make build"}},
+            "result": "Building...\ngcc -o main main.c\nobuild successful!",
+            "response": "Build completed successfully!"
+        },
+    ],
+    "WebSearch": [
+        {
+            "user": "How do I use async/await in Python?",
+            "tool_call": {"name": "WebSearch", "arguments": {"query": "python async await best practices", "count": 5}},
+            "result": "Search results for 'python async await best practices':\n1. Real Python - Async IO in Python: A Complete Walkthrough\n2. Python Docs - asyncio - Asynchronous I/O\n3. Towards Data Science - Mastering Python Async/Await\n4. Stack Overflow - Common async/await mistakes\n5. Medium - Async/Await Patterns in Python",
+            "response": "Here are some resources on Python async/await:\n\n1. **Real Python** - Async IO in Python: A Complete Walkthrough\n2. **Python Docs** - asyncio - Asynchronous I/O\n3. **Towards Data Science** - Mastering Python Async/Await\n\nKey points:\n- Use `async def` to define async functions\n- Use `await` to call other async functions\n- `asyncio.run()` to execute the main coroutine"
+        },
+        {
+            "user": "What's the difference between Docker and Kubernetes?",
+            "tool_call": {"name": "WebSearch", "arguments": {"query": "Docker vs Kubernetes differences", "count": 5}},
+            "result": "Search results for 'Docker vs Kubernetes differences':\n1. Docker Blog - Containers 101\n2. Kubernetes.io - What is Kubernetes\n3. Red Hat - Docker vs Kubernetes explained\n4. DevOps.com - Container orchestration comparison\n5. Container Journal - When to use each",
+            "response": "**Docker** is a container platform - it packages applications into containers.\n\n**Kubernetes** is a container orchestrator - it manages multiple containers across machines.\n\nThink of Docker as individual shipping containers, and Kubernetes as the port management system that coordinates all of them."
+        },
+        {
+            "user": "How do I use React hooks properly?",
+            "tool_call": {"name": "WebSearch", "arguments": {"query": "React hooks useEffect dependency array", "count": 5}},
+            "result": "Search results for 'React hooks useEffect dependency array':\n1. React Docs - Synchronizing with Effects\n2. Kent C. Dodds - A Complete Guide to useEffect\n3. Robin Wieruch - React useEffect Hook Tutorial\n4. Stack Overflow - Common useEffect mistakes\n5. LogRocket - useEffect dependency array deep dive",
+            "response": "**React Hooks Guide:**\n\n1. **useState** - For state management\n2. **useEffect** - For side effects, takes dependency array\n3. **useContext** - For global state\n\nThe dependency array controls when the effect runs. Empty `[]` = once on mount. Include variables to re-run when they change."
+        },
+    ],
+    "Grep": [
+        {
+            "user": "Find all TODO comments in the Python files",
+            "tool_call": {"name": "Grep", "arguments": {"pattern": "TODO", "path": ".", "recursive": True, "file_pattern": "*.py"}},
+            "result": "Found 3 matches:\nsrc/main.py:14: # TODO: Add error handling\nsrc/utils.py:22: # TODO: Implement caching\ntests/test_main.py:8: # TODO: Add more tests",
+            "response": "Found 3 TODO comments in Python files:\n- `src/main.py:14` - Add error handling\n- `src/utils.py:22` - Implement caching\n- `tests/test_main.py:8` - Add more tests"
+        },
+        {
+            "user": "Search for function definitions in the JavaScript files",
+            "tool_call": {"name": "Grep", "arguments": {"pattern": "function ", "path": "src", "recursive": True, "file_pattern": "*.js"}},
+            "result": "Found 5 matches:\nsrc/index.js:3: function setup() {\nsrc/handlers.js:12: function handleClick(event) {\nsrc/handlers.js:25: function handleSubmit(form) {\nsrc/utils.js:8: function formatDate(date) {\nsrc/utils.js:15: function parseJSON(str) {",
+            "response": "Found 5 function definitions in src/:\n- `src/index.js:3` - setup()\n- `src/handlers.js:12` - handleClick(event)\n- `src/handlers.js:15` - handleSubmit(form)\n- `src/utils.js:8` - formatDate(date)\n- `src/utils.js:15` - parseJSON(str)"
+        },
+        {
+            "user": "Find all imports in the Python codebase",
+            "tool_call": {"name": "Grep", "arguments": {"pattern": "import ", "path": ".", "recursive": True, "file_pattern": "*.py"}},
+            "result": "Found 12 matches:\nsrc/main.py:1: import os\nsrc/main.py:2: import sys\nsrc/main.py:3: import json\nsrc/utils.py:1: from typing import Dict\nsrc/api.py:5: import requests\n...",
+            "response": "Found 12 import statements across Python files. Main imports:\n- `os`, `sys`, `json` in src/main.py\n- `typing.Dict` in src/utils.py\n- `requests` in src/api.py"
+        },
+        {
+            "user": "Look for class definitions in TypeScript files",
+            "tool_call": {"name": "Grep", "arguments": {"pattern": "class ", "path": "src", "recursive": True, "file_pattern": "*.ts"}},
+            "result": "Found 2 matches:\nsrc/models/User.ts:3: class User {\nsrc/models/Config.ts:8: class Config {",
+            "response": "Found 2 class definitions:\n- `src/models/User.ts:3` - User class\n- `src/models/Config.ts:8` - Config class"
+        },
+    ],
+}
+# ============================================================================
+# Data Augmentation Helpers
+# ============================================================================
+def vary_file_paths(template: Dict) -> Dict:
+    """Vary file paths in a scenario to create diversity."""
+    import copy
+    result = copy.deepcopy(template)
+    # Vary the file paths
+    if "path" in result["tool_call"]["arguments"]:
+        original_path = result["tool_call"]["arguments"]["path"]
+        for old_path in FILE_PATHS:
+            if old_path in result["user"]:
+                result["user"] = result["user"].replace(old_path, random.choice(FILE_PATHS))
+                result["tool_call"]["arguments"]["path"] = random.choice(FILE_PATHS)
+                break
+    return result
+def vary_content(template: Dict) -> Dict:
+    """Vary content in a scenario."""
+    import copy
+    result = copy.deepcopy(template)
+    if "content" in result["tool_call"]["arguments"]:
+        # Vary code snippets
+        lang = random.choice(["python", "javascript"])
+        result["tool_call"]["arguments"]["content"] = random.choice(CODE_SNIPPETS[lang])
+    return result
+def vary_bash_command(template: Dict) -> Dict:
+    """Vary bash commands."""
+    import copy
+    result = copy.deepcopy(template)
+    if "command" in result["tool_call"]["arguments"]:
+        original = result["tool_call"]["arguments"]["command"].split()[0] if result["tool_call"]["arguments"]["command"] else ""
+        if "npm" in original:
+            commands = ["npm install", "npm run build", "npm test", "npm start"]
+        elif "git" in original:
+            commands = ["git status", "git log --oneline -5", "git diff", "git branch -a"]
+        elif "pytest" in original:
+            commands = ["pytest -v", "pytest tests/", "pytest -xvs", "pytest --cov"]
+        elif "ls" in original:
+            commands = ["ls -la", "ls -1", "ls -lah"]
+        elif "make" in original:
+            commands = ["make build", "make clean", "make test", "make install"]
+        else:
+            commands = ["echo 'hello'", "pwd", "whoami", "date"]
+        result["tool_call"]["arguments"]["command"] = random.choice(commands)
+    return result
+def vary_search_query(template: Dict) -> Dict:
+    """Vary web search queries."""
+    import copy
+    result = copy.deepcopy(template)
+    if "query" in result["tool_call"]["arguments"]:
+        result["tool_call"]["arguments"]["query"] = random.choice(WEB_SEARCH_QUERIES)
+    return result
+def vary_grep_pattern(template: Dict) -> Dict:
+    """Vary grep patterns."""
+    import copy
+    result = copy.deepcopy(template)
+    pattern, file_pattern, _ = random.choice(GREP_PATTERNS)
+    result["tool_call"]["arguments"]["pattern"] = pattern
+    result["tool_call"]["arguments"]["file_pattern"] = file_pattern
+    return result
+# ============================================================================
+# Main Generation Functions
+# ============================================================================
+def create_tool_call_message(tool_call: Dict, tool_call_id: str) -> Dict:
+    """Create a tool_calls message in Qwen format."""
+    return {
+        "role": "assistant",
+        "content": None,
+        "tool_calls": [
+            {
+                "id": tool_call_id,
+                "type": "function",
+                "function": {
+                    "name": tool_call["name"],
+                    "arguments": json.dumps(tool_call["arguments"])
+                }
+            }
+        ]
+    }
+def create_tool_message(role: str, tool_call_id: str, tool_name: str, result: str) -> Dict:
+    """Create a tool message (result of tool execution)."""
+    return {
+        "role": role,  # typically "tool"
+        "content": result,
+        "tool_call_id": tool_call_id,
+        "name": tool_name
+    }
+def generate_example(scenario: Dict, system_prompt: str = None) -> Dict:
+    """Generate a single training example in Qwen2.5-Coder format."""
+    if system_prompt is None:
+        system_prompt = "You are a helpful AI assistant that can use tools to help users solve problems. When you need to perform actions like reading files, running commands, searching the web, or searching code, use the appropriate tool."
+    tool_call_id = f"call_${random.randint(1000, 9999)}"
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": scenario["user"]},
+        create_tool_call_message(scenario["tool_call"], tool_call_id),
+        create_tool_message("tool", tool_call_id, scenario["tool_call"]["name"], scenario["result"]),
+        {"role": "assistant", "content": scenario["response"]}
+    ]
+    return {
+        "messages": messages,
+        "tools": TOOL_DEFINITIONS
+    }
+def augment_scenario(scenario: Dict, tool_name: str) -> Dict:
+    """Apply random augmentations to a scenario."""
+    import random
+    augmented = scenario.copy()
+    if tool_name == "FileRead":
+        augmented = vary_file_paths(augmented)
+    elif tool_name == "FileWrite":
+        augmented = vary_file_paths(augmented)
+        augmented = vary_content(augmented)
+    elif tool_name == "Bash":
+        augmented = vary_bash_command(augmented)
+    elif tool_name == "WebSearch":
+        augmented = vary_search_query(augmented)
+    elif tool_name == "Grep":
+        augmented = vary_grep_pattern(augmented)
+    return augmented
+def generate_dataset(num_examples: int = 1000, output_path: str = None) -> List[Dict]:
+    """Generate the complete dataset."""
+    examples = []
+    tools = list(SCENARIOS.keys())
+    # Track counts for balance
+    examples_per_tool = num_examples // len(tools)
+    remainder = num_examples % len(tools)
+    for i, tool_name in enumerate(tools):
+        # Determine how many examples for this tool
+        count = examples_per_tool + (1 if i < remainder else 0)
+        base_scenarios = SCENARIOS[tool_name]
+        for j in range(count):
+            # Use base scenario and vary
+            base = base_scenarios[j % len(base_scenarios)]
+            # Apply augmentations for variety
+            if j >= len(base_scenarios):
+                scenario = augment_scenario(base, tool_name)
+            else:
+                scenario = base
+            example = generate_example(scenario)
+            examples.append(example)
+    # Shuffle for better training
+    random.shuffle(examples)
+    return examples
+def save_jsonl(examples: List[Dict], output_path: str):
+    """Save examples to JSONL format."""
+    output_file = Path(output_path)
+    output_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_file, 'w', encoding='utf-8') as f:
+        for example in examples:
+            f.write(json.dumps(example, ensure_ascii=False) + '\n')
+def save_json(examples: List[Dict], output_path: str):
+    """Save examples to JSON format."""
+    output_file = Path(output_path)
+    output_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_file, 'w', encoding='utf-8') as f:
+        json.dump(examples, f, ensure_ascii=False, indent=2)
+def main():
+    parser = argparse.ArgumentParser(description="Generate synthetic tool-calling training data")
+    parser.add_argument("--num-examples", type=int, default=1000, help="Number of examples to generate")
+    parser.add_argument("--output-dir", type=str, default="training-data", help="Output directory")
+    parser.add_argument("--output-format", choices=["jsonl", "json", "both"], default="jsonl", help="Output format")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed")
+    args = parser.parse_args()
+    # Set seed for reproducibility
+    random.seed(args.seed)
+    print(f"🎯 Generating {args.num_examples} tool-calling training examples...")
+    print(f"   Output directory: {args.output_dir}")
+    print(f"   Format: {args.output_format}")
+    print()
+    # Generate dataset
+    examples = generate_dataset(args.num_examples)
+    output_dir = Path(args.output_dir)
+    # Save based on format
+    if args.output_format in ["jsonl", "both"]:
+        jsonl_path = output_dir / "tool_examples.jsonl"
+        save_jsonl(examples, str(jsonl_path))
+        print(f"✅ Saved JSONL: {jsonl_path}")
+    if args.output_format in ["json", "both"]:
+        json_path = output_dir / "tool_examples.json"
+        save_json(examples, str(json_path))
+        print(f"✅ Saved JSON: {json_path}")
+    # Statistics
+    print(f"\n📊 Statistics:")
+    print(f"   Total examples: {len(examples)}")
+    # Count by tool
+    tool_counts = {}
+    for ex in examples:
+        for msg in ex["messages"]:
+            if msg.get("tool_calls"):
+                tool_name = msg["tool_calls"][0]["function"]["name"]
+                tool_counts[tool_name] = tool_counts.get(tool_name, 0) + 1
+    print(f"   Examples by tool:")
+    for tool, count in sorted(tool_counts.items(), key=lambda x: x[1], reverse=True):
+        print(f"     - {tool}: {count}")
+    # Show sample
+    print(f"\n📝 Sample example (first in dataset):")
+    sample = examples[0]
+    print(f"   Tools defined: {len(sample['tools'])}")
+    print(f"   Messages: {len(sample['messages'])}")
+    print(f"   First user message: {sample['messages'][1]['content'][:60]}...")
+    print(f"\n✨ Generation complete!")
+if __name__ == "__main__":
+    main()

test_model.py CHANGED Viewed

@@ -1,10 +1,17 @@
 #!/usr/bin/env python3
 """
-Test script for fine-tuned Stack 2.9 model.
-Tests basic code generation capabilities.
 """
 import argparse
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -12,141 +19,583 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 def load_model(model_path: str):
     """Load the fine-tuned model and tokenizer."""
     print(f"Loading model from: {model_path}")
-    tokenizer = AutoTokenizer.from_pretrained(model_path)
     model = AutoModelForCausalLM.from_pretrained(
         model_path,
         torch_dtype=torch.float16,
         device_map="auto",
         low_cpu_mem_usage=True,
     )
     return model, tokenizer
-def test_code_completion(model, tokenizer, prompt: str, max_new_tokens: int = 100):
-    """Test code completion for a given prompt."""
-    inputs = tokenizer(prompt, return_tensors="pt")
     inputs = {k: v.to(model.device) for k, v in inputs.items()}
-    with torch.no_grad():
-        outputs = model.generate(
-            **inputs,
-            max_new_tokens=max_new_tokens,
-            temperature=0.2,
-            top_p=0.95,
-            do_sample=True,
-            repetition_penalty=1.1,
-        )
-    completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    # Remove the prompt from the completion
-    if completion.startswith(prompt):
-        completion = completion[len(prompt):].strip()
-    return completion
-def run_tests(model_path: str):
-    """Run all code generation tests."""
-    model, tokenizer = load_model(model_path)
-    model.eval()
-    test_cases = [
-        {
-            "name": "Reverse String",
-            "prompt": "def reverse_string(s):",
-            "max_tokens": 50,
-            "expected_keywords": ["return", "s[::-1]", "reversed"],
-        },
-        {
-            "name": "Binary Search",
-            "prompt": "def binary_search(arr, target):",
-            "max_tokens": 100,
-            "expected_keywords": ["while", "left", "right", "mid"],
-        },
-        {
-            "name": "Fibonacci",
-            "prompt": "def fibonacci(n):",
-            "max_tokens": 80,
-            "expected_keywords": ["return", "if", "else", "fib"],
-        },
-        {
-            "name": "Factorial",
-            "prompt": "def factorial(n):",
-            "max_tokens": 60,
-            "expected_keywords": ["return", "if", "*"],
-        },
-        {
-            "name": "Is Prime",
-            "prompt": "def is_prime(n):",
-            "max_tokens": 80,
-            "expected_keywords": ["if", "return", "for", "%"],
-        },
-        {
-            "name": "List Sum",
-            "prompt": "def list_sum(nums):",
-            "max_tokens": 50,
-            "expected_keywords": ["return", "sum", "+"],
-        },
-        {
-            "name": "Merge Sort",
-            "prompt": "def merge_sort(arr):",
-            "max_tokens": 150,
-            "expected_keywords": ["if", "len", "return", "merge"],
-        },
-        {
-            "name": "Quick Sort",
-            "prompt": "def quick_sort(arr):",
-            "max_tokens": 150,
-            "expected_keywords": ["if", "len", "return", "pivot"],
-        },
-    ]
-    print("\n" + "="*60)
-    print("Running Code Generation Tests")
-    print("="*60 + "\n")
-    passed = 0
-    failed = 0
-    for i, test in enumerate(test_cases, 1):
-        print(f"Test {i}: {test['name']}")
-        print(f"Prompt: {test['prompt']}")
         try:
-            completion = test_code_completion(
-                model, tokenizer,
-                test['prompt'],
-                test['max_tokens']
-            )
-            print(f"Completion:\n{completion[:300]}")
-            # Check for expected keywords
-            keywords_found = sum(1 for kw in test['expected_keywords'] if kw.lower() in completion.lower())
-            if keywords_found >= len(test['expected_keywords']) // 2:
-                print(f"✅ PASS (found {keywords_found}/{len(test['expected_keywords'])} keywords)")
-                passed += 1
             else:
-                print(f"⚠️  PARTIAL (found {keywords_found}/{len(test['expected_keywords'])} keywords)")
-                passed += 1  # Still count as pass if some keywords found
-            print()
         except Exception as e:
-            print(f"❌ FAIL: {e}")
-            failed += 1
-            print()
-    print("="*60)
-    print(f"Results: {passed} passed, {failed} failed")
     print("="*60)
-    return passed, failed
 def main():
-    parser = argparse.ArgumentParser(description="Test fine-tuned Stack 2.9 model")
-    parser.add_argument("--model-path", type=str, required=True, help="Path to fine-tuned model")
     args = parser.parse_args()
-    run_tests(args.model_path)
 if __name__ == "__main__":

 #!/usr/bin/env python3
 """
+Basic Code Generation Tests for Stack 2.9 Model
+Tests common algorithms and data structures.
+Usage:
+    python test_model.py --model-path /path/to/merged/model
+    python test_model.py --model-path /path/to/merged/model --output test_results.json
 """
 import argparse
+import json
+import time
+from typing import List, Dict, Tuple, Optional
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 def load_model(model_path: str):
     """Load the fine-tuned model and tokenizer."""
     print(f"Loading model from: {model_path}")
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
     model = AutoModelForCausalLM.from_pretrained(
         model_path,
         torch_dtype=torch.float16,
         device_map="auto",
         low_cpu_mem_usage=True,
+        trust_remote_code=True,
     )
     return model, tokenizer
+def generate_completion(
+    model,
+    tokenizer,
+    prompt: str,
+    max_new_tokens: int = 128,
+    temperature: float = 0.2,
+    num_samples: int = 1
+) -> List[str]:
+    """Generate code completion(s) for a prompt."""
+    inputs = tokenizer(prompt, return_tensors="pt", padding=True)
     inputs = {k: v.to(model.device) for k, v in inputs.items()}
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=max_new_tokens,
+        temperature=temperature,
+        top_p=0.95,
+        do_sample=True,
+        repetition_penalty=1.1,
+        num_return_sequences=num_samples,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+    completions = []
+    for output in outputs:
+        text = tokenizer.decode(output, skip_special_tokens=True)
+        if text.startswith(prompt):
+            text = text[len(prompt):]
+        completions.append(text.strip())
+    return completions
+def extract_code(completion: str) -> str:
+    """Extract code from completion, handling markdown code blocks."""
+    # Try ```python blocks first
+    if "```python" in completion:
+        start = completion.find("```python") + len("```python")
+        end = completion.find("```", start)
+        if end != -1:
+            return completion[start:end].strip()
+    # Try generic ``` blocks
+    if "```" in completion:
+        start = completion.find("```") + len("```")
+        # Skip language identifier if present
+        if completion[start:start+10].strip():
+            start = completion.find("\n", start) + 1
+        end = completion.find("```", start)
+        if end != -1:
+            return completion[start:end].strip()
+    return completion.strip()
+def execute_code(code: str, timeout: int = 5) -> Tuple[bool, str, Optional[any]]:
+    """Safely execute code and return (success, error_msg, result)."""
+    import signal
+    class TimeoutError(Exception):
+        pass
+    def timeout_handler(signum, frame):
+        raise TimeoutError("Execution timed out")
+    safe_builtins = {
+        'print': print,
+        'len': len,
+        'range': range,
+        'str': str,
+        'int': int,
+        'float': float,
+        'bool': bool,
+        'list': list,
+        'dict': dict,
+        'set': set,
+        'tuple': tuple,
+        'sum': sum,
+        'min': min,
+        'max': max,
+        'abs': abs,
+        'sorted': sorted,
+        'reversed': reversed,
+        'enumerate': enumerate,
+        'zip': zip,
+        'map': map,
+        'filter': filter,
+        'any': any,
+        'all': all,
+        'isinstance': isinstance,
+        'type': type,
+        'round': round,
+        'pow': pow,
+        'divmod': divmod,
+        'ord': ord,
+        'chr': chr,
+        'hex': hex,
+        'bin': bin,
+        'id': id,
+    }
+    namespace = {'__builtins__': safe_builtins}
+    try:
+        signal.signal(signal.SIGALRM, timeout_handler)
+        signal.alarm(timeout)
+        exec(code, namespace)
+        signal.alarm(0)
+        return True, "", namespace.get('result')
+    except TimeoutError as e:
+        signal.alarm(0)
+        return False, f"Timeout after {timeout}s", None
+    except SyntaxError as e:
+        signal.alarm(0)
+        return False, f"Syntax error: {e}", None
+    except Exception as e:
+        signal.alarm(0)
+        return False, f"{type(e).__name__}: {e}", None
+def check_function_output(code: str, func_name: str, test_cases: List[Dict]) -> Tuple[bool, str]:
+    """Test a function with given test cases.
+    Args:
+        code: The generated code
+        func_name: Name of function to test
+        test_cases: List of dicts with 'input' (tuple), 'expected', 'description'
+    Returns:
+        Tuple of (all_passed, failure_message)
+    """
+    namespace = {'__builtins__': {
+        'print': print,
+        'len': len,
+        'range': range,
+        'str': str,
+        'int': int,
+        'float': float,
+        'bool': bool,
+        'list': list,
+        'dict': dict,
+        'set': set,
+        'tuple': tuple,
+        'sum': sum,
+        'min': min,
+        'max': max,
+        'abs': abs,
+        'sorted': sorted,
+    }}
+    try:
+        exec(code, namespace)
+    except Exception as e:
+        return False, f"Code execution failed: {type(e).__name__}: {e}"
+    if func_name not in namespace:
+        return False, f"Function '{func_name}' not found"
+    func = namespace[func_name]
+    for tc in test_cases:
+        inputs = tc.get('input', ())
+        expected = tc.get('expected')
+        desc = tc.get('description', str(inputs))
         try:
+            if isinstance(inputs, tuple):
+                result = func(*inputs)
             else:
+                result = func(inputs)
         except Exception as e:
+            return False, f"Failed on {desc}: {type(e).__name__}: {e}"
+        if result != expected:
+            return False, f"Failed on {desc}: expected {expected}, got {result}"
+    return True, ""
+# Common algorithm test cases
+ALGORITHM_TESTS = [
+    {
+        "name": "Two Sum",
+        "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
+        "function": "two_sum",
+        "max_tokens": 100,
+        "test_cases": [
+            {"input": ([2,7,11,15], 9), "expected": [0,1], "description": "Basic"},
+            {"input": ([3,2,4], 6), "expected": [1,2], "description": "Middle"},
+        ],
+        "keywords": ["for", "in", "range", "enumerate"],
+    },
+    {
+        "name": "Reverse String",
+        "prompt": "def reverse_string(s):\n    \"\"\"Return the reverse of string s.\"\"\"\n",
+        "function": "reverse_string",
+        "max_tokens": 50,
+        "test_cases": [
+            {"input": ("hello",), "expected": "olleh", "description": "Basic"},
+            {"input": ("Python",), "expected": "nohtyP", "description": "Mixed case"},
+        ],
+        "keywords": ["[::-1]", "reversed"],
+    },
+    {
+        "name": "Fibonacci",
+        "prompt": "def fibonacci(n):\n    \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
+        "function": "fibonacci",
+        "max_tokens": 100,
+        "test_cases": [
+            {"input": (7,), "expected": [0,1,1,2,3,5,8], "description": "n=7"},
+            {"input": (1,), "expected": [0], "description": "n=1"},
+        ],
+        "keywords": ["for", "while", "append", "range"],
+    },
+    {
+        "name": "Factorial",
+        "prompt": "def factorial(n):\n    \"\"\"Return n! (factorial).\"\"\"\n",
+        "function": "factorial",
+        "max_tokens": 60,
+        "test_cases": [
+            {"input": (5,), "expected": 120, "description": "5!"},
+            {"input": (0,), "expected": 1, "description": "0!"},
+        ],
+        "keywords": ["for", "while", "range", "*"],
+    },
+    {
+        "name": "Is Palindrome",
+        "prompt": "def is_palindrome(x):\n    \"\"\"Check if integer x is a palindrome.\"\"\"\n",
+        "function": "is_palindrome",
+        "max_tokens": 60,
+        "test_cases": [
+            {"input": (121,), "expected": True, "description": "121"},
+            {"input": (-121,), "expected": False, "description": "-121"},
+        ],
+        "keywords": ["str", "[::-1]"],
+    },
+    {
+        "name": "Binary Search",
+        "prompt": "def binary_search(arr, target):\n    \"\"\"Return index of target in sorted array, or -1 if not found.\"\"\"\n",
+        "function": "binary_search",
+        "max_tokens": 120,
+        "test_cases": [
+            {"input": ([1,2,3,4,5], 3), "expected": 2, "description": "Found"},
+            {"input": ([1,2,3,4,5], 6), "expected": -1, "description": "Not found"},
+        ],
+        "keywords": ["while", "left", "right", "<=", ">"],
+    },
+    {
+        "name": "Merge Sort",
+        "prompt": "def merge_sort(arr):\n    \"\"\"Return sorted copy of array using merge sort.\"\"\"\n",
+        "function": "merge_sort",
+        "max_tokens": 200,
+        "test_cases": [
+            {"input": ([3,1,4,1,5,9,2,6],), "expected": [1,1,2,3,4,5,6,9], "description": "Mixed"},
+            {"input": ([1,2,3],), "expected": [1,2,3], "description": "Already sorted"},
+        ],
+        "keywords": ["def merge_sort", "if", "len", "return", "merge"],
+    },
+    {
+        "name": "Quick Sort",
+        "prompt": "def quick_sort(arr):\n    \"\"\"Return sorted copy of array using quick sort.\"\"\"\n",
+        "function": "quick_sort",
+        "max_tokens": 200,
+        "test_cases": [
+            {"input": ([3,1,4,1,5,9,2,6],), "expected": [1,1,2,3,4,5,6,9], "description": "Mixed"},
+        ],
+        "keywords": ["def quick_sort", "if", "len", "return"],
+    },
+    {
+        "name": "Maximum Subarray (Kadane's)",
+        "prompt": "def max_subarray(nums):\n    \"\"\"Return maximum sum of contiguous subarray.\"\"\"\n",
+        "function": "max_subarray",
+        "max_tokens": 100,
+        "test_cases": [
+            {"input": ([-2,1,-3,4,-1,2,1,-5,4],), "expected": 6, "description": "Mixed"},
+            {"input": ([1],), "expected": 1, "description": "Single"},
+        ],
+        "keywords": ["for", "max", "+"],
+    },
+    {
+        "name": "Valid Parentheses",
+        "prompt": "def valid_parentheses(s):\n    \"\"\"Check if string has valid bracket matching.\"\"\"\n",
+        "function": "valid_parentheses",
+        "max_tokens": 100,
+        "test_cases": [
+            {"input": ("()",), "expected": True, "description": "Simple"},
+            {"input": ("([)]",), "expected": False, "description": "Wrong order"},
+        ],
+        "keywords": ["stack", "if", "for", "in", "pop", "append"],
+    },
+    {
+        "name": "Climbing Stairs",
+        "prompt": "def climb_stairs(n):\n    \"\"\"Count ways to climb n stairs (1 or 2 steps at a time).\"\"\"\n",
+        "function": "climb_stairs",
+        "max_tokens": 80,
+        "test_cases": [
+            {"input": (5,), "expected": 8, "description": "n=5"},
+            {"input": (2,), "expected": 2, "description": "n=2"},
+        ],
+        "keywords": ["for", "while", "+", "="],
+    },
+    {
+        "name": "List Sum",
+        "prompt": "def list_sum(nums):\n    \"\"\"Return sum of all numbers in list.\"\"\"\n",
+        "function": "list_sum",
+        "max_tokens": 50,
+        "test_cases": [
+            {"input": ([1,2,3,4],), "expected": 10, "description": "Basic"},
+            {"input": ([],), "expected": 0, "description": "Empty"},
+        ],
+        "keywords": ["for", "in", "+", "sum", "0"],
+    },
+    {
+        "name": "List Average",
+        "prompt": "def list_avg(nums):\n    \"\"\"Return average of numbers in list.\"\"\"\n",
+        "function": "list_avg",
+        "max_tokens": 60,
+        "test_cases": [
+            {"input": ([1,2,3,4,5],), "expected": 3.0, "description": "Basic"},
+            {"input": ([5],), "expected": 5.0, "description": "Single"},
+        ],
+        "keywords": ["sum", "len", "/", "float"],
+    },
+    {
+        "name": "Find Maximum",
+        "prompt": "def find_max(nums):\n    \"\"\"Return maximum value in list.\"\"\"\n",
+        "function": "find_max",
+        "max_tokens": 60,
+        "test_cases": [
+            {"input": ([3,1,4,1,5,9],), "expected": 9, "description": "Basic"},
+            {"input": ([-1,-5,-3],), "expected": -1, "description": "Negatives"},
+        ],
+        "keywords": ["for", "in", "max", ">", "<"],
+    },
+    {
+        "name": "Count Zeros",
+        "prompt": "def count_zeros(nums):\n    \"\"\"Count zeros in list.\"\"\"\n",
+        "function": "count_zeros",
+        "max_tokens": 50,
+        "test_cases": [
+            {"input": ([0,1,0,2,0],), "expected": 3, "description": "Mixed"},
+            {"input": ([1,2,3],), "expected": 0, "description": "No zeros"},
+        ],
+        "keywords": ["for", "in", "count", "==", "+"],
+    },
+    {
+        "name": "Unique Elements",
+        "prompt": "def unique_elements(lst):\n    \"\"\"Return list of unique elements preserving order.\"\"\"\n",
+        "function": "unique_elements",
+        "max_tokens": 80,
+        "test_cases": [
+            {"input": ([1,2,2,3,1],), "expected": [1,2,3], "description": "With dups"},
+            {"input": ([1,2,3],), "expected": [1,2,3], "description": "All unique"},
+        ],
+        "keywords": ["for", "in", "if", "append", "set"],
+    },
+]
+def run_test(model, tokenizer, test_config: Dict) -> Dict:
+    """Run a single test and return results."""
+    name = test_config["name"]
+    prompt = test_config["prompt"]
+    func_name = test_config["function"]
+    max_tokens = test_config["max_tokens"]
+    test_cases = test_config["test_cases"]
+    keywords = test_config.get("keywords", [])
+    print(f"\n  Test: {name}")
+    print(f"  Prompt: {prompt.strip()[:60]}...")
+    start_time = time.time()
+    # Generate completion
+    completions = generate_completion(model, tokenizer, prompt, max_tokens=max_tokens)
+    elapsed = time.time() - start_time
+    # Extract and check code
+    code = extract_code(completions[0])
+    print(f"  Generated in {elapsed:.2f}s")
+    print(f"  Code preview: {code[:100]}...")
+    # Check syntax and execution
+    success, error_msg = check_function_output(code, func_name, test_cases)
+    # Check keywords
+    keywords_found = sum(1 for kw in keywords if kw.lower() in code.lower())
+    keyword_ratio = keywords_found / len(keywords) if keywords else 0
+    result = {
+        "name": name,
+        "passed": success,
+        "keywords_found": keywords_found,
+        "keywords_total": len(keywords),
+        "keyword_ratio": keyword_ratio,
+        "execution_time": elapsed,
+        "error": error_msg if not success else None,
+        "generated_code": code[:500],  # Truncate for storage
+    }
+    if success:
+        print(f"  Result: ✅ PASS")
+    else:
+        print(f"  Result: ❌ FAIL - {error_msg[:60]}")
+    return result
+def calculate_pass_at_k(results: List[Dict], k: int) -> float:
+    """Calculate pass@k across all tests.
+    For simplicity, a test passes if it passes the functional test.
+    """
+    if not results or k <= 0:
+        return 0.0
+    passed = sum(1 for r in results if r["passed"])
+    total = len(results)
+    # Simple pass@k: probability that at least 1 of k samples would pass
+    # Assuming independence, this is 1 - (1 - p)^k where p = passed/total
+    if k >= total:
+        return passed / total if total > 0 else 0.0
+    # For pass@1, it's just the pass rate
+    if k == 1:
+        return passed / total if total > 0 else 0.0
+    # For pass@k where k > 1, estimate based on single sample
+    p = passed / total if total > 0 else 0.0
+    p_at_least_1 = 1 - (1 - p) ** k
+    return p_at_least_1
+def print_results(results: List[Dict], k_values: List[int] = [1, 10]):
+    """Print test results summary."""
+    print("\n" + "="*60)
+    print("TEST RESULTS SUMMARY")
     print("="*60)
+    passed = sum(1 for r in results if r["passed"])
+    total = len(results)
+    print(f"\n  Total tests: {total}")
+    print(f"  Passed: {passed}")
+    print(f"  Failed: {total - passed}")
+    print(f"  Pass rate: {100*passed/total:.1f}%")
+    for k in k_values:
+        pass_at_k = calculate_pass_at_k(results, k)
+        print(f"\n  Pass@{k}: {100*pass_at_k:.1f}%")
+    print("\n  Individual Results:")
+    for r in results:
+        status = "✅" if r["passed"] else "❌"
+        print(f"    {status} {r['name']} (keywords: {r['keywords_found']}/{r['keywords_total']})")
+def save_results(results: List[Dict], output_path: str):
+    """Save test results to JSON."""
+    with open(output_path, 'w') as f:
+        json.dump(results, f, indent=2)
+    print(f"\n✅ Results saved to: {output_path}")
 def main():
+    parser = argparse.ArgumentParser(
+        description="Test Stack 2.9 model on common algorithm tasks"
+    )
+    parser.add_argument(
+        "--model-path",
+        type=str,
+        required=True,
+        help="Path to the merged model directory"
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="test_results.json",
+        help="Output file for results (default: test_results.json)"
+    )
+    parser.add_argument(
+        "--max-tokens",
+        type=int,
+        default=200,
+        help="Max new tokens per generation (default: 200)"
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=0.2,
+        help="Sampling temperature (default: 0.2)"
+    )
+    parser.add_argument(
+        "--test-names",
+        type=str,
+        default=None,
+        help="Comma-separated test names to run (default: all)"
+    )
+    parser.add_argument(
+        "--k-values",
+        type=str,
+        default="1,10",
+        help="Comma-separated k values for pass@k (default: 1,10)"
+    )
     args = parser.parse_args()
+    k_values = [int(k.strip()) for k in args.k_values.split(",")]
+    print("="*60)
+    print("Stack 2.9 Model - Algorithm Tests")
+    print("="*60)
+    print(f"Model path: {args.model_path}")
+    print(f"Output: {args.output}")
+    print(f"Max tokens: {args.max_tokens}")
+    print(f"Temperature: {args.temperature}")
+    # Load model
+    model, tokenizer = load_model(args.model_path)
+    model.eval()
+    # Select tests to run
+    tests_to_run = ALGORITHM_TESTS
+    if args.test_names:
+        names = [n.strip() for n in args.test_names.split(",")]
+        tests_to_run = [t for t in tests_to_run if t["name"] in names]
+        print(f"Running tests: {[t['name'] for t in tests_to_run]}")
+    if not tests_to_run:
+        print("No tests to run!")
+        return
+    # Override max_tokens for each test
+    for test in tests_to_run:
+        if args.max_tokens:
+            test["max_tokens"] = args.max_tokens
+    # Run tests
+    print("\n" + "="*60)
+    print(f"Running {len(tests_to_run)} Tests")
+    print("="*60)
+    results = []
+    total_start = time.time()
+    for i, test in enumerate(tests_to_run, 1):
+        print(f"\n[{i}/{len(tests_to_run)}]")
+        result = run_test(model, tokenizer, test)
+        results.append(result)
+    total_time = time.time() - total_start
+    # Print summary
+    print_results(results, k_values)
+    print(f"\n  Total time: {total_time:.1f}s")
+    # Save results
+    save_results(results, args.output)
 if __name__ == "__main__":

train_simple.py DELETED Viewed

@@ -1,197 +0,0 @@
-#!/usr/bin/env python3
-"""
-Simple standalone training script for Stack 2.9.
-No package installation required — just run: python train_simple.py --config train_config.yaml
-"""
-import argparse
-import os
-import sys
-from pathlib import Path
-import yaml
-from datasets import load_dataset
-from transformers import (
-    AutoModelForCausalLM,
-    AutoTokenizer,
-    BitsAndBytesConfig,
-    TrainingArguments,
-    Trainer,
-    DataCollatorForLanguageModeling
-)
-from peft import LoraConfig, get_peft_model, TaskType
-import torch
-def load_config(config_path: str) -> dict:
-    with open(config_path, 'r') as f:
-        return yaml.safe_load(f)
-def load_model_and_tokenizer(model_name: str, trust_remote_code: bool = True, use_4bit: bool = False):
-    """Load base model and tokenizer."""
-    if use_4bit:
-        quantization_config = BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_compute_dtype=torch.float16,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_use_double_quant=True
-        )
-    else:
-        quantization_config = None
-    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=trust_remote_code)
-    model = AutoModelForCausalLM.from_pretrained(
-        model_name,
-        quantization_config=quantization_config,
-        torch_dtype=torch.float16,
-        trust_remote_code=trust_remote_code,
-        device_map="auto"
-    )
-    return model, tokenizer
-def load_data(data_path: str, tokenizer, max_length: int = 2048, train_split: float = 0.9):
-    """Load and tokenize dataset."""
-    raw_dataset = load_dataset("json", data_files=data_path, split="train")
-    def tokenize_function(examples):
-        # Combine instruction and output
-        texts = []
-        for instr, out in zip(examples.get("instruction", [""]), examples.get("output", [""])):
-            if instr and out:
-                texts.append(f"### Instruction:\n{instr}\n\n### Response:\n{out}")
-            elif out:
-                texts.append(out)
-            elif instr:
-                texts.append(instr)
-            else:
-                texts.append("")
-        tokenized = tokenizer(texts, truncation=True, max_length=max_length, padding="max_length")
-        tokenized["labels"] = tokenized["input_ids"].copy()
-        return tokenized
-    tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, remove_columns=raw_dataset.column_names)
-    split = tokenized_dataset.train_test_split(train_size=train_split)
-    return split["train"], split["test"]
-def train(config: dict):
-    """Main training function."""
-    model_config = config["model"]
-    data_config = config["data"]
-    lora_config = config["lora"]
-    training_config = config["training"]
-    output_config = config["output"]
-    hardware_config = config["hardware"]
-    # Load model and tokenizer
-    print(f"Loading model: {model_config['name']}")
-    model, tokenizer = load_model_and_tokenizer(
-        model_name=model_config["name"],
-        trust_remote_code=model_config.get("trust_remote_code", True),
-        use_4bit=hardware_config.get("use_4bit", False)
-    )
-    # Load data
-    print(f"Loading dataset: {data_config['input_path']}")
-    train_dataset, eval_dataset = load_data(
-        data_path=data_config["input_path"],
-        tokenizer=tokenizer,
-        max_length=data_config.get("max_length", 2048),
-        train_split=data_config.get("train_split", 0.9)
-    )
-    print(f"   Train samples: {len(train_dataset)}")
-    print(f"   Eval samples: {len(eval_dataset)}")
-    # Apply LoRA
-    peft_config = LoraConfig(
-        r=lora_config["r"],
-        alpha=lora_config["alpha"],
-        dropout=lora_config["dropout"],
-        target_modules=lora_config["target_modules"],
-        bias=lora_config["bias"],
-        task_type=TaskType.CAUSAL_LM
-    )
-    model = get_peft_model(model, peft_config)
-    model.print_trainable_parameters()
-    # Training arguments
-    output_dir = output_config["lora_dir"]
-    os.makedirs(output_dir, exist_ok=True)
-    training_args = TrainingArguments(
-        output_dir=output_dir,
-        num_train_epochs=training_config["num_epochs"],
-        per_device_train_batch_size=training_config["batch_size"],
-        gradient_accumulation_steps=training_config["gradient_accumulation"],
-        learning_rate=training_config["learning_rate"],
-        warmup_steps=training_config.get("warmup_steps", 100),
-        weight_decay=training_config.get("weight_decay", 0.01),
-        max_grad_norm=training_config.get("max_grad_norm", 1.0),
-        logging_steps=training_config.get("logging_steps", 10),
-        save_steps=training_config.get("save_steps", 100),
-        save_total_limit=training_config.get("save_total_limit", 2),
-        fp16=training_config.get("fp16", True),
-        bf16=training_config.get("bf16", False),
-        gradient_checkpointing=training_config.get("gradient_checkpointing", True),
-        evaluation_strategy="steps" if eval_dataset else "no",
-        eval_steps=training_config.get("eval_steps", 100) if eval_dataset else None,
-        report_to="none",  # No WandB
-    )
-    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=train_dataset,
-        eval_dataset=eval_dataset,
-        data_collator=data_collator,
-        tokenizer=tokenizer,
-    )
-    print("="*60)
-    print("Starting training...")
-    print("="*60)
-    trainer.train()
-    print("Training completed!")
-    # Save final adapter
-    trainer.save_model(output_dir)
-    print(f"✅ Adapter saved to {output_dir}")
-    return trainer
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--config", type=str, required=True, help="Path to YAML config")
-    args = parser.parse_args()
-    print("="*60)
-    print("Stack 2.9 Simple Training")
-    print("="*60)
-    config = load_config(args.config)
-    print(f"Config loaded: {args.config}")
-    print(f"Model: {config['model']['name']}")
-    print(f"Data: {config['data']['input_path']}")
-    try:
-        train(config)
-        print("\n" + "="*60)
-        print("✅ TRAINING SUCCESS")
-        print("="*60)
-    except Exception as e:
-        print("\n" + "="*60)
-        print(f"❌ TRAINING FAILED: {e}")
-        print("="*60)
-        import traceback
-        traceback.print_exc()
-        sys.exit(1)
-if __name__ == "__main__":
-    main()

training-configs/7b-lora-config.yaml ADDED Viewed

	@@ -0,0 +1,104 @@

+# Stack 2.9 Training Configuration for Qwen2.5-Coder-7B
+# Optimized for A100 80GB or multi-GPU setup (2x A100 40GB minimum)
+# For single GPU training with 7B model
+model:
+  name: "Qwen/Qwen2.5-Coder-7B"
+  trust_remote_code: true
+  torch_dtype: "bfloat16"  # bf16 for A100, better precision than fp16
+  use_flash_attention: true  # A100 supports FA2
+tokenizer:
+  model_max_length: 8192  # Standard 8K context for coding tasks
+  padding_side: "right"
+  truncation_side: "right"
+# LoRA Configuration - Optimized for 7B model
+lora:
+  r: 16  # LoRA rank: 16 is sufficient for 7B, saves memory
+  alpha: 32  # alpha = 2*r is standard
+  dropout: 0.05
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "gate_proj"
+    - "up_proj"
+    - "down_proj"
+  bias: "none"
+  task_type: "CAUSAL_LM"
+# Training Configuration
+training:
+  num_epochs: 3  # 3 epochs is typically sufficient
+  batch_size: 2  # 2 per GPU for A100 80GB
+  gradient_accumulation: 8  # Effective batch = 16 (2 * 8)
+  learning_rate: 1.0e-4  # Standard for LoRA
+  warmup_steps: 100
+  weight_decay: 0.01
+  max_grad_norm: 1.0
+  logging_steps: 10
+  eval_steps: 500
+  save_steps: 1000
+  save_total_limit: 3
+  fp16: false
+  bf16: true  # Use bf16 for A100
+  gradient_checkpointing: true  # Enable for memory savings
+  optim: "adamw_torch"  # Standard optimizer
+# Data Configuration
+data:
+  train_files:
+    - "/Users/walidsobhi/.openclaw/workspace/stack-2.9/training-data/final/train.jsonl"
+  val_file: "/Users/walidsobhi/.openclaw/workspace/stack-2.9/training-data/final/val.jsonl"
+  test_file: "/Users/walidsobhi/.openclaw/workspace/stack-2.9/training-data/final/test.jsonl"
+  max_length: 8192
+  train_split: 0.9
+  test_split: 0.1
+# Output Configuration
+output:
+  lora_dir: "/Users/walidsobhi/.openclaw/workspace/stack-2.9-training/output/stack-2.9-7b-lora"
+  merged_dir: "/Users/walidsobhi/.openclaw/workspace/stack-2.9-training/output/stack-2.9-7b-merged"
+  awq_dir: "/Users/walidsobhi/.openclaw/workspace/stack-2.9-training/output/stack-2.9-7b-awq"
+# Quantization - Optional for 7B
+# Enable 4-bit only if running on limited VRAM (< 40GB per GPU)
+quantization:
+  enabled: false  # Disabled for A100 80GB - not needed
+  bits: 4
+  group_size: 128
+# Logging Configuration
+logging:
+  report_to: "none"  # Set to "wandb" for Weights & Biases
+  wandb_project: "stack-2.9-training"
+  run_name: null
+# Hardware Configuration
+hardware:
+  device: "cuda"
+  num_gpus: 1  # Set to 2, 4, or 8 for multi-GPU
+  use_4bit: false  # Not needed for A100 80GB
+  use_8bit: false  # Optional: enables for memory savings
+  zero_stage: 0  # ZeRO stage (0=disabled, 1,2,3 for multi-GPU)
+# Multi-GPU Configuration (use when num_gpus > 1)
+multigpu:
+  zero_stage: 2  # ZeRO-2 recommended for 2-4 GPUs, ZeRO-3 for 8+
+  gradient_bucket_size: 5e8
+  prefetch_bucket_size: 5e8
+  steps_per_print: 10
+# Memory Estimates (7B):
+# - Full bf16, batch_size=2: ~18GB VRAM
+# - Full bf16, batch_size=4: ~22GB VRAM
+# - 4-bit QLoRA, batch_size=1: ~6GB VRAM
+# - 4-bit QLoRA, batch_size=2: ~8GB VRAM
+# A100 80GB = 80GB available (has headroom)
+# Merge Configuration (run after training)
+merge:
+  enabled: true
+  output_dir: "/Users/walidsobhi/.openclaw/workspace/stack-2.9-training/output/stack-2.9-7b-merged"

training-configs/kaggle-7b-qlora.sh ADDED Viewed

	@@ -0,0 +1,79 @@

+#!/bin/bash
+# Kaggle QLoRA Training Script for Stack 2.9 on Qwen2.5-Coder-7B
+#
+# Hardware: Kaggle P100 GPU (16GB VRAM)
+# Expected runtime: 2-4 hours depending on dataset size
+# Memory: ~6GB VRAM with QLoRA + gradient checkpointing
+#
+# Usage:
+#   chmod +x kaggle-7b-qlora.sh
+#   ./kaggle-7b-qlora.sh
+set -e
+# === Configuration ===
+MODEL_NAME="Qwen/Qwen2.5-Coder-7B"
+OUTPUT_DIR="/kaggle/working/stack-2.9-output/lora-7b-qlora"
+DATA_DIR="/kaggle/working/stack-2.9/data"
+# === Hyperparameters ===
+# LoRA config (r=16, alpha=32)
+LORA_R=16
+LORA_ALPHA=32
+LORA_DROPOUT=0.05
+# Training config
+EPOCHS=2
+BATCH_SIZE=1
+GRAD_ACCUM=8  # Effective = 8
+LR=1e-4
+MAX_LENGTH=4096  # Reduced for Kaggle memory
+# === Environment Setup ===
+echo "=== Setting up environment ==="
+cd /kaggle/working/stack-2.9
+# Install dependencies if needed
+pip install -q transformers datasets peft bitsandbytes accelerate torch
+# === Training Command ===
+echo "=== Starting QLoRA training ==="
+echo "Model: $MODEL_NAME"
+echo "Output: $OUTPUT_DIR"
+echo "Epochs: $EPOCHS"
+python -m trl.sft \
+    --model_name "$MODEL_NAME" \
+    --train_files "$DATA_DIR/train.jsonl" \
+    --validation_files "$DATA_DIR/val.jsonl" \
+    --output_dir "$OUTPUT_DIR" \
+    --per_device_train_batch_size $BATCH_SIZE \
+    --gradient_accumulation_steps $GRAD_ACCUM \
+    --num_train_epochs $EPOCHS \
+    --learning_rate $LR \
+    --max_seq_length $MAX_LENGTH \
+    --logging_steps 10 \
+    --save_steps 500 \
+    --save_total_limit 2 \
+    --bf16 true \
+    --gradient_checkpointing true \
+    --lora_r $LORA_R \
+    --lora_alpha $LORA_ALPHA \
+    --lora_dropout $LORA_DROPOUT \
+    --target_modules q_proj k_proj v_proj o_proj gate_proj up_proj down_proj \
+    --load_in_4bit true \
+    --bnb_4bit_compute_dtype "bfloat16" \
+    --bnb_4bit_quant_type "nf4" \
+    --bnb_4bit_use_double_quant true
+echo "=== Training complete ==="
+echo "LoRA adapters saved to: $OUTPUT_DIR"
+# Optional: Merge and save merged model
+echo "=== Merging adapter ==="
+python merge_simple.py \
+    --base_model "$MODEL_NAME" \
+    --lora_adapter "$OUTPUT_DIR" \
+    --output_dir "$OUTPUT_DIR/merged"
+echo "=== Done ==="

training-data/tool_examples.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training-data/tool_examples.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1043720a918f5fe0f70cc013c108710570c37ae6c9cee6f504e49dc359af5a2a
+size 3779800