SozKZ Misc: TTS, Sentiment & Other
Collection
Miscellaneous Kazakh AI models and datasets — TTS, sentiment analysis, speech, benchmarks • 10 items • Updated
Binary sentiment classifier for Kazakh text, fine-tuned from sozkz-core-llama-600m-kk-base-v1.
The model uses a special <sentiment> tag for classification:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "stukenov/sozkz-core-llama-600m-kk-sentiment-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()
text = "Тамақтары өте дәмді, қызмет көрсету керемет!"
prompt = f"<sentiment>{text}</sentiment>
"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=5, do_sample=False)
generated = output[0][inputs["input_ids"].shape[1]:]
label = tokenizer.decode(generated, skip_special_tokens=True).strip()
print(label) # "positive"
| Parameter | Value |
|---|---|
| Base model | sozkz-core-llama-600m-kk-base-v1 (587M params) |
| Dataset | issai/kazsandra (KazSAnDRA) → binary (positive/negative) |
| Train samples | 57,312 (balanced) |
| Val samples | 3,016 |
| Epochs | 3 |
| Batch size | 64 (8 × 4 GPU × 2 accum) |
| Learning rate | 2e-5 (cosine) |
| Final loss | ~0.10 |
| Hardware | 4× RTX 4090 |
| Training time | ~1.9h |
Based on issai/kazsandra (LREC 2024). Scores 1-2 mapped to negative, 4-5 to positive, 3 (neutral) excluded. Classes balanced by undersampling majority class.
10/10 on manual test examples covering positive, negative, and ambiguous inputs.
MIT
Base model
stukenov/sozkz-core-llama-600m-kk-base-v1