scaled down version of the Gemma architecture trained on the Tiny Shakespeare dataset.
Model
- Architecture: Gemma (Transformer Decoder)
- Attention: Multi Query Attention (MQA)
- Hidden Size: 768
- Number of Layers: 12
- Number of Query Heads: 2
- Number of KV Heads: 1
- Sequence Length: 128 (Block Size)
- Vocabulary Size: 65 (Character-level encoding)
- Total Training Steps: 3,500
Architecture
- RMSNorm
- GeGLU
- RoPE
- Embedding Scaling
Usage
You can load this model directly using the Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("parkneurals/Gemma")
# Note: This model uses a custom character-level tokenizer.
# You can use the provided char_map.json for encoding/decoding.
This model has slow inference due to rotation matrix calcuation on every layer for each token(as I made it only for learning purposes; please bear if anyone using)
- Downloads last month
- 458