SymbolicLight V1 Open Package
This repository is the public release root for SymbolicLight V1. It combines two parts of the project narrative in one place:
- The 194M Dual-Path pre-training story, which provides the main controlled language-modeling evidence.
- The 0.8B scale-up release, which provides a public checkpoint, tokenizer, executable code paths, and artifact-based verification material.
The repository should therefore be read as a unified public package, not as two separate model lines. The architecture name used throughout the current paper, code, and public materials is SymbolicLight V1.
Package Layout
LICENSE: Apache License, Version 2.0, covering the released code and public assets unless otherwise statedWEIGHTS_LICENSE.md: license scope for the cleaned model weights and tokenizer assetsMODEL_CARD.md: model card for the released SymbolicLight V1 checkpointNOTICE: copyright and release-boundary noticeTHIRD_PARTY_NOTICES.md: third-party dependency noticesrc/: public Python implementation, training loop, tokenizer tooling, and inference scriptstokenizer/: released tokenizer model, vocabulary, and configurationweights/pytorch/latest.pt: released weights-only 0.8B checkpointpaper/: unified English manuscript, Chinese companion manuscript, bibliography, and compiled PDFsartifacts/: public smoke-test logs and checkpoint metadata summariesdocs/: project lineage, 194M training narrative, and release-facing documentationREPRODUCIBILITY.md: artifact-based reproducibility scope and verified commandstrain_runs_194m.json: registry for the four main 194M runs and selected historical comparison checkpoints
What This Package Tries To Tell
The intended public narrative is:
- 194M SymbolicLight V1 is the main controlled study. It establishes that a spike-gated dual-path language model can train stably at high activation sparsity and remain competitive with dense baselines.
- 0.8B SymbolicLight V1 is scale-up evidence. It shows that the same overall architectural direction can be extended to a larger native pre-training run.
- The public release supports artifact inspection, checkpoint verification, inference verification, and smoke-test training. It does not provide full public reconstruction of the original private pre-training corpus.
Recommended Reading Order
Public Boundary
The release includes code, tokenizer assets, cleaned model weights, and artifact-level documentation. Training and validation corpora are not distributed with this repository; only aggregate data categories, mixture proportions, and preprocessing rules are documented.
License
The Apache-2.0 license applies to the released code, tokenizer assets, cleaned model weights, and public documentation. It does not apply to training or validation corpora, which are not distributed with this repository. The public data disclosure is limited to aggregate domain categories, mixture proportions, and preprocessing rules.