synth-4.0-pplx → LiteRT
LiteRT (.tflite) exports of a finetune of
perplexity-ai/pplx-embed-v1-0.6b,
trained on the synth-4.0 wafer-domain dataset. Bundled for on-device
inference on Android (XNNPACK CPU delegate) and other LiteRT-compatible
runtimes.
pplx-embed-v1-0.6b is structurally Qwen3-0.6B with bidirectional
self-attention + mean pooling. Each artifact here is a self-contained
frozen graph: encoder body, mean-pool over attention_mask, and L2
normalization are all baked in.
License inherits from upstream perplexity-ai/pplx-embed-v1-0.6b.
Verify against the upstream model card if relicensing matters.
Files
7 tflites at one quant (dynamic_int8), one per sequence length:
| seq_len | quant | size | file | intended backend |
|---|---|---|---|---|
| 128 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq128_int8.tflite |
LiteRT CPU (XNNPACK) |
| 256 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq256_int8.tflite |
LiteRT CPU (XNNPACK) |
| 512 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq512_int8.tflite |
LiteRT CPU (XNNPACK) |
| 1024 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq1024_int8.tflite |
LiteRT CPU (XNNPACK) |
| 2048 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq2048_int8.tflite |
LiteRT CPU (XNNPACK) |
| 4096 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq4096_int8.tflite |
LiteRT CPU (XNNPACK) |
| 8192 | dynamic_int8 | 576 MB | synth-4.0-pplx_seq8192_int8.tflite |
LiteRT CPU (XNNPACK) |
seq_len is baked into each graph — pick the smallest variant ≥ your
tokenized input length and right-pad up to that. Sizes are constant
across seqlens because the bidirectional mask is built at runtime
inside the graph (no baked seqlen-dependent buffers).
dynamic_int8 = weight-quantized int8 matmuls via XNNPACK. int8 is the
right deployment quant for phone CPU inference; fp32 exports were used at
conversion time for numerics validation against the upstream sentence-
transformers reference and were not shipped (they would add ~16 GB of
duplicated weights with no inference advantage on phone). The conversion
script (linked in Provenance) regenerates fp32 if needed.
Numerics validation
Validated against the upstream sentence-transformers reference loaded in fp32 (sdpa attention, trust_remote_code=True), over a 6-string suite:
| artifact | mean cosine | min cosine | threshold |
|---|---|---|---|
..._seqN_int8.tflite |
≈ 0.99 | ≥ 0.98 | 0.98 |
(fp32 export was bit-equivalent — cos = 1.0 — at conversion time, then discarded.) int8 ~1-2% drift, decision-equivalent on retrieval (post-quant eval pending).
Architecture details
pplx-embed-v1-0.6b is Qwen3-0.6B with the causal mask flipped to bidirectional + mean pooling head. 28 layers, hidden=1024, heads=16/kv-heads=8, head_dim=128, RoPE θ=1e6. Vocab=151936 (Qwen tokenizer, BPE).
Consequences baked into the exported graph:
- Bidirectional self-attention — exported mask is the pad-key suppression only (no causal triangle term).
- Mean pool over
attention_mask. include_prompt=true per upstream config (no prompt prefix; everything counts). - L2 normalize.
Tensor shapes (all variants):
- Input
input_ids:[1, seq_len]int64 - Input
attention_mask:[1, seq_len]int64 - Output:
[1, 1024]float32 (L2-normalized)
Inference notes for the bridge
No prompt prefix. pplx is vendor-trained without instruction prompts; adding a prefix hurts. Just feed the raw text.
Pad with the Qwen <|endoftext|> token (id=151643) right-pad.
Reference Python usage
import numpy as np
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter
SEQ_LEN = 512
tok = AutoTokenizer.from_pretrained(".")
interp = Interpreter(model_path=f"synth-4.0-pplx_seq{SEQ_LEN}_int8.tflite")
interp.allocate_tensors()
in_details = interp.get_input_details()
out_details = interp.get_output_details()
text = "was my coffee spending more or less in november?"
enc = tok(text, padding="max_length", truncation=True,
max_length=SEQ_LEN, return_tensors="np")
interp.set_tensor(in_details[0]["index"], enc["input_ids"].astype(np.int64))
interp.set_tensor(in_details[1]["index"], enc["attention_mask"].astype(np.int64))
interp.invoke()
emb = interp.get_tensor(out_details[0]["index"]) # [1, 1024]
Provenance
- Upstream base: perplexity-ai/pplx-embed-v1-0.6b (Qwen3-0.6B with bidirectional attention + mean pool)
- Training pipeline: synth-4.0 (cached MNR, lr=3e-5, batch=64, 500 steps, 6× 4090 DDP, gather_across_devices=True)
- Quality leaderboard: NDCG@10 = 0.4957, Recall@10 = 0.5491 (#1 of 21 in retrieval-20260427, beating cohere-embed-v4 and gemini-embedding)
- Conversion script:
on-device/conversion/convert_pplx_embed.pyon the project-switchboard repo - Conversion env:
litert-torch 0.8.0,transformers 5.5.4,torch 2.9.1+cu128, Python 3.11
- Downloads last month
- 56
Model tree for ckg/synth-4.0-pplx-litert
Base model
perplexity-ai/pplx-embed-v1-0.6b