synth-4.0-pplx → LiteRT

LiteRT (.tflite) exports of a finetune of perplexity-ai/pplx-embed-v1-0.6b, trained on the synth-4.0 wafer-domain dataset. Bundled for on-device inference on Android (XNNPACK CPU delegate) and other LiteRT-compatible runtimes.

pplx-embed-v1-0.6b is structurally Qwen3-0.6B with bidirectional self-attention + mean pooling. Each artifact here is a self-contained frozen graph: encoder body, mean-pool over attention_mask, and L2 normalization are all baked in.

License inherits from upstream perplexity-ai/pplx-embed-v1-0.6b. Verify against the upstream model card if relicensing matters.

Files

7 tflites at one quant (dynamic_int8), one per sequence length:

seq_len	quant	size	file	intended backend
128	dynamic_int8	576 MB	`synth-4.0-pplx_seq128_int8.tflite`	LiteRT CPU (XNNPACK)
256	dynamic_int8	576 MB	`synth-4.0-pplx_seq256_int8.tflite`	LiteRT CPU (XNNPACK)
512	dynamic_int8	576 MB	`synth-4.0-pplx_seq512_int8.tflite`	LiteRT CPU (XNNPACK)
1024	dynamic_int8	576 MB	`synth-4.0-pplx_seq1024_int8.tflite`	LiteRT CPU (XNNPACK)
2048	dynamic_int8	576 MB	`synth-4.0-pplx_seq2048_int8.tflite`	LiteRT CPU (XNNPACK)
4096	dynamic_int8	576 MB	`synth-4.0-pplx_seq4096_int8.tflite`	LiteRT CPU (XNNPACK)
8192	dynamic_int8	576 MB	`synth-4.0-pplx_seq8192_int8.tflite`	LiteRT CPU (XNNPACK)

seq_len is baked into each graph — pick the smallest variant ≥ your tokenized input length and right-pad up to that. Sizes are constant across seqlens because the bidirectional mask is built at runtime inside the graph (no baked seqlen-dependent buffers).

dynamic_int8 = weight-quantized int8 matmuls via XNNPACK. int8 is the right deployment quant for phone CPU inference; fp32 exports were used at conversion time for numerics validation against the upstream sentence- transformers reference and were not shipped (they would add ~16 GB of duplicated weights with no inference advantage on phone). The conversion script (linked in Provenance) regenerates fp32 if needed.

Numerics validation

Validated against the upstream sentence-transformers reference loaded in fp32 (sdpa attention, trust_remote_code=True), over a 6-string suite:

artifact	mean cosine	min cosine	threshold
`..._seqN_int8.tflite`	≈ 0.99	≥ 0.98	0.98

(fp32 export was bit-equivalent — cos = 1.0 — at conversion time, then discarded.) int8 ~1-2% drift, decision-equivalent on retrieval (post-quant eval pending).

Architecture details

pplx-embed-v1-0.6b is Qwen3-0.6B with the causal mask flipped to bidirectional + mean pooling head. 28 layers, hidden=1024, heads=16/kv-heads=8, head_dim=128, RoPE θ=1e6. Vocab=151936 (Qwen tokenizer, BPE).

Consequences baked into the exported graph:

Bidirectional self-attention — exported mask is the pad-key suppression only (no causal triangle term).
Mean pool over attention_mask. include_prompt=true per upstream config (no prompt prefix; everything counts).
L2 normalize.

Tensor shapes (all variants):

Input input_ids: [1, seq_len] int64
Input attention_mask: [1, seq_len] int64
Output: [1, 1024] float32 (L2-normalized)

Inference notes for the bridge

No prompt prefix. pplx is vendor-trained without instruction prompts; adding a prefix hurts. Just feed the raw text.

Pad with the Qwen <|endoftext|> token (id=151643) right-pad.

Reference Python usage

import numpy as np
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

SEQ_LEN = 512

tok = AutoTokenizer.from_pretrained(".")
interp = Interpreter(model_path=f"synth-4.0-pplx_seq{SEQ_LEN}_int8.tflite")
interp.allocate_tensors()
in_details = interp.get_input_details()
out_details = interp.get_output_details()

text = "was my coffee spending more or less in november?"
enc = tok(text, padding="max_length", truncation=True,
          max_length=SEQ_LEN, return_tensors="np")

interp.set_tensor(in_details[0]["index"], enc["input_ids"].astype(np.int64))
interp.set_tensor(in_details[1]["index"], enc["attention_mask"].astype(np.int64))
interp.invoke()
emb = interp.get_tensor(out_details[0]["index"])  # [1, 1024]

Provenance

Upstream base: perplexity-ai/pplx-embed-v1-0.6b (Qwen3-0.6B with bidirectional attention + mean pool)
Training pipeline: synth-4.0 (cached MNR, lr=3e-5, batch=64, 500 steps, 6× 4090 DDP, gather_across_devices=True)
Quality leaderboard: NDCG@10 = 0.4957, Recall@10 = 0.5491 (#1 of 21 in retrieval-20260427, beating cohere-embed-v4 and gemini-embedding)
Conversion script: on-device/conversion/convert_pplx_embed.py on the project-switchboard repo
Conversion env: litert-torch 0.8.0, transformers 5.5.4, torch 2.9.1+cu128, Python 3.11

Downloads last month: 56

Model tree for ckg/synth-4.0-pplx-litert

Base model

perplexity-ai/pplx-embed-v1-0.6b

Finetuned

(4)

this model