synth-4.0-pplx → LiteRT

LiteRT (.tflite) exports of a finetune of perplexity-ai/pplx-embed-v1-0.6b, trained on the synth-4.0 wafer-domain dataset. Bundled for on-device inference on Android (XNNPACK CPU delegate) and other LiteRT-compatible runtimes.

pplx-embed-v1-0.6b is structurally Qwen3-0.6B with bidirectional self-attention + mean pooling. Each artifact here is a self-contained frozen graph: encoder body, mean-pool over attention_mask, and L2 normalization are all baked in.

License inherits from upstream perplexity-ai/pplx-embed-v1-0.6b. Verify against the upstream model card if relicensing matters.

Files

7 tflites at one quant (dynamic_int8), one per sequence length:

seq_len quant size file intended backend
128 dynamic_int8 576 MB synth-4.0-pplx_seq128_int8.tflite LiteRT CPU (XNNPACK)
256 dynamic_int8 576 MB synth-4.0-pplx_seq256_int8.tflite LiteRT CPU (XNNPACK)
512 dynamic_int8 576 MB synth-4.0-pplx_seq512_int8.tflite LiteRT CPU (XNNPACK)
1024 dynamic_int8 576 MB synth-4.0-pplx_seq1024_int8.tflite LiteRT CPU (XNNPACK)
2048 dynamic_int8 576 MB synth-4.0-pplx_seq2048_int8.tflite LiteRT CPU (XNNPACK)
4096 dynamic_int8 576 MB synth-4.0-pplx_seq4096_int8.tflite LiteRT CPU (XNNPACK)
8192 dynamic_int8 576 MB synth-4.0-pplx_seq8192_int8.tflite LiteRT CPU (XNNPACK)

seq_len is baked into each graph — pick the smallest variant ≥ your tokenized input length and right-pad up to that. Sizes are constant across seqlens because the bidirectional mask is built at runtime inside the graph (no baked seqlen-dependent buffers).

dynamic_int8 = weight-quantized int8 matmuls via XNNPACK. int8 is the right deployment quant for phone CPU inference; fp32 exports were used at conversion time for numerics validation against the upstream sentence- transformers reference and were not shipped (they would add ~16 GB of duplicated weights with no inference advantage on phone). The conversion script (linked in Provenance) regenerates fp32 if needed.

Numerics validation

Validated against the upstream sentence-transformers reference loaded in fp32 (sdpa attention, trust_remote_code=True), over a 6-string suite:

artifact mean cosine min cosine threshold
..._seqN_int8.tflite ≈ 0.99 ≥ 0.98 0.98

(fp32 export was bit-equivalent — cos = 1.0 — at conversion time, then discarded.) int8 ~1-2% drift, decision-equivalent on retrieval (post-quant eval pending).

Architecture details

pplx-embed-v1-0.6b is Qwen3-0.6B with the causal mask flipped to bidirectional + mean pooling head. 28 layers, hidden=1024, heads=16/kv-heads=8, head_dim=128, RoPE θ=1e6. Vocab=151936 (Qwen tokenizer, BPE).

Consequences baked into the exported graph:

  • Bidirectional self-attention — exported mask is the pad-key suppression only (no causal triangle term).
  • Mean pool over attention_mask. include_prompt=true per upstream config (no prompt prefix; everything counts).
  • L2 normalize.

Tensor shapes (all variants):

  • Input input_ids: [1, seq_len] int64
  • Input attention_mask: [1, seq_len] int64
  • Output: [1, 1024] float32 (L2-normalized)

Inference notes for the bridge

No prompt prefix. pplx is vendor-trained without instruction prompts; adding a prefix hurts. Just feed the raw text.

Pad with the Qwen <|endoftext|> token (id=151643) right-pad.

Reference Python usage

import numpy as np
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

SEQ_LEN = 512

tok = AutoTokenizer.from_pretrained(".")
interp = Interpreter(model_path=f"synth-4.0-pplx_seq{SEQ_LEN}_int8.tflite")
interp.allocate_tensors()
in_details = interp.get_input_details()
out_details = interp.get_output_details()

text = "was my coffee spending more or less in november?"
enc = tok(text, padding="max_length", truncation=True,
          max_length=SEQ_LEN, return_tensors="np")

interp.set_tensor(in_details[0]["index"], enc["input_ids"].astype(np.int64))
interp.set_tensor(in_details[1]["index"], enc["attention_mask"].astype(np.int64))
interp.invoke()
emb = interp.get_tensor(out_details[0]["index"])  # [1, 1024]

Provenance

  • Upstream base: perplexity-ai/pplx-embed-v1-0.6b (Qwen3-0.6B with bidirectional attention + mean pool)
  • Training pipeline: synth-4.0 (cached MNR, lr=3e-5, batch=64, 500 steps, 6× 4090 DDP, gather_across_devices=True)
  • Quality leaderboard: NDCG@10 = 0.4957, Recall@10 = 0.5491 (#1 of 21 in retrieval-20260427, beating cohere-embed-v4 and gemini-embedding)
  • Conversion script: on-device/conversion/convert_pplx_embed.py on the project-switchboard repo
  • Conversion env: litert-torch 0.8.0, transformers 5.5.4, torch 2.9.1+cu128, Python 3.11
Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ckg/synth-4.0-pplx-litert

Finetuned
(4)
this model