Instructions to use aisquared/bolt-embedding-small-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use aisquared/bolt-embedding-small-gguf with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("aisquared/bolt-embedding-small-gguf")

sentences = [
    "I'm trying to write a PHP script which reads SIP (session initiation protocol) signals from a hardware switch to gets specific details and then return some data back to the switch.\nBeing a complete newbie to this SIP thing I don't know how to interact with the switch sending SIP signal. Do we need to send some message to the switch to get response?\nI googled SIP but got only general info regarding what SIP is all about but nothing programmatic.\nCan any one provide any pointers to any tutorials which show how interact with a SIP signal programmatically?\nAre there any free online services that simulate SIP signals for testing purpose?\n",
    "Lake Okahumpka is a freshwater lake in Wildwood, Florida, United States. Lake Okahumpka Park is along part of its shoreline. In 1980, the United States Geological Survey reported on the hydrology of Lake Okahumpka and Lake Deaton area.\n\nThe lake is east of Wildwood on the south side of State Road 44. The lake has been treated for hydrilla. Ring neck ducks have been hunted from its shores.\n\nSee also\nOkahumpka, Florida\n\nReferences\n\nBodies of water of Sumter County, Florida\nOkahumpka",
    "Because of different regional setting on different machines. To have date time output in the same format you ahve to specify format string explciitly:\ndate.ToString(\"yyyy-MM-dd HH:mm:ss\");\n\nAlso as John recommeded in comments below if you want having date time output in the same format on different machines despite local regional settings you can use InvariantCulture format provider:\ndate.ToString(CultureInfo.InvariantCulture);\n\nMSDN:\n\nThe invariant culture is culture-insensitive; it is associated with\n  the English language but not with any country/region\n\nMSDN:\n\nStandard Date and Time Format Strings\nCustom Date and Time Format Strings\n\n",
    "The President of India plays a ceremonial role in foreign affairs, appointing ambassadors and ratifying treaties, but the day‑to‑day conduct of diplomacy is handled by the Ministry of External Affairs and the Prime Minister's Office."
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

llama-cpp-python

How to use aisquared/bolt-embedding-small-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="aisquared/bolt-embedding-small-gguf",
	filename="bolt-embedding-small-GGUF.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use aisquared/bolt-embedding-small-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf aisquared/bolt-embedding-small-gguf
# Run inference directly in the terminal:
llama-cli -hf aisquared/bolt-embedding-small-gguf

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf aisquared/bolt-embedding-small-gguf
# Run inference directly in the terminal:
llama-cli -hf aisquared/bolt-embedding-small-gguf

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf aisquared/bolt-embedding-small-gguf
# Run inference directly in the terminal:
./llama-cli -hf aisquared/bolt-embedding-small-gguf

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf aisquared/bolt-embedding-small-gguf
# Run inference directly in the terminal:
./build/bin/llama-cli -hf aisquared/bolt-embedding-small-gguf

Use Docker

docker model run hf.co/aisquared/bolt-embedding-small-gguf

LM Studio
Jan
Ollama
How to use aisquared/bolt-embedding-small-gguf with Ollama:
```
ollama run hf.co/aisquared/bolt-embedding-small-gguf
```

Unsloth Studio new

How to use aisquared/bolt-embedding-small-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for aisquared/bolt-embedding-small-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for aisquared/bolt-embedding-small-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for aisquared/bolt-embedding-small-gguf to start chatting

Docker Model Runner
How to use aisquared/bolt-embedding-small-gguf with Docker Model Runner:
```
docker model run hf.co/aisquared/bolt-embedding-small-gguf
```

Lemonade

How to use aisquared/bolt-embedding-small-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull aisquared/bolt-embedding-small-gguf

Run and chat with the model

lemonade run user.bolt-embedding-small-gguf-{{QUANT_TAG}}

List all available models

lemonade list

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Bolt Embedding Models

Bolt Embedding is a family of high-performance embedding models optimized for enterprise Retrieval-Augmented Generation (RAG).
These models are fine-tuned from IBM Granite embedding models and are designed to produce strong semantic embeddings for knowledge retrieval, search, and document understanding.

Bolt models map text (queries, sentences, or documents) into a dense vector space suitable for similarity search, clustering, and retrieval pipelines.

Model Overview

Bolt embeddings are purpose-built for enterprise RAG workloads, where retrieval quality and robustness across heterogeneous documents are critical.

Key design goals:

Strong query → document retrieval quality
Robust performance on long enterprise documents
Optimized for large-scale vector search
Trained using large-batch contrastive learning to replicate real RAG retrieval conditions

These models are fine-tuned from IBM Granite embedding models using contrastive training on RAG-style data.

Model Details

Model Type

Sentence Transformer embedding model

Base Model

Fine-tuned from:

ibm-granite/granite-embedding-small-english-r2 (small)
ibm-granite/granite-embedding-english-r2 (large)

(depending on the Bolt variant)

Output

Embedding dimension: 384 (small), 768 (large)
Similarity metric: Cosine similarity
Max sequence length: 4096 tokens

Architecture

SentenceTransformer(
  (0): Transformer(ModernBertModel)
  (1): Pooling(CLS)
)

Bolt uses CLS pooling to produce a single embedding vector per input.

Training Objective

Bolt embeddings are trained specifically for retrieval scenarios using contrastive learning.

Loss Function

CachedMultipleNegativesRankingLoss

This loss is widely used for training embedding models for retrieval tasks.

Key properties:

Efficient training with very large effective batch sizes
Uses in-batch negatives
Encourages queries to be close to their relevant passages while far from irrelevant ones

Large Batch Training

Bolt models were trained using batch sizes of 1024.

Large batches simulate realistic retrieval scenarios:

Query
Positive document
~2000 unrelated documents, including hard negatives

This closely approximates production RAG retrieval environments, where each query must rank the correct document among many candidates.

The result is improved:

retrieval accuracy
semantic separation
ranking robustness

Training Data

Training was performed using custom datasets we collected. This dataset includes hand-curated examples as well as examples from datasets with commercially-accepable licenses. To curate hard negatives for some examples, LLMs with commercially-permissable licenses were used to generate negatives.

Dataset format:

Column	Description
anchor	Query or input text
positive	Relevant document/passage
negative	Unrelated document/passage, with some examples generated using LLMs to provide hard negatives and some examples chosen at random from existing negatives

Training size:

500,000 training samples
20,000 evaluation samples

The dataset contains a mixture of:

question → answer pairs
query → document matches
semantic similarity examples

These samples are designed to mimic real RAG retrieval workloads.

Intended Use

Bolt embeddings are designed for:

Retrieval-Augmented Generation (RAG)
Enterprise document search
Semantic search
Knowledge base retrieval
Question answering
Duplicate detection
Similarity scoring

Typical pipeline:

User query
      ↓
Bolt embedding
      ↓
Vector search
      ↓
Top-k documents
      ↓
LLM generation

Usage

Install Sentence Transformers:

pip install -U sentence-transformers

Load the Model

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("aisquared/bolt-embedding-small")

model = SentenceTransformer("aisquared/bolt-embedding-large")

Generate Embeddings

sentences = [
    "What are the tax implications of employee stock options?",
    "Employee stock options may have tax consequences depending on exercise timing.",
    "The Eiffel Tower is located in Paris."
]

embeddings = model.encode(sentences)

print(embeddings.shape)

Compute Similarity

similarities = model.similarity(embeddings, embeddings)

print(similarities)

Why Bolt?

Many embedding models are trained on general semantic similarity tasks.

Bolt is optimized for enterprise retrieval, where queries must locate the correct information among thousands of unrelated documents.

Key differentiators:

Large-batch contrastive training
RAG-specific dataset
Long context support (4096 tokens trained)
Optimized for vector database retrieval

Framework Versions

Training was performed using:

Python 3.12
Sentence Transformers
Transformers
PyTorch
HuggingFace Datasets
HuggingFace Jobs, utilizing 1xA100 GPU

Citation

If you use Bolt embeddings in research or production systems, please cite the underlying Sentence-BERT work.

Sentence-BERT

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  year = 2019
}

Cached Multiple Negatives Ranking Loss

@misc{gao2021scaling,
  title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
  author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
  year={2021}
}