Embeddings - OpenGround

Embeddings are the core of OpenGround’s semantic search. They transform text chunks into numerical vectors that capture meaning, enabling the system to find relevant documentation even when query words don’t exactly match.

What Are Embeddings?

An embedding is a dense vector representation of text. Similar concepts have vectors that are close together in high-dimensional space.

# Example (simplified)
"how to install package"  → [0.23, -0.15, 0.87, ...] (384 dimensions)
"package installation"    → [0.25, -0.14, 0.89, ...] (similar vector)
"weather forecast"        → [-0.42, 0.76, -0.12, ...] (different vector)

OpenGround uses cosine similarity to measure how close vectors are:

Score near 1.0 = very similar meaning
Score near 0.0 = unrelated

Embedding Backends

OpenGround supports two embedding backends with different trade-offs:

FastEmbed

Default - Lightweight, ONNX-based

Smaller install size
CPU-optimized by default
Optional GPU support (experimental)
Fastest for CPU inference

Sentence-Transformers

Full-featured - PyTorch-based

Larger install size
Automatic GPU/MPS detection
Better GPU performance
More model options

Backend Selection

From config.py:62, the default backend:

DEFAULT_EMBEDDING_BACKEND = "fastembed"

Change it with:

openground config set embeddings.embedding_backend sentence-transformers

FastEmbed Backend

FastEmbed uses ONNX Runtime for inference (from embeddings.py:93-116):

@lru_cache(maxsize=1)
def get_fastembed_model(model_name: str, use_cuda: bool = True):
    """Get a cached instance of TextEmbedding (fastembed)."""
    from fastembed import TextEmbedding
    
    if use_cuda:
        try:
            return TextEmbedding(
                model_name=model_name,
                providers=["CUDAExecutionProvider"],
            )
        except ValueError:
            check_gpu_compatibility()
    
    return TextEmbedding(
        model_name=model_name,
        providers=["CPUExecutionProvider"],
    )

ONNX (Open Neural Network Exchange) is an optimized runtime for neural networks. FastEmbed converts PyTorch models to ONNX for faster CPU inference.

Installation Options

# Smallest install, CPU-only
uv tool install 'openground[fastembed]'
pip install 'openground[fastembed]'

GPU support requires matching CUDA drivers and cuDNN versions. See ONNX Runtime CUDA docs for requirements.

GPU Compatibility Check

OpenGround automatically detects GPU availability (from embeddings.py:44-90):

def check_gpu_compatibility() -> None:
    """Check for GPU compatibility and provide optimization tips."""
    gpu_hardware = is_gpu_hardware_available()  # nvidia-smi check
    
    # Check if fastembed-gpu is installed
    has_gpu_pkg = False
    try:
        version("fastembed-gpu")
        has_gpu_pkg = True
    except PackageNotFoundError:
        pass
    
    # Check for functional GPU via onnxruntime
    functional_gpu = False
    try:
        import onnxruntime as ort
        functional_gpu = "CUDAExecutionProvider" in ort.get_available_providers()
    except ImportError:
        pass
    
    # Provide helpful hints
    if gpu_hardware and not has_gpu_pkg:
        hint("GPU detected! Install the GPU version for faster performance:")
        hint("   uv tool install 'openground[fastembed-gpu]'\n")
    
    elif gpu_hardware and has_gpu_pkg and not functional_gpu:
        error("GPU package is installed but CUDA is not functional.")
        # Suggest fixes...

Sentence-Transformers Backend

Sentence-Transformers uses PyTorch with automatic hardware acceleration (from embeddings.py:14-25):

@lru_cache(maxsize=1)
def get_st_model(model_name: str):
    """Get a cached instance of SentenceTransformer."""
    from sentence_transformers import SentenceTransformer
    
    # Automatically uses:
    # - CUDA on NVIDIA GPUs
    # - MPS on Apple Silicon
    # - CPU otherwise
    return SentenceTransformer(model_name)

Installation

# Automatically uses GPU if available
uv tool install openground
pip install openground

The default openground package includes sentence-transformers with automatic GPU/MPS/CPU support. This is the easiest option if you have a GPU.

Embedding Models

From config.py:58-60, the default model:

DEFAULT_EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
DEFAULT_EMBEDDING_DIMENSIONS = 384

Why BGE-Small-EN-v1.5?

Multilingual: Good English performance
Compact: 384 dimensions (vs 768 for larger models)
Fast: Smaller vectors = faster search
Quality: Strong performance on MTEB benchmarks

Changing Models

Important: Changing the embedding model requires re-embedding all documentation. Models are not compatible with each other.

Choose a Model

Browse models on Hugging Face:

Look for:

Dimensions: 384-768 (smaller = faster)
Language: Match your docs (multilingual, en, etc.)
Size: Smaller models = faster inference

Update Configuration

# Set model and dimensions
openground config set embeddings.embedding_model "sentence-transformers/all-MiniLM-L6-v2"
openground config set embeddings.embedding_dimensions 384

Delete Existing Embeddings

# Remove all embedded data
openground nuke embeddings

This deletes the LanceDB table but preserves raw documentation.

Re-embed Documentation

# Re-embed all libraries
openground embed

This processes all raw data with the new model.

Model Compatibility Validation

OpenGround stores embedding metadata in the LanceDB schema (from ingest.py:159-177):

schema = pa.schema(
    [
        # ... fields ...
        pa.field("vector", pa.list_(pa.float32(), embedding_dimensions)),
    ],
    metadata={
        "embedding_backend": embedding_backend,
        "embedding_model": embedding_model,
    }
)

When adding new documentation, OpenGround validates the model matches (from ingest.py:111-142):

def _validate_table_metadata(table: Table, backend: str, model: str) -> None:
    stored_metadata = _get_table_metadata(table)
    stored_backend = stored_metadata["embedding_backend"]
    stored_model = stored_metadata["embedding_model"]
    
    if stored_backend != backend or stored_model != model:
        raise ValueError(
            f"Embedding configuration mismatch detected!\n\n"
            f"This table was created with:\n"
            f"  Backend: {stored_backend}\n"
            f"  Model: {stored_model}\n\n"
            f"Current configuration is:\n"
            f"  Backend: {backend}\n"
            f"  Model: {model}\n\n"
            f"To resolve this, you can:\n"
            f"  1. Change your config to match the table's original settings\n"
            f"  2. Run `openground nuke embeddings` and then `openground embed`\n"
        )

This prevents mixing embeddings from different models, which would break search quality.

Embedding Generation

From embeddings.py:207-234, the main generation function:

def generate_embeddings(
    texts: Iterable[str],
    show_progress: bool = True,
) -> list[list[float]]:
    """Generate embeddings for documents using the specified backend."""
    
    config = get_effective_config()
    backend = config["embeddings"]["embedding_backend"]
    
    if backend == "fastembed":
        return _generate_embeddings_fastembed(texts, show_progress)
    elif backend == "sentence-transformers":
        return _generate_embeddings_sentence_transformers(texts, show_progress)
    else:
        raise ValueError(f"Invalid embedding backend: {backend}")

Batch Processing

Both backends process embeddings in batches for efficiency (from config.py:65):

DEFAULT_BATCH_SIZE = 32

From embeddings.py:119-160 (sentence-transformers example):

def _generate_embeddings_sentence_transformers(
    texts: Iterable[str],
    show_progress: bool = True,
) -> list[list[float]]:
    config = get_effective_config()
    batch_size = config["embeddings"]["batch_size"]  # 32
    model_name = config["embeddings"]["embedding_model"]
    model = get_st_model(model_name)
    
    texts_list = list(texts)
    all_embeddings = []
    
    with tqdm(total=len(texts_list), desc="Generating embeddings") as pbar:
        for i in range(0, len(texts_list), batch_size):
            batch = texts_list[i : i + batch_size]
            batch_embeddings = model.encode(
                sentences=batch,
                batch_size=len(batch),
                normalize_embeddings=True,  # L2 normalization
                convert_to_numpy=True,
                show_progress_bar=False,
            )
            all_embeddings.extend(list(batch_embeddings))
            pbar.update(len(batch))
    
    return all_embeddings

Increase batch_size if you have a GPU with lots of VRAM:

openground config set embeddings.batch_size 64

FastEmbed Passage Embedding

FastEmbed distinguishes between passage (document) and query embeddings (from embeddings.py:163-204):

def _generate_embeddings_fastembed(
    texts: Iterable[str],
    show_progress: bool = True,
) -> list[list[float]]:
    # ...
    model = get_fastembed_model(model_name)
    
    for i in range(0, len(texts_list), batch_size):
        batch = texts_list[i : i + batch_size]
        # Use passage_embed for document chunks
        batch_embeddings = list(model.passage_embed(batch))
        all_embeddings.extend([emb.tolist() for emb in batch_embeddings])

Some models are trained differently for documents vs. queries. FastEmbed uses passage_embed() for document chunks and would use query_embed() for search queries (though OpenGround currently uses passage_embed for both).

Embedding Dimensions

From config.py:60:

DEFAULT_EMBEDDING_DIMENSIONS = 384

Dimension count affects:

Storage Size
Search Speed
Quality

Each vector = dimensions × 4 bytes (float32)

# 384 dimensions
384 × 4 bytes = 1,536 bytes per vector

# 768 dimensions (larger model)
768 × 4 bytes = 3,072 bytes per vector (2x storage)

# For 10,000 chunks:
384-dim: ~15 MB
768-dim: ~30 MB

More dimensions = slower cosine similarity calculation

# Cosine similarity computation
def cosine_sim(a, b):
    return dot(a, b) / (norm(a) * norm(b))
    # O(n) where n = dimensions

# 384-dim: faster
# 768-dim: ~2x slower

Rule of thumb: Stick with the model’s native dimensions. Don’t try to change dimensions independently from the model.

Configuration Examples

Optimal CPU Performance

# Lightweight FastEmbed with small model
openground config set embeddings.embedding_backend fastembed
openground config set embeddings.embedding_model "BAAI/bge-small-en-v1.5"
openground config set embeddings.embedding_dimensions 384
openground config set embeddings.batch_size 32

GPU Performance (NVIDIA)

# Sentence-Transformers with larger model
openground config set embeddings.embedding_backend sentence-transformers
openground config set embeddings.embedding_model "BAAI/bge-base-en-v1.5"
openground config set embeddings.embedding_dimensions 768
openground config set embeddings.batch_size 64  # Larger batches for GPU

Apple Silicon (M1/M2/M3)

# Sentence-Transformers with MPS acceleration
openground config set embeddings.embedding_backend sentence-transformers
openground config set embeddings.embedding_model "BAAI/bge-small-en-v1.5"
openground config set embeddings.embedding_dimensions 384
openground config set embeddings.batch_size 32

Apple Silicon automatically uses MPS (Metal Performance Shaders) via sentence-transformers. No special configuration needed.

Chunking Strategy

Before embedding, documents are split into chunks (from config.py:66-67):

DEFAULT_CHUNK_SIZE = 800
DEFAULT_CHUNK_OVERLAP = 200

From ingest.py:52-76, using LangChain’s text splitter:

from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_document(page: ParsedPage) -> list[dict]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=200
    )
    chunks = splitter.split_text(page["content"])
    
    for idx, chunk in enumerate(chunks):
        records.append({
            "content": chunk,
            "chunk_index": idx,
            # ... metadata ...
        })

Why 800 Characters?

Context window: Most embedding models handle 512 tokens well
800 chars ≈ 200 tokens: Safe margin for tokenization
Not too small: Preserves context
Not too large: Enables precise retrieval

Why 200 Character Overlap?

Chunk 1: [------------------------]  (chars 0-800)
Chunk 2:              [------------------------]  (chars 600-1400)
                      ^200 overlap^

Overlap ensures:

Information spanning boundaries isn’t lost
Better retrieval for queries matching boundary content
25% overlap provides good coverage without excessive duplication

Adjusting Chunking

# Larger chunks (more context, less precise)
openground config set embeddings.chunk_size 1200
openground config set embeddings.chunk_overlap 300

# Smaller chunks (more precise, less context)
openground config set embeddings.chunk_size 512
openground config set embeddings.chunk_overlap 128

Changing chunk settings requires re-embedding:

openground nuke embeddings
openground embed

Model Caching

Both backends use @lru_cache to load models once (from embeddings.py:14 and 93):

@lru_cache(maxsize=1)
def get_st_model(model_name: str):
    return SentenceTransformer(model_name)

@lru_cache(maxsize=1)
def get_fastembed_model(model_name: str, use_cuda: bool = True):
    return TextEmbedding(model_name=model_name, ...)

Models are:

Downloaded from Hugging Face (first run)
Cached locally in ~/.cache/huggingface/
Loaded into memory once per process
Reused for all embedding operations

The first run downloads the model (~100-500MB depending on model). Subsequent runs are instant.

Performance Comparison

FastEmbed CPU
FastEmbed GPU
Sentence-Transformers GPU
Sentence-Transformers MPS

Best for: Most users, CPU-only machines

~500 chunks/sec (CPU)
Lightweight install
Low memory usage
No GPU setup hassle

Performance varies by hardware. These are approximate estimates for the default model.

Next Steps

Search

Learn how embeddings power hybrid search

Configuration

Full configuration reference for embeddings

Architecture

See where embeddings fit in the architecture

Update Documentation

Efficiently update docs with incremental re-embedding

​What Are Embeddings?

​Embedding Backends

FastEmbed

Sentence-Transformers

​Backend Selection

​FastEmbed Backend

​Installation Options

​GPU Compatibility Check

​Sentence-Transformers Backend

​Installation

​Embedding Models

​Why BGE-Small-EN-v1.5?

​Changing Models

​Model Compatibility Validation

​Embedding Generation

​Batch Processing

​FastEmbed Passage Embedding

​Embedding Dimensions

​Configuration Examples

​Optimal CPU Performance

​GPU Performance (NVIDIA)

​Apple Silicon (M1/M2/M3)

​Chunking Strategy

​Why 800 Characters?

​Why 200 Character Overlap?

​Adjusting Chunking

​Model Caching

​Performance Comparison

​Next Steps

Search

Configuration

Architecture

Update Documentation

What Are Embeddings?

Embedding Backends

Backend Selection

FastEmbed Backend

Installation Options

GPU Compatibility Check

Sentence-Transformers Backend

Installation

Embedding Models

Why BGE-Small-EN-v1.5?

Changing Models

Model Compatibility Validation

Embedding Generation

Batch Processing

FastEmbed Passage Embedding

Embedding Dimensions

Configuration Examples

Optimal CPU Performance

GPU Performance (NVIDIA)

Apple Silicon (M1/M2/M3)

Chunking Strategy

Why 800 Characters?

Why 200 Character Overlap?

Adjusting Chunking

Model Caching

Performance Comparison

Next Steps