Skip to main content
OpenGround uses hybrid search combining two complementary techniques: vector similarity (semantic search) and BM25 (keyword search). This approach finds relevant documentation even when queries use different terminology than the source material.

Hybrid Search Architecture

Why Hybrid?

Strengths:
  • Finds conceptually similar content
  • Works with synonyms and paraphrasing
  • Understands context and intent
Example:
Query: "how to install packages"
Matches:
- "Adding dependencies to your project"
- "Package management guide"
- "Setting up requirements.txt"
Weaknesses:
  • May miss exact technical terms
  • Can retrieve overly broad matches

Query Flow

Let’s trace a search query through OpenGround’s system.

1. Query Input

From the CLI or MCP server:
openground query "how to configure embeddings" -l openground -v latest
Or from an AI agent via MCP:
{
  "tool": "search_documentation",
  "arguments": {
    "query": "how to configure embeddings",
    "library_name": "openground",
    "version": "latest"
  }
}

2. Query Embedding

From query.py:94, the query is converted to a vector:
from openground.embeddings import generate_embeddings

query_vec = generate_embeddings([query], show_progress=show_progress)[0]
# Returns: [0.23, -0.15, 0.87, ...] (384 dimensions)
Query embedding uses the same model as document embedding, ensuring they’re in the same vector space.

3. Hybrid Search Execution

From query.py:96-104, LanceDB performs the hybrid search:
def search(
    query: str,
    version: str,
    db_path: Path = DEFAULT_DB_PATH,
    table_name: str = DEFAULT_TABLE_NAME,
    library_name: Optional[str] = None,
    top_k: int = 10,
    show_progress: bool = True,
) -> str:
    table = _get_table(db_path, table_name)
    query_vec = generate_embeddings([query], show_progress=show_progress)[0]
    
    # Build hybrid search
    search_builder = table.search(query_type="hybrid") \
                          .text(query) \      # BM25 component
                          .vector(query_vec)   # Vector component
    
    # Apply filters
    safe_version = _escape_sql_string(version)
    search_builder = search_builder.where(f"version = '{safe_version}'")
    
    if library_name:
        safe_name = _escape_sql_string(library_name)
        search_builder = search_builder.where(f"library_name = '{safe_name}'")
    
    # Execute and return top K
    results = search_builder.limit(top_k).to_list()
1

Query Type: Hybrid

search_builder = table.search(query_type="hybrid")
Tells LanceDB to combine vector and BM25 search.
2

BM25 Component

.text(query)
Performs keyword search using the full-text index on the content field.
3

Vector Component

.vector(query_vec)
Performs cosine similarity search in the vector space.
4

Metadata Filtering

.where(f"version = '{safe_version}'")
.where(f"library_name = '{safe_name}'")
Filters results to specific library/version before ranking.
5

Limit Results

.limit(top_k)
Returns only the top K highest-scoring chunks.

4. Result Ranking

LanceDB internally combines scores from both search types:
# Simplified conceptual model (actual implementation is in LanceDB)
for chunk in chunks:
    # Vector similarity (cosine)
    vector_score = cosine_similarity(query_vec, chunk.vector)
    
    # BM25 score
    bm25_score = bm25(query_text, chunk.content)
    
    # Combine (LanceDB uses learned fusion)
    combined_score = merge_scores(vector_score, bm25_score)
    
    chunk.score = combined_score

# Sort by combined score and return top K
results = sorted(chunks, key=lambda c: c.score, reverse=True)[:top_k]
LanceDB’s hybrid search uses sophisticated score fusion techniques. The exact algorithm is internal to LanceDB.

5. Result Formatting

From query.py:110-134, results are formatted for the user:
if not results:
    return "Found 0 matches."

lines = [f"Found {len(results)} match{'es' if len(results) != 1 else ''}."]
for idx, item in enumerate(results, start=1):
    title = item.get("title") or "(no title)"
    snippet = (item.get("content") or "").strip()
    source = item.get("url") or "unknown"
    item_version = item.get("version") or version
    score = item.get("_distance") or item.get("_score")
    
    score_str = ""
    if isinstance(score, (int, float)):
        score_str = f", score={score:.4f}"
    
    # Embed tool call hint for fetching full content
    tool_hint = json.dumps(
        {"tool": "get_full_content", "url": source, "version": item_version}
    )
    
    lines.append(
        f'{idx}. **{title}**: "{snippet}" (Source: {source}, Version: {item_version}{score_str})\n'
        f"   To get full page content: {tool_hint}"
    )

return "\n".join(lines)
Example output:
Found 3 matches.
1. **Configuration**: "OpenGround's behavior is controlled through a hierarchical..." (Source: https://github.com/user/repo/docs/config.md, Version: latest, score=0.8234)
   To get full page content: {"tool": "get_full_content", "url": "https://...", "version": "latest"}
2. **Embedding Settings**: "You can configure the embedding model and backend..." (Source: https://github.com/user/repo/docs/embeddings.md, Version: latest, score=0.7891)
   To get full page content: {"tool": "get_full_content", "url": "https://...", "version": "latest"}
...
BM25 (Best Matching 25) is a probabilistic ranking function for keyword search.

BM25 Index Creation

From ingest.py:223-226, the full-text index is created after ingestion:
table.add(all_records)  # Add chunks with embeddings

try:
    table.create_fts_index("content", replace=True)
except Exception as exc:
    print(f"FTS index creation skipped: {exc}")
The content field is indexed for full-text search. This enables BM25 scoring on chunk text.

How BM25 Works

BM25 ranks documents based on:
How often a query term appears in the document.
# Simplified
tf = count(term, document) / len(document)

# Example
Query: "embeddings"
Doc A: "embeddings" appears 5 times in 100 words → high TF
Doc B: "embeddings" appears 1 time in 100 words → low TF
Saturation: BM25 uses diminishing returns - 5 mentions isn’t 5x better than 1.

BM25 Example

# Query: "configure embedding model"

# Document A: "To configure the embedding model, use config.json..."
# - "configure": 1 occurrence, medium IDF
# - "embedding": 1 occurrence, low IDF (common in docs)
# - "model": 1 occurrence, low IDF (common)
# BM25 Score: 3.2

# Document B: "The embedding model configuration allows you to..."
# - "embedding": 1 occurrence
# - "model": 1 occurrence  
# - "configuration": 1 occurrence (synonym of "configure")
# BM25 Score: 2.1 (lower - missed "configure")

# Vector search might rank B higher due to semantic similarity,
# but BM25 ensures A gets boosted for exact term match.
Vector search finds chunks with embeddings close to the query embedding.

Cosine Similarity

From mathematical perspective:
import numpy as np

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors."""
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

# Example
query_vec = [0.5, 0.3, 0.8, ...]  # 384 dimensions
chunk_vec = [0.6, 0.2, 0.7, ...]  # 384 dimensions

similarity = cosine_similarity(query_vec, chunk_vec)
# Returns: 0.92 (very similar)
Cosine similarity measures the angle between vectors, not their magnitude. Values range from -1 (opposite) to 1 (identical).

Normalized Embeddings

From embeddings.py:154, embeddings are normalized:
batch_embeddings = model.encode(
    sentences=batch,
    normalize_embeddings=True,  # L2 normalization
    ...
)
Normalization benefits:
  • Embeddings have unit length (magnitude = 1)
  • Cosine similarity simplifies to dot product
  • Faster computation: similarity = dot(a, b) instead of dot(a, b) / (norm(a) * norm(b))

Approximate Nearest Neighbor

LanceDB uses ANN (Approximate Nearest Neighbor) indexes for fast vector search:
# Exact search (slow for large datasets)
for chunk in all_chunks:
    scores.append(cosine_similarity(query_vec, chunk.vector))
results = top_k(scores)
# O(n) where n = number of chunks

# ANN search (fast)
results = ann_index.search(query_vec, k=top_k)
# O(log n) with high accuracy
LanceDB automatically builds ANN indexes for the vector field.

Metadata Filtering

From query.py:98-103, filters are applied before ranking:
# SQL-style WHERE clauses
search_builder = search_builder.where(f"version = '{safe_version}'")

if library_name:
    search_builder = search_builder.where(f"library_name = '{safe_name}'")

Why Filter First?

# Bad: Search all, then filter
results = search_all_chunks(query)  # 1M chunks
filtered = [r for r in results if r.version == "v1.0.0"]  # 1K chunks
# Wastes time searching irrelevant chunks

# Good: Filter, then search
filtered_chunks = chunks.where("version = 'v1.0.0'")  # 1K chunks
results = search(filtered_chunks, query)
# Only searches relevant chunks

SQL Injection Prevention

From query.py:46-65, user input is escaped:
def _escape_sql_string(value: str) -> str:
    """Escape a string value for safe use in LanceDB SQL WHERE clauses."""
    # Remove null bytes
    value = value.replace("\x00", "")
    # Escape backslashes first
    value = value.replace("\\", "\\\\")
    # Escape single quotes (SQL standard: ' becomes '')
    value = value.replace("'", "''")
    return value

# Usage
safe_version = _escape_sql_string(version)  # "v'1.0.0" → "v''1.0.0"
search_builder.where(f"version = '{safe_version}'")
Always escape user input in SQL WHERE clauses to prevent injection attacks.

Retrieving Full Content

Search results contain chunk content (800 chars). To get the full page, use get_full_content (from query.py:211-251):
def get_full_content(
    url: str,
    version: str,
    db_path: Path = DEFAULT_DB_PATH,
    table_name: str = DEFAULT_TABLE_NAME,
) -> str:
    """Retrieve the full content of a document by its URL and version."""
    table = _get_table(db_path, table_name)
    
    # Query all chunks for this URL and version
    safe_url = _escape_sql_string(url)
    safe_version = _escape_sql_string(version)
    df = (
        table.search()
        .where(f"url = '{safe_url}' AND version = '{safe_version}'")
        .select(["title", "content", "chunk_index"])
        .to_pandas()
    )
    
    if df.empty:
        return f"No content found for URL: {url} (version: {version})"
    
    # Sort by chunk_index and concatenate content
    df = df.sort_values("chunk_index")
    full_content = "\n\n".join(df["content"].tolist())
    
    title = df.iloc[0].get("title", "(no title)")
    return f"# {title}\n\nSource: {url}\nVersion: {version}\n\n{full_content}"
1

Query All Chunks

Find all chunks belonging to the same URL and version.
2

Sort by Chunk Index

Ensure chunks are in original order (chunk_index: 0, 1, 2, …).
3

Concatenate Content

Join chunk content with double newlines to preserve formatting.
4

Format as Markdown

Return complete page with title, source, and full content.

Query Caching

From query.py:12-15, database connections are cached:
_db_cache: dict[str, Any] = {}
_table_cache: dict[tuple[str, str], Any] = {}
_metadata_cache: dict[tuple[str, str], dict[str, Any]] = {}

def _get_db(db_path: Path) -> "lancedb.DBConnection":
    """Get a cached database connection."""
    path_str = str(db_path)
    if path_str not in _db_cache:
        _db_cache[path_str] = lancedb.connect(path_str)
    return _db_cache[path_str]

def _get_table(db_path: Path, table_name: str) -> Optional["lancedb.table.Table"]:
    """Get a cached table handle."""
    cache_key = (str(db_path), table_name)
    if cache_key not in _table_cache:
        db = _get_db(db_path)
        if table_name not in db.table_names():
            return None
        _table_cache[cache_key] = db.open_table(table_name)
    return _table_cache[cache_key]
Caching avoids reconnecting to the database for every query. Especially important for MCP server which handles many sequential requests.

Search Configuration

From config.py:69, default top K:
DEFAULT_TOP_K = 5
Configure with:
# Return more results by default
openground config set query.top_k 10

# Or specify per query
openground query "my query" --top-k 20
Choosing top K:
  • Small (3-5): Precise, focused results for AI agents
  • Medium (10-15): Good for exploratory queries
  • Large (20+): Comprehensive coverage, but may include noise

Performance Characteristics

# Typical latency breakdown
Query embedding:     50-200ms  (depends on model/hardware)
Vector search:       10-50ms   (ANN index)
BM25 search:         5-20ms    (full-text index)
Score fusion:        1-5ms     (LanceDB internal)
Metadata filtering:  <1ms      (indexed columns)
Total:               ~100-300ms
Factors:
  • Embedding model speed (GPU vs CPU)
  • Number of chunks in database
  • Complexity of filters

Search Quality Tips

Good queries:
  • “how to configure embeddings”
  • “FastAPI dependency injection”
  • “error handling best practices”
Poor queries:
  • “stuff” (too vague)
  • “asdfasdf” (gibberish)
  • Single words without context
BM25 rewards exact matches:
# Good: Uses specific API name
"openground config set embeddings.embedding_model"

# Less good: Vague
"change settings for vectors"
Always specify version for accurate results:
# Good: Specific version
openground query "new features" -l fastapi -v v0.100.0

# Risk: Might mix results from old versions
openground query "new features" -l fastapi -v latest
Search results are chunks (800 chars). For complete context:
# 1. Search to find relevant page
results = search("installation guide")

# 2. Get full content of the page
full_page = get_full_content(results[0].url, results[0].version)

Next Steps