Understanding hybrid search, vector similarity, BM25 full-text search, and query flow in OpenGround
OpenGround uses hybrid search combining two complementary techniques: vector similarity (semantic search) and BM25 (keyword search). This approach finds relevant documentation even when queries use different terminology than the source material.
Query: "how to install packages"Matches:- "Adding dependencies to your project"- "Package management guide"- "Setting up requirements.txt"
Weaknesses:
May miss exact technical terms
Can retrieve overly broad matches
Strengths:
Exact term matching
Great for technical jargon, function names, APIs
Predictable and explainable
Example:
Copy
Query: "FastAPI dependencies"Matches:- Pages with "FastAPI" and "dependencies"- Boosts pages where terms appear frequently- Considers term rarity (IDF)
Weaknesses:
Misses synonyms and paraphrasing
No understanding of context
Combines both:
Vector search finds semantic matches
BM25 ensures exact terms aren’t missed
LanceDB merges and reranks results
Result:
Best of both worlds - semantic understanding with keyword precision.
From query.py:110-134, results are formatted for the user:
Copy
if not results: return "Found 0 matches."lines = [f"Found {len(results)} match{'es' if len(results) != 1 else ''}."]for idx, item in enumerate(results, start=1): title = item.get("title") or "(no title)" snippet = (item.get("content") or "").strip() source = item.get("url") or "unknown" item_version = item.get("version") or version score = item.get("_distance") or item.get("_score") score_str = "" if isinstance(score, (int, float)): score_str = f", score={score:.4f}" # Embed tool call hint for fetching full content tool_hint = json.dumps( {"tool": "get_full_content", "url": source, "version": item_version} ) lines.append( f'{idx}. **{title}**: "{snippet}" (Source: {source}, Version: {item_version}{score_str})\n' f" To get full page content: {tool_hint}" )return "\n".join(lines)
Example output:
Copy
Found 3 matches.1. **Configuration**: "OpenGround's behavior is controlled through a hierarchical..." (Source: https://github.com/user/repo/docs/config.md, Version: latest, score=0.8234) To get full page content: {"tool": "get_full_content", "url": "https://...", "version": "latest"}2. **Embedding Settings**: "You can configure the embedding model and backend..." (Source: https://github.com/user/repo/docs/embeddings.md, Version: latest, score=0.7891) To get full page content: {"tool": "get_full_content", "url": "https://...", "version": "latest"}...
# Simplifiedtf = count(term, document) / len(document)# ExampleQuery: "embeddings"Doc A: "embeddings" appears 5 times in 100 words → high TFDoc B: "embeddings" appears 1 time in 100 words → low TF
LanceDB uses ANN (Approximate Nearest Neighbor) indexes for fast vector search:
Copy
# Exact search (slow for large datasets)for chunk in all_chunks: scores.append(cosine_similarity(query_vec, chunk.vector))results = top_k(scores)# O(n) where n = number of chunks# ANN search (fast)results = ann_index.search(query_vec, k=top_k)# O(log n) with high accuracy
LanceDB automatically builds ANN indexes for the vector field.
Search results contain chunk content (800 chars). To get the full page, use get_full_content (from query.py:211-251):
Copy
def get_full_content( url: str, version: str, db_path: Path = DEFAULT_DB_PATH, table_name: str = DEFAULT_TABLE_NAME,) -> str: """Retrieve the full content of a document by its URL and version.""" table = _get_table(db_path, table_name) # Query all chunks for this URL and version safe_url = _escape_sql_string(url) safe_version = _escape_sql_string(version) df = ( table.search() .where(f"url = '{safe_url}' AND version = '{safe_version}'") .select(["title", "content", "chunk_index"]) .to_pandas() ) if df.empty: return f"No content found for URL: {url} (version: {version})" # Sort by chunk_index and concatenate content df = df.sort_values("chunk_index") full_content = "\n\n".join(df["content"].tolist()) title = df.iloc[0].get("title", "(no title)") return f"# {title}\n\nSource: {url}\nVersion: {version}\n\n{full_content}"
1
Query All Chunks
Find all chunks belonging to the same URL and version.
2
Sort by Chunk Index
Ensure chunks are in original order (chunk_index: 0, 1, 2, …).
3
Concatenate Content
Join chunk content with double newlines to preserve formatting.
4
Format as Markdown
Return complete page with title, source, and full content.
# Performance with different dataset sizes10K chunks: ~100ms per query100K chunks: ~150ms per query 1M chunks: ~250ms per query10M chunks: ~500ms per query# ANN index ensures sub-linear scaling
LanceDB’s ANN indexes scale well to millions of chunks.
Copy
# Storage per chunk (approximate)Metadata: ~200 bytesContent text: ~800 bytes (avg)Vector (384d): 1,536 bytesIndexes: ~500 bytesTotal: ~3 KB per chunk# For 100K chunks:Total storage: ~300 MB
# Good: Uses specific API name"openground config set embeddings.embedding_model"# Less good: Vague"change settings for vectors"
Filter by Version
Always specify version for accurate results:
Copy
# Good: Specific versionopenground query "new features" -l fastapi -v v0.100.0# Risk: Might mix results from old versionsopenground query "new features" -l fastapi -v latest
Use get_full_content for Context
Search results are chunks (800 chars). For complete context:
Copy
# 1. Search to find relevant pageresults = search("installation guide")# 2. Get full content of the pagefull_page = get_full_content(results[0].url, results[0].version)