OpenGround is an on-device RAG (Retrieval-Augmented Generation) system designed to give AI agents controlled access to documentation. Everything runs locally - no external APIs, no data leaves your machine.
System Overview
OpenGround follows a pipeline architecture with three main stages:
┌─────────────────────────────────────────────────────────────────────┐
│ OPENGROUND │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ SOURCE PROCESS STORAGE/CLIENT │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐ │
│ │ git repo ├─────>│ Extract ├──>│ Chunk ├──>│ LanceDB │ │
│ | -or- | │ (raw_data)│ │ Text │ │ (vector │ │
│ │ sitemap │ └───────────┘ └──────────┘ │ +BM25) │ │
│ │ -or- │ │ └────┬─────┘ │
│ │ local dir│ │ │ │
│ └──────────┘ │ │ │
│ ▼ │ │
│ ┌───────────┐ │ │
│ │ Local |<────────┘ │
│ │ Embedding │ │ │
│ │ Model │ ▼ │
│ └───────────┘ ┌─────────────┐ │
│ │ CLI / MCP │ │
│ │ (hybrid │ │
│ | search) | │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Architecture Stages
1. Source Layer
The source layer handles documentation ingestion from multiple source types. See the Sources page for detailed information.
Supported Sources:
Git Repositories : Clone and extract documentation from specific branches/tags
Sitemaps : Crawl and extract web documentation following sitemap.xml
Local Paths : Process documentation from local directories
Key Components:
extract/git.py: Handles git repository cloning with sparse checkout
extract/sitemap.py: Fetches and parses sitemaps, respects robots.txt
extract/local_path.py: Processes local file system paths
extract/common.py: Shared file processing logic
2. Processing Layer
The processing layer transforms raw documentation into searchable chunks.
OpenGround supports multiple documentation formats:
Markdown/MDX/RST
Jupyter Notebooks
HTML
# Handled in extract/common.py
def remove_front_matter ( content : str ) -> tuple[ str , dict[ str , str ]]:
"""Parse YAML front matter and extract metadata"""
if not content.startswith( "---" ):
return content, {}
# Parse front matter for title, description, etc.
Supported file types: .md, .mdx, .rst, .txt, .ipynb, .html, .htm
Document Chunking
Documents are split into overlapping chunks for better retrieval (from ingest.py:52-76):
from langchain_text_splitters import RecursiveCharacterTextSplitter
def chunk_document ( page : ParsedPage) -> list[ dict ]:
config = get_effective_config()
chunk_size = config[ "embeddings" ][ "chunk_size" ] # Default: 800
chunk_overlap = config[ "embeddings" ][ "chunk_overlap" ] # Default: 200
splitter = RecursiveCharacterTextSplitter(
chunk_size = chunk_size,
chunk_overlap = chunk_overlap
)
chunks = splitter.split_text(page[ "content" ])
# Each chunk preserves metadata: url, title, version, library_name
records = []
for idx, chunk in enumerate (chunks):
records.append({
"url" : page[ "url" ],
"library_name" : page[ "library_name" ],
"version" : page[ "version" ],
"title" : page[ "title" ],
"content" : chunk,
"chunk_index" : idx,
})
return records
Chunk overlap ensures that context isn’t lost at chunk boundaries, improving retrieval quality.
Embedding Generation
Each chunk is converted to a vector embedding using a local model. See Embeddings for details.
3. Storage Layer
OpenGround uses LanceDB for storing both vector embeddings and full-text search indices.
Why LanceDB?
Columnar storage : Efficient for vector operations
Built-in BM25 : Full-text search without external dependencies
Local-first : No server setup required
PyArrow integration : Fast data serialization
Schema Structure
From ingest.py:163-177, the LanceDB table schema:
schema = pa.schema(
[
pa.field( "url" , pa.string()),
pa.field( "library_name" , pa.string()),
pa.field( "version" , pa.string()),
pa.field( "title" , pa.string()),
pa.field( "description" , pa.string()),
pa.field( "last_modified" , pa.string()),
pa.field( "content" , pa.string()), # Text for BM25
pa.field( "chunk_index" , pa.int64()),
pa.field( "vector" , pa.list_(pa.float32(), 384 )), # Embedding vector
],
metadata = {
"embedding_backend" : "fastembed" ,
"embedding_model" : "BAAI/bge-small-en-v1.5"
}
)
The schema metadata tracks which embedding model was used, preventing incompatible searches.
Full-Text Index
After ingesting chunks, OpenGround creates a BM25 full-text search index (from ingest.py:223-226):
table.add(all_records)
table.create_fts_index( "content" , replace = True )
This enables hybrid search combining semantic similarity and keyword matching.
4. Query/Client Layer
The client layer exposes documentation through two interfaces:
CLI Commands
# Search documentation
openground query "how to configure embeddings" -l fastapi -v latest
# List available libraries
openground list
# Get library statistics
openground stats show
MCP Server
The Model Context Protocol (MCP) server exposes OpenGround to AI agents:
# From server.py
tools = [
{ "name" : "search_documentation" , ... },
{ "name" : "list_libraries" , ... },
{ "name" : "get_full_content" , ... }
]
AI agents can search documentation without polluting the main conversation context.
Data Flow Example
Let’s trace a complete flow from adding documentation to searching it:
Add Documentation
openground add fastapi \
--source https://github.com/tiangolo/fastapi.git \
--docs-path docs/ \
--version v0.100.0 -y
Git extractor clones repo with sparse checkout
Filters for .md, .mdx files in docs/
Extracts content and metadata
Saves to ~/.local/share/openground/raw_data/fastapi/v0.100.0/
Chunk & Embed
Load parsed pages from raw_data directory
Split each page into 800-character chunks with 200-char overlap
Generate embeddings for all chunks (batch size: 32)
Store in LanceDB with metadata
Search
# User query
query = "how to add dependencies"
# Generate query embedding
query_vec = generate_embeddings([query])[ 0 ]
# Hybrid search (vector + BM25)
results = table.search( query_type = "hybrid" )
.text(query)
.vector(query_vec)
.where( "version = 'v0.100.0'" )
.limit( 5 )
.to_list()
Returns ranked results combining semantic similarity and keyword relevance.
Configuration
OpenGround’s behavior is controlled through a hierarchical configuration system (from config.py):
# ~/.config/openground/config.json
{
"db_path" : "~/.local/share/openground/lancedb" ,
"table_name" : "documents" ,
"raw_data_dir" : "~/.local/share/openground/raw_data" ,
"extraction" : {
"concurrency_limit" : 50
},
"embeddings" : {
"batch_size" : 32 ,
"chunk_size" : 800 ,
"chunk_overlap" : 200 ,
"embedding_model" : "BAAI/bge-small-en-v1.5" ,
"embedding_dimensions" : 384 ,
"embedding_backend" : "fastembed"
},
"query" : {
"top_k" : 5
},
"sources" : {
"auto_add_local" : true
}
}
XDG Compliance
OpenGround follows the XDG Base Directory Specification (from config.py:10-24):
Config : $XDG_CONFIG_HOME/openground or ~/.config/openground
Data : $XDG_DATA_HOME/openground or ~/.local/share/openground
Windows : Uses AppData/Local/openground
Component Isolation
Each component is designed for independence:
Extractors output standardized ParsedPage objects
Ingestion works with any ParsedPage source
Query operates on LanceDB tables regardless of source
Embedding backends are swappable (sentence-transformers ↔ fastembed)
This modularity enables:
Adding new source types without changing ingestion
Swapping embedding models without changing extraction
Independent testing of each component
Next Steps