Skip to content

Latest commit

 

History

History
551 lines (445 loc) · 13.5 KB

File metadata and controls

551 lines (445 loc) · 13.5 KB

Semantic Search Roadmap

Technical roadmap for implementing production-grade semantic search and RAG capabilities in doc-agent.


Architecture

Vector Store: Provider Pattern

Support multiple vector store backends via a common interface. The vector store is decoupled from chunk storage—it only knows about IDs and embeddings.

interface VectorStoreItem {
  id: string;                           // maps to Chunk.id
  embedding: number[];
  metadata?: Record<string, unknown>;   // for filtering
}

interface VectorStoreResult {
  id: string;
  score: number;
}

interface VectorStore {
  name: string;
  insert(items: VectorStoreItem[]): Promise<void>;
  search(
    queryEmbedding: number[],
    topK: number,
    filters?: Record<string, unknown>
  ): Promise<VectorStoreResult[]>;
  delete(ids: string[]): Promise<void>;
}

Implementations:

  • CustomVectorStore — brute-force → HNSW
  • LanceDBVectorStore — baseline comparison

The search orchestrator hydrates results by joining VectorStoreResult.id against the chunks table.

Chunking Strategies

Strategy Flag Implementation Best For
Line --chunk line Split on \n, group empty lines Receipts, invoices
Sentence --chunk sentence NLP tokenizer Natural text
Semantic --chunk semantic LLM-assisted boundary detection Contracts, reports

Auto-routing by document type:

  • Receipts/invoices → line
  • Bank statements → line or sentence
  • Contracts/reports → semantic

Embedding Providers

interface EmbeddingProvider {
  name: string;
  dims: number;
  embed(texts: string[]): Promise<number[][]>;
}
Provider Models Notes
Ollama (default) nomic-embed-text, mxbai-embed-large Local, no API key
OpenAI text-embedding-3-small High quality
Gemini text-embedding-005, text-multilingual-embedding-002 Multilingual support
Transformers.js Local ONNX Zero external deps

LLM Providers

interface LLMProvider {
  name: string;
  generate(prompt: string, options?: { system?: string }): Promise<string>;
}
Provider Models Notes
Ollama (default) llama3.2, mistral Local
OpenAI gpt-4o-mini High quality
Gemini gemini-1.5-flash Fast

Storage Model

CREATE TABLE chunks (
  id TEXT PRIMARY KEY,
  document_id INTEGER REFERENCES documents(id),
  content TEXT NOT NULL,
  metadata JSON,
  chunk_index INTEGER NOT NULL
);

Embedding storage:

  • Phase 1: SQLite BLOB (brute-force search)
  • Phase 2+: Vector store's native format (HNSW memory-mapped files)

Hybrid Search

FTS5 contentless index alongside vector search:

CREATE VIRTUAL TABLE chunks_fts USING fts5(
  content,
  content='chunks',
  content_rowid='id'
);

Both FTS5 and vector search return chunk.id, enabling fusion:

interface HybridSearchResult {
  chunk: Chunk;
  vectorScore?: number;
  keywordScore?: number;
  combinedScore: number;
  ranks: {
    vectorRank?: number;
    keywordRank?: number;
  };
}

Search modes:

  • --mode vector — Cosine similarity only
  • --mode keyword — BM25 only
  • --mode hybrid — RRF fusion

Reranking

interface Reranker {
  rerank(query: string, candidates: ScoredChunk[]): Promise<ScoredChunk[]>;
}

Reranker receives scored results to preserve retrieval context for debugging and score blending.

RAG Pipeline

interface RAGResponse {
  answer: string;
  chunks: RAGChunk[];
  debug?: {
    vectorResults: ScoredChunk[];
    keywordResults: ScoredChunk[];
    rerankedResults: ScoredChunk[];
    stats: {
      vectorLatencyMs: number;
      keywordLatencyMs?: number;
      rerankLatencyMs?: number;
      totalLatencyMs: number;
    };
  };
}

Exposed via:

  1. CLI: doc search "query" --rag
  2. MCP: search_documents tool
  3. HTTP: POST /rag (optional)

Evaluation

interface EvalQuery {
  id: string;
  query: string;
  relevantChunkIds: string[];
  category?: string;
}

interface EvalDataset {
  name: string;
  chunks: Chunk[];
  queries: EvalQuery[];
}

interface EvalResult {
  recallAtK: Record<number, number>;
  precisionAtK: Record<number, number>;
  mrr: number;
  byCategory?: Record<string, EvalResult>;
}

Phase 1: Vector Search Core

Scope

  • Chunking module (line, sentence)
  • Embedding provider abstraction + Ollama implementation
  • Custom vector store with brute-force cosine similarity
  • chunks table in SQLite
  • CLI: doc ingest and doc search
  • Evaluation harness

File Structure

packages/vector-store/src/
├── chunking/
│   ├── types.ts
│   ├── line.ts
│   └── sentence.ts
├── embeddings/
│   ├── types.ts
│   └── ollama.ts
├── stores/
│   ├── types.ts
│   └── custom.ts
├── eval/
│   ├── types.ts
│   ├── dataset.ts
│   └── metrics.ts
├── search.ts
└── index.ts

Deliverables

  • Chunk and ChunkingStrategy types
  • Line chunker
  • Sentence chunker
  • EmbeddingProvider interface
  • Ollama embedding provider
  • VectorStore interface
  • Brute-force cosine similarity
  • chunks schema migration
  • Search orchestrator
  • doc ingest <file> command
  • doc search <query> command
  • Evaluation dataset
  • doc eval command

Benchmarks

  • Chunk size vs recall@k
  • Embedding latency by provider

Phase 2: Hybrid Search

Scope

  • FTS5 integration for keyword search
  • BM25 scoring
  • Reciprocal Rank Fusion (RRF)
  • HNSW index
  • Metadata filtering

File Structure Additions

packages/vector-store/src/
├── ranking/
│   ├── bm25.ts
│   ├── rrf.ts
│   └── hybrid.ts
├── stores/
│   └── hnsw.ts

Deliverables

  • FTS5 virtual table + sync triggers
  • bm25Search() function
  • rrfFusion() function
  • HybridSearchResult type
  • hybridSearch() orchestrator
  • --mode vector | keyword | hybrid flag
  • HNSW vector store
  • --filter metadata filtering

Benchmarks

  • Vector vs keyword vs hybrid recall
  • HNSW accuracy vs brute-force
  • HNSW latency vs ef parameter
  • Custom vs LanceDB comparison

Phase 3: RAG & Evaluation

Scope

  • LLM provider abstraction
  • Reranking
  • RAG engine with citations
  • MCP tool integration
  • Provider comparison

File Structure Additions

packages/vector-store/src/
├── llm/
│   ├── types.ts
│   └── ollama.ts
├── rerank/
│   ├── types.ts
│   └── ollama.ts
├── rag/
│   ├── types.ts
│   ├── engine.ts
│   └── prompts.ts

Deliverables

  • LLMProvider interface
  • Ollama LLM provider
  • Reranker interface
  • Ollama reranker
  • runRAG() engine
  • RAG prompt templates
  • doc search --rag command
  • MCP search_documents tool
  • Provider comparison report

Benchmarks

  • Reranking impact on precision
  • Context window size vs answer quality
  • Embedding provider comparison (recall, latency)

Future

  • HTTP server (POST /rag)
  • Search debugging UI
  • OpenAI / Gemini providers
  • Transformers.js embeddings
  • Semantic chunking
  • Index persistence
  • Embeddings versioning
  • Query caching
  • Multi-modal search

Types Reference

// ─────────────────────────────────────────────────────────────
// Chunking
// ─────────────────────────────────────────────────────────────

interface Chunk {
  id: string;
  documentId: string;
  content: string;
  index: number;
  metadata: {
    page?: number;
    section?: string;
    source: string;
    [key: string]: unknown;
  };
}

type ChunkingStrategy = 'line' | 'sentence' | 'semantic';

interface Chunker {
  strategy: ChunkingStrategy;
  chunk(text: string, documentId: string, metadata?: Record<string, unknown>): Chunk[];
}

// ─────────────────────────────────────────────────────────────
// Embeddings
// ─────────────────────────────────────────────────────────────

interface EmbeddingProvider {
  name: string;
  dims: number;
  embed(texts: string[]): Promise<number[][]>;
}

// ─────────────────────────────────────────────────────────────
// Vector Store
// ─────────────────────────────────────────────────────────────

interface VectorStoreItem {
  id: string;
  embedding: number[];
  metadata?: Record<string, unknown>;
}

interface VectorStoreResult {
  id: string;
  score: number;
}

interface VectorStore {
  name: string;
  insert(items: VectorStoreItem[]): Promise<void>;
  search(
    queryEmbedding: number[],
    topK: number,
    filters?: Record<string, unknown>
  ): Promise<VectorStoreResult[]>;
  delete(ids: string[]): Promise<void>;
}

// ─────────────────────────────────────────────────────────────
// LLM
// ─────────────────────────────────────────────────────────────

interface LLMProvider {
  name: string;
  generate(prompt: string, options?: { system?: string }): Promise<string>;
}

// ─────────────────────────────────────────────────────────────
// Ranking
// ─────────────────────────────────────────────────────────────

interface ScoredChunk {
  chunk: Chunk;
  vectorScore?: number;
  keywordScore?: number;
  combinedScore: number;
}

interface HybridSearchResult extends ScoredChunk {
  ranks: {
    vectorRank?: number;
    keywordRank?: number;
  };
}

interface Reranker {
  rerank(query: string, candidates: ScoredChunk[]): Promise<ScoredChunk[]>;
}

// ─────────────────────────────────────────────────────────────
// RAG
// ─────────────────────────────────────────────────────────────

interface RAGRequest {
  query: string;
  topK?: number;
  mode?: 'vector' | 'keyword' | 'hybrid';
  filters?: Record<string, unknown>;
  rerank?: boolean;
}

interface RAGChunk {
  id: string;
  content: string;
  score: number;
  source: {
    documentId: string;
    filename: string;
    page?: number;
  };
}

interface RAGResponse {
  answer: string;
  chunks: RAGChunk[];
  debug?: {
    vectorResults: ScoredChunk[];
    keywordResults: ScoredChunk[];
    rerankedResults: ScoredChunk[];
    stats: {
      vectorLatencyMs: number;
      keywordLatencyMs?: number;
      rerankLatencyMs?: number;
      totalLatencyMs: number;
    };
  };
}

// ─────────────────────────────────────────────────────────────
// Evaluation
// ─────────────────────────────────────────────────────────────

interface EvalQuery {
  id: string;
  query: string;
  relevantChunkIds: string[];
  category?: string;
}

interface EvalDataset {
  name: string;
  description?: string;
  chunks: Chunk[];
  queries: EvalQuery[];
}

interface EvalResult {
  recallAtK: Record<number, number>;
  precisionAtK: Record<number, number>;
  mrr: number;
  byCategory?: Record<string, EvalResult>;
}

CLI Reference

# Ingestion
doc ingest <file>
doc ingest <file> --chunk line|sentence|semantic
doc ingest <file> --embed-provider ollama|openai|gemini|transformers
doc ingest <file> --embed-model <model-name>

# Search
doc search <query>
doc search <query> --mode vector|keyword|hybrid
doc search <query> --vector-store custom|lancedb
doc search <query> --filter "key:value"
doc search <query> --rag
doc search <query> --rerank
doc search <query> --top-k 10
doc search <query> --json

# Evaluation
doc eval --dataset <path>
doc eval --compare ollama,openai,gemini

# Servers
doc mcp
doc serve --port 3000

References