RAG 101: Build a Private Q&A App (Starter Stack)
In a Retrieval-Augmented Generation (RAG) app, your model answers with your content—safely and privately. This guide shows a compact stack you can run on a single machine, with options for cloud or fully local.
Architecture at a glance
1) Ingest
Load PDFs, docs, web pages, wikis. Clean text, detect language, strip boilerplate, and attach metadata (title, section, URL).
2) Chunk
Split into 300–1,000 token chunks with overlap (e.g., 10–20%). Keep headings + hierarchy for better citations.
3) Embed
Use an embedding model (small/fast or large/accurate). Store vectors + metadata in a DB.
4) Retrieve
Hybrid search (BM25 + vectors) → re-rank. Return top-k excerpts with scores.
5) Generate
LLM answers citing the retrieved snippets. Enforce formatting and refusal rules.
6) Observe
Log queries, latency, selected chunks, and user ratings for continuous improvement.
Models: local vs hosted
| Option | Pros | Cons | Use when… |
|---|---|---|---|
| Hosted LLM + hosted embeddings | Best quality, less ops | Data residency, cost, rate limits | You need accuracy fast, non-sensitive docs |
| Hosted LLM + local embeddings | Lower PII exposure; cheaper retrieval | Still sends prompts/contexts out | Privacy-aware, moderate infra |
| Fully local (LLM + embeddings) | Max privacy; offline | Setup + weaker models on CPU | Highly sensitive data, air-gapped envs |
Starter pick: local vector DB + hosted LLM. You keep your document store private, and you can swap models later.
Ingest & chunking
File handling
- Convert PDFs to text with layout awareness.
- Strip headers/footers and tables of contents.
- Preserve headings (H2/H3) in metadata.
PII hygiene
- Optionally redact emails/IDs at ingest.
- Keep raw copies encrypted; log access.
Chunk sizes
- 300–600 tokens: crisp answers, more calls.
- 800–1,000 tokens: fewer calls, risk of drift.
Embeddings & vector store
Pick an embedding model that balances speed and domain accuracy. Dimensions typically 384–1,536. Normalize vectors; store title/section/source in metadata.
Vector DB options
- SQLite + FAISS (simple, local)
- Chroma/Weaviate (dev-friendly)
- pgvector/Postgres (SQL + vectors)
Indexing tips
- Rebuild after big ingest; compact indexes.
- Store BM25 side-by-side for hybrid search.
Schema
{ id, text, vector, title, section, url, tags, ts }
// Pseudocode: ingest → embed → store (Python-like)
docs = load_files("docs/")
chunks = chunk(docs, size=800, overlap=120)
vectors = embed(chunks.map(c => c.text))
db.upsert([{id, vector, text, meta} for each chunk])
Retrieval strategies that work
Top-k + MMR
Maximal Marginal Relevance returns diverse passages (reduces “samey” chunks). Try k=8–12; feed 4–6 to the LLM.
Hybrid search
Combine vector scores with BM25 keywords (weights 0.6/0.4 baseline). Great for exact terms and numbers.
Re-ranking
Use a cross-encoder to re-rank top 50→top 6. Improves precision for long queries.
// Pseudocode: query flow
hits_vec = vdb.similarity_search(q, top=50)
hits_bm25 = bm25.search(q, top=50)
hybrid = blend(hits_vec, hits_bm25, alpha=0.6)
reranked = cross_encoder.rerank(q, hybrid)[:6]
Prompting + citations
System: You answer with the provided context only. If unsure, say so.
User: ${question}
Context:
1) {title} §{section} — {url}
{excerpt}
Assistant: Provide a concise answer with bullet points and cite sources as [1], [2]...
Teach the model to refuse when context is insufficient. Add a “grounded answer” check: if no chunk has a score above a threshold, return “no answer found.”
Evaluation & quality
Golden set
Create 30–100 Q/A pairs with reference passages. Re-run after each change.
Metrics
- Answer faithfulness (manual spot-checks)
- Retrieval precision@k
- Citation coverage
- Latency (P50/P95)
Feedback loop
Collect user thumbs-up/down, let users flag wrong citations, and auto-promote good answers to “suggested.”
Deploy & secure
Single-box dev
Docker compose: app + vector DB + (optional) local text model. Use .env for keys.
Ops basics
Secrets manager, structured logs, per-tenant indexes, and nightly backups. Encrypt at rest.
Privacy
Don’t store raw prompts by default. Hash user IDs. Provide a “forget this document” button.
FAQs
Do I need a GPU?
No. You can run embeddings and a small reranker on CPU. The generator can be hosted or a lightweight local model.
How big can my corpus be?
Start with thousands of chunks on a single node. For millions, move to a scalable vector DB and sharding.
How do I prevent hallucinations?
Strict prompts, thresholded retrieval, shorter answers, and require citations. Consider an answer verifier pass.