RAG 101: Build a Private Q&A App (Starter Stack)

Aug 29, 2025 • 100 views • NPolls Staff

In a Retrieval-Augmented Generation (RAG) app, your model answers with your content—safely and privately. This guide shows a compact stack you can run on a single machine, with options for cloud or fully local.

Architecture at a glance

1) Ingest

Load PDFs, docs, web pages, wikis. Clean text, detect language, strip boilerplate, and attach metadata (title, section, URL).

2) Chunk

Split into 300–1,000 token chunks with overlap (e.g., 10–20%). Keep headings + hierarchy for better citations.

3) Embed

Use an embedding model (small/fast or large/accurate). Store vectors + metadata in a DB.

4) Retrieve

Hybrid search (BM25 + vectors) → re-rank. Return top-k excerpts with scores.

5) Generate

LLM answers citing the retrieved snippets. Enforce formatting and refusal rules.

6) Observe

Log queries, latency, selected chunks, and user ratings for continuous improvement.

Models: local vs hosted

Option	Pros	Cons	Use when…
Hosted LLM + hosted embeddings	Best quality, less ops	Data residency, cost, rate limits	You need accuracy fast, non-sensitive docs
Hosted LLM + local embeddings	Lower PII exposure; cheaper retrieval	Still sends prompts/contexts out	Privacy-aware, moderate infra
Fully local (LLM + embeddings)	Max privacy; offline	Setup + weaker models on CPU	Highly sensitive data, air-gapped envs

Starter pick: local vector DB + hosted LLM. You keep your document store private, and you can swap models later.

Ingest & chunking

File handling

Convert PDFs to text with layout awareness.
Strip headers/footers and tables of contents.
Preserve headings (H2/H3) in metadata.

PII hygiene

Optionally redact emails/IDs at ingest.
Keep raw copies encrypted; log access.

Chunk sizes

300–600 tokens: crisp answers, more calls.
800–1,000 tokens: fewer calls, risk of drift.

Embeddings & vector store

Pick an embedding model that balances speed and domain accuracy. Dimensions typically 384–1,536. Normalize vectors; store title/section/source in metadata.

Vector DB options

SQLite + FAISS (simple, local)
Chroma/Weaviate (dev-friendly)
pgvector/Postgres (SQL + vectors)

Indexing tips

Rebuild after big ingest; compact indexes.
Store BM25 side-by-side for hybrid search.

Schema

{ id, text, vector, title, section, url, tags, ts }

// Pseudocode: ingest → embed → store (Python-like)
docs = load_files("docs/")
chunks = chunk(docs, size=800, overlap=120)
vectors = embed(chunks.map(c => c.text))
db.upsert([{id, vector, text, meta} for each chunk])

Retrieval strategies that work

Top-k + MMR

Maximal Marginal Relevance returns diverse passages (reduces “samey” chunks). Try k=8–12; feed 4–6 to the LLM.

Hybrid search

Combine vector scores with BM25 keywords (weights 0.6/0.4 baseline). Great for exact terms and numbers.

Re-ranking

Use a cross-encoder to re-rank top 50→top 6. Improves precision for long queries.

// Pseudocode: query flow
hits_vec = vdb.similarity_search(q, top=50)
hits_bm25 = bm25.search(q, top=50)
hybrid = blend(hits_vec, hits_bm25, alpha=0.6)
reranked = cross_encoder.rerank(q, hybrid)[:6]

Prompting + citations

System: You answer with the provided context only. If unsure, say so.
User: ${question}
Context:
1) {title} §{section} — {url}
{excerpt}

Assistant: Provide a concise answer with bullet points and cite sources as [1], [2]...

Teach the model to refuse when context is insufficient. Add a “grounded answer” check: if no chunk has a score above a threshold, return “no answer found.”

Evaluation & quality

Golden set

Create 30–100 Q/A pairs with reference passages. Re-run after each change.

Metrics

Answer faithfulness (manual spot-checks)
Retrieval precision@k
Citation coverage
Latency (P50/P95)

Feedback loop

Collect user thumbs-up/down, let users flag wrong citations, and auto-promote good answers to “suggested.”

Deploy & secure

Single-box dev

Docker compose: app + vector DB + (optional) local text model. Use .env for keys.

Ops basics

Secrets manager, structured logs, per-tenant indexes, and nightly backups. Encrypt at rest.

Privacy

Don’t store raw prompts by default. Hash user IDs. Provide a “forget this document” button.

FAQs

Do I need a GPU?

No. You can run embeddings and a small reranker on CPU. The generator can be hosted or a lightweight local model.

How big can my corpus be?

Start with thousands of chunks on a single node. For millions, move to a scalable vector DB and sharding.

How do I prevent hallucinations?

Strict prompts, thresholded retrieval, shorter answers, and require citations. Consider an answer verifier pass.