Modular RAG

A production-oriented, modular Retrieval-Augmented Generation (RAG) framework evolved from an earlier monolithic RAG system with guardrails.

GitHub · Back to Projects

Problem: Monolithic RAG pipeline was hard to extend and debug.
Approach: Modular components + explicit flow orchestration.
Outcome: Hybrid retrieval + agentic-ready workflows without refactoring core logic.

This project represents a deliberate migration from a tightly coupled, single-pipeline RAG implementation into a modular, composable, and extensible architecture. The goal was to support hybrid retrieval, conditional execution, and future agentic workflows without refactoring core logic.

Key architectural ideas:

Separation of computation and control — retrieval, embeddings, generation, and reranking are implemented as reusable modules, while execution logic lives in explicit flows.
Flow-based orchestration — RAG pipelines are defined as executable graphs (sequence, routing, parallelism), enabling simple, hybrid, and agentic-ready workflows.
Explicit wiring, no magic — predictable startup registration and debuggable flows.
Hybrid retrieval by design — supports BM25, FAISS dense retrieval, Reciprocal Rank Fusion (RRF), and Maximal Marginal Relevance (MMR).
Model- and vendor-agnostic — OpenAI-compatible LLM layer supporting local (Ollama) and hosted providers without leaking vendor logic.
Observability-first — structured logging, tracing, and metrics are built in from the API boundary through flow execution. Each flow step logs: step name, duration, retriever used, top-k doc IDs, and any error category.

Compared to the original monolithic RAG system, this modular approach enables incremental evolution: from simple RAG, to hybrid retrieval, to conditional and agentic workflows — without rewriting existing components.

Engineering Notes

API & Interfaces

FastAPI boundary with typed request/response models and input validation.
Deterministic wiring: modules + flows registered at startup; explicit dependency injection (no hidden globals).
Provider layer: OpenAI-compatible client abstraction to swap hosted vs local (e.g., Ollama) without leaking vendor logic.

Retrieval & Flow Execution

Hybrid retrieval: BM25 + dense (FAISS) with configurable fusion (RRF) and diversification (MMR).
Flow graph orchestration: sequence / routing / parallel branches; supports conditional execution for “agentic-ready” patterns.
Extensibility: add a retriever/reranker/generator by implementing a small interface + registering once.

Reliability & Observability

Structured logging at the API boundary and per-flow step (inputs/outputs summarized, errors classified).
Tracing hooks for end-to-end request → flow → module timing.

What I'd Build Next

Evaluation harness: offline retrieval metrics + answer quality checks with a versioned dataset.
Config-driven flows: define flows in YAML/JSON for faster experimentation.
Safety/guardrails: stricter citation enforcement, refusal policies, and prompt injection resistance tests.