Modular RAG

A production-oriented, modular Retrieval-Augmented Generation (RAG) framework evolved from an earlier monolithic RAG system with guardrails.

GitHub · Back to Projects


  • Problem: Monolithic RAG pipeline was hard to extend and debug.
  • Approach: Modular components + explicit flow orchestration.
  • Outcome: Hybrid retrieval + agentic-ready workflows without refactoring core logic.

This project represents a deliberate migration from a tightly coupled, single-pipeline RAG implementation into a modular, composable, and extensible architecture. The goal was to support hybrid retrieval, conditional execution, and future agentic workflows without refactoring core logic.

Key architectural ideas:

  • Separation of computation and control — retrieval, embeddings, generation, and reranking are implemented as reusable modules, while execution logic lives in explicit flows.
  • Flow-based orchestration — RAG pipelines are defined as executable graphs (sequence, routing, parallelism), enabling simple, hybrid, and agentic-ready workflows.
  • Explicit wiring, no magic — predictable startup registration and debuggable flows.
  • Hybrid retrieval by design — supports BM25, FAISS dense retrieval, Reciprocal Rank Fusion (RRF), and Maximal Marginal Relevance (MMR).
  • Model- and vendor-agnostic — OpenAI-compatible LLM layer supporting local (Ollama) and hosted providers without leaking vendor logic.
  • Observability-first — structured logging, tracing, and metrics are built in from the API boundary through flow execution. Each flow step logs: step name, duration, retriever used, top-k doc IDs, and any error category.

Compared to the original monolithic RAG system, this modular approach enables incremental evolution: from simple RAG, to hybrid retrieval, to conditional and agentic workflows — without rewriting existing components.



Engineering Notes


API & Interfaces
  • FastAPI boundary with typed request/response models and input validation.
  • Deterministic wiring: modules + flows registered at startup; explicit dependency injection (no hidden globals).
  • Provider layer: OpenAI-compatible client abstraction to swap hosted vs local (e.g., Ollama) without leaking vendor logic.
Retrieval & Flow Execution
  • Hybrid retrieval: BM25 + dense (FAISS) with configurable fusion (RRF) and diversification (MMR).
  • Flow graph orchestration: sequence / routing / parallel branches; supports conditional execution for “agentic-ready” patterns.
  • Extensibility: add a retriever/reranker/generator by implementing a small interface + registering once.
Reliability & Observability
  • Structured logging at the API boundary and per-flow step (inputs/outputs summarized, errors classified).
  • Tracing hooks for end-to-end request → flow → module timing.
What I'd Build Next
  • Evaluation harness: offline retrieval metrics + answer quality checks with a versioned dataset.
  • Config-driven flows: define flows in YAML/JSON for faster experimentation.
  • Safety/guardrails: stricter citation enforcement, refusal policies, and prompt injection resistance tests.