Modular RAG
A production-oriented, modular Retrieval-Augmented Generation (RAG) framework evolved from an earlier monolithic RAG system with guardrails.
- Problem: Monolithic RAG pipeline was hard to extend and debug.
- Approach: Modular components + explicit flow orchestration.
- Outcome: Hybrid retrieval + agentic-ready workflows without refactoring core logic.
This project represents a deliberate migration from a tightly coupled, single-pipeline RAG implementation into a modular, composable, and extensible architecture. The goal was to support hybrid retrieval, conditional execution, and future agentic workflows without refactoring core logic.
Key architectural ideas:
- Separation of computation and control — retrieval, embeddings, generation, and reranking are implemented as reusable modules, while execution logic lives in explicit flows.
- Flow-based orchestration — RAG pipelines are defined as executable graphs (sequence, routing, parallelism), enabling simple, hybrid, and agentic-ready workflows.
- Explicit wiring, no magic — predictable startup registration and debuggable flows.
- Hybrid retrieval by design — supports BM25, FAISS dense retrieval, Reciprocal Rank Fusion (RRF), and Maximal Marginal Relevance (MMR).
- Model- and vendor-agnostic — OpenAI-compatible LLM layer supporting local (Ollama) and hosted providers without leaking vendor logic.
- Observability-first — structured logging, tracing, and metrics are built in from the API boundary through flow execution. Each flow step logs: step name, duration, retriever used, top-k doc IDs, and any error category.
Compared to the original monolithic RAG system, this modular approach enables incremental evolution: from simple RAG, to hybrid retrieval, to conditional and agentic workflows — without rewriting existing components.
Engineering Notes
API & Interfaces
- FastAPI boundary with typed request/response models and input validation.
- Deterministic wiring: modules + flows registered at startup; explicit dependency injection (no hidden globals).
- Provider layer: OpenAI-compatible client abstraction to swap hosted vs local (e.g., Ollama) without leaking vendor logic.
Retrieval & Flow Execution
- Hybrid retrieval: BM25 + dense (FAISS) with configurable fusion (RRF) and diversification (MMR).
- Flow graph orchestration: sequence / routing / parallel branches; supports conditional execution for “agentic-ready” patterns.
- Extensibility: add a retriever/reranker/generator by implementing a small interface + registering once.
Reliability & Observability
- Structured logging at the API boundary and per-flow step (inputs/outputs summarized, errors classified).
- Tracing hooks for end-to-end request → flow → module timing.
What I'd Build Next
- Evaluation harness: offline retrieval metrics + answer quality checks with a versioned dataset.
- Config-driven flows: define flows in YAML/JSON for faster experimentation.
- Safety/guardrails: stricter citation enforcement, refusal policies, and prompt injection resistance tests.