Hybrid Search RAG
Project Snapshot
- Category: Generative AI / Retrieval-Augmented Generation
- Focus: Hybrid retrieval, search relevance, low-latency RAG
- Architecture: Modular, API-first, production-ready
- GitHub: Hybrid Search RAG
Executive Summary
Hybrid Search RAG is a production-ready Retrieval-Augmented Generation system designed to improve answer relevance and robustness by combining sparse keyword-based retrieval with dense semantic search. The system uses Pinecone for scalable vector storage, FastAPI for serving, and Llama 3 hosted on Groq for low-latency, real-time generation.
Problem Statement
Purely embedding-based RAG systems often fail in real-world scenarios due to:
- Poor handling of exact keywords, numbers, and named entities
- Reduced recall for short or ambiguous queries
- Over-reliance on dense similarity leading to missed context
- Lack of production-grade APIs and persistent storage
Solution Overview
I designed Hybrid Search RAG as a sparse–dense retrieval system that explicitly balances precision and recall before generation. Instead of assuming a single retrieval strategy, the system fuses results from multiple retrievers and serves answers through a production API.
- BM25-based sparse retrieval for exact keyword matching
- Dense semantic retrieval backed by Pinecone vector storage
- Weighted score fusion to combine sparse and dense signals
- FastAPI-based service for real-time inference
Architecture & Approach
- Modular Python architecture separating data, retrieval, and generation
- BM25 retrieval performed in-memory for fast keyword search
- Pinecone used for persistent, scalable dense vector retrieval
- Hybrid retriever fuses scores using configurable weighting
- Llama 3 on Groq used for low-latency, context-grounded generation
Key Capabilities
• Hybrid Sparse–Dense Retrieval
Combines BM25 keyword search with dense semantic retrieval to improve recall and robustness across diverse query types.
• Pinecone-Backed Vector Storage
Uses Pinecone for scalable, persistent vector indexing and low-latency semantic search in production environments.
• API-First Design with FastAPI
Exposes the RAG pipeline through a clean REST API with interactive Swagger documentation.
• Low-Latency LLM Inference
Integrates Llama 3 via Groq to enable fast, real-time response generation suitable for interactive applications.
Impact & Outcomes
- Improved retrieval recall compared to dense-only approaches
- More reliable answers for keyword-heavy and factual queries
- Enabled scalable, persistent retrieval using Pinecone
- Delivered a deployable RAG system suitable for real-world use
Tech Stack
Languages: Python
Frameworks: FastAPI
LLM: Llama 3 (served via Groq)
Embeddings: Sentence Transformers (MiniLM)
Vector Store: Pinecone
Retrieval: BM25, Hybrid Fusion
Deployment: API-first, production-ready design