Hybrid Search RAG

Project Snapshot

Category: Generative AI / Retrieval-Augmented Generation
Focus: Hybrid retrieval, search relevance, low-latency RAG
Architecture: Modular, API-first, production-ready
GitHub: Hybrid Search RAG

Executive Summary

Hybrid Search RAG is a production-ready Retrieval-Augmented Generation system designed to improve answer relevance and robustness by combining sparse keyword-based retrieval with dense semantic search. The system uses Pinecone for scalable vector storage, FastAPI for serving, and Llama 3 hosted on Groq for low-latency, real-time generation.

Problem Statement

Purely embedding-based RAG systems often fail in real-world scenarios due to:

Poor handling of exact keywords, numbers, and named entities
Reduced recall for short or ambiguous queries
Over-reliance on dense similarity leading to missed context
Lack of production-grade APIs and persistent storage

Solution Overview

I designed Hybrid Search RAG as a sparse–dense retrieval system that explicitly balances precision and recall before generation. Instead of assuming a single retrieval strategy, the system fuses results from multiple retrievers and serves answers through a production API.

BM25-based sparse retrieval for exact keyword matching
Dense semantic retrieval backed by Pinecone vector storage
Weighted score fusion to combine sparse and dense signals
FastAPI-based service for real-time inference

Architecture & Approach

Modular Python architecture separating data, retrieval, and generation
BM25 retrieval performed in-memory for fast keyword search
Pinecone used for persistent, scalable dense vector retrieval
Hybrid retriever fuses scores using configurable weighting
Llama 3 on Groq used for low-latency, context-grounded generation

Key Capabilities

• Hybrid Sparse–Dense Retrieval

Combines BM25 keyword search with dense semantic retrieval to improve recall and robustness across diverse query types.

• Pinecone-Backed Vector Storage

Uses Pinecone for scalable, persistent vector indexing and low-latency semantic search in production environments.

• API-First Design with FastAPI

Exposes the RAG pipeline through a clean REST API with interactive Swagger documentation.

• Low-Latency LLM Inference

Integrates Llama 3 via Groq to enable fast, real-time response generation suitable for interactive applications.

Impact & Outcomes

Improved retrieval recall compared to dense-only approaches
More reliable answers for keyword-heavy and factual queries
Enabled scalable, persistent retrieval using Pinecone
Delivered a deployable RAG system suitable for real-world use

Tech Stack

Languages: Python
Frameworks: FastAPI
LLM: Llama 3 (served via Groq)
Embeddings: Sentence Transformers (MiniLM)
Vector Store: Pinecone
Retrieval: BM25, Hybrid Fusion
Deployment: API-first, production-ready design