Hybrid Search RAG

Project Snapshot

  • Category: Generative AI / Retrieval-Augmented Generation
  • Focus: Hybrid retrieval, search relevance, low-latency RAG
  • Architecture: Modular, API-first, production-ready
  • GitHub: Hybrid Search RAG

Executive Summary

Hybrid Search RAG is a production-ready Retrieval-Augmented Generation system designed to improve answer relevance and robustness by combining sparse keyword-based retrieval with dense semantic search. The system uses Pinecone for scalable vector storage, FastAPI for serving, and Llama 3 hosted on Groq for low-latency, real-time generation.

Problem Statement

Purely embedding-based RAG systems often fail in real-world scenarios due to:

  • Poor handling of exact keywords, numbers, and named entities
  • Reduced recall for short or ambiguous queries
  • Over-reliance on dense similarity leading to missed context
  • Lack of production-grade APIs and persistent storage

Solution Overview

I designed Hybrid Search RAG as a sparse–dense retrieval system that explicitly balances precision and recall before generation. Instead of assuming a single retrieval strategy, the system fuses results from multiple retrievers and serves answers through a production API.

  • BM25-based sparse retrieval for exact keyword matching
  • Dense semantic retrieval backed by Pinecone vector storage
  • Weighted score fusion to combine sparse and dense signals
  • FastAPI-based service for real-time inference

Architecture & Approach

  • Modular Python architecture separating data, retrieval, and generation
  • BM25 retrieval performed in-memory for fast keyword search
  • Pinecone used for persistent, scalable dense vector retrieval
  • Hybrid retriever fuses scores using configurable weighting
  • Llama 3 on Groq used for low-latency, context-grounded generation

Key Capabilities

• Hybrid Sparse–Dense Retrieval

Combines BM25 keyword search with dense semantic retrieval to improve recall and robustness across diverse query types.

• Pinecone-Backed Vector Storage

Uses Pinecone for scalable, persistent vector indexing and low-latency semantic search in production environments.

• API-First Design with FastAPI

Exposes the RAG pipeline through a clean REST API with interactive Swagger documentation.

• Low-Latency LLM Inference

Integrates Llama 3 via Groq to enable fast, real-time response generation suitable for interactive applications.

Impact & Outcomes

  • Improved retrieval recall compared to dense-only approaches
  • More reliable answers for keyword-heavy and factual queries
  • Enabled scalable, persistent retrieval using Pinecone
  • Delivered a deployable RAG system suitable for real-world use

Tech Stack

Languages: Python
Frameworks: FastAPI
LLM: Llama 3 (served via Groq)
Embeddings: Sentence Transformers (MiniLM)
Vector Store: Pinecone
Retrieval: BM25, Hybrid Fusion
Deployment: API-first, production-ready design