End-to-End Text Summarization

Project Snapshot

Category: NLP / Generative AI / MLOps
Problem: Long-form text understanding & summarization
Model: PEGASUS (Abstractive Summarization)
Deployment: Docker, AWS EC2, ECR, CI/CD
GitHub: End-to-End Text Summarization

Executive Summary

This project is a production-grade, end-to-end abstractive text summarization system designed to convert long-form documents into concise, coherent summaries. The system covers the complete machine learning lifecycle—from data ingestion and validation to model training, evaluation, deployment, and monitoring—following industry-standard ML engineering practices.

Problem Statement

Manual summarization of large documents is time-consuming and error-prone. Many existing ML solutions focus only on model training and ignore:

Lack of data validation and reproducibility
Poor separation between experimentation and production
Missing deployment and monitoring workflows
Inability to reliably scale and maintain ML systems

Solution Overview

I designed a modular, configuration-driven ML pipeline that treats summarization as a full system rather than a standalone model.

Automated data ingestion and validation gates
Tokenizer-driven preprocessing using Hugging Face Datasets
Fine-tuning PEGASUS with reproducible training configurations
ROUGE-based evaluation with persisted metrics
FastAPI-powered inference service

Architecture & Approach

Clean separation of concerns: config, entities, components, pipelines
Strongly-typed configuration using dataclasses
Pipeline orchestration with explicit stage boundaries
Artifact-driven workflow enabling reproducibility
Designed for scalability, testing, and production deployment

Key Capabilities

• End-to-End ML Pipeline

Covers the complete ML lifecycle from raw data ingestion to deployed inference, not just model training.

• Production-Grade NLP

Uses PEGASUS, a summarization-optimized transformer, fine-tuned using best practices.

• Configuration-Driven Design

YAML-based configuration and typed entities ensure reproducibility and maintainability.

• MLOps & Deployment

Fully containerized and deployed on AWS using GitHub Actions for CI/CD automation.

Impact & Outcomes

Demonstrated real-world ML system design beyond notebooks
Enabled scalable and reproducible summarization workflows
Bridged the gap between ML research and production engineering
Created a reusable blueprint for NLP systems in enterprise settings

Tech Stack

Languages: Python
NLP: Hugging Face Transformers, PEGASUS
Training: PyTorch
Evaluation: ROUGE
API: FastAPI
MLOps: Docker, GitHub Actions
Cloud: AWS EC2, Amazon ECR