End-to-End Text Summarization

Project Snapshot

  • Category: NLP / Generative AI / MLOps
  • Problem: Long-form text understanding & summarization
  • Model: PEGASUS (Abstractive Summarization)
  • Deployment: Docker, AWS EC2, ECR, CI/CD
  • GitHub: End-to-End Text Summarization

Executive Summary

This project is a production-grade, end-to-end abstractive text summarization system designed to convert long-form documents into concise, coherent summaries. The system covers the complete machine learning lifecycle—from data ingestion and validation to model training, evaluation, deployment, and monitoring—following industry-standard ML engineering practices.

Problem Statement

Manual summarization of large documents is time-consuming and error-prone. Many existing ML solutions focus only on model training and ignore:

  • Lack of data validation and reproducibility
  • Poor separation between experimentation and production
  • Missing deployment and monitoring workflows
  • Inability to reliably scale and maintain ML systems

Solution Overview

I designed a modular, configuration-driven ML pipeline that treats summarization as a full system rather than a standalone model.

  • Automated data ingestion and validation gates
  • Tokenizer-driven preprocessing using Hugging Face Datasets
  • Fine-tuning PEGASUS with reproducible training configurations
  • ROUGE-based evaluation with persisted metrics
  • FastAPI-powered inference service

Architecture & Approach

  • Clean separation of concerns: config, entities, components, pipelines
  • Strongly-typed configuration using dataclasses
  • Pipeline orchestration with explicit stage boundaries
  • Artifact-driven workflow enabling reproducibility
  • Designed for scalability, testing, and production deployment

Key Capabilities

• End-to-End ML Pipeline

Covers the complete ML lifecycle from raw data ingestion to deployed inference, not just model training.

• Production-Grade NLP

Uses PEGASUS, a summarization-optimized transformer, fine-tuned using best practices.

• Configuration-Driven Design

YAML-based configuration and typed entities ensure reproducibility and maintainability.

• MLOps & Deployment

Fully containerized and deployed on AWS using GitHub Actions for CI/CD automation.

Impact & Outcomes

  • Demonstrated real-world ML system design beyond notebooks
  • Enabled scalable and reproducible summarization workflows
  • Bridged the gap between ML research and production engineering
  • Created a reusable blueprint for NLP systems in enterprise settings

Tech Stack

Languages: Python
NLP: Hugging Face Transformers, PEGASUS
Training: PyTorch
Evaluation: ROUGE
API: FastAPI
MLOps: Docker, GitHub Actions
Cloud: AWS EC2, Amazon ECR