Research Assistant - LLM Research Pipeline
An intelligent, end-to-end pipeline for processing research PDFs using LLMs (Ollama or Gemini) with dynamic category generation, accurate PDF parsing with OCR fallback, LLM-based metadata extraction, multi-category scoring, deduplication, and topic-focused summarization.

Gallery





View on GitHub
View on PyPi
Problem & Solution
The Problem
Organizing research PDFs is labor-intensive: extracting metadata, categorizing across multiple themes, filtering by topic relevance, detecting duplicates, and creating summaries. Manual workflows are slow, inconsistent, and hard to reproduce.
The Solution
Research Assistant automates this pipeline with LLMs. It generates a dynamic taxonomy from your topic, parses PDFs with OCR fallback, extracts rich metadata, scores papers across all categories, moves each to the best-fit folder, removes duplicates, and produces topic-focused summaries with CSV/JSONL indices for downstream analysis.
Technologies Used
Core & Parsing
- Python 3.12+
- PyMuPDF
- ocrmypdf
- Tesseract
LLM & Embeddings
- Ollama
- Google Gemini
- nomic-embed-text
Indexing & Quality
SQLite Cache
- MinHash LSH
- pytest
Key Features
Dynamic LLM-Driven Taxonomy
Generates categories from your topic (no hardcoding) and scores each paper across all categories simultaneously to choose the best placement.
Accurate Parsing with OCR Fallback
Uses PyMuPDF for born-digital PDFs and seamlessly falls back to OCR (ocrmypdf + Tesseract) for scanned documents, ensuring high-quality text extraction.
Smart Deduplication & Resume
Combines hash-based exact matching with MinHash for near-duplicate detection. A SQLite cache and manifests support resumable processing at scale.
Topic-Focused Summaries & Indices
Produces per-paper summaries that emphasize your topic, plus JSONL and CSV indices for analysis, and Markdown summaries per category.
