April 5, 2025

LLM based medical report parser

Project thumbnail

The goal of this project is to extract structured data from free-text medical reports using Large Language Models (LLMs). The business scenario is a common one:

  • Many healthcare institutions and companies hold years of valuable information in unstructured formats β€” PDFs, DOCs, scanned reports.
  • Manually extracting insights from this archive would require hundreds of hours of highly skilled, expensive labor.
  • LLMs β€” especially domain-specific ones like Google MedGemma β€” offer a scalable, automated solution. However, achieving accurate results requires careful prompt engineering and robust post-processing logic.

In this project, we work specifically with pathology reports and extract a structured CSV containing the following fields:

  • Tissue type
  • Tumor presence
  • Tumor type
  • Malignancy

For the core model, we use MedGemma 3 β€” a 27B-parameter LLM fine-tuned for medical tasks. Due to its size, inference is run on a GPU to ensure acceptable performance and throughput.

To respect privacy regulations, real patient data cannot be shared. However, the GitHub repository includes a small open-access pathology dataset for testing purposes.

🧩 Features

  • πŸ“„ Converts unstructured reports into a structured, database-ready CSV
  • 🧠 Uses Google MedGemma for medical-grade natural language understanding
  • 🎯 Task-specific prompt engineering to ensure consistent, parseable output
  • πŸ§ͺ Roleplay-style instructions to boost consistency and reduce hallucination

πŸ’‘ Technologies Used

  • Google MedGemma 3 (via Hugging Face Transformers)
  • Python – Core backend, parsing, and orchestration
  • Streamlit – UI layer for demo, testing, and result export
  • Pandas – CSV formatting and output handling

🌐 Resources

πŸ‘‰ Github repository