LLM based medical report parser
The goal of this project is to extract structured data from free-text medical reports using Large Language Models (LLMs). The business scenario is a common one:
- Many healthcare institutions and companies hold years of valuable information in unstructured formats β PDFs, DOCs, scanned reports.
- Manually extracting insights from this archive would require hundreds of hours of highly skilled, expensive labor.
- LLMs β especially domain-specific ones like Google MedGemma β offer a scalable, automated solution. However, achieving accurate results requires careful prompt engineering and robust post-processing logic.
In this project, we work specifically with pathology reports and extract a structured CSV containing the following fields:
- Tissue type
- Tumor presence
- Tumor type
- Malignancy
For the core model, we use MedGemma 3 β a 27B-parameter LLM fine-tuned for medical tasks. Due to its size, inference is run on a GPU to ensure acceptable performance and throughput.
To respect privacy regulations, real patient data cannot be shared. However, the GitHub repository includes a small open-access pathology dataset for testing purposes.
π§© Features
- π Converts unstructured reports into a structured, database-ready CSV
- π§ Uses Google MedGemma for medical-grade natural language understanding
- π― Task-specific prompt engineering to ensure consistent, parseable output
- π§ͺ Roleplay-style instructions to boost consistency and reduce hallucination
π‘ Technologies Used
- Google MedGemma 3 (via Hugging Face Transformers)
- Python β Core backend, parsing, and orchestration
- Streamlit β UI layer for demo, testing, and result export
- Pandas β CSV formatting and output handling
π Resources
π Github repository