Medical Text Simplification: From Jargon Detection to Jargon-Aware Prompting
Project Overview
This research investigates automatic jargon detection and LLM-based medical text simplification to improve health literacy. We address the challenge of making biomedical texts accessible to lay readers by developing models that identify complex terminology and evaluating strategies for effective simplification.
Key Features
- Jargon detection across biomedical datasets (MedReadMe and PLABA)
- Cross-dataset transfer learning evaluation and annotation schema alignment
- Jargon-aware prompting strategies for LLM-based text simplification
- Comparative analysis of general-purpose vs. domain-specialized language models
- Human evaluation studies validating automatic metrics
- Comprehensive benchmarking with multiple transformer architectures
Technical Implementation
Implemented using Python with transformer models including BERT, RoBERTa, BioBERT, and PubMedBERT. Fine-tuned models on biomedical datasets using BIO tagging for span detection, with custom evaluation pipelines measuring token-level and entity-level F1 scores. Developed jargon-aware prompting strategies for Llama-3.1-8B-Instruct and Medicine-Llama3-8B models, evaluating simplification quality through automatic metrics (SARI, BERTScore, FKGL, BLEU) and human assessments.
Key Findings & Impact
Successfully reproduced MedReadMe's jargon detection results and extended evaluation to PLABA dataset, revealing that cross-dataset transfer learning yields modest gains primarily limited by divergent annotation objectives. Manual re-annotation of PLABA sentences using MedReadMe's taxonomy improved transfer performance from 33.71% to 42.00% entity F1, demonstrating that annotation schema alignment significantly boosts generalization. Jargon-aware prompting showed model-dependent effectiveness, with benefits often trading off against readability. The work contributes to improving medical text accessibility and has implications for health communication tools.
Publication
Co-authored with Jan Bakker and Jaap Kamps from the Institute for Logic, Language and Computation (ILLC), University of Amsterdam. Code and data publicly available at: https://github.com/taikilazos/thesis_codebase