← Back to Projects

Medical Text Simplification: From Jargon Detection to Jargon-Aware Prompting

Project Overview

This research investigates automatic jargon detection and LLM-based medical text simplification to improve health literacy. We address the challenge of making biomedical texts accessible to lay readers by developing models that identify complex terminology and evaluating strategies for effective simplification.

Key Features

Jargon detection across biomedical datasets (MedReadMe and PLABA)
Cross-dataset transfer learning evaluation and annotation schema alignment
Jargon-aware prompting strategies for LLM-based text simplification
Comparative analysis of general-purpose vs. domain-specialized language models
Human evaluation studies validating automatic metrics
Comprehensive benchmarking with multiple transformer architectures

Technical Implementation

Implemented using Python with transformer models including BERT, RoBERTa, BioBERT, and PubMedBERT. Fine-tuned models on biomedical datasets using BIO tagging for span detection, with custom evaluation pipelines measuring token-level and entity-level F1 scores. Developed jargon-aware prompting strategies for Llama-3.1-8B-Instruct and Medicine-Llama3-8B models, evaluating simplification quality through automatic metrics (SARI, BERTScore, FKGL, BLEU) and human assessments.

Key Findings & Impact

Successfully reproduced MedReadMe's jargon detection results and extended evaluation to PLABA dataset, revealing that cross-dataset transfer learning yields modest gains primarily limited by divergent annotation objectives. Manual re-annotation of PLABA sentences using MedReadMe's taxonomy improved transfer performance from 33.71% to 42.00% entity F1, demonstrating that annotation schema alignment significantly boosts generalization. Jargon-aware prompting showed model-dependent effectiveness, with benefits often trading off against readability. The work contributes to improving medical text accessibility and has implications for health communication tools.

Publication

Co-authored with Jan Bakker and Jaap Kamps from the Institute for Logic, Language and Computation (ILLC), University of Amsterdam. Code and data publicly available at: https://github.com/taikilazos/thesis_codebase