Identifying Workplace Accidents in News Articles
Project Overview
During my bachelor's research internship at TNO (Netherlands Organisation for Applied Scientific Research), I developed an NLP system to automatically identify and classify workplace accident reports in news media. The project aimed to understand how workplace accidents are represented in media coverage and uncover reporting trends by analyzing large-scale news corpora.
Technical Approach
The solution implemented a three-phase pipeline with increasing specificity:
- Phase 1 - General Accident Detection: Built a classifier to identify accident reports from non-accident news in a dataset of 300,000+ articles, requiring creation of a labeled dataset from scratch.
- Phase 2 - Workplace Accident Filtering: Fine-tuned a second classifier to distinguish workplace-specific incidents from general accidents using contextual understanding.
- Phase 3 - Categorization: Planned clustering and multi-class classification to categorize accidents by type (e.g., falls, chemical exposure, machine-related incidents).
Evaluated multiple approaches including Logistic Regression and semi-supervised Label Propagation. BERT fine-tuning delivered superior performance through contextual language understanding and strategic class imbalance management.
Results & Impact
Achieved 0.89 F1-score on workplace accident identification. The BERT-based pipeline significantly outperformed traditional methods, demonstrating the critical importance of semantic context over keyword-based approaches. Delivered a scalable methodology enabling TNO to process vast news datasets for safety research and trend analysis.
Key Learnings
- Data Quality: Manual labeling and clear annotation protocols were essential for model performance.
- Context Matters: Transformer models excel where traditional approaches fail due to their semantic understanding.
- Class Imbalance: Active management through strategic balancing techniques was crucial for minority class performance.
Future Work
Next steps include implementing Phase 3 categorization, domain-specific BERT fine-tuning for Dutch content, and expanding to broader news sources for comparative analysis with official injury statistics.