Date of Award
Summer 7-31-2025
Embargo Period
7-31-2027
Document Type
Dissertation - MUSC Only
Degree Name
Doctor of Philosophy (PhD)
Department
Public Health Sciences
College
College of Graduate Studies
First Advisor
Paul Heider
Second Advisor
Jihad Obeid
Third Advisor
Feng Luo
Fourth Advisor
Ramsey Wehbe
Fifth Advisor
Zijun Wang
Abstract
Background: Electronic health records (EHR) systems have generated vast volumes of clinical narratives, driving clinical natural language processing (NLP) development. Clinical section identification is crucial for NLP tasks like information retrieval, but faces challenges from inconsistent documentation practices and highly skewed section distributions across health systems. Current rule-based, traditional machine learning, and large language model approaches suffer from performance degradation on out-of-distribution data and require extensive preprocessing or post-processing.
Methods: This study addresses these limitations through model development and downstream application validation. Aim 1 developed a contextual BERT-based approach using a novel input strategy that groups each target sentence with immediate preceding and succeeding sentences to capture narrative flow and semantic cues. We tested four BERT-family models (BERT, BioBERT, Bio_ClinicalBERT, ClinicalLongformer) on MedSecId dataset and evaluated generalizability on the i2b2 2014 Shred Task data. Aim 2 developed a novel sentence-level classification system using large language models that directly classifies sentences without requiring known section boundaries or complicated post-processing. We conducted prompt engineering and ablation studies on proprietary models (ChatGPT4o, ChatGPT4o-mini) and open models (Llama, BioMedicalLlama). Aim 3 validated the practical utility of these models by applying section identification to Alzheimer's dementia phenotyping using EHRs from the Medical University of South Carolina (MUSC)'s Research Data Warehouse. We used the developed systems to create filtered clinical narratives excluding irrelevant sections to improve signal-to-noise ratios, comparing CNN models trained on filtered text against baseline models using identical architectures but unfiltered clinical notes.
Results: The contextual input strategy with ClinicalLongformer achieved the highest F1-scores of 0.92 (in-domain) and 0.64 (out-of-domain), surpassing previous BERT performance (F1=0.71 in-domain, F1=0.60 out-of-domain). LLM-based sentence-level classification achieved state-of-the-art performance with micro F1-scores of 0.85 and 0.74 using ChatGPT-4o and GPT-4o-mini respectively, without requiring extensive post-processing. CNN models on filtered text significantly outperformed baselines with medium-to-large effect sizes, achieving 0.90 accuracy using text filtered by ChatGPT-4o-mini.
Conclusions: This research establishes new benchmarks for clinical section identification and demonstrates practical value for downstream clinical NLP tasks, providing a foundation for more robust and generalizable clinical text processing systems.
Recommended Citation
Chen, Kexin, "Addressing Generalization Challenges in Clinical Section Identification: Contextual Learning and Large Language Model Approaches" (2025). MUSC Theses and Dissertations. 1080.
https://medica-musc.researchcommons.org/theses/1080
Rights
Copyright is held by the author. All rights reserved.