MUSC Theses and Dissertations

Addressing Generalization Challenges in Clinical Section Identification: Contextual Learning and Large Language Model Approaches

Kexin Chen, Medical University of South CarolinaFollow

Date of Award

Summer 7-31-2025

Embargo Period

7-31-2027

Document Type

Dissertation - MUSC Only

Degree Name

Doctor of Philosophy (PhD)

Department

Public Health Sciences

College

College of Graduate Studies

First Advisor

Paul Heider

Second Advisor

Jihad Obeid

Third Advisor

Feng Luo

Fourth Advisor

Ramsey Wehbe

Fifth Advisor

Zijun Wang

Abstract

Background: Electronic health records (EHR) systems have generated vast volumes of clinical narratives, driving clinical natural language processing (NLP) development. Clinical section identification is crucial for NLP tasks like information retrieval, but faces challenges from inconsistent documentation practices and highly skewed section distributions across health systems. Current rule-based, traditional machine learning, and large language model approaches suffer from performance degradation on out-of-distribution data and require extensive preprocessing or post-processing.

Methods: This study addresses these limitations through model development and downstream application validation. Aim 1 developed a contextual BERT-based approach using a novel input strategy that groups each target sentence with immediate preceding and succeeding sentences to capture narrative flow and semantic cues. We tested four BERT-family models (BERT, BioBERT, Bio_ClinicalBERT, ClinicalLongformer) on MedSecId dataset and evaluated generalizability on the i2b2 2014 Shred Task data. Aim 2 developed a novel sentence-level classification system using large language models that directly classifies sentences without requiring known section boundaries or complicated post-processing. We conducted prompt engineering and ablation studies on proprietary models (ChatGPT4o, ChatGPT4o-mini) and open models (Llama, BioMedicalLlama). Aim 3 validated the practical utility of these models by applying section identification to Alzheimer's dementia phenotyping using EHRs from the Medical University of South Carolina (MUSC)'s Research Data Warehouse. We used the developed systems to create filtered clinical narratives excluding irrelevant sections to improve signal-to-noise ratios, comparing CNN models trained on filtered text against baseline models using identical architectures but unfiltered clinical notes.

Results: The contextual input strategy with ClinicalLongformer achieved the highest F1-scores of 0.92 (in-domain) and 0.64 (out-of-domain), surpassing previous BERT performance (F1=0.71 in-domain, F1=0.60 out-of-domain). LLM-based sentence-level classification achieved state-of-the-art performance with micro F1-scores of 0.85 and 0.74 using ChatGPT-4o and GPT-4o-mini respectively, without requiring extensive post-processing. CNN models on filtered text significantly outperformed baselines with medium-to-large effect sizes, achieving 0.90 accuracy using text filtered by ChatGPT-4o-mini.

Conclusions: This research establishes new benchmarks for clinical section identification and demonstrates practical value for downstream clinical NLP tasks, providing a foundation for more robust and generalizable clinical text processing systems.

Recommended Citation

Chen, Kexin, "Addressing Generalization Challenges in Clinical Section Identification: Contextual Learning and Large Language Model Approaches" (2025). MUSC Theses and Dissertations. 1080.
https://medica-musc.researchcommons.org/theses/1080

Rights

Download

Available for download on Saturday, July 31, 2027

COinS

MUSC Theses and Dissertations

Addressing Generalization Challenges in Clinical Section Identification: Contextual Learning and Large Language Model Approaches

Date of Award

Embargo Period

Document Type

Degree Name

Department

College

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Recommended Citation

Rights

Browse

Search

Author Corner

MUSC Theses and Dissertations

Addressing Generalization Challenges in Clinical Section Identification: Contextual Learning and Large Language Model Approaches

Author

Date of Award

Embargo Period

Document Type

Degree Name

Department

College

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Recommended Citation

Rights

Share

Browse

Search

Author Corner