2509001421
  • Open Access
  • Research Article

Let the History Speak: Zero-Shot LLMs for Diagnosing Vestibular Disorders

  • Chongkai Lu 1,   
  • Ruiqi Zhang 1,   
  • Fangzhou Yu 1,   
  • Huawei Li 1, 2, 3, 4, 5,   
  • Peixia Wu 1, 6, *

Received: 15 Aug 2025 | Accepted: 26 Aug 2025 | Published: 16 Sep 2025

Abstract

Background: : Vestibular disorders are common yet diagnostically challenging in first-line and specialist settings, and delays or misclassification can alter management and outcomes. Structured symptom questionnaires and supervised machine learning (ML) have shown promise for triage, while recent large language models (LLMs) may reason over clinical descriptions without task-specific training.

Objective: : To evaluate zero-shot LLMs for five-class vestibular diagnosis from an electronic questionnaire, characterize error patterns across disorders, and compare the best-performing LLM with a trained gradient-boosted tree (LightGBM, LGBM).

Methods: : We used a seven-center prospective cohort with an electronic 23-item questionnaire and guideline-based reference diagnoses by experienced ENT specialists. The prediction task was a five-class classification among benign paroxysmal positional vertigo (BPPV), vestibular migraine (VM), Meniere disease (MD), sudden sensorineural hearing loss with vestibular dysfunction (SSNHL-V), and an aggregated “Others” category of individually rare vestibular conditions. After prespecified exclusions, 1,025 single-definite cases were analyzed; 912/113 patients formed the train/test split for a LightGBM baseline. Three LLMs (DeepSeek-R1, DeepSeek-V3, Doubao-1.6-thinking) were evaluated zero-shot on all 1,025 cases. We report Top-k, MRR, and NDCG@5 overall; one-vs-rest sensitivity, specificity, and accuracy per disorder (macro-averaged where applicable); 95% CIs via 1,000-patient bootstrap; paired bootstrap for model differences; and McNemar’s test for accuracy on the shared test set.

Results: : All LLMs outperformed a prevalence prior baseline (Top-1 38.6%). V3 and Doubao achieved Top-1 ≈ 65% and Top-3 ≥ 91%, with MRR ≈ 0.79–0.80. Disorder-wise, BPPV was reliably detected; vestibular migraine remained hardest; sensitivity–specificity trade-offs were model- and disorder-dependent. On the 113-case test set, LGBM slightly exceeded V3 on sensitivity (0.722 vs. 0.632), specificity (0.941 vs. 0.926), and accuracy (0.770 vs. 0.742), with no significant accuracy difference (McNemar p = 0.690). Findings support LLMs as a zero-shot front end that narrows diagnostic search space while approaching a specialized model’s performance.

References 

  • 1.
    Kerber KA, Callaghan BC, Telian SA, Meurer WJ, Skolarus LE, Carender W. . Dizziness symptom type prevalence and overlap: a US nationally representative survey. Am J Med, 2017, 130(12): 1465.e1–1465.e9
  • 2.
    Grill E, Strupp M, Müller M, Jahn K. Health services utilization of patients with vertigo in primary care: a retrospective cohort study. J Neurol, 2014, 261: 1492–1498
  • 3.
    Bhattacharyya N, Gubbels SP, Schwartz SR, Edlow JA, El-Kashlan H, Fife T. . Clinical practice guideline: benign paroxysmal positional vertigo (update). Otolaryngol Head Neck Surg, 2017, 156(3): S1–S47
  • 4.
    Lopez-Escamez JA, Carey J, Chung WH, Goebel JA, Magnusson M, Mandalà M. . Diagnostic criteria for Menière’s disease. J Vestib Res, 2015, 25(1): 1–7
  • 5.
    Kerber KA, Newman-Toker DE. Misdiagnosing dizzy patients: common pitfalls in clinical practice. Neurol Clin, 2015, 33(3): 565–575
  • 6.
    Tehrani ASS, Coughlan D, Hsieh YH, Mantokoudis G, Korley FK, Kerber KA. . Rising annual costs of dizziness presentations to US emergency departments. Acad Emerg Med, 2013, 20(7): 689–696
  • 7.
    Roland LT, Kallogjeri D, Sinks BC, Rauch SD, Shepard NT, White JA. . Utility of an abbreviated dizziness questionnaire to differentiate between causes of vertigo and guide appropriate referral: a multicenter prospective blinded study. Otol Neurotol, 2015, 36(10): 1687–1694
  • 8.
    Jacobson GP, Piker EG, Hatton K, Watford KE, Trone T, McCaslin DL. . Development and preliminary findings of the dizziness symptom profile. Ear Hear, 2019, 40(3): 568–578
  • 9.
    Friedland DR, Tarima S, Erbe C, Miles A. Development of a statistical model for the prediction of common vestibular diagnoses. JAMA Otolaryngol Head Neck Surg, 2016, 142(4): 351–356
  • 10.
    Yu F, Wu P, Deng H, Wu J, Sun S, Yu H. . A questionnaire-based ensemble learning model to predict the diagnosis of vertigo: model development and validation study. J Med Internet Res, 2022, 24(8): e34126
  • 11.
    Sushil M, Zack T, Mandair D, Zheng Z, Wali A, Yu YN. . A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports. J Am Med Inform Assoc, 2024, 31(10): 2315–2327
  • 12.
    EfronBTibshiraniRJ. An introduction to the bootstrap. CRC Press; 1994.
  • 13.
    SteyerbergEW. Clinical prediction models: a practical approach to development, validation, and updating. Springer; 2019.
  • 14.
    Lampasona G, Piker E, Ryan C, Gerend P, Rauch SD, Goebel JA. . A systematic review of clinical vestibular symptom triage, tools, and algorithms. Otolaryngol Head Neck Surg, 2022, 167(1): 3–15
  • 15.
    Dennstädt F, Hastings J, Putora PM, Schmerder M, Cihoric N. Implementing large language models in healthcare while balancing control, collaboration, costs and security. NPJ Digit Med, 2025, 8(1): 143
Share this article:
How to Cite
Lu, C.; Zhang, R.; Yu, F.; Li, H.; Wu, P. Let the History Speak: Zero-Shot LLMs for Diagnosing Vestibular Disorders. ENT Discovery 2025, 1 (1), 23–32. https://doi.org/10.15302/ENTD.2025.090005.
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2025 by the authors.