Background: : Vestibular disorders are common yet diagnostically challenging in first-line and specialist settings, and delays or misclassification can alter management and outcomes. Structured symptom questionnaires and supervised machine learning (ML) have shown promise for triage, while recent large language models (LLMs) may reason over clinical descriptions without task-specific training.
Objective: : To evaluate zero-shot LLMs for five-class vestibular diagnosis from an electronic questionnaire, characterize error patterns across disorders, and compare the best-performing LLM with a trained gradient-boosted tree (LightGBM, LGBM).
Methods: : We used a seven-center prospective cohort with an electronic 23-item questionnaire and guideline-based reference diagnoses by experienced ENT specialists. The prediction task was a five-class classification among benign paroxysmal positional vertigo (BPPV), vestibular migraine (VM), Meniere disease (MD), sudden sensorineural hearing loss with vestibular dysfunction (SSNHL-V), and an aggregated “Others” category of individually rare vestibular conditions. After prespecified exclusions, 1,025 single-definite cases were analyzed; 912/113 patients formed the train/test split for a LightGBM baseline. Three LLMs (DeepSeek-R1, DeepSeek-V3, Doubao-1.6-thinking) were evaluated zero-shot on all 1,025 cases. We report Top-k, MRR, and NDCG@5 overall; one-vs-rest sensitivity, specificity, and accuracy per disorder (macro-averaged where applicable); 95% CIs via 1,000-patient bootstrap; paired bootstrap for model differences; and McNemar’s test for accuracy on the shared test set.
Results: : All LLMs outperformed a prevalence prior baseline (Top-1 38.6%). V3 and Doubao achieved Top-1 ≈ 65% and Top-3 ≥ 91%, with MRR ≈ 0.79–0.80. Disorder-wise, BPPV was reliably detected; vestibular migraine remained hardest; sensitivity–specificity trade-offs were model- and disorder-dependent. On the 113-case test set, LGBM slightly exceeded V3 on sensitivity (0.722 vs. 0.632), specificity (0.941 vs. 0.926), and accuracy (0.770 vs. 0.742), with no significant accuracy difference (McNemar p = 0.690). Findings support LLMs as a zero-shot front end that narrows diagnostic search space while approaching a specialized model’s performance.



