Over the last 30 years, research methods have shifted from manual, library-based approaches to methods supported by digital tools. Current Large Language Models (LLMs), such as ChatGPT, are no longer relegated to literature search and have expanded into research analysis and knowledge discovery. However, their dependability for conducting meaningful research, especially in specialized, interdisciplinary areas, has not been thoroughly assessed. This study evaluated whether contemporary LLMs can reproduce the research outcomes of a fully documented human study: a 1991 article that identified dermatophytosis (ringworm) in historical fine art. This niche, interdisciplinary topic was selected as a deliberate stress test of LLM capabilities in the research frontier, where cross-domain synthesis and deep domain knowledge are required. Ten commercially available LLMs were systematically tested using two prompt conditions: a basic factual query and a complex motivational prompt designed to elicit human-level research performance. LLM responses were graded on a scale from 0 to 3 based on eight characteristics of the artwork and three factors from the benchmark article. Four categories, which included complete fabrication, misattribution, embellishment, and overclaiming, were used to identify, count, and categorize hallucinations. The original paper, digital collections, and museum databases were all examined as part of the verification process utilized for this classification. Statistical comparisons used the Wilcoxon signed-rank test and Fisher’s exact test. No LLM rediscovered any of the seven artworks identified in the original article. 3 out of 10 LLMs (30%) produced incorrect information in response to the simple prompt (M = 0.40, SD = 0.70). 9 of 10 LLMs (90%) produced incorrect information in response to the complex prompt, resulting in 48 total instances (M = 4.80, SD = 4.52). The rise was statistically significant (effect size r = 0.91; Fisher’s exact test p = 0.020; Wilcoxon W = 0.0, p = 0.004). Perplexity Pro Deep Research, despite providing the most detailed etiological information (scoring 3 on all three article-level characteristics), also produced the most hallucinations (n = 17). LLMs consistently fabricated plausible-sounding content rather than acknowledging uncertainty. In the specialized interdisciplinary domain tested here, current LLMs proved unreliable as autonomous research agents. Between the two prompt conditions tested, hallucination rates rose twelvefold when the prompt moved from a basic factual query to a complex motivational prompt that explicitly demanded a detailed report. This pattern is consistent with prioritizing output completion above factual accuracy; Reinforcement Learning from Human Feedback (RLHF) is discussed as a plausible explanation for the pattern, but not as a demonstrated mechanism. LLMs can be helpful as research tools when experts in the field use them to verify the results. However, they should not be viewed as independent research agents. It is also relevant for researchers to keep prompts simple and limited in scope to produce more reliable results and to view confident responses with skepticism. Independent verification is necessary.




