Artificial intelligence tools like ChatGPT are transforming how healthcare organizations manage communication, documentation, and patient engagement. From reducing administrative workloads to helping staff draft educational materials, large language models (LLMs) offer undeniable efficiency. Yet beneath that convenience lies a complex web of risks, from misinformation to compliance violations, that make using non-medical LLMs in healthcare a serious concern.
“ChatGPT definitely has the potential to make a big difference in healthcare by speeding up administrative work, helping staff, and making patient education more engaging. There are some important limitations to keep in mind,” says David Holt, Owner, Holt Law LLC. “First, it doesn’t actually ‘understand’ medicine—it can sound confident even when it gives incorrect or misleading information, which could be risky in a clinical setting. It’s also not up to date with the latest medical guidelines or treatments if you're using versions trained on older data. Another issue is bias—since ChatGPT was trained on large sets of data from the internet, it can reflect gaps and inequalities that already exist in healthcare, especially for underrepresented communities. Plus, as of today, it can only work with text, so it’s not helpful for anything that involves images, like X-rays or visual diagnoses. Sometimes the answers it gives can be too general or surface-level, missing the detail you’d need in complex medical situations. And maybe most importantly, the public versions aren’t HIPAA-compliant, which means using them with any patient data could lead to privacy risks or security breaches.”
Holt’s perspective points to the paradox of adopting AI in healthcare: while non-medical LLMs can streamline workflows and save time, they can also introduce hidden dangers that compromise accuracy, equity, and patient privacy.
One of the clearest dangers is the tendency of LLMs to “hallucinate”. LLMs can confidently produce information that is incorrect, fabricated, or misleading. In clinical contexts, even a low rate of hallucination can be dangerous because outputs are presented in polished, authoritative language that can mislead busy clinicians or administrators.
The study, Developing and evaluating large language model–generated emergency medicine handoff notes, compares LLM-generated clinical notes to physician-written notes and found higher rates of incorrectness in model outputs. LLM notes had ~9.6% incorrectness vs. 2.0% for physician notes, and though many errors in that study were not catastrophic, the phenomenon is real and measurable. When hallucinations affect patient-facing or decision-influencing content, patient safety can quickly be jeopardized.
Examples from everyday life highlight the stakes: a high-profile error in a Google research write-up, the invented term “basilar ganglia”, showed how model-style mistakes can slip into clinical materials and be missed by reviewers, raising alarms about automation bias.
Read more:
Many public LLMs are trained on static datasets that stop at a certain date. That means guidance about treatments, drug approvals, or clinical protocols can be out of date. As the study Dated Data: Tracing Knowledge Cutoffs in Large Language Models notes, “Large Language Models (LLMs) are often paired with a reported cutoff date, the time at which training data was gathered. Such information is crucial for applications where the LLM must provide up-to-date information.” In medicine, where guidelines change and new evidence appears frequently, relying on a model that doesn’t automatically reference the latest literature risks recommending obsolete or unsafe actions.
Even when models are updated more frequently, they are not substitutes for curated, peer-reviewed clinical guidance or local formularies. For use cases such as patient education or administrative drafting, LLMs can help with tone and structure, but they must be paired with verified, up-to-date clinical checks before the content is used with patients.
LLMs reflect the data they were trained on: enormous amounts of internet text. That data often contains gaps, stereotypes, and systemic biases. In healthcare, algorithmic bias can mean models under-recognize symptoms in certain populations, suggest options that aren’t culturally appropriate, or amplify disparities by privileging the majority group’s language and norms.
Research documents how bias can enter AI systems at different stages—data collection, labeling, model design, and deployment—and how these biases reproduce or worsen health inequities if not actively mitigated. That’s why diverse datasets, fairness testing, and stakeholder engagement must be part of any AI adoption plan in healthcare.
David Holt noted that “as of today, it can only work with text”—an important operational limitation for many consumer LLMs. Clinical work often depends on multimodal data: imaging (X-rays, CTs), waveforms (ECGs), scans, and photos. While specialized multimodal models are emerging, generic public LLMs are not designed to parse or interpret clinical images, nor to integrate them meaningfully into diagnostic reasoning.
Even in pure-text tasks, LLMs tend to produce generalist answers. They may miss the nuance required for complex cases: differential diagnosis subtleties, drug interactions in polypharmacy, dose adjustments for renal impairment, or contraindications tied to comorbidities. Those gaps make them unsuitable to replace clinical judgment.
Perhaps the most immediate operational risk for providers is privacy and regulatory compliance. Public versions of consumer LLM platforms do not enter business associate agreements (BAAs) with covered entities, and data sent to those services can be retained and used to improve models. That means putting protected health information (PHI) into a public LLM may create a HIPAA violation or a data breach.
Guidance and analyses from privacy experts have been clear: without a HIPAA compliant contractual and technical arrangement, BAA, zero-data-retention endpoints, and enterprise offerings with proper controls, clinicians and staff should not paste PHI into non-medical LLMs.
According to the Association of Health Care Journalists, ECRI has flagged the use of inadequately governed AI in healthcare as a top health-technology hazard. In emergency and acute settings, where time pressures are high and decisions have immediate consequences, misleading AI outputs can do disproportionate harm. The combination of trusted language, time pressure, and clinician automation bias forms a dangerous vector if unchecked.
Beyond accidental hallucinations, adversarial inputs and transcription errors also pose practical dangers. The study Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support shows that LLMs and associated tools (including speech-to-text systems) can be manipulated or can mistakenly transcribe content, at times inserting fabricated sentences or misattributing statements in medical conversations. Even a small percentage of errors in clinical transcripts can have outsized consequences in legal or clinical documentation.
Read also: Hospitals use a transcription tool powered by an error-prone OpenAI model
Drawing on findings from the systematic review Implementing large language models in healthcare while balancing control, collaboration, costs and security on AI adoption in healthcare, the authors stress the benefits of stakeholder engagement, continuous monitoring, workflow alignment, and ethical governance, we can derive a robust set of safe-use practices for large language models (LLMs) in clinical settings:
See also: HIPAA Compliant Email: The Definitive Guide (2025 Update)
Non-medical LLMs are large language models like ChatGPT or Gemini that were trained on general internet data rather than healthcare-specific, peer-reviewed medical datasets. They can write or summarize text effectively, but were not designed for clinical accuracy, safety, or compliance with healthcare regulations such as HIPAA.
Yes, some enterprise-grade platforms, such as Microsoft Azure OpenAI Service, can offer HIPAA compliance if a business associate agreement (BAA) is in place and appropriate data-handling safeguards are configured. Always confirm this directly with the vendor before using PHI.
Treat it as a potential HIPAA breach. Notify your compliance officer immediately, document the exposure, and follow your organization’s breach response plan. Evaluate whether the data can be contained and whether patient notification or HHS reporting is required.