6 min read
The dangers of using non-medical LLMs in healthcare communication
Tshedimoso Makhene
November 15, 2025
Artificial intelligence tools like ChatGPT are transforming how healthcare organizations manage communication, documentation, and patient engagement. From reducing administrative workloads to helping staff draft educational materials, large language models (LLMs) offer undeniable efficiency. Yet beneath that convenience lies a complex web of risks, from misinformation to compliance violations, that make using non-medical LLMs in healthcare a serious concern.
“ChatGPT definitely has the potential to make a big difference in healthcare by speeding up administrative work, helping staff, and making patient education more engaging. There are some important limitations to keep in mind,” says David Holt, Owner, Holt Law LLC. “First, it doesn’t actually ‘understand’ medicine—it can sound confident even when it gives incorrect or misleading information, which could be risky in a clinical setting. It’s also not up to date with the latest medical guidelines or treatments if you're using versions trained on older data. Another issue is bias—since ChatGPT was trained on large sets of data from the internet, it can reflect gaps and inequalities that already exist in healthcare, especially for underrepresented communities. Plus, as of today, it can only work with text, so it’s not helpful for anything that involves images, like X-rays or visual diagnoses. Sometimes the answers it gives can be too general or surface-level, missing the detail you’d need in complex medical situations. And maybe most importantly, the public versions aren’t HIPAA-compliant, which means using them with any patient data could lead to privacy risks or security breaches.”
Holt’s perspective points to the paradox of adopting AI in healthcare: while non-medical LLMs can streamline workflows and save time, they can also introduce hidden dangers that compromise accuracy, equity, and patient privacy.
Hallucinations
One of the clearest dangers is the tendency of LLMs to “hallucinate”. LLMs can confidently produce information that is incorrect, fabricated, or misleading. In clinical contexts, even a low rate of hallucination can be dangerous because outputs are presented in polished, authoritative language that can mislead busy clinicians or administrators.
The study, Developing and evaluating large language model–generated emergency medicine handoff notes, compares LLM-generated clinical notes to physician-written notes and found higher rates of incorrectness in model outputs. LLM notes had ~9.6% incorrectness vs. 2.0% for physician notes, and though many errors in that study were not catastrophic, the phenomenon is real and measurable. When hallucinations affect patient-facing or decision-influencing content, patient safety can quickly be jeopardized.
Examples from everyday life highlight the stakes: a high-profile error in a Google research write-up, the invented term “basilar ganglia”, showed how model-style mistakes can slip into clinical materials and be missed by reviewers, raising alarms about automation bias.
Read more:
- What are AI hallucinations?
- Google’s healthcare AI made up a body part — what happens when doctors don’t notice?
Outdated knowledge and the need for authoritative, timely sources
Many public LLMs are trained on static datasets that stop at a certain date. That means guidance about treatments, drug approvals, or clinical protocols can be out of date. As the study Dated Data: Tracing Knowledge Cutoffs in Large Language Models notes, “Large Language Models (LLMs) are often paired with a reported cutoff date, the time at which training data was gathered. Such information is crucial for applications where the LLM must provide up-to-date information.” In medicine, where guidelines change and new evidence appears frequently, relying on a model that doesn’t automatically reference the latest literature risks recommending obsolete or unsafe actions.
Even when models are updated more frequently, they are not substitutes for curated, peer-reviewed clinical guidance or local formularies. For use cases such as patient education or administrative drafting, LLMs can help with tone and structure, but they must be paired with verified, up-to-date clinical checks before the content is used with patients.
Algorithmic bias
LLMs reflect the data they were trained on: enormous amounts of internet text. That data often contains gaps, stereotypes, and systemic biases. In healthcare, algorithmic bias can mean models under-recognize symptoms in certain populations, suggest options that aren’t culturally appropriate, or amplify disparities by privileging the majority group’s language and norms.
Research documents how bias can enter AI systems at different stages—data collection, labeling, model design, and deployment—and how these biases reproduce or worsen health inequities if not actively mitigated. That’s why diverse datasets, fairness testing, and stakeholder engagement must be part of any AI adoption plan in healthcare.
- Read more:
- Bias in medical AI: Implications for clinical decision-making
- Bias recognition and mitigation strategies in artificial intelligence healthcare applications
Modality and clinical detail limitations
David Holt noted that “as of today, it can only work with text”—an important operational limitation for many consumer LLMs. Clinical work often depends on multimodal data: imaging (X-rays, CTs), waveforms (ECGs), scans, and photos. While specialized multimodal models are emerging, generic public LLMs are not designed to parse or interpret clinical images, nor to integrate them meaningfully into diagnostic reasoning.
Even in pure-text tasks, LLMs tend to produce generalist answers. They may miss the nuance required for complex cases: differential diagnosis subtleties, drug interactions in polypharmacy, dose adjustments for renal impairment, or contraindications tied to comorbidities. Those gaps make them unsuitable to replace clinical judgment.
HIPAA and data governance
Perhaps the most immediate operational risk for providers is privacy and regulatory compliance. Public versions of consumer LLM platforms do not enter business associate agreements (BAAs) with covered entities, and data sent to those services can be retained and used to improve models. That means putting protected health information (PHI) into a public LLM may create a HIPAA violation or a data breach.
Guidance and analyses from privacy experts have been clear: without a HIPAA compliant contractual and technical arrangement, BAA, zero-data-retention endpoints, and enterprise offerings with proper controls, clinicians and staff should not paste PHI into non-medical LLMs.
Safety hazards in high-stakes, time-sensitive environments
According to the Association of Health Care Journalists, ECRI has flagged the use of inadequately governed AI in healthcare as a top health-technology hazard. In emergency and acute settings, where time pressures are high and decisions have immediate consequences, misleading AI outputs can do disproportionate harm. The combination of trusted language, time pressure, and clinician automation bias forms a dangerous vector if unchecked.
Adversarial and transcription vulnerabilities
Beyond accidental hallucinations, adversarial inputs and transcription errors also pose practical dangers. The study Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support shows that LLMs and associated tools (including speech-to-text systems) can be manipulated or can mistakenly transcribe content, at times inserting fabricated sentences or misattributing statements in medical conversations. Even a small percentage of errors in clinical transcripts can have outsized consequences in legal or clinical documentation.
Read also: Hospitals use a transcription tool powered by an error-prone OpenAI model
How to use LLMs safely in healthcare
Drawing on findings from the systematic review Implementing large language models in healthcare while balancing control, collaboration, costs and security on AI adoption in healthcare, the authors stress the benefits of stakeholder engagement, continuous monitoring, workflow alignment, and ethical governance, we can derive a robust set of safe-use practices for large language models (LLMs) in clinical settings:
- Engage multidisciplinary stakeholders early: The review showed that successful AI systems adopt a human-centered, problem-driven approach with engagement from clinicians, biomedical scientists, operational leads, IT staff, and patients. For LLM deployment: build a team that spans compliance, clinicians, legal, data science, and patient-education specialists. Define clearly what tasks the LLM will assist with (e.g., draft education materials, summarize admin forms) and map out human review steps.
- Define appropriate use-cases for LLMs (low-risk first): The study found that many AI pilots falter when they are misaligned with clinical workflows or attempt to solve the wrong problem. For LLMs, begin with nonclinical, administrative, or patient-communication tasks rather than diagnostic or treatment decision roles. For example: generating patient onboarding emails, drafting FAQs, and summarising policy updates, always with human oversight.
- Pair LLM outputs with curated, up-to-date knowledge sources: One key takeaway from the review is that model performance alone (e.g., retrospective accuracy) does not guarantee clinical utility or safety. For LLM use, ensure outputs reference verified clinical guidelines, formulary documents, or institution-specific protocols. Institute a retrieval-augmented generation (RAG) workflow: the LLM drafts, then a human reviews using current evidence.
- Implement human-in-the-loop (HITL) workflows and preserve final clinical judgement: The review emphasized that AI should augment, not replace, human intelligence, particularly in healthcare, where nuance, judgment, and context matter. For LLMs: every piece of content used with patients should be reviewed by a qualified staff member; if the LLM influences clinical decision-making, that decision must be documented alongside the human reviewer’s sign-off.
- Continuous monitoring, maintenance, and feedback loops: Post-deployment monitoring was a major theme in the systematic review: “monitor and maintain” as a core component of a reliable AI system. For LLMs: log usage, track error reports or feedback from staff and patients, and audit for hallucinations, bias, or misalignment. Update workflows if model drift or new risk patterns appear.
- Bias testing, fairness assessment, and ethical governance: The literature shows why it's important to address algorithmic bias, fairness, ethics, and governance in AI adoption. For LLMs: conduct fairness audits across patient demographics (age, gender, ethnicity, and language dialects). Ensure that templates generated don’t perpetuate stereotypes or exclude underserved populations. Implement governance structures: ethics oversight and documentation of risk mitigation strategies.
- Privacy, compliance, and data protection safeguards: While the review focused broadly on AI systems, its governance findings apply equally to LLMs used in healthcare. Data privacy, security, traceability, and accountability are essential. For LLM use: avoid feeding un-redacted PHI into public LLMs; use enterprise versions with zero retention or sign a business associate agreement (BAA) where required; log data flows, encryption, and access controls; ensure that any content interacting with patient data aligns with Health Insurance Portability and Accountability Act (HIPAA) requirements and your local jurisdiction’s laws.
See also: HIPAA Compliant Email: The Definitive Guide (2025 Update)
FAQS
What are non-medical LLMs?
Non-medical LLMs are large language models like ChatGPT or Gemini that were trained on general internet data rather than healthcare-specific, peer-reviewed medical datasets. They can write or summarize text effectively, but were not designed for clinical accuracy, safety, or compliance with healthcare regulations such as HIPAA.
Are there HIPAA-compliant versions of ChatGPT or other LLMs?
Yes, some enterprise-grade platforms, such as Microsoft Azure OpenAI Service, can offer HIPAA compliance if a business associate agreement (BAA) is in place and appropriate data-handling safeguards are configured. Always confirm this directly with the vendor before using PHI.
What should an organization do if it accidentally enters PHI into a public LLM?
Treat it as a potential HIPAA breach. Notify your compliance officer immediately, document the exposure, and follow your organization’s breach response plan. Evaluate whether the data can be contained and whether patient notification or HHS reporting is required.
Subscribe to Paubox Weekly
Every Friday we'll bring you the most important news from Paubox. Our aim is to make you smarter, faster.
