With healthcare workforce shortages and administrative burdens, AI systems could improve the clinical reasoning workflow.
A recent JMIR study assessed The Utility of ChatGPT throughout the Entire Clinical Workflow. These researchers tested whether the chatbot could tackle a physician's worktable of dilemmas, including differential diagnosis, diagnostic examination, final diagnosis, and patient management with the nuance and accuracy needed to perform such tasks.
According to their results, "ChatGPT was 71.7% accurate (95% CI 69.3%-74.1%) on all 36 clinical vignettes combined.” While that may not set the standard for board certification, it's far better than guessing and could be a possible benchmark for the early uptake of decision support systems.
Their research showed that ChatGPT performed more effectively when given more clinical information, mirroring the give-and-take way clinicians come to conclusions. The model performs best at "successive prompting," with each new piece of information narrowing its conclusions.
More specifically, "ChatGPT did best at final diagnosis with 76.9% accuracy,” with its worst performance being at the beginning of the diagnostic process, "drawing up an initial differential diagnosis with 60.3% accuracy.”
This is noteworthy because the earliest stage of clinical thought often requires expansive, creative thinking, which is something LLMs are still not great at mimicking. The study adds, "ChatGPT performed worse on differential diagnosis [and clinical management] question types."
These results suggest that while ChatGPT is a capable assistant, it simply cannot replace clinicians. Its most natural strength is hypothesis filtering and pattern identification after a premise has been set.
For example, if a clinician is managing a patient with long-standing diabetes and they suspect early kidney involvement based on subtle lab changes, they can enter the data into ChatGPT. The clinician can then receive confirmation that the findings align with early diabetic nephropathy and get suggestions for next steps, like ordering a urine albumin test or adjusting medication. This makes ChatGPT particularly useful for refining ongoing care rather than generating diagnoses from scratch.
Ultimately, ChatGPT is best used in consultative or chronic care scenarios, where clinicians already have a suspicion about what they're looking for and need affirmation or direction.
There are many ways in which ChatGPT can help healthcare providers in their clinical workflows. Like, when a family practitioner is confronted with a baffling presentation of symptoms. With every successive test result or patient disclosure, they can instruct ChatGPT to refine the diagnostic hypothesis. Instead of replacing the doctor, ChatGPT can suggest possibilities, point out outliers, and even refute assumptions.
For example, if an elderly patient presents with fever and confusion. A doctor might initially think of a urinary tract infection or pneumonia. However, ChatGPT, based on specific signs and lab tests, would rather be leaning towards a diagnosis of sepsis and suggest further investigation for possible meningitis.
In this case, ChatGPT relies on iterative decision-making instead of traditional symptom checkers, making it more than just a simple search engine. However, the excitement surrounding chatbots, like ChatGPT, must be balanced against the known limitations of LLMs.
According to the previous research study, there are two main dangers:
Model hallucinations occur when the AI produces reasonable-sounding but incorrect information. For example, during a consultation about chest pain, ChatGPT might suggest prescribing a medication that was withdrawn from the market years ago due to safety concerns. Although the recommendation sounds medically plausible, it’s entirely inaccurate, which could have serious risks if a clinician doesn’t catch it.
Just as concerning is the lack of transparency in ChatGPT's training data. We can't entirely determine what medical journals, texts, or patient records (if any) were used to inform the model's decisions. Without transparency, one can't begin to determine biases, knowledge gaps, or risks of recommending outdated interventions.
For example, if ChatGPT recommends a diagnostic approach for stroke that omits newer guidelines on thrombectomy time windows, it's impossible to know whether that’s due to outdated training data or missing sources entirely. Without knowing what materials the model was trained on, clinicians can’t assess whether its advice reflects current best practices or embedded biases.
When healthcare becomes more reliant on AI, we must ask ourselves who will benefit and who will be left behind. Will smaller practices that cannot afford LLMs fall even further behind? Will clinicians begin to rely too heavily on AI at the expense of their own critical thinking skills? Could language models be perpetuating diagnostic bias by age, gender, or race unknowingly?
However, these are not academic concerns as the report states the performance of ChatGPT was tested partially "based on patient age, gender, and case acuity." Although the model was stable across some variables, even small variations could easily accumulate in a clinical environment. Further monitoring and auditing will be necessary.
If ChatGPT is going to be integrated into real patient care, it is not going to be through a public-facing chatbot. Rather, clinical settings will have to have highly secure interfaces that comply with regulatory safeguards such as the Health Insurance Portability and Accountability Act (HIPAA). That is where HIPAA compliant email solutions, like Paubox, can help.
Physicians, nurses, and care coordinators rely on email to exchange lab results, schedule treatment plans, and communicate details about progress. Regular email software is not secure enough to transport patient health data. More specifically, healthcare providers must use Paubox email to encrypt patients’ protected health information (PHI) during transmission and at rest.
Through Paubox, senders can securely encrypt patient data, whether it is an AI-generated diagnosis summary or a shared care note, to fellow clinicians or even the patient, without any risk of HIPAA breach. This solution automatically encrypts outgoing emails for hassle-free delivery without the recipient needing to log in to a portal.
For example, a pediatrician uses ChatGPT to generate a diagnostic report and sends it via Paubox to a specialist at a neurodevelopment clinic. The specialist can then reply with recommendations, and these emails are encrypted and saved for auditing purposes.
In the long run, the union of AI and secure email preserves collaboration and the legal mandate to protect patient information. It helps healthcare providers avoid costly HIPAA violations, including fine allowances of up to $1.5 million per incident and loss of patient trust.
Read also: How HIPAA compliance improves patient trust
The abovementioned study shows that AI in medicine is still a work in progress; however, ChatGPT and other LLMs have shown that they can perform big pieces of clinical reasoning work with reasonable accuracy. The study states, “ChatGPT does [impressively] well on clinical decision-making with growing potency as it is presented with more clinical information at its disposal.”
So, while the results are impressive, it is not infallible, and in medicine, accuracy is non-negotiable. What we need now is continuous benchmarking. We need more studies like this one, with standardized tools used to assess LLM performance across different patient groups and specialties.
Additionally, we need greater transparency. Developers must disclose more about their training data and model limitations. These systems must be integrated into secure, user-friendly workflows, like Paubox.
Eventually, artificial intelligence will not render doctors obsolete, but doctors who use AI responsibly and safely may well drive their incompetent colleagues out of business. While clinicians struggle with growing administrative responsibilities, physician burnout, and diagnostic subtlety, technologies like ChatGPT can offer much-needed cognitive aid.
Read also: Factors driving AI adoption in healthcare
A model hallucination occurs when ChatGPT generates convincing but false or outdated medical information. For example, ChatGPT might recommend a drug that was removed from the market years ago, making the response sound credible but entirely wrong. Such errors can mislead clinicians and lead to harmful or inappropriate treatment if not detected.
There’s a lack of transparency about which medical sources were used, which makes it hard to verify the reliability of its recommendations. More specifically, clinicians can't assess whether the model's advice is current, complete, or biased. If the training data is biased, the outputs could unintentionally reflect or amplify those biases.
Go deeper: Confronting racial bias in Artificial Intelligence (AI)
Yes, healthcare providers must ensure that AI-powered features comply with HIPAA regulations and industry best practices for data security and privacy. Additionally, providers should evaluate the reliability of AI algorithms to avoid potential risks or compliance issues.
Read also: HIPAA compliant email API