HIPAA restricts how covered entities can use and disclose protected health information (PHI). PHI includes any individually identifiable health information, from medical records and billing data to conversations between patients and providers.
However, once health information is properly de-identified, it is no longer considered PHI and falls outside HIPAA's regulations. According to 45 CFR § 164.514(a), "Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information." This creates a way for using healthcare data in AI development while maintaining patient privacy.
Learn more: When do you have to use deidentified data sets?
HIPAA provides two methods for de-identifying data, each with its own requirements and use:
The first is the Expert Determination method, which relies on statistical and scientific principles. Under this approach, a qualified expert applies statistical or scientific methods to determine that the risk of re-identifying an individual is very small. As specified in 45 CFR § 164.514(b)(1), the expert must determine that "the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information." The expert must also document "the methods and results of the analysis that justify such determination." This method offers flexibility and can be useful when dealing with datasets that contain valuable information that might otherwise need to be removed.
The second method is the Safe Harbor approach, which provides a checklist of 18 identifiers that must be removed from a dataset. These identifiers include names, geographic subdivisions smaller than a state, dates directly related to an individual (except year), telephone and fax numbers, email addresses, Social Security numbers, medical record numbers, account numbers, biometric identifiers, photographs, and any other unique identifying characteristics.
Additionally, under 45 CFR § 164.514(b)(2)(ii), "the covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information." This requirement explains that compliance is not only removing listed identifiers, organizations must also consider whether remaining data could be combined to re-identify individuals.
Machine learning models require large and different datasets to learn patterns and make accurate predictions. Healthcare data is good for developing AI tools that can diagnose diseases, predict patient outcomes, personalize treatments, and optimize healthcare operations.
By properly de-identifying data, healthcare organizations can contribute to AI innovation without risking patient privacy. A hospital could, for example, use de-identified electronic health records to train an AI model that predicts sepsis risk, or a health insurer could develop algorithms to identify potential fraud using de-identified claims data.
However, organizations must note the risks that generative AI systems have on data privacy. As noted in NIST's Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, generative artificial intelligence systems may "leak, generate, or correctly infer sensitive information about individuals," and models may reveal "sensitive information (from the public domain) that was included in training data." These risks show why it is important to have thorough de-identification practices when training AI models on healthcare data.
Modern datasets are large and sometimes include indirect identifiers that could potentially be linked to individuals through external data sources. For example, a dataset showing a rare disease diagnosis in a small geographic area might be re-identifiable even without obvious identifiers.
The Safe Harbor method, while providing guidelines, can also remove so much information that the data loses its value for training AI. This means that organizations must balance privacy protection with data usability.
Re-identification risks also change as technology advances and more data becomes publicly available. Data that seemed de-identified five years ago might be vulnerable to re-identification today through advanced matching techniques or newly available datasets. This requires ongoing assessment and potential re-application of de-identification methods.
It is more difficult when dealing with unstructured data like clinical notes. As research on De-identification of free text data containing personal health information has found, no single approach could reliably de-identify all personal health identifying information in population data records. This shows that de-identification is not a one-size-fits-all process but requires layered approaches tailored to the specific characteristics of the data.
Organizations seeking to use de-identified data for AI training should adopt a best practice approach to privacy protection:
Yes, if de-identified data is later combined with other information in a way that enables re-identification, it may fall back under HIPAA scrutiny.
No, HIPAA does not require patient authorization once data is properly de-identified.
Yes, business associates may de-identify data if their agreements permit it and HIPAA requirements are followed.
State and international privacy laws may impose stricter standards than HIPAA, even for de-identified data.
HIPAA allows sharing of properly de-identified data, but contractual, ethical, and reputational risks still apply.