How many of us have heard one of our stakeholders assert that their data set is de-identified? As privacy compliance professionals, we are often tasked with critically evaluating data sets and guiding stakeholders to fully understand what is and what is not considered de-identified. Another challenge is that stakeholders may not conduct de-identification often and, therefore, may experience definition drift.
This article will explore the two permissible methods of de-identification under HIPAA. We will focus primarily on the safe harbor method as this is the most frequently used method, including exploring special considerations for zip codes, dates, and the “18th identifier”; however, we will touch on the expert determination method as well. Remember, once a data set meets the HIPAA parameters for de-identification, it is no longer subject to HIPAA. We will also differentiate a limited data set from a de-identified data set, as stakeholders often confuse these terms.
Definition of protected health information
Before discussing de-identification methods, let’s review the definition of protected health information (PHI) as described by HIPAA. PHI is defined as individually identifiable health information, including demographic data, that relates to the individual’s past, present, or future physical or mental health condition; the provision of healthcare to the individual; or the past, present, or future payment for the provision of healthcare to the individual, that identifies—or which it could reasonably be believed to identify—the individual in the hands of a covered entity or a business associate.[1] This definition is very broad, and it is important that the team performing de-identification fully understands this definition.
Data classification
A mechanism that aids in de-identification is data classification. In instances where you are subject to HIPAA and the patient data is in a structured data set, three simple classifications are recommended. First is a fully identifiable data set. A fully identifiable data set contains one or more direct identifiers. Direct identifiers are those from the subsequent list of 18 identifiers that cannot be included in a limited data set, for example, name and address. Second is a limited data set. A limited data set does not contain any direct identifiers but only includes indirect identifiers. The limited data set classification should conform to the HIPAA definition of a limited data set. Third is a de-identified data set. This data set does not contain any direct or indirect identifiers and conforms to the requirements of safe harbor de-identification under HIPAA.
Limited data sets
A limited data set is a data set that removes all elements of PHI except for dates such as admission, discharge, procedure, date of birth, date of death, city, state, zip code, and ages in years, months, days, or hours. Recipients who receive a limited data set must sign a data use agreement and must comply with the terms and conditions of the data use agreement.
A data use agreement limits the use of the limited data set to the following uses: healthcare operations, public health, and research. The data use agreement should limit uses to those negotiated between the organization holding the data set and the recipient.
Know your audience
Individuals often don’t understand the definitions of limited data sets and de-identified data sets. To assist in understanding, consider distributing a list of data elements or a data collection form to potential recipients and ask them to highlight the requested fields. Never assume a customer knows these definitions.
Safe harbor de-identification
The simple method of de-identification is safe harbor de-identification.[2] In this process, 18 identifiers are removed from the data set. All the identifiers must be removed from the data set to be considered safe harbor de-identified.
When completing safe harbor de-identification, all the following identifiers of individuals (patients), relatives, employers, and household members of the individual must be removed.
-
Names, including first, last, and middle.
-
Partial names and/or initials are not permitted in a safe harbor de-identified data set. The full name must be removed.
-
-
All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes.
-
All elements of dates (except year) for dates that are directly related to an individual, including but not limited to dates of birth, admission, discharge, procedure, death, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements of date may be aggregated into a single category of age 90 or older.
-
Telephone numbers.
-
Vehicle identifiers and serial numbers, including license plate numbers.
-
Fax numbers.
-
Device identifiers and serial numbers, including but not limited to medical device serial numbers.
-
Email addresses.
-
Web Universal Resource Locators (URLs).
-
Social Security numbers.
-
Internet Protocol (IP) addresses.
-
Medical record numbers.
-
Biometric identifiers, including finger and voice prints.
-
Health plan beneficiary numbers including but not limited to Health Insurance Claim (HIC) numbers.
-
Full-face photographs and any comparable image.
-
Account numbers.
-
Certificate/license numbers.
-
Any other unique identifying number, characteristic, or code.
The recipient of the safe harbor de-identified data set must not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information. The user who creates the de-identified data set must complete an assessment of each data set to determine if the recipient has actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information. If the recipient does have knowledge, a safe harbor de-identified data set cannot be disclosed to the recipient. The analysis of this assessment must be documented in writing.
There are several considerations that should be taken into account when using the safe harbor de-identification method.
Zip codes
In safe harbor de-identification, the initial three digits of the zip code may be used if—according to the current publicly available data from the U.S. Census Bureau—the geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and the initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000.
Age
As previously mentioned, in safe harbor de-identification, all ages over 89 and all elements of dates (including year) indicative of such age must be removed, except that such ages and elements of date may be aggregated into a single category of age 90 or older. Please note that using a safe harbor de-identified data set in a longitudinal study where an individual’s age is an element is inappropriate. One must be precise when calculating ages in days. Remember leap years. Every year does not have 365 days. If the data set contains longitudinal calculations, remember the rule regarding ages over 89. The user needs to ensure that at no time will ages greater than 89 be exposed—even in calculated elements of the data set. If the data set requires an assessment of care over time, a safe-harbor de-identified data set is not appropriate due to the risk of violating the requirements of categorizing ages greater than 89.
The 18th identifier
The 18th identifier, defined as any other unique identifying number, characteristic, or code, is a catch-all for any additional identifying information that may be included in a data set. The data set must be critically evaluated to assess if any element would fit into this category.
But note that a key that is not derived from PHI and kept secure is not considered a “unique identifying number, characteristic, or code” for this catch-all.
Elapsed times
One proposed solution to avoiding the use of dates is using elapsed times. An elapsed time is the amount of time between events. An example of an elapsed time is six days from admission date to procedure date instead of capturing the actual admission and procedure dates.
Expert determination de-identification
The other method of de-identification is expert determination de-identification.[3] In this process, the data set is reviewed by a person with appropriate knowledge and experience using generally accepted statistical and scientific principles to determine if there is a very low risk that the information is individually identifiable. In addition to deeming a data set statistically de-identified by applying statistical and scientific principles, the expert also generally places program stipulations on the certification, for example, expectations for safeguards of the data set. Typically, an expert determination is valid for a period—generally six months to a year or two. While not required, the expert is often an external third party. When preparing to conduct expert determination de-identification, one must evaluate both the qualifications of the statistician as well as the availability of the statistician to provide de-identification services.
Additional protections for de-identified data provided to third parties
Borrowing from state consumer privacy laws, HIPAA-governed entities that share de-identified data with third parties should consider contractually prohibiting those third parties from re-identifying the data, whether by linkage to other readily available data sets or otherwise. Of course, if they do, the data returns to being considered PHI and subject to HIPAA. HIPAA-governed entities should also continue to monitor the use of the de-identified data by the recipients.
Conclusion
Stakeholders often don’t fully understand the definition of de-identification. As privacy compliance professionals, understanding the appropriate de-identifications methods allows us to provide adequate guidance to our stakeholders and minimizes the potential for inappropriate use and impermissible disclosures. If de-identified data can be used in place of PHI, privacy and security risks are greatly reduced but not eliminated as careful oversight is required to ensure the de-identified data remains such—especially as technologies continue to evolve and, with the proliferation of data, the risk of linking de-identified data to other sources causing re-identification is a genuine possibility.
Takeaways
-
As privacy compliance professionals, we are often tasked with critically evaluating data sets and guiding stakeholders to fully understand what is and what is not considered de-identified.
-
Stakeholders often don’t understand the definition of de-identification and don’t conduct de-identification frequently.
-
Data classification and creation of standard data sets may assist stakeholders and avoid misunderstandings. Stakeholders must differentiate between de-identified and limited data sets.
-
There are two methods of HIPAA de-identification: safe harbor and expert determination.
-
Special consideration must be given to certain elements of protected health information when conducting safe harbor de-identification, including zip codes and ages.