An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures
Abstract
The volume of unstructured health records has increased exponentially across healthcare settings. Similarly, the number of healthcare providers that wish to exchange records has also increased and, as a result, de-identification and the preservation of privacy features has become increasingly important and necessary. Governance guidelines now require sensitive information to be masked or removed yet this re-mains a difficult and often ad-hoc task, particularly when dealing with unstructured text. Annotators are typically used to identify such sensitive information but they may only be effective in certain text fragments. There is at present no hybrid, sustainable framework that aggregates different annotators together. This paper proposes a novel framework that leverages a combination of state-of-the-art annotators in order to maximize the effectiveness of the de-identification of health information.