Patient records that are to be shared within the research community must have any identifying information removed. Manual removal of identifying information is prohibitively expensive and time consuming. Considerable research by many investigators has focussed on developing automated techniques for "de-identifying" medical records.
A team from the Massachusetts Institute of Technology (MIT) funded by the National Institutes of Health (NIH) aimed to solve this problem, pointing out that: "Text-based patient medical records are a vital resource in research. The expense of manual de-identification, coupled with the fact that it is time-consuming and prone to error, necessitates automatic methods for large-scale de-identification."
The MIT team tested their censoring software, introduced in the journal BMC Medical Informatics and Decision Making, on a meticulously hand-annotated database of 1836 nursing notes (a total of 296,400 words). According to the authors, "The software successfully deleted more than 94% of the confidential information, while wrongly deleting only 0.2% of the useful content. This is significantly better than one expert working alone, at least as good as two trained medical professionals checking each other's work and many, many times faster than either."
The MIT team is also providing access to the fully-scrubbed annotated data together with the software to allow others to improve their systems, and to allow the software to be adapted to other data types that may exhibit different qualities.
This article is adapted from the original press release.
To read the original publication, please click here.
Photo: Massachusetts Institute of Technology