Healthcare data is a mess. Academic medicine is perhaps starting to realize this uncomfortable fact as it contemplates the use of Big Data to improve the traditional medical research approach, but those of us who have been in the trenches of population health management (PHM) for more than a decade are well versed in the challenges of refining health care data into a high-quality and actionable data set.
In a recent editorial, “Large databases (Big Data) and Evidence-Based Medicine” (European Journal of Internal Medicine, 2018-07-01, Volume 53, Pages 1-2), the authors appropriately recognize the limitations of the traditional Randomized Control Trial (RCT) and suggest that Big Data analytics utilizing data from electronic health records (EHRs) and other sources could improve the development of evidence-based medicine (EBM) guidelines and protocols.
As the article notes:
“Many evidence-generating RCTs do not consider the complexity of patients assessed in everyday practice. Even when a trial is targeted to the elderly, the population enrolled is usually highly selected. Older patients, and women especially, are strikingly under-represented in RCTs, and patients with multimorbidity, a common phenomenon in older ages, are generally excluded. In addition, RCTs rarely address heterogeneity of the older population determined by occurrence or absence of multimorbidity, disability, polypharmacotherapy, factors which can influence a patient’s initial level of risk for a given outcome, responsiveness to treatment, or vulnerability to adverse effects.”
The thought that aggregating data from multiple EHRs and applying Big Data analytics to the data set is a good one, however the current reality is that this approach is a significant challenge that requires the use of specialized technology and human capital. Let’s take a look at some of the reasons applying Big Data analytics to aggregated EHR data is much easier said than done and why experience and the right technology are critical for success as I see it:
Terminology Overload – There are a large number of terminologies (also called code sets, dictionaries, etc.) used by EHR vendors, and many of these terminologies are custom and unique to the EHR vendor. Making sense of EHR data requires translation, mapping, and grouping of data in order to normalize it and make it actionable. Clinicians work with many hundreds of different terminologies, and about two-thirds of them are custom.
Vendor Inconsistency – The terminologies used by an EHR vendor often vary between products and even between versions of the same product. Updates to existing EHR software often change the way data is stored, which requires changes to how data is normalized. EHR vendors often do not communicate such changes, which adds another layer of complexity and increases the risk of missing data and inaccurate translation.
Provider Variability – How EHRs are used has an impact on data quality. Providers and staff members often do not use their EHR in the same way, which means pertinent information can be inconsistently and incompletely aggregated and normalized. Important data can be easily missed due to this variability in use.
Market Issues – Providers and their health care systems rip and replace their EHRs with some regularity, often multiple times. This affects data quality, because pertinent information may not be available in the latest EHR. Data from the last EHR is often not “copied over” to the next EHR due to the cost of doing this accurately for all of the reasons mentioned here.
EHR Coverage – Many patients who may be considered for or enrolled in a study do not receive all of their care at the academic medical centers participating in the study. This means that the information in EHRs used by community-based primary care and specialty care providers is important. Interfaces with multiple different EHR vendors and instances of each vendor’s products will be needed to keep aggregated and normalized data set as complete and up-to-date as possible. This is often a logistical challenge and the EHR vendors often charge for these interfaces. Likely recognizing some of these complexities, the authors said, “The need for large volumes of data could be met by the collaboration of many medical institutions sharing a common terminological system.” However, if you create your own terminology you are still in the business of data aggregation and normalization – normalization to your own terminology, so this really doesn’t affect the complexity of the work required.
The authors went on to conclude, “The possibility that huge amounts of data from electronic health records may be analyzed and combined to identify best therapeutical [sic] options by assessment of matched cases, opens a complete [sic] new scenario about patient management and medical evidence.”
I do believe that Big Data analytics utilizing data from EHRs and multiple other sources is promising and should be studied to determine how such analytics can improve the accuracy of medical research, however, the complexities of data aggregation and normalization from multiple systems are non-trivial and will require the right technology and experience to allow these efforts to succeed.