Table of Contents
1. Executive Summary
Growing data volumes and differences in terminology, language, abbreviations, and even spelling in clinical documents make it challenging for medical professionals to extract relevant information for analytics. Not only are there disparities in interpretation, but there is the cost and time involved, and the effort usually impacts your top-rated staff, reducing their ability to focus on the other aspects of clinical care.
Yet clinical operations continue to produce millions of documents (notes, lab result pdfs, and so forth) annually in varying formats, and this continues to increase. The sheer number of patients and the corresponding amount of unstructured data produced make it cumbersome to extract meaningful insights from clinical documentation.
Building the necessary effective natural language processing (NLP) service that can handle the challenges (including the use of synonyms and acronyms), and incorporates the latest technology, is beyond the ability of many healthcare companies. Fortunately, there has been considerable innovation recently in the healthcare NLP market, with improved models every few months.
Our study examined three public cloud service offerings that use natural language processing to meet the challenge—Google Cloud Healthcare API, Amazon Comprehend Medical, and Microsoft Azure Text Analytics for Health. We manually annotated medical notes to identify terms within the documents from a common set of entities and relationships. Next, we built an annotation taxonomy by comparing the taxonomies of the three NLP solutions and created a standard mapping of the entities and relationships shared by all three platforms. We then compared our annotations to the annotations of each solution, using the annotation taxonomy, and noted false negatives (not desired), true positives (desired), and false positives (not desired).
Our findings: Google Cloud Healthcare API was the most accurate in terms of true positives of entities and relationships. It rarely had any false positives and produced the most accurate results. Google had the highest precision with 99% and recall of 93%. AWS had 97% precision and 90% recall, while Azure had 96% precision and 75% recall.
2. Product Landscape
Google Cloud Healthcare API
Launched in 2018, the Google Cloud Healthcare API is a fully managed solution for storing, accessing, and analyzing healthcare data within the Google Cloud Platform (GCP) umbrella. The API comprises three modality-specific interfaces that implement key industry-wide standards for healthcare data: HL7 FHIR, HL7 v2, and DICOM. Each of these interfaces is backed by a standards-compliant data store that provides read, write, search, and other operations on the data.
The platform bridges existing healthcare data systems and applications hosted on GCP and offers data analysis, machine learning, and application development capabilities. The data stores interface to GCP’s high-capacity publish-subscribe product (Cloud Pub/Sub) as an application integration point. From there, the Cloud Pub/Sub integration point can be used for:
- Invoking data transformations in Cloud Dataflow
- Executing serverless applications using Cloud Functions
- Streaming data into the BigQuery analytics engine
- Generating clinical outcome predictions by sending data to the Vertex artificial intelligence (AI) machine learning platform
- Sending documents to the Healthcare Natural Language API to find, assess, and link medical knowledge within textual data (which is what we are testing in this study)
The Healthcare Natural Language API extracts healthcare information from medical textual data. This healthcare information can include:
- Medical concepts – medications, procedures, and medical conditions
- Functional features – temporal relationships, subjects, and certainty assessments
- Relationships – side effects and medication dosages
The Healthcare Natural Language API supports a wide range of medical vocabularies and taxonomies, including but not limited to ICD-10, SNOMED-CT, gene nomenclatures, MedlinePlus, and RxNorm.
You do not need to create a dataset in the Cloud Healthcare API to use the Healthcare Natural Language API.
While the Healthcare Natural Language API offers pre-trained natural language models to extract medical concepts and relationships from medical documents, you can also elect to use Google Cloud’s AutoML Entity Extraction for Healthcare to create a custom entity extraction model trained using your own annotated medical text and your own categories. We did not test this feature.
Azure Cognitive Services for Language – Text Analytics for Health
Azure Cognitive Services for Language, under the Cognitive Services umbrella, is a set of machine learning and AI algorithms for developing intelligent applications that involve natural language processing. Text Analytics for Health extracts and labels relevant medical information from unstructured texts such as physician notes, discharge recommendations, clinical documents, and electronic health records.
The Text Analytics for Health API performs the following functions for English-language medical text documents:
- Named Entity Recognition (NER) – detects words and phrases mentioned in the unstructured text that can be associated with one or more semantic types, for example, diagnoses, medication names, symptoms, and so on.
- Relation extraction – identifies relational connections between entities mentioned in the text.
- Entity linking – disambiguates distinct entities by associating named entities to concepts found in a predefined database of concepts, such as the Unified Medical Language System (UMLS).
- Negation – the meaning of a text can be greatly affected by modifiers such as negation, which can have critical implications if misinterpretation leads to a misdiagnosis.
Amazon Comprehend Medical
Part of Amazon Web Services cloud computing platform, Amazon Comprehend Medical is a natural language processing service that uses machine learning to extract health data from medical text.
Amazon Comprehend Medical is an API that extracts information such as medical conditions, medications, dosages, tests, treatments, procedures, and protected health information while retaining the context of the information. It can identify the relationships among the extracted information to help build applications for use cases like population health analytics, clinical trial management, pharmacovigilance, and summarization.
Amazon Comprehend Medical is also linked to medical ontologies, such as ICD10-CM or RxNorm, to help build applications for use cases like revenue cycle management (medical coding), claim validation and processing, and electronic health records creation.
Like the other two services, Amazon Comprehend Medical is fully managed, so there are no resources to provision, and no machine learning models to build, train, or deploy.
Other Selection Criteria
Our study focuses on the efficacy of the named-entity recognition and relational extraction capabilities of these three platforms. However, these are not the only features you should consider in evaluating these solutions. Other criteria include:
- Integration – Is the solution future-ready and useful for bridging current technologies?
- Standards conformance – Do the APIs and data stores conform to standards, with updates added when necessary?
- Compliance with privacy regulations – Is the solution fully compliant with regulations like HIPAA in the US, PIPEDA in Canada, and other global privacy protocols?
- Certification – Is the solution backed by privacy and security features like ISO/IEC 27001, ISO/IEC 27017, ISO/IEC 27018, and HITRUST CSF certifications?
- Data location control – Does the solution allow you to select the storage location for data?
- Security – Is there integration with an Identity and Access Management (IAM) system?
- Bulk import and export – Can data be transferred in and out of cloud storage?
- De-identification – Does the solution have the ability to redact patient information from studies for research and other purposes?
- Auditability – Can both administrative and data access requests to the API be audited?
- High availability – Does the solution leverage highly redundant infrastructure?
3. Testing Setup and Methodology
The goal of our study was to assess the efficacy of the named-entity recognition and relational extraction capabilities of these three medical natural language processing solutions.
We constructed a testing scenario using the following methodology:
- Select a large, freely available dataset comprising de-identified health-related data.
- Randomly select 100 documents of medical notes made by care providers.
- Manually annotate the medical notes to identify terms within the documents from a common set of entities and relationships (we selected between seven and 95 tags per document depending on the size of the document).
- Submit all 100 documents to the three tested NLP solution APIs and collect their output.
- Compare the output of the APIs with the manual annotation and score their efficacy using industry-standard evaluation metrics.
To follow this test scenario, we used the components and methods detailed below.
Source Medical Data
For the source medical data, we used the MIMIC-III Clinical Dataset maintained by the PhysioNet platform, a research resource managed by members of the MIT Laboratory for Computational Physiology. We contacted PhysioNet and met the PhysioNet Credentialed Health Data License requirements to access the data.
The MIMIC-III database contains de-identified health-related data associated with over 40,000 patients who stayed in critical care units in the Beth Israel Deaconess Medical Center between 2001 and 2012. The data includes demographics, vital sign measurements, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge). Much of this data is structured (CSV) or semi-structured; however, we were most interested in the de-identified unstructured free-form medical notes made by care providers. A total of 59,587 unique medical notes (in text format) were obtained after pre-processing the raw records.
The documents were placed on a cloud Linux virtual machine to be accessed by our annotation team. The documents were shuffled in random order, and the top 100 were selected for the tests.
For the manual annotation, we used the BRAT, a free-to-use, point-and-click rapid annotation tool that runs as a server. The BRAT server is a Python (version 2.5+) application that runs by default as a CGI application on an Apache web server.
As Table 1 shows, we created an annotation taxonomy by comparing the taxonomies of the three NLP solutions and creating a common mapping of the entities and relationships shared by all three platforms.
Table 1. Compared Taxonomies
In the analysis, Google identifying something as a PROBLEM was equivalent to Azure identifying it as a DIAGNOSIS or ALLERGEN, for example.
The three APIs also shared a “PERSON” or patient entity, which was omitted because the source data was de-identified.
To submit the medical documents to the NLP APIs, we split some of the larger documents into parts. The APIs have message size limits. For example, Azure Text Analytics for health can accept only 30,720 characters. For Google, the maximum size of the medical text is 20,000 Unicode characters. Therefore, we used the Linux utility dd to split the larger files into 5KB chunks. We overlapped the chunks by 128 bytes (so the last 128 bytes of chunk #1 would be the first 128 bytes of chunk #2) to help ensure we did not orphan important entities from any adjacent text NLP engines might need for identification.
We submitted the document chunks to each of the three APIs as the payload. We captured the resulting JSON of all three APIs.
To test, we used the following versions of the three solutions:
- Google Cloud Healthcare Natural Language API (v1)
- Azure Text Analytics for Health (v3.1)
- Amazon Comprehend Medical (v2)
We then compared our labeled annotations to the output annotations of each solution, using the annotation taxonomy, and noted: False Negatives, True Positives, and False Positives.
So, for example, suppose “Insulin” is designated as “Medicine” in the annotated/tagged file:
- If the model doesn’t find “Insulin,” this is a False Negative (FN)
- If the model finds “Insulin” as a “Medicine,” this is a True Positive (TP)
- If the model finds “Insulin” as something else, say a “Condition”
- It counts as a False Positive (FP) for the “Condition” category
- It counts as a False Negative (FN) for the “Medicine” categoryWe followed this procedure for relationships and came up with a count of FNs, TPs, and FPs per file per solution.
Overall, we found AWS was the easiest to find RELATED_TO relationships because they were listed right below. For the Google model, we need to look in the “relationship” section of the output. Azure got “PROCEDURE” and “MEASUREMENT” wrong a lot. It tagged procedures as measurements for a high percentage of its false negatives.
Figure 1, Figure 2, and Figure 3 reveal the results of our tests. We issued 3,279 tags and checked the efficacy of each product for each tag.
Figure 1. Precision by Solution
Google’s solutions provided the most precision. Amazon Comprehend Medical was second at 97%.
Figure 2. Recall by Solution
Google’s solutions had the best recall. Recall estimates how many of the actual positives the model captures.
Figure 3. F1 by Solution
Google’s solutions had the best F1, edging out Amazon Comprehend Medical. F1 is the harmonic mean between precision and recall. It is used as a statistical measure to rate performance.
Figure 4. Number of Correct Tags (True Positives) by Solution
Google had the most true positives. Google v1 tagged 92.2%. Amazon Comprehend Medical came in second at 2,932, representing 89.4% of the tags. Azure Text Analytics for Health came in at 74.7%.
Figure 5. Number of False Positives by Solution
Google’s solutions were by far the best at avoiding false tagging. Its false positives were .007% for v1. Azure Text Analytics for Health misidentified 3.4% of the set falsely.
Figure 6. Number of False Negatives by Solution
Azure Text Analytics for Health had a relatively high number of false negatives. It missed a lot, resulting in 25.2% false negatives. Amazon Comprehend Medical was next at 342 false negatives, while the Google solution was the best model with 221 false negatives.
In conclusion, Google was the most accurate in tagging. It rarely produced false positives. When the model did tag entities and relationships, it got them right in most cases with very little mistagging. Either it successfully tagged, or it didn’t tag at all.
GigaOm runs all of its performance tests to strict ethical standards. The results of the report are the objective results of the application of queries to the simulations described in the report. The report clearly defines the selected criteria and process used to establish the field test. The report also clearly states the data set sizes, the platforms, the queries, and other parameters used. Readers may determine for themselves how to qualify the information for their individual needs. The report does not make any claim regarding the third-party certification and presents the objective results received from the application of the process to the criteria as described in the report. The report strictly measures performance and does not purport to evaluate other factors that potential customers may find relevant when making a purchase decision.
This is a sponsored report. Google chose the competitors, the test, and the Google configuration. GigaOm chose the most compatible configurations for the other tested platform and ran the testing workloads. Choosing compatible configurations is subject to judgment. We have attempted to describe our decisions in this paper.
7. About William McKnight
William McKnight is a former Fortune 50 technology executive and database engineer. An Ernst & Young Entrepreneur of the Year finalist and frequent best practices judge, he helps enterprise clients with action plans, architectures, strategies, and technology tools to manage information.
Currently, William is an analyst for GigaOm Research who takes corporate information and turns it into a bottom-line-enhancing asset. He has worked with Dong Energy, France Telecom, Pfizer, Samba Bank, ScotiaBank, Teva Pharmaceuticals, and Verizon, among many others. William focuses on delivering business value and solving business problems utilizing proven approaches in information management.
8. About Jake Dolezal
Jake Dolezal is a contributing analyst at GigaOm. He has two decades of experience in the information management field, with expertise in analytics, data warehousing, master data management, data governance, business intelligence, statistics, data modeling and integration, and visualization. Jake has solved technical problems across a broad range of industries, including healthcare, education, government, manufacturing, engineering, hospitality, and restaurants. He has a doctorate in information management from Syracuse University.
9. About GigaOm
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.