This idiom describes pretty accurately how we data scientists feel about the data for our machine learning tasks. Generally speaking, the more data we have during training to put through the machine learning models, the better they get.
For our specific problem of identifying pathologies and other important issues on MRIs, this means we want to have as many MRIs as possible to train our networks. But to efficiently train our networks for specific pathologies, the network needs to know what pathologies are visible on the MRIs we provide during the training steps. This means if we give an MRI as input during the training for disk herniations, we need to tell the network for each single MRI if a disk herniation is visible or not. Easy right? Theoretically perhaps, but in practice, not so much. Although we can collaborate with different hospitals to collect more MRI data, the information about the pathologies is written down in an unstructured, continuous text called “radiology report.” Problem is, the networks we want to train to identify pathologies do not understand radiology reports. Therefore, we need to extract the information about pathologies out of these continuous texts and provide it in a structured form to the network. We call this process “label extraction” and use many different natural language processing (NLP) methods to come up with a powerful pipeline which takes the radiology reports as the input and gives labels in an extremely structured way as an output.
The label extraction process can be simplified to the following steps:
- Identify all body parts mentioned in the report (e.g. which vertebra, or which ligament in the knee).
- Identify all pathologies (e.g. tear of a ligament or a disk herniation)
- Identify if the pathology was mentioned in a positive (occurs) or negative (e.g. no disk herniation) way
- Associate pathologies with the correct body part (e.g. which specific disk has a herniation)
During this process, we face many different challenges. One of the biggest problems in natural language processing is that language is verbose and ambiguous. In addition, each language has its own peculiarities. Germans love to chain as many as possible words together to create elaborate compound words. For example, the word “Dampfschifffahrtsgesellschaftskapitänswitwe” means the widow of the captain of the corporation of the steamships. Even simple compound words present a challenge for an NLP system. For example, the composita “Staubecken” has two meanings, depending on the context. Either it is “Staub-Ecken,” the dirty corner which needs to be cleaned, or it is “Stau-becken,” a water reservoir. This example shows the difficulty for an NLP tool to understand and correctly split a composita in the correct compounds.
Besides of the language-specific issues, there are domain-specific challenges. Although solutions are available for a lot of problems, we usually cannot just use them as they are because most often these NLP tools are trained on news articles, Wikipedia or other open source text documents. In contrast, we deal with highly specific text that almost has its own grammar and an abundance of medical terms – terms which are unknown to most of the NLP tools.
The more data we have, the more challenges will manifest themselves. Check out future blog posts to learn more about the never-ending story of NLP challenges we have to solve to extract pathology labels from radiology reports.