Remember my blog post about how in German you can string words together to form new ones? As mentioned in that post, this increases the possible terms for pathologies and can be frustrating when trying to find all possible variations of a single pathology. Now, I want to show you how it is possible to write lists of compounds in a shorter version in German – which is the grammatically correct and most often used way of doing it.
Let me begin with a short recap from the old blog post: In German, you can concatenate words together to create a new one. The number of words is not limited, which can lead to extremely long and complex words. This problem is further compounded when the words are homographs – words that look the same but have different meanings depending on their context. For example, “Stau” and “Becken” lead to “Staubecken” (water reservoir), so does the words “Staub” and “Ecken” (dusty corner): The meaning is totally different, but once combined, they are written the same. A human can destingish based on the context, or when used during a spoken conversation.
One solution is using hyphenation to connect the different words of the compound. However, while connecting with hyphenations is in theory always allowed, it is not always used. To make it easier to understand how the shortened list of compounds works, I will write the compounds with hyphenations. Using the known example “Stau-Becken” (meaning a water reservoir) we need a second compound which uses the same last word, for example, “Plansch-Becken” (pool for small children). Now, if for some reason you write about both using an enumeration, water reservoir and kiddy pool, you can write “Staub- und Plansch-Becken” in German. As a human it is easy to connect the right parts and understand both compounds of words – even if you do not use the hyphenation and just write “Staub- und Planschbecken”.
For a machine, it is not that easy, especially if you have a compound of more than two words and no hyphenation in the second compound. If we look at the artificial shortened list of “A- and B-C-D”, the first compound of words could either be “A-C-D” or “A-D”. This is, in theory, a simple look-up, even if the second compound of words could have different split versions (again, for example either “Staub-Ecken” or “Stau-Becken”). Teaching a machine to learn to find all possible splitting combinations and deciding which is the correct one is already difficult; connecting the right parts back together further complicates the task.
Now, where do I get the data for the look-up? A complete glossary of medical terms describing pathologies and body parts in German, including all possible compounds of words, is not something I have at hand – that would have solved my issue with medical compounds of words pretty fast. This means that I have to try to look it up based on all radiology reports I already have, and all the versions I could imagine are used. Sadly, I have noticed that I am not good at imagining compounds of words for pathologies, so I have needed to add a lot more possibilities to our medical vocabulary database. Fortunately, we have a good glossary to check against and are able to automatically identify most of the compounds and correctly “glue them together”.