One year ago I came across a job I was not aware of: “medical coder.” In Switzerland, but also in the US, medical reports, written by medical doctors in “natural medical language,” are translated by medical coders into a series of diagnosis codes using the ICD (International Classification of Diseases). There are other databases of codes to delineate operations performed (e.g. during surgery); CHOP in Switzerland is one example of such a catalogue. Medical coders are highly skilled professionals; they need to understand medical terms and classify them into precise codes, like J13 for “Pneumonia due to Streptococcus pneumoniae” and J12 for “Viral pneumonia, not elsewhere classified”.
Medical coding is a highly skilled profession in part because the assignment of code has a deep economic impact: every code has a cost associated with it (cost of treatment, cost of the actions taken during a surgery etc.). It is therefore quite expensive for the clinics to convert reports into billing codes to be passed to the “payers” (insurances or public healthcare), and for the payers themselves, who want to countercheck them prior to paying.
Balzano Informatik started a project with the goal of automatically converting reports into sets of codes. Many thousands of scanned reports in pdf format were collected, along with the codes that had been manually assigned by medical coders, to train, validate and test an Artificial Intelligence System. We had to OCR them, anonymize them and then train neural networks to translate them into the language of ICD or CHOP codes.
A lot of work had to be done in preprocessing. Convert every word into its root (stemming and lemmatizing), so that for instance, “go”, “goes”, “went” and “gone” become “go” in case of English, in order to simplify the text. Remove all the so-called “stop-words”, like “and”, “then” etc., that do not carry special meaning. And all this was done for reports written in German, where there is the additional problem of composite words (words made up by concatenating words).
Then, with word2vec algorithms, we transformed the words into vectors in a vector space. This operation is called “embedding”. As well explained by Wikipedia, “word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.”
An algorithm was developed to identify the most common pathologies and operations and convert them into the right ICD-10 and CHOP code. It combined the results of Convolutional Neural Networks (CNN, which excel in identifying pattern of words in a phrase) with Recurrent Networks (like LSTM or GRU, which excel in processing sequences). “Attention” techniques (HAN, Hierarchical Attention Networks) allowed for the identifications of the important words and statements in reports.
In general, this approach worked well to identify both ICD and CHOP codes from the unstructured text in reports. However, there were some challenges that prevented us from identifying all ICD and CHOP codes reliably. Some of the codes were underrepresented in the reports and thus not recognized with satisfactory accuracy. Some reports were only available as scanned documents which in many cases caused challenges for the OCR process, where some words were not properly recognized and thus induced a wrong word, which confused the neural network and decreased the accuracy.
And human medical coders still have a big advantage over A.I.: they are able to call the doctor who wrote the report and ask for clarification, something that A.I. can’t yet do…
The system shows (on the right) the codes, and in the medical report, the important phrases in blue. Selecting a code highlights the relevant phrase.
Convolutional Neural Networks (CNN) look for 2-word, 3-word, 4-word patterns or longer ones, and then combine them to form a single feature vector, which is used for classification (every code is a class). The “embedded” words serve as input. In the picture, the embedding dimension is d=5, but usually d is 100, 200 or 300.
With Hierarchical Attention Networks (HAN), not only single words are encoded (i.e. transformed into vectors), but also sentences and whole documents.