by Corey Fourier, Senior Data Engineer
At ThinkGenetic, Inc. our goal is to reduce the time to diagnosis for genetic and rare conditions. Given they are rare, we must be cognizant of the time it takes for a clinician to validate results. Two major types of data are evaluated in our algorithm; Structured and Unstructured. This proof of concept was created to explore how Transfer Learning could improve text extraction in unstructured data for valuable information contained in the Electronic Health Records (EHR).
What is Structured Data?
ICD-10, Snomed, Procedures codes
- Contains no ambiguity. Tells you exactly what the outcome of the visit was.
- Focused on billing. Limited or no family history. With rare conditions, some disease-specific codes don’t exist. Ex: Von Hippel-Lindau disease (VHL)
- Office of Inspector General (OIG) recommends coding accuracy of 95%. Depending on the audit method this can be achieved over many documents or at a per code level. Coding many easy charts and improperly classifying one patient without a rare condition can leave them out of a result set and still maintain a 95% accuracy.
What is Unstructured Data?
Free form text from many different sources
- Allows the clinician to capture symptoms, history, and thoughts that otherwise doesn’t have a place.
- Can literally have anything in any format. Poor punctuations, spelling mistakes, ambiguous abbreviations, contexts that span more than one sentence.
Being in the rare genetic disease space has unique challenges. The rareness means the patient may be undiagnosed or misdiagnosed with the condition. At ThinkGenetic, our algorithms are created by clinical staff who are looking for signs and symptoms in the patient’s medical record per encounter, as well as across multiple encounters. If they aren’t diagnosed with the condition and or diagnosed with a key marker, then we must look at the unstructured data.
Importance of text
Text is extremely valuable to our success in helping identify at-risk individuals. Given the disease, VHL is a key indicator and is currently documented by the ICD-10 code Q85.8. The description of code reads, “Other phakomatoses, not elsewhere classified”. This isn’t sufficient in easily identifying patients with VHL as it doesn’t explicitly say they have that condition and requires us to look into the text for VHL. New in the year 2023 is an ICD-10 code specific to VHL (Q85.83), but this will not retroactively update old encounters. Similarly, patients previously coded with legacy codes like ICD-9 may not have updated ICD-10 codes in their records.
Genetic conditions can be driven by what your family is or isn’t diagnosed with. Little to no structured data exists for family history, but this exists in all forms in the unstructured data.
Difficulty of text
Processing text is challenging to say the least. Abbreviations that are commonly used amongst providers in a specific specialty can create confusion, especially when one abbreviation can represent multiple conditions. For example, CP can stand for Chest Pain or Cerebral Palsy. Clinical notes also contain past, present, and future information. Family history contains challenges as this isn’t always an explicit section of a document. (Fx of diabetes. or Mother died of cancer.) This information can also span more than one sentence. May or may not contain the medical term for the condition (muscle weakness vs hypotonia). It’s obvious that not all text contains value, but it does contain symptoms or diseases. All text can contain assertions, such as “no fever” or “fever neg”. This can even get more complicated when many items are asserted in a list. “No cough, shortness of breath, sore throat, or fever”. All symptoms are absent from the list even though only the first item was “No”.
We started out with an iterative approach to this problem as we encountered more and more diseases and document types. NER (Name Entity Recognition) has been the gold standard for some time and has worked okay for us. As we progressed we have had to constantly revisit the process; understanding if the “problem” was related to the patient, family member, side effect of a medication, or education for the patient (von Hippel-Lindau disease is an inheritable condition with an incidence of 1 in 36,000 live births). It lacked proper context and lost tracking when a new sentence started. It couldn’t understand long-term dependencies across the text.
In looking for a solution we found papers using BERT in various capacities of labeling. This is the process in which you take a body of text (clinical note) and attach labels to it indicating what is contained in the text. The results were exciting but lacked the ability to scale as the examples took the most popular codes in the data set. T5 builds on the popular BERT model, but going one step further having both an encoder and decoder. T5 stands for Text-to-text Transfer Transformers, it was created by Google research in 2020. Its unique architecture gives it the ability to perform a wide array of tasks, such as; language translation, Q & A, linguistic acceptability, sentence similarity, and document summarization.
T5’s behavior for summaries and language translation has some overlap with extracting important information and possibly converting it to code descriptions. The feasibility of this idea stemmed from the work of Shi and Lin (2019) in Semantic Role labeling and Relation Extraction. In the paper they used transformers in a non-typical way to identify information in the text. A proof of concept was quickly created and a small fine-tune data set was thrown together. The results were impressive given the model (Large 770 million parameters) was pretrained on C4 (Colossal Clean Crawled Corpus), essentially web pages and then fine-tuned on a custom data set of approximately 300 examples (80% / 20% split between training and validation).
In our testing we found several behaviors that we were able to exploit.
- Normalized and stream-lined assertions. Multi assertions as well as mixed (negative and positive assertions).
- Normalized family history mixed in with notes explicitly stated as family or denoted by relationship (mother, sister, cousin, etc..).
- Identified information across sentences.
- Properly identify abbreviations even when ambiguous. (CP for Chest Pain or Cerebral Palsy
- Picking up the age of the patient and appending “yo” at the end.
- Past issues vs present are prepended with “History of”.
- Conditions and abbreviations of conditions are outputted in the format “Condition (Abbreviation)”
- Identify gender of the patient by context and ambiguity. (F, M, female, male, she, he, etc..). Outputting a predictable text of “Gender X”
The NLP (Natural Language Processing) task behaves like summarization, language translation, and labeling. We call it SumTrans for Summary Translation.
Below are examples of actual text input and output that the model hasn’t seen yet:
|Sample Clinical note||T5 Output|
|No active pulmonary disease
~~ ~ Assessment/Plan ~ ASA allergy (hives)Unstable anginaHistory of Nonobstructive CAD -Aspirin desensitization protocol, can downgrade to floor once completed and patient stable-LHC tomorrow # Family Medicine Faculty Addendum: I have seen and examined the patient and I agree with the resident’s note as written above. Continue protocol, no issues thus far.
Admit/Discharge Dates Admit Date: 10/25/2021 Discharge Date: 10/26/2021 #, # , #, #, # Discharge Medications Continue anastrozole (anastrozole 1 mg oral tablet) aspirin (aspirin 81 mg oral capsule) citalopram (citalopram 40 mg oral tablet) irbesartan (Avapro 75 mg oral tablet) levothyroxine (Synthroid 25 mcg (0.025 mg) oral tablet) 25 mcg, Oral, Daily lovastatin (lovastatin 40 mg oral tablet) 40 mg, Oral, Daily Followup #, on 11/02/2021 Hospital Course 60yo female with hx HTN, HLD admitted from clinic yesterday due to unstable angina. She underwent LHC which revealed severe multivessel CAD detailed below. Pt will be transferred to # for CABG evaluation by #.
Cardiovascular: Negative. Neurologic: Seizure, hx of Cerebral palsy, non-GM seizure disorder. Health Status
Pt called wanting to know the status of her derm. referral that Dr. # placed on #. I am still showing it is pending acceptance in referral management. Pt did state that she is still suffering with hyperhidrosis really bad and it is starting to make her depressed. Please review referral and advise how to handle this. Thanks!
T5 can be trained to “label” the data in that it outputs the text description of the ICD-10 code. This can then be looked up to get the actual alpha-numeric code. The example correctly matched the information about the pregnancy to #4, which is the exact ICD-10 code for the visit.
|Sample Clinical note||T5 Output|
|associated diagnoses: first trimester pregnancy; nausea and vomiting; uti in pregnancy||
The model properly identifies the meaning of CP in both texts.
|Sample Clinical note||T5 Output|
|seizures in a 20 yo male with CP and static encephalopathy. On Aptiom 1400mg daily. Father refuses valproic acid loading dose and prefers to wait to add 2nd AED untiil he sees # outpatient. He understands the risks involved. Neurology will sign off. Call with questions. D/c video EEG.||
|pt a 41 year-old Female, presents with symptoms of cp and sob. elevated hr. no fever||
T5 is overly eager to summarize and or translate the text. Ideally any sections you are not interested in should be ignored, but when this can’t be avoided, negative classification must be added to the training set. Simply training it on undesirable data and putting a consistent translation is so far sufficient.
|Sample Clinical note||T5 Output|
|Samples containing indocyanine green cause interferences on Total and/or Direct Bilirubin and must not be measured.||
Our examples have shown that the model produces a predictable output text when fine-tuned. The text can be coupled with downstream processes to provide additional value using NER (Name Entity Recognition) or parsing the text based on the keywords from the predictable output. Furthermore, the results obtained so far are promising and can only get better, pretraining the model on clinical text will further improve the results.
Sources that aided us in this solution
- Transformers for Natural Language Processing by Denis Rothman.
- BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining Zachariah Zhang, Jingshu Liu, Narges Razavian https://arxiv.org/pdf/2006.03685v1.pdf
- Semantic Role labeling – Shi and Lin (2019) https://arxiv.org/pdf/1904.05255.pdf
- Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
About Corey Fourier: Corey has over 16 years of software development and architectural experience in the health IT field. At ThinkGenetic, he provides the data architecture, design and development leadership required to turn genetic counselor knowledge into the intelligence needed to score patient health records for genetic condition risk level. He is also the lead developer for the FindEHR product at ThinkGenetic.