Jhis project began with a tricky problem. Imaging tests that revealed unexpected problems – such as suspicious lung nodules – were ignored by busy caregivers, and patients who needed follow-up fast weren’t getting it.
After months of discussion, health system leaders at Northwestern University have united around a heady solution: Artificial intelligence could be used to identify these cases and quickly ping providers.
If only it were that simple.
It took three years to integrate AI models for flagging lung and adrenal nodules into clinical practice, requiring thousands of hours of work by employees across the organization – from radiologists to human resources specialists, in through nurses, primary care physicians and computer experts. Developing accurate models was the least of their problems. The real challenge was building trust in their findings and designing a system to ensure that the tool’s warnings didn’t just prompt providers to click on a pop-up, but instead translated into effective, real-world care.
“There were so many surprises. It was a daily learning experience,” said Jane Domingo, project manager in Northwestern’s clinical improvement office. “It’s amazing to think about how many different people and expertise we’ve brought together to make this work.”
Ultimately, the adrenal model failed to produce the level of accuracy needed in live testing. But the lung model, by far the most common source of suspicious lesions, has proven very effective in informing caregivers, paving the way for thousands of follow-up tests for patients, according to a published article last week in NEJM Catalyst. Further study is needed to determine whether these tests reduce the number of missed cancers.
STAT interviewed Northwestern employees who were involved in building the algorithm, integrating it into computer systems, and associating it with protocols to ensure patients received the prompt follow-up that had been recommended. The challenges they faced and what it took to overcome them underscore that the success of AI in medicine depends as much on human effort and understanding as it does on the statistical accuracy of the algorithm itself. .
Here is an overview of the actors involved in the project and the obstacles they faced along the way.
For the AI to report the right information, it needed to be trained on labeled examples from the healthcare system. Radiology reports were to be annotated to note incidental findings and follow-up recommendations. But who has had time to mark up tens of thousands of clinical documents to help AI spot the telltale language?
The human resources department had an idea: nurses who had been placed on light duty due to workplace injuries could be trained to scan reports and extract key excerpts. This would eliminate the need to hire an expensive third party with unknown expertise.
However, highlighting inconspicuous passages in long radiology reports isn’t as easy as it sounds, said Stacey Caron, who supervised the team of nurses responsible for the annotation. “Radiologists write their reports differently, and some of them will be more specific in their recommendations, and some will be more vague,” she said. “We had to make sure that education on how [to mark relevant excerpts] It was clear.”
Caron met with the nurses individually to orient them to the project and created a training video and written instructions to guide their work. Each report had to be annotated by multiple nurses to ensure accurate labeling. In the end, the nurses logged around 8,000 hours of work annotating over 53,000 separate reports, creating a high-quality stream of data to help train the AI.
Developing the AI models may not have been the most difficult task of the project, but it was crucial to its success. There are several different approaches to analyzing text with AI – a task known as natural language processing. Choosing the wrong one means certain failure.
The team started with a pattern known as a regular expression, or regex, which searches for manually defined sequences of words in text, such as “chest CT without contrast.” But due to the variability in the wording used by radiologists in their reports, AI has become too error-prone. It missed an unacceptable number of suspicious nodules requiring follow-up and flagged too many reports where none existed.
Then the AI specialists, led by Mozziyar Etemadi, a professor of biomedical engineering at Northwestern, tried a machine learning approach called a bag of words, which counts the number of times a word is used from a pre-selected list of vocabulary, creating a numerical representation that can be fed into the model. This too did not achieve the desired level of accuracy.
The shortcomings of these relatively simple models highlighted the need for a more complex architecture known as deep learning, where data is passed through multiple layers of processing in which the model learns key features and relationships. This method allowed the AI to understand dependencies between words in the text.
Early testing showed that the model almost never missed a report of a suspicious nodule.
“It’s really a testament to these deep learning tools,” Etemadi said. “As you send more and more data to it, it gets it. These tools really learn the underlying structure of the English language.
But technical proficiency, while an important step, was not enough for AI to make a difference in the clinic. His conclusions would only matter if people knew what to do with them.
“AI can’t come forward and give clinicians more work,” said Northwestern Medicine chief medical officer James Adams, who has championed the project within the ranks of health system leadership. “He has to be an agent of people on the front lines, and that’s different from how this last generation of healthcare technology has been implemented.”
A commonly used vehicle for providing timely information to clinicians is known as a Best Practice Alert, or BPA – a message that appears in health records software.
Clinicians are already bombarded with such alerts, and adding to the list is a tricky subject. “We kind of have to have our ducks in a row because if it’s disruptive it’s going to face some resistance from doctors,” said Pat Creamer, program manager for information services.
The solution in this case was to embed the alert in clinicians’ inboxes, where two red exclamation marks signify a message requiring immediate attention. To build confidence in the validity of the AI alert, relevant text from the original report has been embedded in the message, along with a hyperlink for physicians to easily order the recommended follow-up test.
Creamer said the message also allows clinicians to reject the recommendation if other information indicates follow-up isn’t necessary, such as someone else’s ongoing management of the patient. The message can also be forwarded to that other caregiver.
The most important part of the alert, Creamer said, was integrating it into the record-keeping system so the team could keep tabs on every part of the process. “It’s not a normal BPA,” he said, “because there’s programming behind it that helps us track results and recommendations throughout the lifecycle.”
And in cases where patients weren’t receiving follow-up, they were ready with a plan B.
The loop closes
The alert system needed a safety net to ensure that patients did not fall through the cracks. This challenge fell into the lap of Domingo, the project manager who had to figure out how to make sure patients showed up for their next test.
The first line of defense was a dedicated team of nurses tasked with monitoring patients if the ordered test was not completed within a certain number of days. Given the difficulty of reaching patients by phone, however, they needed another option. The idea was floated of sending a letter to patients by post, but some doctors feared that a report of a suspicious lesion would cause panic, triggering a flurry of nervous phone calls.
“The letter has become one of my passions,” said Domingo. “It’s something that I really pushed.”
The wording of the letter was particularly delicate. She contacted Northwestern’s Patient Advisory Councils for advice. “There were overwhelming comments that we should alert them that there was a discovery that might need to be followed up,” she said. But it was suggested to add another clause noting that such conclusions are not always serious and may simply require further consultation. The letter is now sent to the patient within seven days of the initial AI alert to doctors.
“Based on the limited number of complaints we received,” Domingo said, “this was an important component to help improve patient safety.”
Since the project began, AI has prompted more than 5,000 doctor-patient interactions, and more than 2,400 additional tests have been performed. This remains a work in progress, with further tweaks to ensure the AI remains accurate and alerts are set accurately. Some doctors remain skeptical, but others said they saw value in AI that wasn’t so clear at the start of the project.
“At the end of the day, I don’t have the burden of keeping track of everything anymore,” said internal medicine physician Cheryl Wilkes. “It helps me sleep better at night. That’s the best way I can explain it.