31 July 2019

The computer will see you now

A typical doctor makes 10,000 decisions a year. But even the best will get things wrong at least 100 times. Can AI offer a solution to human error in medicine?

By Phil Whitaker

Gary was unshaven and restless. He kept pressing his hands to different places on his chest as he tried to describe the pain he’d been experiencing, and the difficulty breathing, and the sick feeling and light-headedness. At 22, these were unlikely to represent a serious heart or lung problem – unlikely, though by no means impossible. But I’d skimmed his notes before calling him through to my consulting room; these were the same symptoms he’d been experiencing for several years, off and on. He’d been extensively investigated. The last time I’d seen him, about a year before, I’d tried to broker the understanding that they must represent physical manifestations of anxiety. I had started him on some medication and arranged to see him a few weeks later, hoping that if he were responding it would increase his confidence in my diagnosis. He hadn’t returned.

Lola, his girlfriend, sitting in the other chair, was hostile. “He keeps getting fobbed off, being told there’s nothing wrong.” Gary’s notes contained records of multiple 111 contacts. As soon as he mentioned chest pain and breathing difficulties to the operator, high drama would ensue. A 999 ambulance would be dispatched. Which would rush him, blue lights flashing, to the nearest A&E. Where doctors in scrubs would do all manner of tests, after which he’d be told they were normal and he’d be sent home.

“I’m not fobbing you off,” I told them. “There’s definitely something wrong, and I really want to help. But I can’t if you won’t work with me.”

I started to explain my diagnosis again, which provoked a dismissive huff from Lola and the accusation that I thought it was all in Gary’s head. I took a deep breath, and ran through how some of us experience anxiety in our minds, but for others it manifests chiefly in the body with physical symptoms.

“The tablets don’t work, though,” Gary pointed out.

“They take a little while,” I said. “You’ve never given them long enough.”

In the end, they left with a prescription for another medication and agreed to return in three weeks. I wondered whether they would. There was such a contrast between the dramatic response to his 111 calls – sirens, monitors, blood tests and ECGs – and my low-key attribution of his symptoms to psychological ill-health. The confusion that Gary and Lola must feel. Who were they to believe? What if something really serious were going on that nobody had put their finger on yet, and I was just the next bungling idiot about to generate another set of scornful headlines: “Four doctors missed meningitis!” “GPs failing to spot cancer!” Little wonder it was so hard for them to know who to trust.

****

Even though we’d started Harriet on 20-minute appointments, with strategic gaps to allow her to make up time, her surgery had run very late. When she finally got to the end, I went through to debrief. There were bits of paper strewn across her desk, reminding her to do tasks that she hadn’t had time for during the consultations themselves. She had not one but two stethoscopes draped around her neck, where she’d hurriedly parked them after conducting examinations. Her eyes had a startled distraction to them, as though she was still trying to work out what had just hit her. I pulled up a chair and gave her a smile.

“Don’t worry. How you’re feeling – it’s entirely normal.”

We started to go through her cases. She’d got bogged down by a middle-aged woman with abdominal pain, fatigue and irregular vaginal bleeding. An older chap on multiple medications for heart disease, emphysema, diabetes, prostate problems and blood pressure who was feeling unsteady and muzzy-headed and forgetful. A child who for months now had been making bizarre facial grimaces at random times and who was falling behind at school. Someone else with chronic pain in their muscles and joints with a normal set of investigations.

I talked her through each case, encouraging her whenever I felt she’d come up with a good plan, and gently expanding on other things to think about or alternative approaches she could have taken when a patient’s presentation had left her perplexed.

“I just feel so pathetic,” she said when we finally finished.

I told her I’d felt exactly the same at her stage. GP registrars study and train for minimum of nine years, virtually all of them spent in the hospital environment, yet nothing prepares them for their first weeks in general practice. The “medical model” – which essentially views a human being as a complex machine that can go wrong in numerous but always explicable ways – works well in hospitals; those are usually the kinds of patients one encounters there. But out in general practice, the medical model falls apart spectacularly. Yes, a minority of the time we do see patients with those sorts of problems, and then it’s usually straightforward to know what to do. It’s all the others, the people with bewildering combinations of complaints that don’t fit with what one finds in textbooks – those are the patients the medical model fails to describe. Appointment after appointment we are presented with undifferentiated symptomatology: tired all the time, chronic cough, mysterious headaches, bowels gone haywire. We have to learn new ways to make sense of it all.

I outlined for Harriet the bare bones of the biopsychosocial model, which all GPs become adept at using. How the “bio” bit – the medical model – is only one part of the picture. How people’s emotions, psychology, family, relationships, occupation, material circumstances, experiences, hopes and fears all interact with their biology in a glorious mishmash to create the kinds of stories they come to tell us.

“Give it six, eight weeks,” I told her. “You’ll soon be getting the hang of it.”

****

To begin with, I heard reports of Babylon, a start-up in London offering a new kind of general practice – patients being seen by the next available doctor via a video call. The NHS had imposed restrictions on Babylon’s service: pregnant women, for example, were not deemed suitable for its style of care. And Babylon itself was discouraging complex, chronically ill patients from signing up. It was fit and healthy working-age people it wanted. People who might not consult a GP from one year to the next, and if they did then it would usually be with simple minor complaints.

Then I worked an out-of-hours shift alongside Jürgen, a German doctor with an entrepreneurial spirit and an interest in medical technology. He showed me an app on his phone – Babylon’s “GP at Hand” – which he’d downloaded to play with. He was enthusiastic. “Look, this is what they use – they don’t even need people to see a doctor. You put in your symptom,” he said, swiping and tapping at his screen, “it goes through some questions and… there! Look, it tells you what’s wrong!”

I didn’t share Jürgen’s excitement. What I saw of the app, looking over his shoulder, struck me as rigid and unwieldy; question after question – yes/no, yes/no. The examples he was trialling – straightforward issues such as a sore throat or new-onset diarrhoea – seemed eventually to result in sensible diagnoses, but I couldn’t see it being of much use with the complex, multifaceted cases I spend much of my days on. I filed it in a mental box marked “interesting gimmick” and thought no more about it.

A few months later and Babylon was suddenly in the headlines. The company claimed it had pitted GP at Hand against practice questions for the MRCGP – the exam all doctors have to pass before being allowed to practise as a GP – and it had sailed through with 81 per cent, a higher mark than some of the seven human doctors enlisted by Babylon to sit the same paper. Ali Parsa, Babylon’s CEO, was being interviewed in every publication I came across. He described the results as “phenomenal”: “You study your whole life to become a GP… For a machine to be able to pass this with flying colours in its first go; that is incredible.” Matt Hancock, the Health Secretary, shared a platform with Parsa, extolling the merits of the artificial intelligence (AI) revolution and announcing that he wanted GP at Hand rolled out across the entire NHS.

Different in practice: faced with bewildering combinations of symptoms, GPs have to take account of patients’ psychology and material circumstances. George Marks/Retrofile/Getty Images

I thought back to Jürgen. Perhaps he’d been right to be so excited about the app; maybe I’d misjudged it. But the thought gave me a feeling of foreboding. What was this world we were accelerating towards? What future would there be for someone like Harriet, and the other young registrars I’d mentored over recent years, embarking on careers that looked suddenly vulnerable to the march of AI? And what of my own role, both as a GP and a trainer? Would there even be enough years left for me to see out the end of my working life?

I wanted to learn more, to find out how Babylon had achieved this incredible performance. Like many a person in midlife or beyond, confronted by wizard technology outside their experience, I felt intimidated. Parsa’s explanations simply baffled me: “The way we train the machine is very novel. No one else in the world is doing it. Our approach of mixing a knowledge base with natural language processing, a probabilistic graphical model and inference engine – put on top of a deep learning engine – allowed us to achieve the results at a speed that nobody else could.” I had no idea what he was talking about. I had an overpowering sense that the world was passing me by.

****

The letter was in the bundle of hospital post I was processing. It was from A&E concerning a 44-year-old patient of ours, David, who had presented with a thunderclap headache. An urgent CT scan had confirmed the suspected diagnosis: a large bleed in the membranes surrounding his brain, called a subarachnoid haemorrhage (SAH). He’d been in a bad way; they’d blue-lighted him across to the specialist neurosurgical unit at the neighbouring hospital, where he was currently in a semi-comatose state.

I called up his notes on the computer to code his records. Two things struck me. David had attended the same A&E five days previously with a sudden-onset painful, stiff neck. He’d been told it was muscular and sent away with painkillers. Then he’d consulted with Harriet three days later, complaining of the same thing. I read her notes. She’d come to the same conclusion as the A&E doctor: that David had muscle spasm. She’d prescribed him stronger analgesia.

Subarachnoid haemorrhage most often arises from the rupture of a swelling in a cerebral artery called an aneurysm. In around half of cases, the SAH is preceded days or even weeks earlier by a sentinel bleed – a small leak of blood that causes an abrupt and unusual headache. If a sentinel bleed is correctly diagnosed, it presents a window of opportunity in which to perform a procedure to seal the aneurysm and prevent a full-blown SAH from occurring.

With the benefit of hindsight, both Harriet and the A&E doctor had misdiagnosed David’s earlier symptoms, which had been a sentinel bleed. The chance to prevent disability or even death had been missed.

Harriet and I reviewed the initial A&E letter and her own notes. It was the absence of headache that had mislead her – and presumably our hospital colleague, too. In retrospect, she remembered feeling slightly uneasy: something hadn’t seemed to fit. But she’d pushed the thought aside and had stuck with the common and seemingly obvious explanation. We sourced some review articles regarding sentinel bleeds and found a range of unusual symptoms that a warning leak might occasionally present with: nausea and vomiting, intolerance of bright light, general feeling of unwellness and, least common of all, isolated neck pain. No matter that the literature also commented that these nebulous symptoms – so frequently associated with other less serious conditions – were commonly misinterpreted by physicians: Harriet was disconsolate.

It couldn’t have come at a worse time for her. Just a couple of months into her general practice attachment and her nascent confidence in sifting and sorting undifferentiated presentations had been dealt a huge blow. All doctors, after a missed diagnosis like this, will be affected. For most, it undermines their tolerance of uncertainty: for some time afterwards they will over-
investigate and refer much more readily, just in case. This may feel safer, but it has negative consequences for patients and the health service alike.

For some doctors, a serious misdiagnosis can be devastating. Harriet kept thinking about David and his family, and couldn’t shake the guilt she felt at having – as she saw it – failed them. I tried giving her my “10,000 decisions” talk. That is the approximate number of meaningful judgements a typical doctor will have to make in a year. Even someone who is 99 per cent perfect will get something wrong 100 times annually. And even if only 1 per cent of those have a serious consequence, we still have to expect a significant incident every year of our working lives. This is inevitable, and the professional response is to reflect honestly, learn any lessons that arise and take them forward into future practice, a bit wiser and a bit more experienced. It didn’t help Harriet. Not long afterwards, with David still in a serious condition in hospital, she announced that she wasn’t sure she was going to be able to carry on in medicine.

****

“We started 20 years back. I didn’t even know we were doing AI then. It was only three or four years ago when someone said: ‘That’s whatyou’re doing.’”

I was speaking with Jason Maude, co-founder of Isabel Healthcare. The video conferencing software we were supposed to be using for the call hadn’t worked, so we made do with the phone, Maude demonstrating Isabel’s functionality on my computer through a screen-sharing mode.

Maude’s earlier career was as a financial analyst. But then his three-year-old daughter, Isabel, suffered a potentially fatal complication of chickenpox, which was initially misdiagnosed by the junior doctors looking after her. Isabel survived – just – but spent months in a critical condition in hospital. The disaster arose because her physicians hadn’t been aware of the existence of the extremely rare complication she’d developed and had initially misinterpreted her decline in terms of the original viral infection. When Maude came to understand this, he abandoned his former career and set about developing a computer tool to help doctors generate alternative diagnoses in puzzling or unusual clinical scenarios. Isabel Healthcare Ltd was born.

“Computers are pretty dumb,” Maude explained, “but they’re very good at doing a very narrow job.” The Isabel engine works a bit like a highly specialised Google search. Its database is fed huge quantities of medical literature – textbooks, journals, case reports – building up comprehensive pictures of the entire range of symptoms and signs associated with virtually any known medical condition. A user can enter a free-text description of their case and Isabel will churn through tens of thousands of possibilities to find matches for the pattern, generating a list of potential diagnoses which is then weighted for probability and potential seriousness.

For much of Isabel’s existence, her users have been doctors – she was originally designed as a tool for physicians, to help them consider alternative causes when they were stumped as to what might be wrong. Latterly, since it was pointed out to Maude that what he was doing was AI, Isabel has been adapted into a patient-facing symptom checker to compete with other products such as GP at Hand.

“We’re very different to the chatbots, though,” Maude explained. Apps such as Babylon’s – and its arguably more robust competitor Ada – are based on algorithms. Getting a diagnosis out of them is a little like climbing a tree. You start at the trunk – this is the principal symptom, such as chest pain or headache – and the app presents you with a series of refining questions. Each is like a fork in a branch; the answer you give at each stage influences where you’ll eventually end up.

“The problem is,” Maude said, “you can be asked 30 to 50 questions. Patients get app fatigue and give up. And you have to decide what your most important symptom is – is it the fever, the rash or the abdominal pain? Only 10 per cent of users have just the one symptom.”

This is one of the reasons apps such as GP at Hand have come in for mounting criticism. Pick different principal symptoms and the algorithms take you along completely different trees. There are now numerous examples circulating where case descriptions that any competent
doctor would recognise as a heart attack, say, or a form of cancer, generate bizarre and completely erroneous diagnostic suggestions. The other difficulty is that the probability that different answers correlate with specific conditions are at best educated guesswork. The more questions such an app asks, the more any errors become magnified, leading to sometimes wildly improbable conclusions.

Machine learning: Phil Whitaker believes AI should be a tool to support doctors, not replace them. Tom Pilston for New Statesman

Maude put Isabel through its paces for me, entering complex case descriptions one minute and vague symptom clusters the next. The differential diagnoses that appeared virtually instantaneously seemed spot on, and far more comprehensive than I could have come up with unless I’d had a lot more time and potentially done some research. I was impressed.

Curious as to how Isabel would work for someone like Gary, I got Maude to enter his case details. The list Isabel generated contained all the unlikely but serious heart and lung problems that kept eliciting the same 999 ambulance response Gary was habituated to. Nowhere did anxiety feature in the differential. Maude was a bit deflated when I explained. But Isabel, like all other AI apps, is based purely on the medical model. None is anywhere close to incorporating psychosocial dimensions in their assessments.

Undeterred, Maude wanted to demonstrate Isabel’s geographical sensitivities. Fever and gastrointestinal symptoms for a UK resident produced a range of familiar diagnostic possibilities. But when Maude changed the patient’s details to indicate recent travel to West Africa, a whole host of tropical diseases were suggested, some of which I had never even heard of. I was impressed all over again, but something niggled.

“Malaria’s not there,” I said.

We re-ran the case, altering the symptoms to make them more classic for the parasitic disease. Still the list of diagnoses didn’t include malaria. I felt suddenly uncertain; I’d been sure West Africa was malarial, but perhaps I’d been wrong. At my suggestion, Maude changed the location to sub-Saharan and sure enough, up popped malaria high in the differential. I felt extremely embarrassed: I don’t do much travel medicine, and Isabel was clearly better informed than I was. Machine – 1, human doc – 0.

****

Gary did return with Lola and, three weeks into the treatment, he had started to notice a marked improvement in his symptoms and his general well-being. It was exactly what I’d been hoping for, and opened the door to him beginning to understand his chest pains in terms of uncontrolled anxiety. We explored some of the difficulties he was contending within life, and the links between times of heightened stress and aggravated physical symptoms. I prescribed a further month’s worth of medication and suggested another follow up
appointment. They both nodded, smiling, all trace of hostility and mistrust gone.

I felt buoyed by the encounter. After several years of unsatisfactory contacts with health services, I had at last helped Gary make a breakthrough. And it was something no AI would have achieved. Whether Gary had accessed GP at Hand, or Ada, or Isabel, the diagnostic possibilities he would have been offered would have focused on the purely physical and generated yet more anxiety to fan his flames.

That said, were Harriet to have had Isabel installed on her desktop when David came to see her, and had she listened to her gut feeling and run David’s presentation through Isabel’s database, the possibility of a sentinel bleed would have been flagged up, along with several other rare, important differentials that only a superhuman familiarity with the literature would have suggested. Would it have made a difference? Quite possibly. I have close to 30 years’ experience yet I wasn’t specifically aware that a sentinel bleed could occasionally present with isolated neck pain. But having that possibility pop up on a computer screen would have given me pause for thought.

The shine is beginning to come off GP at Hand, even as Ali Parsa makes every effort to spread Babylon’s reach further into the NHS: Birmingham recently became the next city to welcome it in. The AI app’s supposedly stellar performance in the MRCGP exam has subsequently been shown to be distinctly questionable. The clinical cases it was tested on were presented in highly digested form, already translated into medical model scenarios – just the sort of material it is trained to perform on. The app didn’t have to do what doctors do all the time: converse with unique human beings, sift and understand their meaning from the words they choose, the silences they leave, the metaphors and similes they employ, and the language that their body is also speaking.

And as more and more disconcerting instances of app malfunction have come to light, serious questions are being asked about the lack of regulation of such devices. Babylon has said that it is undertaking properly robust studies to evaluate its chatbot’s real-world performance, but to date nothing has been published. Even Parsa himself, in a recent interview with the Daily Telegraph, admitted that sometimes “the AI is just stupid”. Meanwhile, Babylon’s London patient list reportedly had a turnover of 25 per cent in 2018, with large numbers of patients choosing to return to their traditional NHS practices after experiencing Babylon’s model first-hand.

There is a place for AI in medicine, undoubtedly, but it is as a tool to support doctors, not to replace them. AI is successfully augmenting interpretation of X-ray images in radiology and in monitoring skin lesions for cancerous change in dermatology. As Jason Maude said, computers do very narrow jobs very well. But to set them to work in one of the broadest jobs of all – making sense of the biopsychosocial world of primary care medicine – will, I believe, come to be understood as applying a potentially superb technology to precisely the wrong role.

David recovered well from his SAH, and returned to work and family life, albeit that he has to contend with some residual memory and concentration problems, and a degree of personality alteration. I continued to support Harriet through her crisis of confidence, but it was perhaps her discovery that David attached no blame to her or to the A&E doctor for not recognising his sentinel bleed that was the best medicine. She stayed with her training, passed her MRCGP with a distinction and is now working as a GP in a neighbouring city, gaining in wisdom and experience with every passing year. Like any of us, she feels her responsibility towards patients keenly and cares deeply if she gets something wrong. The emotional impact from David’s case will have made her a better doctor – something that can never be true for a machine.

Gary and Lola didn’t keep the next follow-up appointment and haven’t been back since. My guess is the stresses have subsided for the moment, but whether it will be me or 111 to whom Gary turns next time he experiences a bad bout of physical symptoms is an open question. Some journeys take a long time to complete, and some never finish at all.

****

Following my conversation with Jason Maude, I was left with a niggling doubt. I went to the literature and found I had been right after all: West Africa is of course malarial. What interested me most was how ready I had been to accept that Isabel knew best. Seeing something
appear authoritatively on a computer screen has a powerful magic of its own. How would that play out for Gary, I wondered, if next time he gets chest pain and shortness of breath he turns to an AI symptom checker, and is presented with an alarming array of diseases that might be affecting his heart or lungs. Even more importantly, what about another patient whose serious pathology isn’t recognised by the chatbot of their choice and who believes the false reassurance?

I emailed Jason Maude to let him know that Isabel had a blind-spot about malaria distribution in Africa. He said he would get his team on to it. I discovered there is no capacity for Isabel, or GP at Hand, or any other symptom-checker app, to reflect on and learn directly from their mistakes. For that, we rely on the humans who design and maintain them. Those rushing to deploy AI in health care should be required to slow down and have these products evaluated thoroughly, systematically and independently. Real patients should not be used as a proving ground for our brave new world.

Phil Whitaker is a GP and writes the New Statesman’s “Health Matters” column. His books include “Chicken Unga Fever: Stories from the Medical Frontline” (Salt)