Data dearth

February 12, 2021

Rachel Williamson

The data is in: children really are less susceptible to COVID-19, and more than 20% of US and UK residents are likely to have already been infected at least once.

Two studies published this week use computational models to start answering two of the biggest questions around COVID-19, as algorithms come into their own in a pandemic that from the beginning was dubbed a “big data problem”.

But while algorithmic prediction models are beginning to emerge – just last month an Israeli team released its machine-learning model that can predict COVID-19 diagnoses based on eight basic questions – researchers are running into the same problem the health sector has been dealing with for years: highly fragmented data.

“To do machine learning, to do deep learning, you need big datasets and you need labelled datasets,” says Louisa Jorm, the foundation director for the Centre for Big Data Research in Health at UNSW Sydney.

In Australia, and around the world, finding those datasets is difficult.

“For example, the electronic medical records,” says Jorm. “At the moment it’s extremely difficult to access those for research purposes and it’s also extremely difficult to combine them or aggregate them across different hospitals or different countries.

“That’s for technical reasons to do with the fact that they’re proprietary products, but it’s also due to issues with the need to preserve patient privacy.”

Undercounting, kids and COVID-19

The two studies this week – looking at children and COVID-19, and case undercounting – got around fragmented datasets by working with governments and using large, publicly available figures.

Jungsik Noh at the University of Texas Southwestern Medical Centre, US, fed his new machine-learning model with publicly available data on confirmed case numbers and deaths in the 50 worst-hit countries around the world. Noh and co-author Gaudenz Danuser say in a paper published in PLOS ONE that their model is the first to provide daily predictions of current case numbers.

The most up-to-date projection is that nearly 40% of the population of Belgium had been infected with COVID-19 by 8 February this year, and more than 10% are currently infected. The model estimates that nearly 30% of UK residents were infected by that date, and more than 20% of the populations of the US, Italy and Mexico.

Covid big data small — A new study estimates the percentage of each country’s population that has been infected with COVID-19. Credit: Noh, Danuser / PLOS ONE

“Although daily new cases of laboratory-confirmed COVID cases are collected and reported in, say, Australia, nobody knew the actual number of new cases, because there are people who were infected but not test-confirmed,” says Noh.

“The actual size of infections in a region today would be what policymakers should know, for example, to determine which regions should be prioritised in vaccine distribution, or to determine how many contact tracers should be allocated to a region.”

Israeli research looking at childhood susceptibility to COVID-19 did not use a machine-learning algorithm. Instead it employed a dynamic stochastic mathematical model, which describes the interactions between random variables and processes – a model that has come into its own as computers became more powerful.

The study used a small but comprehensive dataset that was, thanks to the fact it had government backing, able to find an answer quickly to inform whether to reopen schools, author Itai Dattner says.

Several studies have already found that children with mild symptoms were not being picked up, including a prominent large-scale study from Iceland, while a study in NSW suggests children are not the main drivers of infection in schools.

But these didn’t provide “unequivocal evidence” of lower susceptibility among children, the Israeli study said.

The Israeli team, led by Dattner at the University of Haifa, used serological antibody tests and polymerase chain reaction (PCR) tests for current infections from 637 households in Bneu Brak, east of Tel Aviv. An outbreak in the town in spring had prompted the government to test all members of any household with positive cases.

They found children under 20 years were 43% less likely to get COVID-19 than adults and were estimated to be 63% as infective – meaning they were less likely to transmit the virus to others.

Still early days

The two papers answer some very important questions, Jorm says. But while we may have the data to resolve more of the big COVID-19 questions, it’s very difficult to access.

In Australia, where cases have been limited, some of the big questions now are around the nature and impact of “long COVID-19”, vaccine uptake and effectiveness, and the broader impact on the health sector of lockdowns.

“Elective surgery stopped, so we ended with massive waiting lists,” Jorm says. “People stopped going to the GP so they potentially stopped taking preventive medications. They definitely reduced their use of screening services, so things like mammograms and Pap tests have really fallen away.

“So [research on] quite a lot of other collateral damage of COVID-19 requires access to that whole-of-health-system data for individuals.”

It’s not just a lack of data that is hampering efforts to build and feed predictive algorithms. Machine-learning models themselves are still in the very early stages of being able to help with diagnostic prediction, infection screening, outbreak detection and risk of deterioration.

In November, Australian researchers Ian Scott and Enrico Coiera lamented that while machine learning and its more sophisticated cousin artificial intelligence could support the COVID-19 cause, most applications are not yet mature enough to work in a clinical setting.

“The speed of research means that many reports are preprints awaiting peer review, while still attracting media coverage and clinician adoption before proper evaluation,” they wrote.

“Most machine-learning models have relied on Chinese data, limiting generalisability to other populations. Those trained on limited and unrepresentative data are susceptible to overfitting and can perform poorly on real‐world datasets. Many diagnostic and prognostic machine-learning models published to date are poorly reported, lack external validation, and have high risk of bias.”

The “big data” pandemic is now beginning to produce enough information to answer a handful of the big COVID-19 questions – but it’s still too early for algorithms to predict whether that cough is COVID-19 or just a cold.

Data dearth

Rachel Williamson

Decoding the future – the unstoppable rise of data science

Using AI large language models for government work poses privacy risks, says Victorian deputy privacy commissioner

Your house has been Googled - but don't call the police

The largest human family tree ever created