Janice Scealy talks statistics and the Next Big Thing

# Finding the pictures in the data

Maths is everywhere. It’s a really important skill to have these days, of course in business, academia, all of the sciences – but even the softer sciences like psychology and biology are becoming more quantitative. Anyone running experiments and getting data needs an applied mathematician on their side.

I was a bit of an all-rounder at high school. I got good marks in a lot of different subjects, but I didn’t really know what I wanted to do. Parents probably urge their kids to become lawyers or doctor or engineers, but that’s because no one really knows about maths and statistics. I certainly didn’t.

I started off doing telecommunications engineering, which involved a lot of building circuits in labs. And that made me realise I don’t like labs – I’m simply not a hands-on type person. But I went well in maths in that first year at university because you have to do it for engineering anyway. And so I swapped to maths because I was good at it and it didn’t involve labs or essay writing.

Anyone running experiments and getting data needs an applied mathematician on their side.

The great thing about maths is that once you know the fundamentals, you don’t have to memorise very much because it builds on those basic principles. I also realised I liked applied maths a lot more than pure maths, because it’s motivated for real-world problems.

Then someone said to me, “If you want to get a job, you should do statistics.” So I did a double major in applied maths and statistics, and I’ve never looked back.

I get to work with lots of different people: geophysicists, for example. One of my specialities involves looking at data on spheres and curved surfaces – like our planet. I analyse objects with complicated constraints. Paleomagnetism is a classic example.

Inside the Earth there’s an approximate giant bar magnet with two poles, although it’s obviously a lot more complicated than that. The magnetic field you can measure on the surface of the Earth is constantly changing over time, and it’s different depending where you are on the globe geographically. Geophysicists monitor this to see what the Earth’s magnetic field is doing over time. They can even find out what it was doing hundreds of thousands of years ago, because when some rocks first form, they lock in and record the direction of the Earth’s magnetic field. So geophysicists date the rocks and extract the direction of the field. But their problem is they can’t easily measure the magnitude – how strong the magnetic field was. They’ve got the direction, but they don’t know the size. That’s where my spherical statistics comes in handy.

The great thing about maths is that once you know the fundamentals, you don’t have to memorise very much because it builds on those basic principles.

We can estimate, say, what the magnetic field is doing on average. But you also have to take the uncertainty into account. How confident are you of your measurements? That’s also what statisticians try to do – we try to give error bounds on your estimates. You’ve got to take into account the geometry of the object that you’re measuring so that you’re getting a proper estimate of both the average and the error bounds.

When we first get data, one of the most important things that any statistician will do is have a rough look at the structure, to see its plots and images. Statistics is like putting together bits of a jigsaw puzzle. It’s not just maths and it’s not just science. It involves different skills.

I don’t only analyse geophysics data. My most recent paper was on analysing microbiome compositional data – yes, that’s measuring gut bacteria in stool samples.

When we first get data, one of the most important things that any statistician will do is have a rough look at the structure, to see its plots and images.

So you’ve got counts of the heaps of different types of bacteria that are present in some stool samples. The issue here is that you’ve got counts of the different types of bacteria, but the total count is actually a constant. It’s fixed. We call that a constraint. So you’ve got a proportion of each of the different types of bacteria that sum to one – that’s called compositional data. You need to take the constraints into account when you do your analysis. It’s important to recognise the geometric structure of the compositional data object in order to do a proper analysis.

In my most recent paper, I developed a new complex model. I started by collecting about 100 observations from different people. You then try to summarise that data with a small number of parameters. That’s what we’re trying to do as statisticians – we’re trying to extract a signal from lots of noise.

I started with five different categories of bacteria that sum to one from each of the 100 people. Next, you want to see how the different bacteria interact with each other, so you want to model the dependencies between them as well. The model itself is easy to write down, but estimating the parameters in that model on that restricted space is very difficult. So this is where traditional methods didn’t work. I was able to use an idea from machine learning and adapt it.

We were able to get tractable estimators – the algorithm runs really fast. It’s a pretty big, significant advance. The thing was, I had to use the geometry, and by merging differential geometry from maths with statistics, I was using all these different tools to estimate some of these parameters. That was really cool.

My job is applicable to so many different fields, and it’s never boring, because I’m always finding new applications and talking to different people.

That’s what I love about my job. It’s applicable to so many different fields, and it’s never boring, because I’m always finding new applications and talking to different people.

The next big thing? Maybe high-dimensional, merged with non-Euclidean objects – object-orientated data analysis where you’ve got these really complicated objects.

A sphere and the compositional spaces are quite simple compared to some of these other data objects that are out there. That’s where the deal is going. High-dimensional data.

Big data has been a huge topic in statistics for a number of years, but this object data is getting quite big too, so that they’re going to merge together. I think that’s where it’s going.

You’ve got to use special techniques to try and extract information. It’s fun devising them.