I only started thinking about a career in science at the end of high school. Prior to that it was all about sport for me. I’d practised karate from the age of 6 and always imagined I would become an instructor, until I suffered a knee injury and had to give up that dream.
At that point, I put all my loves on the table. I was always a fan of mathematics and physics, and all the natural sciences. What could I do that encompasses all of them? Computer science was the answer.
Computers amaze me every day. They make me realise how much we can change the world for the better. When I started my PhD, my supervisor was working on a project related to neuroscience, helping map the brain in terms of data, and then using this digital representation to understand the workings of the human brain. I was fascinated by the capabilities of computers, and that motivated me further to explore the research in data analysis.
Data is extremely powerful. It tells stories, it leads to discoveries. Everything is hidden in the data. The opportunities to improve the world around us are endless when it comes to leveraging data.
But the sheer volume of data we collect nowadays is enormous. In fact, the rate of growth in data collection is exponential: every year we generate the same amount of data that we have generated in all previous years. No one really uses CDs anymore to store data, but if we did, the amount of data we have collected so far when burnt on a stack of CDs would be more than 200 million kilometres high. To put things into perspective, the distance from the Earth to the Moon and back is less than 1 million km. Therefore, finding insights in the data that help lead to new discoveries is literally like finding a needle in a haystack.
That’s where my research comes into the picture. It’s about building database systems that help users find that “needle”. We do this by building a database system that learns from the interactions with the user and from the queries that users are asking the database.
Data is extremely powerful. It tells stories, it leads to discoveries.
The next generation of data systems will place the stakeholder right at the centre. In the past, we haven’t really built databases having the user needs and experiences in mind. As a consequence, useful information is at times very hard to extract for a typical user – say, an astronomer, or a physicist, or a doctor. We’re talking about very educated people, who need to interact with databases on a daily basis as part of their work. But they can’t be expected to know how to set up the database system to support their analysis.
A large part of my work is about creating a database system that is able to adjust to the type of analysis that is required by a user. For instance, an astronomer should only have to focus on what they want to discover, and ask questions towards that discovery. Behind the scenes, the database system uses those questions as a driver for automation. We try to predict the intention of users – what is it that they’re after? Is it a new star? Is it a new quasar? Is it a new black hole? We use those questions to teach astronomers the right questions to ask for them to lead to a discovery.
In my group, we are building, from the ground up, databases of a new generation, coined “self-driving” databases, that are grounded in machine learning. Self-driving databases, akin to self-driving cars, use machine learning and artificial intelligence to automate mundane and complex database preparation tasks that typically require (costly) domain database experts. On top of that, the self-driving database monitors the type of analysis that the user is employing, and tries to interpret the user intention from their sequence of queries. The database then helps users formulate questions that will likely lead to a new discovery.
In my group, we are building, from the ground up, databases of a new generation.
We’ve seen really creative uses of machine learning in various domains: for instance, chatbots offered on many websites are simply machine-learning models that learn to interact with the user. The difference here is that we are building not only a new generation of databases, but also a new generation of machine-learning algorithms that can efficiently sift through those large amounts of data, effectively summarise it, get a meaning out of it, and then offer this meaning to the user.
If you have a large book of a thousand pages, and you want a short summary in three sentences, you can teach a machine-learning algorithm to do that for you. In a similar vein, we are leveraging machine learning to help summarise the data stored within a database, because the volume of data is so large that it often surpasses our cognitive capabilities.
I read an interesting article recently of the fastest growing black hole being discovered. It was actually found using an Australian SkyMapper telescope, and interestingly the telescope started generating data back in the early 2000s. For more than 20 years that data was available to scientists all over the world, who all sifted through it extensively, yet no one found that black hole. Scientists said they simply missed it. And this is just one of numerous examples to illustrate how hard it is to find that hidden knowledge buried in the data.
I love working with stakeholders from various domains. I try to conduct research that is applied, that is going to help people. Working with particle physicists and astronomers was an eye-opening experience. Understanding their use cases, and then offering computational solutions that help address their needs, leads to innovation across both domains.
The example I gave about astronomy is just one of my interests. I find it fascinating. I would love us to understand our world. I would love us to understand where we came from, how we came to be. Data can help us answer some of those questions. That’s very exciting.
As told to Graem Sims.
Cosmos is a not-for-profit science newsroom that provides free access to thousands of stories, podcasts and videos every year. Help us keep it that way. Support our work today.