Data scientists can improve the accuracy, speed and longevity of science by focusing on solving issues related to the “life cycle” of data, said Deb Agarwal, a Lawrence Berkeley National Laboratory senior scientist, at the Women in Data Science at Berkeley conference this month.
Often, data scientists support science through machine learning, Agarwal said. But there are opportunities to look past the algorithms to help scientists better generate, manage and preserve data that can speed up the cycle of discovery and unlock future revelations.
This is “a life cycle problem, from the beginning of collecting the data through to the understanding in the science,” said Agarwal, a research affiliate for the Berkeley Institute for Data Science, part of the UC Berkeley Division of Computing, Data Science, and Society (CDSS). “Because for us in the science areas, until it's impacted the science and enabled understanding, is it of use that I collected the data? Maybe for somebody in the future, but it's not of use to me.”
Data science is already critical to science, Agarwal said. But with increasingly affordable and compact tools like sensors being deployed more widely and with demand for real-time data growing due to crises like disasters and climate change, there are even more opportunities for data science to create impact in scientific fields.
Agarwal spoke at the March 8 Berkeley conference, an annual event dedicated to spotlighting women in the data science field on campus. CDSS, the Lawrence Berkeley National Laboratory and the School of Information were among the co-sponsors for this event. It is part of a broader international conference sponsored by Stanford University with the same mission.
Improving Scientists ‘Ability to Do Quality Work’
Data science can help solve science’s data life cycle problem in any and all of the phases from data generation to data management to data preservation to data analytics, Agarwal said.
Data is often harder to collect and share in certain parts of the world, which can bias related data and findings. Right now, data scientists have created one-off solutions to these kinds of problems. But data scientists could help create a toolkit to address these gaps, which could result in a less scientifically biased understanding of the world.
More data also requires more processing, a task currently left to domain scientists. But data science could be used to build tools to help quality check the information collected for bad data or gaps, which could speed up this part of the process and make the resulting data more accurate, Agarwal said. This could “dramatically improve [domain scientists’] ability to do quality assessment,” she said.
Data scientists can help other scientists get the most value out of their data, too. They can help scientists think about how to preserve data that’s collected now in long-term machine readable formats that can be useful for scientists decades from now.
They can also discuss with scientists how to standardize data, so it can be shared and used by scientists across fields and disciplines. Data scientists can also help make scientists’ metadata more easily searchable and accessible, so scientists can better understand each others’ data.
There are endless ways data scientists can take a more active role in science outside of algorithms, Agarwal said. But when data scientists step in to work on these problems, she emphasized they should keep in mind who the end user is and what their needs are.
“Understanding what the purpose of this data, of these analytics, of these products [is crucial] so that you can understand what it is you're targeting. We too often forget that at our peril,” said Agarwal. “Until we talk to the user and work with them to understand their needs, we don't know those answers.”