Over the past 30 years, UC Berkeley statistics Professor Bin Yu has covered a lot of territory, both in her research field and in sharing her knowledge with others. And now she has used it to create a framework that she believes will lead to a more rigorous and trustworthy data science process, including the use of methods such as machine learning.
Yu and her team at Berkeley have developed novel statistical machine learning approaches and are combining their work with the domain expertise of collaborators to solve important problems in the fields of neuroscience, genomics, and precision medicine.
"Artificial intelligence has huge potential to help us solve critical problems," said Yu, who is also a professor in UC Berkeley's Department of Electrical Engineering and Computer Science and the Division of Computing, Data Science, and Society (CDSS). "But there is a lot of misunderstanding as well. We need to have realistic optimism."
Yu points to applications ranging from self-driving vehicles to precision medicine to medical imaging where AI is seen by many as the answer to the problems. The challenge of creating safe self-driving cars can be seen in the number of startups in the field that have folded, she added.
Behind all of these efforts is the discipline of data science, which Yu describes as a “field of evidence seeking that combines data with information from a research domain information to generate new knowledge," Yu wrote. It’s this process that she wants to make more consistent and trustworthy.
Yu laid out her framework for integrating predictability, computability, and stability, which she calls PCS, in her paper "Veridical data science" co-authored with her former student Karl Kumbier (now a postdoc at UCSF) and published in the Proceedings of the National Academy of Sciences in February 2020.
She and Rebecca Barter, Yu’s former student and current postdoc, are now adapting the material into a textbook to be published by the MIT Press in 2021. They also plan to make the material available online at no cost. She adds that "veridical" refers to something that is truthful or coincides with reality.
Yu was inspired to develop the PCS framework as a result of her interdisciplinary research projects with the Gallant Lab in neuroscience on the Berkeley campus and the Celniker and Brown Labs in genomics at Lawrence Berkeley National Laboratory. Recently, the PCS framework successfully guided the development of novel statistical pipelines epiTree and staDISC by the Yu Group and collaborators to recommend possible genetic drivers of disease and subgroups of people for which a particular treatment is effective, respectively. The projects are part of a research program to advance precision medicine.
The key, she says, is to look at the use of data science as a cycle, not a set of linear steps. In this scenario, the cycle of steps begins with the posing of a science question in a particular domain and proceeds through collecting, managing, processing (or cleaning), exploring, modeling, and interpreting data results to guide new actions.
“We need to look at the whole data science life cycle and make sure it is trustworthy,” Yu said. “Every step needs to be vetted.”
According to Yu, since data science typically crosses over multiple research disciplines, it requires human involvement from experts who understand both the domain and the tools used to collect, process, and model data.
"These individuals make implicit and explicit judgment calls throughout the data science life cycle," she said. "Since there is limited transparency in reporting these judgment calls, the evidence behind many analyses is blurred and we are seeing more false discoveries than might otherwise occur."
She describes PCS as a conceptual framework for asking critical questions and documenting them at every step of the data science life cycle. In fact, she sees an important role in data science for the role of critical thinking as it is taught in the liberal arts.
“The first step is to make an argument to yourself, then make an argument to the reader as to why your thinking is sound, making the process transparent,” Yu said. “Documenting these steps is integral to the process. You need to make a concise summary of why the work is responsible, reliable, reproducible, and transparent.”
The core principles of predictability, computability, and stability are the basis for such a unified data analysis framework, which builds and expands on principles of statistics, machine learning, and scientific inquiry. Yu points out that many of the ideas embedded in PCS have been widely used across various areas of data science and sees them as the minimum requirements for achieving her goal of veridical data science.
In other words, PCS synthesizes, streamlines, and expands on these ideas as an accessible protocol or pipeline to share best practices for a quality-controlled data science life cycle. At the same time, it emphasizes the importance of domain knowledge and critical thinking and communication skills of a data scientist.
Even though many research fields have successfully embraced artificial intelligence (AI), Yu notes that there is still a lot about AI and machine learning that is not understood. She is a co-principal investigator on aBerkeley-led $10 million project funded by the National Science Foundation and Simons Foundation to gain a theoretical understanding of deep learning--how it works and why it works.
Why now?
Yu said she made the decision to push forward her PCS framework for both personal and professional reasons. About the same time she hit a milestone age of 50 years, her mother became seriously ill. So, she wanted to make a statement about an issue she feels strongly about. Although she was eligible to submit a paper to the Proceedings of the National Academy of Sciences upon her election in 2014, she deliberately took her time to refine and polish her ideas before publishing them.
Her work draws on a wide range of perspectives she has gained. Since earning her bachelor's degree in mathematics at Peking University, Yu went on to earn her M.S. and Ph.D. degrees in statistics from UC Berkeley. In addition to working for two years at the Bell Labs based in New Jersey, she has been a professor at the University of Wisconsin-Madison, visiting professor at Yale University and a visiting faculty member at MIT, ETH (the Swiss Federal Institute of Technology in Zurich), Poincare Institute in Paris, Peking University, Inria-Paris (the French Institute for Research in Computer Science and Automation), Fields Institute at University of Toronto, Newton Institute at Cambridge University, and Flatiron Institute in New York. Yu is an investigator with the Chan-Zuckerberg Biohub and the Weill Neurohub. She is also a member of the American Academy of Arts and Sciences.