This is a republication from the original post from: https://www.facebook.com/theNASciences/photos/a.395922850548784.1073741831.381954325278970/1141779482629780
Photo: Professor Bin Yu
Over the last few centuries, statistics has evolved from gathering demographic information about people to an awesome field of helping data-driven decisions in industry and data-driven knowledge generation in sciences and social sciences. For example, clinical trials are conducted by medical doctors and biostatisticians to provide evidence for the U.S. Food and Drug Administration to decide on a new drug release to save lives; supervised learning in machine learning (a frontier field of both statistics and computer science) is playing an important role from cancer diagnosis to self-driving car design.
The burgeoning field of data science is a re-merging of computational and statistical/inferential thinking. It heavily relies on statistics as well as computer science, mathematics, and domain knowledge to solve data problems. In 1890, the volume of US census data drove statistician Herman Hollerith to invent the Hollerith Tabulating Machine. His company with other three companies later formed IBM. It is timely to see an integration of ideas and concepts of statistics and computer science in two new undergraduate data science courses at Berkeley (http://data8.org/, http://www.ds100.org/).
The stability principle has emerged as a central principle of data science that builds on stability of knowledge on one hand, and on the other, connects to statistical inference or uncertainty assessment (Yu, “Stability”, Bernoulli, 2013). It is a minimum requirement for reproducibility and interpretability. In a nutshell, it makes it self-evident that data-driven decisions and knowledge should be stable relative to appropriate perturbations in data, models, methods algorithms, and ad-hoc human decisions in the data analysis cycle. It helps prevent p-hacking, model-hacking*, and resulted in false discoveries. It can be employed as early as the phase of exploratory data analysis and data visualization. Meaningful data patterns (e.g. a linear trend) should persist by using 80% of the data deemed as an appropriate perturbation. An appropriate data perturbation means a sample similar to the original data set, with similarity decided using information from domain knowledge and data collection process. If not, further investigations are warranted before the discovery of a linear trend is claimed to be a data result or discovery. This principle is conceptually simple to use and easily understood by data scientists and consumers of data results alike. Give it a try!
*Footnote: model-hacking is defined by the author as the phenomenon that one tries a large number of models (or algorithms) to find a desirable data result. It is a form of taking noise as results.
Learn more about Dr. Yu’s work at:
https://projecteuclid.org/download/pdfview_1/euclid.bj/1377612862
http://www.nasonline.org/member-directory/members/20022958.html