An interview with Data 8 co-creator Ani Adhikari
UC Berkeley Statistics professor Ani Adhikari helped develop and is one of the principle instructors of the hugely popular course Data 8: Foundations of Data Science, which teaches data science applications to thousands of students a year from scores of different majors. Below are some highlights from an interview with Professor Adhikari exploring how the innovative course engages and serves a broad cross section of students.
Q: Data 8 is known for working with students from all kinds of academic backgrounds and interests, including those without advanced math or computer science. How do you go about teaching this range of students?
A: I think it’s really important to remember that Data 8 is a collaboration between computer science and statistics at Berkeley, which while renown for being research institutions, have also for a long time been pedagogical leaders in their field, so we are standing on the shoulders of giants. From a statistical side, our work is hugely influenced by the textbook that came out of our department in the 1970s: Statistics by Freedman, Pisani, and Purves. They really turned the teaching of statistics on its head; they just said, “What’s the heart of it, what does everybody need to know?” It’s not a bunch of formulas and rote memorization.
We are taking what is really deep in that and updating it so that we can use the new computational techniques and the new media, because what we now have at our disposal is extremely easy to use. It seems paradoxical but it’s true that students can get a better understanding of the fundamentals because they’re able to do larger analyses. They can take entire data sets, draw their own pictures, compute their own quantities of interest.
Before, this had to be done by hand or on a hand calculator, which is very tedious, so we had to summarize the data down for students. We would lead them to look at just a few things, or we drew the graph and said: “Interpret the graph.” But the liberation now is that we can simply hand them the entire data set. Now they can decide the right graph to draw, which is based on intellectual considerations; the actual drawing of it is very simple. This gives the students a lot of agency.
When the computation becomes easy, we are able to focus on: What were the underlying assumptions and how do you interpret the results? How do you check that your data meets the assumptions and then how do you justifiably interpret your answer in the language of the question, not with statistical jargon? In other words, if the question is medical—for example, “If I give this medicine to this child, will it cure their spots?”—then your answer should not be: “The chi square statistic with these many degrees of freedom comes out with a p-value of such and such and therefore the result is highly statistically significant.” Your answer should be: “Medical professionals, the data suggest that this treatment is actually doing something effective, and now it’s up to you to do the biology to figure out exactly what it’s doing.”
In a setting where there are myriad different factors and lots of interconnected things, data science can shine a light on a place where the investigation should look. Then these insights should be discussed with the medical professional or the political scientist or the sociologist.
Q: And so today it is easier for all kinds of students to understand and apply data science because the computation is more automated?
A: Not only more automated, I think it is easier to think about the computation because you can view large data structures easily. You can see your table of data and every row corresponds to a person and then every column corresponds to a feature of that person. So I can look at one person and all their features or I can look at everybody’s education level as one feature across all people. So visualization is easy. That makes the concepts very concrete. And yes, the actual calculation is rather easy to do. And we have taken care to choose and develop a computational system such that the command you type into the computer actually makes sense as English.
When you read a line of code, you should be able to tell what it’s doing...actually words that make sense.
Python is a very natural language for doing mathematics and the data science library that was developed here paid special attention to the fact that this was going to be used for data analysis by people who don’t have a lot of or really any computer science background or mathematics background. So it had to make sense. When you read a line of code, you should be able to tell what it’s doing; not a lot of dollar signs and semi colons and underscore, underscore and so on, but actually words that make sense.
Q: Data 8 draws on examples from a lot of different fields—could you talk a bit about some of data sets you work with, and how you use them to explore different concepts?
A: We use a study about the ethnic composition of juries in Alameda County to look at sampling variability. The jury panels that were selected did not exactly resemble the distribution of the eligible population, but if you select at random you don’t get exactly what you’re selecting from; you get a little bit of give-or-take. How much is the give-or-take? Is it reasonable give-or-take if you’re selecting at random?
We use a dataset that came from a student who developed a machine learning classification system to classify cells as cancerous or not to look at concepts related to classification. How do you use attributes of a cell to predict whether it is cancerous or not, and what fraction of the time are you going to get the prediction right?
In another case, students look at films, and based on attributes in the script, they try to use the machine to classify the movie as an action movie or a romance movie. The data there are not numbers, but text. Based on textual features, they try to sort into one of two groups We have a large data set that is a random sample of mothers and their newborns and various characteristics that we use to study relations between variables. We look at Deflategate, the controversy about whether a football team was unfairly deflating its footballs; how much a weight lifter can bench press; the length and the age of dugongs…we’re trying to keep it varied!
We still have a lot of work to do. Data tend to grow old rather quickly. The pilot class was in 2015, and now in 2019 some of the data already feels old. For example, in 2015 it was great that we were using data from the NBA from 2013, but now in 2019, an entire generation of players has retired.
Q: How do you teach complicated concepts like inference and causality?
A: Inference is the very first thing we do after the introductory lecture. We take one example: John Snow in London and how he worked his way into really identifying—to the extent that a data scientist really could do—dirty water as the culprit that caused cholera. This raises issues—if you see that the people who drink this water have this outcome and the people who drink that water have this other outcome, can you immediately ascribe it to the water? What are the other things that you have to consider? I like to do that example early not only because it asks the big questions in a way that everybody can understand, but also because it points out to students that data science is not new. This man was using visualization and data analysis, careful data analysis, in the 1850s, and others had done it before him.
The calculation is easy; the students have done it many times before. What’s deep is the thinking.
How do we talk about causality? As the students have a little more technique, a little more ability to analyze data, we take data from a randomized controlled experiments and we try to very precisely in English say, “What is the question we are looking at?” We can see this person was in the control group and they had this outcome; what would their outcome have been had they had been in the treatment group? The calculation is easy; the students have done it many times before. What’s deep is the thinking; it’s: How do you formulate the question and how do you interpret the answer? And that kind of thinking every Berkeley student can do.
Q: How do you encourage students from a variety of backgrounds to even consider data science?
A: We’re actively encouraging students from backgrounds that are not typically associated with data science to take this class, and we’re trying to support them with the Data Scholars program—they have their own lab and their own seminar. What we’re trying to do is not lose them, to make sure that they understand that they belong and to give them the support needed to move forward.
Q: Do you think it’s helpful for students to see you as a woman in the field of statistics?
A: I’m quite confident that it is. Numerous students have come and told me, “You are the first woman I’ve had teach a STEM class.” And that is important not just for the women but also for the men. Because their attitudes are shaped by what they know. If everybody, men and women, whatever ethnic background, see diversity on the faculty, that will affect their view of the profession.
Q: What do students do after they take Data 8?
A: You don’t just teach people in Data 8 or Data 8x and they all sort of melt away. They’re all still there clamoring for more. Last fall, in the fall Data 8 class, in their exit survey, two-thirds of the 1,300 students said they wanted to take another data science class.
In the Data 8 exit survey, two-thirds of the 1,300 students said they wanted to take another data science class.
If you want to seriously explore data science, become a professional data scientist, you can take several paths. You can apply the techniques of data science in a particular domain, which is what a lot of our students do—you know, I’m a neuroscientist, I’m a linguist, I’m so on. And then there are the people who want to develop data science itself or want to study more deeply to see what it’s doing. These students then have to acquire enough of a theoretical background in mathematics and computer science to go and take a look at the underpinnings of the subject. One of the crucial ones, because it relies so much on randomness, is probability theory. They’ve got to understand chances; they have to be able to quantify randomness, and so that is what Prob 140 does.
Q: We’ve been talking about how to make this accessible and engaging for a broad segment of students. Why is that important?
A: I think that the broadest possible set of people in the world need to be owners of the data-—what data is collected, who gets to keep it, who gets to delete it, what questions are asked of the data. This should not be restricted to some homogenously trained section of society. I think it should be very broad. Otherwise, we simply lose insights.
I think that the broadest possible set of people in the world need to be owners of the data.
It’s important because everyone has to make decisions based on data. For the students’ use in their personal lives, and their professional lives, whatever their profession happens to be, they should not rely on people of certain majors, or a certain kind of person that works in a certain kind of industry, to do the data analysis. I think that it is fundamentally important for human knowledge. It is just a natural way of thinking along with all the other ways of thinking and we have students who are empowered to use that way of thinking or not as they choose. But they are choosing from a position of strength.