[MUSIC] Welcome to Introduction to Data Science. My name is Bill Howe, I'm the Director of Research for scalable data analytics at the University of Washington eScience Institute, and an affiliate assistant professor in Computer Science & Engineering, also at the University of Washington. So, in this first segment, what I want to do is go through some examples of data science activities and projects from the recent past that I found interesting. And use them to sort of wake your appetite for the concepts that we're gonna learn in this course. Okay, so the first one I wanna mention here is the presidential election from 2012. And I know you're probably sick of hearing about this if you live in the United States, and even if you don',t you might be sick of hearing about it, but bear with me. So, this is a map of the electoral college and each state is colored for the candidate that took the electoral votes, and the numbers that represent how many electoral votes each state has. And so, if you recall, what was interesting about this map at the time was that it led to a pretty significant discussion in the media about data science. Because Nate Silver of the 538 blog, was able to predict this map perfectly before the election. All right. And that discussion in the media talked a lot about, what a genius Nate Silver was and mentioned the sophisticated mathematics he was using and how he's sort of a whiz with these things. But, what I thought was interesting about this was, that Nate Silver would be the first one to tell you that the methods used in ploying to make this prediction were actually pretty simple, right? And so he says here, in a series of quotes from blog posts around that time, this first one from October 26th was, the intuition behind this ought to be very simple, Mr Obama is maintaining leads in the polls in Ohio and other states that are sufficient for him to win 270 electoral votes. And what was funny was a few days later, he'd become sort of more blunt. The argument we're making is exceedingly simple. Here it is, Obama's ahead in Ohio. It's not a magic trick. So then after the election on November 10th, when he was shown to be right and got this sort of flawless prediction, the blog post that the last quote was taken from was describing why he started the 538 blog in the first place and he says, look, you know, the bar set by the competition was invitingly low. Someone could look like a genius simply by doing some fairly basic research into what really has predictive power in a political campaign. And so what really had predictive power in this case was the state polls themselves, aggregated. Right? So historically, the state polls, aggregated, did a pretty good job of predicting the outcome of the general election. And so that's what he did. Now, there was some sophisticated work in quantifying the uncertainty and certainly in presenting these results. There's a lot of beautiful interactive visualizations that he created in order to sort of convey these ideas to the public. And that's one of the points that I want to make about this is getting the answer in some cases is the easy part. It's in interpreting the results, and in convincing others of the result by presenting them usually through visualization that can be the hard part. And this is one of the themes that we'll come back to throughout this course. Okay. So just to summarize it though, I'm not sure I said this, the simple methods plus enough good data wins. That trumps more sophisticated methods in many cases. And that's another theme that we'll come back to. All right. So something else related to the campaign before we moved on to other topics was the system that the Obama campaign used for their data-driven ground game so to speak. The ability to sort of target direct, target to specific categories of users. And so what they did was they built and maintained a really significantly sized, massive voter database. And use it to design these highly tailored messages to very specific groups. So, the mother of two in a small town in Ohio who tweeted about the environment and mentioned organic vegetables on her Facebook page and who had voted in 2008 and had registered on Obama's website but had never donated. Okay, you know she'd get a message from Michelle Obama that highlighted Barack Obama's environmental policies. Okay. And so in order to design these messages what you had to do was kind of do ad hoc hypothesis testing about what might work and what didn't. You had to kind of slice and dice this data at kind of interactive speeds. And this is another theme that we'll return to is the need for these kind of ad hod interactive analysis. And the systems they use for this are pretty interesting too. You know, this was a sequel database, a very fast one called Vertica, and we'll talk a little bit about what makes Vertica special I hope toward the end of the course. But it is a SQL data base. And SQLs sometimes get a bad name, in data science context, as for the old guard, that can't be possibly used for analytics, and doesn't really make sense in todays era. But, you know, don't believe it. Right, it has a role to play in many cases. And so here, you know, they did use Hadoop right to do the aggregate generations of anything not real time he says here but for the speed of thought queries about the data they used this Vertica database. Okay? And so we'll come back to systems in several segments from now. [MUSIC]