Okay. One of my favorite topics. So, this is where we break away from what I call intermediate SQL, which is like JOINs, and UNIONs, and getting super familiar with working with nulls, and a lot of the topics that we covered as part of that. Creating datasets, and visualizing insights of course. And we really move into some of what I would call the advanced SQL, or the beginnings of the advanced SQL concepts. Which is using things like ranking and numbering functions. So if you've ever heard of the rank function, or people using row number, or dense rank, or leading and lagging, its to actually iterate between rows in the output. This is where we can start to get really dangerous with SQL. So let's take through a few examples here, and then work out a quick demo. Okay. So, this is where we mentioned the concept of a window function. So what ultimately we want to do here, is we start with all the way in the left, a table of employees, at an imaginary company. And they're part of two different departments. Say department one is sales, and the department two is analytics. And each of the different employees have a tenure at the organization. So this is when they first started. And the question that we want to answer at the end of the day is, rank everybody by each department, and how long that they've actually been there. Now, think about that. We actually have to break apart our original dataset. Kind of like a GROUP BY, into two different departments. But then we all sort of rank those records. So, what you could do technically, is you could do a select from the employee list there, and then where department equals one, and then do your ranking there via an ORDER BY, on the start date. And then march that together with a UNION again, for that other department, for the department number two there. But that's not a scale. We'll say instead of two departments, there are 10 departments. You don't want to keep breaking things apart manually where clauses, and then unioning it back together. And that's why we're going to introduce the analytical window function. And one of them in particular is called RANK. So, the SQL we're going to review on the next slide. Let's review conceptually what we're doing. So on the left, we have the list of employees and their tenure. The first thing we have to do is you have to split it apart by department. So that's where we're invoking this thing called partitions. We're partitioning by department. And this is taking one data table. And now we're breaking it apart so we can operate on separate chunks of that. So that's that second column there. And then next, we need to give it a logical ordering, so why or what's the whole point of actually us breaking this apart? Well, we ultimately have to answer the question of, who's been there the longest? In order to do that, we need to get our records in a sensible order before we apply a rank of the rows of one, two, three, four, five, that sort of thing. So that third column of information there, has the sorted results there. So you can see like Jacob there for department number one, has been the longest since 1990, and Isabella in department two, has been in that department since 1997. And last but not least, is just actually applying a rank function, which is going to be a new column all the way at the end. And you can see that we're just simply applying the ranks of the rows one, two, three, four, five. And there in the case of ties, you could actually use, for rank, it's going to be one, one and then you could say two, three, four. Or if you just wanted the dense rank, one of the functions, I believe it's dense rank, will actually, ties are still counted as one, one, but it'll actually skip over and say one, one and then three. So I'm pretty sure that's the rank, whereas the dense rank will actually go one, one and then two. Anyways, so recap of what we are actually just did starting from left to right. We have a employees table. We're splitting it apart into different subgroups, or windows, and then we're ordering on it. And then lastly is we're applying that ranking function. Just one of the available analytical functions, SQL analytical functions, that we're using on this data. So what is the actual SQL query look like? Here we go. So we've got the first name of the employee, the department. And again, this is actually how we're going to break apart that singular table into those different chunks. Those different chunks we're going to call partitions. And then we've got the start date, which is how we're actually going to get value out of why we're breaking apart. In the first place because we have to answer the question of who has the most tenure at the firm? And the last field is that calculated field there, where it's the analytical function that we're invoking. So it's saying perform a rank operation over in parentheses. Here's how we're going to set up those windows of data. Here's how we're going to partition it. So we're going to say, partition by the department, meaning break apart that single table into those two different departments, we'll have many different departments. And while you're at it, store each of those by a sorted order, where it's start date. And again, order by default to ascending in this case and for dates that means oldest to newest. All right. That's a part of the rank function. What I'd like to do now is, let's make it a little bit more real by jumping into BigQuery, and then writing an example of our own. Okay. So here we go. So, our goal here, since we're already familiar with the IRS dataset, we are going to find, let's do something fun. Let's do the largest employer of these nonprofits per US state for the 2015 filing year. Immediately, lights should be going off right now, when I said per US state. So taking one table, and breaking it apart, while we're processing some kind of function over those broken apart partitions. So, what are you going to do? So the first thing we need to do is we need to find the data table where that data is stored. So, we need to pull up the, let's see in the left hand side 2015, and I'm going to copy this, because this is actually going to go away when I query this table. Let's pull up the query table. And again, the reason why I really like doing that is I hate actually writing out the manual project dataset and the table analysis. For sure if I write, I'll make spelling mistakes because that's the first thing that I normally do in demos, is do a bunch of typos. Okay. So we have all of our data. So what do we actually want to pull from here? Let's see. Let's get some identifying information. Let's pull. Let's see maybe the EIN. Let's see if we can get like a count of employees. Let's see employees with a simple control find, number of employees, perfect. So I'm going to add a comma, number of employees, and I'm going to get a little bit more of a clear alias, as number of employees. Ah, so now you're asking, okay, so you've got your identifying number for each charity, and you've got your number of employees, where did you get that state information from? And of course, if you remember from the joining and emerging lesson that we went over in the previous course, we are going to do that through a JOIN. So in order to do that, we are going to JOIN on the organizational details which is just simply ein. And you can join in those keys by using an alias. We used this before as tax and as org, but I'm going to show you a little shortcut since we're in the advanced topic. If it's the exact same name of the field across each different datasets, you can use this pretty cool clause called USING and it'll save you just a little bit of typing. So we've joined those two together, and we get the advantage of clicking into that dataset, again, holding down the command or control key to quickly get to that dataset schema. The reason why I'm going to do that is I want to pillage it for some of those columns that might be interesting. So I need to get state and I need to get the name. So I actually want, I'm going to reorder this a little bit. And the reason why we don't need aliases here, like prefixing these, is because there's no ambiguity. There's no shared column names between the two datasets. Okay. So we've got state. We've got all of this good stuff. Let's make sure our query runs and then we'll start writing some advanced analytical functions. Okay. Cool. So it's pulling. There's no crazy nulls or anything like that that we will see happening here. It's got some employees and states, and then we can see some of their charity names. But we are just getting on our journey. So, what we want to do is we want to answer the question per US state. So here's the state. Add a comma here because we're going to be adding this analytical function. So you know you want to do something with state. So when you say, break it apart in SQL speed, that means PARTITION BY state. I need to [inaudible]. What else am I going to do within there? Well, I want to get the top, the highest number of employees first. So we need to throw in an ORDER BY. So ORDER BY the number of employees, and we want highest to lowest. So highest to lowest is going to be DESCENDING. And what function do we alternately want to do on this partition that we've just created here in parentheses? Well, we want to do a rank function. So we want to rank the actual one, two, three, four, five, for those top charities. So we need to rank the records over this particular partition that we created in parentheses. And that's generally how I'll write this. I don't really write it with rank over first. I'll generally figure out like, okay, how am I partitioning my data? And then I'll invoke some analytical function over it afterwards. And since this is a new function, you'll remember that we'll actually have to give that an alias. So that is step one. Now we have not, if you're already ahead of the game, we have not yet filtered for the top ranked one. And as you can see, we have the ranking for these records and this is, for the state we have already got some of like the top number of employees for these. But we need to filter this. Now, what we could do is we could do WHERE and then WHAT. If you remember from a previous lecture, why wouldn't this work? So we run it and this is, bam, query failed. Now if you recall, you cannot use the alias of a fuel that you just created directly after FROM. Right? So where you can use the aliases in things like, the ORDER BY or HAVING, but you can't use it in the WHERE because it doesn't exist yet at the time this dataset is being filtered. So, I don't want to copy over that analytical function into the WHERE because that would create really messy code. So what's one of the things that we can do? Well, I'm going to introduce a concept that may be new to a few of you. And it's called the creation of these temporary tables. So, this data is fantastic. We just need to filter it simply and maybe perform some other operations on it. But we've essentially created a dataset that either we could store it like in a permanent table and then query. But if I just wanted to do this all in one query, what we could do is, we could store into a temporary table. And we'd do that through what's called invoking a common table expression. I want to cover these a little bit more in the last part of this lecture. So WITH, let's say, just give it a creative name, where we're going to name the temporary table. So this is employees_count_ per_state. With employees_count_ per_state and then wrap the entire thing in parentheses, that allows us to then call this as a table directly. Which is pretty cool. So we can then do a SELECT to give me all of the fields, which is going to be these five that are here, FROM the temporary table that we just created with a common table expression. Give me all of those fields. And then here, now that that already exist because that query is ran before this, then I can actually use that alias in this. And I neglected, we actually need to alias this entire thing. So it's WITH, the entire name, AS. And then this entire query is ran, stored into here. You can actually change multiple of these with statements together, which is pretty cool. And it allows you to break apart a complex query into multiple steps. Okay. So let's see. Let's go ahead and give this a go. And then the other benefit of breaking this out into a subquery is now what we can actually do is order the results. Now the interesting part is, since we didn't do any kind of de-duplication, where you saw some of the dirty data issues with EINs that have filed more than once in one tax period. We can do a quick de-duplication by saying, "All right." Well, before we're doing any more work here, what I want to do is de-duplicate this first query and we can do that simply by doing this. So GROUP BY 1, 2, 3, 4 and then we have this aggregate function for RANK. 1, 2, 3, 4 correspond to the first four columns in that SELECT statement. That will actually de-duplicate those results for us before the parse. As you can see, we have duplicative results in here and we want to avoid that in our final query. It is a symptom of bad data. So, run that query again, and you'll see different results. And at least in the file that we'll get here there's no duplicates. And it went down from 71 to 61. And we'll cover why there are 11 more than what is traditionally thought of 50 United States in just a second. So the last thing that I'd like to do is I want to ORDER BY the total number_of_employees. So I can get the states with the most at the top. So what we've done is two things. We found the charity that has the most employees per state. And then, we're sorting all of those results by the states that have the charities that have the most. So you're comparing two things. The charities within the states, that's the contest first, and then there's the states amongst themselves with their largest charitable employer. Okay. Final query. Let's go ahead and run this and then you can get to the answer you're looking for. So it looks like A FAMILY AFFAIR BRIDGES INCORPORATED and 787,000 employees in California. And then we have CROSSBRIDGES, KAISER FOUNDATIONAL HOSPITALS in Oregon. And again, these are the rank of one. And you can really get a sense for some of your data if you go all the way to the last, and you can see potentially strange state names, where this could be like the armed forces or this could be not really actually sure what AS is. And then here we go, a few others like Virgin Islands, like things that are part of the US and US territories like Guam. And you can also see, if we keep going long enough, you actually do have some that are partitioned by state and one of the states is a valid value of a null. So if you additionally wanted to filter out nulls, you could add in that filter where state is not null as well. Okay. That was a pretty in-depth demo that covers a couple of different things. On a recap, the top three that I think you should know. Number one, if you have a very complex query, nesting all of this inside of this temporary table and then potentially chaining multiple of these together, is a very extremely powerful thing that once you just realize that you can do this, you could be doing this all the time, trust me. And it's instead of writing just very, very complex queries, just break it apart into different steps. So step by step. So that's a really good thing to know. And the second is partition functions or advanced window functions in SQL are not that scary. What I generally like to do is I reverse in here the question. I basically say, per US state. That means PARTITION BY state. And then what I want to do on the largest? Okay. Largest means this. And then, I apply that function, ranking function 1, 2, 3, 4, 5, and then, filter it out from maybe the top one or if you just wanted to do the top two, we could do that as well. So, for the top two it would be- well, let's do with the rank. Is less than three or less than or equal to two. So this will be the top two. Right? And hopefully, that will give us about 122 rows, that's 61 times 2, 120. Maybe there's something that they are a tie. And this is the top two, keeping in mind that the states themselves might not be ranked like California because the second largest in California does not beat out the first largest in Tennessee as well. So again, you can play around with ranked functions to your heart's content. And yeah. Again, Big Learnings are using the common table expression to create those temporary tables and using partitions with some pretty fine analytical functions. RANK just being one of them. Also new was the USING instead of doing the key equals key. And GROUP BY indexing your columns here for quickly removing duplicates. Okay. Let's take a deep breath and continue on with advanced functions.