Welcome back. Recall that Spark is a compute engine, it's not for storing data. Data on the Spark cluster disappears when the cluster is terminated, unless it's saved elsewhere. In this lesson, we'll look at this process, focusing on tables and views, global and temporary operations, and managed and unmanaged tables. By the end of this lesson, you will have written to both tables and views, as well as managed and unmanaged tables, you will also explore the behavior of dropping tables on the underlying data. Now, you might have similar concepts in other SQL environments, namely tables and views. Tables create a table in an existing database, views are different. Think of views as a storage SQL query. When we create a table, we're writing to a database. In our case, it's not a traditional database like a server configured using a database like Postgres. Rather, we're writing to the Databricks File System, so those files are then sitting on S3, we're then storing the metadata to the Hive Metastore that comes pre-installed in this environment. This metadata is really just some high level information on the table like the schema and the path to the files in the file system. We'll take a look at this in a second. A view on the other hand, is just the query itself. Each time we call the view, we have to recalculate the data. Is that bad? Well, it depends. If you want the fastest running query, then yes, writing to a table is better than a view. However, if you want the query to update automatically, then a view could be preferred because that view can call a query that pulls the latest versions of the data. Actually, just a couple of weeks ago, I was consulting with a team that was writing all of their data to views and then was concerned that their queries we're taking too long, we switched them to tables and that's sped up their jobs substantially. It's worth mentioning that views and tables are scoped at different ways. Global views are stored in the Hive Metastore and are available across all clusters since all clusters use only one Metastore. Temporary views by contrast, are only available in the notebook you're working in and they disappear after the cluster is taken down. First, let's create a temporary view of our dataset. Next, we'll write to a temporary view. I'll create this view called group view, and have a query associated with it. Notice that it ran right away, this is because we're not writing data, it's just saving the query. Next, I'll do a select star and get a sense for how long it takes to run. Now, here we're actually executing that query, so it's going to take a little bit longer, let's just give it just a second and you can see that it's completed in about a little over nine seconds. Next, let's do the same thing with a table instead. Here, I'm calling create table rather than create view. You can see that this create table statement took a little under seven seconds to run, and so it took a little bit because we had to actually save out that data. Now, if we do a select star, you can see that it took about 1.3 seconds. This compared to the 9.17 seconds that we had before, this is significantly faster to read from the save table rather than from a view. Now, let's take a look at managed and unmanaged tables. In this walk-through, we're going to write to both managed and unmanaged tables, and we're going to explore the effect of dropping those tables on both the metadata and the underlying data. A managed table is a table that manages both the data itself as well as the metadata. That metadata is usually some schema information, as well as the location of the data too. In the case of managed tables, if we drop that table, we'll remove both the metadata for the table as well as the data itself. Unmanaged tables perform a little bit differently. Unmanaged tables manage the metadata, but the data itself is sitting in a different location, maybe S3 or the Azure Blob. In this case, Spark is not going to delete the data when we perform a drop table operation. Let's take a look at how this works. First, I'm going to use the default database. I'm going to create this table called table managed, and I'm going to insert a couple of different values into it. Now, if I call this describe extended command, we can take a look at the type of table we are making by looking at the underlying data. Here, you can see the name of the table, you can see the database associated with it, you can see what version of Spark we were using, and then here you can see the type associated with it is managed. Now, if I want to create an unmanaged table instead, I can go ahead and run this command first. This is just to make sure that if I've already run this notebook, there aren't any artifacts left behind. But here, I can go ahead and have a similar command, but I'm going to add in this location associated with it. Now, when we perform describe extended against this, we'll see that we have an external table, so here, I can just scroll down and see that the type of this table is now an external table. Now, I can insert a few more values into this table if I want to, and we can take a look at this table unmanaged, and you can see those values here. If we want to view the underlying files, we can just call percent fs ls in order to see what those underlying files looks like. Now, let's switch back to the managed table, let's take a look at those files and see what happens when we drop the table. You can see the files here, and let's go ahead and drop that table. Now next, I'm going to uncomment this out and then try and take a look at those files. You can see that we get a FileNotFoundException because this is a managed table and Spark dropped both the metadata and the underlying files themselves. Now, if we were to do drop table on our unmanaged table, we can take a look at whether or not we still have those files. I call percent fs ls, and you can see that those files in the case of the unmanaged table are still there. To sum up, be mindful of whether you want to use a table or view. Tables persist data, while views just store the query itself. Use either an external or unmanaged table when you want to persist the data once the cluster has been shut down. You can also use managed tables if you only want ephemeral data.