You will see a number of different file formats and compression options. You're likely most used to seeing CSV files, which are row-based files of comma delimited data. The ideal file format in distributed systems more often than not is Delta. Instead of being row-based, Delta is column-based. It is built on the back of the powerful file format, Apache Parquet. In this lesson, you'll examine various file formats as well as different compression options. By the end of this lesson, you'll be able to determine the best file format options to use under a variety of different conditions. The first thing I'll do is I'll run my classroom setup script. I'll attach to my cluster as well. Next, let's take a look at a comma delimited file sitting on our BLOB storage account on the Databricks file system. We can use percent fs ls in order to confirm that this file is here. You can see it's called dbfs:/mnt/davis/fire-calls/fire-calls-colon.txt. If we want to get a sense for the overall size, you can see that here. This is the size in bytes. It looks like this is a 1.8 gigabyte file. If we wanted to process this file in many different systems, we need at least 1.8 gigabytes of RAM available plus additional resources to do any data manipulation, manage the operating system, etc. With Spark, will by default process this data in memory, but we have the ability to spill out onto disk where needed. We can even go and forget data entirely if we run out of disk and go back to the original source and pull back additional data when needed. Next, let's take a look at the first 1,000 bytes of this data. It looks like I lied. It's not actually comma delimited. See here, you can see that it's actually using colons to delimit this file. There are many different reasons why you might want to use alternative delimiters. Using a pipe, which is that vertical line character is also common since it appears only rarely in different datasets. Next, let's go ahead and create a temporary view. We'll call this fire call CSV. We're going to pass in a path, we're going to set the header to be equal to true, and then we're going to set the separator or deliminator to be a colon. Next, let's take a look at the data types using DESCRIBE. It looks like Spark parsed all of the data types as strings. This can be problematic for a number of reasons. First off, storing an integer as a string is much less efficient as a representation of that data, meaning it takes up more space and memory. Also, it doesn't allow us to do integer operations like additions. The same is true of timestamp and data types as well. So we always want to be using the correct type wherever possible. Let's take a look at how to resolve this. We're going to rerun the same command, but this time we're going to pass in this inferSchema argument. Now, while that's running, let's take a look at the differences between these commands. When we ran the first command, it ran in just a few seconds, actually 0.76 seconds, to be exact. With this second command, you can see that it's still running. This is taking a little longer to run because we have to actually infer the schema. We have to check those data types in order for Spark to be able to figure out which integers are integers, which strings are strings, etc. You can see that this is still running, so this is going to take a little while longer. But once this completes, we can get a sense for how Spark actually interpreted the different data types that we have here. Now that that's complete, you can see that it took 51 seconds to run rather than under one second. However, if we call DESCRIBE on that table, you can see that we interpreted integers as integers, strings as strings, however, the date is still a string, and then if we look at the timestamps, those are still strings instead. I'm a little embarrassed to admit how much time I've spent my career trying to parse out different timestamps. There are all sorts of different random formats that people use when they're writing in timestamps. Yes, there are industry conventions for how you should be writing a timestamp, write the precise format of dates and times. However, the honest truth is that there's little adherence to these conventions, and so often times we have to write our own timestamp parsers to make sure that we're actually parsing out those specific fields. I'm going to go ahead and click "Run All" on the rest of this notebook, because some of these commands are going to take a little bit longer to run than others. But first, we're going to compare different compression formats, so let's compare Gzip and Bzip in particular. This is our original dataset, recall that this is about 1.8 gigabytes. If we take a look at a Gzipped dataset, all of our data is going to be within this part here. If we take a look over here, it looks like this dataset is a lot smaller, so it's about 260 megabytes. Then finally, if we take a look at a Bzipped dataset, it looks like here we have about 193 megabytes. Let's take a look at how they perform when we actually read these different files. When we read from a Gzip file, you can see that it took about 36.16 seconds. Also, it's worth mentioning that with this gzip dataset, if we take a look at the underlying partitions, it just gave us one single partition. This took up less storage space. However, there's still a lot of computation we need in order to run this file. Now let's take a look at bzip. When we created this view, it took us almost two minutes to run this, so you can see that it took quite a bit longer. However, if we take a look at the number of underlying partitions here, you can see that we have eight total partitions. Gzip and bzip, they're are two alternative compression formats, but you can see here that gzip is not splittable, whereas with bzip, it is a splittable file format, and so here this allows us to read back in our data using eight different partitions. Next, we're going to compare this with Parquet. Here, I'm going to read from a Parquet table with a single partition. You can see that it didn't take that long to run, about 1.22 seconds. When I call Describe on this dataset, you can see that it was able to infer all of our different data types that are here. This is because Parquet captures some of the metadata that's associated with my data. Parquet can be a lot faster. Let's confirm all of this using a timeout comparison. Here I'm going to compare reads from a Parquet file to reads from a CSV and a gzipped file. When we compare these different options, you can see that our Parquet read took about eight seconds compared to 36.67 seconds for our CSV file, and compared to 23.5 seconds for our gzip file. You can see that our bzip file is still running. Now that that's complete, you can see that that took almost a minute and a half to run. What is Parquet and what is Delta? Parquet is a columnar storage format, which means that it's column-based rather than row-based. Here you see a standard CSV file. This is the standard row format that you would expect. Below this, you can see a column-based format instead. There are some distinct benefits to using a column-based format rather than a low-based format when we want to optimize reads from these different file formats. One of the benefit is compression. Let's say for instance, that I have an ID column, we have multiple IDs repeated over and over and over again. Instead of storing the actual ID over and over, we can store the ID once, and then a pointer that allows us to denote that that ID exists multiple times across our data. This is one of the ways that Parquet is compressing our data. One of the other reasons why we want to use Parquet is because it's highly splittable. Spark can read and write from Parquet in parallel. When we're actually saving to Parquet, we can see that we are saving to S3 using a number of different small files. What is Delta? Delta is built on top of Parquet. Delta layers on acid transactions on top of open source Parquet. What that means is that we can do database like operations, like adding and deleting data on top of that file format. There's a lot more to Delta than just that, and we'll explore those differences in the fourth module of this course. But for now, let's write our Parquet file to a Delta table. We're also going to repartition this data so that when we write it out, we'll have eight separate files. This should take just a moment to run. Now that that's complete, we can go ahead and call describe extended against that Delta table. Here you can see that preserved the different data types associated with this dataset. Now, when we call dbutils.fs.ls to take a look at the underlying files, you can see that it saved out a number of different partitions of this dataset. You can see that within these different parts. You can see part 0, part 1,all the way up to part 7. When we called repartition 8, it repartitioned our file into these eight different subsets and load out to each of them concurrently. So you can also see here that we have.snappy.parquet. This is a Delta file. You can know that it's a Delta file because we have this Delta log directory associated with it. This is how we're preserving any metadata associated with this file. You can also see that we have this.snappy.parquet. Snappy is a compression format that generally worked really well and distributed ecosystems. You can also see that we have this.parquet, and so that.parquet, is indicating to us that Delta is built on the back of Parquet. It's leveraging Parquet files under the hood. In the rest of this notebook, you'll have a comparison of a number of different ways that you can read your data. I'll leave this up to you to explore, and also you can use this as a reference, and so if you ever have a new datatype that you haven't seen before or a new file type rather, you can use this as a reference in order to make sure that you can read from this.