In this lesson, we'll explore how to write data. Data writes can be parallelized by changing the number of partitions of our data. This means that our writes will be faster, and when we read from that data, we can read in parallel as well. By the end of this lesson, you will have written to a target database in both serial and in parallel. Let's start by talking a little bit about partitions. Partitions control the reads and writes that we perform. We're also going to control these partitions using the repartition and coalesce hints. This syntax is going to look a little bit different from what you've seen before. About partitions. Due to Spark's distributed nature, database rights work a little bit different from what you've seen in the past. Now, a partition is a portion of our total data set. Recall that when we interact with data frames in Spark, in reality, our data is distributed across the cluster. It's not just all sitting together in memory on one machine. A partition is one of those portions of our data set. When we read or write to a target database, we're going to have one connection for each of the different partitions that we have. This is the main way we control parallelism with our writes from Spark. Let's take a look at how that works. First, we can go ahead and import our data set. Now, we're going to dip down into a little bit of Python in order to do these writes. You can use this as boilerplate code and just fill in your own data sets and paths as needed. I can always use the SQL command within Python in order to run SQL queries from within a Python environment. This should allow you to easily interoperate between these two different frameworks. Here I just create this variable df, and then if I call display on it, it'll print it out in the same format that you're familiar with. Now, in order to do a write, I can use the dot write method on our DataFrame. I usually want to use this mode overwrite, that way if I already wrote data, it's going to overwrite whatever's there. There's also an append mode that's helpful, if you want to append to a preexisting data set. Then I can pass in the path. With this path, I'm going to be using this username that's been defined for us. This is just to make sure that if you're running this command in the same workspace as somebody else, it's going to write to a different location than what they're writing too. Now, let's take a look at the file that we just wrote. As you can see, my path here includes my username, and then it includes fire calls CSV, which is actually a directory. This isn't just one single file. Here you can see that I have some metadata files as well. These different underscore files here. These are just keeping some high level information about whether my write was successful, that sort of thing. You can see that I have my data divided into these different parts. I have part zero, part one, all the way up to part seven. This means I have eight different partitions that I was writing to. I have eight different partitions in this data set. Spark was creating a connection to my target database, in this case S3, when it wrote all eight of those different files, so it was creating eight different connections in parallel. Now, if I want to get a sense for how many partitions I have, I can always call dot rdd.getNumpartitions on that same DataFrame. If you see a different value here, it's actually some optimization that Spark does under the hood based upon the underlying compute resources that you have. If your cluster looks a little bit different than mine, you could wind up seeing a different number of underlying partitions. When we control concurrency, we have two different options. One is called coalesce and the other is called repartition. The syntax is right here. The difference between coalesce and repartition has to do with the type of transformation that it is, and also whether or not it evenly distributes data. In the case of coalesce, it's a narrow transformation. I'm not shuffling my data across the cluster when I use it, rather, I'm just forcing a smaller size of partitions that I currently have. In the case of repartition, this is a wide transformation. I will transfer my data across the cluster. I'll also make sure that I have the same amount of data in each of my partitions when I use repartition. This is not the same for coalesce. Coalesce works one way to make fewer partitions and in a way that saves the transfer of data across the network. Repartition can be used to make either fewer or more partitions. Regardless of what data source I'm connecting to, whether it's S3 or maybe a database using JDBC, these partitions determine how many of those connections I'm making at a given time. Let's take a look at how this works. You can see here that I pass in the coalesce hint. This looks just like a SQL comment, but in reality Spark is going to use this to change the number of partitions I'm working with. Now, if we take a look here and we query that table and call getNumPartitions, we can see that I have one single partition. Now, if I want to repartition instead and have eight different partitions, I can do that here. Now I can call dot rdd.getNumPartitions in order to confirm that I do have those eight different partitions. If I wanted to change this number, maybe I want 12 different partitions instead, I can easily just change that and then verify that I do in fact have that many different partitions. Now, let's go ahead and save the results. Here I'm just going to call on that query dot.write.mode overwrite and pass in this directory. Now, if I click these drop-downs here, you should see that you have the number of different tasks here for the number of different partitions that you have. Here I had 12 different tasks because I have 12 different partitions that I'm running from. If you want to confirm that that worked, you can go ahead and call dbutils.fs.ls, and take a look at the total number of different parts here. If I scroll down here, you can see that this number goes all the way up to 11, and bear in mind that for most programming languages, they are zero indexed rather than one indexed, and so this gives me a total of 12 different partitions. Just to sum up, you can alter the number of concurrent writes that you're making using these coalesce and repartition hints. It's worth noting that coalesce means something a little bit different in ANSI SQL. In ANSI SQL what coalesce means is to return the first non-null value. The term coalesce might be a little bit confusing in this case. But when we use it like this, what we're actually determining is the number of distinct partitions of our data we have. When Spark does these writes, they might look like a normal CSV or parquet file, but in reality, they're directories that contain a number of different parts of our data.