There are a number of ways to connect to remote data. In this lesson, we'll start with connecting to two common data sources. Object or blob stores in JDBC connections, which is the common protocol for connecting to databases. Object stores are common way of storing large amounts of data. JDBC is an application programming interface or API for java environments. This allows us to query most databases. Spark works well with JDBC including a functionality called predicate push down where aspects of sequel queries such as filter operations can be done by the database before transferring that data into the spark cluster. This saves a lot of network transfer. By the end of this lesson, you will have read data from an object store in red data in both serial and parallel from a JDBC connection. The first thing I'm going to do is make sure I'm attached to my cluster and then I'm going to run my classroom setup script. First, let's talk about Blob stores. So, DBFS where the data bricks, file system is backed by S 3. This is amazon's simple storage service. It offers us a scalable storage solution that's going to be the backbone of different data lakes or lake houses, but we'll cover this more in the final module of this course. So in this case, what we're going to do is we want to mount an S3 bucket. This allows us to read and write from this highly scalable data source, so I'm going to go ahead and run this cell here. This code we can really only executing python but what this allows us to do is allows us to define our access keys to AWS and then mount them. This is the actual name of the AWS bucket as it's currently sitting in S3 and then this mount name is how it's going to appear within the data brooks environment. So to mount this bucket once again I have a little bit of python code and this is just boilerplate code so you can copy and paste and fill in your own credentials as needed. If you do want to connect to your own AWS bucket, I'm going to be using DVU Tills which is the data breaks utility library. This just gives me a number of different functions that allow me to access my data. Okay, now that this is run I should have my data accessible to me within this bucket. So you can see that I passed in my AWS access key, the secret key, the bucket name and then the mount name for how it's going to appear within the data buckets environment. Next, I can go ahead and explore this mount. So here I'm going to be used the percent Fs Magic key. So I'm going to add LS to it. So this allows me to see the different names of the different directories within this bucket. So you can see that I have three different directories. I have one called data frames, one called fire calls and one called fire incidents. Fire calls is the main data source that we're using throughout this course. So in practice you want to make sure to keep your AWS credential secret. The credentials that you saw on the screen are really just generic read only credentials so there's not that much harm and sharing them. So if you've got keys that have more permissions attached to them, such as write access, be sure to keep those secret so that you don't leak any of your data and you don't have to worry about missing data as well. So next we can do percent FS mounts. This should show us all of the different mounts we have available within this workspace. You can see that we have slash MNT slash Davis. This is how we're normally mounting this data. You can see the source behind it as well. You can also see the slash MNT slash training. This is just one of the different mount points that we use for some of our different training programs. And then finally you can see slash MNT slash Davis dash temp. This is the amount that we just mounted, Okay. So now we have access to a theoretically unlimited store of data. I'm going to go ahead and unmount this directory just so it's not hanging around so I can just call debut tools dot f s top mount in order to in mount this. Okay, the next thing we're going to do is we're going to take a look at JDBC reads. JDBC is an application programming interface or API that defines database connections in job environments. So spark is written in a programming language called Scala. Scala runs on the java virtual machine with the JBM. Since these are Java environments, JDBC is a preferred method for connecting to data whenever possible. And a lot of different data technologies, including databases run on the job of virtual machine. So this makes accessing between these two ecosystems quite easy. Now, data bases have a number of different functionalities. One functionality that's particularly helpful here is something called predicate pushed out. Predicate push down allows us to push the predicate part of a sequel query down to the database itself. An example of a predicate is a filter operation or aware clause in a sequel query. So in this case we might want to filter out a bunch of different data before we take it into spark with predicate push down. We only have to worry about certain data being transferred across the wire that is data that passes through the predicate. This allows for significantly more performing operations because recall that I O bound problems or network bound problems are one of the most significant bottlenecks within distributed ecosystems because we're working with so much data, we want to throw out as much of that data as possible. Now the first thing I'm going to do is I'm going to run this Scala code here. This makes sure that we're using the best Postgres driver because we're connecting to a Postgres database. The next thing we need to do is create the JDBC strength. We're going to create this table called twitter JDBC and we're going to be using this particular way of accessing this data. We're also going to pass in a number of different options. These options will depend on how your database is set up. But here we're specifying this driver, we're specifying this URL, which is that server one, data, brick stock training on Port 5432. And then we're passing in different credentials. So we're passing in a username and a password and then we're giving it the specific table that we're accessing as well. So we can go ahead and run this cell here and then we can take a look at the 1st 10 lines of that table. So as you can see this is a bunch of twitter data, so we have user ID, screen names, locations and a number of other features for this dataset. One other aspect of spark is that we can also add specific sub queries to our strength. So here, under the table you can see that we added this particular sub query. Now this filters out any data that's not within the fill bins. So this is an example of predicate push down where the sub query is passed on to the database, so the database runs the sub query and then only returns the results. Now, if I want to take a look at the 1st 10 lines of that data set, you can see that the location for these tweets are all within the fill bins. So far what we've seen is all serial reads. So this means we're creating a single database connection and we're reading from that one connection. Now let's take a look at how we can paralyze that operation so that we're reading from the database using multiple concurrent threats. This is a lot more efficient. The first thing I want to do is I want to figure out what the general distribution of my data base. So here I have some IDs that I can work with, so I'll take the minimum and maximum ID. The reason for this is that I want spark to have some way of determining how to do the parallel read. Which is if you're reading from multiple concurrent threads, which specific subsets of the data are each one of those threads responsible for? Okay, so now let's take a look at what this parallel query looks like. You'll see many of the same options such as the driver, URL, user credentials, etcetera, look identical to what you've seen before. But we've also passed in three additional parameters. So we passed in a lower bound, an upper bound and then the number of partitions. So in this case this indicates to spark that we want to have two different partitions. So two different parallel reads between this lower bound and this upper bound. So if I do a slug star from twitter parallel, JDBC, it should look very similar to what we saw before. And now I'm going to go ahead and run these two cells because they take a few seconds to run. But the point here is that we are comparing the serial read to the parallel read. So if we take a look back at the number of partitions here, we're reading using two different partitions. This is relatively low but bear in mind that the clusters that we're using are also quite small. And so in practice when you're using a much larger cluster, you can play around with the number of partitions here in order to get a sense for how fast the reads can actually get. Okay, so let's scroll down a little bit. So this first cell here is looking at the serial reads. So you can see that this took about 33.3 seconds. If we wanted to run a number of different of those same queries, we can change those numbers here in order to make sure that we have a statistically significant sense for how long it took to run each of these different queries. So the first cell, like I said, this is our serial read and the second cell is our parallel read. So you can see that they took just about the same amount of time. If we were to play around with these numbers, we can see if we can't further improve the speed within the parallel read. And so I'll leave that as an exercise for you to do in order to get a sense for how the parallel reads affect how fast we're able to actually read this data. But okay, I hope this video has helped introduce you to alternative ways of accessing data, including object storage, in connecting to databases via JDBC through serial and parallel reads.