So far this week, you've been looking at texts, and how to tokenize the text, and then turn sentences into sequences using the tools available in TensorFlow. You did that using some very simple hard-coded sentences. But of course, when it comes to doing real-world problems, you'll be using a lot more data than just these simple sentences. So in this lesson, we'll take a look at some public data-sets and how you can process them to get them ready to train a neural network. We'll start with this one published by Rishabh Misra with details on Kaggle at this link. It's a really fun CC0 public domain data-set at all around sarcasm detection. Really? Yeah, really. This data-set is very straightforward and simple, not to mention very easy to work with. It has three elements in it. The first is sarcastic, is our label. It's a one if the record is considered sarcastic otherwise it's zero. The second is a headline, which is just plain text and the third is the link to the article that the headline describes. Parsing the contents of HTML, stripping out scripts, and styles, etc, is a little bit beyond the scope of this course. So we're just going to focus on the headlines. If you download the data from that Kaggle site, you'll see something like this. As you can see, it is a set of list entries with name-value pairs where the name is article link, headline and is_sarcastic and the values are as shown. To make it much easier to load this data into Python, I made a little tweak to the data to look like this, which you can feel free to do or you can download my amended data-set from the link in the co-lab for this part of the course. Once you have the data like this, it's then really easy to load it into Python. Let's take a look at the code. So first you need to import JSON. This allows you to load data in JSON format and automatically create a Python data structure from it. To do that you simply open the file, and pass it to json.load and you'll get a list containing lists of the three types of data: headlines, URLs, and is_sarcastic labels. Because I want the sentences as a list of their own to pass to the tokenizer, I can then create a list of sentences and later, if I want the labels for creating a neural network, I can create a list of them too. While I'm at it, I may as well do URLs even though I'm not going to use them here but you might want to. Now I can iterate through the list that was created with a for item in data store loop. For each item, I can then copy the headline to my sentences, the is_sarcastic to my labels and the article_link to my URLs. Now I have something I can work with in the tokenizer, so let's look at that next.