[MUSIC] Astronomers and other scientists often build catalogs. We might have lists of thousands to millions of objects and each object can have hundreds of properties. As datasets get large or complex, it's important to organize and access your data in an efficient way. In this module, we're going to look at one common way of organizing your data using relational databases. We'll start with a catalogue of exoplanetary systems, which are stars beyond our own solar system that are known to have planets orbiting them. By the end of the module, you'll be able to use SQL to answer fundamental scientific questions. Like how many earth sized exoplanets line the habitable zone of their host star? In other words, how many exoplanets have the potential to harbor life? The discovery of planets around stars beyond our solar system is one of the scientific breakthroughs of our lifetime. Research in this area is driven by the stream of discoveries pouring in. Many of these are being made by the Kepler space telescope using the transit method, both of which we'll hear more about in the next lecture. 2016 set a record for the biggest haul of exoplanets, when the Kepler team applied statistical validation to verify over a thousand new planets. This is only possible because they had so much exoplanetary data to work with. For each one of the thousands of exoplanets that are being discovered, there are tons of fundamental properties we want to keep track of. For example, the orbital period of the planet, what type of star it orbits, or what we think its surface temperature is. The surface temperature of a planet is important for assessing its habitability. We think of a planet as being habitable if it can possess liquid water on its surface. If a planet's extremely close to its star, then the planet's surface will be too hot and any water it has will boil or evaporate. A good example is Kepler-10b. Its orbit takes it so close to its star that its average temperature is nearly 2000 Kelvin. On the other hand, a planet orbiting very far from a star will be too cold, and will be covered in ice instead, much like the dwarf planets in our outer solar system. Each start has a so-called Goldilocks zone, where the temperature is just right for liquid water, which is more commonly called the habitable zone. One of the fundamental aspects of data driven research is being able to ask questions about our data. For example, we might want to ask how many exoplanets are smaller than Earth? Traditionally, scientists have answered these by writing short programs, maybe in Shell, Python, or the language of their choice, that read in the data from a text file and use a series of if statements to filter the data. This works when you know what questions you want to ask, and when you can write a custom script for each question. But what if we want other researchers to be able to investigate our data? We don't know in advance what calculations they might want to make, or how they might want to filter those data. We might also care about security, or data integrity, or being able to update our data regularly. The common solution to these challenges is to use a database management system, often just called a database. Databases are used nearly used anywhere for handling large datasets. In banking systems, university enrollment records or health information systems. Increasingly, databases are also being used in science. The most famous early example was the Sloan Digital Sky Survey that was pioneered by Microsoft database's guru Jim Gray. The SGSS was one of the first large projects to release all of their data online, with a full schema describing the data sets, and customizable query boxes to allow users to run their own queries on the server. It's enabled millions of users queries, and had a huge impact on the way future projects manage their data. Perhaps the most powerful feature of databases from a scientist's perspective is that you'll only need to declare your query. You don't need to worry about the implementation, the database system takes care of all of that for you. It means you can do your science without issues like scalability getting in your way. One of the key skills for working with large datasets is knowing how to choose the right solution for your particular problem. In this module we will discuss the advantages and the disadvantages of using flat files versus using databases. We'll also explain how databases organize data, and we'll show you how to query databases using SQL. You'll learn how to use SQL to explore a database of exoplanets using standard queries. This will allow you to determine how many of the exoplanets we know about so far, have the right parameters to support life. [MUSIC]