Before visualizing data, we need to make sure that the data we acquired is ready to tell the story. In this video, we will talk about what is data and what is the general workflow to prepare data for visualization. Often, we use the term data and information interchangeably. But actually, their meaning is different. Data is the raw material for any journalistic work, and usually takes the form of numbers, which is rarely useful on its own. On the contrary, information is processed data that carries a logical meaning. Information is value added to data, which usually is a data form that is summarized, organized, and analyzed. Data is the input language for a computer and information is the output language for a human. Like the process of making whiskey, which involves processes such as preparation, fermentation, distillation. To transfer data to information, we also involve multiple steps such as cleaning data, derivinging, formatting data, graphing data, and adding context to the data. Data are actually attributes. For example, when we are talking about weather, there are data points to indicate temperatures, humidity, and pressures. Similarly, when we are talking about real estate, we're usually referring to data points about housing price, location, square feet, or mortgage rate. There are three types of data: categorical, ordinal, and continuous. If values are simply labels and cannot be ordered or ranked with any meaning, they are categorical data. When we simply classify temperature into two categories, above freezing and below freezing, then the data is categorical data. Ordinal data can be ordered or ranked, but can't measure or count any characteristic. When we describe temperature as hot, warm, or cold, this is ordinal data. Interval data and ratio are continuous measures. Interval gives differences between values a meaning, but the differences between intervals have no intrinsic meanings. This happens anytime the zero point on the measurement scale does not describe the absence of some quantity. A good example of this type of data is temperature in degrees Fahrenheit and Celsius. On one day, the temperature of a particular city may be five degrees below zero, and then the next day the temperature rises to five degrees above zero. The rise in temperature by five degrees is well understood, but the ratio does not describe a physical change. Zero degrees in Fahrenheit and Celsius is only a location on a scale and does not describe the absence of heat energy. However, absolute zero on the Kelvin scale does. The zero point of the Kelvin scale describes the absence of heat energy. Not so many years ago, data was hard to obtain. Often, journalists would have to compile their own datasets from paper records. The Internet has changed the game, although those methods may still be needed on occasion. We now can search government databases and then download the results. The main problem today is usually not finding relevant data, but working to find a database that can be trusted and getting a database into the right format for analysis and visualization. As a journalist, its is worth familiarizing yourself with the main federal government agencies that have responsibility for the beats you are interested in and the datasets they maintain. For example, data.gov is the main federal government agency to provide data for journalists to use. The US Census Bureau provides journalists with data about population, demographic, or trade and manufacturing statistics. The Centers for Disease Control and Prevention, and the National Center for Health Statistics provide health datasets including on causes of death. For the local level, many counties and cities maintain their own data portals. We can usually use American Factfinder to find a city or county that you care about. For international data, if you need to make comparisons between nations, the World Bank probably has what you need. Its World Development Indicators catalog contains data for more than 7,000 different measures. Other useful sources for data for international comparisons are Gapminder and United Nation Statistical Division. For datasets that are not available online, you can write a request to the information officer at local, state, or federal agencies. Many countries now have laws in place allowing people to request access to information. In the United States, the Freedom of Information Act, which is called FOIA in its short form, generally provides any person the right to request access to federal agency records or information. Or we can collect the data by scraping. Scraping is a process of creating a computer program to extract data from a website or a PDF, and put it in a usable database. Scraping can be an incredibly powerful technique for collecting the data that you want, but it's not originally in a useable format. For you to understand scraping, you need to understand some computer coding. In an ideal world, every dataset we find would have been lovingly curated allowing us to start analyzing without worrying about its accuracy. However, in reality, most data doesn't arrive organized and error free. Most data is messy. Before beginning any kind of analysis, the data needs to be cleaned. Data cleaning is a process data journalists use to detect, correct, or delete inaccurate or incomplete data with the aim to improve data quality. Examples of errors commonly found in data are misspellings such as trying to type "coverage" but mistakenly typing "covfefe," or using some non-text characters or symbols such as exclamation marks, pound signs, or HTML tags. Also, look for inconsistencies, sometime there may be multiple ways referring to the same item. For example, "Burma" and "Myanmar" refer to the same country. Some data offers obvious checks. If you see negative values in age data or ZIP codes with less than five digits, then you know something must be wrong. Also, scan for possible missing data and duplicates. Missing data is a huge topic and you always need to think carefully about how and when you choose to remove observations from a data analysis. Some other things to pay attention to is people names. Look for variations in spelling or formatting because the same person may appear multiple times. After cleaning up the data, we need to get the data structure in the right format. This ensures consistency and doing so will help our later analysis run more smoothly. For example, make sure the data structure in the spreadsheet is in the proper layout: only one row for a header and one data point in each row. Sometimes it may be necessary to split data into multiple columns. For example, if we want to extract the ZIP code separately from the full address, then we have to create an extra column for the ZIP code. Moreover, format data points into the same letter case and stick with one unit. For example, when you are talking about height, you should stick with either foot- inch lengths or centimeters. For money values, consider the factor of inflation and transfer the value to a common time basis. Similarly, turn date and time into consistent formats. Data analysis involves statistics and math to find insights from the data. The good news is in most cases, the newsroom math is easy; add, subtract, multiply, divide, and then you can conquer the world with your stories! More examples of basic descriptive statistics include finding the maximum and minimum values, the average, mean, median, and mode, finding the percentage and percent change, calculating the rate such as per capita, per cases, and be able to interpret the distribution of data. For advanced statistical inference, there are concepts such as correlations and regressions. We can also make predictions based on models or a time series analysis. Also, for the emerging field in computational journalism, there are more investigative stories that employ algorithmic methods such as machine learning to find insight from data at scale. Analyzing a dataset can prompt more questions that require other datasets to answer. For example, when you compare the number of suicides in two countries, you need their population numbers to calculate the number of cases per capita for fair comparison. To investigate further the reasons behind a rise or fall in the suicide rate, you might want to look at statistics about economy or mental illness in those countries. Moreover, examining different datasets helps you to investigate the relationships between phenomena, but be extra careful in drawing any cause and effect conclusions. Correlation between two variables does not mean causation. For example, the homeless population and crime rate in a city are correlated, but this does not necessarily mean homeless people commit crimes. There could be other factors like unemployment or drug abuse. Make it clear to your readers when you describe the relationship between variables. The whole purpose of deriving data is to look for trends, contrasts, and outliers. One way to tell a data story is to examine how the variables have changed over time or across groups. For example, data shows that Americans are dissatisfied with their state of the union, write a story to find out why. This type of story is typically easy to visualize and easy to understand conceptually. For contrasts, comparing one result or dataset to another can be an effective way to tell a data story that surprises your readers. It often leaves readers plenty of room to apply their own interpretations to the data and it could be effective when offering a counter-intuitive result. For example, conventional wisdom in Hollywood is that male stars are a bigger box office draw, often the reason they are given the higher salaries. But that may be a miscalculation according to a new analysis, showing films with female leads earn more. For outliers, look for data points that are far from the average, but double check to make sure there are no errors. Sometimes outliers in your data represent errors and need to be rejected, but some interesting outliers can become your basis for a new data story. For example, in general, people in richer nations are less likely than those in poorer nations to say religion plays a very important role in their lives. But Americans are more likely than their counterparts in economically advanced nations to deem religion more important. More than half of Americans say that religion is very important in their lives, which is much higher than people in Canada, Australia, or Germany. Therefore, we can write a story to find out why. Data and journalism have become deeply intertwined with increased prominence. Finding contrasts, trends, and outliers from data are foundations for a breaking story. We have now gone through a general workflow for acquiring, cleaning, formatting, and deriving data. We'll go on to learn about graphing data as well as adding context to visualization along the class.