In this video, we will talk about the challenges of ingesting and processing big data and remind ourselves why need any paradigm and programming models for big data. After this video, you will be able to summarize the requirements of programming models for big data and why you should care about them. You will also be able to explain how the challenges of big data related to its variety, volume and velocity affects its processing. Before we start, let's imagine an online gaming newscase, just like the one we have for Catch the Pink Flamingo. You just introduced the game, and users started signing up. You start with a traditional relational database, keeping track of user sessions and other events. Your game server receives an event notification every time a user opens his session and makes a point in the game. Initially, everything is great, your game is working and the database is able to handle the event streams coming into the server. However, suddenly your game becomes highly popular a good problem to have. The database management system in your game server won't be able to handle the load anymore. You start getting errors that the events can't be inserted into the database at the speed they are coming in. You decide that you will have a buffer or a queue to process the advancing chunks. Maybe also at the same time processing them to be organized in windows of time or game sessions. However, in time as the demand goes up, you will need more processing nodes and even more database servers that can handle the load. This is, a typical scenario that most web sites face when confronted with big data issues related to volume and velocity of information. As this scenario demonstrates, solving the problem in one step might be possible initially. But the more reactive fixes the game developers add, the system becomes less robust and more complicated to evolve. While the developers initially started with an application and the database to manage. Now they have to manage a number of issues related to this infrastructure management just to keep up with the load on the system. Similarly, the database servers can be effected and corrupted. The replication and fault tolerance of them need to be handled separately. Let's start by going through these issues. Let's say, one of the processing nodes went down. The system needs to manage and restart the processing and there will be potentially some data loss in the meantime. The system would need to check every processing node before it can discard data. Each note and each database has to be replicated separately. Batch computations that need data from multiple data servers need to access and maintain use of the data separately which might end up being quite slow and costly. Big data processing techniques we will address in this course, will help you to reduce the management of the mentioned complexities, including failing servers and breaking compute nodes. While helping with the scalability of the management and processing infrastructure. We will talk about using big data systems like Spark to achieve data parallel processing scalability for data applications on commodity clusters. We will use to Spark Runtime Libraries and Programming Models to demonstrate how big data systems can be used for application management. To summarize, what our imaginary game application needs from big data system. First of all, there needs to be a way to use common big data operations to manage and split large volumes of events data streaming in. This means the partitioning and placement of data in and out of computer memory along with a model to synchronize the datasets later on. The access to data should be achieved in a fast way. The game developers need to be able to deploy many event processing jobs to distributed processing nodes at once. And these are potentially the data nodes we move the computations to. It should also enable reliability of the computing and enable fault tolerance from failures. This means enabling programmable replications and recovery of event data when needed. It should be easily scalable to a distributed set of nodes where the data gets produced. It should also enable scaling out. Scaling out is simply adding new resources like distributed computers to process more or faster data at scale without losing performance. There are many data types in an online game. Although, we talked about time click events and scores, it would be easy to imagine there are graphs of players, text-based chats, and images that need to be processed. Our big data system should enable processing of such a mixed variety of data and potentially optimize handling of each type separately as well as together when needed. In addition, our system should have been able both streaming and batch processing, enabling all the processing to be debuggable and extensible with minimal effort. That means being able to handle operations at small chunks of data streams with minimal delay, that is what we call low latency. While at the same time handle processing of potentially all available data in batch form and all through the same system architecture. Latency is a word that we use and hear a lot in big data processing, here we refer to how fast the data is being processed, or simply the difference between production or event time and processing time of a data entry. In other words, latency is quantification of the delay in the processing of the streaming data in the system. While some big data systems are good at it. Hadoop for instance is not a great choice for operations that require low latency. Let's finish by remembering the real reasons for all these requirements of big data processing. Making a different from processing in a traditional data architecture. Big data has varying volume and velocity requiring the dynamic and scalable batch and stream processing. Big data has a variety requiring management of data in many different data systems and integration of it all at scale.