In this part of the module, you will learn about what it takes to implement a pipeline that will scale as your dataset size grows. Let's take a closer look. Some of you may already be familiar with MapReduce. It is a distributed fault tolerant data processing framework that was described by Google in an influential research paper published in 2004. It is still widely used today, for example by the Apache ParDo project. You need to know the basic concepts from the MapReduce framework because data flow and Apache beam build on successful ideas from that framework. And also include innovations that have been developed by Google's researchers and engineers after 2004. The diagram on the screen, will give you a quick introduction to MapReduce. To process data in MapReduce, you start by shutting, in other words, splitting up data. The individual shards of data are distributed on storage devices across multiple compute nodes in a distributed computing cluster. On the diagram, this is shown as data getting split up across nodes one for free in the compute cluster. To run a data processing job in this framework, you write code for Map and Reduce functions. Let's look at Map's first. A map should be a stateless function, so that it can be scheduled to run in parallel across the nodes in the cluster. Each Map reads the data from the storage on the node where this running, processes the data and generates an output. The output of the map operations are shuffled from the different nodes in the cluster to the next stage of processing called Reduce. You can think of reductions as an aggregation operation over data. The aggregation can be operations like counting the number of data elements or computing sums. Once the reduced operations are finished the result becomes the output of the MapReduce step in a pipeline. If you want to take a transformation in your data processing pipeline and let data flow run it at scale with automatic distribution across many nodes in a cluster. Then you should use the Apache beams ParDo class. ParDo is short for parallel do. The transformation steps created using ParDo or similar to the maps in MapReduce. The transformations used with ParDo, have to be stateless so they can be run in parallel. This is somewhat restrictive but useful for many tasks. For example; You're building a data processing pipeline and analyzing web server files and you may need to filter out the log entries that include IP address of a visitor to your website. You can do that with a stateless transformation or if you want to extract the value of the IP address from the string of the log entry, you can do that statelessly. Other stateless processing operations like converting strings through integers or any calculations that work was just a part of the input, like a raw of data are all good candidates for a ParDo. If you're using python to implement your data processing pipeline, the're helper methods to let you start using ParDo. Beam.Map shown on the slide is designed only for one to one relationships. For example; if you're processing words in a document and for each word in the document you want to return a pair with the word itself and its length, then there is a one to one relationship because every word can only be mapped to one length in terms of the number of the words characters. So if you use beam.Map for transformation in your pipeline, data flow will automatically handle running the transformation. Such as word lengths calculations over multiple nodes in a dataflow cluster. Unlike Map, beam.FlatMap supports transformations that can generate any number of outputs for an input including zero outputs. Continuing with the example where you're processing words from a document and maybe for every word you would like to output the list of vowels for that word, obviously you can have zero, one or a two or even more vowels per word. The transformations in beam.FlatMap can also be run in parallel by dataflow. If you're using Java to implement your pipeline, you simply code ParDo off static method on your transformation and pass the result to the next apply code on your pipeline. If you like to use the GroupBy key operation, it's straightforward to add it to your pipeline. For example; If you have a pipeline that processes postal addresses and tries to find all the zip codes for every city, once your pipeline has a P collection of key value pairs like what's shown with a pair containing the key and the zip code. The output created by beam.GroupByKey will produce a P collection of pairs, where every pair has the city as a key and the list of the city's zip codes as the value. While groupByKey is similar to the shuffle step in MapReduce, the combined PerKey operation is more general and includes both shuffle and reduce steps to help you implement aggregations like sum, count. You can use combined.globally method to compute over your entire dataset. For example; If you're processing financial transaction data, so that every row in your collection is a transactions of sales amounts, then to compute the total sales over all transactions, you can use the combined.global with the sum operation as the argument. Combine, also supports more fine grained aggregations. For example; If your financial transaction records include the name of the salesperson in addition to the sales amount, you can pass the same operation to the combined PerKey and use it to combine the total sales per salesperson.