Welcome back to the course on software defined networking. In this lesson, we're continuing our discussion of the control and data planes separation, and in particular, we're talking about the challenges associated with separating the data and control planes. We'll overview several challenges associated with the separation of the control and data plane. Including scalability, reliability, and consistency. And we'll talk about approaches to solving these problems in the form of two different systems. The routing control platform and ONIX. Let's first take a look at some of the scalability challenges faced by the routing control platform and the approaches that that platform takes to solving some of these scalability challenges. So one of the problems that the RCP faces is that it must store routes and compute routing decisions for every router across the autonomous system. And a single autonomous system may have hundreds to thousands of routers. So that's potentially many, many routing tables and a lot of routing computations, all being performed at a single node. Whereas before all those computations were distributed across the routers themselves. So some scalability principles from the RCP design. The first is to eliminate redundancy. So rather than storing a routing table for every single router in the autonomous system. The RCP stores a single copy of each route and if those routes are duplicated across the routers and the autonomous system as is commonly the case. That redundancy can be represented by storing pointers into a common data structure. The second principle is to accelerate lookups by maintaining indexes to identify the particular routers that may be effected by a change in network conditions. Such as the advertisement of a new route or a node or link failure. Therefore, when a particular event happens, the RCP only needs to compute new routing information, or routing tables for the routers that are affected by that change. Rather than recomputing the state for the entire network. Finally, RCP tackles some scalability problems by simply punting on performing routing for all the routing protocols in the network and simply decides to focus on performing inter-domain routing alone. The ONIX network controller applies a couple of related principles to handle scalability problems. The first is partitioning, whereby an ONIX controller might actually only keep track of a subset of the overall network information base. And network state in the network. And then apply various consistency protocols to maintain consistency across different partitions. The ONIX controller takes advantage of two different consistency models. One is a strong consistency model that ensures that different replicas are strongly consistent at the expense of some performance. The other is a weaker consistency model. That is more efficient about passing information around more quickly. The second scalability principle is aggregation. So the Onix design essentially describes the notion of a hierarchical set of controllers. Such as, for example, having an Onix controller for a department or a building across a larger enterprise network. And having a single super ONIX controller that effectively controls those sub-controllers for the overall domain. Let's now take a look at how these systems tackle a second challenge: that of reliability. So one particular approach to rel-, reliability is to simply replicate. So, the RCP design advocates having a hot spare, whereby multiple identical RCP servers essentially run in parallel. And the backup or standby RCP could take over in the event that the primary fails. So the idea here is that the network would run independent replicas of the RCP whereby each replica has its own feed of the routes from all the routers in the autonomous system. Now, if each replicate receives the exact same increase and runs the exact same routing output, then it should be the case that the output, or the resulting state, that each of these RCP's would push back. Into the routers should be exactly the same because the inputs and the algorithm are exactly the same. So in the case of the Hot Spare approach, there's actually no need for a consistency protocol if both replicas always see the same information. There are potential consistency problems however, if different replicas see different information. Let's see how that might be the case. So, here are two RCP's. Let's suppose that they see different information and, as a result compute different outcomes. Or desired routing table state for routers A and B in this autonomous system. So the RCP on the left might compute an egress route for router A, that says, use Egress router D to reach a particular destination and hence use router B as the next hop to reach the egress router D. Now similarly, the second RCP, the RCP on the right might install a conflicting state into router B that says: use egress router C. To reach that destination, and you use a as the next top to reach the egress router C. Now, you can see that if these two replicas install this respective state into routers A and B, then we have a forwarding loop between router A and router B because in trying to reach that destination,. Router A is going to use the gold route to try to egress via router D and router B is going to use the grey route to try and egress via router C. And when each of these routers receives the packets for that destination. From it's respective neighbor it's just going to bounce the packets back and forth. So what we want is for route assignments to be consistent even in the presence of failures and partitions. Now we previously just said that if every RCP receives the same input. And runs the same out algorithim. Then the output should be consistent and we want some way to guarantee that. Now fortunately a flooding based Internal Gateway Protocol such as OCPF or ISIS as we learned in our networking course. Essentially means that each one of these replicas already knows which partitions it's connected to. Is if the RCP is participating in the intra-domain routing protocol, or the IGP, then it sees the full link state of that partition. And that information that it receives is enough to make sure that the RCP only computes routing table information, or routing state, for the routers in the partition for which it's connected. And that alone is enough to guarantee correctness. Let's see why. Let's suppose that we have a network partition where the routers in partition one can't see, or can't forward traffic, to the routers in partition two. And vice versa. Now in this case, the solution would be to have that single RCP only use state, from the routers in each one of these partitions in assigning routes. For example to assign routes to routers in partition one, the RCP would only use the set of candidate routes that it learned from the routers in partition one. It would not use any candidate routes from routers it learned from partition two. And that alone is actually sufficient to guarantee consistent forwarding. You can intuitively see why because, for example, if the RCP never assigned a route learned for partition two to a router in partition one, then effectively partition one and partition two are simply acting as separate networks with a common routing control platform. Let's suppose now that we've got multiple RCP's that we've actually replicated the RCP, but the network itself has multiple partitions. Now here you might think that we've got a more serious problem, because you might have cases where there are partitions. That are reachable by both RCP's or can be seen by both RCP's. But you have others that are reachable only by subsets that may be non-overlapping. Well, the approach here is to just ensure that the RCP's receive the same state from each partition that they can reach. So the IGP provides complete visibility and connectivity for each of these partitions and if the RCP only acts on a partition, if it has the complete state for that partition. Then it's guaranteed that the routes it assigns for that partition will be consistent. In other words, there'll be no forwarding loops. Let's look how how ONIX tackles the challenges of reliability. So ONIX talks about different types of failures that may occur on the network. The first, is network failures. In this case, ONIX simply just assumes that it is the application's responsibility, to detect and recover from those failures. Now if the network failure affects reachability to ONIX. The design suggests that the use of a reliable protocol or multi-path routing and so forth, could help ensure that the ONIX controller remains reachable even, in the case of a network failure. If ONIX itself fails, the solution is to take a similar approach which is to apply replication and then use a distributed coordination protocol amongst those replicas. Now, because ONIX has been designed for a far more general set of applications than the routing control platform, a more complicated distributed coordination protocol is necessary. Some of the details of those coordination protocols are discussed in the paper that was referenced on the original slide. Where we talk, were we introduced ONIX in this lecture. So in summary, separating the control and data plane poses three significant challenges. One is scalability. In particular, a single controller must now make routing decisions, or various control plane decisions on behalf of many, many network elements. That were previously each performing those computations independently for themselves. The second challenge is reliability. That is guaranteeing correct operation under failure of the network, or failure of the controller itself. The third challenge is consistency. We're ensuring consistency across multiple controller replicas. Particularly, in cases of network partitions or failures. So, we export each of these challenges in some detail, and talked about various techniques including hierarchy, aggregation, and various clever state management and distribution that various systems such as the RCP and onyx have used to tackle these challenges. Each particular controller tackles these challenges in a different way, but many of these principles apply, across different controller designs and implementations.