In previous videos, we talked about using generalized policy iteration to find an optimal policy. We've also talked about using TD to estimate value functions. What would it look like if we use TD to do the policy evaluation step in generalized policy iteration? By the end of this video, you'll be able to explain how generalized policy iteration can be used with TD to find improved policies, as well as describe the Sarsa control algorithm. Generalized policy iteration or GPI, combines two parts; policy evaluation and policy improvement. The first algorithm we saw with this form was policy iteration. Policy iteration runs policy evaluation to convergence before gratifying the policy. Then, we saw GPI with Monte Carlo, which performs a cycle of policy evaluation and improvement every episode. To better remember GPI with Monte Carlo, imagine a mouse in a four-state corridor with cheese at the end. The mouse starts out knowing nothing and follows a random policy. Eventually, the mouse will stumble into the cheese just by moving randomly. At that point, the mouse updates its action values. Then it improves its policy by greed defined with respect to its action values. As the process repeats, it will eventually learn the optimal policy. Notice that GPI with Monte Carlo does not perform a full policy evaluation step before improvement. Rather, it evaluates and improves after each episode. Going even further, we could improve the policy after just one policy evaluation step. We will do this with TD. To use TD within GPI, we need to learn an action value function. So we'll need to look at slightly different version of TD than you've seen in the past. Instead of looking at transitions from state to state and learn the value of each state, let's look at transitions from state action pair to state action pair and learn the value of each pair. This algorithm is called Sarsa prediction. Let's look at it in a bit more detail. The sarsa acronym describes the data used in the updates, state, action, reward, next state, and next action. Sarsa makes predictions about the values of state action pairs. The agent chooses an action, in the initial state to create the first state action pair. Next, it takes that action in the current state and observes the reward RT plus 1 and next state ST plus one. In Sarsa, the agent needs to know its next state action pair before updating its value estimates. That means it has to commit to its next action before the update. Since our agent is learning action values for a specific policy, it uses that policy to sample the next action. Here's the full update equation. It actually looks quite similar to the TD update for state values, with state values V of S replaced by action values QSA. The algorithm we just described is for policy evaluation. It learns action values for a specific fixed policy. However, thanks the GPI framework, we can turn it into a control algorithm. This time, we'll improve the policy every time step rather than after an episode or after convergence. This completes the description of Sarsa, the GPI algorithm that uses TD for policy evaluation. In this video, we talked about how we can combine generalized policy iteration with TD learning to find improved policies. Sarsa control is an example of GPI with TD learning.