We just learned how the value function computed for

a given policy can be used to find a better policy.

In this video, we will show how we can use this to find

the optimal policy by iteratively

evaluating and proving a sequence of policies.

By the end of this video,

you will be able to outline

the policy iteration algorithm

for finding the optimal policy,

understand the dance of policy and value,

how policy iteration reaches the optimal policy

by alternating between evaluating

policy and improving it,

and apply policy iteration to compute

optimal policies and optimal value functions.

Recall the policy improvement theorem.

It tells us that we can

construct a strictly better policy by

acting greedily with respect to

the value function of a given policy,

unless the given policy was already optimal.

Let's say we begin with the policy Pi 1.

We can evaluate Pi 1 using

iterative policy evaluation to

obtain the state value, V Pi 1.

We call this the evaluation step.

Using the results of the policy improvement theorem,

we can then greedify with respect to v Pi

1 to obtain a better policy, Pi 2.

We call this the improvement step.

We can then compute V Pi 2

and use it to obtain an even better policy, Pi 3.

This gives us a sequence of better policies.

Each policy is guaranteed to be an improvement on the

last unless the last policy was already optimal.

So when we complete an iteration,

and the policy remains unchanged,

we know we have found the optimal policy.

At that point, we can terminate the algorithm.

Each policy generated in this way is deterministic.

There are finite number of deterministic policies,

so this iterative improvement must

eventually reach an optimal policy.

This method of finding an optimal policy

is called policy iteration.

Policy iteration consists of two distinct steps repeated

over and over, evaluation and improvement.

We first evaluate our current policy, Pi 1,

which gives us a new value function that

accurately reflects the value of Pi 1.

The improvement step then uses V Pi 1

to produce a greedy policy Pi 2.

At this point, Pi 2 is greedy

with respect to the value function of Pi 1,

but V Pi 1 no longer

accurately reflects the value of Pi 2.

The next step evaluation makes

our value function accurate with

respect to the policy Pi 2.

Once we do this, our policy is once again not greedy.

This dance of policy and value proceeds back and forth,

until we reach the only policy,

which is greedy with respect to it's

own value function, the optimal policy.

At this point, and only at this point,

the policy is greedy and the value function is accurate.

We can visualize this dance as

bouncing back and forth between one line,

where the value function is accurate,

and another where the policy is greedy.

These two lines intersect only at

the optimal policy and value function.

Policy iteration always makes progress towards

the intersection by projecting

first onto the line v equals v Pi,

and then onto the line where Pi is greedy with

respect to v. Of course,

the real geometry of the space of policies

and value functions is more complicated,

but the same intuition holds.

Here's what this procedure looks like in pseudocode.

We initialize v and Pi in any way we like

for each state s. Next,

we call iterative policy evaluation

to make V reflect the value of Pi.

This is the algorithm we learned earlier in this module.

Then, in each state, we set Pi to select

the maximizing action under the value function.

If this procedure changes

the selected action in any state,

we note that the policy is still changing,

and set policy stable to force.

After completing step 3,

we check if the policy is stable.

If not, we carry on and evaluate the new policy.

Let's look at how this works on

a simple problem to build some intuition.

Remember the four-by-four grid ruled example we

used to demonstrate iterative policy evaluation.

Previously, we showed that by

evaluating the random policy,

and greedifying just once,

we could find the optimal policy.

This is not a very interesting case for policy iteration.

Let's modify this problem a little bit

to make the control task a bit harder.

First, let's remove one of

the terminal states so that there's

only one way to end the episode.

Previously, each state admitted a reward of minus 1.

Instead, let's add some especially bad states.

These bad states are marked in blue.

Transitioning into them gives a reward of negative 10.

The optimal policy should follow

the winding low cost path in white to the terminal state.

This additional complexity means that

policy iteration takes several iterations

to discover the path.

Let's see how this plays out.

First, we initialize a policy and value function.

As before, we choose the uniform random policy,

and set the value estimate to zero for all states.

The first step is to use

iterative policy evaluation to

evaluate the uniform random policy.

Since you've seen how this works before,

let's skip straight to the result.

The values are quite negative everywhere,

those slightly less so in states near the goal.

Next we perform the improvement step.

You've seen how greedification works before.

So again, let's skip to the result.

Notice that near the terminal state,

the policy correctly follows

the low cost path toward the terminal state.

In the states in the bottom row however,

the policy instead takes them more direct,

but lower value path through the bad states.

Let's evaluate this new policy.

Notice how after just one improvement,

the values are starting to look much more reasonable,

but we aren't finished yet.

Let's greedify again.

Remember, the policy improvement theorem tells us

that this new policy is an improvement on the last one,

unless we have already reached the optimum.

Specifically, the bottom-right state now

goes straight up along the low cost path.

One more step of policy evaluation reflects this change.

The value of the bottom right state goes from

minus 15 to just minus 6.

Let's keep going until we reach the optimal policy.

One more step of improvement improves

the action selection and yet another state.

The next step of policy evaluation reflects this change,

improve again, and evaluate, and improve.

Now, we can see that the policy has reached the optimum,

and follows a low cost path avoiding the blue states.

Evaluating one more time

gives us the optimal value function.

If we try to greedify again,

the policy remains unchanged.

This tells us that policy iteration is complete,

and the optimal policy has been found.

This example shows the power of policy iteration,

in that it guarantees we can follow a sequence of

increasingly better policies until

we reach an optimal policy.

Policy iteration cuts through the search space,

which is key when the optimal policy

is not straightforward,

in this case literally.

The same complexity will come

up and problems we really care about.

In this week's assessment,

you will have a chance to implement

policy iteration on a slightly more realistic example.

That's it for this video.

You should now understand how policy iteration works

by alternating between

policy evaluation and policy improvement,

and how policy iteration follows a sequence

of better policies and value functions

until it reaches the optimum policy and

optimal value function. See you next time.