Reinforcement learning is a problem formulation for

sequential decision making under uncertainty.

Earlier, we learned that the agent's role in

this interaction is to choose

an action on each time step.

The choice of action has an immediate impact on

both the immediate reward, and the next state.

In this video, we will describe policies.

How an agent selects these actions.

By the end of this video,

you'll be able to;

recognize that a policy is

a distribution over actions for each state,

describe the similarities and differences

between stochastic and deterministic policies,

and generate valid policies for

a given MDP or Markup Decision Process.

In the simplest case, a policy

maps each state to a single action.

This kind of policy is called the deterministic policy.

We will use the fancy Greek letter Pi to denote a policy.

Pi of S represents the action

selected in state S by the policy Pi.

In this example, Pi selects the action A1 in

state S0 and action A0 in states S1 and S2.

We can visualize a deterministic policy with a table.

Each row describes the action chosen by Pi in each state.

Notice that the agent can select

the same action in multiple states,

and some actions might not be selected in any state.

Consider the example shown here where

an agent moves towards its house on a grid.

The states correspond to the locations on the grid.

The actions move the agent up,

down, left, and right.

The arrows describe one possible policy,

which moves the agent towards its house.

Each arrow tells the agent

which direction to move in each state.

In general, a policy assigns

probabilities to each action in each state.

We use the notation Pi of A given S,

to represent the probability of selecting

action A in a state S.

A stochastic policy is one where

multiple actions may be

selected with non-zero probability.

Here we show the distribution over

actions for state S0 according to Pi.

Remember that Pi specifies

a separate distribution over actions for each state.

So we have to follow some basic rules.

The sum over all action probabilities

must be one for each state,

and each action probability must be non-negative.

Let's look at another state, S1.

Pi in S1 corresponds to

a completely different distribution over actions.

In this example, the set of

available actions is the same in each state.

But in general, this set can be different in each state.

Most of the time we won't need this extra generality,

but it's important nonetheless.

Let's go back to our house example.

A stochastic policy might choose up

or right with equal probability in the bottom row.

Notice the stochastic policy will

take the same number of steps to

reach the house as

the deterministic policy we discussed before.

Previously we discussed how a stochastic policy,

like Epsilon greedy, can be useful for exploration.

The same kind of exploration-exploitation

trade-off exists in MDPs.

Let's talk more about that later.

It's important that policies depend

only on the current state,

not on other things like time or previous states.

The state defines all the information

used to select the current action.

In this MDP, we can define a policy that chooses to

go either left or right with equal probability.

We might also want to define a policy

that chooses the opposite of what it did last,

alternating between left and right actions.

However, that would not be

a valid policy because this is

conditional on the last action.

That means the action depends

on something other than the state.

It is better to think of this

as a requirement on the state,

not a limitation on the agent.

In MDPs, we assume that the state

includes all the information

required for decision-making.

If alternating between left and

right would yield a higher return,

then the last action should be included in the state.

That's it for this video.

The most important things to remember are; one,

an agent's behavior is specified by a policy that maps

the state to a probability distribution over actions,

and two, the policy can depend only on the current state,

and not other things like time or

previous states. See you next time.