So, welcome to week two of our reinforcement learning course. During this week, we will talk about core concepts lying at the heart reinforcement learning. How to explain to agent what do we want it to do? The answer to this question is reward hypothesis, which states that we could formulate any goal and any purpose in terms of cumulative sum of scalar signal. This signal is called reward, and the sum of the signal over time is called a return. So, the return G sub t is a sum of immediate rewards from arbitrary time t, until the very end of the episode. The end of the episode is denoted by a capital T. This return consists of all rewards to the end of the episode and thus is a measure of global optimality of agent policy. Return is a random variable because each immediate reward depends on the agent action and also depends on the environment reaction to this action. For now, consider as example, the game of chess. Let's assume that we have designed the immediate rewards to be a value of all opponent's piece taken at a particular timestep, T. So, the return is equal to the total value of all opponent's pieces an agent have managed to take till the end of the game. Also, mathematically convenient, such formulation of our desire could have side effects. So, to better understand the limitations of cumulative sum of immediate rewards, consider another example of data center non-stop cooling system. An agent controls temperature in data center room and could adjust speed of different fans to enforce required temperature regime. We reward this agent with plus one for each second the system's temperature was sufficiently low and give it reward equal to zero otherwise. So, can you think of any possible problem in such a design? Well, the problem here is the length of the episode. Unlike the chess games this task doesn't have natural ending. That is, cooling system is meant to operate, well, in every day of the week, every single moment and operate in that way forever. The essence solves the problem with infinite horizon lies in optimization problem, that is, our objective, our return, is infinite for a myriad of non optimal behaviors. For instance, it is indeed infinite for an agent, behaviors that violates temperature regime at every search timestep. Tasks who has infinite horizons are often called continuing tasks. So of course, we could split such infinite horizon in some fixed length chunks. For example, hours, or days and assess agent's performance on the basis of performance during these chunks. However, as this approach requires manual decisions of what chunk length is appropriate, and this approach does not generalize well to arbitrary problem. Infinite horizon is not the only problem which may occur with the formulation of return. To better understand another problem with cumulative sum of rewards, consider as an example of an agent responsible for floor cleaning and air conditioning in a building. So, we encourage our agent to clean the floor by rewarding cleaning action with large value of 100 but because we also want air in the building to be fresh, we also encourage an agent for turning air conditioning system on and subsequently off, with reward equal to one. For this example, there is no infinite horizon because the episode ends when the day is over. So, what potential problem do you see in such a design? Well, this example illustrates a problem of positive feedback loop. This positive feedback loop is a cycle which allows an agent to gain large or possibly infinite reward. In this example, an agent will completely ignored the task of floor cleaning because this activity takes the whole day to finish. While turning air conditioning on off takes one second. And by doing this sequence of actions many, many times throughout the day, and agent is able to obtain cumulative reward much larger than 100. Well, the feedback loop in this task is an example of a bad reward design. We will discuss the reward design a little bit later, but for now, think about the environments where this infinite loop is in fact desire. For example, giving a positive reward for each second an agent is riding a bicycle is an example of a desire for positive feedback loop. We want an agent to never fall off the bike and in this case, with agent getting better and better in riding a bicycle, we could also face the problem of unboundedly increasing sum of rewards. This unboundedness can greatly harm our optimization procedure and thus, could break that learning process. But, how could we deal with infinite return? What could help us against the return being very large due to positive infinite loop? One of the most common approaches is discounting. Discounting means that we introduce some multiplier gamma, which is less than one and greater or equal to zero. This multiplier present rate with which things lose their value over time. That is, each reward receive later, as in current moment t, is reduced by multiplying the reward with the gamma, to the power of number of timesteps to this return. So discounting focusses agents attention more on close rewards and reduce the value of very distant once. More informally, the same cake compared to today's one was gamma times less tomorrow, gamma square times less the day after tomorrow, and so on. So that rewards received in n timesteps in the future is worth gamma to the power of N minus one times less, compared with the same reward acting right now. Such discounting allows to make infinite sum finite, providing that each term, is assumed, is bounded was from above and below. Well, why does this discounting solves the problem of return being infinite? This is so because of mathematical probity of geometric progression. So for example, if each immediate reward is equal to one, as an example of this data center cooling system, that infant sum is equal to the value of one over one minus gamma. Maximal return have nothing to do with the number of steps after which agent does not care, and this fact is useful to keep in mind. On the floors, you can see why rewards of plus one would be equal two after being discounting N times was different gammas. Know that slope of any curve decreases while increases N, this decrease suggests that even in the near future, are discounted at a higher discounted rate then events in the distant future. Also know that in case gamma is not exactly equal to zero, every reward, even infinitely distant one contributes to the return computation. It may contribute very, very little for very distant rewards, but it certainly does so. That is our agent still cares about distant rewards, but not as much as in case of undiscounted return. However, you should understand that agent will be almost indifferent to the very distant rewards, and this change of optimization objective definitely change what optimal policy looks like in each of these cases. So, why is discounting meaningful thing to do? Well, the reason of discounting is partially mathematical convenience and partially inspiration from human behavior. Humans, just like animals, given two similar rewards show of preference for the one that arrives sooner rather than later. Humans also discount the value of the later their work by a factor that increases with the length of the delay. However, a scientific literature suggests that human and animal discounting function f(t) is different and is more like hyperbolic discounting. The discounting function f(t) used in the reinforcement learning is very similar to the so-called quasi-hyperbolic discounting. Mainly, if we assume beta in quasi-hyperbolic discounting is equal to one, then we get precisely the same discounting scheme you have been shown on previous slide. In some sense, a quasi-hyperbolic discounting is a discrete approximation to a hyperbolic discounting function. And thus, it's rather close to how human discount. However, unlike hyperbolic discounting, it has some nice mathematical properties. The second reason of this particular kind of discounting is mathematical convenience. It not only makes infinite sums finite, it preserve some amount of contribution for each reward, but it also allows us to express the return in a recurrent function. We will heavily rely upon this recourse of definition of Gt, through Rt plus gamma multiplied by Gt plus one in future lectures. So, make sure you understand and remember it. Why do we also think about discounting from a different more theoretical perspective? Under mark of assumption, any action affects only the immediate reward and the next state. Well, any action could affect all or some of the future rewards by affecting the next state. That is, when you find a pressure in corner of the room, you definitely should assign them credit related to this rewards to the action of opening the door to this room. So, let's assume that the action effect lasts some subsequent number of steps after the action was committed, and then end. Let us also treat gamma S probability of effect continuation, and one minus gamma S probability is that effect ends. Then, the expected amount of return, which is due to current action is exactly equal to the discounted return. That is, with probability one minus gamma effect as immediately after the action and only the immediate reward error zero is attributed to the action committed now. With probability gamma effect less two timesteps and then, ends with probability one minus gamma. And in this case, R0 and R1 are attributed to the current action, and so on. So, you get the idea. Let us now speak a little bit about reward design. Two examples given in this lecture are, in fact, examples of a bad reward design. Consider for example, a game of chess and reward equal to the value of taken opponent's piece. In this case, an agent will not have the desire to win because it knows for sure that it will not be rewarded. That is so because we reward an agent only for taking pieces. And the cleaning robot example, also you know the reward given for cleaning the floor. To these examples share the same mistakes. And both of them, reward is given for how an agent should perform it is and not for what to do we wanted to do. In the game of chess, we should reward an agent for winning the game and not for taking pieces because the later could simply result in losing every single game with a lot of opponent's pieces taken. That is surely not what we want. The same is valid for the second example. We should reward an agent for fresh air and clean floor but not for using the means to achieve this. However, as such sparse rewards given only in the very end of an event, say plus one in the chess game for winning, may introduce additional difficulties in agent's learning. To some extent, this may be mitigated by your work shaping, which we'll discuss a bit later. The next concern is about reward shifting. In the machine learning courses, you may get used to standardizing training and testing the data before doing anything with this data. Well, for example, means of instruction and division by standard deviation is often a good idea in supervised and unsupervised learning. In the reinforcement learning, this is valid only for state representations but not for reward. You should not standardize rewards. So, let us see what will happen with optimal policy if you shift all the rewards by some mean value. So consider a world with four states and episode starts in S1 and ends in S4. Rewards for transitions are reason both the arrows. In the following example, mean other words will be negative, say, minus three. And suppression of this value will turns a negative feedback loop between states S2 and S3 into a positive feedback loop. So, optimal action for this world with modified rewards will definitely make use of this positive feedback loop. However, in the world with untransparent rewards, this policy will get possibly unboundedly bad the return. So the second takeaway is do not subtract anything from rewards unless you're pretty sure what you're doing. So, what transformations do not change optimal policy? One of such transformations is reward scaling. That means you can divide all the rewards in market decision process by any non-zero constant. And be sure that the optimal policy wasn't changed. This scaling by another constant may be especially helpful if you know beforehand the greatest immediate reward possible in a market decision process. And it is also useful for approximate methods, which we are going to cover in future videos. In other transformation which doesn't change the optimal policy in a market decision process is related to the so-called reward shaping. Reward shaping is an umbrella term for many methods that get an agent in a learning process by adding an extra values to the immediate rewards coming from an MDP. In general, under such reward shaping, the optimal policy changes. But there exists an important theory specifying how extra rewards should look like in order to preserve the optimal policy. In fact, these extra values should be equal to the difference of the potentials of next and current states. Such potential function is denoted by pi on this slide. We can also multiply the potential of the next state by a discount factor, gamma, if we want. The resultant potential-based function, F of SA and its prime, is presented on the slide. In particular, if I know nothing about condition probabilities and rewards in an MDP, the potential-based function F or the form depicted on this slide is the only one that preserves the optimal policy under reward shaping. The intuition of why does such addition of F later and change optimal policies may be as follows. Pretend for simplicity that discount factor gamma is equal to one, that is there are no discounting in an MDP. In such a case, the potential-based function sound or any circular sequence of states, that is starting and ending in the same state, is precisely equal to zero. That is why an agent cannot fall into a trap of a positive feedback loop caused by the introduction of potential-based rewards. Also, note that the added values do not depend on action A, but only on current and next state potentials. Thus, the potentials pi cannot influence that action selection directly, only by the means or the next state. But the influence of the next state potential is also illuminated because it is obstructed from the total return once we subsequently exit the next state as pi. This potential-based shaping function trick either other advanced technique. So, if you are get interested, you are highly encouraged to read the relevant paper specified at the bottom of the slide.