 （知乎文章居然不能公式居中，我也是服了，看这里不爽的话，直接看原文吧。）

## Reinforcement Learning Components Review

An reinforcement learning agent may include one or more of the following components:

• Policy: A function representing the agent's behavior,
• Value Function: A function indicating how good each state and/or action is,
• Model: The agent's representation of the environment.

Reinforcement learning methods can be seperated into two types, model-based approaches and model-free approaches. Model-based approaches include a model that represents the state transition of the external environment, which is usually learnt from the past experiences of the agent, and exploits such a model to generate proper value function so as to discover an optimal policy. On the other hand, Model-free approaches does not include any model of the external environment, two primary methods of which are direct policy search and value function iteration. In direct policy search, the agent captures environment transitions at each running step and and takes advantage of such information to update its policy directly without storing any information about the dynamics of the environment. And in value function iteration, the agent takes advantage of the environmental transition information at each running step to update its value function and plans its policy in accord to its value function.

## RL Review: Policy

A policy represents an agent’s behavior at each state, which maps a state to an action.

A deterministic policy is one that maps each state to a specific action deterministically that where is the state the agent is at, is the policy that the agent executes, and is the action to take based on policy .

A stochastic policy is one that considers the probability distribution of any action to take at a specific state. It suggests that in a specific state, the action that should be taken by the agent is not a specific one but various actions with their own probabilities of being taken at that state. Formally a stochastic policy can be expressed as where is the state the agent is at, is a possible action to take at state , is the policy that the agent executes, and is the probability that the agent takes action at state under policy .

## RL Review: Value Function

A value function is a prediction of the long-term future rewards of a state or an action taken at a specific state. A state value function predicts how much reward the agent will get when reaching a state , while a state-action value function, or says a Q-value function, predicts how much reward the agent will get by taking an action at state .

Q-value Function

Q-value function gives expected total reward from state and action under policy with discount factor that Such a function could also decompose into a Bellman Equation form as Optimal Q-value Function

An optimal Q-value function is the maximum achievable value after taking an action at a state , which is formally represented by where is the optimal policy the agent could execute. Similarly, if we have the optimal Q-value function, it could yield the optimal policy at ease that Moreover, an optimal Q-value function could formally decompose into the Bellman Equation form as Intuitively, an optimal Q-value function indicates the total expected sum of rewards gained at the future by taking the action sequence that yields the most rewards in total.

## Deep Reinforcement Learning

As mentioned above, there are roughly three primary approaches to solving reinforcement learning problems, which are respectively

• Value-based approach, which estimates the optimal Q-value function that is the maximun sum of rewards achievable in future
• Policy-based approach, which searches directly for the optimal policy that is the policy achieving maximum future rewards
• Model-based approach, which builds a model of the environment and plan for a policy to execute using model (e.g. by look-ahead)

Deep reinforcement learning is acutally using deep neural networks to represent the components in classic reinforcement leanring problems. The components that could be represented by deep neural networks are

• Value function
• Policy
• Model

Due to the exhausting computational complexity, stochastic gradient descent (SGD) method is usually employed in the loss function optimization for these neural networks.

## Deep Q-Networks

An advantage of using neural networks in representing a Q-value function is that it they could handle the continuity of the state space and the action space. In cases where the state space and action space are both continuous, the Q-Network structure with both state and action as input and an output number that indicates the value of taking such an action in such a state is usually employed. But in some cases, where action space is discrete, there would be some tricks to save computational resources.

The following are two structures of Deep Q-Networks design.

The left DQN structure is generally applicable to both discrete and continuous action space, and the structure, proposed by Google DeepMind (reference requested), to the right is applicable to discrete action space. A Significant advantage of the structure to the right is that it can generate all the action values for a specific state at once and save a lot of computational resources.

Our goal of learning here is to learn the parameter vector which contains all the weights of a Q-Network that approximates the true optimal value function that for each state in the state space and each action in the action space.However, it is impossible to obtain the actual optimal Q-value function , otherwise we do not have to train a Q-Network that approximates the true optimal Q-value function and we should just use the true optimal Q-value function instead. Formally the true optimal Q-value function is where the right-hand term in this equation is the learning target given state and action . But we could not obtain the real value of the expectation term. Instead, what we could do is to assume that the Q-Network we have trained so far is a good approximation to the true Q-value function, so we could approximate the right-hand term with our current Q-Network, which is formally And our goal of learning is now a goal of optimization on minimizing the mean-squared error (MSE) loss And then we apply stochastic gradient descent (SGD) to optimize the above MSE loss so as to make the Q-Network more close the the true optimal Q-value function. We use stochastic gradient descent instead of gradient descent here because of the computational complexity. And SGD could optimize the weights of the Q-Network to be close enough to those optimized with gradient descent method if we have trained for enough number of epochs.

If we are using table lookup representation of the MSE, the lookup table will converge to the true Q-value function . But continuous cases, the Q-Network may diverge due to:

• Correlations between samples (i.e. the state-action pairs show up in a specifc order in the training set)
• Non-stationary targets (i.e. the Q-Network changes after each training epoch which leads to non-stationary targets for the learning process)

In the following section, we are going to discuss how to overcome these undesirable effects.

## Experience Replay

A trick called the experience replay is designed to handle the correlations between samples. If there are strong correlations between samples, the Q-Network may bias to a specific direction when training with a set of samples, and it may bias to another direction when training with another set of samples. To handle this problem, we need to break the correlations of the samples, so as to training the Q-Network towards an average direction. With this demand, the technique named experience replay is proposed.

In experience replay, past state-action pairs are stored in a memory with limited or unlimited length, which means the memory may store a specific amount of latest state-action pairs or just store all of them from the past experience. After then, in each learning epoch, a set of a small number of state-action pairs are randomly chosen from the memory to form a training set to train the Q-Network, where such a set of training samples is called a mini-batch. This technique could break the correlations because it is randomly choosing different state-action pair from the memory, rather than choosing them in order. So the observation order of the state-action pair does not matter in the training process, thus the correlations are removed.

Formally, during the experience remembering process, the new experience as a state-action pair is stored to the memory that During the training process, multiple state-action pairs are randomly chosen from the memory to form a mini-batch for training the Q-Network that And then apply stochastic gradient descent to update the weight vector of the Q-Network by adding to the weight vector a difference where is the learning rate, is the mean-squared loss function of the weight and is weight vector difference to add to the current weight vector so as to update the Q-Network. The partial derivative of the loss function with respect to the weight vector is where the coefficient can be considered as part of the learning rate , so the weight vector difference can be formally expressed as where in the above two equations, which is equal to the value of the weight vector but considered as a constant in derivative.

## Fixed Parameters

To deal with non-stationarity, the weight vector in the learning target term is held fixed. If this parameter is fixed instead of changing after each traning epoch, the non-stationary targets issue is solved, because the training target is now fixed, or says stationary.

Practically, however, the target weight vector term could not be held fixed all the time, otherwise the Q-Network could never get close to the true Q-value function. In order to train the Q-Network practically, we need to update the target weight vector term after certain epochs of training.

So intuitively, there are actually two Q-Networks, which are respectively the training network with parameter and the target network with parameter . The training network will be updated at each training epoch, and after certain training epochs, the parameter of the training network is assigned to that of the target network to update the target network that The training target is now stationary within certain training epochs, after that the training target is updated with the new weight vector, and this process repeats at each training phase.

（先弄这么多吧，编辑这个好麻烦啊... 我的博客里面那个文章是完整的）