Deep Q-Learning Recap

Deep Q-Learning Recap

本文章转自我的博客文章:Deep Q-Learning Recap (在原文章了里面看,可能某些显示效果比较好些)

本文基于 Lecture 02/20/2017, CMU 10703 Deep Reinforcement Learning。由于我有上那个课,而且只看课件可能对没有直接上课的人来说比较吃力,所以我就总结了下主要内容,并把要点逐一扩充了,可以看作是一个 Lecture Note 吧。该 Lecture 主要关注 Deep Q-Learning,DQN 以及相关的拓展。由于顺手,就直接用英文写了,大家将就着看吧。如果有人希望翻译,可以联系我获得版权许可进行翻译的。若有纰漏,敬请指正。谢谢大家。



Reinforcement Learning Components Review

An reinforcement learning agent may include one or more of the following components:

  • Policy: A function representing the agent's behavior,
  • Value Function: A function indicating how good each state and/or action is,
  • Model: The agent's representation of the environment.

Reinforcement learning methods can be seperated into two types, model-based approaches and model-free approaches. Model-based approaches include a model that represents the state transition of the external environment, which is usually learnt from the past experiences of the agent, and exploits such a model to generate proper value function so as to discover an optimal policy. On the other hand, Model-free approaches does not include any model of the external environment, two primary methods of which are direct policy search and value function iteration. In direct policy search, the agent captures environment transitions at each running step and and takes advantage of such information to update its policy directly without storing any information about the dynamics of the environment. And in value function iteration, the agent takes advantage of the environmental transition information at each running step to update its value function and plans its policy in accord to its value function.

RL Review: Policy

A policy represents an agent’s behavior at each state, which maps a state to an action.

A deterministic policy is one that maps each state to a specific action deterministically that

\boldsymbol{a} \leftarrow \pi \left( \boldsymbol{s} \right)

where \boldsymbol{s} is the state the agent is at, \pi is the policy that the agent executes, and \boldsymbol{a} is the action to take based on policy \pi.

A stochastic policy is one that considers the probability distribution of any action to take at a specific state. It suggests that in a specific state, the action that should be taken by the agent is not a specific one but various actions with their own probabilities of being taken at that state. Formally a stochastic policy can be expressed as

P \left( \boldsymbol{a} \middle| \boldsymbol{s} \right) \leftarrow \pi \left( \boldsymbol{s}, \boldsymbol{a} \right)

where \boldsymbol{s} is the state the agent is at, \boldsymbol{a} is a possible action to take at state \boldsymbol{s}, \pi is the policy that the agent executes, and P \left( \boldsymbol{a} \middle| \boldsymbol{s} \right) is the probability that the agent takes action \boldsymbol{a} at state \boldsymbol{s} under policy \pi.

RL Review: Value Function

A value function is a prediction of the long-term future rewards of a state or an action taken at a specific state. A state value functionV predicts how much reward the agent will get when reaching a state \boldsymbol{s}, while a state-action value function, or says a Q-value function, Q predicts how much reward the agent will get by taking an action \boldsymbol{a} at state \boldsymbol{s}.

Q-value Function

Q-value function gives expected total reward from state \boldsymbol{s} and action \boldsymbol{a} under policy \boldsymbol{\pi} with discount factor \gamma that

Q^{\pi} \left( \boldsymbol{s}, \boldsymbol{a} \right) = \mathbb{E} \left[ r_{t+1} + \gamma r_{t+2} + \gamma^{2} r_{t+3} + \cdots \middle| \boldsymbol{s}, \boldsymbol{a} \right]

Such a function could also decompose into a Bellman Equation form as

Q^{\pi} \left( \boldsymbol{s}, \boldsymbol{a} \right) = \mathbb{E}_{ \boldsymbol{s}', \boldsymbol{a}' } \left[ r + \gamma Q^{\pi} \left( \boldsymbol{s'}, \boldsymbol{a'} \right) \middle| \boldsymbol{s}, \boldsymbol{a} \right]

Optimal Q-value Function

An optimal Q-value function is the maximum achievable value after taking an action \boldsymbol{a} at a state \boldsymbol{s}, which is formally represented by

Q^{\*} \left( \boldsymbol{s}, \boldsymbol{a} \right) \leftarrow \max_{\pi} Q^{\pi} \left( \boldsymbol{s}, \boldsymbol{a} \right) \equiv Q^{ \pi^{\*} } \left( \boldsymbol{s}, \boldsymbol{a} \right)

where \pi^{*} is the optimal policy the agent could execute. Similarly, if we have the optimal Q-value function, it could yield the optimal policy at ease that

\pi^{\*} = \arg\max_{a} Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right)

Moreover, an optimal Q-value function could formally decompose into the Bellman Equation form as

Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right) = \mathbb{E}_{ \boldsymbol{s'} } \left[ r + \gamma \max_{ \boldsymbol{a'} } Q^{*} \left( \boldsymbol{s'}, \boldsymbol{a'} \right) \middle| \boldsymbol{s}, \boldsymbol{a} \right]

Intuitively, an optimal Q-value function indicates the total expected sum of rewards gained at the future by taking the action sequence that yields the most rewards in total.

Deep Reinforcement Learning

As mentioned above, there are roughly three primary approaches to solving reinforcement learning problems, which are respectively

  • Value-based approach, which estimates the optimal Q-value function Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right) that is the maximun sum of rewards achievable in future
  • Policy-based approach, which searches directly for the optimal policy \pi^{*} that is the policy achieving maximum future rewards
  • Model-based approach, which builds a model of the environment and plan for a policy to execute using model (e.g. by look-ahead)

Deep reinforcement learning is acutally using deep neural networks to represent the components in classic reinforcement leanring problems. The components that could be represented by deep neural networks are

  • Value function
  • Policy
  • Model

Due to the exhausting computational complexity, stochastic gradient descent (SGD) method is usually employed in the loss function optimization for these neural networks.

Deep Q-Networks

An advantage of using neural networks in representing a Q-value function is that it they could handle the continuity of the state space and the action space. In cases where the state space and action space are both continuous, the Q-Network structure with both state and action as input and an output number that indicates the value of taking such an action in such a state is usually employed. But in some cases, where action space is discrete, there would be some tricks to save computational resources.

The following are two structures of Deep Q-Networks design.

The left DQN structure is generally applicable to both discrete and continuous action space, and the structure, proposed by Google DeepMind (reference requested), to the right is applicable to discrete action space. A Significant advantage of the structure to the right is that it can generate all the action values for a specific state at once and save a lot of computational resources.

Our goal of learning here is to learn the parameter vector \boldsymbol{w} which contains all the weights of a Q-Network that approximates the true optimal value function Q^{*} that

Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \approx Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right)

for each state \boldsymbol{s} in the state space and each action \boldsymbol{a} in the action space.However, it is impossible to obtain the actual optimal Q-value function Q^{*}, otherwise we do not have to train a Q-Network that approximates the true optimal Q-value function and we should just use the true optimal Q-value function instead. Formally the true optimal Q-value function is

Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right) = \mathbb{E}_{ \boldsymbol{s'} } \left[ r + \gamma \max_{ \boldsymbol{a'} } Q^{*} \left( \boldsymbol{s'}, \boldsymbol{a'} \right) \middle| \boldsymbol{s}, \boldsymbol{a} \right]

where the right-hand term in this equation is the learning target given state \boldsymbol{s} and action \boldsymbol{a}. But we could not obtain the real value of the expectation term. Instead, what we could do is to assume that the Q-Network we have trained so far is a good approximation to the true Q-value function, so we could approximate the right-hand term with our current Q-Network, which is formally

r + \gamma \max_{a'} Q \left( \boldsymbol{s'}, \boldsymbol{a'}, \boldsymbol{w} \right) \approx \mathbb{E}_{ \boldsymbol{s'} } \left[ r + \gamma \max_{ \boldsymbol{a'} } Q^{*} \left( \boldsymbol{s'}, \boldsymbol{a'} \right) \middle| \boldsymbol{s}, \boldsymbol{a} \right] \equiv Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right)

And our goal of learning is now a goal of optimization on minimizing the mean-squared error (MSE) loss

    & = \left( Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right) - Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \right)^{2} \\\\
    & \approx \left( r + \gamma \max_{a} Q \left( \boldsymbol{s'}, \boldsymbol{a'}, \boldsymbol{w} \right) - Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \right)^{2}

And then we apply stochastic gradient descent (SGD) to optimize the above MSE loss so as to make the Q-Network more close the the true optimal Q-value function. We use stochastic gradient descent instead of gradient descent here because of the computational complexity. And SGD could optimize the weights of the Q-Network to be close enough to those optimized with gradient descent method if we have trained for enough number of epochs.

If we are using table lookup representation of the MSE, the lookup table will converge to the true Q-value function Q^{*}. But continuous cases, the Q-Network may diverge due to:

  • Correlations between samples (i.e. the state-action pairs show up in a specifc order in the training set)
  • Non-stationary targets (i.e. the Q-Network changes after each training epoch which leads to non-stationary targets for the learning process)

In the following section, we are going to discuss how to overcome these undesirable effects.

Experience Replay

A trick called the experience replay is designed to handle the correlations between samples. If there are strong correlations between samples, the Q-Network may bias to a specific direction when training with a set of samples, and it may bias to another direction when training with another set of samples. To handle this problem, we need to break the correlations of the samples, so as to training the Q-Network towards an average direction. With this demand, the technique named experience replay is proposed.

In experience replay, past state-action pairs are stored in a memory with limited or unlimited length, which means the memory may store a specific amount of latest state-action pairs or just store all of them from the past experience. After then, in each learning epoch, a set of a small number of state-action pairs are randomly chosen from the memory to form a training set to train the Q-Network, where such a set of training samples is called a mini-batch. This technique could break the correlations because it is randomly choosing different state-action pair from the memory, rather than choosing them in order. So the observation order of the state-action pair does not matter in the training process, thus the correlations are removed.

Formally, during the experience remembering process, the new experience as a state-action pair is stored to the memory that

D_{t+1} \leftarrow D_{t} \cup \left\{ \left\langle \boldsymbol{s}_{t}, \boldsymbol{a}_{t}, r_{t+1}, \boldsymbol{s}_{t+1} \right\rangle \right\}

During the training process, multiple state-action pairs are randomly chosen from the memory to form a mini-batch for training the Q-Network that

\left\langle \boldsymbol{s}, \boldsymbol{a}, r, \boldsymbol{s'} \right\rangle \sim \text{Uniform} \left( D \right)

And then apply stochastic gradient descent to update the weight vector of the Q-Network by adding to the weight vector a difference

\triangle \boldsymbol{w} = - \alpha \frac{ \partial l }{ \partial \boldsymbol{w} }

where \alpha is the learning rate, l is the mean-squared loss function of the weight \boldsymbol{w} and \triangle \boldsymbol{w} is weight vector difference to add to the current weight vector so as to update the Q-Network. The partial derivative of the loss function with respect to the weight vector is

    \frac{ \partial l }{ \partial \boldsymbol{w} } 
    & = \frac{ \partial }{ \partial \boldsymbol{w} } \left( Q^{*} \left( \boldsymbol{s}, \boldsymbol{a} \right) - Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \right)^{2} \\\\
    & \approx \frac{ \partial }{ \partial \boldsymbol{w} } \left( r + \gamma \max_{a} Q \left( \boldsymbol{s'}, \boldsymbol{a'}, \boldsymbol{w}^{-} \right) - Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \right)^{2} \\\\
    & = - 2 \left( r + \gamma \max_{a} Q \left( \boldsymbol{s'}, \boldsymbol{a'}, \boldsymbol{w}^{-} \right) - Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \right) \triangledown_{ \boldsymbol{w} } Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right)

where the coefficient 2 can be considered as part of the learning rate \alpha, so the weight vector difference can be formally expressed as

\triangle \boldsymbol{w} = \alpha \left( r + \gamma \max_{a} Q \left( \boldsymbol{s'}, \boldsymbol{a'}, \boldsymbol{w}^{-} \right) - Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right) \right) \triangledown_{ \boldsymbol{w} } Q \left( \boldsymbol{s}, \boldsymbol{a}, \boldsymbol{w} \right)

where \boldsymbol{w}^{-} \equiv \boldsymbol{w} in the above two equations, which is equal to the value of the weight vector but considered as a constant in derivative.

Fixed Parameters

To deal with non-stationarity, the weight vector \boldsymbol{w}^{-} in the learning target term is held fixed. If this parameter is fixed instead of changing after each traning epoch, the non-stationary targets issue is solved, because the training target is now fixed, or says stationary.

Practically, however, the target weight vector term could not be held fixed all the time, otherwise the Q-Network could never get close to the true Q-value function. In order to train the Q-Network practically, we need to update the target weight vector term after certain epochs of training.

So intuitively, there are actually two Q-Networks, which are respectively the training network with parameter \boldsymbol{w} and the target network with parameter \boldsymbol{w}^{-}. The training network will be updated at each training epoch, and after certain training epochs, the parameter of the training network is assigned to that of the target network to update the target network that

\boldsymbol{w}^{-} \leftarrow \boldsymbol{w}

The training target is now stationary within certain training epochs, after that the training target is updated with the new weight vector, and this process repeats at each training phase.

(先弄这么多吧,编辑这个好麻烦啊... 我的博客里面那个文章是完整的)

编辑于 2017-03-16