【学界】第五章:Python代码之模型与Q-Learning

【学界】第五章:Python代码之模型与Q-Learning

文章作者:@小杨
本文获得@小杨授权,原文链接:【学界】第五章:Python代码之模型与Q-Learning
欢迎原链接转发,转载请私信 @留德华叫兽 获取信息,盗版必究。
敬请关注和扩散本专栏及同名公众号,会邀请全球知名学者发布运筹学、人工智能中优化理论等相关干货、知乎Live及行业动态:『运筹OR帷幄』大数据人工智能时代的运筹学


简介:作者是广东工业大学2016级工业工程系研究生,师从广东工业大学教授、博士生导师、《工业工程》杂志主编,陈庆新教授,广东工业大学讲师、港大博士,张婷。于研一至研二第一学期初次进行学术研究,发表一篇被EI期刊、国内Top工程期刊《系统工程理论与实践》收录,两篇国际学术会议文章(International Conference on Networking, Sensing and Control 2018,Healthcare Operations Management SIG 2018 Conference),完成一个国际会议poster演讲(Manufacturing and Service Operations Management Conference 2018 ),投稿一篇文章至Journal of Clean Production正在初审。目前研究兴趣主要集中在机器学习与运筹学的交叉应用领域。

这篇文章是个人的一些体会,也是对上一阶段工作的总结。由于是初次涉水,阅历尚浅,希望通过总结来反省,也许有幸收获同僚大神的意见。
研究课题:

[ 社区居家养老调度与路径规划问题 ]
[ Community Home Health Care Scheduling and Routing Problem ]

------本文目录------
第一、二、三章:论文选题、文献综述方法、问题分析、与模型假设

第四章:马氏决策过程&机会约束模型&多人随机路径规划

第五章:Python代码之模型与Q-Learning

第六章:Python代码之蚁群算法部分

第七章:Python代码之模块整合与主程序

第八章:Python代码之数据输出、调参与算法总结

距离上篇讲模型的文章至今已经5个月过去了,这段时间由于SCI论文反复修改,并且忙于复习雅思考试(考了7分还是很开心的,虽然作文小分才5.5 orz...),所以拖到现在才来做总结。并对之前的代码做一番整理。好了,废话不多说,进入正题。

------5 Python代码之模型与Q-Learning------

这篇文章主要从如何用Python实现算法的角度来总结。如标题所示,这个启发式算法结合了Q-learning和Ant Colony Optimization,其本质思想是强化学习的思想,即通过不断地试错来寻找可接受的近似最优解,并根据奖励机制来更新Q矩阵。其最终目的是训练一个Q矩阵,使得智能算法在根据Q矩阵进行决策之后,能够得到一个令人满意的全局解(很难最优)。

在训练过程中,算法根据评估函数来评估一个解的优劣,并把找到的好的解存起来,留作测试训练成果时候的比较对象。训练agent的流程如下所示。

Outline of the QL-ACO training algorithm

那么整个程序可以分为四部分:

Data Representation(DR)
Best-Worst Ant Colony Optimization(BWACO)
Q-Learning(QL)
Results Exporting(RE)

需要用到的包:

import numpy as np
import copy
import pandas as pd
import math
import random
import time
import matplotlib.pyplot as plt
import xlwt

------5.1 案例数据读取与初始化------

护工:标签label、技能等级skill、总工作时长total_time总等待时长total_waiting_time、平均等待时长average waiting time of jobs。为了方便记录数据,需要一个列表储存路径(self.r),一个列表储存不同等级需求(self.sd),一个列表储存到达各个任务的时间(self.aT)。

这里的想法主要是把每条路径的信息都作为护工对象的属性存起来,这样后期只要调用nurse.attribute就能方便的对数据进行处理和输出,而且不会发生各属性对应错的情况。

任务:标签label、从属老人elder、需求等级level、坐标coordinate_x, coordinate_y、时窗time_window_begin, time_window_end。

class Nurses:
    # Use '__slots__' to save memory
    __slots__ = ['l', 's', 'tt', 'twt', 'avg_w', 'r', 'sd', 'aT']

    def __init__(self, label, skill):
        self.l = label
        self.s = skill
        # Total workload
        self.tt = 0
        # Total waiting time
        self.twt = 0
        # Average waiting time by job
        self.avg_w = 0
        # Visiting targets in line
        self.r = []
        # Fulfilled demand numbers by level
        self.sd = [0, 0, 0]
        # Arrival time at every target
        self.aT = []


class Jobs:
    # Use '__slots__' to save memory
    __slots__ = ['l', 'e', 'lv', 'c', 'twb', 'twe']

    def __init__(self, label, elder, level, coordinate_x, coordinate_y, coordinate_z,
                 time_window_begin, time_window_end):
        self.l = label
        self.e = elder
        self.lv = level
        self.c = [coordinate_x, coordinate_y, coordinate_z]
        self.twb = time_window_begin
        self.twe = time_window_end

由于实验数据是储存在Excel表格里的,所以我用Pandas来读取数据。建立一个函数,输入为案例文件名称和属性,输出为一个Job对象列表和各老人之间的距离矩阵。根据表格形式读取和储存数据。

同学会看到算法里有一块是专门用来建立第一个job对象的,即在列表里排第一个的任务,这个任务应理解为“回到起点”。这么设置的原因在上一篇文章有提到过,主要是为了写程序和运算上的效率。比如计算某个护工的总工作时长时,只要记下其“回到起点”的那一刻就好了。

这里的get_data函数做了三件事:1)读取数据,2)根据数据建立Job对象列表,3)计算距离矩阵。因为函数可以修改输入量并返回某些东西,所以这个函数在修改了作为输入量的jobs的同时,返回了一个矩阵distance。

Example data sheet (excel file)
def get_data(f_index, f_scale, jobs):
    # Setup statement
    file = './Elders_' + f_index + '.xlsx'

    # Temporal lists
    elders_index = []
    job_num = []
    job_level = []
    elder_location = []
    job_coordinate = np.zeros((f_scale, 3))
    time_window = np.zeros((f_scale, 2))

    # Read out column 'JobNum', 'Indexes', and 'JobLevel' from excel file
    excel = pd.read_excel(file, sheet_name='Sheet1')
    job_num.append(0)
    job_num += (list(copy.deepcopy(excel['JobNum'].values)))
    elders_index.append(0)
    elders_index += (list(copy.deepcopy(excel['Indexes'].values)))
    job_level.append(0)
    job_level += (list(copy.deepcopy(excel['JobLevel'].values)))

    # The first job is defined as the depot with coordinate (125, 125, 0)
    job_coordinate[0][0] = 125
    job_coordinate[0][1] = 125
    job_coordinate[0][2] = 0
    time_window[0][0] = 0.00
    time_window[0][1] = 480.00

    # Read out coordinates and time windows
    xyz = np.vstack((copy.deepcopy(excel['X'].values), copy.deepcopy(excel['Y'].values), copy.deepcopy(excel['Z'].values)))
    for i in range(len(xyz[0])):
        job_coordinate[i+1][0] = xyz[0][i]
        job_coordinate[i+1][1] = xyz[1][i]
        job_coordinate[i+1][2] = xyz[2][i]
    tw = np.vstack((copy.deepcopy(excel['TWB'].values), copy.deepcopy(excel['TWE'].values)))
    for i in range(len(tw[0])):
        time_window[i+1][0] = tw[0][i]
        time_window[i+1][1] = tw[1][i]

    # Read out locations labelled by elders for computing distance matrix
    lo = []
    for i in range(f_scale):
        lo.append([elders_index[i], job_coordinate[i][0], job_coordinate[i][1], job_coordinate[i][2]])
    for i in range(f_scale):
        if lo[i] in elder_location:
            continue
        else:
            elder_location.append(lo[i])

    # Build job classes and stack them into a list
    for fs in range(f_scale):
        jobs.append(
            Jobs(fs, elders_index[fs], job_level[fs], job_coordinate[fs][0], job_coordinate[fs][1],
                 job_coordinate[fs][1], time_window[fs][0], time_window[fs][1]))

    # Build distance matrix and return it
    num = len(elder_location)
    distance = np.zeros((num, num))
    for i in range(num):
        for j in range(num):
            hD = math.sqrt(pow((elder_location[i][1] - elder_location[j][1]), 2) + pow(
                (elder_location[i][2] - elder_location[j][2]), 2))
            if hD == 0:
                distance[i][j] = distance[j][i] = 9.6 * abs(elder_location[i][3] - elder_location[j][3])
            else:
                distance[i][j] = distance[j][i] = hD + elder_location[i][3] + elder_location[j][3]
    return distance
在这一部分暂且不建立护工对象。
其一:护工作为一种被分配的资源,在一个计划还没开始之前就确定要用到多少人是不可能的,所以不如在需要用到某个技能水平的护工时,再就地建立一个对象并储存起来;
其二:护工被使用的次序和数量都会影响结果;
其三:减少资源占用。
虽然对象是在过程中建立的,但我们仍需确认到底有多少资源、有什么资源可供使用。

接下来就可以初始化训练案例了,需要设置的包括:

1)需求数据

2)护工资源

3)preference矩阵

4)技能匹配参数矩阵

这里需要考虑护工回到起点之后产生的服务时长是0,因此建立一个全为0的列向量:
np.zeros((1,len(e_nurses_skill))
并与事先建立好的随机preference矩阵合并:
e_random_pre =
np.loadtxt("./initial_preference_ABC.csv",delimiter=",",skiprows=0)
e_preference_matrix =
np.row_stack((np.zeros((1,len(e_nurses_skill))), e_random_pre))
    """------Instance representations------"""
    # Get elder data and distance matrix
    file_index = 'A'
    file_scale = 61
    e_jobs = []
    e_init_distance_matrix = QL_BWACO.get_data(file_index, file_scale, e_jobs)
    e_depot = e_jobs[0]
    # Set up nurse resource
    e_nurses_skill = [1, 2, 2, 3, 3, 3, 3]
    # Build initial preference matrix
    e_random_pre = np.loadtxt("./initial_preference_ABC.csv",delimiter=",",skiprows=0)
    e_preference_matrix = np.row_stack((np.zeros((1,len(e_nurses_skill))), e_random_pre))
    # Skill demand matching parameter matrix
    e_service_time_mean = np.array([[0, 0, 0, 0],
                                    [0, 25, -1, -1],
                                    [0, 20, 30, -1],
                                    [0, 18, 20, 20]])
    e_walk_speed = 60

------5.2 Q-Learning------

Q-Learning是一个强化学习算法,其特点在于model-free,不用获得完整的环境数据,只要满足以下三个条件,理论上就可以通过训练得到一个Q矩阵:

1. 状态空间是有限的
2. 奖励函数合理
3. 状态信息表示合理
*事实证明“合理”这个程度很难达到,特别对于本文研究的这种复杂的组合优化问题。不过我们暂且不说结果,先看实现过程。

关于Q-Learning的文献和网文:

  • Watkins, C.J.C.H., 1989. Learning from delayed rewards. King's College, Cambridge.
  • Watkins, C.J., Dayan, P., 1992. Q-learning. Machine learning 8(3-4), 279-292.
  • Rodrigues Gomes, E., Kowalczyk, R., 2009. Dynamic analysis of multiagent Q-learning with ε-greedy exploration, Proceedings of the 26th Annual International Conference on Machine Learning. ACM, pp. 369-376.
  • 如何用简单例子讲解 Q - learning 的具体过程? - @牛阿 的回答 - 知乎 zhihu.com/question/2640

状态、奖励函数的定义在上一章已经说明了,这里贴出公式方便recall:

State definition
d,d'\in D represent the levels of demands, s_{nd},s_{nn} are two absorbing states. So the scale of state space is reduced to the number of permutations of demand levels plus the absorbing states (2+C_{D}^{2}) , and the scale of Q-matrix is m\cdot (2+C_{D}^{2}). Note that, we only consider two size-relationships between any two demand levels, which are \geq and < .
Reward function
\Delta d_{ss'}^{i} is the amount of demand at level i that fulfilled by the chosen nurse. Different value of \lambda_{wh} leads to different importance degree of the fulfilled demands and waiting time. R_{hd}(\cdot) provides a reward that is only related to the demands number a nurse serves when \lambda_{wh}=0 , while it trades off between waiting time and fulfilled demands when \lambda_{wh}\ne0 .

接下来就是Q值的更新函数。

注意每一个Q值对应一个状态和行动的组合,也就是每个(s, a)对应一个q(s, a),因此所有Q值组成一个二维矩阵,行列分别为状态和行动,矩阵元素就是Q值。

Q value update function
In spite of learning policy from historical Q values, we apply the ε-greedy heuristic rule for the agent. It contains the idea that, an agent would take an action according to the known best Q value with probability , and take other actions with probability ε (Rodrigues Gomes and Kowalczyk, 2009).

至此就可以开始写代码了,先看看输入输出:

输入:
Variables:
需求集(可选移动目标集合)
target_set
决策空间(可选护工集合)
skill_set
Q矩阵
q_matrix
待填充的护工对象列表(空)
q_nurses_list
Constants:
算法参数
learning_rate_para
discount_para
greedy_para
waiting_para

输出:
Variables:
一组有次序的护工对象,每个对象都写入了子解的信息
*q_nurses_list
Constants:
所有护工的总工时和等待时长
total workload
total waiting time

再看看训练算法流程图与代码:

Q Learning training flow chart
def q_learning(q_nurses_list, target_set, skill_set,
               q_matrix, learning_rate_para, discount_para, greedy_para, waiting_para, absorbing_state,
               # ACO variables
               ccp_acquaintance_increment, ccp_alpha, ccp_beta, ccp_waiting_limit, ccp_workload_limit,
               aco_alpha, aco_beta, aco_rho, aco_ant_number, changeable_pheromone_matrix,
               e_walk_speed, changeable_preference_matrix, e_init_distance_matrix, e_service_time_mean,
               depot):
    # Start a new QL episode
    q_sub_total_workload = 0
    q_sub_total_waiting = 0
    for n in range(len(skill_set)):
        # Identify current state
        current_state = state_identification(target_set, skill_set)
        if current_state in absorbing_state:
            # reach absorbing state
            break
        # Take action according to e-greedy
        chosen_skill = action_taking(current_state, q_matrix, skill_set, greedy_para)

        # Create nurse object in a list sequentially
        q_nurses_list.append(Nurses(n, chosen_skill))
        current_demand_num = count_demand_num(target_set)

        # Collect targets according to skill demand match rule
        sd_matched_targets = []
        for aj in range(len(target_set)):
            if target_set[aj].lv <= q_nurses_list[n].s:
                sd_matched_targets.append(target_set[aj])

        # Build route by ACO algorithm
        aco(ccp_acquaintance_increment, ccp_alpha, ccp_beta, ccp_waiting_limit, ccp_workload_limit,
            aco_alpha, aco_beta, aco_rho, aco_ant_number, changeable_pheromone_matrix,
            e_walk_speed, changeable_preference_matrix, e_init_distance_matrix, e_service_time_mean,
            q_nurses_list[n], sd_matched_targets,
            depot)

        # Remove fulfilled demands
        for o in range(len(q_nurses_list[n].r)):
            if q_nurses_list[n].r[o].l != 0:
                for b in range(len(target_set)):
                    if target_set[b].l == q_nurses_list[n].r[o].l:
                        target_set.remove(target_set[b])
                        break

        # Calculate fulfilled demands
        remaining_demands = count_demand_num(target_set)
        fulfilled_demands = sum([x-y for x,y in zip(current_demand_num, remaining_demands)])
        # Calculate reward
        reward = fulfilled_demands + waiting_para * q_nurses_list[n].twt
        next_state = state_identification(target_set, skill_set)
        next_max_q = np.max(q_matrix[next_state])
        # Calculate q value
        q_value = (1 - learning_rate_para) * q_matrix[current_state][chosen_skill - 1] \
                  + learning_rate_para * (reward + discount_para * next_max_q)
        # Update Q-matrix
        q_matrix[current_state][chosen_skill - 1] = float('%.2f' % (copy.deepcopy(q_value)))

        # Accumulate total workload
        q_sub_total_workload += q_nurses_list[n].tt
        q_sub_total_waiting += q_nurses_list[n].twt

    # Return total workload and waiting for this episode
    return [q_sub_total_workload, q_sub_total_waiting]

这个函数中调用到了几个子函数,分别是:

1. state_identification()

这是一个简单的分类算法,求当前各等级需求的大小关系并返回对应的标记(状态值),由于Python没有switch case语句,所以就用逻辑运算符写了个。 orz...

*测试时候出了个小bug,这个函数的返回值是None,仔细一想,之前的判断语句中没有考虑非0情况下某两个值甚至三个值都相等的情况,所以就把>号改成了>=号,解决了bug。但是仔细思考下,发觉这是数学定义的不严谨,如果将大小号与等号合并的话,相当于再次裁剪了状态空间,如此一来,模型的精度难免再次下降,但好在不管有没有对等号进行合并,算法给出来的解质量差别并不大。毕竟合并前后两个状态空间所能表达的环境信息都有限。
def state_identification(target_set, skill_set):
    a = len(skill_set)
    dn = count_demand_num(target_set)
    if a != 0:
        d1 = dn[0]
        d2 = dn[1]
        d3 = dn[2]
        d = sum(dn)
        if d != 0:
            if d1 >= d2 >= d3:
                return 0    # 0-d1>=d2>=d3
            elif d1 >= d3 >= d2:
                return 1   # 1-d1>=d3>=d2
            elif d2 >= d1 >= d3:
                return 2   # 2-d2>=d1>=d3
            elif d2 >= d3 >= d1:
                return 3   # 3-d2>=d3>=d1
            elif d3 >= d1 >= d2:
                return 4   # 4-d3>=d1>=d2
            elif d3 >= d2 >= d1:
                return 5   # 5-d3>=d2>=d1
        else:
            return 6   # 6-no more demands
    else:
        return 7   # 7-no more nurses
2. count_demand_num()

这个方法用于计算某个Job对象构成的集合中各等级需求的量,用于计算reward,识别状态、记录remaining demands等。调用了4次。
Situations that call count_demand_num()
def count_demand_num(target_set):
    d = [0, 0, 0]
    for j in range(len(target_set)):
        if target_set[j].lv == 1:
            d[0] += 1
        if target_set[j].lv == 2:
            d[1] += 1
        if target_set[j].lv == 3:
            d[2] += 1
    return d
3. action_taking()

这个方法根据e-greedy从可选护工资源中挑出一个作为返回值,作为当前action。
输入是当前的状态、q矩阵、可选护工集合,以及greedy参数。
3.1 生成一个在区间 [0,1] 中的随机数,比较其余greedy参数的大小,随机选择行动策略
3.2 根据Q矩阵行动的情况,选择当前状态与三种技能水平对应的Q值,选最大者对应的技能等级作为action。如果对应技能等级的资源已耗尽,则设其对应Q值为-1之后,选择第二大Q值对应的,以此类推。用deepcopy()防止修改原矩阵。
3.3 探索行为的情况,打乱可选技能集中的元素,返回第一个。
def action_taking(state, q_matrix, skill_set, greedy):
    # Generate a constant randomly
    g = random.uniform(0, 1)
    if g < greedy:
        # Act according to maximum q value
        q_values = copy.deepcopy(q_matrix[state])
        s1 = np.argmax(q_values)
        if s1 + 1 in skill_set:
            skill_set.remove(s1 + 1)
            return s1 + 1
        else:
            q_values[s1] = -1
            s2 = np.argmax(q_values)
            if s2 + 1 in skill_set:
                skill_set.remove(s2 + 1)
                return s2 + 1
            else:
                q_values[s2] = -1
                s3 = np.argmax(q_values)
                skill_set.remove(s3 + 1)
                return s3 + 1
    else:
        # Act randomly as exploration
        skill = copy.deepcopy(skill_set)
        random.shuffle(skill)
        action = skill[0]
        skill_set.remove(action)
        return action
4. aco()

这个函数是蚁群算法函数,返回蚁群算法找到的解,具体细节在第6章。

q_learning()方法的结尾是根据aco()返回的解的信息,从未满足需求集合中删去解包含的目标,然后更新Q矩阵。

def q_learning(...):
    ...
        ...
        # Build route by ACO algorithm
        aco(...)

        # Remove fulfilled demands
        for o in range(len(q_nurses_list[n].r)):
            if q_nurses_list[n].r[o].l != 0:
                for b in range(len(target_set)):
                    if target_set[b].l == q_nurses_list[n].r[o].l:
                        target_set.remove(target_set[b])
                        break

        # Calculate fulfilled demands
        remaining_demands = count_demand_num(target_set)
        fulfilled_demands = sum([x-y for x,y in zip(current_demand_num, remaining_demands)])
        # Calculate reward
        reward = fulfilled_demands + waiting_para * q_nurses_list[n].twt
        next_state = state_identification(target_set, skill_set)
        next_max_q = np.max(q_matrix[next_state])
        # Calculate q value
        q_value = (1 - learning_rate_para) * q_matrix[current_state][chosen_skill - 1] \
                  + learning_rate_para * (reward + discount_para * next_max_q)
        # Update Q-matrix
        q_matrix[current_state][chosen_skill - 1] = float('%.2f' % (copy.deepcopy(q_value)))

        # Accumulate total workload
        q_sub_total_workload += q_nurses_list[n].tt
        q_sub_total_waiting += q_nurses_list[n].twt

    # Return total workload and waiting for this episode
    return [q_sub_total_workload, q_sub_total_waiting]

*这里计算解所包含的各等级的需求量用到了zip()函数,原因是count_demand_num()返回的值是一个数组[d1, d2, d3],而数组无法直接相减,所以需要zip()函数。对于这个函数的应用可以看:Python zip() 函数 | 菜鸟教程

*对Q值的更新也要用deepcopy,因为在Python里运算符“=”是把一个量的地址赋给另一个量,而我们不想让Q矩阵跟着变量“q_value”一直变来变去,所以用deepcopy把其复制过来。并且调用float()函数把Q值四舍五入到小数点后2位。

*最后两句把当前得到的解的等待时长与工作负荷加到全局解的变量上,n是整个for循环的索引,同时也代表了护工对象列表中的第n+1个对象(索引从0开始,当前为n)。

至此,Q Learning算法的函数就写好了,返回值是全局解的总等待时长和总工作负荷,并按照action_taking给出的次序在q_nurses_list[]中塞入了一些护工对象,每个被建立的护工对象的内部属性都在aco()和q_learning()两个函数中更新了。

在主程序中,每次调用q_learning()就会跑一个episode,每个episode都会对Q矩阵进行更新,在足够多的episodes之后,可以得到一个训练完毕的Q矩阵。我们进而可以测试当100%根据这个Q矩阵行动的时候,agent能否给出一个满意的解。

下一章我们用Python来实现蚁群算法:第六章:Python代码之蚁群算法部分


文章来源申明:本篇文章来自【学界】第五章:Python代码之模型与Q-Learning
如需转载请联系原作者,获取转载须知



扫二维码关注『运筹OR帷幄』公众号:

点击查看『运筹OR帷幄』志愿者招募介绍及加入方式

点击查看【骥寻伯乐】板块说明及投稿方式

编辑于 2018-10-05

文章被以下专栏收录