R语言机器学习笔记(四):mlr模型训练

作者:黄天元,复旦大学博士在读,热爱数据科学与开源工具(R),致力于利用数据科学迅速积累行业经验优势和科学知识发现,涉猎内容包括但不限于信息计量、机器学习、数据可视化、应用统计建模、知识图谱等,著有《R语言高效数据处理指南》(《R语言数据高效处理指南》(黄天元)【摘要 书评 试读】- 京东图书)。知乎专栏:R语言数据挖掘邮箱:huang.tian-yuan@qq.com.欢迎合作交流。

前文提要:

HopeR:R语言机器学习笔记(一):mlr总纲

HopeR:R语言机器学习笔记(二):mlr任务定义

HopeR:R语言机器学习笔记(三):mlr学习器定义


mlr包中,在定义了任务(要做分类还是回归)和模型(学习器)之后,训练就是一个train函数就能够完成。简单如斯:

# Generate the task
task = makeClassifTask(data = iris, target = "Species")

# Generate the learner
lrn = makeLearner("classif.lda")

# Train the learner
mod = train(lrn, task)
mod
## Model for learner.id=classif.lda; learner.class=classif.lda
## Trained on: task.id = iris; obs = 150; features = 4
## Hyperparameters:

上面,首先用R自带数据集定义了分类任务task,然后选择学习器LDA(线性判别分析),然后一个train函数做训练,最后把训练好的模型放在mod中。如果学习器只想使用默认设置,可以不定义直接放到train函数中,如:

mod = train("classif.lda", task)
mod
## Model for learner.id=classif.lda; learner.class=classif.lda
## Trained on: task.id = iris; obs = 150; features = 4
## Hyperparameters:

训练获得的模型,依旧是一个对象。names函数可以看到里面有什么信息,然后直接用$进行访问。比如我们训练一个无监督的聚类模型:

# Generate the task
ruspini.task = makeClusterTask(data = ruspini)

# Generate the learner
lrn = makeLearner("cluster.kmeans", centers = 4)

# Train the learner
mod = train(lrn, ruspini.task)
mod
## Model for learner.id=cluster.kmeans; learner.class=cluster.kmeans
## Trained on: task.id = ruspini; obs = 75; features = 2
## Hyperparameters: centers=4

# Peak into mod
names(mod)
## [1] "learner"       "learner.model" "task.desc"     "subset"       
## [5] "features"      "factor.levels" "time"          "dump"

mod$learner
## Learner cluster.kmeans from package stats,clue
## Type: cluster
## Name: K-Means; Short name: kmeans
## Class: cluster.kmeans
## Properties: numerics,prob
## Predict-Type: response
## Hyperparameters: centers=4

mod$features
## [1] "x" "y"

# Extract the fitted model
getLearnerModel(mod)
## K-means clustering with 4 clusters of sizes 23, 17, 20, 15
## 
## Cluster means:
##          x        y
## 1 43.91304 146.0435
## 2 98.17647 114.8824
## 3 20.15000  64.9500
## 4 68.93333  19.4000
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  1  1  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  2  2  2  2  2  2  2  2  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4 
## 
## Within cluster sum of squares by cluster:
## [1] 3176.783 4558.235 3689.500 1456.533
##  (between_SS / total_SS =  94.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

在训练中,还可以指定subset参数来对其中一部分数据进行训练,而留下剩余的数据来进行验证。subset能够接受的数据类型是整数向量(代表用哪些行做训练)或逻辑向量(方便使用条件筛选)。同时,weights参数可以对类失衡的问题进行重采样,进而进行校正。这些细节,都留到后面的部分继续展开。

参考链接:

Training a Learner

发布于 07-12

文章被以下专栏收录