关于时间序列问题的交叉验证

算法工程师

由于存在时间先后的问题，对于存在时间特性的分类或者回归问题不能简单的使用StratifiekdKFold或者KFold进行交叉验证，更不能shuffle，会带来一定的时序特征交叉的问题，比如用未来的数据来预测过去的数据，这样的交叉验证结果在业务上意义不大，在比赛中也很容易造成过拟合的问题。简单总结一下时间序列的交叉验证方法：

1、timeseriessplit

from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)  
TimeSeriesSplit(n_splits=3)
for train, test in tscv.split(X):
     print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

比较常见的一种包含时序性的数据集的交叉验证的方法之一；

2、TSCV

GapLeavePOut

作者介绍并且实现了另外的一些有意思的时序性数据的验证的方式：

>>> from tscv import GapLeavePOut
>>> cv = GapLeavePOut(p=3, gap_before=1, gap_after=2)
>>> for train, test in cv.split(range(7)):
...    print("train:", train, "test:", test)

train: [5 6]   test: [0 1 2]
train: [6]     test: [1 2 3]
train: [0]     test: [2 3 4]
train: [0 1]   test: [3 4 5]
train: [0 1 2] test: [4 5 6]

p表示test数据集大小，是一个变动的窗口的大小，gap_before和gap_after显然如图是test数据集前后间隔的周期数，目前作者仅实现了数量没有实现比例的参数设置，不过问题不大，自己算算代入就可以了。

GapKFold

>>> from tscv import GapKFold
>>> cv = GapKFold(n_splits=5, gap_before=2, gap_after=1)
>>> for train, test in cv.split(range(10)):
...    print("train:", train, "test:", test)

train: [3 4 5 6 7 8 9] 	 test: [0 1]
train: [5 6 7 8 9] 	 test: [2 3]
train: [0 1 7 8 9] 	 test: [4 5]
train: [0 1 2 3 9] 	 test: [6 7]
train: [0 1 2 3 4 5] 	 test: [8 9]

kfold的gap版，从上图就可以很直观的感受到了实现的原理了，参数的名字也很直观知道是控制啥的。。。这种方法不可避免用到未来的数据，不知道有木有用，比赛的时候试试吧。

GapWalkForward

>>> from tscv import GapWalkForward
>>> cv = GapWalkForward(n_splits=3, gap_size=1, test_size=2)
>>> for train, test in cv.split(range(10)):
...    print("train:", train, "test:", test)

train: [0 1 2] 	         test: [4 5]
train: [0 1 2 3 4]       test: [6 7]
train: [0 1 2 3 4 5 6] 	 test: [8 9]

实际上就是sklearn中的timeseriessplit的加入了gap版本的实现。

gap_train_test_split

import numpy as np
from tscv import gap_train_test_split
X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = gap_train_test_split(X, y, test_size=2, gap_size=2)

实际上就是train_test_split的gap版本

底层用的sklearn的基类写的，所以可以和sklearn模板写的api互相兼容。没怎么尝试过这种验证方式，下次试试吧

编辑于 2022-02-04 10:36

时间序列分析

算法