#深入探究# Tensorflow.Data.shuffle 方法的实现原理和 buffer_size 参数的作用

辽宁工程技术大学软件工程硕士

文章首发于：

今天在学习 tensorflow 中 dataset 的shuffle方法时，对 buffer_size 这个参数一直不理解

找遍了全网，都只是说 buffer_size 数值越大，混乱程度越好，没有从原理上解释这个参数是什么意思，

于是我查询了shuffle方法官方帮助手册，里边的英文原文如下：

Randomly shuffles the elements of this dataset.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer.
reshuffle_each_iteration controls whether the shuffle order should be different for each epoch. In TF 1.X, the idiomatic way to create epochs was through the repeat transformation:

其中加粗的文字解释了buffer_size的作用并用了一个例子来解释，翻译过来大概意思是：

buffer_size 规定了乱序缓冲区的大小，且要求缓冲区大小大于或等于数据集的完整大小；
帮助手册中的例子如下：
假设数据集的大小为10000，buffer_size的大小为1000，最开始算法会把前1000个数据放入缓冲区；当从缓冲区的这1000个元素中随机选出第一个元素后，这个元素的位置会被数据集的第1001个元素替换；然后再从这1000个元素中随机选取第二个元素，第二个元素的位置又会被数据集中的第1002个数据替换，以此类推…

此外，buffer_size不宜过大，过大会导致内存爆炸；读完官方的帮助文档，才从本质上明白了 buffer_size参数的含义和shuffle方法背后实现的原理。

编辑于 2021-03-14 23:46

TensorFlow 学习

机器学习

深度学习（Deep Learning）

#深入探究# Tensorflow.Data.shuffle 方法的实现原理和 buffer_size 参数的作用

文章被以下专栏收录

机器学习理论研究