在搭载M1芯片的MacBook Pro上感受tensorflow的表现

前已无通路，后不见归途。

早就听说M1芯片跑机器学习能力强劲，我作为一个后知后觉的肥宅自然也不能错过这个机会。这不，今天翻收藏夹找到了这篇：

心情那是一个格外激动啊。

因为本人机器学习和tensorflow基础都很薄弱，因此没有搞大规模的benchmark，只是用示例级别的代码感受了一下。

安装

安装过程即使是和python大战过几百回合的熟练工也是相当容易踩坑。用conda来来回回折腾了很多次，在输入python输入import tensorflow命令的时候都会碰到以下问题：

>>> import tensorflow as tf
zsh: illegal hardware instruction  python

看了一眼哦原来是不支持macos，只支持ubuntu和windows，好吧那就试图安装tensorflow-macos，但是又会遇到以下问题：

 (base) ~ pip install tensorflow-macos
ERROR: Could not find a version that satisfies the requirement tensorflow-macos (from versions: none)
ERROR: No matching distribution found for tensorflow-macos

好的，经过一个早上的辛勤搜索（摸鱼），终于找到了解决办法。

通解是安装miniforge的arm64版本并在这种conda里安装（参考这篇博客：在m1 mac上安装tensorflow），但是我发现我还是失败了。

那么下面是完整的解决方案（tensorflow你学学人家pytorch那么保姆式的教程好伐，谷歌这回我要黑你了x）：

参考硬件：Apple M1芯片，OS X12.1 系统 Monterey

首先去python官网下载python 3.9.4的universal 64bit版本并安装（即更新本地的python3），这个原问题我找不到了。

然后运行以下命令：（原问题链接：Developer Forums）

$ python3 -m venv tensorflow-metal-test
$ source tensorflow-metal-test/bin/activate
$ cd tensorflow-metal-test/
$ python -m pip install -U pip
$ pip install tensorflow-macos
$ pip install tensorflow-metal

但是在安装过程中会碰到以下问题：

Building wheels for collected packages: h5py
  Building wheel for h5py (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for h5py (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [70 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-10.9-universal2-3.9
      creating build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/h5py_warnings.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/version.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/ipy_completer.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      creating build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/files.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/compat.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/selections.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/dataset.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/selections2.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/group.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/datatype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/attrs.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/dims.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/base.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/filters.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      creating build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dimension_scales.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_attribute_create.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_file_image.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/conftest.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5d_direct_chunk.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5f.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dataset_getitem.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_group.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_errors.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dataset_swmr.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_slicing.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5pl.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_attrs.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_attrs_data.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5t.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_big_endian_file.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5p.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dims_dimensionproxy.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5o.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_datatype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/common.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dataset.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_file.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_selections.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dtype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_file2.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_completions.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_filters.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_base.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_objects.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      creating build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      copying h5py/tests/data_files/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      creating build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/test_highlevel_vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/test_virtual_source.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/test_lowlevel_vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/data_files/vlen_string_s390x.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      copying h5py/tests/data_files/vlen_string_dset_utc.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      copying h5py/tests/data_files/vlen_string_dset.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      running build_ext
      Building h5py requires pkg-config unless the HDF5 path is explicitly specified
      error: pkg-config probably not installed: FileNotFoundError(2, 'No such file or directory')
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for h5py
Failed to build h5py
ERROR: Could not build wheels for h5py, which is required to install pyproject.toml-based projects

查了一下发现需要先用brew安装hdf5，即运行以下命令：

arch -arm64 brew install hdf5

装了一早上。。。brew那叫一个慢。。。（老爷爷看手机.jpg）

但是安装好之后发现还是装不了，参考了这篇博客：

发现先需要运行

find /opt -iname "*hdf5.h*"

不出意外会找到：

/opt/homebrew/include/hdf5.h
/opt/homebrew/Cellar/hdf5/1.13.0/include/hdf5.h

然后运行

export CPATH="/opt/homebrew/include/"
export HDF5_DIR=/opt/homebrew/

设置环境变量，最后再重新运行

$ pip install tensorflow-macos
$ pip install tensorflow-metal

就大功告成了！

小心翼翼地运行python，输入import tensorflow。仿佛过了一万年，终于

Python 3.9.4 (v3.9.4:1f2e3088f3, Apr  4 2021, 12:19:19) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
 
>>>

成功！那么下面就可以来到激动人心的代码部分了。

实验过程

首先看看我们的M1 GPU在不在tf的设备里面：

>>> import tensorflow as tf
>>> tf.config.experimental.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

哇！

因为笔者上次训练神经网络已经是三年前的事情了。。而且当时只训练过mnist，所以必须从（抄）头（别人）学（代）起（码）。参考代码如下：

# env: tf-test

import tensorflow as tf
from tensorflow import keras
import ssl

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

ssl._create_default_https_context = ssl._create_unverified_context

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

y_train = keras.utils.to_categorical(y_train, num_classes=10, dtype='float32')
y_test = keras.utils.to_categorical(y_test, num_classes=10, dtype='float32')

# from tensorflow.python.compiler.mlcompute import mlcompute
# mlcompute.set_mlc_device(device_name="gpu")

with tf.device('/GPU:0'):
        model = keras.Sequential([keras.layers.Flatten(input_shape=(32,32,3)),
                                keras.layers.Dense(3000, activation='relu'),
                                keras.layers.Dense(1000, activation='relu'),
                                keras.layers.Dense(10, activation='sigmoid')
                        ])

        model.compile(optimizer="SGD", loss="categorical_crossentropy", metrics=['accuracy'])
        model.fit(x_train, y_train, epochs=5)
        model.evaluate(x_test, y_test, verbose=2)

参考了文章开头提到的博客和tensorflow的tutorial：

运行结果：

Train on 50000 samples
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

2022-03-14 11:32:41.119904: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-14 11:32:41.120025: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-03-14 11:32:41.125025: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-03-14 11:32:41.125164: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-03-14 11:32:41.133741: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-03-14 11:32:41.449048: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
Epoch 1/5
2022-03-14 11:32:41.453886: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
50000/50000 [==============================] - 17s 331us/sample - loss: 1.8126 - accuracy: 0.3534
Epoch 2/5
50000/50000 [==============================] - 15s 291us/sample - loss: 1.6244 - accuracy: 0.4282
Epoch 3/5
50000/50000 [==============================] - 15s 292us/sample - loss: 1.5423 - accuracy: 0.4572
Epoch 4/5
50000/50000 [==============================] - 15s 292us/sample - loss: 1.4819 - accuracy: 0.4761
Epoch 5/5
50000/50000 [==============================] - 15s 293us/sample - loss: 1.4323 - accuracy: 0.4953

把GPU改成CPU：

Train on 50000 samples
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

2022-03-14 11:38:11.071008: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-14 11:38:11.071162: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-03-14 11:38:11.075712: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/5
50000/50000 [==============================] - 24s 477us/sample - loss: 1.8087 - accuracy: 0.3563
Epoch 2/5
50000/50000 [==============================] - 24s 489us/sample - loss: 1.6201 - accuracy: 0.4260
Epoch 3/5
50000/50000 [==============================] - 24s 486us/sample - loss: 1.5392 - accuracy: 0.4527
Epoch 4/5
50000/50000 [==============================] - 24s 485us/sample - loss: 1.4784 - accuracy: 0.4787
Epoch 5/5
50000/50000 [==============================] - 24s 485us/sample - loss: 1.4296 - accuracy: 0.4960

在显卡型号Nvidia A100的服务器上测试了一下：

Train on 50000 samples
2022-03-14 12:02:28.306455: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-14 12:02:29.499951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 861 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-03-14 12:02:29.500734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1082 MB memory:  -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0
Epoch 1/5
2022-03-14 12:02:30.896021: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
50000/50000 [==============================] - 6s 125us/sample - loss: 1.8107 - accuracy: 0.3550
Epoch 2/5
50000/50000 [==============================] - 5s 97us/sample - loss: 1.6262 - accuracy: 0.4248
Epoch 3/5
50000/50000 [==============================] - 5s 98us/sample - loss: 1.5425 - accuracy: 0.4559
Epoch 4/5
50000/50000 [==============================] - 5s 96us/sample - loss: 1.4822 - accuracy: 0.4783
Epoch 5/5
50000/50000 [==============================] - 5s 98us/sample - loss: 1.4330 - accuracy: 0.4953

服务器的CPU结果：

Train on 50000 samples
2022-03-14 12:03:27.529990: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-14 12:03:28.747153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 861 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-03-14 12:03:28.747977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1082 MB memory:  -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0
Epoch 1/5
50000/50000 [==============================] - 34s 679us/sample - loss: 1.8070 - accuracy: 0.3575
Epoch 2/5
50000/50000 [==============================] - 34s 679us/sample - loss: 1.6191 - accuracy: 0.4296
Epoch 3/5
50000/50000 [==============================] - 34s 678us/sample - loss: 1.5401 - accuracy: 0.4551
Epoch 4/5
50000/50000 [==============================] - 34s 677us/sample - loss: 1.4805 - accuracy: 0.4778
Epoch 5/5
50000/50000 [==============================] - 34s 676us/sample - loss: 1.4335 - accuracy: 0.4943

大概结果是M1 GPU比服务器级别的英特尔CPU快了一倍，比CPU快了25%。但还是被英伟达吊打（划掉）

发布于 2022-03-14 12:11

TensorFlow 学习

Apple M1（芯片）

图形处理器（GPU）

在搭载M1芯片的MacBook Pro上感受tensorflow的表现

安装

实验过程

文章被以下专栏收录

废物的CS碎碎念