在搭载M1芯片的MacBook Pro上感受tensorflow的表现

在搭载M1芯片的MacBook Pro上感受tensorflow的表现

早就听说M1芯片跑机器学习能力强劲,我作为一个后知后觉的肥宅自然也不能错过这个机会。这不,今天翻收藏夹找到了这篇:

心情那是一个格外激动啊。

因为本人机器学习和tensorflow基础都很薄弱,因此没有搞大规模的benchmark,只是用示例级别的代码感受了一下。

安装

安装过程即使是和python大战过几百回合的熟练工也是相当容易踩坑。用conda来来回回折腾了很多次,在输入python输入import tensorflow命令的时候都会碰到以下问题:

>>> import tensorflow as tf
zsh: illegal hardware instruction  python

看了一眼哦原来是不支持macos,只支持ubuntu和windows,好吧那就试图安装tensorflow-macos,但是又会遇到以下问题:

 (base) ~ pip install tensorflow-macos
ERROR: Could not find a version that satisfies the requirement tensorflow-macos (from versions: none)
ERROR: No matching distribution found for tensorflow-macos

好的,经过一个早上的辛勤搜索(摸鱼),终于找到了解决办法。

通解是安装miniforge的arm64版本并在这种conda里安装(参考这篇博客:在m1 mac上安装tensorflow),但是我发现我还是失败了。

那么下面是完整的解决方案(tensorflow你学学人家pytorch那么保姆式的教程好伐,谷歌这回我要黑你了x):

参考硬件:Apple M1芯片,OS X12.1 系统 Monterey

首先去python官网下载python 3.9.4的universal 64bit版本并安装(即更新本地的python3),这个原问题我找不到了。

然后运行以下命令:(原问题链接:Developer Forums

$ python3 -m venv tensorflow-metal-test
$ source tensorflow-metal-test/bin/activate
$ cd tensorflow-metal-test/
$ python -m pip install -U pip
$ pip install tensorflow-macos
$ pip install tensorflow-metal

但是在安装过程中会碰到以下问题:

Building wheels for collected packages: h5py
  Building wheel for h5py (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for h5py (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [70 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-10.9-universal2-3.9
      creating build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/h5py_warnings.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/version.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      copying h5py/ipy_completer.py -> build/lib.macosx-10.9-universal2-3.9/h5py
      creating build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/files.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/compat.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/selections.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/dataset.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/selections2.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/group.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/datatype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/attrs.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/dims.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/base.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      copying h5py/_hl/filters.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
      creating build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dimension_scales.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_attribute_create.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_file_image.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/conftest.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5d_direct_chunk.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5f.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dataset_getitem.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_group.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_errors.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dataset_swmr.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_slicing.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5pl.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_attrs.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_attrs_data.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5t.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_big_endian_file.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5p.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dims_dimensionproxy.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5o.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_datatype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/common.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dataset.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_file.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_selections.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_dtype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_h5.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_file2.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_completions.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_filters.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_base.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      copying h5py/tests/test_objects.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
      creating build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      copying h5py/tests/data_files/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      creating build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/test_highlevel_vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/test_virtual_source.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/test_vds/test_lowlevel_vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
      copying h5py/tests/data_files/vlen_string_s390x.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      copying h5py/tests/data_files/vlen_string_dset_utc.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      copying h5py/tests/data_files/vlen_string_dset.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
      running build_ext
      Building h5py requires pkg-config unless the HDF5 path is explicitly specified
      error: pkg-config probably not installed: FileNotFoundError(2, 'No such file or directory')
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for h5py
Failed to build h5py
ERROR: Could not build wheels for h5py, which is required to install pyproject.toml-based projects

查了一下发现需要先用brew安装hdf5,即运行以下命令:

arch -arm64 brew install hdf5

装了一早上。。。brew那叫一个慢。。。(老爷爷看手机.jpg)

但是安装好之后发现还是装不了,参考了这篇博客:

发现先需要运行

find /opt -iname "*hdf5.h*"

不出意外会找到:

/opt/homebrew/include/hdf5.h
/opt/homebrew/Cellar/hdf5/1.13.0/include/hdf5.h

然后运行

export CPATH="/opt/homebrew/include/"
export HDF5_DIR=/opt/homebrew/

设置环境变量,最后再重新运行

$ pip install tensorflow-macos
$ pip install tensorflow-metal

就大功告成了!

小心翼翼地运行python,输入import tensorflow。仿佛过了一万年,终于

Python 3.9.4 (v3.9.4:1f2e3088f3, Apr  4 2021, 12:19:19) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
 
>>> 

成功!那么下面就可以来到激动人心的代码部分了。

实验过程

首先看看我们的M1 GPU在不在tf的设备里面:

>>> import tensorflow as tf
>>> tf.config.experimental.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

哇!

因为笔者上次训练神经网络已经是三年前的事情了。。而且当时只训练过mnist,所以必须从(抄)头(别人)学(代)起(码)。参考代码如下:

# env: tf-test

import tensorflow as tf
from tensorflow import keras
import ssl

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

ssl._create_default_https_context = ssl._create_unverified_context

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

y_train = keras.utils.to_categorical(y_train, num_classes=10, dtype='float32')
y_test = keras.utils.to_categorical(y_test, num_classes=10, dtype='float32')

# from tensorflow.python.compiler.mlcompute import mlcompute
# mlcompute.set_mlc_device(device_name="gpu")

with tf.device('/GPU:0'):
        model = keras.Sequential([keras.layers.Flatten(input_shape=(32,32,3)),
                                keras.layers.Dense(3000, activation='relu'),
                                keras.layers.Dense(1000, activation='relu'),
                                keras.layers.Dense(10, activation='sigmoid')
                        ])

        model.compile(optimizer="SGD", loss="categorical_crossentropy", metrics=['accuracy'])
        model.fit(x_train, y_train, epochs=5)
        model.evaluate(x_test, y_test, verbose=2)

参考了文章开头提到的博客和tensorflow的tutorial:

运行结果:

Train on 50000 samples
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

2022-03-14 11:32:41.119904: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-14 11:32:41.120025: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-03-14 11:32:41.125025: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-03-14 11:32:41.125164: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-03-14 11:32:41.133741: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-03-14 11:32:41.449048: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
Epoch 1/5
2022-03-14 11:32:41.453886: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
50000/50000 [==============================] - 17s 331us/sample - loss: 1.8126 - accuracy: 0.3534
Epoch 2/5
50000/50000 [==============================] - 15s 291us/sample - loss: 1.6244 - accuracy: 0.4282
Epoch 3/5
50000/50000 [==============================] - 15s 292us/sample - loss: 1.5423 - accuracy: 0.4572
Epoch 4/5
50000/50000 [==============================] - 15s 292us/sample - loss: 1.4819 - accuracy: 0.4761
Epoch 5/5
50000/50000 [==============================] - 15s 293us/sample - loss: 1.4323 - accuracy: 0.4953

把GPU改成CPU:

Train on 50000 samples
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

2022-03-14 11:38:11.071008: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-14 11:38:11.071162: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-03-14 11:38:11.075712: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/5
50000/50000 [==============================] - 24s 477us/sample - loss: 1.8087 - accuracy: 0.3563
Epoch 2/5
50000/50000 [==============================] - 24s 489us/sample - loss: 1.6201 - accuracy: 0.4260
Epoch 3/5
50000/50000 [==============================] - 24s 486us/sample - loss: 1.5392 - accuracy: 0.4527
Epoch 4/5
50000/50000 [==============================] - 24s 485us/sample - loss: 1.4784 - accuracy: 0.4787
Epoch 5/5
50000/50000 [==============================] - 24s 485us/sample - loss: 1.4296 - accuracy: 0.4960

在显卡型号Nvidia A100的服务器上测试了一下:

Train on 50000 samples
2022-03-14 12:02:28.306455: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-14 12:02:29.499951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 861 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-03-14 12:02:29.500734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1082 MB memory:  -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0
Epoch 1/5
2022-03-14 12:02:30.896021: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
50000/50000 [==============================] - 6s 125us/sample - loss: 1.8107 - accuracy: 0.3550
Epoch 2/5
50000/50000 [==============================] - 5s 97us/sample - loss: 1.6262 - accuracy: 0.4248
Epoch 3/5
50000/50000 [==============================] - 5s 98us/sample - loss: 1.5425 - accuracy: 0.4559
Epoch 4/5
50000/50000 [==============================] - 5s 96us/sample - loss: 1.4822 - accuracy: 0.4783
Epoch 5/5
50000/50000 [==============================] - 5s 98us/sample - loss: 1.4330 - accuracy: 0.4953

服务器的CPU结果:

Train on 50000 samples
2022-03-14 12:03:27.529990: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-14 12:03:28.747153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 861 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-03-14 12:03:28.747977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1082 MB memory:  -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0
Epoch 1/5
50000/50000 [==============================] - 34s 679us/sample - loss: 1.8070 - accuracy: 0.3575
Epoch 2/5
50000/50000 [==============================] - 34s 679us/sample - loss: 1.6191 - accuracy: 0.4296
Epoch 3/5
50000/50000 [==============================] - 34s 678us/sample - loss: 1.5401 - accuracy: 0.4551
Epoch 4/5
50000/50000 [==============================] - 34s 677us/sample - loss: 1.4805 - accuracy: 0.4778
Epoch 5/5
50000/50000 [==============================] - 34s 676us/sample - loss: 1.4335 - accuracy: 0.4943

大概结果是M1 GPU比服务器级别的英特尔CPU快了一倍,比CPU快了25%。但还是被英伟达吊打(划掉)

发布于 2022-03-14 12:11