在搭载M1芯片的MacBook Pro上感受tensorflow的表现
早就听说M1芯片跑机器学习能力强劲,我作为一个后知后觉的肥宅自然也不能错过这个机会。这不,今天翻收藏夹找到了这篇:
心情那是一个格外激动啊。
因为本人机器学习和tensorflow基础都很薄弱,因此没有搞大规模的benchmark,只是用示例级别的代码感受了一下。
安装
安装过程即使是和python大战过几百回合的熟练工也是相当容易踩坑。用conda来来回回折腾了很多次,在输入python输入import tensorflow命令的时候都会碰到以下问题:
>>> import tensorflow as tf
zsh: illegal hardware instruction python
看了一眼哦原来是不支持macos,只支持ubuntu和windows,好吧那就试图安装tensorflow-macos,但是又会遇到以下问题:
(base) ~ pip install tensorflow-macos
ERROR: Could not find a version that satisfies the requirement tensorflow-macos (from versions: none)
ERROR: No matching distribution found for tensorflow-macos
好的,经过一个早上的辛勤搜索(摸鱼),终于找到了解决办法。
通解是安装miniforge的arm64版本并在这种conda里安装(参考这篇博客:在m1 mac上安装tensorflow),但是我发现我还是失败了。
那么下面是完整的解决方案(tensorflow你学学人家pytorch那么保姆式的教程好伐,谷歌这回我要黑你了x):
参考硬件:Apple M1芯片,OS X12.1 系统 Monterey
首先去python官网下载python 3.9.4的universal 64bit版本并安装(即更新本地的python3),这个原问题我找不到了。
然后运行以下命令:(原问题链接:Developer Forums)
$ python3 -m venv tensorflow-metal-test
$ source tensorflow-metal-test/bin/activate
$ cd tensorflow-metal-test/
$ python -m pip install -U pip
$ pip install tensorflow-macos
$ pip install tensorflow-metal
但是在安装过程中会碰到以下问题:
Building wheels for collected packages: h5py
Building wheel for h5py (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for h5py (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [70 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-10.9-universal2-3.9
creating build/lib.macosx-10.9-universal2-3.9/h5py
copying h5py/h5py_warnings.py -> build/lib.macosx-10.9-universal2-3.9/h5py
copying h5py/version.py -> build/lib.macosx-10.9-universal2-3.9/h5py
copying h5py/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py
copying h5py/ipy_completer.py -> build/lib.macosx-10.9-universal2-3.9/h5py
creating build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/files.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/compat.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/selections.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/dataset.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/selections2.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/group.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/datatype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/attrs.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/dims.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/base.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
copying h5py/_hl/filters.py -> build/lib.macosx-10.9-universal2-3.9/h5py/_hl
creating build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_dimension_scales.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_attribute_create.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_file_image.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/conftest.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5d_direct_chunk.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5f.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_dataset_getitem.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_group.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_errors.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_dataset_swmr.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_slicing.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5pl.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_attrs.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_attrs_data.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5t.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_big_endian_file.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5p.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_dims_dimensionproxy.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5o.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_datatype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/common.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_dataset.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_file.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_selections.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_dtype.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_h5.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_file2.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_completions.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_filters.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_base.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
copying h5py/tests/test_objects.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests
creating build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
copying h5py/tests/data_files/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
creating build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
copying h5py/tests/test_vds/test_highlevel_vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
copying h5py/tests/test_vds/test_virtual_source.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
copying h5py/tests/test_vds/__init__.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
copying h5py/tests/test_vds/test_lowlevel_vds.py -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/test_vds
copying h5py/tests/data_files/vlen_string_s390x.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
copying h5py/tests/data_files/vlen_string_dset_utc.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
copying h5py/tests/data_files/vlen_string_dset.h5 -> build/lib.macosx-10.9-universal2-3.9/h5py/tests/data_files
running build_ext
Building h5py requires pkg-config unless the HDF5 path is explicitly specified
error: pkg-config probably not installed: FileNotFoundError(2, 'No such file or directory')
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for h5py
Failed to build h5py
ERROR: Could not build wheels for h5py, which is required to install pyproject.toml-based projects
查了一下发现需要先用brew安装hdf5,即运行以下命令:
arch -arm64 brew install hdf5
装了一早上。。。brew那叫一个慢。。。(老爷爷看手机.jpg)
但是安装好之后发现还是装不了,参考了这篇博客:
发现先需要运行
find /opt -iname "*hdf5.h*"
不出意外会找到:
/opt/homebrew/include/hdf5.h
/opt/homebrew/Cellar/hdf5/1.13.0/include/hdf5.h
然后运行
export CPATH="/opt/homebrew/include/"
export HDF5_DIR=/opt/homebrew/
设置环境变量,最后再重新运行
$ pip install tensorflow-macos
$ pip install tensorflow-metal
就大功告成了!
小心翼翼地运行python,输入import tensorflow。仿佛过了一万年,终于
Python 3.9.4 (v3.9.4:1f2e3088f3, Apr 4 2021, 12:19:19)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>>
成功!那么下面就可以来到激动人心的代码部分了。
实验过程
首先看看我们的M1 GPU在不在tf的设备里面:
>>> import tensorflow as tf
>>> tf.config.experimental.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
哇!
因为笔者上次训练神经网络已经是三年前的事情了。。而且当时只训练过mnist,所以必须从(抄)头(别人)学(代)起(码)。参考代码如下:
# env: tf-test
import tensorflow as tf
from tensorflow import keras
import ssl
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
ssl._create_default_https_context = ssl._create_unverified_context
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train = keras.utils.to_categorical(y_train, num_classes=10, dtype='float32')
y_test = keras.utils.to_categorical(y_test, num_classes=10, dtype='float32')
# from tensorflow.python.compiler.mlcompute import mlcompute
# mlcompute.set_mlc_device(device_name="gpu")
with tf.device('/GPU:0'):
model = keras.Sequential([keras.layers.Flatten(input_shape=(32,32,3)),
keras.layers.Dense(3000, activation='relu'),
keras.layers.Dense(1000, activation='relu'),
keras.layers.Dense(10, activation='sigmoid')
])
model.compile(optimizer="SGD", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)
参考了文章开头提到的博客和tensorflow的tutorial:
运行结果:
Train on 50000 samples
Metal device set to: Apple M1
systemMemory: 16.00 GB
maxCacheSize: 5.33 GB
2022-03-14 11:32:41.119904: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-14 11:32:41.120025: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-03-14 11:32:41.125025: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-03-14 11:32:41.125164: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-03-14 11:32:41.133741: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-03-14 11:32:41.449048: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
Epoch 1/5
2022-03-14 11:32:41.453886: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
50000/50000 [==============================] - 17s 331us/sample - loss: 1.8126 - accuracy: 0.3534
Epoch 2/5
50000/50000 [==============================] - 15s 291us/sample - loss: 1.6244 - accuracy: 0.4282
Epoch 3/5
50000/50000 [==============================] - 15s 292us/sample - loss: 1.5423 - accuracy: 0.4572
Epoch 4/5
50000/50000 [==============================] - 15s 292us/sample - loss: 1.4819 - accuracy: 0.4761
Epoch 5/5
50000/50000 [==============================] - 15s 293us/sample - loss: 1.4323 - accuracy: 0.4953
把GPU改成CPU:
Train on 50000 samples
Metal device set to: Apple M1
systemMemory: 16.00 GB
maxCacheSize: 5.33 GB
2022-03-14 11:38:11.071008: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-14 11:38:11.071162: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-03-14 11:38:11.075712: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/5
50000/50000 [==============================] - 24s 477us/sample - loss: 1.8087 - accuracy: 0.3563
Epoch 2/5
50000/50000 [==============================] - 24s 489us/sample - loss: 1.6201 - accuracy: 0.4260
Epoch 3/5
50000/50000 [==============================] - 24s 486us/sample - loss: 1.5392 - accuracy: 0.4527
Epoch 4/5
50000/50000 [==============================] - 24s 485us/sample - loss: 1.4784 - accuracy: 0.4787
Epoch 5/5
50000/50000 [==============================] - 24s 485us/sample - loss: 1.4296 - accuracy: 0.4960
在显卡型号Nvidia A100的服务器上测试了一下:
Train on 50000 samples
2022-03-14 12:02:28.306455: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-14 12:02:29.499951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 861 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-03-14 12:02:29.500734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1082 MB memory: -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0
Epoch 1/5
2022-03-14 12:02:30.896021: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
50000/50000 [==============================] - 6s 125us/sample - loss: 1.8107 - accuracy: 0.3550
Epoch 2/5
50000/50000 [==============================] - 5s 97us/sample - loss: 1.6262 - accuracy: 0.4248
Epoch 3/5
50000/50000 [==============================] - 5s 98us/sample - loss: 1.5425 - accuracy: 0.4559
Epoch 4/5
50000/50000 [==============================] - 5s 96us/sample - loss: 1.4822 - accuracy: 0.4783
Epoch 5/5
50000/50000 [==============================] - 5s 98us/sample - loss: 1.4330 - accuracy: 0.4953
服务器的CPU结果:
Train on 50000 samples
2022-03-14 12:03:27.529990: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-14 12:03:28.747153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 861 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-03-14 12:03:28.747977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 1082 MB memory: -> device: 1, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:af:00.0, compute capability: 8.0
Epoch 1/5
50000/50000 [==============================] - 34s 679us/sample - loss: 1.8070 - accuracy: 0.3575
Epoch 2/5
50000/50000 [==============================] - 34s 679us/sample - loss: 1.6191 - accuracy: 0.4296
Epoch 3/5
50000/50000 [==============================] - 34s 678us/sample - loss: 1.5401 - accuracy: 0.4551
Epoch 4/5
50000/50000 [==============================] - 34s 677us/sample - loss: 1.4805 - accuracy: 0.4778
Epoch 5/5
50000/50000 [==============================] - 34s 676us/sample - loss: 1.4335 - accuracy: 0.4943
大概结果是M1 GPU比服务器级别的英特尔CPU快了一倍,比CPU快了25%。但还是被英伟达吊打(划掉)