OpenCV 4.3 with Tengine(飙车版)

OpenCV 4.3 with Tengine(飙车版)

传送门

OpenCV 4.3 with Tengine(稳定版)zhuanlan.zhihu.com图标https://github.com/BUG1989/opencv/tree/4.3.0-tengine-fastergithub.com
OAID/Tenginegithub.com图标

背景

OpenCV 4.3.0已在GitHub上悄悄更新,Tengine作为DNN Module中新增Arm平台的推理后端(Backend)在OpenCV官方版本(稳定版)中便提供了可见的速度提升。

然而现实是很残酷的,整个移动端深度学习开源推理框架,从2017年鹅厂SNG(我就是不说CSIG)的ncnn(给up主打call)开始,经过3年的共(你)建(追)生(我)态(赶),前(迷)仆(之)后(重)继(构)的框架有(包括但不限于):

ncnn、mdl、Tengine、TFLite、FeatherCNN、MACE、paddle-mobile、zqcnn、Anakin、MNN、paddle-lite、bolt

我真是谢谢上面这一群框架,把Arm CPU平台算力快压榨到极限了。于是作为OpenCV官方认可的Arm Backend,在毛熊、鹰酱的面前,可不能丢我兔的颜面啊~~!

《野萌君》

好吧,圈圈虫摊牌了:在本次PR项目的过程中,曾经有过一个中间临时版本,速度还行。最终未入选的原因是OpenCV的CI项目非(wo)常(tai)严(cai)格(le)。先解决有没有,再看快不快,希望在后续开发过程中能合并进入主线分支。

先来看看OpenCV DNN到底是何方神器吧。

DNN Module跑马观花

《野萌君》

不知道是历史原因,还是最初DNN在设计架构时,没有别的推理框架作为参考。基于Layer结构的设计的计算图,而不是现在流行的Graph = Node + Tensor + Operator的组合。其实单纯从当前的推理框架实用性分析,Layer与Graph没有本质上的优劣差异,简单、易用、稳定、解决问题就是好轮子。下面是当初学习DNN Module源码时总结的脑图(个人能力有限,不定时更新,仅供参考。OpenCV DNN Framework(百度脑图)

OpenCV DNN Framework

OpenCV Extra

opencv_extra是opencv项目的一个扩展项目,主要包含opencv library中Sample Test、Perf Test、Unit Test等模块中所需要的所有依赖文件,由于我们只是针对DNN Module进行功能开发,只能是管中窥豹,包含但不限于以下内容:输入数据、输出参考数据、图片、所支持各个框架的网络模型等。

其中较大的网络模型并没有直接存放在repository中,需要通过opencv_extra/testdata/dnn/download_models.py脚本进行科学下载(感谢闲来大佬提供了17GB的流量,你们懂的.jpg)。

DNN Module的测试分为两个大部分:Layer(OP)和Network,如果不下载网络模型文件(git clone opencv_extra的大小约1.04GB......),那么只能顺利完成Layer级别的各种测试。git clone完成后,将opencv_extra/testdata/的绝对路径添加到环境变量OPENCV_TEST_DATA_PATH中即可:

export OPENCV_TEST_DATA_PATH=~/github/opencv_extra/testdata/

编译命令

为了操作讲解,这里编译Linux-x86平台下的版本:

cmake -DOPENCV_ENABLE_NONFREE=ON -DBUILD_EXAMPLES=ON -DBUILD_PERF_TESTS=ON -DBUILD_TESTS=ON -DWITH_OPENCL=OFF -DBUILD_DOCS=OFF -DWITH_CUDA=OFF -DCMAKE_BUILD_TYPE=release -DENABLE_PROFILING=OFF ..

编译成功后,生成所需的DNN Module相关测试代码及存放位置如下:

qtang@tengine-train:~/github/opencv/build-linux-x86$ tree bin/ | grep dnn
├── example_dnn_classification
├── example_dnn_colorization
├── example_dnn_object_detection
├── example_dnn_openpose
├── example_dnn_segmentation
├── example_dnn_text_detection
├── opencv_perf_dnn
├── opencv_test_dnn

单元(功能)测试

编译很容易,CMake创建项目时选择-DBUILD_TESTS=ON

基本操作

直接运行opencv_test_dnn

qtang@tengine-train:~/github/opencv/build-linux-x86$ ./bin/opencv_test_dnn
CTEST_FULL_OUTPUT
OpenCV version: 4.3.0
OpenCV VCS version: 4.3.0-3-g88e5964761
Build type: release
WARNING: build value differs from runtime: Release
Compiler: /usr/bin/c++  (ver 7.5.0)
Parallel framework: pthreads (nthreads=16)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2020.0.0 Gold (-) Oct 19 2019
TEST: Skip tests with tags: 'mem_6gb', 'verylong', 'dnn_skip_halide', 'dnn_skip_ocl', 'dnn_skip_ocl_fp16', 'dnn_skip_ie_ocl', 'dnn_skip_ie_ocl_fp16'
[==========] Running 934 tests from 65 test cases.
[----------] Global test environment set-up.
[----------] 5 tests from Test_Caffe
[ RUN      ] Test_Caffe.memory_read
[     SKIP ] OpenCV tests: Can't find data file: dnn/bvlc_googlenet.caffemodel
[       OK ] Test_Caffe.memory_read (0 ms)
[ RUN      ] Test_Caffe.read_gtsrb
[       OK ] Test_Caffe.read_gtsrb (2 ms)
[ RUN      ] Test_Caffe.read_googlenet
[       OK ] Test_Caffe.read_googlenet (2 ms)
[ RUN      ] Test_Caffe.multiple_inputs
[       OK ] Test_Caffe.multiple_inputs (1 ms)
[ RUN      ] Test_Caffe.shared_weights
[       OK ] Test_Caffe.shared_weights (1 ms)
[----------] 5 tests from Test_Caffe (6 ms total)
......(这里省略了中间超多message)
[ RUN      ] Test_Torch_nets.FastNeuralStyle_accuracy/0, where GetParam() = OCV/CPU
[     SKIP ] OpenCV tests: Can't find data file: dnn/fast_neural_style_eccv16_starry_night.t7
[       OK ] Test_Torch_nets.FastNeuralStyle_accuracy/0 (0 ms)
[----------] 3 tests from Test_Torch_nets (0 ms total)

[----------] Global test environment tear-down
[ SKIPSTAT ] 95 tests skipped
[ SKIPSTAT ] TAG='mem_6gb' skip 1 tests
[ SKIPSTAT ] TAG='verylong' skip 1 tests
[ SKIPSTAT ] TAG='skip_other' skip 93 tests
[==========] 934 tests from 65 test cases ran. (440 ms total)
[  PASSED  ] 934 tests.
qtang@tengine-train:~/github/opencv_tq/build-linux-x86$

从输出信息中可以发现“TAG='skip_other' skip 93 tests”,这93个skip的test项目就是依赖网络模型文件的测试用例,默认找不到所需的模型文件,就skip掉。

筛选和过滤

假设我们只想测试Convolution Layer的用例

qtang@tengine-train:~/github/opencv/build-linux-x86$ ./bin/opencv_test_dnn --gtest_filter=*Conv*:*conv*
CTEST_FULL_OUTPUT
OpenCV version: 4.3.0
OpenCV VCS version: 4.3.0-3-g88e5964761
Build type: release
WARNING: build value differs from runtime: Release
Compiler: /usr/bin/c++  (ver 7.5.0)
Parallel framework: pthreads (nthreads=16)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2020.0.0 Gold (-) Oct 19 2019
TEST: Skip tests with tags: 'mem_6gb', 'verylong', 'dnn_skip_halide', 'dnn_skip_ocl', 'dnn_skip_ocl_fp16', 'dnn_skip_ie_ocl', 'dnn_skip_ie_ocl_fp16'
Note: Google Test filter = *Conv*:*conv*
[==========] Running 197 tests from 9 test cases.
[----------] Global test environment set-up.
[----------] 1 test from Layer_Test_Convolution
[ RUN      ] Layer_Test_Convolution.relu_fusion
[       OK ] Layer_Test_Convolution.relu_fusion (0 ms)
[----------] 1 test from Layer_Test_Convolution (0 ms total)

[----------] 1 test from Test_Darknet_layers
[ RUN      ] Test_Darknet_layers.convolutional/0, where GetParam() = OCV/CPU
[       OK ] Test_Darknet_layers.convolutional/0 (1 ms)
[----------] 1 test from Test_Darknet_layers (1 ms total)

[----------] 128 tests from Layer_Test_Halide/Convolution
[ RUN      ] Layer_Test_Halide/Convolution.Accuracy/0, where GetParam() = ([6, 4, 1], 5x6, 3x1, 1x1, 1x0, 1x1, false, OCV/CPU)
[       OK ] Layer_Test_Halide/Convolution.Accuracy/0 (0 ms)
[ RUN      ] Layer_Test_Halide/Convolution.Accuracy/1, where GetParam() = ([6, 4, 1], 5x6, 3x1, 1x1, 1x0, 1x1, true, OCV/CPU)
[       OK ] Layer_Test_Halide/Convolution.Accuracy/1 (0 ms)
[ RUN      ] Layer_Test_Halide/Convolution.Accuracy/2, where GetParam() = ([6, 4, 1], 5x6, 3x1, 1x1, 1x0, 2x2, false, OCV/CPU)
[       OK ] Layer_Test_Halide/Convolution.Accuracy/2 (0 ms)
......(这里省略了中间超多message)
[----------] 3 tests from Test_Torch_layers
[ RUN      ] Test_Torch_layers.run_convolution/0, where GetParam() = OCV/CPU
[       OK ] Test_Torch_layers.run_convolution/0 (0 ms)
[ RUN      ] Test_Torch_layers.run_deconv/0, where GetParam() = OCV/CPU
[       OK ] Test_Torch_layers.run_deconv/0 (1 ms)
[ RUN      ] Test_Torch_layers.net_conv_gemm_lrn/0, where GetParam() = OCV/CPU
[       OK ] Test_Torch_layers.net_conv_gemm_lrn/0 (0 ms)
[----------] 3 tests from Test_Torch_layers (1 ms total)

[----------] Global test environment tear-down
[ SKIPSTAT ] 1 tests skipped
[ SKIPSTAT ] TAG='skip_other' skip 1 tests
[==========] 197 tests from 9 test cases ran. (37 ms total)
[  PASSED  ] 197 tests.

其实都是基于gtest实现,熟悉的大佬轻喷。

性能(速度)测试

编译很容易+1,CMake创建项目时选择-DBUILD_PERF_TESTS=ON

基本操作

qtang@tengine-train:~/github/opencv/build-linux-x86$ ./bin/opencv_perf_dnn
Time compensation is 0
TEST: Skip tests with tags: 'mem_6gb', 'verylong'
CTEST_FULL_OUTPUT
OpenCV version: 4.3.0
OpenCV VCS version: 4.3.0-3-g88e5964761
Build type: release
WARNING: build value differs from runtime: Release
Compiler: /usr/bin/c++  (ver 7.5.0)
Parallel framework: pthreads (nthreads=16)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2020.0.0 Gold (-) Oct 19 2019
[==========] Running 135 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 100 tests from Conv
[ RUN      ] Conv.conv/0, where GetParam() = (GFLOPS=10.087, K=[3 x 3], IN={1, 576, 38, 50}, OCN=512, PM=SAME, BIAS, OCV/CPU)
IN=4275 Kb [ 1 576 38 50 ]    OUT=3800 Kb [ 1 512 38 50 ]    Weights(parameters): 10370 Kb    MFLOPS=10087
[ PERFSTAT ]    (samples=10   mean=12.70   median=12.70   min=12.69   stddev=0.01 (0.1%))
[       OK ] Conv.conv/0 (149 ms)
[ RUN      ] Conv.conv/1, where GetParam() = (GFLOPS=1.704, K=[3 x 3], IN={1, 512, 19, 19}, OCN=512, G=512, P=[1 x 1], BIAS, OCV/CPU)
IN=722 Kb [ 1 512 19 19 ]    OUT=722 Kb [ 1 512 19 19 ]    Weights(parameters): 20 Kb    MFLOPS=1703.6
[ PERFSTAT ]    (samples=13   mean=0.20   median=0.20   min=0.20   stddev=0.00 (0.7%))
......(这里省略了中间超多message)
[ RUN      ] DNNTestNetwork.Inception_v2_Faster_RCNN/0, where GetParam() = OCV/CPU
[     SKIP ] OpenCV tests: Can't find data file: dnn/faster_rcnn_inception_v2_coco_2018_01_28.pb
[       OK ] DNNTestNetwork.Inception_v2_Faster_RCNN/0 (2 ms)
[----------] 19 tests from DNNTestNetwork (10 ms total)

[----------] Global test environment tear-down
[ SKIPSTAT ] 19 tests skipped
[ SKIPSTAT ] TAG='skip_other' skip 19 tests
[==========] 135 tests from 3 test cases ran. (2748 ms total)
[  PASSED  ] 135 tests.

筛选和过滤

与前一小节“单元(功能)测试”一样,这里不再重复。

如何评估稳定性?在移动端部署有些经验的小伙伴,其实对网络模型最小耗时并不敢兴趣,他们更关心的是网络模型耗时的稳定性,opencv_perf_dnn中:

  • stddev:展示稳定性
  • MFLOPS:展示实际峰值算力

这两个信息可向 性能优化工程师 提供一定的定量分析,值得学习和借鉴。

浅析Samples

编译很容易+1,CMake创建项目时选择-DBUILD_EXAMPLES=ON

《野萌君》

这里就有点小坑了,控制终端运行不了,需要修改example的源码。

车速对比

大概在现有版本上还是快一些,感觉还行吧,至少给端侧部署的小伙伴,提供了新的(雾)、可靠、简单的选择了。在飙车版本的功能正式提交到OpenCV官方代码之前,如果担心稳定性,可以直接使用Tengine进行尝鲜(更快一点)。

结尾

有点感叹,3年前差不多这个时间点,作为大龄程序员来到改革开放的最前沿城市。有幸搭上AI这波火箭,经过短暂的摸索期后准确找到自身技术定位,深挖Arm端侧部署及低比特量化产品落地。结识了之前完全不敢想象的圈内各路大佬、大牛、大神,虽然在平日里大佬更倾向于面向江湖(朋友圈)的技(shang)术(ye)交(hu)流(chui),但大家的初心依旧是推动整个行业高效有序发展。

今年是AI开源元年,希望圈圈虫在年末能交付一份让开源社区满意的答卷。

《两米兔》

OpenCV 4.3 with Tengine(终章)预告

  1. 讲解libtengine是如何自动化判断、下载、编译、集成进入DNN模块;
  2. 浅谈后续roadmap;
  3. 分享少许NPU精度修复/踩坑trick。
编辑于 2020-04-06

文章被以下专栏收录