taowen
首发于taowen

优雅并发的四种姿势

人的大脑并不善于处理并发的逻辑。如果把源代码写成并发的样子,对人的大脑是非常不友好的。所以就需要各种语法糖来把“优雅” 串行的代码,翻译成并发形式来执行。这里有四种主要的方式。

1、投机执行

不仅仅CPU可以做speculative execution。我们自己也可以做

这个类型的经典代表是 Haxl:facebook/Haxl

let fetchSingleElement = function (arg) {
    console.log('fetch single element', arg);
    return arg;
};

let fetchBatchElements = function (args) {
    console.log('fetch batch elements', args);
    return args;
};

let fakeResult = {};

function speculativeExecute(f) {
    let cache = {};
    let collectedArgs = [];
    fetchSingleElement = function (arg) {
        if (cache[arg] !== undefined) {
            return cache[arg];
        }
        collectedArgs.push(arg);
        return fakeResult
    };
    while(true) {
        let result = f();
        if (collectedArgs.length === 0) {
            return result;
        }
        for (const [arg, fetched] of fetchBatchElements(collectedArgs).entries()) {
            cache[arg] = fetched;
        }
        collectedArgs = [];
    }
}

function businessLogic() {
    let result = 0;
    for (let i = 0; i < 10; i++) {
        result += fetchSingleElement(i);
    }
    return result;
}

// console.log('result', businessLogic());
console.log('result', speculativeExecute(businessLogic));

执行的结果是

fetch batch elements [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
result 45

如果直接执行businessLogic,效果是

fetch single element 0
fetch single element 1
fetch single element 2
fetch single element 3
fetch single element 4
fetch single element 5
fetch single element 6
fetch single element 7
fetch single element 8
fetch single element 9
result 45

本质上这种数据流依赖的分析也可以用编译器来做。

2、协程和同步调用

const verifyUser = function(username, password, callback){
   dataBase.verifyUser(username, password, (error, userInfo) => {
       if (error) {
           callback(error)
       }else{
           dataBase.getRoles(username, (error, roles) => {
               if (error){
                   callback(error)
               }else {
                   dataBase.logAccess(username, (error) => {
                       if (error){
                           callback(error);
                       }else{
                           callback(null, userInfo, roles);
                       }
                   })
               }
           })
       }
   })
};

相比

const verifyUser = async function(username, password){
   try {
       const userInfo = await dataBase.verifyUser(username, password);
       const rolesInfo = await dataBase.getRoles(userInfo);
       const logStatus = await dataBase.logAccess(userInfo);
       return userInfo;
   }catch (e){
       //handle errors as needed
   }
};

协程主要解决了线程数量不够,怕阻塞的问题。让多个协程共享有限的线程。

3、编译器翻译for循环

经典代表是intel的ispc Intel® SPMD Program Compiler

编写SIMD代码的时候,我们需要同时考虑多条”数据管线“。下面的%ymm0就是一个256位的寄存器,按32位来算就是8条数据管线。

LBB0_3:
    vpaddd    %ymm5, %ymm1, %ymm8
    vblendvps %ymm7, %ymm8, %ymm1, %ymm1
    vmulps    %ymm0, %ymm3, %ymm7
    vblendvps %ymm6, %ymm7, %ymm3, %ymm3
    vpcmpeqd  %ymm4, %ymm1, %ymm8
    vmovaps   %ymm6, %ymm7
    vpandn    %ymm6, %ymm8, %ymm6
    vpand     %ymm2, %ymm6, %ymm8
    vmovmskps %ymm8, %eax
    testl     %eax, %eax
    jne       LBB0_3

如果不用一条指令同时操纵多条数据管线,而是写那种独立处理单条数据管线的逻辑就要简单得多:

float powi(float a, int b) {
    float r = 1;
    while (b--)
        r *= a;
    return r;
}

ispc.github.io/ispc.html 编译之后,上面的代码就可以从SPMD的风格,转换成SIMD的风格,其实他们是等价的代码。

4、硬件直接支持海量多线程

经典代表是nvidia的CUDA

__global__ void parallel_shared_reduce_kernel(float *d_out, float* d_in){
    int myID = threadIdx.x + blockIdx.x * blockDim.x;
    int tid = threadIdx.x;
    extern __shared__ float sdata[];
    sdata[tid] = d_in[myID];
    __syncthreads();
    //divide threads into two parts according to threadID, and add the right part to the left one, 
    //lead to reducing half elements, called an iteration; iterate until left only one element
    for(unsigned int s = blockDim.x / 2 ; s>0; s>>=1){
        if(tid<s){
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads(); //ensure all adds at one iteration are done
    }
    if (tid == 0){
        d_out[blockIdx.x] = sdata[myId];
    }
}

编写的代码是逐个像素处理的。但是硬件直接支持很多个线程。

发布于 2018-11-05

文章被以下专栏收录