Object Detection-CornerNet笔记

论文: CornerNet: Detecting Objects as Paired Keypoints

作者: Hei Law, Jia Deng

链接: arxiv.org/pdf/1808.0124

代码:github.com/umich-vl/Cor

目前大部分的object detection的模型都基于anchor,尤其是one-stage的detector。使用anchor有以下缺点:

  1. 通常需要大量的anchor,因为需要判断anchor是否与GT有较大的overlap,所以需要大量的anchor来保证覆盖所有的GT。大量的anchor其实只有少部分和GT相重叠,正样本和负样本的不平衡会影响训练!
  2. anchor的使用引入了大量的参数和设计(先验),包括anchor的数量,大小,比例等(虽然这样可以生成较多的multi-scale和multi-ratio的region proposals)。当面临multi-scale architecture的时候会变得非常复杂,因为需要设计每个scale的anchor

作者提出了一种全新的将anchor扔掉的detector,将object的检测简化为两个关键点(bounding box的两个端点,corner)的检测。

CornerNet主要有两个关键的部分:

  • Associative embedding,将同属一个object的corner push到一起,将不同的pull开
  • Corner pooling,encoding corner的先验知识从而更好地localize corner

为什么检测corner会比检测检测bounding box的centers或proposals容易,作者提出了两个可能的原因:

  1. box的中心取决于四个边,而corner只需要两个点,加上用了corner pooling,那就更加容易了。
  2. corners能够更好有效地对boxes的空间进行离散化,可以仅仅用 O(wh) 的corner就可以表示 O(w^2h^2) 的anchor boxes。

Multi-task training

在经过corner pooling以后,采用了类似残差块的结构,然后在用个特征做了三个不同的任务,分别是heatmaps用来预测corner,做embedding,还有预测offsets用来修正因为降采样导致的misalignment。下面将对它的每个部分进行介绍。


Backbone

选择了hourglass network作为backbone。


Corner Pooling

考虑一个简单的问题,如何在一个HxW的矩形区域,每个点非0即1,并且只有一个联通区域,找到它的两个corner,做法么就是将每一行的最大值投影到最左边,将每一列的最大值投影到最上面,然后将这两个结果合并就可以得到top-left的点。

同样地,如果我们需要决定feature map上的一个点是否是corner,我们需要从行和列分别检查.

为了得到top-left这个点,将计算过程分成三部分:

  1. 从上到下做max pooling
  2. 从右到左做max pooling
  3. 然后合并(相加)

以从下到上为例,每一列我都能得到一个单调非递减的结果,相当于对corner的先验做了编码。对于object来说,我们如果要去找最上边的位置,我们需要从下到上检查这一列的最大值,最大值的位置是corner的可能存在的位置

bottom-right这个点的计算过程也是类似。


#include <torch/torch.h>

#include <vector>

std::vector<at::Tensor> pool_forward(
    at::Tensor input
) {
    // Initialize output
    at::Tensor output = at::zeros_like(input);

    // Get height
    int64_t height = input.size(2);
    // n x c x h x w 
    // Copy the last column
    at::Tensor input_temp  = input.select(2, 0);
    at::Tensor output_temp = output.select(2, 0);
    output_temp.copy_(input_temp);

    at::Tensor max_temp;
    for (int64_t ind = 0; ind < height - 1; ++ind) {
        input_temp  = input.select(2, ind + 1);
        output_temp = output.select(2, ind);
        max_temp    = output.select(2, ind + 1);
        at::max_out(max_temp, input_temp, output_temp);
    }

    return { 
        output
    };
}

std::vector<at::Tensor> pool_backward(
    at::Tensor input,
    at::Tensor grad_output
) {
    auto output = at::zeros_like(input);

    int32_t batch   = input.size(0);
    int32_t channel = input.size(1);
    int32_t height  = input.size(2);
    int32_t width   = input.size(3);

    auto max_val = at::zeros(torch::CUDA(at::kFloat), {batch, channel, width});
    auto max_ind = at::zeros(torch::CUDA(at::kLong),  {batch, channel, width});

    auto input_temp = input.select(2, 0);
    max_val.copy_(input_temp);

    max_ind.fill_(0);

    auto output_temp      = output.select(2, 0);
    auto grad_output_temp = grad_output.select(2, 0);
    output_temp.copy_(grad_output_temp);

    auto un_max_ind = max_ind.unsqueeze(2);
    auto gt_mask    = at::zeros(torch::CUDA(at::kByte),  {batch, channel, width});
    auto max_temp   = at::zeros(torch::CUDA(at::kFloat), {batch, channel, width});
    for (int32_t ind = 0; ind < height - 1; ++ind) {
        input_temp = input.select(2, ind + 1);
        at::gt_out(gt_mask, input_temp, max_val);

        at::masked_select_out(max_temp, input_temp, gt_mask);
        max_val.masked_scatter_(gt_mask, max_temp);
        max_ind.masked_fill_(gt_mask, ind + 1);

        grad_output_temp = grad_output.select(2, ind + 1).unsqueeze(2);
        output.scatter_add_(2, un_max_ind, grad_output_temp);
    }

    return {
        output
    };
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def(
        "forward", &pool_forward, "Bottom Pool Forward",
        py::call_guard<py::gil_scoped_release>()
    );
    m.def(
        "backward", &pool_backward, "Bottom Pool Backward",
        py::call_guard<py::gil_scoped_release>()
    );
}

Associative Embedding Method

这个部分是用来决定一对corner是否来自同一object。具体的做法就是对卷积特征进行embedding(1x1的conv),得到corner的embedding向量,我们希望同属于同一个object的一堆pair的距离尽可能小,不属于的距离尽可能大!所以有两个loss,push loss和pull loss。

Heatmap Prediction

网络预测两部分的heatmaps,top-left和bottom-right,每个部分heatmaps的大小都是 C\times H\times W , C是类别的数量。每个channel都是一个 H\times W 的mask,用来表示是否是corner。

作者对loss进行了调整,对小于GT半径r内的negative locations用unnormalized 2D gaussian调整权值,越近loss越小!

y_{cij} 表示是2D gaussian的输出值,用来调整权值


Location Offset Prediction

直接降采样会导致misalignment,所以预测offset来调整corner的位置。


Decoding

  1. 在heatmap上做NMS,也就是3x3的max poolings,选择top 100的top-left和top 100的bottom-right
  2. 通过预测的offset调整位置
  3. 计算top-left和bottom-right的embeddings distance,筛掉距离大于0.5的以及不属于同一类的
  4. 两个相加取平均得到最终的score

在Titan X上的平均inference时间是244ms

def _decode(
    tl_heat, br_heat, tl_tag, br_tag, tl_regr, br_regr, 
    K=100, kernel=1, ae_threshold=1, num_dets=1000
):
    # tl_regr, br_regr: offsets
    batch, cat, height, width = tl_heat.size()

    # heatmaps
    tl_heat = torch.sigmoid(tl_heat)
    br_heat = torch.sigmoid(br_heat)

    # perform nms on heatmaps, overlaped max pooling, stride 1
    tl_heat = _nms(tl_heat, kernel=kernel)
    br_heat = _nms(br_heat, kernel=kernel)

    tl_scores, tl_inds, tl_clses, tl_ys, tl_xs = _topk(tl_heat, K=K)
    br_scores, br_inds, br_clses, br_ys, br_xs = _topk(br_heat, K=K)

    tl_ys = tl_ys.view(batch, K, 1).expand(batch, K, K)
    tl_xs = tl_xs.view(batch, K, 1).expand(batch, K, K)
    br_ys = br_ys.view(batch, 1, K).expand(batch, K, K)
    br_xs = br_xs.view(batch, 1, K).expand(batch, K, K)

    if tl_regr is not None and br_regr is not None:
        tl_regr = _tranpose_and_gather_feat(tl_regr, tl_inds)
        tl_regr = tl_regr.view(batch, K, 1, 2)
        br_regr = _tranpose_and_gather_feat(br_regr, br_inds)
        br_regr = br_regr.view(batch, 1, K, 2)

        tl_xs = tl_xs + tl_regr[..., 0]
        tl_ys = tl_ys + tl_regr[..., 1]
        br_xs = br_xs + br_regr[..., 0]
        br_ys = br_ys + br_regr[..., 1]

    # all possible boxes based on top k corners (ignoring class)
    bboxes = torch.stack((tl_xs, tl_ys, br_xs, br_ys), dim=3)
    
    # embeedding distance
    tl_tag = _tranpose_and_gather_feat(tl_tag, tl_inds)
    tl_tag = tl_tag.view(batch, K, 1)
    br_tag = _tranpose_and_gather_feat(br_tag, br_inds)
    br_tag = br_tag.view(batch, 1, K)
    dists  = torch.abs(tl_tag - br_tag)

    # merge top-left and bottom-right corners
    tl_scores = tl_scores.view(batch, K, 1).expand(batch, K, K)
    br_scores = br_scores.view(batch, 1, K).expand(batch, K, K)
    scores    = (tl_scores + br_scores) / 2

    # reject boxes based on classes
    tl_clses = tl_clses.view(batch, K, 1).expand(batch, K, K)
    br_clses = br_clses.view(batch, 1, K).expand(batch, K, K)
    cls_inds = (tl_clses != br_clses)

    # reject boxes based on distances
    dist_inds = (dists > ae_threshold)

    # reject boxes based on widths and heights
    width_inds  = (br_xs < tl_xs)
    height_inds = (br_ys < tl_ys)

    scores[cls_inds]    = -1
    scores[dist_inds]   = -1
    scores[width_inds]  = -1
    scores[height_inds] = -1

    scores = scores.view(batch, -1)
    scores, inds = torch.topk(scores, num_dets)
    scores = scores.unsqueeze(2)

    bboxes = bboxes.view(batch, -1, 4)
    bboxes = _gather_feat(bboxes, inds)

    clses  = tl_clses.contiguous().view(batch, -1, 1)
    clses  = _gather_feat(clses, inds).float()

    tl_scores = tl_scores.contiguous().view(batch, -1, 1)
    tl_scores = _gather_feat(tl_scores, inds).float()
    br_scores = br_scores.contiguous().view(batch, -1, 1)
    br_scores = _gather_feat(br_scores, inds).float()

    detections = torch.cat([bboxes, scores, tl_scores, br_scores, clses], dim=2)
    return detections


Experiments


首先是corner pooling的ablation study

可以发现在小物体提升有限(这个应该corner pooling本身的机制有关,降采样后小物体的位置信息丢失地会更加严重些?)


然后是error analysis

用gt heatmaps替换模型的predicted heatmaps,可以发现AP提升非常明显,说明如果heatmaps能够提升,伴随着的将是AP大幅提升。如果用gt offsets替换predicted offsets,可以发现AP也有不少的提升,尤其在小物体上。


和state-of-the-art们的比较

在one-stage detectors中,outperform其他方法,在two-stage detectors中同样很有竞争力。

编辑于 2018-08-13

文章被以下专栏收录