Object detection methods published recently have pushed the state of the art (SOTA) on a popular benchmark – MS COCO dataset. In Part I we took a closer look into CornerNet. This time, let’s see what makes CornerNet-Lite superior to the previous CornerNet method.
CornerNet-Lite: Efficient Keypoint-Based Object Detection
As was mentioned before, the good thing about CornerNet is its competitive results on MS COCO dataset. Nevertheless, it has one huge drawback.
It is slow.
To overcome this issue, the authors proposed CornerNet-Lite – a combination of two efficient variants of CornerNet
- CornerNet-Saccade: It uses an attention mechanism to avoid the need for exhaustively processing all pixels of the image
- CornerNet-Squeeze: It introduces a new compact backbone architecture.
These two methods improve the two critical features of efficient object detection: high efficiency without sacrificing accuracy, and high accuracy at real-time efficiency (Figure 1).
Let’s dive into both of these new methods and see what is so great about them.
What does the word “saccade” mean?
Saccade refers to rapid eye movement that shifts the center of gaze from one part of the visual field to another. Saccades are mainly used for orienting gaze towards an object of interest.
The method is inspired by and derives its name from this natural phenomenon.
Figure 2 below shows an overview of CornerNet-Saccade. Let’s examine it in detail.
Estimating Object Locations
- The network operates on the two scales of the input image. At the higher scale, the longer side of the image is resized to 255, and at the lower scale it is resized to 192 pixels. The image of size 192 is padded with zeros to the size of 255 so that both the scales can be processed in parallel.
- For a downsized image, CornerNet-Saccade predicts 3 attention maps: one for small objects, one for medium objects and one for large objects.
- The attention maps are predicted by using feature maps at different scales, obtained from the backbone network, which is an hourglass network.
- The bounding boxes obtained from the downsized image may not be accurate and therefore are also examined at higher resolutions to get better bounding boxes.
- At each possible location (x,y) the original image is zoomed-in by the object scale, depending on whether it’s a small, medium or large object.
- Then CornerNet-Saccade is applied to a 255×255 window centered at the location for detecting possible bounding boxes.
- Soft-NMS is applied to merge and remove the redundant bounding boxes.
- Bounding boxes that are not fully covered by a region and touch the crop boundaries are also removed as they may have low overlaps with boxes of the full objects (Figure 3).
- Detected bounding boxes are then ranked by their scores and only top k-max of them are selected.
A new hourglass 54-layer network is suggested (hence named Hourglass-54). Each of the 3 hourglass modules in the new architecture has fewer parameters and is shallower than the one in Hourglass-104.
- Input size: 255×255;
- Adam optimization strategy;
- The training hyperparameters are the same as in CornerNet;
- Batch size is 48 on four 1080Ti GPUs.
- Using new fire modules (Figure 4);
- Hourglass module modifications:
- reducing the maximum feature map resolution of the hourglass modules;
- downsizing the image three times before the hourglass module, whereas CornerNet downsizes the image twice;
- replacing the 3×3 filters with 1×1 filters in the prediction modules of CornerNet;
- replacing the nearest neighbor upsampling with 4×4 transpose convolution.
Some training details mentioned in the paper:
- The training hyperparameters and losses are the same as in CornerNet;
- Batch size is 55;
- 4x1080Ti GPUs.
Figure 5 shows the accuracy and efficiency trade-off curves of CornerNet-Saccade and CornerNet-Squeeze on the MS COCO validation set compared to other object detectors, including YOLOv3, RetinaNet and CornerNet:
CornerNet-Saccade achieves a better accuracy and efficiency trade-off (42.6% at 190 ms) than both RetinaNet (39.8% at 190 ms) and CornerNet (40.6% at 213 ms). CornerNet-Squeeze achieves better accuracy and efficiency (34.4% at 30 ms) trade-off than YOLOv3 (32.4% at 39 ms). Running CornerNet-Squeeze on both flipped and original images (Test Time Augmentation, TTA) improves its AP to 36.5% at 50 ms, but that’s still a good trade-off.
Performance Analysis of Hourglass-54
Some experiments were done to investigate the performance contribution of the new Hourglass-54 architecture. We can see predicting the attention maps as a binary classification problem, where the object locations are positives and the rest are negatives. Considering that, the authors propose to measure its accuracy by average precision, denoted as APatt. Hourglass-54 achieves an APatt of 42.7%, while Hourglass-104 achieves 40.1%, suggesting that Hourglass-54 is better at predicting attention maps (Figure 6):
After reviewing both methods, you might be wondering, why not merge both of the methods? The truth is that some of the experiments show that combining CornerNet-Squeeze with saccades does not outperform CornerNet-Squeeze.
On the validation set, CornerNet-Squeeze achieves an AP of 34.4%, while CornerNet-Squeeze-Saccade achieves 32.7% (Figure 7). To see how saccade impacts the accuracy the authors replace the predicted attention map with the ground-truth. That improves the AP of CornerNet-Squeeze-Saccade to 38.0%, outperforming CornerNet-Squeeze. The results suggest that saccade can only help if the attention maps are sufficiently accurate. Due to its architecture, CornerNet-Squeeze-Saccade does not have enough capacity to detect objects and predict accurate attention maps simultaneously.
CornerNet-Lite versus others on MS COCO
Last, but not the least – CornerNet-Lite results on MS COCO test set (Figure 8). CornerNet-Squeeze is faster and more accurate than YOLOv3. CornerNet-Saccade is more accurate than CornerNet at multi-scales and 6 times faster. What an achievement!
Below is the link to the repository with the publicly available code from the authors: