YOLOD - Improving neural nets for detecting packed cars in satellite imagery

By: Alan J. Schoen
Data Scientist

Neural network research moves very quickly and it can be hard to keep up. In our work at Radiant Solutions, we use a variety of machine learning modalities. My team works with neural networks for segmentation, detection, and other imagery processing tasks, but our most common application is object detection. Detection is basically the task of drawing boxes around objects of interest. For instance, we might want to find all the cars in an image, and draw boxes indicating their size. We don’t actually need to know how big an object is to count it, but it’s useful for quality control. So we are always looking out for the best neural network for this task, and we’re fortunate to be able to borrow a lot of techniques from academic research. But sometimes academic research doesn’t line up perfectly with our needs.

Most research on object detectors centers around a few public datasets like PASCAL-VOC and MS-COCO. Both of these datasets are made of up what you might call “normal” pictures, taken with a normal camera a few feet off the ground. At Radiant Solutions, we work with a particular type of picture taken with a specialized camera at a vantage point about 600 kilometers above the earth’s surface (read more about DigitalGlobe satellites here). As a result, there are systematic differences between the images in PASCAL-VOC and our images. For instance, we have to deal with parking lots where rows of cars are packed in right next to each other, and there’s really nothing like that in PASCAL VOC. So most detection neural networks are terrible for counting cars in satellite imagery.

YOLO is one of the best networks for object detection with PASCAL VOC, and we have done a lot of research using YOLO. YOLO has gone through two iterations: the original YOLO network and YOLO9000 (or YOLOv2) which improved the design in several ways and improved overall performance on PASCAL VOC. Our colleagues at CosmiQ Works improved on the YOLO design to get better performance with boats in satellite images. Continuing in that tradition, we decided to improve YOLO9000, with inspiration from anther network called DSSD.

Convolutional Neural Networks process images in a series of layers stacked on top of each other. Most alternate between convolutional layers, which search for patterns, and pooling layers that throw away irrelevant information. As the image passes through each layer it gets transformed from a regular image into an abstract representation of the image contents. The pooling layers have the effect of “shrinking” the image. In the case of YOLO9000, we start with a 300 pixel by 300 pixel image, and by the end of the network we have a 13 x 13 grid for object detection. This works great when you only have a few objects in each image, but it doesn’t work as well when the image is packed with objects, like a parking lot. YOLO’s performance decreases when it has to detect multiple objects in each grid cell.

The simplest solution to this is to use smaller input images and then enlarge them before feeding them into the neural network. In other words, magnify the image first before shrinking it. This reduces the object density, but it also stops the neural net from seeing context. It’s easier to tell that a car is a car if you can see that it’s in a parking lot, so we want the model to be able to see both the car and the parking lot at the same time. We want a model that can see both context and detail.

There’s more than one way to address this issue. YOLT and YOLO9000 use a passthrough layer so that the network can combine information at two scales. This improves detection of small objects, and the second version of YOLO incorporates this idea as well. Another network called DSSD took the idea a step further, using another kind of neural network called layer a “deconvolution” to blow up the detection grid (if you’re interested in what that is, here’s a clarification).

Increasing the size of the detection grid lets the network see detail and context, which is what we want. SSD and YOLO9000 have similar performance, but I have much more experience with the YOLO models. So I decided to take the idea behind DSSD and apply it to YOLO9000 and I created YOLOD.

YOLO9000

YOLO9000 Model Structure

YOLOD

YOLOD Model Structure

The dotted line represents passthrough, where some information skips ahead in the neural network. The passthrough goes from a larger image to a smaller image using a “space to depth” transformation.

In truth, I’m skipping over a lot of detail. There was a lot of tweaking and tuning involved in getting the structure of the network right. Details like the depth of the deconvolution layers, the quantity and shapes of anchor boxes, and the non-maximum suppression parameters. These are called “hyperparameters” and there’s really no way to know ahead of time what’s going to work best so I had to run experiments to optimize them. On top of that, detection neural networks have more hyperparameters than most, so there are more combinations to test out. Fortunately, we have four NVIDIA GP100 GPUs for training neural networks at Radiant Solutions. These are absolute top-of-the-line graphics cards, and they let me test out stacks of hypotheses quickly. I used all four GP100s for weeks in order to find a good combination of hyperparameters.

The end result of adding the deconvolution layers is that we do detection on a finer grid with more grid cells. You can see how that affects things if you look at these examples of an image with a 13 x 13 grid (YOLO9000) and a 26 x 26 grid (YOLOD).

Image with 13x13 Grid Image with 26x26 Grid
Image with 13x13 Grid Image with 26x26 Grid

In the 13x13 grid, we’ll sometimes get multiple cars in the same grid cell, but this happens much less frequently in the 26x26 grid.

Results

Now I’ll show some results so you can see how much this improved the performance with cars. We trained these models using the xView dataset, produced by DIUx. xView has over 1,000,000 labeled objects in high-resolution satellite images around the world. It is being released as part of the xView Detection Challenge, which is an exciting effort to find the best models to work with satellite imagery.

We used the markings from xView, but the data formatting might be different. We created our own train/val split so scores reported in this post will not correspond to scores in the xView Detection Challenge.

To show an example of model performance, I picked a location in our validation areas that has a bunch of cars packed together. I’ll show you the ground truth markings, car detections with YOLO9000, and car detections with YOLOD.

Ground Truth

Ground Truth

YOLO9000

YOLO9000

YOLOD

YOLOD

You can see that both models caught most of the cars, even really dark ones in the shadows, but that YOLOD performs much better in the area on the left where cars are packed together.

Now I’ll quantify the performance improvement. We picked out a set of 16 different 1km x 1km validation areas in the xView training set. We chose these validation areas using an algorithm that was designed to give us a good mixture of different object types. Cars are everywhere in xView, so there area a lot of cars in the validation areas. These areas were excluded from the training data so we could use them to test the model on images it had never seen before. There are 11,083 cars in the validation areas, so a perfect model would have 11,083 true positives, 0 false positives, and 0 false negatives. Even YOLOD isn’t perfect though:

Model True Positive Count False Positive Count False Negative Count Precision Recall F1
YOLO9000 6,328 4,187 4,755 0.60 0.57 0.59
YOLOD 7,270 3,151 3,813 0.70 0.66 0.68

You can see an improvement in performance with the YOLOD model, which reflects the fact that it can detect more closely-packed cars.

I should point out another thing about the validation areas. The estimates of accuracy here will be slightly inflated because we used the validation areas for model selection. That is, the models were not trained on the validation areas but we did pick the best model based on its performance in the validation areas. So it’s likely that future performance will be a little lower than what we reported. In any case, we have found that model performance is linked to geographic area. So if you were interested in counting cars in New York, then you should build a test set in New York to measure the model’s performance.

Before I sign off, there’s one more thing I’d like to cover. I have been showing car results up until now, but the model I used can actually detect many different types of objects, not just cars. I trained them on 54 categories in the xView dataset that correspond to movable objects. Up until now, whenever I talked about “cars” I was actually talking about a composite class consisting of cars, pickup trucks, cargo trucks, and truck tractors. So if the model says an object is any of those four categories, I call it a car. Now I’m going to delve into class-by-class results from the models. From now on, I won’t be talking about composite classes any more.

The model was trained on 54 classes, but not all of those classes had enough training samples to learn anything useful. So I’m only going to report results for the 30 classes that had at least one prediction from either model in the validation areas. For each class, I’ll show the number of instances of that class in the training set, the number of instances in the validation areas, and the performance of the two models.

Class Training Count Validation Count YOLO9000 F1 Score YOLOD F1 Score
car 336597 10098 0.58 0.67
truck 20507 393 0.21 0.21
bus 11326 519 0.49 0.48
cargo_truck 9645 427 0.26 0.25
pickup 8473 327 0.16 0.19
trailer 6468 145 0.25 0.13
tractor_trailer 5388 335 0.39 0.45
railcar_cargo 5137 99 0.49 0.45
shipping_container 2181 360 0.13 0.18
dumptruck 2068 112 0.39 0.34
railcar_passenger 1964 211 0.62 0.71
motorboat 1775 75 0.11 0.15
excavator 1456 49 0.28 0.19
tractor_w_trailer 1439 87 0.1 0.13
fishing_vessel 1279 192 0.58 0.36
large_plane 1245 33 0.72 0.81
bulldozer 1148 28 0.33 0.29
truck_tractor 1030 231 0.17 0.08
maritime_vessel 806 34 0.22 0.05
sailboat 690 39 0.0 0.04
small_aircraft 570 64 0.71 0.61
haul_truck 455 19 0.69 0.07
cement_mixer 434 50 0.29 0.34
container_ship 404 30 0.63 0.05
yacht 375 89 0.65 0.43
barge 277 10 0.3 0.0
ferry 240 22 0.43 0.0
railcar_flat 223 30 0.2 0.0
helicopter 162 12 0.35 0.0
railcar_locomotive 156 17 0.32 0.27

I’d like to comment on a few trends in there. First of all, it’s no surprise to see that classes with more training examples generally outperform classes with fewer training examples. But there is one big exception. Several classes related to trucks have many training samples, but still achieve relatively poor performance (truck, shipping container, tractor_trailer, truck tractor). There is a simple explanation for this: the model couldn’t tell the difference between these classes. Even humans have trouble distinguishing these classes in 30cm satellite imagery. It’s not always easy to tell if you’re looking at a shipping container or a semi trailer. Or if you see a semi tractor, you have to decide whether or not that tractor is attached to a trailer. The easiest way to deal with this problem is to merge similar classes into a metacategory. A more rigorous solution would be to use a hierarchical structure that allows the model to express conditional probabilities for class membership. The YOLO9000 paper used a structure like this, and we are working on implementing it in a future iteration of these models.

Another observation: YOLO9000 requires fewer training examples than YOLOD. YOLO9000 can learn to detect objects with as few as 800 training samples, while YOLOD needs at least 1000. YOLOD has more parameters, owing to the additional deconv layers, so it’s no surprise that it needs more training data.

Finally, to end this post, here are more detailed results tables for both models:

YOLO9000 Results Table
Class Training Count Validation Count True Positive Count False Positive Count False Negative Count Precision Recall F1 Score
car 336597 10098 5754 3845 4344 0.6 0.57 0.58
truck 20507 393 76 254 317 0.23 0.19 0.21
bus 11326 519 245 231 274 0.51 0.47 0.49
cargo_truck 9645 427 124 394 303 0.24 0.29 0.26
pickup 8473 327 52 260 275 0.17 0.16 0.16
trailer 6468 145 33 89 112 0.27 0.23 0.25
tractor_trailer 5388 335 133 208 202 0.39 0.4 0.39
railcar_cargo 5137 99 45 41 54 0.52 0.45 0.49
shipping_container 2181 360 47 310 313 0.13 0.13 0.13
dumptruck 2068 112 39 50 73 0.44 0.35 0.39
railcar_passenger 1964 211 104 18 107 0.85 0.49 0.62
motorboat 1775 75 8 64 67 0.11 0.11 0.11
excavator 1456 49 15 42 34 0.26 0.31 0.28
tractor_w_trailer 1439 87 7 53 80 0.12 0.08 0.1
fishing_vessel 1279 192 104 64 88 0.62 0.54 0.58
large_plane 1245 33 21 4 12 0.84 0.64 0.72
bulldozer 1148 28 9 17 19 0.35 0.32 0.33
truck_tractor 1030 231 24 21 207 0.53 0.1 0.17
maritime_vessel 806 34 6 14 28 0.3 0.18 0.22
sailboat 690 39 0 0 39 0.0 0.0 0.0
small_aircraft 570 64 48 24 16 0.67 0.75 0.71
haul_truck 455 19 10 0 9 1.0 0.53 0.69
cement_mixer 434 50 12 22 38 0.35 0.24 0.29
container_ship 404 30 16 5 14 0.76 0.53 0.63
yacht 375 89 52 18 37 0.74 0.58 0.65
barge 277 10 5 18 5 0.22 0.5 0.3
ferry 240 22 10 15 12 0.4 0.45 0.43
railcar_flat 223 30 5 15 25 0.25 0.17 0.2
helicopter 162 12 4 7 8 0.36 0.33 0.35
railcar_locomotive 156 17 5 9 12 0.36 0.29 0.32
YOLOD Results Table
Class Training Count Validation Count True Positive Count False Positive Count False Negative Count Precision Recall F1 Score
car 336597 10098 6683 3090 3415 0.68 0.66 0.67
truck 20507 393 139 805 254 0.15 0.35 0.21
bus 11326 519 253 278 266 0.48 0.49 0.48
cargo_truck 9645 427 124 442 303 0.22 0.29 0.25
pickup 8473 327 81 461 246 0.15 0.25 0.19
trailer 6468 145 31 296 114 0.09 0.21 0.13
tractor_trailer 5388 335 146 174 189 0.46 0.44 0.45
railcar_cargo 5137 99 61 109 38 0.36 0.62 0.45
shipping_container 2181 360 68 318 292 0.18 0.19 0.18
dumptruck 2068 112 40 80 72 0.33 0.36 0.34
railcar_passenger 1964 211 147 59 64 0.71 0.7 0.71
motorboat 1775 75 14 93 61 0.13 0.19 0.15
excavator 1456 49 14 87 35 0.14 0.29 0.19
tractor_w_trailer 1439 87 9 42 78 0.18 0.1 0.13
fishing_vessel 1279 192 58 68 134 0.46 0.3 0.36
large_plane 1245 33 24 2 9 0.92 0.73 0.81
bulldozer 1148 28 11 38 17 0.22 0.39 0.29
truck_tractor 1030 231 10 12 221 0.45 0.04 0.08
maritime_vessel 806 34 1 5 33 0.17 0.03 0.05
sailboat 690 39 1 11 38 0.08 0.03 0.04
small_aircraft 570 64 44 36 20 0.55 0.69 0.61
haul_truck 455 19 1 7 18 0.12 0.05 0.07
cement_mixer 434 50 12 8 38 0.6 0.24 0.34
container_ship 404 30 2 54 28 0.04 0.07 0.05
yacht 375 89 33 33 56 0.5 0.37 0.43
barge 277 10 0 0 10 0.0 0.0 0.0
ferry 240 22 0 0 22 0.0 0.0 0.0
railcar_flat 223 30 0 0 30 0.0 0.0 0.0
helicopter 162 12 0 0 12 0.0 0.0 0.0
railcar_locomotive 156 17 3 2 14 0.6 0.18 0.27
comments powered by Disqus

Contact the DeepCore team!

Questions, comments, or requests? Contact the DeepCore team for more information!