Double Y: Building Extraction Generalization

Deep learning has significantly advanced the field of building extraction from remote sensing images, providing robust solutions for identifying and delineating building footprints. However, a major challenge persists in the form of domain adaptation, particularly when addressing cross-city variations. The primary challenge lies in the significant differences in building appearances across cities, influenced by variations in building shapes and environmental characteristics. Consequently, models trained on data from one city often struggle to accurately identify buildings in another city. In this paper, we address this challenge from a data-centric perspective, focusing on diversifying the training set. Our empirical results show that improving data diversity via open-source datasets and diffusion augmentation significantly improved the performance of the segmentation model. Our baseline model, trained with no extra dataset, only achieved a private F1 score of 0.663. On the other hand, our best model, trained with the additional Las Vegas building footprints extracted from the Microsoft Building Footprint dataset, achieved a high private F1 score of 0.703. Surprisingly, we found that diffusion augmentation helps improve our model score to 0.681 without requiring an extra dataset, which is higher than the baseline model. Finally, we also experimented with the Non-Maximal Suppression (NMS) hyperparameter to improve the model’s performance in segmenting dense and small objects, which gave us a high private F1 score of 0.897. Our source code and the pretrained models are publicly available at https://github.com/DoubleY-BEGC2024/OurSolution.

1. Objective: This competition embarks on this challenge by utilizing a building footprint dataset from the Tokyo area as the primary training set, with plans to extend testing to other Japanese regions. This approach aims to inspire the development of models with robust generalization capabilities, capable of overcoming the hurdles of automatic building footprint detection and extraction across various landscapes. Overcoming this challenge signifies the creation of a novel approach for efficient, cost-effective, and precise building footprint extraction at a national level with minimal regional data, showcasing its potential applicability worldwide.

2. Mandatory Training Data: The training set data uses 0.3-meter Google earth satellite images complemented by meticulously manually annotated building outlines. A total of 4717 images are provided, where all of them are extracted within Tokyo vicinity. The training data was divided into a training set and a validation set with a ratio of 8:2.

3. Mandatory Test Data: The imagery and building annotations for both test sets derive from the open-source Japanese 3D city model, the Plateau project (https://www.mlit.go.jp/plateau/) enhanced with manual adjustments following visual inspection. All test images were randomly selected from 42 cities in Japan, but a balance of different types of areas was maintained. A total of 250 images were taken from each region, totaling 1,000 images.

Figure 1: The region of interest for the building footprints extracted from the Microsoft Building Footprint (BF) dataset. (a) Redmond, Washington. (b) Las Vegas, Nevada. For simplicity's sake, we refer the former as Redmond dataset, and the latter as Las Vegas dataset.

Figure 2. The proposed diffusion augmentation pipeline. (1) Use pretrained segmentation model to generate semantic segmentation. (2) Refine the segmentation mask using the building polygon labels. (3) Concatenate input image with the semantic mask. (4) Train the segmentation-guided diffusion model using the concatenated inputs.

YOLOv8 series comes with several instance segmentation models, ranging from the smallest nano (n) variant to the largest extra-large (x) variant. We performed several experiments to select the best YOLOv8 variant for our task, considering both the F1 score and model complexity, as shown in Table I. Additionally, we compared the performance of YOLOv8-based instance segmentation models with other state-of-the-art models, including YOLOv9, Mask R-CNN, and EfficientNet. All models are trained for 50 epochs with 640 image size. During test time and submission, the confidence and NMS IoU thresholds are set as 0.20 and 0.70, unless stated otherwise. We also tested the F1-score of the models with a confidence threshold of 0.50, primarily to evaluate how confident the models are rather than for actual submission.

Model	Pretrained Weights	Batch Size	Params (M)	FLOPs (G)	Public F1-Score
Model	Pretrained Weights	Batch Size	Params (M)	FLOPs (G)	Conf = 0.50	Conf = 0.20
YOLOv8n-seg	DOTAv1 Aerial Detection	16	3.4	12.6	0.510	0.645
YOLOv8s-seg		16	11.8	42.6	0.535	0.654
YOLOv8m-seg		16	27.3	110.2	0.592	0.649
YOLOv8x-seg		8	71.8	344.1	0.579	0.627
YOLOv9c-seg	COCO Segmentation	4	27.9	159.4	0.476	0.577
Mask R-CNN (MPViT-Tiny)	COCO Segmentation	4	17	196.0	-	0.596
EfficientNet-b0-YOLO-seg	ImageNet	4	6.4	12.5	-	0.560

Our observations:

Generally, we observe that the F1 score increases when scaling up the model from the smallest YOLOv8n-seg to the medium size YOLOv8m-seg. Notably, there is a significant jump in F1 score from the YOLOv8s-seg to the YOLOv8m-seg when evaluated in confidence threshold of 0.50, with the score improving from 0.535 to 0.592.
Interestingly, the F1 score of YOLOv8m-seg is slightly lower than the smaller YOLOv8s-seg when setting the confidence threshold to 0.20. This observation suggest that the m-variant still has a bigger room of improvement compared to s-variant.
Meanwhile, the largest YOLOv8x-seg variant has a lower F1 score than YOLOv8m-seg in both confidence threshold 0.50 and 0.20. This suggests that further improvements in F1 score beyond the m-variant may be minimal unless we enhance the quality of the training dataset or address generalizability issues.
We also tried training YOLOv9 instance segmentation model. Specficially, we chose YOLOv9c, which corresponds to the m-variant of YOLOv8 (YOLOv8m-seg). However, we find that YOLOv9 is hard to train and slower due to the high FLOPS, with no F1-score improvement.
Other than YOLO family, we also tried using Mask R-CNN with the MPViT backbone. However, due to resource constraints, we were only able to test the smallest MPViT-Tiny for our Mask R-CNN backbone. The modified Mask R-CNN has a slightly higher F1 score than YOLOv9c, but is inferior to all YOLOv8 variants we tried. It is not only slow, but also not accurate.
Lastly, we tried replacing the YOLOv5 backbone with EfficientNet, which is a lightweight yet effective CNN model designed for low-computational-power. We uses the smallest EfficientNet-b0 as the replacement backbone. It is able to reach a considerably high F1 score (close to both YOLOv9c and Mask R-CNN) with significantly lower FLOPs. However, the F1 score is still significantly lower than YOLOv8 series. We believe the performance could be further improved by pretraining the EfficientNet with COCO segmentation, and scaling up the EfficientNet backbone.
In short, we prefered using YOLOv8 series due to its exceptional balance between speed and accuracy. Specifically, we uses YOLOv8m-seg variant due to its high performance.

We experimented with the performance of YOLOv8m-seg by varying the training dataset, as shown in table below:

Setup	Dataset	Public F1 Score
A	BEGC 2024	0.649
B	BEGC 2024 + Redmond Dataset	0.660
C	BEGC 2024 + Las Vegas Dataset	0.686
D	BEGC 2024 + Diffusion Augmentation	0.672
E	BEGC 2024 + CutMix Dataset	0.650

Our observations:

Setup A: 0.649 - This setup represents the performance baseline, where YOLOv8m-seg was trained solely on the provided BEGC2024 training dataset. As expected, this setup resulted in the lowest F1-score, likely due to the lack of diversity in the training data.
Setup B: 0.660 - We trained YOLOv8m-seg using both the BEGC2024 training set and our Redmond dataset. This simple step of diversifying the training data led to a significant increase in the F1-score, from 0.649 to 0.660.
Setup C: 0.686 - Surprisingly, using the Las Vegas dataset resulted in an even higher public F1-score of 0.686, as shown in Setup C. We believe the reason why the Las Vegas dataset results in a greater improvement in F1 score is due to its greater semantic difference from the BEGC2024 training set, which helps enhance the model's ability to generalize in the test set.
Setup D: 0.672 - Surprisingly, the performance of the YOLOv8m-seg model trained with the BEGC2024 dataset using diffusion augmentation resulted in a considerably high F1 score of 0.672. This F1 score is even higher than that of Setup B, which was trained with the Redmond dataset. This observation demonstrates that our diffusion augmentation method successfully created semantically different images that were sufficient to diversify the BEGC2024 training set.
Setup E: 0.650 - We also tried CutMix augmentation to diversify the training dataset to improve generalization of our model. However, we found this method to be less effective, achieving an F1 score of only 0.650, as shown in Setup E of Table III. The F1 score improvement was almost negligible compared to our baseline in Setup A. We believe this is due to the lack of variation in building structures, as we only changed the backgrounds. This highlights the importance of diversifying both building shapes and background textures to improve the model's generalization.

We compare our solutions with the 2nd and 3rd place in the leaderboard:

Solution	FLOPS (G)	F1-Score
Solution	FLOPS (G)	Public	Private
YOLOv8m-seg + BEGC 2024	110.2	0.64926	0.66531
YOLOv8m-seg + BEGC 2024 + Redmond Dataset		0.65951	0.67133
YOLOv8m-seg + BEGC 2024 + Las Vegas Dataset		0.68627	0.70326
YOLOv8m-seg + BEGC 2024 + Diffusion Augmentation		0.67189	0.68096
2nd place (RTMDet-x + Alabama Buildings Segmentation Dataset)	141.7	0.6813	0.68453
3rd Place (Custom Mask-RCNN + No extra Dataset)	124.1	0.59314	0.60649

Our observations:

Generally, using an additional dataset, whether it is an open-sourced dataset or a synthetic dataset, helps improve the training of the model.
However, you might sample high-quality or low-quality additional datasets from open-sourced databases without careful engineering. For instance, using the Redmond dataset only slightly improves the F1 score compared to using the BEGC 2024 dataset alone. On the other hand, using the Las Vegas dataset significantly improves the F1 score, achieving the top F1 score among all methods.
On the other hand, using our diffusion augmentation, we can generate a synthetic dataset to train YOLOv8m-Seg without needing an additional dataset (which means no extra manual annotation is required). Using BEGC2024 combined with the synthetic dataset, our YOLOv8m-Seg model achieved an F1 score that is significantly higher than the baseline and close to our top-1 score (using the Las Vegas dataset) and the 2nd-place solution.
Note that the 2nd-place solution uses a bigger model (higher FLOPs) with an additional dataset to reach a high F1 score, whereas our diffusion augmentation pipeline allows our model (lower FLOPs) to achieve a surprisingly close F1 score without an additional dataset.

Non-maximal suppression (NMS) can be less effective at detecting small, densely packed objects, as it relies on IoU to suppress overlapping bounding boxes. In scenarios involving small and dense objects, the bounding boxes often overlap significantly, which can lead to the suppression of true positives. We can mitigate this issue by increasing the IoU threshold in the NMS layer to prevent unnecessary reduction of bounding boxes. We experimented by increasing the IoU threshold in the NMS layer of YOLOv8m-seg from the default 0.70 to 0.95, with increments of 0.05.

Dataset	Private F1 Score (using different NMS IoU Threshold)
Dataset	0.70	0.75	0.80	0.85	0.90	0.95
BEGC2024 + Redmond Dataset	0.672	0.677	-	-	0.748	0.866
BEGC2024 + Las Vegas Dataset	0.703	0.693	0.686	0.721	0.766	0.897
BEGC2024 + Diffusion Augmentation	0.681	-	0.694	0.711	0.751	0.887

Our observations:

Generally, we found that IoU thresholds of 0.90 and 0.95 work best compared to other threshold settings.
Note that simply increasing the IoU threshold does not directly translate to better performance, as it may lead to an increase in false positives that should have been suppressed by the NMS layers.
For instance, setting the IoU threshold between 0.75 and 0.80 is generally worse than the default 0.70 threshold.
Hence, our final submission is the YOLOv8m-seg model trained on the BEGC2024 and Las Vegas datasets, with the IoU threshold for NMS set to 0.95.
In future works, we consider trying more advanced NMS variation including Attention based NMS and Density-based NMS to better mitigate this problem.

1. Dataset quality is what you need: There are 2 observations from our study. Firstly, data diversity is important to mitigate the generalization challenge. For instance, Las Vegas dataset offers higher diversity (i.e., desert backgrounds, different building shapes) as compared to the Redmond dataset, which is semantically more similar to the provided BEGC2024 training set. Hence, the performance of our model trained with BEGC2024 + Las Vegas dataset is better than BEGC2024 + Redmond dataset.

2. Diffusion Augmentation is label-efficient: Diffusion augmentaion is what you need if you do not have extra dataset which is diverse enough from the original training set. For instance, the Redmond dataset is not as useful as the Las Vegas dataset. However, it might be difficult and/or costly to find out the suitable extra dataset. On the other hand, we do not need extra dataset to prepare our diffusion augmentation pipeline. Even better, BEGC2024 + Diffusion Augmentation outperforms BEGC2024 + Redmond dataset, and also outperforms the 2nd and 3rd place entrants!

3. Start with a small model: We recommend starting with a smaller model. It is unwise to use a larger model when dealing with a limited dataset, as it may lead to overfitting. Our empirical study agrees with this hypothesis, as we failed to achieve a high mAP score using the biggest YOLOv8 version (YOLOv8x-seg). Given more time, we would explore training YOLOv8x-seg with all the extra datasets we gathered, and also using our diffusion augmentation pipeline.