【Note】FloodNet_plus - Bili-Sakura/NOTES GitHub Wiki
[!NOTE] Bibliography
Zhao, D., Lu, J., & Yuan, B. (2024). See, Perceive, and Answer: A Unified Benchmark for High-Resolution Postdisaster Evaluation in Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–14. https://doi.org/10.1109/TGRS.2024.3386934

Abstract: Visual-language generation for remote sensing image (RSI) is an emerging and challenging research area that requires multitask learning to achieve a comprehensive understanding. However, most existing models are limited to single-level tasks and do not leverage the advantages of the visual-language pretraining (VLP) model. In this article, we present a unified benchmark that learns multiple tasks, including interpretation, perception, and question answering. Specifically, a model is designed to perform semantic segmentation, image captioning, and visual question answering (VQA) for high-resolution RSIs simultaneously. Our model not only attains pixel-level segmentation accuracy and global semantic comprehension but also responds to user-defined queries of interest. Moreover, to address the challenges of multitask perception, we construct a novel multitask dataset called FloodNet+, which provides a new solution for the comprehensive postdisaster assessment. The experimental results demonstrate that our approach surpasses existing methods or baseline in all three tasks. This is the first attempt to simultaneously consider multiple remote sensing perception tasks in an integrated framework, which lays a solid foundation for future research in this area. Our datasets are publicly available at: https://github.com/LDS614705356/FloodNet-plus.
[!IMPORTANT] Visual-Language Generation
Visual-Language generation, such as image captioning and visual question answering (VQA), is a significant cross-modal challenge in the field of visual-language comprehension. This task aims to generate descriptive language outputs based on the input image and the possible text constraint.
[!NOTE] Related Work (General)
Object Detection: [1], [2] ,[3]
Segmentation: [4], [5] ,[6]
Change Detection: [7], [8] ,[9]
[!IMPORTANT] Intelligent Semantic Perception and Cognition
Beyond the current semantic processing route, networks are trained to not only extract key features from images but also identify their relationships and summarize high-level semantic information based on text requirements.
[!IMPORTANT] Visual-Language Generation Methods: Single-task & Multitask
Single-task Methods: encoder-decoder
framework
Multitask Methods: visual-language pretraining
(VLP) models
[!WARNING] Limitation of VLP for RSIs
But for RSIs, the scarcity of large-scale datasets makes conventional training methods infeasible, and the difficulty of handling high-resolution
images in a multitask framework is formidable.
[!IMPORTANT] Multitask Visual-Language Generation in RSI

[!IMPORTANT] Framework

[!IMPORTANT] Training Process of Proposed Model

[!IMPORTANT] Main Contribution
-
We introduce a comprehensive multitask framework tailored for postdisaster analysis in remote sensing images (RSIs), establishing the first benchmark that integrates semantic segmentation, image captioning, and VQA.
-
We design an innovative multilevel, multitask learning model to address a spectrum of tasks from intuitive segmentation to complex cognition-driven question answering. This model enhances postdisaster assessments by bridging the gap between conventional perception tasks and advanced cognitive evaluations in RSIs.
-
We present FloodNet+, a novel high-resolution, multitask dataset specifically developed for postflood scenarios. It is meticulously designed to benchmark the effectiveness of our approach across semantic segmentation, image captioning, and VQA tasks, where our method has achieved SOTA performance in multitask joint perception, showcasing effectiveness, and robustness.
For the remote sensing semantic segmentation task, Long et al. [38] introduced the first fully convolutional network for semantic segmentation in RSIs. From a methodological point of view, current remote sensing semantic segmentation methods can be divided into two categories: CNN-based methods and transformer-based methods. CNN-based methods are the dominant approach for remote sensing semantic segmentation tasks. Zheng et al. [4] proposed a foreground-aware relation network (FarSeg), which enhanced the segmentation performance by redistributing feature maps to increase the contrast between foreground and background regions. Diakogiannis et al. [39] introduced ResUNet-a, a UNet encoder–decoder backbone with multiple parallel branches in ResNet, and a pyramid scene parsing pooling block to catch and fuse multilevel features. Yang et al. [40] explored AFNet to fuse multichannel and multilevel features by attention blocks.
Transformer-based methods have been a new trend for semantic segmentation in general images. In the domain of remote sensing semantic segmentation tasks, they are still in their infancy. Zhang et al. [41] proposed a transformer and CNN Hybrid network, using Swin transformer as an encoder and CNN as a decoder. Cui et al. [42] explored an improved network based on Swin transformer and UPerNet to focus on significant features of the postearthquake area by a convolutional block attention module.
For the RSI captioning task, current research has made some progress in both dataset development and network design. Qu et al. [22] first introduced a neural network model for the RSI captioning task, publishing two datasets named UCMCaptions
and Sydney-Captions
. Lu et al. [43] proposed an RSI captioning dataset, providing a valid benchmark for this task. Network design primarily focuses on two components: the encoder and the decoder. Enhancements to the encoder aim to better extract image features for improved performance. Ma et al. [23] focused on the multiscale problem, designing a multiscale method to cope with multiscale information in RSIs. Zhao et al. [24] proposed a network based on structure attention to address the issue of irregular shapes in remote sensing objects. Liu et al. [44] designed a network based on a dual-branch transformer to handle multitemporal RSIs. For the decoder, it is typically common to redefine the mapping relationship between the encoded and decoded results, to enable the network to fit more effectively. Wang et al. [45] attempted to make the encoding results more interpretable by redefining them to words before being constructed into sentences by the decoder. Li et al. [46] introduced RASG
, a framework with the novel recurrent attention mechanism and semantic gate to extract the high-level attentive maps and help the decoder recognize and understand the effective information.
[!NOTE] Omitted
[!IMPORTANT] Detailed Network Structure

We adopt CCNet
proposed from [49] as our backbone, which consists of a ResNet-101
[50] and recurrent criss-cross attention
as the segmentation feature module, and a segmentation head for generating segmentation proposals.
CCNet
: Arvix | TPAMI 2020 & ICCV 2019
[!TIP] Ablation Studies

To deal with high-resolution RSIs, we adopt a patch-based strategy, which divides the input image into patches and then aggregates the local features of the patches into a global image feature. Our segmentation module is trained to generate a segmentation proposal for each image patch, which implies the local feature in each patch has an internal specific distribution. To preserve this distribution, we do not directly combine the local features of different patches, but rather assign different weights to each patch and multiply them by all pixels in the same patch. The weight is derived from the patch itself and adjacent patches.

Text generation tasks, such as image captioning or text generation, pose several challenges for conventional learning methods. One challenge is the mismatch between the loss functions and the evaluation metrics of these tasks. For example, many loss functions, such as negative log likelihood loss
(NLL Loss) and cross entropy loss
(XE Loss), compute losses based on individual elements, while evaluation metrics for text generation tasks (such as CIDEr-D [51]) measure the quality of the whole sequences. Another challenge is the exposure bias, which occurs when the model is trained with the ground truth words as inputs, but it is tested with its own predictions as inputs. This discrepancy can cause the model to accumulate errors progressively, potentially degrading its performance over sequential predictions. To address these challenges, the self-critical sequence training
(SCST) reinforcement learning method [52] is applied to the image captioning task. It incorporates inference into the training process and introduces sequence-level evaluation metrics as reward functions in the loss function. This approach helps cope with gap between training and testing phase, while the model learns to generate words on its own previous predictions. The reward and loss function are constructed as follows:
where Y is the predicted sequence. Z is the target sequence. ⟨eos⟩ is the sequence ending mask. k represents the total sequence number of each batch. b means the baseline function value, which is always reward value obtained from greedy algorithms or the average reward value from all sequences in a batch.
...The training was conducted on 4 GeForce GTX 1080 Ti GPUs...
Memory: 11*4=44GB









[1] Z. Xiao, Y. Gong, Y. Long, D. Li, X. Wang, and H. Liu, “Airport detection based on a multiscale fusion feature for optical remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 9, pp. 1469–1473, Sep. 2017.
[2] Q. Yao, X. Hu, and H. Lei, “Geospatial object detection in remote sensing images based on multi-scale convolutional neural networks,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2019, pp. 1450–1453.
[3] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020.
[4] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 4096–4105.
[5] Z. Zou, T. Shi, W. Li, Z. Zhang, and Z. Shi, “Do game data generalize well for remote sensing image segmentation?” Remote Sens., vol. 12, no. 2, p. 275, Jan. 2020.
[6] S. Pan, Y. Tao, C. Nie, and Y. Chong, “PEGNet: Progressive edge guidance network for semantic segmentation of remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 18, no. 4, pp. 637–641, Apr. 2021.
[7] Q. Wang, X. Zhang, G. Chen, F. Dai, Y. Gong, and K. Zhu, “Change detection based on faster R-CNN for high-resolution remote sensing images,” Remote Sens. Lett., vol. 9, no. 10, pp. 923–932, Oct. 2018.
[8] Y. Yang, H. Gu, Y. Han, and H. Li, “An end-to-end deep learning change detection framework for remote sensing images,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Sep. 2020, pp. 652–655.
[9] C. Zhang et al., “A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 166, pp. 183–200, Aug. 2020.
[22] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” in Proc. Int. Conf. Comput., Inf. Telecommun. Syst. (CITS), Jul. 2016, pp. 1–5.
[23] X. Ma, R. Zhao, and Z. Shi, “Multiscale methods for optical remote sensing image captioning,” IEEE Geosci. Remote Sens. Lett., vol. 18, no. 11, pp. 2001–2005, Nov. 2021.
[24] R. Zhao, Z. Shi, and Z. Zou, “High-resolution remote sensing image captioning based on structured attention,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5603814.
[38] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3431–3440.
[39] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “ResUNet—A: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS J. Photogramm. Remote Sens., vol. 162, pp. 94–114, Apr. 2020.
[40] X. Yang et al., “An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery,” ISPRS J. Photogramm. Remote Sens., vol. 177, pp. 238–262, Jul. 2021.
[41] C. Zhang, W. Jiang, Y. Zhang, W. Wang, Q. Zhao, and C. Wang, “Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 4408820.
[42] L. Cui, X. Jing, Y. Wang, Y. Huan, Y. Xu, and Q. Zhang, “Improved Swin transformer-based semantic segmentation of postearthquake dense buildings in urban areas using remote sensing images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, pp. 369–385, 2023.
[43] X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 4, pp. 2183–2195, Apr. 2018.
[44] C. Liu, R. Zhao, H. Chen, Z. Zou, and Z. Shi, “Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5633520.
[45] Q. Wang, W. Huang, X. Zhang, and X. Li, “Word–sentence framework for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 12, pp. 10532–10543, Dec. 2021.
[46] Y. Li et al., “Recurrent attention and semantic gate for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5608816.
[49] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: Criss-cross attention for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 603–612.
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
[51] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4566–4575.
[52] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Selfcritical sequence training for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 7008–7024.