Glacier extraction based on high spatial resolution remote sensing images using a deep learning approach with attention mechanism

. Accurate and quick extraction of glacier boundaries plays an important role in studies of glacier inventory, glacier change and glacier movement, and it faces great opportunities and challenges due to the increasing availability of high- 10 resolution remote sensing images with larger data volume and richer texture informations. In this study, we improved the DeepLab V3+ as Attention DeepLab V3+ and designed a complete solution based on the improved network to automatically extract glacier outlines from the Gaofen-6 PMS images with a spatial resolution of 2 m . In the solution, the Test-Time Augmentation (TTA) was adopted to increase model robustness, and the Convolutional Block Attention Module (CBAM) was added into the Atrous Spatial Pyramid Poolin (ASPP) structure in DeepLab V3+ to enhance the weight of the target pixels and 15 reduce the impact of useless features. The results show that the improved model effectively improves the robustness of the model, enhances the weight of target image elements and reduces the influence of non-target elements. Compared with deep learning models such as FCN, U-Net and DeepLab3+, the improved model performs better, with OA and Kappa coefficients of 99.58% and 0.9915 for the test dataset, respectively. Moreover, our method achieves the highest OA and Kappa of 99.40% and 0.9846 for glacier boundary extraction in parts of the Tanggula Mountains and Kunlun Mountains based on Gaofen-6 20 PMS images, showing its excellent performance and great potential.

and robustness of the proposed method by comparing with the reference outlines of glaciers based on manual interpretation of orthorectified images taking parts of the Tanggula Mountain and Kunlun Mountain as the test region. Meanwhile, we assess our result by comparing with GAMDAM glacier inventory (GGI) (Nuimura et al., 2015) and the glacier coverage data on the 65 Tibetan Plateau in 2017 (TPG2017) (Ye et al., 2017;Ye, 2019).

Model structure and data process scheme
In this study, we used the Deeplab V3+ in combination with the attention mechanism (section 2.1) to explore the glacier extraction method based on high spatial resolution images. The Gaofen-6 images were preprocessed and divided into two groups, firstly, the former was used to make samples by data augmentation to train the improved Deeplab V3+, then the latter 70 was used to test the trained model by performing the classification (section 2.3). Meanwhile, the Test-time augmentation (TTA) was added into this classification to improve model accuracy, which is described in Section 2.4. Figure 1 shows the overall flow of the method.

Network structure
DeepLab V3+, an encoding-decoding architecture proposed by Chen et al. (2018), is one of the most advanced semantic segmentation algorithms. In this paper, we improved the DeepLab V3+ model by adding an attention mechanism to the encoding-decoding structure (Attention DeepLab V3+). In the encoder, the ResNet 34 (He et al., 2016) was used as the backbone network to extract semantic information to obtain the low-level feature and then the Atrous Spatial Pyramid Pooling 80 https://doi.org/10.5194/tc-2022-61 Preprint. Discussion started: 11 April 2022 c Author(s) 2022. CC BY 4.0 License.
(ASPP) module with attention mechanism was connected to obtain the encoder feature. In the decoder, the low-level feature and the encoder feature maps were input. The encoder feature performed a upsampling with a factor of four and then fused with the low-level feature. After that, a depthwise separable convolution and a bilinear interpolation upsampling with a factor of four were executed to extract the expected features and output them at the same size as the input image (Fig. 2). 85 Figure 2. Architecture of Attention DeepLab V3+.

Attention mechanism
The attention mechanism can learn contextual information and capture the internal correlation, its basic idea is to ignore irrelevant information and focus on key information in operations (Woo et al., 2018). In this paper, we used the Convolutional Block Attention Module (CBAM) (Fig. 3) (Woo et al., 2018) including a channel attention module and a spatial attention 90 module to obtain the attention weights along both channel and spatial dimension in turn, and then multiplied with the original feature map to adaptively adjust the features and increase the weights of target features.

Attention ASPP 95
The ASPP structure replaces the partial volumes of the deep neural network with a dilated convolution (Yu and Koltun, 2015), which expands the perceptual field without increasing the parameters, thus obtaining more feature information. However, the dilated convolution may cause discontinuity of spatial information, which was addressed by incorporating CBAM into the juxtaposition structure of the extracted features in this paper. The role of the attention mechanism was to focus on the noticed target pixel and enhance its weight, and the dilated convolution in the ASPP can obtain contextual information at different 100 scales. Adding the attention mechanism could make the features of related categories more aggregated, thus effectively reduce the phenomenon of empty features. The five branches of the juxtaposition structure obtained more contextual information by collecting features from different sensory domains, and by combining this information with refined feature maps, dependencies between pixels and differences between categories were gained (Fig. 4).

Backbone
The ability of CNN to retrieve relevant information from images is enhanced with the increase of the network depth (Telgarsky, 2016), but the too deep network could lead to the gradient explosion and network degradation. Residual connections (He et al., 2016) solved this problem by feeding a given layer into the previous one where a building block of residual learning was 110 included (Fig. 5), and by which, the depth of the network and learning capacity can be dramatically increased. ResNet has many branches with different numbers of blocks, and we adopt ResNet-34 with 16 blocks and 33 convolutional layers in total in the present study (He et al., 2016).

Depthwise separable convolution
In this paper, we added a depthwise separable convolution at the end of semantic segmentation after combining high-level and low-level features (Chollet, 2017;Howard et al., 2017). This convolution can decompose the traditional convolution into a depthwise convolution and a 1 × 1 pointwise convolution, improving the efficiency of operation without losing too much 120 https://doi.org/10.5194/tc-2022-61 Preprint. Discussion started: 11 April 2022 c Author(s) 2022. CC BY 4.0 License. accuracy compared with normal convolution. In addition, the layers of neural network with depthwise separable convolution can be deeper for the same number of parameters.

Loss function
Due to the high resolution of Gaofen-6 PMS images and the large size of some glaciers, only a small part of glaciers or other features may be present in the sample, which has the potential to cause the sample imbalance. In this paper, the dice loss 125 (Milletari et al., 2016) was used as the loss function to mitigate this phenomenon. The dice loss is expressed as: where I is the number of intersection between sample labels and predicted pixels, U is the sum of sample labels and predicted pixels, and ε is a constant mainly used to prevent the denominator from being zero and smooth the loss operation.
The use of dice loss generally produces severe oscillations during the training of the model when the positive sample is 130 a small target because a large loss change and a consequently gradient drastic change will occur once some pixels of the small target are predicted incorrectly in the case of only foreground and background. Therefore, we introduced a combination of cross entropy loss and dice loss to make the network training more stable, and the cross-entropy loss is as follows: where pi is the probability that the pixel is predicted as glacier, and yi is the sample label which takes the value of 1 if the 135 sample pixel being a glacier, and 0 otherwise. Therefore, these above two losses were fused as the final loss function.

The pre-processing procedures of Gaofen-6 PMS image and datasets production
Deep learning requires a large amount of labeled data related to the classification target for training the model. However, the 140 currently available open-source datasets cannot meet the requirements of the classification in this paper. Therefore, we collected Gaofen-6 PMS images with the less cloudy and snowy from China High-resolution Earth Observation System (https://www.cheosgrid.org.cn/) (Tab. 1) as a training and validation dataset of the model, some of which were selected to test the accuracy of the glacier extraction method. The Gaofen-6 satellite, officially operational since March 21, 2019, is a loworbiting optical remote sensing satellite with high spatial resolution, featuring wide coverage, high quality and efficient 145 imaging. A two-meter panchromatic/eight-meter multispectral high-resolution camera and a 16-meter multispectral mediumresolution wide field camera are boarded on Gaofen-6 satellite, the former with an observation width of 90 km and the latter with that of 800 km. Gaofen6_PMS_E91.6_N33.6 2020-12-20 Pre-processing, including fusion, orthorectification, geometric alignment and some other operations were performed prior 150 to using the original defective images. Glaciers in the images were extracted as true values by manual visual interpretation.
Considering the high spatial resolution of Gaofen-6 PMS images, the larger scale of glaciers and other features, and the inclusion of two kinds of features (glaciers and other features) in the samples whenever possible, the images were cropped into roughly 1024×1024 size and used as the input for deep learning training. It is important to prepare a sufficiently diverse dataset to ensure that the model can be adapted to different scenarios of glacier extraction. In this paper, the data augmentation 155 including randomly clip the Gaofen-6 PMS images, vertical flipping, horizontal flipping and diagonal flipping of the samples as well as clockwise 90° and counterclockwise 90° rotations were conducted to expand the sample library, improve the model accuracy and enhance its generalization performance. Finally, a training set and a validation set containing 3600 well-annotated images of 1024×1024 size with blue, green, red and near infrared bands were obtained. Meanwhile, we kept a test set containing 400 images without data augmentation. An example of the sample is shown in Fig. 6. 160 Figure 6. The RGB Gaofen-6 samples and ground truth are displayed in the first and second rows.
A sample usually displays only a portion of the glacier, so an image larger than the sample size needs to be input to extract the complete glacier. Usually, the images to be classified were generally split into a series of images with the same size as the 165 samples and fed into the network for prediction, and then the predicted results are merged into one final result image in the cropping order in the prediction process. However, larger classification errors and unsmooth merging after clipping could happen due to the insufficient pixel features in the edge areas of each patch. Hence, we adopt the strategy to make the two patches clipped overlap each other to preserve the features of edge pixels and make the merging of edges smoother. Figure 7 illustrates the process of prediction, in which, the pre-processed original image over the area in the red checkbox 170 ( Fig. 7a and Fig. 7b) must be spilt into 1024×1024 overlapping patches before being input into the network for prediction, and then the classification results of each patch are obtained (Fig. 7d). Subsequently, 90% of the central part of each patch (red checkbox in Fig. 7d) was merged to obtain the classification result (Fig. 7c).

Test-time augmentation
In this paper, the TTA strategy was used in the result extraction, which can be considered as a post-processing technique https://doi.org/10.5194/tc-2022-61 Preprint. Discussion started: 11 April 2022 c Author(s) 2022. CC BY 4.0 License.
because it is executed during the testing phase (Wang et al., 2021b). Therefore, it does not affect the network learning parameters, but tries to obtain multiple enhanced copies by performing data enhancement operations such as horizontal and vertical transformations on each image during the test, then combining the results of multiple enhanced copies for prediction 180 (Fig. 8). The voting formula is as follows: where p is the probability that the pixel belongs to glacier. n is the number of each test image and its copies. S is the probability of each pixel belonging to the glacier in the image and its copies. Figure 8 shows the result of TTA. To quantitatively describe the ability to extract glaciers from high-resolution images using the method proposed by this study, the results obtained from manual visual interpretation were chosen as ground truth to be compared with the classification 190 results. The overall accuracy (OA) and Kappa coefficient (Kappa) of the confusion matrix were used for accuracy evaluation (Olofsson et al., 2014), where OA is the ratio of correctly classified pixels to all pixels in the entire image, and can be calculated as follows: where p1 and n are the number of correctly classified pixels and total pixels, respectively. 195 The Kappa is a statistic that measures the agreement between prediction and ground truth, and can be calculated as follows:

Experimental result
This section will test the algorithms in this paper, analyze the experimental results and address the following two questions: (1) examining the performance of the Attention DeepLab V3+ with TTA on the test set; (2) evaluating the ability of our method to extract large scale glaciers. 205

Experimental setup
For the experimental platform, we used a central processing unit with processor Intel Core i9-10920 (3.50 GHz) which configured with 64GB of memory, a graphics card with Nvidia GeForce RTX 2080 Ti 11GB of video memory, Windows 10 64-bit operating system, and python programming implementation. In terms of software environment, we chose the pytorch as the deep learning framework, CUDA version 11.1 as the graphics processing unit (GPU) computing platform and cuDNN8.0 210 as deep learning GPU acceleration library.
In this paper, the ratio of training set, validation set and test set was 8: 1: 1. The training set was used to optimize the network parameters (weights and bias), the validation set to prevent overfitting and optimize the hyperparameters of the visual procedure to extract glacier is shown in Fig. 9. 220 Figure 9. The procedure of glacier extraction. Firstly, the pre-processed image obtained on December 20, 2020 (a) were clipped into 1024 x 1024 overlapping patches (b), then input into the model to predicted result (c). Secondly, we merged the results and classified the pixel values greater than 0.5 as glaciers to obtain the final binary map (d and e). Thirdly, the binary map from above can be converted to a vector (f) and smoothed (g). Finally, the glacier was segmented using ASTER GDEM to obtain individual glacier vector boundaries (h).

Experiments on Gaofen-6 test set
In this section, we explored the effectiveness of our network on the test set by comparing with FCN, U-Net and DeepLab V3+, in which FCN was specifically used FCN 32s network and U-Net was added the backbone network of Resnet 18. Figure 10 shows the visualized prediction results of some test sets on different methods. The extracted results derived from FCN were almost error-free, however, had a poor performance on the test set, which may due to its direct upsampling of 16, resulting in 230 the loss of the detailed information in glacier boundaries. The difference between the performances of U-Net and DeepLab V3+ on the test set was small, U-Net worked better in Fig. 10-4/5/6 and DeepLab in Fig. 10-1/2/3. It is obvious that the Attention DeepLab V3+ with TTA model has the best glacier extraction capability, which extracts glacier boundary with excellent continuity and fewer fine patches.    Figure 11. Performance of extraction models by confusion matrix, OA and Kappa.

Experiments on Gaofen-6 image
In this section, we compared the ability of extracting complete glaciers based on four scene of Gaofen-6 images (section 3.1)

265
The SBTM performed well when the spectral features of images are simple (Fig. 12), but had the worst effectiveness when the reflectance of some glaciers is similar to that of other features (Fig. 12 to 15). The RF had a better glacier extraction capability than SBTM due to its use of four bands. The deep learning yielded a better result than above methods, which was https://doi.org/10.5194/tc-2022-61 Preprint. Discussion started: 11 April 2022 c Author(s) 2022. CC BY 4.0 License.
trace-free and smooth despite being merged of many sample-sized images. Among the deep learning methods, our method had the best performance with extraction results similar to the ground truth. 270 In terms of accuracy, the SBTM has the lowest accuracy with OA and Kappa of 97.3% and 0.9425, respectively, followed by the FCN and RF. It should be noted that OA and Kappa of RF were 98.97% and 0.9740, respectively, which were higher than that of FCN (98.80%, 0.9702) on the whole, but lower for image (1). It was followed by U-Net and DeepLab V3+, but they also had a good score. Our method had the best accuracy with the highest scores in all test images and the highest OA and Kappa of 99.40% and 0.9846, respectively. 275 Table 2. Comparison of different glacier extraction methods on Gaofen-6 image.

Comparative Experiment with or without TTA
We improved the accuracy of Attention Deeplab V3+ model by employing TTA during testing, and verified the effectiveness by comparing the results with that without TTA.

Advantages and limitations
Compared with previous studies on glacier extraction, the main achievements of this paper are: (1) building a glacier 290 dataset with high spatial resolution for deep learning, (2) improving the DeepLab V3+ model by adding attention mechanism, (3) proposing an effective method to extract glaciers from high-resolution remote sensing images.
We expected that our method differs from previous glacier extraction methods that focused only on small scale, such as monitoring changes in the Antarctic ice shelf front and melting of the Greenland outlet glaciers (Baumhoer et al., 2019;Zhang et al., 2019), but can be applied to extract complete glacier over a large extent. Unlike traditional glacier 295 extraction methods that depend only on spectrum features, the high-spatial resolution images possess rich information of texture, shape, and spatial distribution of ground objects, which contribute significantly to distinguish categories with similar spectral characteristics (Tong et al., 2020). Based on which, our method can automatically learn the features to distinguish glaciers from terminal moraine lake, thin clouds and snow ( Fig. 13 and 15), showing an obvious advantage in glacier extraction in the case of complex spectrum characteristics. Our method yields the continuous glacier boundaries with fewer fine patches, which can reduce the workload for the further post-processing. Furthermore, the adoption of attention mechanism and TTA both improved the effectiveness of our model to extract glaciers in the test set ( Fig. 11 and 16).
The method in this paper was not used to extract the debris-or clouds-covered glaciers, in which cases errors may occur. And, features such as frozen rivers were sometimes mistaken for glaciers. Another drawback is that despite the 305 short prediction time of the well-training model, its production took a lot of time due to the lack of readily available glacier datasets based on high-resolution remote sense images.

The difference between inventories
The amount of glacier resources is critical to regional water resources and future sea level rise (Bolch et al., 2012) and the accurate extraction of glaciers contributes to the exact assessment of ice volume and mass balance as well as the measurement 310 of glacier length (Immerzeel et al., 2010). Existing glacier inventories are generally based on images with low/medium spatial resolution, which may misestimate the global glacier size to some extent, therefore, we discussed the differences between GGI and TPG2017 from the test images, where the former was produced by manual delineation using Landsat TM/ETM+ images in 1999-2003(Nuimura et al., 2015 and the latter was generated based on Landsat OLI images in 2013-2018. To better explore the differences between the inventories, glaciers were firstly divided into accumulation zone and ablation area based on the 315 median area elevation that is deemed as the material equilibrium line altitude (ELA) which is higher than the actual ELA for some glaciers (e.g., disintegrating glaciers) (Braithwaite and Raper, 2009), but has little effect for this study. The median altitude of glaciers is different in each region, which is 5645 m in the Tanggula Mountains and 5441 m in the Kunlun Mountains (Tab. 4). Our extracted glacier area differs from that of GGI by only -2.00 km 2 , accounting for -0.92% of the extracted area, while the area difference with TPG2017 is -16.45 km 2 , accounting for -7.61% of the extracted area (Tab. 4). In detail, our model extracts a larger glacier area in accumulation zone than GGI, which is mainly due to the omission of the shaded area in the upper glacier by GGI (Nuimura et al., 2015). However, the glacier area in the ablation zone obtained by our model is smaller than in GGI, which is rational because of the retreat of glaciers in the ablation zone under the 325 dramatic warming (Fig. 18a-d). Compared with the results of TGP2017, the glacier areas extracted by our method are smaller in both the ablation and accumulation zones, while the difference is significantly larger in the ablation zone than in the accumulation zone, indicating that the glaciers change is mainly concentrated in the ablation area, which is consistent with the general pattern of glacier changes ( Fig. 18e-h). From the perspective of data sources, GGI and TPG2017 were produced by Landsat TM/ETM+/OLI images with resolution of 30 m/15 m. Our data was obtained from 330 Gaofen-6 PMS images with a resolution of 2 m that fewer mixed pixels in the image allow for more detail classification of glaciers and other features as well as the more accurate extraction of glacier boundaries, which leads to the smaller glacier area for the data in this paper (Fig. 17). In general, our method allows a more accurate extraction of the bare intact glaciers, although in a few cases, other features are misclassified as glaciers, but a small amount of manual modification can provide a more exact data for the glacier inventory.

Conclusion and prospect
In this paper, a glacier extraction method using DeepLab V3+ network with attention mechanism and TTA was proposed to accurately extract glaciers from high-resolution remote sensing images. This method can help to improve the accuracy of the automatically extracted glacier outlines and solve the problem that most high-resolution images cannot extract 345 glacier profiles using traditional methods such as NDSI due to the lack of short-wave infrared band. By comparing with FCN, U-Net and DeepLab V3+, the glacier extracted ability of our model was demonstrated, with the highest OA and Kappa of 99.58% and 0.9915, respectively.
Four scenes of Gaofen-6 PMS image with glaciers were selected to test the ability of extracting complete glacier over large extent. And then comparison with other glacier extracted methods shows that our model has better 350 performance with the OA and kappa of 99.40% and 0.9846, respectively, which could distinguish glacier from terminal moraine lake, thin snow and cloud. Moreover, the glacier boundary obtained from our method was continuous with less fine patches, reducing the workload for post-processing. When comparing the glaciers extracted by our method with the GGI and TPG2017, we found that our data have a more detailed representation of bare ice boundary, which can provide a more accurate data for glacier inventory after manual revision. 355 In the future, we will make further research and adjustment in the following aspects: (1) improving the algorithm to increase the network's ability to learn glacier features; (2) adding more samples to diversify the training samples and allow the network to learn more features to solve the existing difficulties of extracting debris-covered glaciers; (3) using other high-resolution remote sensing image and SAR image, etc., to compensate for the loss of extraction accuracy of optical images under cloud occlusion; (4) applying transfer learning to reduce the time cost of sample annotation to allow 360 deep learning to be put to glaciers extraction more quickly, thus improving model generalization.
Code and data availability. Gaofen-6 PMS images are available at China High-resolution Earth Observation System (https://www.cheosgrid.org.cn/). The datasets including the GGI and TPG2017 used in this study are freely available. The code for deep learning is available from https://github.com/yiyou101/glacier-extraction.git, and sample datasets of glacier can be 365 provided upon request from the corresponding readers.
Author contributions. XC designed this algorithm of extracting glacier outlines and writing the original draft; XY contributed in terms of supervision and reviewing the manuscript; HD contributed in terms of editing of the manuscript; CC, JL and WP contributed in terms of getting satellite data and processing data. 370 Competing interests. The authors declare that they have no conflict of interest.