Improved Multimodel Superensemble Forecast for Sea Ice Thickness using Global Climate Models

This paper aims to find an ensemble method that combines the global climate models, providing an accurate forecast of sea ice thickness (SIT). The conventional multimodel superensemble method is widely used in the atmospheric, oceanic, and other fields, but it does not perform well in SIT simulations. Hence, an adaptive forecasting through exponential re10 weighting algorithm is adopted to improve the conventional multimodel superensemble method. The results demonstrate, through a multi-criteria evaluation, that our proposed method performs better than any other mainstream ensemble method. The proposed method is used to predict future SIT in 2020–2049, and its potential biases are discussed.

The bias-removed ensemble mean method (e.g. Zhi et al., 2011;Zhu, 2011), assumes that the poorest model can be made equivalent to the best model with a bias correction. Then, the bias-removed models can be assigned the same weights.
30 Krishnamurti et al. (1999) adopted a new method called the multimodel superensemble, which utilises linear regression to minimise the errors between GCMs and observations at the grid level during the training period, and different GCM weights are obtained and transferred to the forecast phase. This method can effectively reduce the local biases in space and time and those of vast parameters on different models, as it is far more particular in its weight assignments compared with that of the other two methods (Krishnamurti et al., 2016). Existing studies illustrate that the multimodel superensemble has been widely 35 applied in weather and climate (e.g. Derber and Wu, 1998;Kazumori et al., 2008;Leutbecher, 2003;Mahfouf and Rabier, 2000), ocean (e.g. Kantha et al., 2008;Lenartz et al., 2010aLenartz et al., , 2010b, hurricane (e.g. Munsell et al., 2015;Rios-Berrios et al., 2014;Sato and Xue, 2013;Xue et al., 2013), and other forecasting, significantly reducing the prediction error. Additionally, ensemble forecasting based on artificial intelligence technology has been gradually developed (e.g. Zhi ying et al., 2004;Gui and Zhao, 2013;Shi, 2013;Zhang et al., 2018).

40
However, there is a lack of research on the ensemble forecasting of SIT. Compared with the sea ice concentration (SIC), SIT is a complex variable that is difficult to observe (Haas, 2010) without sufficient and reliable large-scale satellite observations (e.g. Laxon et al., 2013;Tilling et al., 2016). Both statistical methods (e.g. Lindsay et al., 2008) and numerical models (e.g. Holland et al., 2011) show that there is a strong correlation between SIT and sea ice extent (SIE). The change in SIT is more significant and can provide more information than SIC, especially in central areas (Melia et al., 2015c). The reduction in SIT 45 extends the navigation season, making high-latitude sailing possible (Smith and Stephenson, 2013) and facilitating the exploration of abundant natural resources, impacting the Arctic ecosystem and mid-latitude climate (Francis and Vavrus, 2012).
However, experiments have illustrated that the conventional multimodel superensemble based on linear regression cannot simulate SIT well due to the sparsity of temporal and spatial SIT data. Moreover, Yang (2001) pointed out that complicated ensemble methods can lead to unstable weights and inferior performances than those of the best candidates. Therefore, an 50 ensemble method with an improved forecasting capacity is required for further investigations.
To fill this research gap, this study incorporates 12 high-performing GCMs for different scenarios and initial conditions that were evaluated by Wang and Overland (2015), amounting to 101 ensemble candidates in total, and a new method called adaptive forecasting through exponential re-weighting (AFTER) is adopted to improve the conventional multimodel superensemble method. Monthly SIT data from 2006-2017 were used in the training phase, while monthly data from 2018 55 were used in the test phase. A multi-criteria evaluation, including root mean square error (RMSE), correlation coefficient (CC), structural similarity index measure (SSIM), empirical orthogonal function (EOF) analysis, and sea ice volume (SIV), is incorporated in this study to examine the validation of the proposed method and other mainstream ensemble methods, e.g. the ensemble mean, bias-removed ensemble mean, multimodel superensemble, and artificial neural network. The results show that https://doi.org/10.5194/tc-2020-86 Preprint. Discussion started: 16 June 2020 c Author(s) 2020. CC BY 4.0 License. the improved multimodel superensemble algorithm has a superior performance to that of the other algorithms. Finally, a new 60 method is adopted for SIT projection from 2020 to 2049.
The reminder of this paper is organised as follows. A data description is presented in Section 2. Section 3 introduces the methodology, followed by the model validation test in Section 4. Finally, Section 5 provides the future SIT predictions and summary.

Pan-Arctic Ice-Ocean Modelling and Assimilation System data
Spatial consistency, temporality, and completeness are key factors in data evaluation (Melia et al., 2015a). The Pan-Arctic Ice-Ocean Modelling and Assimilation System (PIOMAS) sea ice reanalysis data, which assimilated the atmospheric reanalysis from the National Centres for Environment Prediction, consists of SIC satellite (Lindsay and Zhang, 2006) and sea surface temperature observations (Schweiger et al., 2011); this dataset was selected for use in this study .

70
The quality of PIOMAS was evaluated by Schweiger et al. (2011), demonstrating biases in PIOMAS of 0.26 m in autumn and 0.1 m in spring, compared with the ICESat data (Zwally et al., 2002). Although uncertainty exists in the PIOMAS data, current satellite observations (i.e. ICESat or CryoSat-2) have limited spatial and temporal coverage, restricting their ability to evaluate models. Moreover, the largest discrepancy between PIOMAS and ICESat data is found in the north of Greenland and the Canadian Archipelago, the thickest sea ice areas; the PIOMAS data have fewer discrepancies with the situ data than that of 75 ICESat due to the difference in the satellite inversion methods (Schweiger et al., 2011). Labe et al. (2018) pointed out that the spatial patterns, seasonal cycles, and SIT trends are sufficiently reproduced by the PIOMAS data. Therefore, PIOMAS has been widely used to represent observations in various studies (e.g. Shu et al., 2015;Labe et al., 2018).

Global climate models
This study incorporates 12 GCMs from CMIP5 that were evaluated by Wang & Overland (2015) for a combined forecast, with 80 a total of 101 ensemble candidates for four emission scenarios called representative concentration pathways (RCPs) 2.6, 4.5, 6.0, and 8.5 (van Vuuren et al., 2011).
The basic characteristics of the selected ensemble candidates are displayed in Table 1. Regarding the discrepancies in spatial resolution, all the model candidates and PIOMAS were interpolated into the same 1°× 1° resolution.

Ensemble forecast methodology
The ensemble forecast aims to improve the model projection accuracy by making full use of multiple information sources from 90 the GCMs and constraining they GCMs using observations. We have tested the performances of different ensemble forecast methods that are seldomly used in SIT projection (Section 4). SIT simulations based on the conventional multimodel superensemble, an advanced method that can significantly improve predictions in other areas, exhibit large observation biases due to the linear regression overfitting. Therefore, a new weight determination method was adopted to improve the conventional multimodel superensemble. The ideas behind this improved method and other ensemble forecast methods used 95 in this study are introduced in this section. The mathematical notation for the following equations is in Table 2. Improved multimodel superensemble method with L1-norm AFTER AFTER.L2 Improved multimodel superensemble method with L2-norm AFTER

Ensemble mean
The ensemble mean method was widely used in the fifth report from Intergovernmental Panel on Climate Change to predict atmospheric, oceanic, and cryospheric variables. This approach averages all the ensemble candidates, 〈 , 〉, regardless of 100 model discrepancy.

Bias-removed ensemble mean
Bias-removed ensemble mean methods attempt to correct the models with poor accuracies before averaging them. The conventional approach corrects the time mean by subtracting the biases between each ensemble candidate, ̅ ,ℎ , and observation, ̅ ℎ , during the training period at the grid level.

105
The mean and variance correction ensemble mean (MAVRIC), first proposed by Nathanael Melia et al. (2015), attempts to consider both mean and variance in the bias correction. This study incorporates MAVRIC using the ratio of the temporal standard deviation of the detrended observations, ̂ℎ , to the standard deviation of each detrended candidate, ̂, ℎ , over the training period (Eq. 3). Each model is detrended using the linear time series trend during the training period. The multiplicative correction is first detrended, the variance is then corrected, and the trend is re-applied.

Ensemble forecast via artificial neural network
The ensemble forecast with an artificial neural network (ANN) is biologically motivated, imitating the abilities of the human brain including information storing, learning, and training to minimise the difference between multi-models and observations.
The algorithm structure can be seen in Eq. (4)

120
The conventional multimodel superensemble method is a type of regression-improved forecast that provides weights for each grid (Eq. 5). The weights can be obtained by minimising the errors between all the candidates and observations during the training phase, and a linear regression is most commonly used (Eq. 6).
In our experiment, this superensemble method contributes to large systematic errors in SIT projection (Section 4). As a solution for this, we adopted the AFTER algorithm in the superensemble structure to improve the weight calculations, maintaining all 125 positive weights to avoid overfitting and instability. The improved algorithm was first proposed by Yang (2001b) and can be presented as follows: Two weighting forms are proposed, and the L1-norm AFTER algorithm is as follows.
Step 3. For the ℎ model, the loss function, , from Eq. 9 can be written as Step 4. Compute the convex weight for the ℎ model following wed by Eqs. (8) and (9).
Step 5. Randomly permute the order of the data N-1 times. Repeat Steps 2-5.

135
Note that a tuning parameter is used to control the effect of weighting on the forecast performance (normally, = 1).
The workflow of the L2-norm AFTER algorithm is similar to that of L1-norm AFTER, expect for the function in Step 3, which should be rewritten as

Method validation
In this study, we analysed the performance of the ensemble forecast methods mentioned above using a statistical multi-criteria 140 evaluation approach. Both univariate and multivariate techniques including RMSE, CC, SSIM, and EOF analysis are adopted to capture the reliability and nature of the ensemble models. The analysis is performed both spatially and temporally on the ensemble forecast datasets during the testing phase. The temporal scale analysis was used to understand the prediction ability of different ensemble forecast methods for the SIT variation trends, testing whether the methods can perform well after being fully trained. Gridded data were used to analyse the discrepancies between the diverse ensemble forecast models in different 145 regions. Finally, these ensemble datasets were adopted to reproduce the monthly variations of the SIV in 2018. The results have provided various statistical properties for these methods.

RMSE & CC test
The RMSE of the datasets measures the deviation between the simulation and observation. The CC of the datasets refers to the degree of linear correlation between them, combining the concepts of mean, standard deviation, and regression line. In this 150 study, both the RMSE and CC of different SIT ensemble forecasts were calculated using the spatial (Figures 2 and 4) and temporal means (Figures 3 and 5). better than any single ensemble candidate, improving the simulation accuracy of the conventional models. Combining the four algorithms, the largest RMSE occurs in August during the testing period, which is consistent with that of the greatest SIV anomalies driven by a positive feedback loop between the SIT and ice-albedo (Bushuk et al., 2017). That kind of feedback is affected by melt pond information, snowfall, and sea ice concentration, which cannot be sufficiently simulated by the current GCMs from the CMIP5 (Stroeve et al., 2014), restricting the effectiveness of the ensemble forecast in that month. 165 Figure 4 illustrates that the best three ensemble forecast models based on averaged spatial CC tests are the two improved multimodel superensemble methods and the bias-removed ensemble mean method, where the values are approximately 0.9 for all the months. The latter method performs better from July to November, while our proposed methods have an advantage in the remaining months of 2018, and all of the models together have a higher correlation with the observations than that of any single ensemble candidate. Moreover, these methods and the ANN method conclude that the least relevant occurs in August, 170 while that appears in September for other methods, providing more evidence for model evaluation. From the spatial distribution perspective, we can see that the datasets from all the listed ensemble methods in Figure   Results of all 101 candidates are depicted using grey dashed lines, which are marked as raw data. All model abbreviations are the same as those provided in Table 2.

Structural similarity index measure (SSIM)
The SSIM is used to evaluate the performance of different ensemble models by adopting a single index that can measure overall spatial patterns, which is a type of objective test method (Wang et al., 2004). Compared to traditional error methods

195
(e.g. RMSE), this method can better depict the spatial distribution differences between model outputs and observations. The formula of this SSIM can be presented as follows: where the variations in mean value, ( , ), deviation, ( , ), and structure, s( , ) are combined together; and σ are the mean value and standard deviation of , respectively; and 1 , 2 , and 3 are the constants to prevent system instability. Generally, if = = = 1, 3 = 2 /2 , then Eq. (12) can be rewritten as follows:
A sample model with a random matrix A of 10 × 10 that ranges from 0 to 1 is provided to verify the advantage of SSIM.
Matrix B can be obtained if 10 is added to the last element, while Matrix C can be obtained if we add 1 or -1 randomly to each https://doi.org/10.5194/tc-2020-86 Preprint.  (Figure 6).
205 Figure 7 illustrates that the improved multimodel superensemble methods have higher scores than those of any other single candidate every month in the SSIM, matching the spatial distributions with the observations. The improved methods, together with the bias-removed and ANN methods, demonstrate that the largest SIT spatial distribution biases arise in August, which the other ensemble methods cannot capture.   Table 2.

EOF analysis
The EOF analysis is introduced to further evaluate the model performances to find a reliable ensemble model to reproduce a realistic sea ice climatology, consistently capturing the major spatial modes and their related 1D principal components (PC).
In this study, raw datasets from seven ensemble models and observations are processed by the EOF analysis, where the cumulative total variance of the first four leading EOF patterns is over 99% (Figures 8 and 9). The first spatial pattern reflects EOF4 patterns of most of the datasets exhibit similar structure, leading to disorders in their corresponding PC patterns, while only the ANN method can capture a similar trend and amplitude as that of the observations in the fourth PC pattern, meaning that this method displays more detailed results.

Sea ice volume
Sea ice volume (SIV) is an important index in the assessment of sea ice simulations, combining SIT, SIC, and grid area into 240 consideration. Here, SIC datasets from PIOMAS and the GCMs listed in Table 2 Table 2.

Ranking
Ranking is provided for the ensemble methods listed in Table 2 based on a multi-criteria evaluation. The best model is given 250 the top ranking, while the worst model is ranked seventh. Table 3 shows that our proposed methods perform better than the other selected ensemble methods for most evaluation techniques. Additionally, the ANN ensemble method can capture more details of the observation dataset in the EOF analysis owing to the advantage of artificial intelligence in data mining. Even the simplest method, the ensemble mean, is highly correlated with the observation dataset in the spatial distribution. To summarise, our proposed methods can greatly improve the accuracy of the ensemble forecast in SIT, which is followed by the conventional 255 bias-removed ensemble mean algorithm and ANN ensemble methods, while the remaining three methods score the lowest.

Prediction and summary 260
Based on the study results, the improved superensemble methods are the best ensemble models for simulating SIT, and the improved model with L1-norm AFTER performs better than the other one. Hence, the L1-norm AFTER superensemble method is adopted to predict future variations of the September mean SIT during three periods (2020-2029, 2030-2039, and 2040-2049), where the bias-removed ensemble mean is used as a control group. Figure 11 illustrates that the sea ice will continue to melt in the next three decades and the two selected ensemble models exhibit a similar overall spatial distribution, i.e. thicker 265 sea ice in the west and nearly ice-free in the east.
However, even these two high performing methods still exhibit discernible differences that increase with the prolongation of time. Compared to the results of the bias-removed ensemble mean, the SIT distribution exhibits a "decrease-increase-decrease" belt from the north of Canadian Arctic Archipelago and Greenland Sea to the Barents and Kara Sea in the first decade of the simulation. Then the SIT in the centre of the Arctic Basin increases, while the SIT decrease in the Barents and Kara Sea 270 disappears during 2030-2039. By the middle of the 21st century, the discrepancy arises in the Arctic Basin, while the "two opposite poles" still exist along the west of the Arctic region, showing large uncertainties in SIT prediction in these areas.
The issues related to how many ensemble candidates should be combined to improve the model performance and whether the results change with space are investigated in Figure 12, where the Arctic area is separated into 10 regions. The results show  Figure 13 reflects the weight distributions of each region, showing that the roles of different ensemble candidates vary in different regions. Therefore, the multimodel superensemble structure is far more selective in its assignment of weights.
In summary, this study has incorporated an improved weight-determined algorithm in the multimodel superensemble structure 280 to predict the SIT. A multi-criteria evaluation was used to validate the model. The study insights are summarised below.
• The AFTER algorithm can effectively avoid overfitting and instability in the conventional superensemble forecasting method, demonstrating better SIT simulation performance than that of the other mainstream ensemble forecast methods through a multi-criteria evaluation.
• Large biases in the SIT simulations between the dataset from the improved ensemble method and the observations were 285 found along the coastline in the west and in August, which was in accordance with the largest SIT anomaly in time and space. This result is restricted by the limited simulations of internal variability and external forcing of SIT from all CMIP5 selected ensemble candidates and can be improved by further developing the GCMs.
• This method was used to forecast the September mean SIT in the next three decades, where the bias-removed ensemble mean method was used as a control group. The results from these two methods exhibited a consistent spatial pattern of a 290 continuous thinning trend in the west and an expanded ice-free area in the east. However, differences between these two high-performing ensemble methods still exist in the Canada Arctic Archipelago, Greenland Sea, and central Arctic Basin, which are enhanced over time.
https://doi.org/10.5194/tc-2020-86 Preprint. Discussion started: 16 June 2020 c Author(s) 2020. CC BY 4.0 License. Figure 11: Ensemble forecast of the September mean SIT derived from all the ensemble candidates listed in Table 1 during