An increase in Antarctic Ice Sheet (AIS) surface mass balance (SMB) has the potential to mitigate future sea level rise that is driven by enhanced solid ice discharge from the ice sheet. For climate models, AIS SMB provides a difficult challenge, as it is highly susceptible to spatial, seasonal, and interannual variability.

Here we use a reconstructed data set of AIS snow accumulation as “true” observational data, to evaluate the ability of the CMIP5 and CMIP6 suites of models in capturing the mean, trends, temporal variability, and spatial variability in SMB over the historical period (1850–2000). This gives insight into which models are most reliable for predicting SMB into the future. We found that the best scoring models included the National Aeronautics and Space Administration (NASA) GISS model and the Max Planck Institute (MPI) for Meteorology's model for CMIP5, as well as one of the Community Earth System Model v2 (CESM2) models and one MPI model for CMIP6.

Using a scoring system based on SMB mean value, trend, and temporal variability across the AIS, as well as spatial SMB variability, we selected a subset of the top 10th percentile of models to refine 21st century (2000–2100) AIS-integrated SMB projections to 2274

Notably, we find that there is no improvement from CMIP5 to CMIP6 in overall score. In fact, CMIP6 performed slightly worse on average compared to CMIP5 at capturing the aforementioned SMB criteria. Our results also indicate that model performance scoring is affected by internal climate variability (particularly the spatial variability), which is illustrated by the fact that the range in overall score between ensemble members within the CESM1 Large Ensemble is comparable to the range in overall score between CESM1 model simulations within the CMIP5 model suite. We also find that a higher horizontal resolution does not yield to a conclusive improvement in score.

Surface mass balance (SMB) is the rate of accumulation of mass on the surface of the ice sheet and is characterized predominantly by precipitation and sublimation and also includes runoff and blowing snow terms

Over longer (

Despite its importance for AIS mass balance and global mean sea level, there are only a few robust observations of SMB across the continent. A lack of regular spatial and temporal distribution of observations has led to many efforts to model SMB using both regional and global climate models (RCMs and GCMs, respectively). Because the AIS is so large, predicting SMB out onto timescales from decades to centuries requires the use of GCMs

While past research by

To improve upon model estimates, several groups have combined ice core data with models to create spatio-temporally robust SMB data sets (

The reconstructed uncertainty used throughout this paper is a combination of the reconstruction uncertainty (i.e., uncertainty from the ice core records) and internal variability (Eq. 1). The inclusion of the internal variability uses the spread generated from climate models to estimate uncertainty in observations due to internal variability of the climate system.

For this work, we investigate AIS SMB in GCMs. GCMs have, compared to RCMs, relatively low horizontal resolution, which makes it difficult for them to reproduce the detailed AIS SMB. RCMs have been shown to be more accurate in capturing AIS SMB

We used all applicable CMIP5 and CMIP6 model outputs, of which there were 81 models and 42 independent models (i.e., different model physics and/or resolutions), respectively, for the historical simulations (1850–2005). For the future simulations, we only had available output for 30 CMIP5 models, 19 of which are independent, and 24 CMIP6 models, of which 16 are independent. See Tables S1–S3 in the Supplement for a list of models and their resolutions. The future simulations include three different forcing scenarios for CMIP5: Representative Concentration Pathway (RCP) 2.6, RCP4.5, and RCP8.5. RCP2.6 represents a low-emission scenario, RCP4.5 a mid-range-emission scenario, and RCP8.5 a high-emission scenario through the 21st century

We downloaded CMIP5 and CMIP6 precipitation and evaporation–sublimation output at monthly time resolution and, after calculating SMB as precipitation

We formulated five criteria on which to score the historical runs of the models. Three of the criteria are based on the AIS-integrated SMB – mean, trends, variability – and two are based on AIS SMB spatial patterns: modes of SMB variability and variance explained by these modes. As the models' abilities to capture SMB are presented in the format of a “score card”, judging the models against each criterion will be hereinafter referred to as “scoring”. These criteria were determined having in mind the following questions: (1) do the models adequately simulate several SMB observed characteristics in the recent past? (2) Are the models that perform well adequately simulating SMB for the right reasons? All five criteria are weighted equally in the final scoring to prevent the final score from being skewed by any given criterion.

To score the models based on AIS-integrated SMB, we took the mean SMB across the AIS for every year that the reconstruction overlapped the models (1850–2000) to generate a single 151-year, AIS-integrated time series. We then split the time series into three aspects: the mean value of the SMB time series values (mean value referring to the value obtained by integrating SMB over the entire AIS), the time series linear trend, and the time series interannual variability.

To score the time series mean value, we assigned a score,

Time series of the reconstructed AIS-integrated SMB time series (dark indigo) with 1

Similarly, for the time series trend, we assigned a score of

For temporal variability, if a model should greatly underestimate the mean value, for example, the variability about that mean value will also likely be underestimated. To ensure that we are not double-counting the impact of SMB mean value (because this is already covered by the first scoring criterion), we calculated the variability about the normalized time series. To detrend and normalize each time series, then, to separate the SMB variability from its mean value, we performed the following analysis:

We then calculated the standard deviation of each time series and assigned a score,

To ensure model performance was not solely based on AIS-integrated SMB values, we also analyzed the spatial SMB variability. To do so, we performed an empirical orthogonal function (EOF) analysis on annual AIS SMB data from 1850–2000. EOF analysis maps the spatial pattern of a variable where the first mode represents the largest explained variance, the second mode – which is orthogonal to the first – represents the next largest explained variance, the third mode – which is orthogonal to both modes one and two – represents the third largest explained variance, and so on until all the variance is explained. By breaking this criterion down into two main factors, (1) spatial variability and (2) variance explained, both of which are considered as separate scoring criteria, we aim to determine the models' abilities to accurately capture the modes of variability as well as how much variance each EOF mode explained.

In the reconstruction, the top three modes of variability collectively explain roughly 76 % of the total variance explained. The fourth mode explains only about 6 % of the total variance, and all other modes explain

A spatial map of

We did this for all nine combinations of model and reconstruction maps for the top three modes of variability (model

Because the variance explained is also important for gauging how well models are performing at recreating the observed spatial patterns, we also summed the difference in variance explained for the top three sorted modes of variability for each model. Because the modes were sorted based on difference for the maps, each mode kept its variance explained to preserve the accuracy of the models regarding the dominance of each spatial pattern.

After compiling scores for all five of the aforementioned scoring criteria, we removed any outliers by calculating the 1.5 quartile range of the data and neglecting models that fell outside of that range. We then normalized each set of scores to be on a scale from 1 to 10 to ensure that each criterion was equally weighted. After this normalization, the outliers for any given criterion were retroactively assigned a score of 10 for that criterion. The total score, then, is the average of all five sets of normalized scores. Because the scores are based on the difference between the reconstruction and the models, higher scores indicate poorer model performance.

To look at the impact of resolution and internal variability on the final scoring, we correlated the horizontal resolution to final score and applied the same scoring analysis to the CESM Large Ensemble (CESM-LENS) experiment.

To reduce the uncertainty for AIS SMB in the future, we created a subset of models that had a final score in the top 10th percentile (90th percentile and above) of CMIP5 and CMIP6. For our future projections, we investigated the impacts of SMB under forcing scenarios RCP2.6, RCP4.5, and RCP8.5 for CMIP5 and SSPs 1–2.6, 2–4.5, and 5–8.5 for CMIP6. We compared the top scoring models that could be projected out under the selected forcings (of which there are five: four for CMIP5 and one for CMIP6) to the entire scope of CMIP5 and CMIP6. We ran a Monte Carlo simulation in which five random models were selected 100 000 times. Those 100 000 sets of five random scores were compared to the five best scoring model scores using a two-sided

Using this subset of best scoring models, we calculated the projected AIS-integrated mean value and trend in warming scenarios, RCPs 2.6, 4.5, and 8.5 and SSPs 1–2.6, 2–4.5, and 5–8.5, out to 2100. To see if and how the models respond differently to different warming scenarios, we also calculated the AIS-integrated SMB sensitivity to temperature change as

The final overall scores are an unweighted average of all five different scores. After performing the analysis outlined in the Methods section, the top 90th percentile overall of scoring models were determined to be GISS E2 H CC, GISS E2 R CC, GISS E2 R, MPI ESM LR, MPI ESM MR, and MPI ESM P from CMIP5 and CESM FV2 and MPI ESM2 LR from CMIP6. For comparison, these eight models have been added to Figs. 3, 4, and 5, to show their performance in each scoring criterion relative to the rest of the CMIP model suites.

Along with higher SMB values, the coastal regions of East Antarctica and the Antarctic Peninsula also show the highest absolute SMB trends in the reconstruction (Fig.

Panel (a) in Fig.

The reconstructed AIS SMB ranges from 1800

While the reconstructed SMB time series and eight best scoring models show a generally increasing trend, the same is not true for all CMIP5 or CMIP6 models (Fig.

Box plots of the linear trends in spatially integrated AIS SMB in CMIP5 (blue) and CMIP6 (red) for the periods

Looking at all of the CMIP5 and CMIP6 models, the median linear trend is positive for all three time slices and the trend interquartile ranges are from

Gaussian distributions of SMB where the standard deviation is that of the SMB time series for the reconstruction (black).

Apart from its trend magnitude and sign, SMB variability is also important for accurately representing SMB and can be indicative of the relevant SMB-driving mechanisms. Figure

Just as temporal SMB variability is important for accurately capturing AIS SMB, spatial variations in SMB are also important in AIS SMB representation in models, as precipitation is not distributed uniformly. To look at the spatial variability in SMB, we performed EOF analysis and plotted looked at the top three modes of variability which collectively account for 76.3 % of the total spatial variability.

EOF analysis plots of the top three modes of variability for

List of ranges of values for the three temporal criteria for the top 90th percentile models, all CMIP5 models, and all CMIP6 models as well as the values and uncertainties for the reconstruction.

Separated out, the top three modes of variability in the reconstruction from EOF analysis explain 39 %, 26 %, and 12 % of the total variability, respectively (Fig.

As an example of the comparison, one of the better scoring models for the EOF map criterion, CMCC CM, also shows a dipole between the Antarctic Peninsula and the Ross Sea region for the top mode as well as a strong variance signal around the Antarctic Peninsula for mode 2 and a quadrupolar pattern for mode 3. However, even the better scoring models tend to overestimate the magnitude of the variance, particularly around the coast, even when they capture the general spatial patterns. CESM1 WACCM, one of the more poorly performing models with regard to this metric, generally overestimates the variance everywhere in all three of the top modes. The top mode for this model reflects an East–West Antarctic SMB dipole, and mode 2 shows a strong, unidirectional signal across the entire AIS, though mode 3 seems to reflect the same quadrupolar pattern as seen in the reconstruction.

The scores for all CMIP5 and CMIP6 models. The large dots show the average score for all model groupings. Models are grouped by similar model physics and have in parentheses the number of models in the grouping after the name. Each model grouping has all model scores plotted as small blue (red) dots for CMIP5 (CMIP6) with the model average plotted in the larger dots. Models that have no like models are followed by a 1 in parentheses and only have a larger dot. The eight best scoring models (above the 90th percentile) are denoted with red outlines if they are among the CMIP5 suite of models – GISS E2 H CC, GISS E2 R CC, GISS E2 R, MPI ESM LR, MPI ESM MR, and MPI ESM P – or with blue outlines if they are among the CMIP6 suite of models – CESM FV2 and MPI ESM2 LR. Note that the overall scores for two of the GISS models and three of the MPI models in CMIP5 are almost exactly equal so outlines overlap almost completely.

Models that score above the 90th percentile make up the subset of best scoring models. Eight models – GISS E2 H CC, GISS E2 RCC, GISS E2 R, MPI ESM LR, MPI ESM MR, and MPI ESM P from CMIP5 and CESM FV2 and MPI ESM LR from CMIP6 – comprise this top 90th percentile. MPI ESM P GISS E2 R from CMIP5 and CESM2 FV2 do not have the requisite future projection data for this analysis. The most poorly performing models include BNU ESM, CESM FASTCHEM, and FIO ESM. The mean model score is 4.36 for CMIP5 and 5.77 for CMIP6. CMIP5 and CMIP6 scores were normalized together such that all scores are on the same scale and are directly comparable.

With this subset of the eight best performing models, we then refined future projections of AIS SMB in terms of mean value, trend, and variability. Comparing the difference in SMB projections between RCPs and SSPs allows us a look into the potential sea level changes caused by different amounts of warming.

As stated earlier, both mean value and trend of AIS SMB have significant implications for future projections of sea level change. The spatially integrated AIS SMB (i.e., SMB mean value) has been increasing from 1850–2000 (Fig.

From 2070–2100, spatially integrated AIS SMB is projected to be 2294

For the entirety of the 21st century, 2000–2100, most CMIP5 and CMIP6 climate models project positive SMB trends in all forcing scenarios (Fig.

The best scoring CMIP5 models have trends of 1.2

Box plots of modeled SMB sensitivity to changes in temperature (i.e., how much SMB will change per degree Celsius of near-surface atmospheric warming) are shown in Fig.

These sensitivity results are not statistically significantly different across forcing scenarios, however, indicating no significant more-than-linear SMB increase in enhanced warming scenarios. Table

The differences in modes of variability in the EOF maps likely point to differences in atmospheric conditions that force AIS SMB. Mode 1 of the reconstruction EOF shows a dipolar pattern across the Antarctic Peninsula and Ross Ice Shelf region of West Antarctica. This dipole corresponds to variability in precipitation generated by variations in the track and strength of the Amundsen Sea Low. The Amundsen Sea Low, a dominant synoptic phenomenon that drives a significant amount of the circulation variability in West Antarctica and on the Antarctic Peninsula

Looking at mode 2, previous work by

Time series for all CMIP5 (lighter colors) and CMIP6 models (darker colors) and best scoring models (skinny lines) and the best scoring models' average (thick lines) for

Box plots of the linear trend in spatially integrated AIS SMB from 2050–2100 for

Box plots of all CMIP5 models' projected SMB sensitivity to temperature changes (

Projected values for SMB, SMB trend, SMB temperature sensitivity, and change in 21st century temperature for all CMIP5 and CMIP6 models compared to the best scoring CMIP5 models for RCP2.6, RCP4.5, and RCP8.5 and best scoring CMIP6 models for SSP1-2.6, SSP2-4.5, and SSP5-8.5.

Our study uses the full ensemble of available CMIP5 and CMIP6 models. However, we only select a single member of each model (since some models have only one ensemble member available), which potentially leads to under-sampling of internal variability in the scoring. To analyze the effect of natural variability on final scoring, we use the Large Ensemble of the Community Earth System Model (CESM-LENS;

A major caveat of this finding, however, is that the CESM-LENS runs and the reconstruction only overlap from 1920–2000. This will likely most significantly impact the assessment of the trend and EOF analyses.

That said, this analysis highlights that internal variability plays a significant role in our AIS SMB assessment. Some models within the CMIP5 and CMIP6 frameworks, such as CESM1-CAM5, have many ensemble members. However, not all models – and even not all model versions – have multiple ensemble members. As such, performing a direct comparison of the models using the ensemble mean would not necessarily yield an accurate result as models with more ensemble members would have their final score shifted significantly while the same is not true for models with a single ensemble member. For considering using GCMs for AIS SMB analysis, then, we strongly suggest taking into account the fact that internal variability could be playing a strong role in some models' final score and that the number of ensemble members available should be considered along with the final score.

As the CMIP5 and CMIP6 models vary widely in horizontal resolution, from about 0.75

The major limitations of this work stem largely from the subjective selection of scoring criteria. While each model is scored based on the same criteria, each criterion is chosen specifically to gauge model performance for capturing AIS SMB. As such, these criteria may be ill suited for looking at other variables, and, thus, other metrics could yield very different results. Another caveat of this work is that we are only capable of analyzing the CMIP6 models that have been released. As this analysis and the release of CMIP6 are concurrent, this limits the number of models we can reasonably analyze due to time constraints. Additional CMIP6 models may have different results and may skew the comparison between CMIP5 and CMIP6 significantly. Similarly, due to the small number of CMIP6 models released at this point, using statistical analyses becomes moot as the top 90 % of models constitute the single, best scoring model. One final major caveat with this work is the relatively narrow scope of just looking at AIS SMB. Because we refined our criteria at the outset of our experiment to solely reflect model performance with regard to capturing SMB and did not include outside factors like synoptic weather patterns, sea ice, or sea surface conditions (

Another significant caveat of this work is the use of single ensemble members. For this work, we use the first ensemble member for each model. This choice was made as the various model members of CMIP5 and CMIP6 vary widely in the number of ensemble members available – ranging from 1 to 50 – so using only a single ensemble member helps account for this large disparity between the models. However, in looking at the CESM-LENS experiment – which has 35 ensemble members – it is clear that there can be a large spread caused solely by internal variability. The spread in final score among the CESM-LENS ensemble members is 4.65, which is largely generated by the difference in EOF maps, meaning that the precise realization of atmospheric conditions in the models is incredibly significant in how the model, in turn, represents AIS SMB.

In this paper, we tested the ability of the suite of models in CMIP5 to capture SMB reconstructed from ice cores and reanalysis products by scoring them using a series of criteria: AIS-integrated mean value, trend, and variability, as well as the spatial variability patterns. This scoring system is designed as a guide for choosing what GCMs to focus on studying for future SMB projections. Using this scoring system, we found that the top 90th percentile models were GISS E2 H CC, GISS E2 R CC, GISS E2 R, MPI ESM LR, MPI ESM MR, and MPI ESM P of CMIP5 and CESM FV2 and MPI ESM2 LR of CMIP6. A similar study in

Our SMB mean value estimates are comparable to those of

All scores are equally weighted to avoid issues with coincidental good or bad performance. Having a spread of criteria against which we score the models limits the possibility that models are recreating one aspect well for the wrong reasons. This scoring method does well in determining simple and consistent criteria to score the accuracy of modeled SMB. In contrast, it struggles to recognize any difference in the importance of individual criteria as they are all weighted equally and also only reflects a few, simple scoring metrics. The criteria were chosen such that they all carry equal weight, which we justify by arguing that not meeting any one of the criteria to within a reasonable degree would significantly impact future SMB estimates.

Of the top eight scoring models, six were from CMIP5 and two from CMIP6. Using the top six best scoring models from CMIP5, four of which we were able to project out to 2100 under three different RCPs, we refined future SMB predictions to 2274

Some of the major caveats of this work are the subjective selection of scoring criteria which dictate the assessment of best scoring models as well as the use of single-ensemble members for model analysis which may lead to an undersampling of internal variability.

No data sets were used in this article.

The supplement related to this article is available online at:

TG and JTML conceptualized and initiated this work. TG performed the analysis, discussed the results with JTML, and wrote the paper. BM provided the reconstructions and guidance on using and interpreting them. All authors reviewed the paper before submission.

The authors declare that they have no conflict of interest.

Tessa Gorte and Jan T. M. Lenaerts acknowledge support from the National Aeronautics and Space Administration (NASA), grant 80NSSC17K0565 (NASA Sea Level Team 2017–2020).

We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP5 and CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the CMIP data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.

We also acknowledge the Global Modeling Assimilation Office and Modeling, Analysis and Prediction Office at NASA for their effort and support in the development of the MERRA-2 reanalysis product. We thank the three reviewers and the editor for their constructive comments that greatly improved our paper.

This research has been supported by the NASA Sea Level Change Team (2017–2020) (grant no. 80NSSC17K0565).

This paper was edited by Michiel van den Broeke and reviewed by three anonymous referees.