CMIP5 model selection for ISMIP6 ice sheet model forcing: Greenland and Antarctica

Abstract. The ice sheet model intercomparison project for CMIP6 (ISMIP6) effort brings together the ice sheet and climate modeling communities to gain understanding of the ice sheet contribution to sea level rise. ISMIP6 conducts stand-alone ice sheet experiments that use space- and time-varying forcing derived from atmosphere–ocean coupled global climate models (AOGCMs) to reflect plausible trajectories for climate projections. The goal of this study is to recommend a subset of CMIP5 AOGCMs (three core and three targeted) to produce forcing for ISMIP6 stand-alone ice sheet simulations, based on (i) their representation of current climate near Antarctica and Greenland relative to observations and (ii) their ability to sample a diversity of projected atmosphere and ocean changes over the 21st century. The selection is performed separately for Greenland and Antarctica. Model evaluation over the historical period focuses on variables used to generate ice sheet forcing. For stage (i), we combine metrics of atmosphere and surface ocean state (annual- and seasonal-mean variables over large spatial domains) with metrics of time-mean subsurface ocean temperature biases averaged over sectors of the continental shelf. For stage (ii), we maximize the diversity of climate projections among the best-performing models. Model selection is also constrained by technical limitations, such as availability of required data from RCP2.6 and RCP8.5 projections. The selected top three CMIP5 climate models are CCSM4, MIROC-ESM-CHEM, and NorESM1-M for Antarctica and HadGEM2-ES, MIROC5, and NorESM1-M for Greenland. This model selection was designed specifically for ISMIP6 but can be adapted for other applications.


Abstract. The ice sheet model intercomparison project for CMIP6 (ISMIP6) effort brings together the ice sheet and climate modeling communities to gain understanding of the ice sheet contribution to sea level rise. ISMIP6 conducts standalone ice sheet experiments that use space-and time-varying forcing derived from atmosphere-ocean coupled global climate models (AOGCMs) to reflect plausible trajectories for climate projections. The goal of this study is to recommend a subset of CMIP5 AOGCMs (three core and three targeted) to produce forcing for ISMIP6 stand-alone ice sheet simulations, based on (i) their representation of current climate near Antarctica and Greenland relative to observations and (ii) their ability to sample a diversity of projected atmosphere and ocean changes over the 21st century. The selection is performed separately for Greenland and Antarctica. Model evaluation over the historical period focuses on variables used to generate ice sheet forcing. For stage (i), we combine metrics of atmosphere and surface ocean state (annual-and seasonal-mean variables over large spatial do-mains) with metrics of time-mean subsurface ocean temperature biases averaged over sectors of the continental shelf. For stage (ii), we maximize the diversity of climate projections among the best-performing models. Model selection is also constrained by technical limitations, such as availability of required data from RCP2.6 and RCP8.5 projections. The selected top three CMIP5 climate models are CCSM4, MIROC-ESM-CHEM, and NorESM1-M for Antarctica and HadGEM2-ES, MIROC5, and NorESM1-M for Greenland. This model selection was designed specifically for ISMIP6 but can be adapted for other applications. from both ice sheets (Velicogna, 2009;Zwally et al., 2011;Khan et al., 2014;Mouginot et al., 2014;Mouginot et al., 2019;Shepherd et al., 2018). Both atmospheric and oceanic changes have been identified as drivers of observed mass loss, although regional mechanisms vary. For example, rising air temperatures over Greenland lead to increased surface melt, causing direct mass loss Fettweis et al., 2017). Enhanced surface meltwater production also destabilizes the margins of the ice sheet (van den Broeke, 2005;Banwell et al., 2013) and lubricates the ice flow at the bed (Andrews et al., 2014;Kendrick et al., 2018). Ocean interactions with the ice sheet occur in Greenland fjords, where a combination of onshore ocean heat transport, estuarinetype circulation, subglacial meltwater runoff, and calving processes influence glacier terminus position and ice discharge (Straneo and Cenedese, 2015). In Antarctica, most of the ice sheet's mass loss is mediated through floating ice shelves. Melting at the ice shelf underside, which affects ice flow dynamics, is mainly controlled by the extent to which ocean dynamics along the continental margin allow intrusion of offshore ocean heat into the ice shelf cavities, leading to distinct regimes operating in "warm" vs. "cold" continental shelf regions (e.g., Dinniman et al., 2016;Thompson et al., 2018). Rising air temperatures and associated surface melting are thought to be responsible for the collapse of ice shelves around the Antarctic Peninsula (Domack et al., 2005) and subsequent speedup of grounded ice flow (Rignot et al., 2004), while surface melting is currently limited in most other parts of the continent (e.g., Trusel et al., 2013). In the future, increased water vapor transport in a warmer atmosphere may lead to increased surface accumulation in Antarctica (Frieler et al., 2015;Palerme et al., 2017) together with increased melting over Greenland  and the Antarctic ice shelves (Trusel et al., 2015). Besides this general pattern, the spatial distribution and magnitudes of atmospheric and oceanic contributions to the mass balance of both ice sheets vary greatly and depend on synopticscale climate variability and physical processes at regional and smaller scales. The ice sheet model intercomparison project for CMIP6 (ISMIP6) brings together the ice sheet and climate modeling communities to gain understanding of the ice sheet contribution to sea level rise (Nowicki et al., 2016). Due to the delay in the CMIP6 dataset release, ISMIP6 revised the protocol described in Nowicki et al. (2016) to utilize climate forcing from the CMIP5 dataset (Nowicki, 2019). ISMIP6 conducts stand-alone ice sheet experiments that use space-and time-varying forcing derived from atmosphere-ocean coupled global climate models (AOGCMs) to reflect plausible trajectories for climate projections, building on earlier coordinated experiments which applied ad hoc boundary conditions either constant in time or imposed as an abrupt perturbation (Pattyn et al., 2013;Bindschadler et al., 2013;Levermann et al., 2014). However, this effort requires converting AOGCM output to forcing for ice sheet models, pos-ing several challenges. First, climate models from the Coupled Model Intercomparison Project (CMIP) have a horizontal resolution that is too coarse to accurately represent sharp ice sheet topographic gradients impacting the surface climate of the ice sheet (e.g., melt, wind, precipitation). Ocean components cannot represent narrow fjords connecting the deep ocean and tidewater glaciers around Greenland (Straneo et al., 2012), the ocean eddies involved in poleward heat transport across continental shelves , or ocean circulation beneath ice shelves (Asay-Davis et al., 2017). Second, AOGCMs poorly represent polar-specific processes that have a major impact on the ice sheet surface climate (e.g., snowpack evolution, cloud and boundary-layer processes) (Favier et al., 2017).
These limitations can be addressed by using regional climate models adapted for the polar regions. On the atmosphere side, polar-oriented regional climate models (RCMs) have proved to provide more realistic surface climate than direct AOGCM outputs for both the Greenland ice sheet (e.g., Noël et al., 2018;Fettweis et al., 2013) and the Antarctic ice sheet (e.g., Agosta et al., 2019). On the ocean side, a number of models have recently added the capability to represent ice shelf cavities and ice-ocean interactions (e.g., Dinniman et al., 2016). However, ocean simulations are still unable to provide non-biased solutions from a pan-ice-sheet perspective, and they remain computationally expensive, which probably explains the small number of existing projections of ice shelf basal melting (Timmermann and Goeller, 2017;Naughten et al., 2018). Thus, the ISMIP6 steering committee has proposed the following strategy to convert AOGCM outputs into ice sheet forcing: surface forcing is provided by AOGCMs dynamically downscaled with a polar-oriented atmospheric RCM , while ocean forcing is computed by interpolating AOGCMs' ocean temperature onto the continental shelf and by parameterizing ice shelf melt or retreat rates, as detailed in Slater et al. (2019) and Nowicki (2019).
The goal of this study is to recommend a subset of CMIP AOGCMs to produce forcing for ISMIP6 stand-alone ice sheet simulations. This ensemble of AOGCMs aims to capture (i) plausible climate near Antarctica and Greenland over the historical period and (ii) a diversity of atmosphere and ocean warming rates over the 21st century. For evaluating AOGCMs we focus on variables that are inputs of the downscaling methods defined to generate ice sheet forcing. Although it is technically possible to select different AOGCMs for atmosphere and ocean forcing, we choose to use the same climate models across both realms due to their interdependence in projections (e.g., Krinner et al., 2014;Bracegirdle et al., 2018). We thus perform a combined assessment of both the atmosphere and ocean components of AOGCMs.
This paper describes the process utilized to select six AOGCMs to provide forcing for each ice sheet. This evaluation combines observational/reanalysis data, metrics from existing studies, and data produced specifically for this study. 1. the model must provide 6-hourly wind, temperature, and humidity to be able to run an atmospheric regional climate model (18 models); 2. the model output must include required data fields under both the RCP2.6 and RCP8.5 scenario projections, following the revised ISMIP6 protocol (Nowicki,  For the atmosphere and surface ocean, we consider variables that have an impact on RCM-modeled surface mass balance and for which reanalyses are reliable, following Agosta et al. (2015). All model outputs are bilinearly interpolated onto a common regular longitude-latitude grid (1.5 • × 1.5 • ). For each variable that retains spatial information (described in the following paragraph), we calculate the spatial root-meansquare error (RMSE) for annual-or seasonal-mean values over 1980-2004 (25 years). We take the European Centre for Medium-Range Weather Forecasts "Interim" re-analysis (ERA-Interim, 1979-present;Dee et al., 2011) as a reference, since differences between reanalyses are much smaller than climate model biases (Agosta et al., 2015), and ERA-Interim was assessed to be the most reliable contemporary global reanalysis over Antarctica (Bromwich et al., 2011;Bracegirdle and Marshall, 2012;Huai et al., 2019).
For Antarctica, we evaluate air temperature at 850 hPa (ta850; average of summer and winter RMSE), annual precipitable water (prw), and annual sea level pressure (psl), together with summer sea surface temperature (sst[s]) and winter sea ice extent (sie[w]), for the domain extending south of 40 • S over the ocean (Fig. 1a). In addition to spatially resolved variables, we include a metric of the historical CMIP5 vs. ERA-Interim bias in westerly jet strength (Jstr), calculated as the maximum in annual mean zonal mean 850 hPa zonal wind between 10 and 75 • S (m s −1 ), compared to time-slice means of the overlapping 1979-2005 period, as in Bracegirdle et al. (2018).
For Greenland, we evaluate air temperature at 700 hPa (ta700; average of summer and winter RMSE), annual precipitable water (prw), and annual geopotential height at 500 hPa (zg500), inside the domain of the "Modèle Atmosphèrique Regional" (MAR; Fettweis et al., 2017) and where the Greenland ice sheet is below 2000 m a.s.l. (bright shaded color in Fig. 1c). In this small domain, sea surface conditions do not significantly impact MAR results (Noël et al., 2014).

Subsurface ocean metrics
The ISMIP6 stand-alone ice sheet oceanic forcing is derived from "far-field" salinity and potential temperature (Slater et al., 2019;. Consistent with this approach, our evaluation of subsurface ocean properties is performed on regionally averaged CMIP5 temperatures. Since the oceans around Greenland and Antarctica are characterized by different geographic and dynamic regimes in observations (e.g., Straneo et al., 2012;Schmidtko et al., 2014;Thompson et al., 2018) and models (Yin et al., 2011 and Urban, 2016; Levermann et al., 2014), individual metrics are obtained for several subregions surrounding both ice sheets ( Fig. 1b, d).
For this purpose, 1989-2009 time-mean ocean temperatures from each CMIP5 model are interpolated onto a common tripolar ORCA025 grid (Ferry et al., 2012), which has a quasi-isotropic resolution corresponding to 0.25 • in latitude, and 75 vertical layers with a thickness ranging from 1 m at the surface to 200 m at the bottom. We use a conservative 3-D interpolation; if some parts of the ORCA025 grid are not covered by the CMIP grid, we extrapolate from the closest neighbor (horizontally above sills, then vertically to fill troughs behind sills). The regridding tools are made available on https://github.com/nicojourdain/SCRIPTS_CMIP5_ ANOM_NOW (last access: 29 July 2019, Dutheil et al., 2019). Regionally averaged coastal ocean temperatures are then computed in six sectors around the Antarctic continent ( Fig. 1b), which capture different continental shelf and melting regimes. A maximum bottom depth criterion of 1500 m is used, together with an explicit limit for the northern boundaries in the large embayments in the Ross and Weddell seas, to select ORCA025 ocean cells that are located on the continental shelf near the coast. For Greenland, the ocean has been separated into four connected regions based on the major hydrographic regimes surrounding the ice sheet ( Fig. 1d), with a similar cutoff beyond 1500 m bottom depth and geographical distance from the ice sheet to select coastal ocean cells near the ice sheet. For each subregion, volume-averaged temperatures below 200 m depth are computed, providing a scalar nearshore subsurface temperature metric. For Antarctica, the full depth range down to 1500 m is included, while for Greenland, the profiles are truncated below 500 m depth to account for shallow continental shelf depths and bottom sills that typically prevent inflows from greater depths to- The Cryosphere, 14, 855-879, 2020 www.the-cryosphere.net/14/855/2020/ ward the marine-terminating glaciers in Greenland fjords (Morlighem et al., 2017). Regional volume-averaged temperatures are also computed from available observed ocean climatologies, using the same algorithm as for the model output. For Greenland, observational data are taken directly from the annually averaged statistical fields of the 2013 World Ocean Atlas (WOA; Locarnini and Seidov, 2013). For Antarctica, a refined climatology of coastal water masses was constructed by combining the 2018 WOA data (Locarnini et al., 2019) with statistical fields from the EN4 ocean climatology (Good et al., 2013) and publicly available temperature profiles from seals equipped with satellite relay data loggers (Roquet et al., 2018), with further details provided in Nowicki (2019). In both cases, ocean measurements close to the ice sheets are so sparse that all observations are included in the computation of the regional averages, regardless of their acquisition date.

Aggregating historical metrics
In order to aggregate different metrics of varying nature and magnitude, each of the historical metrics described above (denoted as χ below) is normalized with regards to the 33model multi-model median and interquartile range (IQR). For each model i, We average the normalized metrics into three realms: atmosphere, surface ocean (for Antarctica), and subsurface ocean. This decision was made to weaken the dependence of the fiwww.the-cryosphere.net/14/855/2020/ The Cryosphere, 14, 855-879, 2020 nal ranking on the number of variables used for each realm. Normalization of metrics prevents highly variable or largeamplitude metrics from being overly influential in the average (see Fig. A1) while still penalizing extremes. The final aggregated score for each model is obtained by averaging atmosphere and ocean for Greenland and atmosphere, surface ocean, and subsurface ocean for Antarctica. An alternative aggregating method, where all normalized metrics are weighted equally (12 for Antarctica, 7 for Greenland), is presented in Fig. A2 and does not change our conclusions.

Climate change metrics
For atmospheric and surface ocean variables, climate change metrics are calculated as the difference between the 2070-2100 mean (RCP8.5) and the 1980-2010 mean (historical) value of each variable, spatially averaged over the entire Greenland and Antarctic atmospheric domains ( Fig. 1), denoted with the symbol. The only exception is for change in precipitable water, computed as the difference between the 2070-2100 mean (RCP8.5) and the 1980-2010 mean (historical) divided by the 1980-2010 mean value of each variable, then spatially averaged over the atmospheric domain, denoted with the δ symbol, because it follows a lognormal distribution. For the subsurface ocean, we define metrics as the change in volume-averaged regional temperature between the 1989-2009 and 2080-2100 periods. For Antarctica, we consider four metrics for the atmosphere (change in annual air temperature at 850 hPa, ta850[a]; in annual precipitable water, δprw[a]; and in position and strength of the tropospheric westerly jet, Jpos and Jstr), two metrics for the surface ocean (change in winter sea ice extent, sie[w]; and in summer sea surface temperature, tos[s]), and six metrics for change in subsurface ocean temperature ( T ), one for each of the sectors defined in Sect. 2.2. For Greenland, we define two metrics for the atmosphere (change in annual air temperature at 700 hPa, ta700[a]; and in annual precipitable water, δprw[a]) and four metrics for change in subsurface ocean temperature ( T ), one for each ocean sector defined in Sect. 2.2.

Maximizing diversity of projected 21C changes
To maximize the diversity of future projections covered in a sub-selection of models of size n, we define the ensemble inter-model spread E by combining the pairwise model differences across the climate change metrics defined in Sect. 2.3.1 (12 for Antarctica, six for Greenland). The spread of a three-model ensemble is computed as the following: with χ the climate change metrics defined in Sect. 2.3.1. The ensemble that maximizes E for a given ensemble size n (n = 3 for top three, n = 6 for top six) is the one qualified as "most diverse" in its future projections.

Results for Antarctica
In this section, we focus on the model selection for the Antarctic ice sheet, which is based on historical ranking (Sect. 3.1) and projection diversity (Sect. 3.2). The selected models are presented in Sect. 3.3.

Historical bias ranking
Over the Antarctic domain, the total normalized historical metric ranges between −0.32 (model of the highest fidelity, CanESM2) and 1.50 (model of the lowest fidelity, BMU-ESM), with a median value of 0.13 (Fig. 2a). Figure 2a shows the 33 climate models ranked by their historical metric, together with contributions of the subsurface ocean (blue), atmosphere (orange), and surface ocean (yellow) to the total historical metric. Models do not perform equally across the three realms. For example, GFDL-CM3 and EC-EARTH perform well in the atmosphere, with atmospheric metrics of −0.22 and −0.21 respectively, amongst the best models, but are ranked as low fidelity (with total bias scores of 0.46 and 0.54) due to their poor performance in ocean subsurface and surface conditions. Conversely, IPSL-CM5B-LR performs well in the subsurface ocean (metric of −0.20) but is penalized by its poor performance in the atmosphere (metric of 2.07) and surface ocean conditions (metric of 1.77).
Models also do not perform equally within each realm, indicating that biases originate due to regional processes for subsurface ocean or variable-specific biases for surface ocean and atmosphere. We provide the per-variable breakdown of the ocean subsurface metric (Fig. 2b) and ocean surface and atmospheric metrics (Fig. 2c). Although this paper cannot address these differences in detail, we highlight a few notable sources of discrepancies between metrics. For example, the subsurface heat in the Weddell Sea region is the largest single contributor to the ocean bias metric in several models (Fig. 2b), including EC-EARTH, MRI-CGM3, and BNU-ESM. The large ocean heat bias would warrant specific studies investigating the model representation of the ocean climatology in that region. Similarly, in the atmosphere, precipitable water is the largest single bias for models such as IPSL-CM5B-LR, INM-CM4, and MRI-CGCM3 and would warrant further investigation to improve model representation of the historical period.
Models that perform better than the median (historical metric < 0.13) have reasonable values for all three realms: the worst metric for each realm is lower than 50 % of the IQR away from the ensemble median for that realm (Fig. 2a). This The Cryosphere, 14, 855-879, 2020 www.the-cryosphere.net/14/855/2020/ result gives confidence that these models have a good overall performance, rather than compensating biases across realms.
Our averaging method was effective in penalizing models that have a low fidelity over an entire realm. For this reason, selecting the top three models in the top half of the 33 models ensures overall good performance of these models in both the ocean and atmosphere.

Projected changes
All 33 models considered in this study show an increase in air temperature over the Southern Ocean and Antarctic continent between the end of the 21st century and the end of 20th century climatologies (Fig. 3a), with a multi-model mean increase of 2.54 • C. Nevertheless, the ensemble shows a spread of transient climate sensitivity, with an atmospheric warming ranging from 1.3 • C (GFDL-ESM2G) to 3.6 • C (BNU-ESM), with a median of +2.5 • C. We highlight the three core (red) and three targeted (yellow) AOGCMs selected in Sect. 3.3, to illustrate the spread that they cover compared to the 33-model ensemble. Although the projected change in air temperature is only one of the variables we use to diagnose projected atmospheric changes, it provides a good representation of projected changes in the atmosphere. Indeed, the changes in annual air temperature are strongly correlated (R 2 > 0.82) to the projected changes in seasonal air temperature and in annual and seasonal precipitable water and strongly anti-correlated to changes in winter sea ice extent (R 2 = 0.70). Projected changes in wind jet strength, as quantified in Bracegirdle et al. (2018), show a weaker negative correlation with air temperature changes, although a decrease in jet strength is generally associated with a decrease in annual sea ice extent (R 2 = 0.46), as noted in Bracegirdle et al. (2018).
Climate models also overwhelmingly project a 21st century increase in ocean temperatures around Antarctica. For example, the 33 models project a warming of the Amundsen shelf ( Fig. 3b), ranging from no significant warming (lowest warming, MRI-CGCM3) to +1.10 • C (highest warming, IPSL-CM5B-LR), with a median value of +0.45 • C. Other regions show a qualitatively similar range of projected changes, with the highest warming (as quantified by the median value of the ensemble) occurring in the Dronning Maud Land (DML), Amery, and Totten regions (DML median: +0.76 • C; Amery median: +0.70 • C; Totten median: +0.59 • C). The lowest projected warming occurs in the Weddell and Ross regions (Weddell median: +0.21 • C; Ross median: +0.30 • C). The Amundsen region, presented in Fig. 3b, is currently under scrutiny due to ice shelf thinning and accelerating ice discharge in the last decade (Kimura et al., 2017;Mouginot et al., 2014;, but this region is projected to warm moderately in the future according to the 33-model ensemble (Amundsen median: +0.45 • C).
Unlike the atmospheric warming, which is a good proxy for other atmospheric changes, the projected ocean warm-ing in the Amundsen region is only weakly correlated (R 2 ≤ 0.016) to other ocean regions. Some significant correlation can be found for neighboring regions in East Antarctica, such as between the Dronning Maud Land and Amery regions (R 2 = 0.71) and between the Amery and Totten regions (R 2 = 0.48), but it is low across other regions (R 2 ≤ 0.25). Projected changes in the ocean are relatively independent across regions (detailed in Fig. B1), which confirms the added value of quantifying regional ocean metrics rather than metrics integrated over all Antarctic shelves.

Top three (core experiments)
In the case of the Antarctic domain, the selection criteria described in Sect. 2 led to six suitable coupled models (CanESM2, NorESM1-M, CSIRO-Mk3-6-0, CCSM4, MIROC-ESM-CHEM, MIROC-ESM), where availability of required data from RCP2.6 projections is the strongest constraint. We then select the three models that maximize the ensemble diversity E n=3 , as defined in Sect. 2.3.2. The selection is robust to removing one of the metrics at a time and to changing the weight of the metrics in the calculation (Appendix C1).
Overall, NorESM1-M shows a stronger end-of-21st-century ocean warming than the ensemble median (dashed) but a low atmospheric warming compared to the model ensemble. Conversely, MIROC-ESM-CHEM features an ocean warming similar to that of the ensemble median, associated with strong atmospheric changes, about one IQR higher than the median. Finally, CCSM4 shows very distinct regional patterns of ocean warming, with strong warming in the Weddell and Totten regions and lower warming in the Ross and Dronning Maud Laud regions, relative to the ensemble median. The projected atmospheric changes in CCSM4 are on the high end of the ensemble, qualitatively similar to that of MIROC-ESM-CHEM. The qualitative warming projected by the three models selected for the Antarctic core experiments is summarized in Table 2.

Top six (targeted experiments)
For the additional three models (targeted), CSIRO-Mk3-6-0 (yellow) is chosen because of its good ranking (Fig. 2) and median projected changes (Figs. 3, 4b), and it is preferred to ACCESS1.0 (which also shows median projections under RCP8.5) because of the availability of the RCP2.6 scenario. Each of the metrics of future change lies close to the multi-model ensemble median (see Fig. 4b), meaning that approximately half of the 33 climate models predict www.the-cryosphere.net/14/855/2020/ The Cryosphere, 14, 855-879, 2020 higher changes than those of CSIRO-Mk3-6-0, and half predict lower changes. The other two models selected are, in alphabetical order, HadGEM2-ES (brown) and IPSL-CM5A-MR (dark blue). HadGEM2-ES brings diversity to the six-model ensemble because of its extreme end-of-21st-century warming in the ocean, particularly in the Ross Sea. This extreme regional warming, more than 2 times larger than the IQR from the median value, is ruled out of the top three because it is considered to be a less likely response than those produced by a The Cryosphere, 14, 855-879, 2020 www.the-cryosphere.net/14/855/2020/ large number of distinct climate models. Nevertheless, in an intercomparison effort such as ISMIP6, sampling high-end scenarios is essential to (1) examine the response of ice sheet models which may have runaway effects and (2) include high-risk (low probability, high cost) scenarios in terms of future sea level rise. The atmospheric changes produced by HadGEM2-ES are higher than the median, but not outliers. Finally, IPSL-CM5A-MR features an ocean warming lower than the ensemble median in most ocean regions and atmospheric changes higher than the median. It is the only model selected with systematically low warming in the ocean and can be thought of as the converse to NorESM1-M. Robustness of the model selection is demonstrated in Appendix C2. The qualitative warming projected by the additional three models selected for the Antarctic "targeted" experiments is summarized in Table 2.

Results for Greenland
In this section, we describe the model selection for the forcing of the Greenland ice sheet. The methods include the model evaluation (Sect. 4.1 and 4.2) and ensemble selection (Sect. 4.3), mirroring the selection performed for the Antarctic ice sheet (Sect. 3).

Historical bias ranking
Coupled climate models do not perform equally over the subsurface ocean and the atmosphere (Fig. 5a) around Greenland, consistent with findings for Antarctica, shown in Sect. 3. Some models perform well in the atmosphere but are penalized by their poor ocean performance. For example, CMCC-CMS is the median of the ensemble and features one of the lowest biases in the atmosphere (−0.69) and one of the highest biases in the ocean (0.73). Conversely, others perform well in the ocean but show high biases in the atmosphere (e.g., MRI-CGCM3). This unequal performance across the ocean and atmospheric variables supports the need to assess several components of coupled climate models together, rather than separately.
Investigating the source of biases in any given model is beyond the scope of this paper, which focuses on selecting six models suitable for the ISMIP6 simulations. Nevertheless, the ranking of the models can highlight significant biases. For example, the ocean bias in several models, most notably CMCC-CS, CMCC-CESM, and IPSL-CM5B-LR, is dominated by a bias in ocean heat in the Arctic region. This large bias in temperature would warrant a specific study to improve model representation of that region. However, the observations in this region are scarce and we have a lower degree of confidence in the resulting ocean climatology in that region than in more frequently and densely observed regions, as discussed in Sect. 5.
The model ranking around Greenland highlights that the fidelity of coupled models is regionally dependent. The models of the highest fidelity around Greenland do not necessarily perform well around Antarctica and vice versa. For example, CanESM2 is the best-ranked model for Antarctica (see Sect. 3) but is ranked in the lower half of the ensemble around Greenland due in part to its ocean biases. Likewise, MIROC5 performs well on all metrics around Greenland, and has been extensively used in the relevant literature www.the-cryosphere.net/14/855/2020/ The Cryosphere, 14, 855-879, 2020 (e.g., Fettweis et al., 2013;Tedesco and Fettweis, 2012), but has strong atmospheric biases over Antarctica. Climate models are not expected to perform equally in all regions; nevertheless, it is important for the scientific community to keep those regional variations in mind, especially if using existing studies performed over a different region. This unequal performance across the Greenland and Antarctic regions also supports our decision to perform model ranking and selection independently for the two ice sheets. Finally, the models that perform better than the median have ocean and atmosphere biases that lie lower than 0.5 IQR away from the median. Although biases in individual (regional) variables may be higher than that, this result confirms that the best-ranked models have a good performance in both the subsurface ocean and the atmosphere and gives us confidence that the top half of the ensemble models are suitable candidates for the Greenland model selection.

Future projection diversity
All 33 AOGCMs project atmospheric warming over Greenland by the end of the 21st century. Projections range from +1.95 • C (lowest warming, FIO-ESM) to +5.95 • C (highest warming, MIROC-ESM-CHEM) with a median warming of +4.09 • C (Fig. 6a). Models that made our final selection, highlighted in red (top three) and yellow (top six), sample a range of future warming. Similar to results presented for Antarctica (Sect. 3), the changes in annual air temperature over Greenland are a good proxy for most other atmospheric changes. Increase in 700 hPa air temperature is associated with an increase in precipitable water (R 2 = 0.96), an increase in ocean surface temperature (R 2 = 0.60), and a decrease in summer sea ice cover (R 2 = 0.29). Most models also project an increase in ocean temperature on the shelf surrounding Greenland. Baffin Bay, for example, is projected to warm by +0.48 • C by the end of the 21st century, with models projecting between +0.07 • C (lowest warming, BCC-CSM1-1) and +1.70 • C (highest warming, CanESM2). The models selected in Sect. 4.3, highlighted in Fig. 6b, cover a range of projected warming in Baffin Bay. Two other regions show similar projected changes (Arctic median: +0.48 • C; Subpolar Gyre (SPG) median: +0.49 • C). The highest projected warming occurs in the Greenland-Iceland-Norwegian region (GIN), with a median warming of +0.76 • C.
Projected changes across the ocean regions are correlated between the Arctic Ocean and GIN regions (R 2 = 0.58) and mildly correlated between the SPG and GIN regions (R 2 = 0.31). Other regions are only weakly correlated with each other (detailed in Fig. B2), and ocean changes show no significant correlation with the projected atmospheric changes (R 2 < 0.06).

Recommended ensemble
In the case of Greenland, the availability of sub-daily outputs is a strong constraint for the model selection. This was a determining factor because existing studies over Greenland show that RCMs outperform global climate models in representing realistic surface mass balance (e.g., Noël et al., 2018;Fettweis et al., 2013).

Top three (core experiments)
When applying the selection criteria described in Sect. 2 and removing CNRM-CM5 and EC-EARTH due to unavailable data, six models remain for the top three selection (MIROC5, IPSL-CM5A-MR, NorESM1-M, ACCESS1-0, ACCESS1-3, HadGEM2-ES). In this case, MIROC5 was preselected, as it features changes similar to those of the ensemble median (dotted; Fig. 7a), meaning half of the models project stronger changes than those of MIROC5, and half project weaker changes. Two additional models are selected, maximizing ensemble diversity of three models (MIROC5, model 1, model 2).
The Cryosphere, 14, 855-879, 2020 www.the-cryosphere.net/14/855/2020/ The top three models selected are, in alphabetical order, HadGEM2-ES, MIROC5, and NorESM1-M. These three models show different patterns of projected changes by the end of the 21st century (Fig. 7a). As described above, MIROC5 is chosen as a good representation of the overall ensemble. HadGEM2-ES features high atmospheric changes, including increases in air temperature and precipitable water, of a magnitude stronger than the ensemble median. The projected changes in ocean heat are more regionally dependent, with warming higher in the Arctic and GIN (northeast) and lower in Baffin Bay and SPG (southwest) relative to the ensemble median. Conversely, NorESM1-M features a warming in the atmosphere on the low end of the 33-model ensemble projections. The ocean warming is also regionally dependent, with NorESM1-M featuring low warming in GIN, the Arctic, and the SPG regions and a strong warming in the Baffin Bay region. The qualitative warming projected by the three models selected for the Greenland core experiments is summarized in Table 3.
The selected models are, in alphabetical order, ACCESS1-3, CSIRO-Mk3-6-0, and IPSL-CM5A-MR. CSIRO-Mk3-6-0 projects a low atmospheric warming, far below the median value, alongside an extreme warming in the southwest ocean regions ( T BB > 2; T SPG = 0.94). ACCESS1-3 adds diversity to the ensemble as it shows strong warming in Baffin Bay and the Arctic Ocean but low warming in the Sub-polar Gyre (SPG) region. Its atmospheric warming is close to the median. Finally, IPSL-CM5A-MR project strong warming in the Greenland-Iceland-Norwegian seas (GIN), while other ocean regions and atmospheric variables are closer to the median. The qualitative warming projected by the additional three models selected as forcing for the Greenland targeted experiments is summarized in Table 3.

Discussion
In this study, we evaluated the performance of 33 CMIP5 AOGCMs relative to reanalyses and gridded observational datasets covering the atmosphere, sea surface, and subsurface ocean around the Greenland and Antarctic ice sheets. We also assessed 21st century changes in key oceanic and atmospheric variables. Time constraints for ISMIP6 simulations drove several decisions relating to the scope of this analysis, including the use of the CMIP5 (rather than the now partially available CMIP6) ensemble, the use of AOGCMs that had already been processed and regridded for both the ocean and atmosphere, and the use of available observational products with limitations and biases, particularly in the ocean subsurface. However, this assessment of near-ice-sheet presentday and future climate remains the most comprehensive performed to date.
Many subjective choices were made in the model selection process. We have attempted to document these choices, and note that the relative insensitivity of results to alternate choices (e.g., Fig. A.2, Appendix C) provides some confidence that our rankings are robust for the CMIP5 ensemble. However, because the evaluation and selection ex- The Cryosphere, 14, 855-879, 2020 www.the-cryosphere.net/14/855/2020/ ercise will have to be repeated for future model ensembles (e.g., CMIP6), our discussion focuses on key elements of our methodology that could be further developed. Implications are discussed with respect to results from the full 33-member ensemble to extend the relevance to other exercises where the small ensemble required for ISMIP6 may not apply. Model selection was made largely based on their representation of the present-day local climate, with the implicit assumption that biases relative to observations reflect a poor representation of processes of relevance to future warming. It is difficult to determine whether performance relative to this set of present-day regional metrics is (1) a sufficient means to evaluate AOGCMs and (2) relevant to the rate of 21st century near-ice-sheet warming. Krinner and Flanner (2018) show that model biases are stationary under future climate change within the CMIP5 dataset, providing justification for using less biased models for climate change studies. However, over the long timescales that ISMIP6 seeks to assess, different processes and/or biases (global and/or nonlocal ocean warming rates, e.g., stratospheric ozone recovery) may be equally important; i.e., even if a model closely matches historical conditions, key processes for projections may still be missing. Support for the relevance of these metrics might be derived from a clear relationship between the modern state and projections of change across models (so-called "emergent constraints"). Bracegirdle et al. (2015) and Agosta et al. (2015) found that 21st century changes in Antarctic air temperature and precipitation rate (and, perhaps surprisingly, jet strength; Bracegirdle et al., 2018) were correlated to sea ice area bias across models. In this analysis, we found no significant correlation between historical biases and climate changes over Antarctica (or Greenland). A plausible explanation is our use of an 850 hPa (rather than surface) temperature metric and our circum-Antarctic study region. However, this result may also indicate a sensitivity to the specific models included in the ensemble: we find that the magnitude and significance of inter-model correlations are sensitive to whether all or a set of the best-performing models are assessed. Shared code and parameterizations across models may also underlie some of the modest correlations evidenced in our analysis.
It is difficult to determine whether the historical metrics chosen in this analysis are comprehensive (e.g., account for all relevant processes) and/or independent. Concerning independence, we eliminated metrics which represent the same physical processes and are strongly correlated (e.g., the precipitation and air temperature variables in Bracegirdle et al. (2015) are strongly correlated to those in Agosta et al. (2015) and were not included in this study). Assessing comprehensiveness is more difficult. For example, the choice of metrics is constrained by the availability of observations. In particular, oceanographic measurements in the vicinity of ice sheets are very sparse and feature sharp horizontal gradients in water masses (e.g., Thompson et al., 2018). As a result, we chose to calculate volume-and time-mean quantities over subjectively defined regions in order to maximize the number of observations included. It is unclear which ocean region is most "important" in terms of future mass balance. The optimal number of regions, based on their relevance to future ice sheet change and their independence, remains to be determined. These choices should be expected to influence evaluations of both performance and warming. In contrast, observations for the atmosphere and surface ocean have better spatiotemporal coverage. Correspondingly, the metrics chosen were continental scale and seasonally resolved. However, our continental-scale evaluation may obscure regional variability. Atmospheric dynamical modes, such as variability in the Amundsen Sea Low and the Southern Annular Mode (SAM), strongly impact the regional climate in Antarctica Fyke et al., 2017). Although our gridwww.the-cryosphere.net/14/855/2020/ The Cryosphere, 14, 855-879, 2020 point error metric reflects biases in atmospheric pressure, it is not able to attribute the bias to a model's lack of fidelity to, say, the asymmetric nature of the SAM. Future work should more formally assess the number and relative weighting of regional metrics in the atmosphere and ocean and include dynamically relevant measures of asymmetry. Similar concerns apply to the metrics of future warming and their relevance to ice sheet mass balance. We note that our analysis does not address the rate of warming, which differs widely across models. In the ocean, the rate and timing of warming may have dramatic effects on 21st century ice sheet evolution (Hellmer et al., 2012;Timmermann and Goeller, 2017). We have noted the unequal performance of coupled climate models over different realms, which we suggest highlights the importance of assessing model fidelity over a range of metrics combining the subsurface ocean, surface ocean, and atmosphere conditions. It also explains why the present ranking of models differs from existing intercomparison studies specifically focused on the atmosphere (e.g., Agosta et al., 2015) or the ocean (e.g., Meijers et al., 2012;Sallée et al., 2013;Russell et al., 2018). For example, the metrics used in Agosta et al. (2015) led to EC-EARTH and CanESM2 being ranked closely (8 and 9 out of 41 models), implying similar performance. However, by including the subsurface ocean metrics, our results point to CanESM2 as the model with the best fidelity overall, while EC-EARTH is in the lower half of the 33-model ensemble due to its poor performance in the ocean (other examples of differences in rankings across realms can be found by examining Figs. 2 or 5). As Agosta et al. (2015) focuses purely on the model performance for ice sheet surface mass balance, their results differ from the current study evaluating both the ocean and atmospheric metrics for the sake of providing the atmospheredriven surface mass balance and the ocean-driven melt from the same coupled model as boundary conditions to ice sheet models. This underscores the importance of considering the original aim of an intercomparison, including the variables and the regions considered, before interpreting or applying a ranking derived from the analysis.
Antarctica and Greenland were treated independently, supported by the different model performance across the ensemble. A different set of models was selected for Greenland and Antarctica, suggesting model performance varies in polar regions of different hemispheres. However, with respect to future warming, it is reasonable to expect some degree of interhemispheric correlation in warming (e.g., due to a high AOGCM climate sensitivity). It is unclear how this inter-ice-sheet independence assumption could influence sea level projections, as it depends upon the response of surface mass balance (SMB) and changes in ice flux of the different ice sheets.
Using aggregated measures of present-day performance and future climate changes, we selected six AOGCMs as adequate and representative of future near-ice-sheet warming pathways. This ensemble size was judged to be reasonable for ISMIP6, given computational limitations and the goal to sample different sources of uncertainty (e.g., model, RCP scenarios, parameterizations, parameters values). However, given the many degrees of freedom across the evaluation metrics, it is difficult to select a fully representative sample. Some limitations of the sample size are apparent, notably the nonuniform distribution across parameters (e.g., no low ocean warming sampled). Furthermore, the models selected are not structurally independent. For example, HadGEM2-ES and ACCESS1-3 share a common Hadley Centre atmospheric model, while NorESM1 and CCSM4 share the NCAR Community Atmospheric Model. Such interdependence may limit the diversity of forcing applied to ISMIP6 models. We do note that even if ISMIP6 had the ability to evaluate all available CMIP5 AOGCMs, issues with statistical sampling and diversity of CMIP models, code similarities/independence, and quality would persist Sanderson et al., 2015a, b). Future model evaluation studies may invert the process used here, i.e., objectively assess the appropriate number of models to achieve sufficient diversity in forcing. Finally, we emphasize that evaluation is only a first step to a better process-based understanding of the differences between models. It is critical to assess the processes that make models (or model families) perform better or project climate warming at different rates. We invite modeling groups or researchers interested in examining these to trace back the source of the bias in individual models or across the larger ensemble.

Conclusions
As part of the Ice Sheet Model Intercomparison Project for CMIP6 (ISMIP6), ice sheet models will be forced with climate model-derived time series of basal melt (for Antarctica), front retreat (for Greenland), and surface mass balance. To generate such forcing, a subset of CMIP5 models has been selected according to (i) their realistic representation of the historical period (compared to reanalysis data) and (ii) the diversity of the projected 21st century changes under RCP8.5 within the selected subset. As a result of the evaluation and selection process performed in this study, six AOGCMs have The Cryosphere, 14, 855-879, 2020 www.the-cryosphere.net/14/855/2020/  Table 2). Independently, six AOGCMs have been selected for ISMIP6 Greenland future projection runs (core experiments: HadGEM2-ES, MIROC5, NorESM1-M; targeted experiments: ACCESS1-3, CSIRO-Mk3-6-0, IPSL-CM5A-M; see Table 3). Ocean and atmospheric data from these AOGCMs are used to generate ice sheet surface mass balance, the Greenland retreat parameterization (e.g., Slater et al., 2019), and the Antarctic basal melt parameterization (Nowicki, 2019), which will be presented in detail in upcoming papers. It is expected that the range of near-ice-sheet climate changes simulated by these AOGCMs will result in diverse projections of ice sheet mass balance change when used to force ISMIP6 simulations. The evaluation and selection of models was a necessary first step to develop the current ISMIP6 experiment protocol and can be improved upon for the next phase of ISMIP in multiple ways. Firstly, future studies will evaluate and select models from the CMIP6 ensemble. Repeating this study with CMIP6 data will provide insight into whether new developments in climate models reduce ocean and atmospheric biases near ice sheets. Secondly, results from the ice sheet simulations will provide insight into ice sheet model sensitivity. For example, future model selection may weight atmospheric changes more heavily than ocean changes if ice sheet models show a higher sensitivity to surface mass balance. In addition, future selection should look to include more dynamical metrics (e.g., Amundsen Sea Low representation, ocean slope front position) and consider the rate of projected changes in the ensemble diversity. These improvements will ensure that, in an intercomparison project that remains computationally limited, we prioritize the forcing that is most fruitful.
www.the-cryosphere.net/14/855/2020/ The Cryosphere, 14, 855-879, 2020 Appendix A: Robustness of historical ranking Appendix A provides additional information illustrating the robustness of the historical ranking methodology. Firstly, we demonstrate the need to normalize the historical bias metrics (Fig. A1). As the atmospheric bias metrics are based on variables as distinct as temperature, pressure, sea ice extent, and precipitable water, the raw bias metrics have different mean values and different inter-model variability (Fig. A1a). By normalizing each metric by its ensemble median and interquartile range, the (normalized) metrics are scaled to cover a similar inter-model variability (Fig. A1b) and can be combined into a single metric independently of the original magnitude of the raw variable. Secondly, we illustrate the robustness of the historical ranking to the averaging method by providing an alternate ranking. In Fig. A2, the AOGCMs are ranked by averaging all (normalized) bias metrics with equal weight (dashed), instead of the bias metric used in this study (where each realm is given the same weight; black). For both the Antarctic (Fig. A2a) and Greenland (Fig. A2b) domains, the difference in ranking is minor, as only two models would switch between the top and bottom 50 % of the 33-model ensemble, and neither of these models is present in the top three or top six ensembles.  Fig. B1, while the warming over the four Greenland shelf regions is presented in Fig. B2. In each region, labels and markers (*, †) identify models selected in the top three and top six ensembles, respectively. In each region, the majority of AOGCMs predict a warming by the end of the 21st century, although the magnitude and inter-model spread of warming is regionally dependent. www.the-cryosphere.net/14/855/2020/ The Cryosphere, 14, 855-879, 2020   Table C2 lists the absolute and relative occurrence of each individual model in the possible combinations presented in Table C1.    Review statement. This paper was edited by Ginny Catania and reviewed by two anonymous referees.