Reply on RC2

This submission describes an extensive set of hindcasts from the CESM2 model that enable the performance of initialized predictions in the relatively unexamined multi-year range (out to 24 months in this case) to be extensively explored. Notably, performance over a broad range of Earth system components (atmosphere, sea ice, ocean and land including biogeochemistry) is addressed. The paper is very well organized and written, and criticisms are limited mainly to relatively minor details of description and presentation. Exceptions are items 7 and 17 below, which will require modest additional computation if the authors concur that acting on these recommendations will improve the paper. Overall however the authors are to be congratulated for this interesting and compelling documentation of SMYLE.


Done. Thanks for these suggestions.
3) At line 72 should also reference Ilyina et al. https://doi.org/10.1029 Done. form 1901-1920 to equilibrate the land state. Please go into a bit more detail about the total length of time this cyclic forcing was applied, in relation to expected equilibration times of land variables such as vegetation and soil carbon.

4) Line 128 states that forcing is applied cyclically
In response to this comment as well as a similar comment from RC1, we have added Figure A4 to the new Appendix A. This figure demonstrates the near equilibration of land carbon stocks over ~4,000 years of spin-up simulation. Despite some slight imbalance in total ecosystem carbon (primarily associated with soil organic matter carbon), the spin-up was deemed close enough to equilibrium to proceed with the land-only historical simulation. The associated discussion in the text has been expanded (Section 2, starting at line 134).

5) It's stated that the hindcasts cover 1970-2019. Presumably this is the period covered by the initialization times, and not the simulations themselves which would extend into 2021? If so please be explicit that 1970-2019 spans the initialization times.
We've modified the text throughout to clarify that mention of specific historical windows (e.g., 1970-2019) refers to forecast initialization years included in the analysis. We also added a discussion in section 2 clarifying that the verification window is a function of: 1) hindcast initialization time, and 2) observational temporal coverage. Our general approach is to maximize temporal sampling to the extent possible, which means that the actual verification window can vary with lead time. We think this is clear while also avoiding burdening the reader with excessive detail.

6)
Regarding "Potentially useful prediction skill (ACC>0.5) is seen for land precipitation over the southwestern US in DJF and MAM (lead month 1)" at lines 208-209, this really should say "southwestern North America" considering that the only DJF grid boxes >0.5 are in Mexico.
In response to a comment from RC1, we have changed the precipitation verification dataset to GPCPv2.3 (spanning 1979-2021). This has resulted in slight changes in skill scores and regions/leads where ACC>0.5. The discussion of Figure 2 has been revised accordingly, and in particular, we now highlight "southwestern North America in DJF (lead month 1)".

7)
Regarding "A more rigorous analysis is needed to definitively demonstrate that SMYLE skill differences from DPLE are statistically significant and not likely explainable by chance" (lines 226-227), this could be done relatively easily by applying the random walk methodology of DelSole and Tippett, https://doi.org/10.1175/MWR-D-15-0218.1, where differences either in the anomaly pattern correlation or the RMSE between the 50 pairs of November-initialized hindcasts could be used as the basis for comparison.
Thanks for this suggestion. In response to RC1 and this comment, we chose to include additional figures in Appendix B (see Figs. B3, B4, B6) that test the significance of local ACC differences between SMYLE-NOV and DPLE-NOV by accounting for uncertainty due to finite ensemble size. The 20-member SMYLE-NOV skill is compared to a distribution of 20-member DPLE-NOV skill scores at each grid point, and overall performance is assessed by comparing the percentage of global surface area (within 80°S-80°N) associated with skill improvement/degradation. Other methods (such as that proposed by DelSole and Tippett) could be explored in future work, but we think simple skill difference plots will be of particular interest to readers and will suffice for this manuscript.
8) At line 178, please provide a rationale for regridding to a 5x5 or 3x3 degree grid. (Also a small point, but I'm not sure that regridding to a coarser grid qualifies as "interpolation".) We have replaced "interpolated" with "mapped" and have added the following rationale (line 198): "This remapping is done to highlight aggregate regional skill, increase the efficiency of skill score computation, and improve the quality of global map plots that include pointwise significance markers (Appendix B)." 9) Below line 360 it would be appropriate to reference Butler et al. https://doi.org/10.1002Butler et al. https://doi.org/10. /qj.2743 in relation to the influence of lid height on skill in forecasting the NAO. (For example could append as "…relative to these baseline SMYLE results, although a robust connection between atmospheric vertical resolution and NAO skill has not been demonstrated (Butler et al. 2015).") Done. Fig. 9 which are better aligned with the experiment. We think that the seasonal dependence of potential predictability skill in these regions concerns the persistence of subsurface nutrient anomalies (first described in Park et al., 2019). During wintertime mixing, these nutrient anomalies reemerge in the upper ocean and contribute to skill in predicting total summer/fall productivity in these LMEs. We agree that this is interesting and therefore, in response to this comment, we have added a few additional sentences at the end of this paragraph to elaborate on the seasonal dependence of marine ecosystem predictability (starting at line 588). Gregor and Gruber (2021), so presumably the skill results in Fig. 9 are specific to this period? Or does the verification period leave out the years before 1990 which are much more uncertain according to those authors? Please be explicit about this and any other deviations of verification periods from the 1970-2019 period covered by SMYLE.

13) The OceanSODA-ETHZ aragonite saturation dataset covers 1985 to 2018 according to
The verification against OceanSODA-ETHZ is performed over the window 1985-2018. We have clarified this in the text (line 608). We've also made efforts to be clear throughout about verification windows, without getting into excessive detail (see response to comment above). Fig. 13  We have corrected Figure 13 to accurately reflect the lead times mentioned in the caption (2, 8, 14, 20), and we've removed the sea ice skill scores from plots (Figs. 12, 13) and put them in new tables (Tables 1,2). We think the new tables help improve readability, and they facilitate inclusion of scores between OBS and FOSI.  Fig. 15 Fig. 4).

18) Tables 1 and S1 along with
We have added this clarification in the captions of Figs. 14, 15 as well as the TC ACE correlation table (now Table 3).

19)
Regarding the RMSE scores, "RMSE" isn't defined anywhere, and when introducing RMSE in the text should briefly comment on the use of normalized RMSE and introduce the nRMSE notation. Also, is nRMSE defined such that predictions of climatology (zero anomaly) will yield values of 1? If so then briefly mentioning this will help the reader appreciate that nRMSE values <1 indicate that the prediction is more skillful than a climatological prediction.
A new paragraph in section 2 (starting line 202) introduces the skill metrics examined in the paper. We note that nRMSE is defined such that a climatology forecast yields a value of 1. line 520 vs 185 vs 591: is it "best track", "Best Track" or "BestTrack"?
We now consistently used "Best Track".
line 528: should the 2 in kt2 be a superscript?

Yes, fixed.
line 556: suggest "multi-year skill" -> "multi-year skill or potential skill" We have modified the sentence accordingly.

line 923: ACC map gross -> ACC map for gross
Fixed.