## Overview

In this vignette we will see how to retrieve and prepare Reynolds optimally interpolated sea surface temperature (OISST) data for calculating marine heatwaves (MHWs). The OISST product is a global 1/4 degree gridded dataset of Advanced Very High Resolution Radiometer (AVHRR) derived SSTs at a daily resolution, starting on 1 September 1981. The source of the data is currently the NOAA NCDC.

Each global, daily file is around 8.3 MB, so they add up to a large amount of data when a time series of at least 30 years duration is downloaded. Remember that a time series of at least 30 years is recommended for heatwave detection. If one were to download all of the data currently available it would be much larger than 100 GB of total disk space. So it is usually preferable to download only a a subset of data. Thanks to the rerddap package this is now very easy to do in R.

### The packages

The data generated by the environmental observation organisations in the United States are now rather easy to download. For the purposes of this vignette we will be accessing the NOAA OISST data hosted on this ERDDAP web interface. One may download the data there manually with the user friendly web interface, or use the rerddap package to do the same through R. Due to some recent developments, the rerddap package on CRAN may provide an unstable interface to the NOAA ERDDAP servers. This issue has already been addressed in the development version of the package and so we will begin by installing the package from there. Please note that this package has quite a few dependencies and so may take some time to install. If any errors are encountered during this process please consult the GitHub page here.

# Install the development version of the rerddap package from GitHub
# NB: Only need to run this once
# devtools::install_github("ropensci/rerddap")

# The two packages we will need
library(tidyverse)
library(rerddap)

# The information for the NOAA OISST data
info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")

With our packages loaded and our target dataset identified, we may now simply download it with griddap(). While putting this vignette together however I noticed one little hiccup in the work flow. It seems that the ERDDAP server does not like it when one tries to access more than nine consecutive years of data in one request, regardless of the spatial extent being requested. So before we download our data we are going to make a small wrapper function that we can tell the range of times we want to download. This will reduce the amount of redundant coding we would otherwise need to do.

### The function

# This function expects the user to provide it with a start and end date
OISST_sub <- function(time_df){
oisst_res <- griddap(x = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon",
url = "https://www.ncei.noaa.gov/erddap/",
time = c(time_df$start, time_df$end),
depth = c(0, 0),
latitude = c(-40, -35),
longitude = c(15, 21),
fields = "sst")\$data %>%
mutate(time = as.Date(str_remove(time, "T00:00:00Z"))) %>%
dplyr::rename(t = time, temp = sst) %>%
select(lon, lat, t, temp) %>%
na.omit()
}

In the wrapper function above we see that we have chosen to download only the ‘sst’ data out of the several variables available to us. We also see that we have chosen the spatial extent of latitude -40 to -35 and longitude 15 to 21. This a small window over some of the Agulhas Retroflection to the south west of the coastline of South Africa. A larger area is not being chosen here simply due to the speed constraints of downloading the data and detecting the events therein. One may change the lon/lat values above as is necessary to match a desired study area.

The wrapper function above will also be re-labelling the ‘time’ column as ‘t’, and the ‘sst’ column as ‘temp’. We do this so that they match the default column names that are expected for calculating MHWs and we won’t have to do any extra work later on.

One must note here that depending on the RAM available on ones machine, it may not be possible to handle all of the data downloaded at once if they are very large (e.g. > 5 GB). The discussion on the limitations of the R language due to its dependence on virtual memory is beyond the scope of this vignette, but if one has chosen to only download the spatial extent laid out above one should not encounter any RAM issues.

### The date range

With our wrapper function written we would now need to run it several times in order to grab all of the available OISST data for the range of the full available years. As of the writing of this vignette this is from 1982 – 2018. Even though each year of data for the extent used in this vignette is only ~1.0 MB, the server does not like it when more than 9 years of consecutive data are requested. The server will also end a users connection after ~17 individual files have been requested. Because we can’t download all of the dat in one request, and we can’t download the data one year at a time, we will need to make requests for multiple batches of data. To accomplish this we will create a dataframe of start and end dates that will allow us to automate the entire download while meeting the areforementioned criteria.

# Date download range by start and end dates per year
dl_years <- data.frame(date_index = 1:5,
start = as.Date(c("1982-01-01", "1990-01-01",
"1998-01-01", "2006-01-01", "2014-01-01")),
end = as.Date(c("1989-12-31", "1997-12-31",
"2005-12-31", "2013-12-31", "2018-12-31")))

### The data

With the download index complete we now use purrr to map our wrapper function to the date range dataframe. One could also use the plyr suite of functions to automate this process, but I’ve chosen here to stick with the tidyverse native approach. If the below chunk of code fails or times out, simply re-run it until all of the data have been downloaded.

It is worth pointing out here that these data are downloaded as cached files on the users computer by using the hoardr package. This means that if one runs the same command again, it will not re-download the data again because it first looks in the folder where it has automatically cached the data for you and sees that it may simply draw the data from there. This is convenient as it means that re-running this same code will load the data that have already been downloaded. No need to change anything or write a second script for loading data.

# Download all of the data with one nested request
# Speed results may vary based on internet strength
system.time(
OISST_data <- dl_years %>%
group_by(date_index) %>%
nest() %>%
mutate(dl_data = map(data, OISST_sub)) %>%
ungroup() %>%
select(-date_index, -data) %>%
unnest()
) # 560 seconds, ~112 seconds per batch

### The result

With the data downloaded and prepared for furhter use, all that’s left to do is save them.

# Save the data as an .Rda file
saveRDS(OISST_data, file = "~/Desktop/OISST_vignette.Rda")

Note above that I have chosen to save the file to my desktop. This is not normally where one (hopefully!) would save such a file. Rather one would be saving these data into the project folder out of which one is working.

And that is all there is to it. Download, prep, save. In the next vignette we will see how to detect MHWs in gridded data.