4. Data Summary and Processing

4.1. Common Tools

4.1.1. NCOs

The NetCDF Operators (NCOs) are suite of command-line tools that can efficiently manipulate well-formed NetCDF datasets (e.g. compute statistics, concatenate, edit metadata, compression) and produce outputs or display results to your console. To our knowledge, NCOs are among the most efficient tools to manipulate large NetCDF files. You’ll find a full description of the dozen existing operators in the nco user guide (https://nco.sourceforge.net/#RTFM). If you have questions on how to use an operator that is not addressed in the user guide, you can post them on the nco help forum: https://sourceforge.net/p/nco/discussion/9830.

Additional examples of file manipulation using nco can be found here (http://research.jisao.washington.edu/data_sets/nco/).

4.1.2. Unidata

Unidata also provides several very useful command-line utilities to manipulate NetCDF files, ncdump and nccopy, more info here:

https://docs.unidata.ucar.edu/nug/current/netcdf_utilities_guide.html

Unidata also maintains a list of NetCDF compatible software:

https://www.unidata.ucar.edu/software/netcdf/software.html

4.1.3. Python netCDF4 package

This is the de-facto standard Python interface to netCDF files. There are some higher level wrappers, but at the end of the day everything you want to do can be done with this package.

https://unidata.github.io/netcdf4-python/

4.1.4. Python xarray package

xarray aims to bring a pandas like interface to netCDF files (and earth science data in general). Pandas has become a standard in the Python data community, but it was originally developed for users in the FINTECH space, and so has some awkward aspects for earth science data. xarray aims to fix this.

https://xarray.dev

4.1.5. R

There are several client libraries for handling NetCDF in R, including:

4.2. Common Operations

Collected here are a few examples of common data manipulation operations. The examples here are far from exhaustive. If you encounter or develop a useful solution that you’d like to see here, plese send us a Pull Request!

4.2.1. Explore structure of NetCDF file

ncdump

This command will list the dimensions, variables and metadata of a file. The -h specify to only display headers.

$ ncdump -h input.nc
ncdump details

This command will list the dimensions, variables and metadata of a file AND the values for the coordinate variables.

$ ncdump -c input.nc

4.2.2. Compress NetCDF file

nccopy

Compress NetCDF files using nccopy. This will copy and compress a netcdf file without loss of data.

$ nccopy -u -d1 input.nc output.nc This command

4.2.3. Subset netcdf files by dimensions

Example: we want to subset the last 10 years of an annual historical simulation 115 years long.

ncks

The flag -O will overwrite any existing output file. The flag -h will not include this command in the global attribute of the output file to document the history of its creation.

$ ncks -O -h -d time,104,114,1 input.nc output.nc

Note

Caution: As in python, the indexing in nco starts at zero. So the index of the 115th time step is actually 114.

python netCDF4
>>> import netCDF4 as nc
>>> ds = nc.Dataset('GPP_yearly_tr.nc')
>>> last_10yrs = ds.variables['GPP'][-10:,:,:]

4.2.4. Display variable values to terminal

Example: we want to display the annual active layer depth (ALD) values for upper left corner pixel of a regional run.

ncks
$ ncks -d x,0 -d y,0 -v ALD input.nc

If you are not sure about the names of the dimensions and variables, you can always display the files structure using ncdump as described below.

python netCDF4
>>> import netCDF4 as nc
>>> ds = nc.Dataset('ALD_yearly_tr.nc')
>>> print(ds.variables['ALD'][:.0,0])

4.2.5. Compute sum, average and standard deviation across dimensions

Example: Model simulation produces monthly GPP time series partitioned by plant functional types and compartments. We now want to compute GPP at the community level by summing across plant functional types (dimension named pft) and compartments (dimensions named pftpart).

ncwa
$ ncwa -O -h -v GPP -a pftpart, pft -y ttl input.nc output.nc

This command will produce sums of GPP across two dimensions listed after the -a flag. The variable to be summed is specified after the flag -v. Finally, the flag -y is used to indicate the type of operation to be done.

python netCDF4
>>> import netCDF4 as nc
>>> ds = nc.Dataset('GPP_yearly_tr.nc')
>>> a = ds.variables['GPP'][:].sum(axis=1).sum(axis=1)

Computations can also be done across a subset of the data. For instance the following command will compute the annual mean temperature for the months of June, July and August. To do so, you will need to make the time dimension as unlimited, as it is done with the ncks operator.

ncks, ncra
$ ncks -O -h --mk_rec_dmn time input.nc input1.nc
$ ncra --mro -O -d time,5,,12,3 -y avg -v tair input1.nc output.nc

The -d flag indicate which dimension should the computation be done across. The indices following the dimension name indicate how to group and subset the dataset. The first index indicate where to start the operation (i.e. the month of June of the first year). The second indicate where to end the operation (nothing indicated means that the operation should be conducted across the entire time series). The third index indicate how to group the data (i.e. 12 months chunks for yearly computations). Finally, the fifth index indicate the number of time step to do the operation for for every group (i.e. 3 months, from June to August). The -v flag indicate what variable to use for the operation. The -y flag indicate what type of operation to conduct. The option --mro instructs ncra to output its results for each sub-group (in that case, each year).

4.2.6. Append files of same dimensions

dvmdostem output variables are stored in single files. To append multiple variables from the same simulation in a single file, you can use the following command.

ncks
$ ncks -A -h file1.nc file2.ncs

The -A flag indicate that the output file (file2.nc in this case), should append (vs overwrite) data. Caution: the files need to be the same exact structure (the dimensions in common between files should have the same length, name and attributes). The data in file1.nc will be appended to file2.nc. This command processes files twice at a time.

4.2.7. Operations with multiple variables

Example: model simulations produced annual thickness of the fibric and the humic horizons (namely SHLWDZ and DEEPDZ) of the organic layer and you want to compute the total organic layer thickness (OLDZ)

ncks, ncap2
$ ncks -A -h SHLWDZ.nc DEEPDZ.nc
$ ncap2 -O -h -s 'OLDZ = DEEPDZ + SHLWDZ' DEEPDZ.nc OLDZ.nc

The first command append the two variables in a single file. The second command is the arithmetic processor, accepting short scripts to create new variables. In this case, we create the variable OLDZ as the sum of two existing DEEPDZ and SHLWDZ.

4.2.8. Concatenate files along the record dimension

Whole model simulations consist of a succession of runs, i.e. pre-run, equilibrium, spin-up, transient (i.e. historical) and scenario. For analysis purposes, you may wat to concatenate the historical ad scenario runs into a single file. To do so, you will need to make the time dimension as unlimited, so additional records can be added to it, before you can do the concatenation.

ncks, ncrcat
$ ncks -O -h --mk_rec_dmn time input1.nc output1.nc
$ ncks -O -h --mk_rec_dmn time input2.nc output2.nc
$ ncrcat -O -h output1.nc output2.nc output.nc