The `pyrealm_build_data` package

The pyrealm repository includes both the pyrealm package and the pyrealm_build_data package. The pyrealm_build_data package contains datasets that are used in the pyrealm build and testing process. This includes:

Example datasets that are used in the package documentation, such as simple spatial datasets for showing the use of the P Model.
“Golden” datasets for regression testing pyrealm implementations against the outputs of other implementations. These datasets will include a set of input data and then output predictions from other implementations.
Datasets for providing profiling of pyrealm code and for benchmarking new versions of the package code against earlier implementations to check for performance issues.

The package is organised into submodules that reflect the data use or previous implementation.

Note that pyrealm_build_data is a source distribution only (sdist) component of pyrealm, so is not included in binary distributions (wheel) that are typically installed by end users. This means that files in pyrealm_build_data are not available if a user has simply used pip install pyrealm: please do not use pyrealm_build_data within the main pyrealm code. See the pyrealm.core.datasets module for methods to give users access to these datasets in notebooks.

The `bigleaf` submodule

This submodule contains benchmark outputs from the bigleaf package in R, which has been used as the basis for core hygrometry functions. The bigleaf_conversions.R R script runs a set of test values through bigleaf. The first part of the file prints out some simple test values that have been used in package doctests and then the second part of the file generates more complex benchmarking inputs that are saved, along with bigleaf outputs as bigleaf_test_values.json.

Running bigleaf_conversions.R requires an installation of R along with the jsonlite and bigleaf packages, and the script can then be run from within the submodule folder as:

Rscript bigleaf_conversions.R

The `community` submodule

The pyrealm_build_data.community submodule provides a set of input files for the pyrealm.demography module that are used both in unit testing for the module and as inputs for generating documentation of the module. The files provide definitions of plant functional types and plant communities in a range of formats.

The `rpmodel` submodule

This submodule contains benchmark outputs from the rpmodel package in R, which has been used as the basis for initial development of the standard P Model.

Test inputs

The generate_test_inputs.py file defines a set of constants for running P Model calculations and then defines a set of scalar and array inputs for the forcing variables required to run the P Model. The array inputs are set of 100 values sampled randomly across the ranges of plausible forcing value inputs in order to benchmark the calculations of the P Model implementation. All of these values are stored in the test_inputs.json file.

It requires python and the numpy package and can be run as:

python generate_test_inputs.py

Simple rpmodel benchmarking

The test_outputs_rpmodel.R contains R code to run the test input data set, and store the expected predictions from the rpmodel package as test_outputs_rpmodel.json. It requires an installation of R and the rpmodel package and can be run as:

Rscript test_outputs_rpmodel.R

Global array test

The remaining files in the submodule are intended to provide a global test dataset for benchmarking the use of rpmodel on a global time-series, so using 3 dimensional arrays with latitude, longitude and time coordinates. It is currently not used in testing because of issues with the rpmodel package in version 1.2.0. It may also be replaced in testing with the uk_data submodule, which is used as an example dataset in the documentation.

The files are:

pmodel_global.nc: An input global NetCDF file containing forcing variables at 0.5° spatial resolution and for two time steps.
test_global_array.R: An R script to run rpmodel using the dataset.
rpmodel_global_gpp_do_ftkphio.nc: A NetCDF file containing rpmodel predictions using corrections for temperature effects on the kphio parameter.
rpmodel_global_gpp_no_ftkphio.nc: A NetCDF file containing rpmodel predictions with fixed kphio.

To generate the predicted outputs again requires an R installation with the rpmodel package:

Rscript test_global_array.R

The `sandoval_kphio` submodule

This submodule contains benchmark outputs from the calc_phi0.R script, which is an experimental approach to calculating the \(\phi_0\) parameter for the P Modelwith modulation from climatic aridity and growing degree days and the current temperature. The calculation is implemented in pyrealm as QuantumYieldSandoval.

The files are:

calc_phi0.R: The original implementation and parameterisation.
create_test_inputs.R: A script to run the original implementation with a range of inputs and save a file of test values.
sandoval_kphio.csv: The resulting test values.

The `splash` submodule

This module contains the code and inputs used to generate the SPLASH benchmark datasets used in unit testing splash and in regression tests against the original SPLASH v1 implementation.

Benchmark test data

The splash_make_flux_benchmark_inputs.py script is used to generate 100 random locations around the globe with random dates, initial soil moisture, preciptation, cloud fraction and temperature (within reasonable bounds). This provides a robust test of the calculations of various fluxes across a wide range of plausible value combinations. The input data is created using:

python splash_make_flux_benchmark_inputs.py -o data/daily_flux_benchmark_inputs.csv

The splash_run_calc_daily_fluxes.py script can then be used to run the inputs through the original SPLASH implementation provided in the splash_py_version directory.

python splash_run_calc_daily_fluxes.py \
    -i data/daily_flux_benchmark_inputs.csv \
    -o data/daily_flux_benchmark_outputs.csv

Original time series

The SPLASH v1.0 implementation provided a time series of inputs for a single location around San Francisco in 2000, with precipitation and temperature taken from WFDEI and sunshine fraction interpolated from CRU TS. The original source data is included as data/splash_sf_example_data.csv.

The original SPLASH main.py provides a simple example to run this code and output water balance, which can be used as a direct benchmark without any wrapper scripts. With the alterations to make the SPLASH code importable, the command below can be used to run the code and capture the output:

python -m splash_py_version.main > data/splash_sf_example_data_main_output.csv

Note that this command also generates main.log, which contains over 54K lines of logging and takes up over 6 Mb. This is not included in the pyrealm_build_data package.

Because the splash_sf_example_data_main_output.csv file only contains predicted water balance, the same input data is also run through a wrapper script to allow daily calculations to be benchmarked in more detail. The first step is to use the splash_sf_example_to_netcdf.py script to convert the CSV data into a properly dimensioned NetCDF file:

python splash_sf_example_to_netcdf.py

This creates the file data/splash_sf_example_data.nc, which can be run using the original SPLASH components using script splash_run_time_series_parallel.py.

python splash_run_time_series_parallel.py \
    -i "data/splash_sf_example_data.nc" \
    -o "data/splash_sf_example_data_details.nc"

Gridded time series

This is a 20 x 20 cell spatial grid covering 2 years of daily data that is used to validate the spin up of the initial moisture and the calculation of SPLASH water balance over a time series across a larger spatial extent. The dataset is generated using the splash_make_spatial_grid_data.py script, which requires paths to local copies of the WFDE5_v2 dataset and a version of the CRU TS dataset. Note that the file paths below are examples and these data are not included in the pyrealm_build_data package.

python splash_make_spatial_grid_data.py \
    -w "/rds/general/project/lemontree/live/source/wfde5/wfde5_v2/" \
    -c "/rds/general/project/lemontree/live/source/cru_ts/cru_ts_4.0.4/" \
    -o "data/splash_nw_us_grid_data.nc"

The resulting splash_nw_us_grid_data.nc dataset can then be analysed using the original SPLASH implementation using the script splash_run_time_series_parallel.py. This uses parallel processing to run multiple cells simultaneously and will output the progress of the calculations.

python splash_run_time_series_parallel.py \
    -i "data/splash_nw_us_grid_data.nc" \
    -o "data/splash_nw_us_grid_data_outputs.nc"

The `subdaily` submodule

At present, this submodule only contains a single file containing the predictions for the BE_Vie fluxnet site from the original implementation of the subdaily module, published in (Mengoli et al., 2022). Generating these predictions requires an installation of R and then code from the following repository:

https://github.com/GiuliaMengoli/P-model_subDaily

TODO - This submodule should be updated to include the required code along with the settings files and a runner script to reproduce this code. Or possibly to checkout the required code as part of a shell script.

The `t_model` submodule

The t_model submodule provides reference data for testing the implementation of the T model (Li et al., 2014). The file t_model.r contains the original implementation in R. The rtmodel_test_outputs.r contains a slightly modified version of the function that makes it easier to output test values and then runs the function for the following scenarios:

A 100 year sequence of plant growth for each of three plant functional type (PFT) definitions (default, alt_one and alt_two). The parameterisations for the three PFTs are in the file pft_definitions.csv and the resulting time series for each PFT is written to rtmodel_output_xxx.csv.
Single year predictions across a range of initial diameter at breast height values for each of the three PFTs. These are saved as rtmodel_unit_testing.csv and are used for simple validation of the main scaling functions.

To generate the predicted outputs again requires an R installation

Rscript rtmodel_test_outputs.r

The `two_leaf` submodule

This submodule contains benchmark outputs from an R implementation of the two leaf, two stream model.

TODO - this module is currently in development and the files here need to be cleaned up and documented once the implementation has been completed.

The `uk_data` submodule

This submodule provides P Model forcings for the United Kingdom at 0.5° spatial resolution and hourly temporal resolution over 2 months (1464 temporal observations). It is used for demonstrating the use of the subdaily P Model.

The Python script create_2D_uk_inputs.py is used to generate the NetCDF output file UK_WFDE5_FAPAR_2018_JuneJuly.nc. The script is currently written with a hard-coded set of paths to key source data - the WFDE5 v2 climate data and a separate source of interpolated hourly fAPAR. This should probably be rewritten to generate reproducible content from publicly available sources of these datasets.

The `phenology` submodule

This submodule provides regression test data for LAI phenology calculations

The …_methods directories contain the golden datasets of annual fapar max and daily LAI from running a particular method. These are calculated using the data provided in the inputs directory.

The inputs/source directory contains three original data files:

DE_GRI_hh_fluxnet_simple.csv: This file is a subset of the original FluxNET dataset for the site (FLX_DE-Gri_FLUXNET2015_FULLSET_HH_2004-2014_1-4.csv). This original file contained the complete FluxNET data set for the ‘DE-Gri’ site at half hourly resolution, which includes 242 fields and is around 350 MB. The fluxnet_reducer.py script was used to remove fields not used in the calculations to reduce file size, creating the file DE_GRI_hh_fluxnet_simple.csv.
DE_gri_splash_cru_ts4.07_2000_2019.nc: This contains soil moisture data for the site, extracted from a global run of the pyrealm SPLASH model on the CRU TS 4.07 data set (daily inputs, 0.5° resolution). The script splash_extractor.py was used to extract data from the global outputs for the single cell containing the site coordinates.
DE-GRI_site_data.json: This contains required site data that is constant across all observations.

The create_inputs.py file then populates two directories of inputs for use in testing:

fortnightly: this contains fortnightly summary data from the half hourly inputs for use with the standard PModel.
subdaily: this contains outputs at the original 30 minute time scale for use with the subdaily PModel.

Each of those two directories contains identically structured files:

pmodel_inputs.csv: processed and cleaned data in the correct units for fitting a P Model along with preciptation data and an indication of which observations are in the growing season.
pmodel_outputs.csv: GPP, ca, chi and ci values from fitting P Models using the data. For the fortnightly data this is a standard P Model, for the subdaily inputs this is a Subdaily model incorporating a soil moisture penalty.
annual_inputs.csv: Annual summary data of the variables needed to calculate fapar max.
daily_assimilation.csv: GPP and then molar assimilation at the daily timescale, by interpolation for fortnightly data and by aggregation for subdaily data.

The pyrealm_build_data package

The bigleaf submodule

The community submodule

The rpmodel submodule