The pyrealm_build_data package
The pyrealm repository includes both the pyrealm package and the
pyrealm_build_data package. The pyrealm_build_data package contains datasets
that are used in the pyrealm build and testing process. This includes:
Example datasets that are used in the package documentation, such as simple spatial datasets for showing the use of the P Model.
“Golden” datasets for regression testing
pyrealmimplementations against the outputs of other implementations. These datasets will include a set of input data and then output predictions from other implementations.Datasets for providing profiling of
pyrealmcode and for benchmarking new versions of the package code against earlier implementations to check for performance issues.
The package is organised into submodules that reflect the data use or previous implementation.
Note that pyrealm_build_data is a source distribution only (sdist) component of
pyrealm, so is not included in binary distributions (wheel) that are typically
installed by end users. This means that files in pyrealm_build_data are not
available if a user has simply used pip install pyrealm: please do not use
pyrealm_build_data within the main pyrealm code. See the
pyrealm.core.datasets module for methods to give users access to these datasets
in notebooks.
The bigleaf submodule
This submodule contains benchmark outputs from the bigleaf package in R,
which has been used as the basis for core hygrometry functions. The
bigleaf_conversions.R R script runs a set of test values through bigleaf. The
first part of the file prints out some simple test values that have been used in package
doctests and then the second part of the file generates more complex benchmarking inputs
that are saved, along with bigleaf outputs as bigleaf_test_values.json.
Running bigleaf_conversions.R requires an installation of R along with the
jsonlite and bigleaf packages, and the script can then be run from within the
submodule folder as:
Rscript bigleaf_conversions.R
The community submodule
The pyrealm_build_data.community submodule provides a set of input files for
the pyrealm.demography module that are used both in unit testing for the module
and as inputs for generating documentation of the module. The files provide definitions
of plant functional types and plant communities in a range of formats.
The rpmodel submodule
This submodule contains benchmark outputs from the rpmodel package in R,
which has been used as the basis for initial development of the standard P Model.
Test inputs
The generate_test_inputs.py file defines a set of constants for running P Model
calculations and then defines a set of scalar and array inputs for the forcing variables
required to run the P Model. The array inputs are set of 100 values sampled randomly
across the ranges of plausible forcing value inputs in order to benchmark the
calculations of the P Model implementation. All of these values are stored in the
test_inputs.json file.
It requires python and the numpy package and can be run as:
python generate_test_inputs.py
Simple rpmodel benchmarking
The test_outputs_rpmodel.R contains R code to run the test input data set, and store
the expected predictions from the rpmodel package as test_outputs_rpmodel.json.
It requires an installation of R and the rpmodel package and can be run as:
Rscript test_outputs_rpmodel.R
Global array test
The remaining files in the submodule are intended to provide a global test dataset for
benchmarking the use of rpmodel on a global time-series, so using 3 dimensional
arrays with latitude, longitude and time coordinates. It is currently not used in
testing because of issues with the rpmodel package in version 1.2.0. It may also be
replaced in testing with the uk_data submodule, which is used as an example dataset
in the documentation.
The files are:
pmodel_global.nc: An input global NetCDF file containing forcing variables at 0.5° spatial resolution and for two time steps.test_global_array.R: An R script to runrpmodelusing the dataset.rpmodel_global_gpp_do_ftkphio.nc: A NetCDF file containingrpmodelpredictions using corrections for temperature effects on the kphio parameter.rpmodel_global_gpp_no_ftkphio.nc: A NetCDF file containingrpmodelpredictions with fixedkphio.
To generate the predicted outputs again requires an R installation with the rpmodel
package:
Rscript test_global_array.R
The sandoval_kphio submodule
This submodule contains benchmark outputs from the calc_phi0.R script,
which is an experimental approach to calculating the \(\phi_0\) parameter for
the P Modelwith modulation from climatic aridity and growing degree days and the
current temperature. The calculation is implemented in pyrealm as
QuantumYieldSandoval.
The files are:
calc_phi0.R: The original implementation and parameterisation.create_test_inputs.R: A script to run the original implementation with a range of inputs and save a file of test values.sandoval_kphio.csv: The resulting test values.
The splash submodule
This module contains the code and inputs used to generate the SPLASH benchmark
datasets used in unit testing splash and in regression tests against the
original SPLASH v1 implementation.
Benchmark test data
The splash_make_flux_benchmark_inputs.py script is used to generate 100 random
locations around the globe with random dates, initial soil moisture, preciptation,
cloud fraction and temperature (within reasonable bounds). This provides a robust test
of the calculations of various fluxes across a wide range of plausible value
combinations. The input data is created using:
python splash_make_flux_benchmark_inputs.py -o data/daily_flux_benchmark_inputs.csv
The splash_run_calc_daily_fluxes.py script can then be used to run the inputs
through the original SPLASH implementation provided in the splash_py_version
directory.
python splash_run_calc_daily_fluxes.py \
-i data/daily_flux_benchmark_inputs.csv \
-o data/daily_flux_benchmark_outputs.csv
Original time series
The SPLASH v1.0 implementation provided a time series of inputs for a single location
around San Francisco in 2000, with precipitation and temperature taken from WFDEI and
sunshine fraction interpolated from CRU TS. The original source data is included as
data/splash_sf_example_data.csv.
The original SPLASH main.py provides a simple example to run this code and output
water balance, which can be used as a direct benchmark without any wrapper scripts. With
the alterations to make the SPLASH code importable, the command below can be used to run
the code and capture the output:
python -m splash_py_version.main > data/splash_sf_example_data_main_output.csv
Note that this command also generates main.log, which contains over 54K lines of
logging and takes up over 6 Mb. This is not included in the pyrealm_build_data
package.
Because the splash_sf_example_data_main_output.csv file only contains predicted
water balance, the same input data is also run through a wrapper script to allow daily
calculations to be benchmarked in more detail. The first step is to use the
splash_sf_example_to_netcdf.py script to convert the CSV data into a properly
dimensioned NetCDF file:
python splash_sf_example_to_netcdf.py
This creates the file data/splash_sf_example_data.nc, which can be run using the original SPLASH components using script splash_run_time_series_parallel.py.
python splash_run_time_series_parallel.py \
-i "data/splash_sf_example_data.nc" \
-o "data/splash_sf_example_data_details.nc"
Gridded time series
This is a 20 x 20 cell spatial grid covering 2 years of daily data that is used to
validate the spin up of the initial moisture and the calculation of SPLASH water balance
over a time series across a larger spatial extent. The dataset is generated using the
splash_make_spatial_grid_data.py script, which requires paths to local copies of the
WFDE5_v2 dataset and a version of the CRU TS dataset. Note that the file paths
below are examples and these data are not included in the pyrealm_build_data
package.
python splash_make_spatial_grid_data.py \
-w "/rds/general/project/lemontree/live/source/wfde5/wfde5_v2/" \
-c "/rds/general/project/lemontree/live/source/cru_ts/cru_ts_4.0.4/" \
-o "data/splash_nw_us_grid_data.nc"
The resulting splash_nw_us_grid_data.nc dataset can then be analysed using the
original SPLASH implementation using the script splash_run_time_series_parallel.py.
This uses parallel processing to run multiple cells simultaneously and will output the
progress of the calculations.
python splash_run_time_series_parallel.py \
-i "data/splash_nw_us_grid_data.nc" \
-o "data/splash_nw_us_grid_data_outputs.nc"
The subdaily submodule
At present, this submodule only contains a single file containing the predictions for
the BE_Vie fluxnet site from the original implementation of the subdaily module,
published in (Mengoli et al., 2022). Generating these predictions requires an
installation of R and then code from the following repository:
https://github.com/GiuliaMengoli/P-model_subDaily
TODO - This submodule should be updated to include the required code along with the settings files and a runner script to reproduce this code. Or possibly to checkout the required code as part of a shell script.
The t_model submodule
The t_model submodule provides reference data for testing the implementation of the
T model (Li et al., 2014). The file t_model.r contains the original implementation
in R. The rtmodel_test_outputs.r contains a slightly modified version of the
function that makes it easier to output test values and then runs the function for the
following scenarios:
A 100 year sequence of plant growth for each of three plant functional type (PFT) definitions (
default,alt_oneandalt_two). The parameterisations for the three PFTs are in the filepft_definitions.csvand the resulting time series for each PFT is written tortmodel_output_xxx.csv.Single year predictions across a range of initial diameter at breast height values for each of the three PFTs. These are saved as
rtmodel_unit_testing.csvand are used for simple validation of the main scaling functions.
To generate the predicted outputs again requires an R installation
Rscript rtmodel_test_outputs.r
The two_leaf submodule
This submodule contains benchmark outputs from an R implementation of the two leaf, two stream model.
TODO - this module is currently in development and the files here need to be cleaned up and documented once the implementation has been completed.
The uk_data submodule
This submodule provides P Model forcings for the United Kingdom at 0.5° spatial resolution and hourly temporal resolution over 2 months (1464 temporal observations). It is used for demonstrating the use of the subdaily P Model.
The Python script create_2D_uk_inputs.py is used to generate the NetCDF output file
UK_WFDE5_FAPAR_2018_JuneJuly.nc. The script is currently written with a hard-coded
set of paths to key source data - the WFDE5 v2 climate data and a separate source of
interpolated hourly fAPAR. This should probably be rewritten to generate reproducible
content from publicly available sources of these datasets.
The phenology submodule
This submodule provides regression test data for LAI phenology calculations
The …_methods directories contain the golden datasets of annual fapar max and daily LAI from running a particular method. These are calculated using the data provided in the inputs directory.
The inputs/source directory contains three original data files:
DE_GRI_hh_fluxnet_simple.csv: This file is a subset of the original FluxNET dataset for the site (
FLX_DE-Gri_FLUXNET2015_FULLSET_HH_2004-2014_1-4.csv). This original file contained the complete FluxNET data set for the ‘DE-Gri’ site at half hourly resolution, which includes 242 fields and is around 350 MB. Thefluxnet_reducer.pyscript was used to remove fields not used in the calculations to reduce file size, creating the fileDE_GRI_hh_fluxnet_simple.csv.DE_gri_splash_cru_ts4.07_2000_2019.nc: This contains soil moisture data for the site, extracted from a global run of the pyrealm SPLASH model on the CRU TS 4.07 data set (daily inputs, 0.5° resolution). The script
splash_extractor.pywas used to extract data from the global outputs for the single cell containing the site coordinates.DE-GRI_site_data.json: This contains required site data that is constant across all observations.
The create_inputs.py file then populates two directories of inputs for use in testing:
fortnightly: this contains fortnightly summary data from the half hourly inputs for use with the standard PModel.
subdaily: this contains outputs at the original 30 minute time scale for use with the subdaily PModel.
Each of those two directories contains identically structured files:
pmodel_inputs.csv: processed and cleaned data in the correct units for fitting a P Model along with preciptation data and an indication of which observations are in the growing season.
pmodel_outputs.csv: GPP, ca, chi and ci values from fitting P Models using the data. For the fortnightly data this is a standard P Model, for the subdaily inputs this is a Subdaily model incorporating a soil moisture penalty.
annual_inputs.csv: Annual summary data of the variables needed to calculate fapar max.
daily_assimilation.csv: GPP and then molar assimilation at the daily timescale, by interpolation for fortnightly data and by aggregation for subdaily data.