The pyrealm_build_data module
The pyrealm repository includes both the pyrealm package and the
pyrealm_build_data package. The pyrealm_build_data package contains datasets
that are used in the pyrealm build and testing process. This includes:
Example datasets that are used in the package documentation, such as simple spatial datasets for showing the use of the P Model.
“Golden” datasets for regression testing
pyrealmimplementations against the outputs of other implementations. These datasets will include a set of input data and then output predictions from other implementations.Datasets for providing profiling of
pyrealmcode and for benchmarking new versions of the package code against earlier implementations to check for performance issues.
The package is organised into submodules that reflect the data use or previous implementation.
Note that pyrealm_build_data is a source distribution only (sdist) component of
pyrealm, so is not included in binary distributions (wheel) that are typically
installed by end users. This means that files in pyrealm_build_data are not
available if a user has simply used pip install pyrealm: please do not use
pyrealm_build_data within the main pyrealm code.
The bigleaf submodule
This submodule contains benchmark outputs from the bigleaf package in R,
which has been used as the basis for core hygrometry functions. The
bigleaf_conversions.R R script runs a set of test values through bigleaf. The
first part of the file prints out some simple test values that have been used in package
doctests and then the second part of the file generates more complex benchmarking inputs
that are saved, along with bigleaf outputs as bigleaf_test_values.json.
Running bigleaf_conversions.R requires an installation of R along with the
jsonlite and bigleaf packages, and the script can then be run from within the
submodule folder as:
Rscript bigleaf_conversions.R
The community submodule
The pyrealm_build_data.community submodule provides a set of input files for
the pyrealm.demography module that are used both in unit testing for the module
and as inputs for generating documentation of the module. The files provide definitions
of plant functional types and plant communities in a range of formats.
The rpmodel submodule
This submodule contains benchmark outputs from the rpmodel package in R,
which has been used as the basis for initial development of the standard P Model.
Test inputs
The generate_test_inputs.py file defines a set of constants for running P Model
calculations and then defines a set of scalar and array inputs for the forcing variables
required to run the P Model. The array inputs are set of 100 values sampled randomly
across the ranges of plausible forcing value inputs in order to benchmark the
calculations of the P Model implementation. All of these values are stored in the
test_inputs.json file.
It requires python and the numpy package and can be run as:
python generate_test_inputs.py
Simple rpmodel benchmarking
The test_outputs_rpmodel.R contains R code to run the test input data set, and store
the expected predictions from the rpmodel package as test_outputs_rpmodel.json.
It requires an installation of R and the rpmodel package and can be run as:
Rscript test_outputs_rpmodel.R
Global array test
The remaining files in the submodule are intended to provide a global test dataset for
benchmarking the use of rpmodel on a global time-series, so using 3 dimensional
arrays with latitude, longitude and time coordinates. It is currently not used in
testing because of issues with the rpmodel package in version 1.2.0. It may also be
replaced in testing with the uk_data submodule, which is used as an example dataset
in the documentation.
The files are:
pmodel_global.nc: An input global NetCDF file containing forcing variables at 0.5° spatial resolution and for two time steps.test_global_array.R: An R script to runrpmodelusing the dataset.rpmodel_global_gpp_do_ftkphio.nc: A NetCDF file containingrpmodelpredictions using corrections for temperature effects on the kphio parameter.rpmodel_global_gpp_no_ftkphio.nc: A NetCDF file containingrpmodelpredictions with fixedkphio.
To generate the predicted outputs again requires an R installation with the rpmodel
package:
Rscript test_global_array.R
The sandoval_kphio submodule
This submodule contains benchmark outputs from the calc_phi0.R script, which is
an experimental approach to calculating the \(\phi_0\) parameter for the P Model
with modulation from climatic aridity and growing degree days and the current
temperature. The calculation is implemented in pyrealm as
QuantumYieldSandoval.
The files are:
calc_phi0.R: The original implementation and parameterisation.create_test_inputs.R: A script to run the original implementation with a range of inputs and save a file of test values.sandoval_kphio.csv: The resulting test values.
The splash submodule
This module contains the code and inputs used to generate the SPLASH benchmark
datasets used in unit testing splash and in regression tests against the
original SPLASH v1 implementation.
Benchmark test data
The splash_make_flux_benchmark_inputs.py script is used to generate 100 random
locations around the globe with random dates, initial soil moisture, preciptation,
cloud fraction and temperature (within reasonable bounds). This provides a robust test
of the calculations of various fluxes across a wide range of plausible value
combinations. The input data is created using:
python splash_make_flux_benchmark_inputs.py -o data/daily_flux_benchmark_inputs.csv
The splash_run_calc_daily_fluxes.py script can then be used to run the inputs
through the original SPLASH implementation provided in the splash_py_version
directory.
python splash_run_calc_daily_fluxes.py \
-i data/daily_flux_benchmark_inputs.csv \
-o data/daily_flux_benchmark_outputs.csv
Original time series
The SPLASH v1.0 implementation provided a time series of inputs for a single location
around San Francisco in 2000, with precipitation and temperature taken from WFDEI and
sunshine fraction interpolated from CRU TS. The original source data is included as
data/splash_sf_example_data.csv.
The original SPLASH main.py provides a simple example to run this code and output
water balance, which can be used as a direct benchmark without any wrapper scripts. With
the alterations to make the SPLASH code importable, the command below can be used to run
the code and capture the output:
python -m splash_py_version.main > data/splash_sf_example_data_main_output.csv
Note that this command also generates main.log, which contains over 54K lines of
logging and takes up over 6 Mb. This is not included in the pyrealm_build_data
package.
Because the splash_sf_example_data_main_output.csv file only contains predicted
water balance, the same input data is also run through a wrapper script to allow daily
calculations to be benchmarked in more detail. The first step is to use the
splash_sf_example_to_netcdf.py script to convert the CSV data into a properly
dimensioned NetCDF file:
python splash_sf_example_to_netcdf.py
This creates the file data/splash_sf_example_data.nc, which can be run using the original SPLASH components using script splash_run_time_series_parallel.py.
python splash_run_time_series_parallel.py \
-i "data/splash_sf_example_data.nc" \
-o "data/splash_sf_example_data_details.nc"
Gridded time series
This is a 20 x 20 cell spatial grid covering 2 years of daily data that is used to
validate the spin up of the initial moisture and the calculation of SPLASH water balance
over a time series across a larger spatial extent. The dataset is generated using the
splash_make_spatial_grid_data.py script, which requires paths to local copies of the
WFDE5_v2 dataset and a version of the CRU TS dataset. Note that the file paths
below are examples and these data are not included in the pyrealm_build_data
package.
python splash_make_spatial_grid_data.py \
-w "/rds/general/project/lemontree/live/source/wfde5/wfde5_v2/" \
-c "/rds/general/project/lemontree/live/source/cru_ts/cru_ts_4.0.4/" \
-o "data/splash_nw_us_grid_data.nc"
The resulting splash_nw_us_grid_data.nc dataset can then be analysed using the
original SPLASH implementation using the script splash_run_time_series_parallel.py.
This uses parallel processing to run multiple cells simultaneously and will output the
progress of the calculations.
python splash_run_time_series_parallel.py \
-i "data/splash_nw_us_grid_data.nc" \
-o "data/splash_nw_us_grid_data_outputs.nc"
The subdaily submodule
At present, this submodule only contains a single file containing the predictions for
the BE_Vie fluxnet site from the original implementation of the subdaily module,
published in (Mengoli et al., 2022). Generating these predictions requires an
installation of R and then code from the following repository:
https://github.com/GiuliaMengoli/P-model_subDaily
TODO - This submodule should be updated to include the required code along with the settings files and a runner script to reproduce this code. Or possibly to checkout the required code as part of a shell script.
The t_model submodule
The t_model submodule provides reference data for testing the implementation of the
T model (Li et al., 2014). The file t_model.r contains the original implementation
in R. The rtmodel_test_outputs.r contains a slightly modified version of the
function that makes it easier to output test values and then runs the function for the
following scenarios:
A 100 year sequence of plant growth for each of three plant functional type (PFT) definitions (
default,alt_oneandalt_two). The parameterisations for the three PFTs are in the filepft_definitions.csvand the resulting time series for each PFT is written tortmodel_output_xxx.csv.Single year predictions across a range of initial diameter at breast height values for each of the three PFTs. These are saved as
rtmodel_unit_testing.csvand are used for simple validation of the main scaling functions.
To generate the predicted outputs again requires an R installation
Rscript rtmodel_test_outputs.r
The two_leaf submodule
This submodule contains benchmark outputs from an R implementation of the two leaf, two stream model.
TODO - this module is currently in development and the files here need to be cleaned up and documented once the implementation has been completed.
The uk_data submodule
This submodule provides P Model forcings for the United Kingdom at 0.5° spatial resolution and hourly temporal resolution over 2 months (1464 temporal observations). It is used for demonstrating the use of the subdaily P Model.
The Python script create_2D_uk_inputs.py is used to generate the NetCDF output file
UK_WFDE5_FAPAR_2018_JuneJuly.nc. The script is currently written with a hard-coded
set of paths to key source data - the WFDE5 v2 climate data and a separate source of
interpolated hourly fAPAR. This should probably be rewritten to generate reproducible
content from publically available sources of these datasets.
The phenology submodule
This submodule provides regression test data from the initial implementation of LAI phenology calculations, provided by Boya Zhou. The input data consists of three files:
DE_GRI_hh_fluxnet_simple.csv: This file is a subset of the original FluxNET dataset for the site (
FLX_DE-Gri_FLUXNET2015_FULLSET_HH_2004-2014_1-4.csv). This original file contained the complete FluxNET data set for the ‘DE-Gri’ site at half hourly resolution, which includes 242 fields and is around 350 MB. Thefluxnet_reducer.pyscript was used to remove fields not used in the calculations to reduce file size, creating the fileDE_GRI_hh_fluxnet_simple.csv.DE_gri_splash_cru_ts4.07_2000_2019.nc: This contains soil moisture data for the site, extracted from a global run of the pyrealm SPLASH model on the CRU TS 4.07 data set (daily inputs, 0.5° resolution). The script
splash_extractor.pywas used to extract data from the global outputs for the single cell containing the site coordinates.DE-GRI_site_data.json: This contains required site data that is constant across all observations.
The script file python_implementation.py contains a pure Python reimplementation of
Boya Zhou’s original workflow, put together by David Orme and Boya Zhou to bring all of
the calculations into Python using agreed inputs to create a repeatable regression test
dataset.
The script creates outputs from fitting two P Models.
A subdaily P Model at 30 minute resolution including daily soil moisture penalties. The outputs of this model are stored in the subdaily_example directory and includes three output files to allow regression testing at three time scales:
half_hourly_data.csv: The predictions from the P Model of GPP at the half hourly scale, along with optimal chi and ci values. The file also includes the forcing data used to fit the model and the precipitation data.
daily_outputs.csv: Daily total GPP along with soil moisture stress factors and resulting penalised daily GPP, growing season definition and resulting time series in LAI and lagged LAI.
annual_outputs.csv: Annual values used in calculations including total annual assimilation, precipitation, number of growing days, mean carbon chi and VPD within the growing season and then annual values for maximum FAPAR, LAI and the m parameter. These values are then used to calculate the daily LAI predictions.
A standard P Model calculated using fortnightly averages, excepting precipitation, which is calculated from the sum of half hourly FluxNET precipitation converted from mm to moles. The outputs from this model are stored in the fortnightly_example directory and includes:
fortnightly_data.csv: The predictions from the P Model of GPP at the fortnightly scale, along with optimal chi and ci values. The file also includes the forcing data used to fit the model and the precipitation data.
annual_outputs.csv: Annual values used in calculations including total annual assimilation, precipitation, number of growing days, mean carbon chi and VPD within the growing season and then annual values for maximum FAPAR, LAI and the m parameter. These values are then used to calculate the daily LAI predictions.