--- jupytext: formats: md:myst text_representation: extension: .md format_name: myst format_version: 0.13 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 language_info: codemirror_mode: name: ipython version: 3 file_extension: .py mimetype: text/x-python name: python nbconvert_exporter: python pygments_lexer: ipython3 version: 3.11.9 --- # Missing data in the subdaily model The worked example from this page {download}`can be downloaded as Jupyter notebook here `. The acclimation process in the subdaily model uses a weighted mean of the daily acclimation conditions with preceeding daily conditions. When there are missing data, this results in problems in the estimation of the realised daily values for $\xi$, $V_{cmax25}$ and $J_{max25}$. The key problem is if there are missing values in the observations set as the acclimation window for the subdaily model. The daily optimal behaviour for the plant is calculated using the average values of forcing variables within the window and missing values within the window will lead to the average value also being missing. This propagates through the weighted mean process - all days following the missing data will also be missing. The subdaily model provides two options to help deal with missing data. 1. The acclimation process can be set to **allow partial acclimation data** in the calculation of daily average values by ignoring missing data and taking the average of the available observations (`allow_partial_data`). However, allowing partial data cannot solve the issue where the forcing data is missing throughout the acclimation window for a day or where the P Model calculations are undefined throughout that window on a day. Both of these cases give a missing value in the time series of daily optimal values, which would then be propagated through the calculation of realised values using weighted averages. 1. Hence the second option is to allow the iterated calculation of daily realised values to **hold over the last valid value** until the next valid daily optimal value is found (`allow_holdover`). Note that this cannot fill missing values at the start of time series. The code below gives a concrete example - a time series that starts and ends during in the middle of a one hour acclimation window around noon. Only two of the three observations are provided for the first and last day ```{code-cell} ipython3 from copy import copy import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Rectangle, Patch import matplotlib.dates as mdates from pyrealm.pmodel.acclimation import AcclimationModel # A five day time series running from noon until midnight datetimes = np.arange( np.datetime64("2012-05-01 00:00"), np.datetime64("2012-05-05 23:30"), np.timedelta64(30, "m"), ) # Example data with missing values scale_trig = (60 * 24) / (2 * np.pi) amplitude = 5 trend = 5 complete_data = np.round( (-amplitude * np.cos((datetimes - datetimes[0]).astype(np.float64) / scale_trig)) + amplitude, 2, ) + np.linspace(0, trend, len(datetimes)) data = complete_data.copy() data[datetimes <= np.datetime64("2012-05-01 15:30")] = np.nan data[ np.logical_and( datetimes >= np.datetime64("2012-05-03 11:00"), datetimes <= np.datetime64("2012-05-03 13:00"), ) ] = np.nan data[ np.logical_and( datetimes >= np.datetime64("2012-05-04 09:30"), datetimes <= np.datetime64("2012-05-04 15:30"), ) ] = np.nan # Create a first default acclimation model acclim_model = AcclimationModel(datetimes) window_center = np.timedelta64(12, "h") half_width = np.timedelta64(150, "m") acclim_model.set_window(window_center=window_center, half_width=half_width) ``` The {meth}`~pyrealm.pmodel.acclimation.AcclimationModel.get_window_values` method extracts the values within the acclimation window for each day. With the half hourly data and the window set above, these are the observations at 11:30, 12:00 and 12:30. This method is typically used internally and not directly by users, but it shows the problem of the missing data clearly: * The 11:30 observation is 'missing' on the first day because the data start at 12:00. * The 12:00 and 12:30 observations are 'missing' on the last day because the data ends at 11:30. * One day has a single missing 12:00 data point within the acclimation window. * One day has no data within the acclimation window. ```{code-cell} ipython3 fig, ax = plt.subplots() # Add the acclimation windows for each day acclim_windows = [ Rectangle( (win_center - half_width, 0), 2 * half_width, 2 * amplitude + trend, facecolor="salmon", alpha=0.3, ) for win_center in acclim_model.sample_datetimes_mean ] [ax.add_patch(copy(p)) for p in acclim_windows] ax.plot(datetimes, complete_data, color="grey", linewidth=0.5, linestyle="dashed") ax.plot(datetimes, data, marker=".") # Format date axis myFmt = mdates.DateFormatter("%m/%d\n%H:%M") _ = ax.xaxis.set_major_formatter(myFmt) ``` ```{code-cell} ipython3 acclim_model.get_window_values(data).round(1) ``` The daily average conditions are calculated using the {meth}`~pyrealm.pmodel.acclimation.AcclimationModel.get_daily_means` method. If partial data are not allowed - which is the default - the daily average conditions for all days with missing data is also missing (`np.nan`). ```{code-cell} ipython3 partial_not_allowed = acclim_model.get_daily_means(data) partial_not_allowed.round(2) ``` Using an acclimation model that sets `allow_partial_data = True` allows the daily average conditions to be calculated from the partial available information. This does not solve the problem for days with no data in the acclimation window, which still results in a missing value and also generates a warning. ```{code-cell} ipython3 # Create an acclimation model that allows partial data acclim_model_partial = AcclimationModel(datetimes, allow_partial_data=True) acclim_model_partial.set_window(window_center=window_center, half_width=half_width) partial_allowed = acclim_model_partial.get_daily_means(data) partial_allowed.round(2) ``` The {func}`~pyrealm.pmodel.acclimation.AcclimationModel.apply_acclimation` method is used to calculate realised acclimated values of a variable from the optimal values. By default, this function *will raise an error* when missing data are present: ```{code-cell} ipython3 :tags: [raises-exception] try: acclim_model_partial.apply_acclimation(partial_not_allowed) except ValueError as excep: print("Error message: ", excep) ``` Using an acclimation model set with `allow_holdover=True` allows the function to be run. If the input to `apply_acclimation` *does not* allow partial data, then the gaps on day 3 and 4 are filled by holding over the value from day 2. ```{code-cell} ipython3 acclim_model_partial_and_holdover = AcclimationModel(datetimes, allow_holdover=True) acclim_model_partial_and_holdover.set_window( window_center=window_center, half_width=half_width ) acclim_model_partial_and_holdover.apply_acclimation(partial_not_allowed).round(3) ``` When partial data is allowed, the `allow_holdover` uses the value estimated from partial data on day 3 to fill the completely missing data on day 4. ```{code-cell} ipython3 acclim_model_partial_and_holdover.apply_acclimation(partial_allowed).round(3) ``` These options do not fix all problems, such as the gap on day 1 that cannot be filled by holding over earlier values. The best way forward depends partly on the source of the missing data and how common it is, as discussed below. ## Sources of missing data There are three ways that missing data can occur: ### Simple data gaps Your data might simply be incomplete and have missing data through the time series. Note that the main problems only arise if you have missing data **during the acclimation window**, because this prevents the calculation of the realised values of $\xi$, $V_{cmax25}$ and $J_{max25}$ at the subdaily time scale. Missing values at other points in the daily cycle simply lead to missing predictions at individual observations. You can fix this problem in a few ways: * With sparse missing data, you may be able to simply use the `allow_partial_data` option to ignore the missing data when calculating daily means. * However if any day has *no* valid data during the acclimation window for a variable, then the partial data calculation will still result in missing data in the daily average values. The optimal behaviour for the day cannot be calculated and hence the realised values cannot be calculated. In this case, the `allow_holdover` option may help resolve the problem. * It may also be easier to interpolate your missing data, possibly using methods from the {mod}`scipy.interpolate` module, and avoid having to use these options. ### Incomplete start and end days It is common for observation data to start or end part way through a day, often as a result of converting UTC times to local time. If the observations on the first and last day only *partly cover the acclimation window*, then there is effectively missing acclimation data. * The `allow_holdover` option will skip over the first day - you will not get predictions until the following day, but then the rest of the calculations will continue as normal. * The `allow_partial_data` will allow the estimation of optimal values on the first and last days and extend the predictions to the start of the data. * You could also simply truncate the data to complete days or extrapolate the data to fill in the missing start and/or end observations. ### Undefined behaviour in the P Model Two critical variables in the P Model ($V_{cmax}$ and $J_{max}$) are not estimable under some environmental conditions (see {class}`~pyrealm.pmodel.optimal_chi.OptimalChiPrentice14` for the details). As a result, even with complete forcing data, the P Model can generate undefined values as part of the calculation of optimal $\chi$. Under these conditions, missing values arise in the daily model of optimal behaviour and the `allow_holdover=True`option is required to generate predictions.