4 stayCALM

The R package, stayCALM (Standardized Tool for Applying Your Consolidated Assessment and Listing Methodology), provides a robust framework for automating assessments as outlined in the NYSDEC CALM. The intent is to provide a tool that enforces a strict standardized process, but is flexible enough to quickly and accurately accommodate changes to the CALM.

4.1 R Programming Language

The programming language, R, was selected because it provides an excellent environment for developing and executing standardized computations and processes.

4.2 Process Overveiw

There are two methods for applying the logic defined in the CALM– both of which could be supported by the functions provided in stayCALM:

  1. The CALM process could be strictly followed by dividing the data at each node of the decision tree depicted in Figure 1 until each data point has been appropriately assessed. The various subsets created during this process would be appended together to create the final assessment product.
    • Pros: No unnecessary calculations are made.
    • Cons: The process creates many disperate datasets, which…
      • makes it difficult comprehend how the data is being handled at each step;
      • leaves room for the introduction of errors, such as accidentally dropping a subset of data or applying the wrong process to a subset of data;
      • and future modifications to the CALM may require an extensive re-structuring of flow through the decision tree (Figure 1).
  2. All intermittent steps from the CALM process could be applied globally to the data set, and subsequently the assessment classifications can be assigned.
    • Pros: The process does not divide the data into multiple subsets, which…
      • makes it simple to follow the process;
      • reduces the chance of introducting errors;
      • and will require minimal reconfiguration if the CALM process is updated in the future.
    • Cons: Potentially slower than other methodologies because unnecessary calculations are made.
      • The time added by these unnecessary calculations is likely to be neglible for the data set sizes assessed by NYSDEC. Additionally, the potential time lag would likely be on the order of seconds or minutes; this is far faster than a manual effort to perform assessments for a large amount of data, which would likely be on the order of days, weeks, or months.

4.3 Expected Data Structure

Data should be stored in a long-format where each row represents a single parameter value collected during a sampling event.

4.4 Data Preprocessing

  • Data Preprocessing
    • Assessment Database
      • Water Quality Thresholds
        • tlkParam = a table that includes water quality standards associated with waterbody class and parameter.
        • tlkParam_75p_long = a table that includes 75th percentile thresholds associated with waterbody class and parameter.
        • These tables need to be expanded to include every combination of trout (T) and trout spawning (TS) designations.
          • NEED: build a function to prepare T and TS designations.
          • NEED: to export these tables and add them to the Assessment Database.
      • The following tables from the Assessment database need to be joined
        • tblWIPWL
        • tblkCOUNTY (by = “COUNTY”)
        • tlkDRAIN_BASIN (by = “SUBBASIN_CODE”)
        • tlk305B (by = “TYPE”)
        • tlkTYPE (by = “X305B_CODE”)
        • tlkParam (as prepared above)
        • tlkParam_75p_long (as prepared above)
        • Alter: change the order that these occur in the script. tlkParam tables should be prepared first, and then the other Assessment database tables should be imported, joined, and joined with the tlkParam tables.
    • Chemistry Data
      • correction_param_prep() = parameters used as correction factors (e.g., hardness, pH, and temperature) are extracted from the parameter-name and parameter-value columns and represented as a unique column. This prepares the data frame for the correction factors to be applied to the appropriate parameter values. The correction parameters, when available, will be joined with all parameters collected during the same sampling event, NOT just the parameters that require correction. This global join simplifies the preprocessing steps and reduces the chance of introducing errors.
        • ISSUE: pH is creating duplicates
    • Join the prepared Assessment tables with the prepared Chemistry tables
      • by = c("SEG_ID" = "SH_PWL_ID", "parameter" = "CHEMICAL_NAME")
    • date_standard_cols() = adds various date related columns to the data frame that will simplify the assessment process.
      • datetime
      • date
      • year
      • month
    • preprocess_validation() = ensures that the data validation column only includes the expected character strings and filters out rows that represent “Not Reported”.
      • Review: how does this handle NAs?
    • assessment_id() = uses dplyr group_indicies() to generate a unique assessment ID based on unique instances of the columns supplied to the ellipse argument (...).
      • assessment_id the name of the column added to the data frame representing the unique assessment ID.
    • assessment_period() = allows for the specification of the assessment period by number of years from the current date. For example, the default value is 10-years (n_years_ago = 10), and therefore the sampling period would include all samples from 10-years from the present date to the present date.
      • asssess_period the name of the column added to the data frame, which will specify “within_assessment_period” or “outside_assessment_period”. These designations will dictate how the data is assessed via the CALM.
    • assessment_min_counts() = allows for the specification of the minimum number of samples required per a designated group (e.g., assessment_id) to perform an assessment.
      • n_samples = a column representing the total number of samples per designated group.
      • min_n_samples = a column containing logical values indicating if there are enough samples available to perform an assessment. These designations will dictate how the data is assessed via the CALM.
    • assessment_min_years() = allows for the specification of the minimum number of years required to perform an assessment. This is different from assessment_period(). assessment_period() designates the extent of the time period
      • within_assess_period = a logical column indicating if the samples were collected within the designated sampling period (TRUE) or not (FALSE).
    • assessment_min_req() = wraps around assessment_min_counts() and assessment_min_years() to perform these calculations with a single call and provides a flag indicating if the minimum requirements from both functions were met.
      • min_req_met = a column indicating if the minimum number of samples and minimum number of sampling years are met (min_n_samples == TRUE & min_n_years == TRUE). The values in this column will represent logical values (TRUE or FALSE).
      • Questioning: should assessment period be included in this call and update the min_req_met column?
        • No, these should remain separate

4.5 Current Status

  • To Do
    • Create a general function for assessing both WQS and > 75% IQR of WQS
    • Create a classification function that uses case_when to look at all prepped columns