2  WQMA Oracle Data Warehouse

2.1 Connecting to the WQMA Data Warehouse

If you have not already, please setup an ODBC connection to the WQMA data warehouse following the instructions in the ODBC section.

2.1.1 Install nexus

The nexus package is available on GitHub. To install packages from GitHub, first install the package, pak.

Use pak to install the nexus package from GitHub. The code below specifies the organization (“BWAM”) and the repository (“nexus”). If this fails, try the next code chunk.

pak::pak("BWAM/nexus")

If pak fails to install the package, try using devtools and gitcreds. The difficulty with this option is you must install a GitHub PAT. To setup a PAT, see Managing Git(Hub) Credentials.

devtools::install_github(
  repo = "BWAM/nexus",
  auth_token = gitcreds::gitcreds_get()$password)

2.1.2 Get Connected

To work with the data warehouse, you must first connect to it. This is very similar to specifying specifying a file directory containing multiple CSV files you want to import into R. nexus provides the function, get_connected(), to simplify the process for connecting to the data warehouse. At a minimum, you must provide your data warehouse username. You will be prompted to enter your password in a pop-up window. You should never supply your password in plain text because it increases the chances you accidentally share your password with others. For example, it is very easy to accidentally share your password when committing and pushing other changes to GitHub.

When you are finished querying the data warehouse, please use
`DBI::dbDisconnect(conn = insert-your-connection-name-here)`.
This message is displayed once per session.
con <- nexus::get_connected(username = "ZSMITH")

2.1.3 Understanding the Data Warehouse Structure

A primary goal of the WQMA Data Warehouse and nexus is to allow users to focus on the task at hand and NOT have to spend much energy cleaning data or learning how tables relate to one another. This section outlines a number of functions that can be used to rapidly learn about the data warehouse and it’s contents.

2.1.3.1 Tables and Columns and Definitions, Oh My!

The DBI function, dbListTables(), can be used to get a vector of tables in a database. For the WQMA data warehouse, we also need to specify the name of the schema we want to explore.

DBI::dbListTables(con, schema_name = "WQMA_OWNER")
 [1] "TAXONOMY_REFERENCE_JUNCTION" "PARAMETER"                  
 [3] "SITE"                        "TAXONOMIC_TRAIT"            
 [5] "SAMPLE_DELIVERY_GROUP"       "BASIN"                      
 [7] "RESULT"                      "ACCESSIONS_TABLES"          
 [9] "ACCESSIONS_DATA_MAP"         "PREVIOUS_TAXON_NAME"        
[11] "QUALITY_CONTROL"             "PROJECT"                    
[13] "PREVIOUS_SITE_ID"            "ACCESSIONS_FLD_XFM"         
[15] "EVENT"                       "SAMPLE"                     
[17] "WATERBODY"                   "ACCESSIONS_CDE_MAP"         
[19] "TAXONOMY_REFERENCE"          "ACCESSIONS"                 
[21] "RESULT_QUALIFIER"            "TAXONOMIC_ABUNDANCE"        
[23] "PARAMETER_NAME"              "TAXONOMY"                   
[25] "ACCESSIONS_KEY_MAP"         

However, it is recommended that you use the nexus get_data_dictionary() function because it provides much more information than dbListTables(). Setting the argument just_tables = TRUE will return tibble (i.e., a fancy data frame) of all tables in the data warehouse and their definitions.

(dictionary <- nexus::get_data_dictionary(just_tables = TRUE))
# A tibble: 19 × 2
   table_name                  definition                                       
   <chr>                       <chr>                                            
 1 BASIN                       This table represents the 17 major drainage basi…
 2 EVENT                       A table representing all distinct sampling event…
 3 PARAMETER                   This table represents the distinct parameters pr…
 4 PARAMETER_NAME              This table provides descriptions of the paramete…
 5 PREVIOUS_SITE_ID            This table represents site IDs that were previou…
 6 PREVIOUS_TAXON_NAME         This table represents taxonomic names that were …
 7 PROJECT                     This table represents the monitoring projects th…
 8 QUALITY_CONTROL             This table contains quality control results asso…
 9 RESULT                      This table represents numeric and/or categorical…
10 RESULT_QUALIFIER            The table provides a detailed description of res…
11 SAMPLE                      This table represents distinct samples collected…
12 SAMPLE_DELIVERY_GROUP       A table containing all valid Sample Delivery Gro…
13 SITE                        This table represents the distinct monitoring lo…
14 TAXONOMIC_ABUNDANCE         This table contains taxonomic abundances— the co…
15 TAXONOMIC_TRAIT             This table represents taxonomic traits, such as …
16 TAXONOMY                    This table represents the tax's observed in New …
17 TAXONOMY_REFERENCE          This table identifies taxonomic reference materi…
18 TAXONOMY_REFERENCE_JUNCTION This is a junction table to facilitate the relat…
19 WATERBODY                   This table represents waterbodies in New York St…

Running get_data_dictionary() with all the default arguments will return all table and column definitions in the data warehouse.

(dictionary <- nexus::get_data_dictionary())
# A tibble: 218 × 11
   table_name column_name    examples   definition  db_data_type character_limit
   <chr>      <chr>          <list>     <chr>       <chr>                  <dbl>
 1 BASIN      BASIN          <chr [17]> A unique 2… NUMBER                    NA
 2 BASIN      BASIN_NAME     <chr [17]> The names … VARCHAR2(50)              50
 3 BASIN      <NA>           <NULL>     This table… <NA>                      NA
 4 EVENT      EVENT_DATETIME <chr [4]>  The date a… DATETIME                  NA
 5 EVENT      EVENT_ID       <chr [10]> A unique s… VARCHAR2(50)              50
 6 EVENT      PROJECT        <chr [10]> A monitori… VARCHAR2(10…             100
 7 EVENT      QAPP_ID        <chr [5]>  The Qualit… VARCHAR2(50)              50
 8 EVENT      SITE_ID        <dbl [10]> A unique s… NUMBER                    NA
 9 EVENT      <NA>           <NULL>     A table re… <NA>                      NA
10 PARAMETER  CASRN          <chr [10]> A CAS Regi… VARCHAR2(50)              50
# ℹ 208 more rows
# ℹ 5 more variables: not_null <lgl>, surrogate_key <lgl>, primary_key <chr>,
#   foreign_key <chr>, parent_table <chr>

The dictionary contains an examples field that provides examples of the type of data contained in each column. If there are less than 20 unique values in a column, all possible values are shown. For example, you can see all of the possible values for WATERBODY_TYPE using the following code:

dictionary |> 
  dplyr::filter(table_name %in% "WATERBODY",
                column_name %in% "WATERBODY_TYPE") |> 
  dplyr::pull(examples)
[[1]]
[1] "lake"         "river_stream" "other"        "outfall"     

You can supply a character vector of table names to the argument table_vec to subset the tibble to only the table or tables of interest. The example below, only shows the definitions for the BASIN and WATERBODY tables.

(dictionary <- nexus::get_data_dictionary(table_vec = c("BASIN",
                                                        "WATERBODY")))
# A tibble: 12 × 11
   table_name column_name       examples definition db_data_type character_limit
   <chr>      <chr>             <list>   <chr>      <chr>                  <dbl>
 1 BASIN      BASIN             <chr>    A unique … NUMBER                    NA
 2 BASIN      BASIN_NAME        <chr>    The names… VARCHAR2(50)              50
 3 BASIN      <NA>              <NULL>   This tabl… <NA>                      NA
 4 WATERBODY  WATERBODY_CODE    <chr>    A unique … VARCHAR2(50)              50
 5 WATERBODY  BEACH_PRESENT     <chr>    Identifie… VARCHAR2(15)              15
 6 WATERBODY  FISHERIES_INDEX_… <chr>    This is a… VARCHAR2(75)              75
 7 WATERBODY  PUBLIC_WATER_SUP… <chr>    Identifie… VARCHAR2(15)              15
 8 WATERBODY  USGS_POND_NUMBER  <chr>    A unique … VARCHAR2(15)              15
 9 WATERBODY  WATERBODY_ALTERN… <chr>    Some wate… VARCHAR2(10…             100
10 WATERBODY  WATERBODY_NAME    <chr>    The name … VARCHAR2(10…             100
11 WATERBODY  WATERBODY_TYPE    <chr>    Inidicate… VARCHAR2(50)              50
12 WATERBODY  <NA>              <NULL>   This tabl… <NA>                      NA
# ℹ 5 more variables: not_null <lgl>, surrogate_key <lgl>, primary_key <chr>,
#   foreign_key <chr>, parent_table <chr>

Similar to table_vec, you can supply a character vector of column names to the argument column_vec to subset the tibble to only the column or columns of interest. This is useful for:

  1. Getting the definition of a column(s) of interest.

  2. Quickly identifying which table a column comes from.

The example below shows the definitions for two columns of interest from two different tables.

(dictionary <- nexus::get_data_dictionary(
  column_vec = c("BASIN_NAME",
                 "WATERBODY_NAME")
))
# A tibble: 2 × 11
  table_name column_name    examples   definition   db_data_type character_limit
  <chr>      <chr>          <list>     <chr>        <chr>                  <dbl>
1 BASIN      BASIN_NAME     <chr [17]> The names o… VARCHAR2(50)              50
2 WATERBODY  WATERBODY_NAME <chr [10]> The name of… VARCHAR2(10…             100
# ℹ 5 more variables: not_null <lgl>, surrogate_key <lgl>, primary_key <chr>,
#   foreign_key <chr>, parent_table <chr>

2.1.4 Get Data

2.1.4.1 Single Table Queries

To query data from a single table you can use functions from the dplyr and DBI packages. Below I have established a remote connection to the BASIN table in the data warehouse. I have not yet pulled the data into R. You may think of this as a preview of the data that you will pull into R. As you can see, you need to provide a connection to a database, the schema, and the table name.

(basin_tbl <- dplyr::tbl(con,
                        I("WQMA_OWNER.BASIN")))
# Source:   table<WQMA_OWNER.BASIN> [?? x 6]
# Database: Oracle 19.00.0000[ZSMITH@WQMAP/]
   BASIN BASIN_NAME  CREATE_DATE         END_DATE UPDATE_DATE
   <chr> <chr>       <dttm>              <dttm>   <dttm>     
 1 01    Lake Erie-… 2024-03-26 13:32:45 NA       NA         
 2 02    Allegheny … 2024-03-26 13:32:45 NA       NA         
 3 03    Lake Ontar… 2024-03-26 13:32:45 NA       NA         
 4 04    Genesee Ri… 2024-03-26 13:32:45 NA       NA         
 5 05    Chemung Ri… 2024-03-26 13:32:45 NA       NA         
 6 06    Susquehann… 2024-03-26 13:32:45 NA       NA         
 7 07    Seneca-One… 2024-03-26 13:32:45 NA       NA         
 8 08    Black Rive… 2024-03-26 13:32:45 NA       NA         
 9 09    St. Lawren… 2024-03-26 13:32:45 NA       NA         
10 10    Lake Champ… 2024-03-26 13:32:45 NA       NA         
# ℹ more rows
# ℹ 1 more variable: UPDATED_BY_GUID <chr>

To bring the data into R, you need to use the dplyr function, collect().

(basin_df <- dplyr::collect(basin_tbl))
# A tibble: 17 × 6
   BASIN BASIN_NAME  CREATE_DATE         END_DATE UPDATE_DATE
   <chr> <chr>       <dttm>              <dttm>   <dttm>     
 1 01    Lake Erie-… 2024-03-26 13:32:45 NA       NA         
 2 02    Allegheny … 2024-03-26 13:32:45 NA       NA         
 3 03    Lake Ontar… 2024-03-26 13:32:45 NA       NA         
 4 04    Genesee Ri… 2024-03-26 13:32:45 NA       NA         
 5 05    Chemung Ri… 2024-03-26 13:32:45 NA       NA         
 6 06    Susquehann… 2024-03-26 13:32:45 NA       NA         
 7 07    Seneca-One… 2024-03-26 13:32:45 NA       NA         
 8 08    Black Rive… 2024-03-26 13:32:45 NA       NA         
 9 09    St. Lawren… 2024-03-26 13:32:45 NA       NA         
10 10    Lake Champ… 2024-03-26 13:32:45 NA       NA         
11 11    Upper Huds… 2024-03-26 13:32:45 NA       NA         
12 12    Mohawk Riv… 2024-03-26 13:32:45 NA       NA         
13 13    Lower Huds… 2024-03-26 13:32:45 NA       NA         
14 14    Delaware R… 2024-03-26 13:32:45 NA       NA         
15 15    Passaic-Ne… 2024-03-26 13:32:45 NA       NA         
16 17    Atlantic O… 2024-03-26 13:32:45 NA       NA         
17 16    Housatonic… 2024-03-26 13:32:45 NA       NA         
# ℹ 1 more variable: UPDATED_BY_GUID <chr>

You can use dplyr functions to work with the data in the database. dplyr will translate these functions to SQL for you.

This example: 1. Uses the basin_tbl connection established before 2. Filters the BASIN column to the rows that represent the major drainage basins “03” and “07” 3. Retains only the BASIN_NAME column 4. Collects the data from the database and into R.

# 1
basin_tbl |> 
# 2
  dplyr::filter(BASIN %in% c("03", "07")) |> 
# 3
  dplyr::select(BASIN_NAME) |> 
# 4
  dplyr::collect()
# A tibble: 2 × 1
  BASIN_NAME                      
  <chr>                           
1 Lake Ontario & Minor Tribs.     
2 Seneca-Oneida-Oswego River Basin

This example: 1. Connects to the “EVENT” table 2. filters to the rows were the EVENT_DATETIME column to only the events that occurred between July 1-7th of 2022. 3. Collects the data from the database and into R.

# 1
dplyr::tbl(con,
           I("WQMA_OWNER.EVENT")) |> 
# 2
  dplyr::filter(
    dplyr::between(
      x = EVENT_DATETIME,
      left = to_date("2022-07-01", "YYYY-MM-DD HH24:MI:SS"),
      right = to_date("2022-07-07", "YYYY-MM-DD HH24:MI:SS")
    )
  ) |> 
# 3
  dplyr::collect()
# A tibble: 215 × 9
   EVENT_ID      EVENT_DATETIME      SITE_ID PROJECT QAPP_ID CREATE_DATE        
   <chr>         <dttm>                <dbl> <chr>   <chr>   <dttm>             
 1 13-FDRY-0.4_… 2022-07-06 09:15:00   27578 Undefi… Unknown 2024-03-26 14:26:14
 2 13-STNY-0.8_… 2022-07-06 15:00:00   28025 Undefi… Unknown 2024-03-26 14:26:14
 3 13-STNY-0.8_… 2022-07-05 15:00:00   28025 Undefi… Unknown 2024-03-26 14:26:14
 4 13-GUNK-40.3… 2022-07-05 13:15:00   27611 PWS pr… Unknown 2024-03-26 14:26:11
 5 13-WKLEI_T4-… 2022-07-05 15:30:00   28182 PWS pr… Unknown 2024-03-26 14:26:11
 6 14-COUL-0.9_… 2022-07-05 10:00:00   28349 PWS pr… Unknown 2024-03-26 14:26:11
 7 07-SXML-2.2_… 2022-07-06 11:45:00   25790 PWS pr… Unknown 2024-03-26 14:26:11
 8 07-SXML-8.0_… 2022-07-06 10:30:00   25791 PWS pr… Unknown 2024-03-26 14:26:11
 9 07-SXML-12.4… 2022-07-06 09:15:00   25789 PWS pr… Unknown 2024-03-26 14:26:11
10 13-CRUM-1.5_… 2022-07-06 11:00:00   27545 PWS pr… Unknown 2024-03-26 14:26:11
# ℹ 205 more rows
# ℹ 3 more variables: END_DATE <dttm>, UPDATE_DATE <dttm>,
#   UPDATED_BY_GUID <chr>

2.1.4.2 Multiple Table Queries (Or Operating on the Data Model)

In most situations, it is recommended that you acquire data using the get_data_model() function from nexus. This function has been tailored to connect to establish a remote connection to each table in the data warehouse and define the relationships between tables.

(data_model <- nexus::get_data_model(con = con))
── Table source ────────────────────────────────────────────────────────────────
src:  Oracle 19.00.0000[ZSMITH@WQMAP/]
── Metadata ────────────────────────────────────────────────────────────────────
Tables: `BASIN`, `EVENT`, `PARAMETER`, `PARAMETER_NAME`, `PREVIOUS_SITE_ID`, … (19 total)
Columns: 200
Primary keys: 13
Foreign keys: 18

We can visualize the table relationships with the get_erd() function.

nexus::get_erd(data_model)

If you want to collect a single table, such as BASIN, you can use the following code. The get_data_model() function hides some of the complexity of connecting to the table through dplyr, such as establishing the schema, and makes it simple to import data into R.

(basin_df2 <- data_model$BASIN |> 
  dplyr::collect())
# A tibble: 17 × 2
   BASIN BASIN_NAME                      
   <chr> <chr>                           
 1 01    Lake Erie-Niagara River Basin   
 2 02    Allegheny River Basin           
 3 03    Lake Ontario & Minor Tribs.     
 4 04    Genesee River Basin             
 5 05    Chemung River Basin             
 6 06    Susquehanna River Basin         
 7 07    Seneca-Oneida-Oswego River Basin
 8 08    Black River Basin               
 9 09    St. Lawrence River Basin        
10 10    Lake Champlain River Basin      
11 11    Upper Hudson River Basin        
12 12    Mohawk River Basin              
13 13    Lower Hudson River Basin        
14 14    Delaware River Basin            
15 15    Passaic-Newark (Ramapo River)   
16 16    Housatonic River Basin          
17 17    Atlantic Ocean-Long Island Sound

Or…

(basin_df2 <- data_model |> 
   dm::pull_tbl(table = "BASIN") |> 
  dplyr::collect())
# A tibble: 17 × 2
   BASIN BASIN_NAME                      
   <chr> <chr>                           
 1 01    Lake Erie-Niagara River Basin   
 2 02    Allegheny River Basin           
 3 03    Lake Ontario & Minor Tribs.     
 4 04    Genesee River Basin             
 5 05    Chemung River Basin             
 6 06    Susquehanna River Basin         
 7 07    Seneca-Oneida-Oswego River Basin
 8 08    Black River Basin               
 9 09    St. Lawrence River Basin        
10 10    Lake Champlain River Basin      
11 11    Upper Hudson River Basin        
12 12    Mohawk River Basin              
13 13    Lower Hudson River Basin        
14 14    Delaware River Basin            
15 15    Passaic-Newark (Ramapo River)   
16 16    Housatonic River Basin          
17 17    Atlantic Ocean-Long Island Sound

2.1.5 Example Queries

You cannot import all data from the WQMA Data Warehouse into the R– the connection will time out or you will crash your R session. However, there is no need to read in all data to R. With DBI, dplyr, dm, and nexus you have the ability to do the following and more within the database:

  • preview data

  • filter data

  • select only the columns of interest

  • join tables together

You should try to narrow your focus to the smallest amount of data you need for your task. Queries will return results much faster if you limit the number of rows and columns that need to be transferred from the data warehouse into R. This might mean filtering the rows by basin, waterbody, site, sampling event, project, parameter, etc. or a combination of these factors to get only the data necessary. Similarly, selecting only the tables and columns of interest will help to speed queries up.

If you are feeling nervous about not pulling all data in, I hope that it

nexus is designed to work with the R-package, dm. dm stands for “data model” and it provides a number of useful functions for working with relational data and has great documentation on how to use those functions (see Tutorials, Cheatsheet, and Technical Articles on the dm website). Many of the functions are designed to both execute locally in R and to be translated into a SQL query to be executed by a relational database or data warehouse. When possible, it is recommended that you favor dm queries over dbplyr, DBI, and custom SQL queries. It is not bad to write dbplyr, DBI, and custom SQL queries, but dm provides an elegant syntax for leveraging relational data that would be very difficult and time intensive to re-implement with other tools; in other words, dm is generally more efficient because easier to write and understand.

2.1.5.1 Lake Sites with Phosphorus Data Collected in the last 5-Years

  1. Define the time period of interest as start_date and end_date
  2. Start with the data_model object defined previously
  3. Filter the WATERBODY table to only the rows where the column WATERBODY_TYPE is “lake”.
  4. Filter the PARAMETER table to only the rows where the column PARAMTER_NAME is “phosphorus”.
  5. Filter the EVENT table to only the rows where the column EVENT_DATETIME represents a date in the last 5-years.
  6. In this case, we are only interested in the SITE table. It’s important to note that we can use the data model to perform queries on tables besides the SITE table that ultimately influence the rows of the SITE table returned.
  7. In this case, we are only interested in keeping the columns associated with the site identifier, SITE_CODE, and the sites coordinates (LATITUDE and LONGITUDE).
  8. Collects the data from the database and into R.
# 1
start_date <- Sys.Date() - lubridate::years(5)
end_date <- Sys.Date()

site_remote <- data_model |> # 2
  dm::dm_filter(
    WATERBODY = WATERBODY_TYPE == "lake", # 3
    PARAMETER = PARAMETER_NAME == "phosphorus", # 4
  EVENT = dplyr::between( # 5
    x = EVENT_DATETIME,
    left = to_date(start_date, "YYYY-MM-DD HH24:MI:SS"),
    right = to_date(end_date, "YYYY-MM-DD HH24:MI:SS")
  )
  ) |>
  dm::pull_tbl(SITE) |>  # 6
  dplyr::select( # 7
    SITE_CODE,
    LATITUDE,
    LONGITUDE
  )

site_dm <- dplyr::collect(site_remote) # 8

2.1.5.2 Site

site_remote <- dm::dm_filter(data_model,
                             SITE = SITE_CODE %in% c("07-ONON-1.0"))
site_dm <- dplyr::collect(site_remote)
(single_df <- site_dm |> 
  dm::dm_flatten_to_tbl(.start = RESULT, .recursive = TRUE) |> 
  dplyr::select(
    EVENT_ID,
    SITE_CODE,
    REPLICATE,
    SAMPLE_METHOD,
    SAMPLE_METHOD_DESCRIPTION,
    PARAMETER_NAME,
    RESULT_VALUE,
    UNIT,
    LATITUDE,
    LONGITUDE,
    BASIN,
    BASIN_NAME,
    WATERBODY_NAME,
    WATERBODY_TYPE
  ))
# A tibble: 1,693 × 14
   EVENT_ID             SITE_CODE REPLICATE SAMPLE_METHOD SAMPLE_METHOD_DESCRI…¹
   <chr>                <chr>     <chr>     <chr>         <chr>                 
 1 07-ONON-1.0_1986090… 07-ONON-… 1         survey        sample_event_info_sur…
 2 07-ONON-1.0_1986090… 07-ONON-… 1         multiprobe    unknown               
 3 07-ONON-1.0_1986090… 07-ONON-… 1         multiprobe    unknown               
 4 07-ONON-1.0_1986090… 07-ONON-… 1         multiprobe    unknown               
 5 07-ONON-1.0_1986090… 07-ONON-… 1         multiprobe    unknown               
 6 07-ONON-1.0_1986090… 07-ONON-… 1         survey        macroinvertebrate_fie…
 7 07-ONON-1.0_1986090… 07-ONON-… 1         survey        macroinvertebrate_fie…
 8 07-ONON-1.0_1986090… 07-ONON-… 1         survey        macroinvertebrate_fie…
 9 07-ONON-1.0_1986090… 07-ONON-… 1         survey        macroinvertebrate_fie…
10 07-ONON-1.0_1986090… 07-ONON-… 1         survey        macroinvertebrate_fie…
# ℹ 1,683 more rows
# ℹ abbreviated name: ¹​SAMPLE_METHOD_DESCRIPTION
# ℹ 9 more variables: PARAMETER_NAME <chr>, RESULT_VALUE <dbl>, UNIT <chr>,
#   LATITUDE <dbl>, LONGITUDE <dbl>, BASIN <chr>, BASIN_NAME <chr>,
#   WATERBODY_NAME <chr>, WATERBODY_TYPE <chr>

2.1.5.3 Project

project_remote <- data_model |> 
  dm::dm_filter(PROJECT = PROJECT_TYPE == "RIBS Routine") |> 
  dm::dm_select_tbl(
    WATERBODY,
    SITE,
    EVENT,
    SAMPLE,
    RESULT,
    PARAMETER
  )
project_dm <- dplyr::collect(project_remote)
(single_df <- project_dm |> 
  dm::dm_flatten_to_tbl(.start = RESULT, .recursive = TRUE) |> 
  dplyr::select(
    EVENT_ID,
    SITE_CODE,
    REPLICATE,
    SAMPLE_METHOD,
    SAMPLE_METHOD_DESCRIPTION,
    PARAMETER_NAME,
    RESULT_VALUE,
    UNIT,
    LATITUDE,
    LONGITUDE,
    BASIN,
    WATERBODY_NAME,
    WATERBODY_TYPE
  ))
# A tibble: 18,911 × 13
   EVENT_ID             SITE_CODE REPLICATE SAMPLE_METHOD SAMPLE_METHOD_DESCRI…¹
   <chr>                <chr>     <chr>     <chr>         <chr>                 
 1 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 2 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 3 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 4 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 5 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 6 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 7 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 8 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
 9 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
10 13-ROND-9.9_2018061… 13-ROND-… 1         unknown       unknown               
# ℹ 18,901 more rows
# ℹ abbreviated name: ¹​SAMPLE_METHOD_DESCRIPTION
# ℹ 8 more variables: PARAMETER_NAME <chr>, RESULT_VALUE <dbl>, UNIT <chr>,
#   LATITUDE <dbl>, LONGITUDE <dbl>, BASIN <chr>, WATERBODY_NAME <chr>,
#   WATERBODY_TYPE <chr>

2.1.5.4 Parameter

chloride_remote <- dm::dm_filter(data_model,
                             PROJECT = PROJECT_TYPE == "RIBS Routine",
                             PARAMETER = PARAMETER_NAME == "chloride")
chloride_dm <- dplyr::collect(chloride_remote)
(single_df <- chloride_dm |> 
  dm::dm_flatten_to_tbl(.start = RESULT, .recursive = TRUE) |> 
  dplyr::select(
    EVENT_ID,
    SITE_CODE,
    REPLICATE,
    SAMPLE_METHOD,
    SAMPLE_METHOD_DESCRIPTION,
    PARAMETER_NAME,
    RESULT_VALUE,
    UNIT,
    LATITUDE,
    LONGITUDE,
    BASIN,
    BASIN_NAME,
    WATERBODY_NAME,
    WATERBODY_TYPE
  ))
# A tibble: 623 × 14
   EVENT_ID             SITE_CODE REPLICATE SAMPLE_METHOD SAMPLE_METHOD_DESCRI…¹
   <chr>                <chr>     <chr>     <chr>         <chr>                 
 1 01-BUFF-1.7_2018041… 01-BUFF-… 1         unknown       unknown               
 2 01-BUFF-1.7_2018061… 01-BUFF-… 1         unknown       unknown               
 3 01-BUFF-1.7_2018081… 01-BUFF-… 1         unknown       unknown               
 4 01-BUFF-1.7_2018101… 01-BUFF-… 1         unknown       unknown               
 5 01-BUFF-1.7_2019041… 01-BUFF-… 1         unknown       unknown               
 6 01-BUFF-1.7_2019061… 01-BUFF-… 1         unknown       unknown               
 7 01-BUFF-1.7_2019081… 01-BUFF-… 1         unknown       unknown               
 8 01-BUFF-1.7_2019102… 01-BUFF-… 1         unknown       unknown               
 9 01-BUFF-1.7_2020042… 01-BUFF-… 1         unknown       unknown               
10 01-BUFF-1.7_2020061… 01-BUFF-… 1         unknown       unknown               
# ℹ 613 more rows
# ℹ abbreviated name: ¹​SAMPLE_METHOD_DESCRIPTION
# ℹ 9 more variables: PARAMETER_NAME <chr>, RESULT_VALUE <dbl>, UNIT <chr>,
#   LATITUDE <dbl>, LONGITUDE <dbl>, BASIN <chr>, BASIN_NAME <chr>,
#   WATERBODY_NAME <chr>, WATERBODY_TYPE <chr>

2.2 Setup Oracle ODBC Connection{sec-odbc}

2.2.1 Objective

Setup an ODBC connection to the WQMA Data Warehouse.

2.2.2 What is an ODBC?

ODBC stands for Open Database Connectivity and, once configured accordingly, it allows the user to connect to a database from another program, such as R.

2.2.3 Oracle’s Instant Client ODBC Driver

Oracle’s Instant Client ODBC software is a standalone package that offers the full functionality of the Oracle ODBC driver (except the Oracle service for Microsoft Transaction Server) with a simple install. –From: https://www.oracle.com/database/technologies/releasenote-odbc-ic.html

In other words, Oracle’s Instant Client is a set of software that needs to be installed on your computer in order for you to connect to an Oracle database.

Caution

I don’t care what Oracle says, this is anything but simple for the average user.

Note

I was directed to the Oracle Instant Client from Posit’s webpage on connecting to an Oracle database..

2.2.4 Procedure

  1. Visit Oracle’s Instant Client ODBC Release Notes webpage

  2. Follow the instructions in Installing Oracle Instant Client ODBC/On Windows section. You must first install the Instant Client Basic package and then install the Instant Client ODBC package. I have copied and modified the text from the webpage to try and make the steps easier to follow.

    1. “Install the Instant Client Basic or Basic Light package, as described above.”

      1. Follow the instructions at the head of the page in the “Installing Oracle Instant Client Basic and Basic Lite” section, which boils down to Installing Oracle’s Instant Client from here.

      2. Select the Instant Client for Microsoft Windows (x64).

      3. Select the most recent version of Instant Client for Microsoft Windows and a table will expand.

      4. Download the Basic Package to the root of your Downloads folder. You cannot download the file directly to your C-drive (C:\).

      5. Right-click on the downloaded zip file in your Downloads folder, select Extract All, browse to your C-drive (C:\), and select Extract.

    2. “Download the Instant Client ODBC package. Unzip it in the same directory as your Basic or Basic Light package.”

      1. The ODBC package is located on the same downloads page as the Basic Package installed in step 2.1 above. If you need to navigate back to this page, follow steps in 2.1.1 - 2.1.3.

      2. Scroll down until you see the ODBC Package– it is under the subheading: Development and Runtime - optional packages.

      3. Install the ODBC package to your Downloads folder. You cannot download the file directly to your C-drive (C:\).

      4. Right-click on the downloaded zip file in your Downloads folder, select Extract All, browse to your C-drive (C:\), select the folder you created in step 2.1.5, and select Extract.

    3. The C-drive directory, created in step 2.1, must be put on your computers PATH.

      1. In your Windows Search Bar, look up and open “Edit the system environmental variables.”

      2. In the Advanced tab of the pop-up window, click the “Environmental Variables…” button

      3. This step requires admin-rights.

        1. In the “System variables” section,:

          1. find the existing “Path” variable

          2. select the “Path” row

          3. click on the “Edit…” button

        2. In the pop-up:

          1. Click on the “New” button

          2. Click on the “Browse…” button

          3. Select the C-drive path to the instant client directory created in step 2.1

    4. This step requires admin-rights. “Execute odbc_install.exe from the Instant Client directory.”

      1. Navigate to the folder created in step 2.2.4.

      2. Double-click the odbc_install.exe.

  3. Setup a 64-bit ODBC connection.

    1. In your Windows search bar, type “ODBC Data Sources”.

    2. Select the Option that ends in “64-bit”.

    3. In the window that appears, select the “Add” button.

    4. Select the Oracle Instant Client option (should look something like this: “Oracle in instant_client_19_19”) and click the “Finish” button. If you do not see the Oracle Instant Client Option, then your ODBC installation was not successful. Please try to install the ODBC again.

    5. Configuration:

      1. Data Source Name: wqma_prod

      2. Description: WQMA Data Warehouse Production Version

      3. TNS Service Name: WQMAP

      4. User ID: [Enter your database User ID]

        • Example: JDOE
      5. Click on the “Test Connection” Button, enter your database password, and click on the “OK” button. If there are no errors, you should receive a message stating “Connection successful.”

        • If you are entering a temporary password, you will be prompted for a new password.
      6. Click the “OK” button in the Oracle ODBC Driver Configuration window.

  4. Congratulations! You have successfully downloaded, installed, and configured the necessary software to connect to an Oracle database.

2.2.5 Keyring Setup (Optional)

If you do not want to enter your password each time you establish a connection, you can setup a keyring to securely store and access your password. Again, you should NEVER type out your password in plain text.

keyring::key_set("your-user-name",
                 keyring = "wqma-example")
keyring::key_get("your-user-name",
                 keyring = "wqma-example")

Once the keyring is established, you can point to the keyring in the get_connected() function using the kerying argument and you will not need to provide your password each time you connect to the WQMA data warehouse.

con <- nexus::get_connected(username = "your-user-name",
                            keyring = "wqma-example")