This document was written as a manual on the use of the online RStudio environment of the ScheldeMonitor information and data portal. This environment has been accessible through the ScheldeMonitor website since 2021. It provides accredited researchers and the partners of ScheldeMonitor with a centralized RStudio hub to do analysis and build up scripts directly based on the data and information that is held within the portal and the underlying database.
Within the manual, guidelines are provided for new users or users that are inexperienced in the use of RStudio, as well as some overall recommendations on how to keep work in RStudio structurized and comprehensible. The last chapter of this manual also provides a step-by-step breakdown on how to load data from ScheldeMonitor into the RStudio workspace.
The RStudio environment can be accessed from the website, using the following link. To access, credentials are required. These credentials can be requested here, or by sending a mail to info@scheldemonitor.org with a statement on the reason why use of the RStudio environment is required.
It is also possible to link the RStudio environment with an existing project from the ScheldeMonitor GitHub organization. How to work with GitHub in relation with RStudio, is discussed in an additional GitHub manual.
The ScheldeMonitor environment can be accessed at (https://rstudio.scheldemonitor.org/auth-sign-in) using your personal credentials:
These credentials are similar to those used for all other tools within the ScheldeMonitor platform, such as the data download toolbox and the E-room.
New users should first register to receive their personal credentials. This can be done using this link or by contacting the helpdesk of ScheldeMonitor stating a reason to use the RStudio environment.
The following information is needed to register on the RStudio environment of ScheldeMonitor:
After registration, your account needs to be approved by a moderator of the ScheldeMonitor RStudio environment. This process can take one or two working days.
After approval, you can use your personal credentials to sign in to the RStudio environment of ScheldeMonitor.
Once logged in, you will see your personal workspace in RStudio. This workspace can either be cleaned (for instance if you are a new user) or show the structure and content that was worked on during your previous session if you saved the workspace image on the last closure.
A personal workspace is standardly composed of four windows, showing scripts or dataframes, your environment, the console, and your project or personal file structure. It also indicates which user is logged in and which project is linked to your workspace:
The default behaviour of R for the handling of .RData files and workspaces encourages and facilitates a model of breaking work contexts into distinct working directories. This implies that the user can select a certain folder in his local directory to use as the location where files, handled through RStudio, are saved. This local directory, or workspace, can be altered at any given moment by the user.
In version v0.95 of RStudio, a new ‘Projects’ feature was introduced to make managing multiple working directories more straightforward. It is recommended to use this feature, however this chapter also explains how to handle your workspaces in the default manner.
As with a local RStudio installation, the online RStudio environment of ScheldeMonitor uses the local user’s home directory as workspace by default. This workspace is typically referenced using ~ in R. When RStudio starts up it does the following:
When RStudio exits and changes to the workspace have been made, a dialog box asks whether these changes should be saved to the .RData file in the current working directory. Clicking “Save” will ensure that your changes are stored and will appear as they were the next time you login to the RStudio environment.
RStudio displays the current working directory within the title region of the Console. To check your current working directory, you can run the command getwd() in the RStudio console:
## [1] "C:/Users/pietr/Documents/Code/ScheldeMonitor-Manuals"
To change the working directory, you can run the command setwd() in the RStudio console with the new directory inserted as a string:
You can also change the working directory by selecting the “Session” menu and “Set Working Directory”. Be careful to consider the side effects of changing your working directory:
Because these side effects can cause confusion and errors, it is usually best to start within the working directory associated with your project and remain there for the duration of your session.
The best practice, however, is to connect the RStudio environment to a certain ‘project’. This allows for a better oversight on the working directories and different cases you work on within the same RStudio environment. The next segments describe how such projects are instigated.
Any approved user can utilize the RStudio environment to commence or continue his or her personal project with the data of ScheldeMonitor. Doing so, users can either start a new local project, download an existing project from their own GitHub, or connect their work to the GitHub organization of ScheldeMonitor.
Starting a local project is the easiest way to commence your work. This project is saved on a local drive of the user’s hardware, and can only be restarted by accessing that drive. To initiate such a local project, users need to follow the following steps:
The user can now choose which type of project needs to be started
This manual was written for the users of the online RStudio environment of ScheldeMonitor. The procedure and affiliated scripts have been deployed on the server of this environment, so that users do not need to install anything to execute the procedure. This implies that this procedure will not work on other (local) RStudio environments or servers. Users that want to execute the procedure within another RStudio environment, not affiliated to the ScheldeMonitor server, can request a Git-bundle of the scripts at info@scheldemonitor.org.
In 2020, GitHub announced that it would no longer accept account passwords when authenticating with the REST API and will require the use of token-based authentication. This implies that when using the combination of the ScheldeMonitor RStudio workspace with the GitHub repository, all actions (e.g. pull, push, commit) will require a different kind of authentication than standardly used.
VLIZ has identified that setting up a SSH key would be the best way for users moving forward. This manual explains in detail the one-time procedure that is needed for a correct setup. This procedure involves four scripts that need to be run by a user from within the terminal of the RStudio environment of ScheldeMonitor.
The ScheldeMonitor environment can be accessed by following the steps described in the chapter Connecting to the RStudio environment of ScheldeMonitor
Once within the workspace, switch the console to the terminal window by clicking the ‘Terminal’ tab on top of the window. This window is where you will conduct all necessary command lines:
If users have never made a SSH key before within this RStudio environment, they need to run the ‘make-git-sshkey.sh’ script. This script will generate a standard named key pair to use for connecting to git services.
To run, execute the following command in the terminal:
This will provide a message that the key pair was generated:
Next, the user will need to register the generated key pair on GitHub online. To do so, the script ‘connect-sshkey.sh’ can be used to correctly configure locally and advice the user towards how to publicly register the public part of the key at the service of the user’s choice (defaults to GitHub).
To run, execute the following command in the terminal:
This will provide an extensive message, detailing how to register the key on GitHub:
As stated in the message, copy the text between the two ‘----’.
Surf to https://github.com/settings/keys, and select the ‘New SSH key’ button in the upper right corner.
This will open a new window with two text fields. In the ‘Key’ field, paste the text that you have copied from the RStudio terminal in step 3.
Optionally, you can give a name to this key in the ‘Title’ field, applying to the RStudio environment in which you have created the key (best practices state that each environment has its own key).
When finished, press the ‘Add SSH key’ button. This will return you to the previous window, where you will see the new key added.
To verify if the previous steps were executed correctly, the script ‘check-gitssh.sh’ can be used. This script will check if the connection is set up correctly. This step can be done as often as needed or wanted.
To run, execute the following command in the terminal of RStudio:
If done for the first time, the terminal might ask you to verify the action. To do this, type ‘yes’ behind the question mark and execute:
When everything was done correctly, the terminal will return the correct username:
The workspace is now connected via SSH. The user can now start working through SSH in new projects or convert already connected projects to SSH. Both methods are explained in the following chapters.
From now on new projects should immediately start off by using the correct remote origin-url. The steps below explain how to do that.
After choosing a name for your project locally, as well as a location, the project can be created.
Projects were already connected to the RStudio workspace, before converting to SSH, will still work through HTTPS. To change this, open an existing project that you wish to convert in the RStudio environment. This can be done in the upper right corner of the screen:
Once the project is opened, the script ‘fix-gitssh.sh’ can be used which will easily switch the connection to your project to git-ssh, and in the process figure out what service it is connected to.
The terminal should return the original connection, and a message that it is fixing the connection to git-ssh:
A working directory or project in RStudio can hold a large number of scripts and files to work with. In order to keep the work organized, as well as reproductive over time, it’s important to structure these scripts both in the directory as well as internally. The segments below suggest guidelines that might aid researchers in keeping their work transparent for themselves and other users.
A working directory or project is similar to any other folder on the local drive of your hardware. This implies that such a directory can consist of folders and subfolders. It is, however, imperative that folders are created following a certain structure or idea, to make the scripts and underlying data findable for yourself and other users. There are multiple levels on which a directory can be structured.
Firstly, if your work in RStudio is linked to a certain publication or report, your directory structure should mimic the same structure as the headings of the report. Here is an example from the T2015 report on the Scheldt, for which the project directory was structured conform the titles and subtitles within the published report:
Yet, it is even more important to have a uniform structure at the lowest level of the working directory, where all files are stored. Especially for projects that are not linked to a fixed report, and for which the above-mentioned structure is not applicable.
Typically, data files and scripts should be saved in separate folders. Although it might seem more convenient to keep those files together, the general overview benefits from the two-folder structure. Scripts and data files often do not have a 1:1 relationship, as a single script can use multiple data files while these data files are run through multiple different scripts. However, the structure of each folder should be the same, with a folder for every phase of the project:
Using this structure, a uniform workflow can be established within the project directory. This workflow follows four steps, that are explained using the following table:
Using data from: | Using scripts or functions from: | Saving new data or results in: | |
---|---|---|---|
Step 1 - Import data (if necessary) | n/a |
|
|
Step 2 - Clean data |
|
|
|
Step 3 - Anayze data | ’b. Cleaned data |
|
|
Step 4 - Create figures or results |
|
|
|
It is possible that users rather run a single script to go through all these steps, especially in smaller projects. In this case, a ‘Main.R’ script can be saved alongside the ‘Data’ and ‘Scripts’ folders. This main script can then run through all these steps on its own, while sourcing different data files and functions from the underlying folder structure. The latter is especially important in larger projects, to ensure that the length and readability of the main scripts is optimal. When doing so, it is very important that the main script is well structured and annotated, as will be further explained in Script structure and Script annotations below.
In any case, only one ‘Main.R’ file should be present as to not create confusion.
Scripts should be named in such a way that users can easily derive its purpose, in order to not have to open all scripts in an RStudio environment to know what they are used for. This is especially important when working with a main script that sources functions from other scripts throughout the different phases.
For example, when using different scripts for different kind of graphs, the nomenclature should clearly indicate which plot is made using the script:
Additionally, if the work in the RStudio environment is linked to a certain report or publication, the figure number from the publication could be inserted in the file name:
It is also possible that multiple scripts are used for the same figure, for instance if users want to be able to show both the original and the new plot on a later date. Still, the nomenclature needs to clearly indicate the discrepancies in the different scripts:
Nevertheless, whatever nomenclature is chosen, it should consist of a fixed and uniform naming convention. There are several options to choose from, similar to the ones available for code nomenclature as explained in Naming conventions:
Similar to a directory, an individual script can greatly benefit from a fixed and uniform structure. This structure should clearly delineate the different sections in a script, which gives the reader a quick overview on the content, but also ensures the user that all actions and functions are run in a fixed order. Script structure can be accomplished almost immediately by using headings in the code. These are inserted in the same way as annotations are done. Ideally, all scripts should have the same headings to start with:
######################################################################
## This is an example for the manual
##
## written by Jelle Rondelez of VLIZ
## info@scheldemonitor.org - Oct 2020
######################################################################
##############################
# 0 - Load libraries
##############################
library(dplyr) # package to clean datatable
library(lubridate) # package to change date formats
##############################
# 1 - Static part
##############################
#Assign variable
newvar <- ""
#Source script from within directory
source("Script/a. Import scripts/ImportWFS")
#Open datafile
datafile <- read.csv(file = "Data/b. Cleaned data/dataRWS.csv")
##############################
# 2- Scripts
##############################
code...
Note that the sourced files in the example above are using the directory structure as described in Directory structure.
These headings not only give a fixed structure and order to all scripts in the project, it also has the added advantage that sections can be collapsed or expanded if needed. Especially for longer scripts, in which certain sections of the code are not of interest to the user, this can greatly increase the readability of the script:
Larger scripts can benefit more from an expanded structure with additional headings. This is especially true for ‘Main.R’ scripts that run through all phases of the project within a single script, as discussed in Directory structure. Those type of scripts typically source and use a multitude of different functions and files. An extended structure can make these scripts more readable and can make it easier to search for a specific function or action:
######################################################################
## This is an example for the manual
##
## written by Jelle Rondelez of VLIZ
## info@scheldemonitor.org - Oct 2020
######################################################################
##############################
# 0 - Load libraries
##############################
library(dplyr) # package to clean datatable
library(lubridate) # package to change date formats
##############################
# 1 - Static part
##############################
#Assign variable
newvar <- ""
#Source script from within directory
source("Script/a. Import scripts/ImportWFS")
#Open datafile
datafile <- read.csv(file = "Data/b. Cleaned data/dataRWS.csv")
##############################
# 2- Scripts
##############################
code...
##############################
# 3 - Analysis part
##############################
code...
##############################
# 4 - Make plots & Figures
##############################
code...
Annotating code is important for a number of reasons. The main reason is for the user personally when looking back on what was coded. It helps to explain in detail what a line, chunk or even section of code is trying to accomplish. This is also helpful for other people who read the code. Explaining what a line of code is doing can be useful for others who are looking to adapt work to their own, or when someone is checking or evaluating a chunk of code. Annotating code is done with the symbol # (hashtag). Typically annotating can be done above a whole chunk of code, like when explaining the purpose of a certain function.
#Reactive values for uses locations
data_of_click <- reactiveValues (clicked = NULL)
longitude_click <- reactiveValues (lng = NULL)
latitude_click <- reactiveValues (lat = NULL)
#if user clicks on map, new coordinates are saved and maps is adjusted
observeEvent(input$Map_click, {
data_of_click$clicked <- input$Map_click
longitude_click <- input$Map_click$lng
latitude_click <- input$Map_click$lat
leafletProxy('Map') %>%
clearMarkers() %<%
addMarkers(lng = input$Map_click$lng,
lat = input$Map_click$lat,
popup = paste("Longitude=", round(input$Map_click$lng, 2),
"and",
"Latitude=", round(input$Map_click$lat, 2)))
})
Unfortunately, unlike other programming languages, R has no widely accepted coding best practices. Instead there have been various attempts to put together a few sets of rules. This chapter tries to fill the gap by summarizing what was found relevant in those various attempts.
Calling to a file or folder from within a script is mostly done through ‘hardcoding’, e.g. giving the location of the file as a string. However, users are strongly recommended to keep the amount of hardcoding minimal, as it requires less effort to change a script when a directory location changes if less hardcoding is used. To do so, if your code will read in data from a file, define a variable early in the code that stores the path to that file. By doing so, the following example:
input_file <- "data/data.csv"
outpu_file <- "data/result.csv"
#read input
input_data <- read.csv(input_file)
#get number of samples in data
sample_number <- nrow(input_data)
#generate results
results <- some_other_function(input_file, sample_number)
#write results
write.table(results, output_file)
is preferable to:
R has no naming conventions for variables and functions that are generally agreed upon. As a newcomer to R it is useful to decide which naming convention to adopt. Generally, there are five naming conventions to choose from. It is important to pick one convention and stick to it for the remainder of your project:
Above else, and besides the chosen naming convention, it is important to choose variable and function names that are concise and meaningful.
As with naming conventions, there are no syntax conventions when it comes to writing code in R. However, large scripts benefit greatly from the use of a clear and consistent syntax, as it makes the code more open and readable. Using correct spacing in your code makes an invaluable difference in the syntax. It can be implemented by following these rules:
# Good
height <- (feet * 12) + inches
mean(x, na.rm = 10)
# Bad
height<-feet*12+inches
mean(x, na.rm=10)
However, it is important to not overdo spacing as well. Adding extra space can help, but only if it improves the alignment of = or <-. Do not add extra spaces to places where space is not helpful.
Just as when talking about the overall structure of a script, hierarchy is equally important within the code itself. To define the most important hierarchies, curly braces are used. However, to keep the hierarchy transparent for yourself and other users, a consistent syntax is needed when using curly braces. This syntax is based on three rules:
Users are recommended to always strive to limit the code to 80 characters per line. To do so, using a concise and efficient naming convention might already be an important step. If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing bracket. This makes the code easier to read and to change later:
Even when using correct spacing and adequate structuring of code blocks, a script can remain quite difficult to understand. This is especially true for scripts where a lot of different operations and functions are being used. When code is formed by a lot of functional language, it comes with a large number of parentheses and arguments per function. This can make code extremely complex and hard to understand.
To overcome this problem, users are recommended to using ‘piping’ for multiple actions on the same argument. Piping uses the ‘%>%’ operator and can be used by installing the ‘magrittr’ or ‘dplyr’ library. It is best explained through three simple rules:
The R-community has multiple guides on how to style and manage your code in order to make it readable and clean. All these style guides are however fundamentally opinionated. Some decisions genuinely do make code easier to use, but many decisions are arbitrary. The most important thing about a style guide is that it provides consistency, making code easier to write because you need to make fewer decisions.
Users of the RStudio environment of ScheldeMonitor are recommended to use the tidyverse style guide, as it is one of the most commonly used guides. The rules mentioned above in this manual are also part of the tidyverse style guide.
There are two tools that can be installed by users that make it easier to implement this style guide, the ‘styler’ and ‘lintr’ packages. The installation of the ‘tidyverse’ package is not needed for these applications. The ‘styler’ and ‘lintr’ packages can be installed with the following R code:
The following window will now be visible:
It is recommended to use the ‘styler’ package first, followed by the ‘lintr’ package. Because the ‘styler’ package automatically corrects style errors such as the incorrect use of spaces and commas. Hence, the list of errors generated by the ‘lintr’ package, that need to be manually corrected, is shorter.
Most of the data in ScheldeMonitor can be used freely, and users are encouraged to use the RStudio environment of ScheldeMonitor to further analyse and validate our data collection. To do so, the data needs to be loaded into the RStudio environment first. This can be done either by loading downloaded data files such as CSV or TXT, or by using the generic webservices of ScheldeMonitor. Both methods involve accessing the Data Download Toolbox of ScheldeMonitor, which can be done using the following steps:
It is not mandatory to select a datasource, geographical area or time period in the explore tab. In the next tab (accessible via the green “next” button) a specific taxon (biotic data) or parameter (abiotic data) can be selected. Datasets, parameters, or taxa can be added to your selection with the plus sign on the right of the dataset, parameter, or taxon.
When criteria are selected, the counter on the right side of the screen shows the remaining number of records that match the chosen criteria.
The toolbox shows a summary of the chosen data set, along with several options to download or visualize the data. The following actions can be taken in the toolbox:
Users can now choose to download the data from the ScheldeMonitor toolbox as data files in a CSV file format. To do so, and to use them in the RStudio environment, the user can perform the following steps:
The user selects the “Download Data” button and submits all necessary information to commence his/her download:
For example:
However, users of the RStudio environment of ScheldeMonitor are urged to make use of the generic webservices that are available in the data download toolbox of ScheldeMonitor. These webservices are a URL format that automatically queries the ScheldeMonitor database without human intervention. The composition of this URL is automatically generated, based on the selection made by the user in the criteria of the data download toolbox. Using webservices has the added advantage that no data files are needed to load in the data set in R, and that the most recent version of the database is queried. The latter implies that when new data is added in the database to an already downloaded data set, the same webservice URL will be able to automatically load in the newly added data. To use the webservices in the RStudio environment:
Depending on the size of the requested data set, loading the data in R can take up to a minute. Nevertheless, the data set will be available in the environment of the RStudio. The limit of the webservice is capped at around 1.000.000 records per request. Therefore, it is recommended that users generate multiple separate URL’s in the toolbox if they want to analyze more than a million records, and merge the data set in R itself.
VLIZ has made a script read data for a given time period (one year) for a given parameter. This script can be found below or on the ScheldeMonitor GitHub page. How to access the ScheldeMonitor GitHub organization is described in a dedicated manual on the use of GitHub, available on the website.
#Created by Jelle Rondelez (VLIZ) on 8/3/21
#These packages are needed
install.packages(sf)
install.packages(stringr)
install.packages(dplyr)
library(dplyr)
library(sf)
library(stringr)
dataset <- data.frame()
#Here, the years for which you want to download data should be listed.
#Using the for loop, data can be downloaded per year by default
years <- c("2016","2017","2018","2019","2020","2021")
#This is a test string. Replace it with your own string
#The timespan of the original wfs string should run from 1 Jan tot 31 Dec
#no matter which years are selected
wfsstring <- "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal%3Aabiotic_observations&resultType=results&viewParams=where%3Aobs.context+%26%26+ARRAY%5B1%5D+AND+standardparameterid+IN+%281073%29+AND+%28%28datetime_search+BETWEEN+%272016-01-01%27+AND+%272021-12-31%27+%29%29%3Bcontext%3A0001&propertyName=stationname%2Clongitude%2Clatitude%2Cdatetime%2Cdepth%2Cparametername%2Cvaluesign%2Cvalue%2Cdataprovider%2Cdatasettitle&outputFormat=csv"
#This for loop results in the download of yearly datasets.
#At the end of the loop, all datasets are both saved seperately and appended together.
#If the user wants to download in larger timespans, change the second 'years[i]'
#example: 'years[i+1] downloads data for two year spans
for (i in 1:length(years)) {
wfsstring <- str_replace(wfsstring,"(?<=%27).*(?=-12-31)",
paste(years[i],"-01-01%27+AND+%27",years[i],sep=""))
name <- paste(years[i])
data <- data.frame(st_read(wfsstring))
dataset <- rbind(data,dataset)
assign(name,data)
}
VLIZ is responsible to keep the RStudio environment of ScheldeMonitor up and running. Besides foreseeing the necessary server and memory capacity, VLIZ will thus also make sure that all necessary R libraries and packages are installed on the RStudio server. If new libraries and packages need to be installed, users can contact VLIZ to do so.
To accommodate these and other needs of users and contributors, VLIZ will have a permanent helpdesk. This helpdesk can be contacted through the general address of the ScheldeMonitor:
Helpdesk ScheldeMonitor
Data Centre - Local Services & Projects
Vlaams Instituut voor de Zee vzw
Flanders Marine Institute
InnovOcean site, Wandelaarkaai 7
8400 Oostende, Belgium
For urgent matters or questions, or if users and contributors want to discuss the use of the RStudio environment for certain projects, the project manager of ScheldeMonitor should be contacted:
Jelle Rondelez
Project Manager
Data Centre - Local Services & Projects
Vlaams Instituut voor de Zee vzw
Flanders Marine Institute
InnovOcean site, Wandelaarkaai 7
8400 Oostende, Belgium