Developers guide - NEONScience/eddy4R GitHub Wiki

This page provides guidance to use Git and Github to track your work, as well as how to adjust functions (./pack/) and workflow templates (./flow/), and share code developments back to the code base. Additional docker resources are also provided.

Content in this page includes:

Use GitHub in dockerized Rstudio Server

  • once logged into Rstudio Server with access to the host file system (see 2.2 Access eddy4R in User's guide), only a few additional steps are needed to utilize Git and GitHub assuming that you already had your github account, you are permitted to access the eddy4R repository, you have create your own copy in your fork. If not, proceed to section 3.2 Github users guide first before continue the steps below.
  • in this way, use of the most current and fully configured eddy4r Docker image can be extended from pure data processing to version controlled editing
  • step 4 below needs to be executed only during the first time is used (an existing Git project can later be re-opened), and the remaining steps need to be executed only during the first time that a new Docker container is run
    1. in Rstudio -> Tools -> Global Options -> Git/SVN -> click "create RSA key" -> close the dialogue window -> click "View public key" -> copy the key to the clipboard

    2. go to GitHub SSH and GPG keys and add the new SSH key

    3. go to the landing page of the desired GitHub repository, select "SSH", and copy the SSH handle to the clipboard (example: "[email protected]:stefanmet/eddy4R-stefanmet.git")

    4. go to Rstudio -> File -> New Project -> Version Control -> Git -> paste the SSH handle in the field for "Repository URL" -> browse to the folder location in the Docker container that corresponds the host file system -> click "Create Project"

    5. go to Rstudio -> Tools -> Shell -> execute the following commands while replacing GITHUB-EMAIL and GITHUB-USER with the email address and user name used when registering on GitHub:

       git config --global user.email "GITHUB-EMAIL"
       git config --global user.name "GITHUB-USER"
       git config --list
      

[Go back to the top of the page]

Github developers guide

This section lays out the necessary essentials to use Git and Github for:

  • adjusting existing workflows for your own purposes and to your own data,
  • keeping track of your changes, and
  • staying in sync with the code base.

The content in this section includes:

[Go back to the top of the page]

Creating your very own copy

  • Create a Github account and log in
  • Send an email to eddy4R team [email protected] and request to be added to the eddy4R repository
  • Accept the invitation to the eddy4R repository and open its Github page
  • For most users, this repository is read-only. Here the necessary steps to be able to modify the code to your own data/workflows:
  • Fork the repository to your own Github account: http://r-pkgs.had.co.nz/git.html#pr-make
  • Rename your fork by adding your Github username to it. This is done in your fork on github.com by clicking "settings" (in the menu on the right side of the screen), changing the repository name from originalname to originalname-username, and confirming by clicking "rename".

[Go back to the top of this section]

[Go back to the top of the page]

Connecting Rstudio to Github

[Go back to the top of this section]

[Go back to the top of the page]

Working in feature branches

Creating a branch

  • In your own fork, the default branch is "master", which is your reference and should not be touched for "exploring".

  • So create a new branch, e.g. "my-workflows": This branch is where you can modify and play with the code.

      git checkout -b my-workflows
    
  • Next, this local branch also needs to be pushed to Github

      git push -u origin my-workflows
      OR
      git push --all -u
    
  • {ALTERNATIVELY: Checkout remote Git branch}

      git fetch
      git branch -v -a
      git checkout my-workflows OR (in case of multiple remotes)
      git checkout -b my-workflows name-of-remote/my-workflows
    

[Go back to the top of this section]

[Go back to the top of the page]

Confirming that everything runs as designed

  • Adjust the working directory in the workflow file to the local folder on your computer where you are working in
    • Attention Windows users: R uses Unix-convention forward slashes "/" to indicate directories!
  • Test-run your working version with provided input "gold-files" and ensure that produced outputs are identical to the originally provided output "gold-files".
  • If necessary, resolve any conflicts, missing libraries etc.
  • Tip: When you select the "master" branch in Rstudio, the project directory on your hard drive will show you the "master" branch files. And when you select the another branch in Rstudio like "my-workflows", then the project directory on your hard drive will show you the files belonging to this branch.

[Go back to the top of this section]

[Go back to the top of the page]

Adjusting to your needs

  • Make sure you always work on/modify your my-workflows branch, never your master branch
  • Open the workflow file you want to use as template for your work and save your own copy with an informative addition to the file name. In that way your own workflow file will never be "overwritten" when later synchronizing with the work of others.
  • Adjust only the workflow in your own workflows file, never modify the original workflow templates or the underlying functions in the /routines folder (that's possible, but requires additional tasks, see section 3.3 Adjust functions and workflow templates, and share code )
  • Commit your changes frequently, accompanied by a concise but expressive commit message. Make sure to always commit before changing (checkout other) branches, as un-commited changes will be lost
  • append to previous commit
    • in commit dialogue: don’t touch mandatory square checkmark
  • Organize your data
    • Github limits: 1GB per repository, 100MB per file!
    • Hence, it is necessary to keep data (large file size) separately from the algorithms (small file size, in the project directory)
    • The project directory only contains R definition and wrapper functions in the /routines folder, and R workflow files in the /workflows folder
    • Create new folders (not under the project directory!) for /in, /out... to hold your own data
    • Modify the respective paths in your own workflow file
  • Start plugging away with your own data!

[Go back to the top of this section]

[Go back to the top of the page]

Undoing changes

  • in Rstudio, close all files that were affected by any commits that shall be made undone

  • open Git Shell -figure out which commit(s) need to be made undone, e.g. by looking at differences between two branches, base-branch and feature-branch; similar to a pull-request.

      git log --left-right --graph --cherry-pick --oneline base-branch...feature-branch
    
  • undoing an individual commit at any point in time

      git revert #SHA
    
  • Note that an additional commit message is required to revert the changes. On a Windows platform, a VIM window will open to allow you to do this. Type your commit message, press Esc to exit out of editing mode, then :wq to save and close the window.

  • undoing a series of commits back to a specific point in time

  • reverting beyond past merges

      git revert -m 1 HEAD
      git cherry-pick -m 1 #SHA
    
  • removing commits

    • git rebase: remove one or more consecutive commits

        git rebase --onto <branch name>~<first commit number to remove> <branch name>~<first commit to be kept> <branch name>
        git push -f <remote-name> <branch-name>
      
      • "A range of commits could also be removed with rebase…"

      • first commit number to remove: if a merge commit is specified, all merged commits will be removed (don't specify / include those): example for branch "fixEcseSens": the 8th and 9th latest commits need to be removed, where commit number 8 has merged commit number 9 into the "fixEcseSens" branch. Then only specify commit number 8 to be removed (which will automatically remove all merge commits incl. commit number 9):

          git rebase --onto fixEcseSens~8 fixEcseSens~7 fixEcseSens
        
    • git cherry-pick: remove non-consecutive commits

[Go back to the top of this section]

[Go back to the top of the page]

Setting up processing environments

  • strategy "FORK -> BRANCH -> FOLDER -> CLONE"
    • create your personal fork of the upstream (central) repo
    • for each project you are working on
      • create a project branch in your personal fork
      • under the project branch, create a project folder
    • create any desired number of clones (local copies) for development (-DEVEL) and execution (-EXEC1, -EXEC2…)
  • executions: clone of -stefanmet fork: /stefanmet/LI7200-Niwot-Ridge/EXEC1…
  • wrap-up upon completion/publication
    • diff and copy working improvements to clone 0-stefanmet-DEFAULT/my-workflows, pull into upstream
    • leave final clone for archiving {vs. copy it to local project folder for archiving}?

[Go back to the top of this section]

[Go back to the top of the page]

Staying in sync with the code base

  • "read team" users need to “watch” repository to get updates, and check notification settings
  • This is to make sure that you use the newest fixes, have access to the newest features, and that your work is up-to-date with everyone else.
  • It is good conduct to synchronize regularly (weekly to monthly), here the relevant steps:

Synchronizing your forks master

Synchronizing your forks branches

  • Next, synchronize your forks' branch with your forks' master

  • To not run the risk of losing any of your own work, first create an experimental branch "my-dataflows-merge01" based on your branch "my-dataflows" (see Sect. "create new branch" above).

  • That's where we try out the actual merge:

      git checkout master
      git pull
      git checkout my-dataflows-merge01
      git merge master
      git push origin my-dataflows-merge01
    

[Go back to the top of this section]

[Go back to the top of the page]

Confirming that everything runs as designed

  • Ensure with a small piece of your data that your processing produces the same results before and after the merge.

  • In case the results before and after the merge check out, you can clean up your branches and continue your work with your now up-to-date my-dataflows branch:

      git branch --delete my-dataflows
      git push origin --delete my-dataflows
      git branch -m my-dataflows-merge01 my-dataflows
      git push -u origin my-dataflows
    
  • In case the results before and after the merge don't match:

    • Trace the difference to the function that does something different than before and open an issue in the code base repository with the information:
      • Name of your fork and the branch that you used for merging?
      • What is different in the results from before?
      • Which function causes the difference?
    • In the meantime, you can simply continue working in your branch my-dataflows. Once the issue is resolved, repeat above procedure with a new branch my-dataflows-merge02 etc. until everything checks out. At that point you can also delete now obsolete branches my-dataflows-merge01 etc.:
      • git branch --delete my-dataflows-merge01
      • git push origin --delete my-dataflows-merge01

[Go back to the top of this section]

[Go back to the top of the page]

Resolving merge conflicts

  • Merge conflicts only occur when (i) you have uncommitted changes in my-dataflows-merge01, or (ii) someone else has been editing the same portion of code as yourself (3-way merge).
  • In case (i) you simply need to commit and push any pending changes in my-dataflows-merge01 and retry the merge.
  • Case (ii) should not occur as long as you restrict your work to your own workflow file and don't touch any of the underlying functions in /routines. If however you get such a merge conflict, you have likely changed something either in a workflow template, and/or in the underlying functions in /routines. Great! As that means you are now a developer, and should continue with the "Resolve merge conflicts" section in the 3.3 Adjust functions and workflow templates, and share code).

Github issue tracking

  • For bugs, extension, additions… that are needed
  • Possibility to assign people to issues by referring to an issue number

Viewing Git commit history

See here for how to view git command history

  • git log --since=2.days, this command gets the list of commits made in the last two days

Using git tags

  • you can list the tags with $ git tag -l and then checkout a specific tag:

    $ git checkout tags/<tag_name>

  • Even better, checkout and create a branch (otherwise you will be on a branch named after the revision number of tag):

    $ git checkout tags/<tag_name> -b <branch_name>

[Go back to the top of this section]

[Go back to the top of the page]

Adjust functions and workflow templates, and share code

This section provides info for users to adjust functions (./pack/) and workflow templates (./flow/), and share code developments back to the code base.

  • Goals:
    • scientific contributions into underlying functions /pack/eddy4R/R
    • make workflow templates in ./flow/ as generalizable as possible

Content in this section includes:

General info

  • One can only fork once – keep own master branch clean!

  • https://www.rstudio.com/wp-content/uploads/2015/03/devtools-cheatsheet.pdf

  • Work sequence

    • First standard user sync

      • In case there are changes that Github cannot handle automatically, it will give a "conflict" message and mark the areas in the files that require fixing.
      • It basically means that someone else has edited the same area in a file as you, and you need to decide which version to keep or how to combine them. For that you need to open the corresponding file, find the marked area, decide on a solution and commit the solution. More info on genomewiki.ucsc.edu, dont-be-afraid-to-commit.readthedocs.org, and stackoverflow.com
      • The solution will only appear in the my-dataflows-merge branch, but not in the master branch. {So theoretically, merge would have to happen both ways to continue with the solution also in the master branches. But instead, we leave the master as is, and ask to pull the contents of my-dataflows-merge into the source repo. Once accepted, the source repo will then also update the own (and all other) master and my-dataflows.}
    • test-run on input "gold-files" to ensure that modifications are compatible with the master repo and outputs are identical to original output "gold-files", resolve any conflicts.

    • create new branch my-workflows-pull from up-to-date my-workflows

      • use .gitignore for R Project files (.Rhistory, .Rproj.user, .Rproj, .RData user workflow files, in_gold, out_gold, cache)
      • pushing to remote
    • pull request from my-dataflows-pull to the source repo: Issue a pull request on github.com, to ask the maintainers of the master repo to incorporate the changes in your fork

  • packaging

    • flow.pack.R function
    • basic Roxygen tags need to be present, otherwise functions not included during package generation!
  • enable issues tab in fork

    • fork a repo
    • go to the Settings page of your fork
    • check the box next to Issues
  • How to make a permanent link to files and lines

    • click on the line number you want
    • hold down the shift key if you want multiple lines selected
    • get the url for that particular commit by pressing the “y” key
    • copy url from browser

[Go back to the top of this section]

[Go back to the top of the page]

Pulling from someone else's remote into a local branch

Prerequisite: ensure your master branch is up-to-date from upstream (NEONScience/eddy4R)

	git checkout master
	git fetch upstream
	git merge upstream/master

If you haven't done already, add someone else's remote to your local list of remotes. Here an example for a new entry "ddurden" in the list of local remotes, with the URL "[email protected]:ddurden/eddy4R-ddurden.git". Then fetch the remote:

	git remote add ddurden [email protected]:ddurden/eddy4R-ddurden.git
	git fetch ddurden

Now create a new local branch (here: "wavelet"), from your up-to-date "master" branch, pull the commits from the remote branch (here: "ddurden wavelet"), and push to your own orign:

	git checkout -b wavelet
	git pull ddurden wavelet
	git push -u origin wavelet

Resources: Stackoverflow, GitHubGist.

Stashing changes

In some cases it can be desired to not yet commit local changes. In these cases they can be temporarily "stashed" out of the way, so they don't appear in the staging area. Stashing via:

	git stash

Returning stashed changes to environment via:

	git stash apply

[Go back to the top of this section]

[Go back to the top of the page]

Updating the list of Git remotes

When branches are being added or deleted online on Github, the index of these "remote" branches can become outdated in the local Git / Rstudio. Updating list of branches on origin:

	git remote update origin --prune

Updating list of branches on upstream:

	git remote update upstream --prune

Check-out GitHub release as branch in a fork

Here an example for release tag "0.2.6-rc4" following this Stackoverflow post. First fetch all tags:

	git fetch --all --tags --prune
	git tag

Then check-out the tag corresponding to the release into a new branch:

	git checkout tags/0.2.6-rc4 -b 0.2.6-rc4

Lastly, push the new branch to origin; specifying "HEAD" according this Stackoverflow post worked for me:

	git push -u origin HEAD:0.2.6-rc4

[Go back to the top of this section]

[Go back to the top of the page]

Including data or constants for package-wide use

If certain parameters, values, or datasets are commonly used within the functions of a package, it can be useful to predefine them so they are automatically available in the global or function environment when the package is loaded. In this way, they are available without having to set or load them each time they are used. For example, the standard set of packages loaded when RStudio starts includes the "datasets" package. Simply typing "mtcars" at the command prompt displays a set of statistics for common vehicle makes and models, even though there are no objects listed in the global environment. This is because "mtcars" is defined and saved as exported data within the "datasets" package, more on which can be found here.

For the eddy4R suite of packages, use the following procedure to define, document, and call unchanging parameters, variables, or datasets that are auto-loaded with the package.

Defining package data

  1. Save any constant, variable or dataset objects as .rda files within the /data directory of the package.

    • Once you have the objects you want to save, use the devtools::use_data() function with the parameter "internal=FALSE" (this is the default). This will save the .rda files with same name as each object in the /data directory of the package. Also set parameter "overwrite = TRUE" if you want to update existing .rda files.
    • Set "LazyData: true" in the DESCRIPTION file of the package (this is the default if you create the package using devtools::create).
  2. Include the code used to create the package data in the /data-raw directory or the package.

    • If the /data-raw directory does not already exist for the package, use the one-time-use function devtools::use_data_raw() to set up this directory and include data-raw/ in .Rbuildignore.
    • Alternatively, if /data-raw already exists but is not included in .Rbuildignore, use the one-time-use function "devtools::use_build_ignore("data-raw", escape = TRUE)" to add data-raw to .Rbuildignore
    • Putting the code to create the package data in data-raw will include it in the source version of the package, so it is available for other developers to modify. It will not be included in the production version of the package.
  3. Document the package data using Roxygen headers

    It is important to document the definitions or descriptions of package data, especially since users will not have access to the code used to create it unless they download the source version of the package.

    • Within the R/ directory of the package, create one .R file for each object defined in step 1. Name this script according to the Coding style convention, with base name "docu.data" (e.g. docu.data.conv.R and docu.data.natu.R)
    • Document constants, or variables using the roxygen2 block specific to datasets (example here). For any constants/parameters, make sure to include the value and units of the constant.

Using package data in functions and the global environment

Once data is saved in data/ and the package is compiled, the objects will be available when the package is loaded, and can simply be called by name. For example, eddy4R.base has lists of conversion factors and natural constants defined in the objects Conv and Natu, respectively. After calling "library(eddy4R.base)", typing "Conv$RadDeg" will return the conversion factor from radians to degrees.

However, the most robust way to use package data is the use the double-colon operator (e.g. eddy4R.base::Conv$RadDeg). This specifically refers to the objects as they were defined in the package, thus making them insensitive to whether or not the object names are overwritten in the global or function environment. So long as the package is installed, using the double-colon operator makes the objects accessible throughout the R environment, whether or not the package is loaded.

[Go back to the top of this section]

[Go back to the top of the page]

Integrative Collaboration

  • When collaborating on code development, it is essential to ensure all developers are working with the same version of the Docker image, as different versions can lead to different behavior. We suggest that when a group merges a collaborative pull request, everyone jointly moves to the next image version, specifying a version tag. If a group encounters reproducibility issues, the first step in troubleshooting should be to verify that the image used by all members has the same SHA-256 digest value, which is a unique identifier for each image that is built.
  • When submitting a GitHub issue to the eddy4R repository, please include the SHA-256 digest value for the image used.
  • To determine the current image SHA-256 digest value use the following commands
    • pull latest Docker image from Dockerhub, also displays image digest sha256:

      docker pull REPO/IMAGE:TAG
      docker pull REPO/IMAGE@sha256:
      
    • for reproducible results the digest of a specific image can be specified when running a container

      docker run REPO/IMAGE:TAG
      docker run REPO/IMAGE@sha256
      
    • to display image digest sha256: after download

      docker images --digests
      

[Go back to the top of this section]

[Go back to the top of the page]

Subtree updates

From the combined repo (i.e. NEON-FIU-algorithm) in the terminal:

  • 1st time:
git remote add eddy4R [email protected]:NEONScience/eddy4R.git

  • Regular update (All commits in eddy4R repo and squashed in the combined repo [NEON-FIU-algorithm]):
git fetch eddy4R
git subtree pull --prefix=ext/eddy4R --squash eddy4R main
git subtree push --prefix=ext/eddy4R eddy4R main

Additional Docker resources

[Go back to the top of the page]

⚠️ **GitHub.com Fallback** ⚠️