2d. Code Management and Collaboration - EY-Data-Science-Program/2021-Better-Working-World-Data-Challenge GitHub Wiki
Introduction
This page describes the use of Git for this challenge and how it can help individuals and, in particular, teams coordinate and share their solutions and work.
If you are participating in this challenge with other individuals, or you're making use of multiple environments (e.g. some work on your local computer, some in an Azure environment), you almost certainly want to use version control to synchronise between people and platforms.
What is Version Control?
When working with code, it is often a good idea to use some form of Source Code Management (SCM, also referred to as version control). SCM tools are used to track what was changed, when it was changed, why (as long as the author provides a reason!) and by whom.
These capabilities are most useful when multiple people are collaborating on the same code project. They allow teams to safely work on the same files, to share and merge their code, compare versions, rollback to old versions, manage backups, manage concurrent branches of work and lots more.
Learning Git
So you're likely reading this wiki on GitHub. GitHub is a hosting service for Git repositories. Git is one of various version control systems, generally the most popular. A repository is basically a folder of files that is being tracked.
There are plenty resources to learn the basics of using Git, some good ones are listed below.
- The Git Book: Official book on git. Chapter one provides a detailed history and rationale plus installation instructions, Chapter two covers the basics.
- The Git Handbook by Github. This covers basic Git commands and explains how GitHub fits into the picture.
- What is Git?: a Git tutorial by Atlassian. Atlassian also provide a cheat sheet which can prove invaluable for Git users at any level.
- Visualising Git again by GitHub. This is an excellent resource to help you get your HEAD around all concepts in git, especially branching and merging. Try all the basic commands in its interactive terminal and watch the Git tree grow!
EY Data Science Challenge Quickstart
Creating a Repository in GitHub
-
Login to GitHub with your username and password. If you don’t have one, please sign up. After logging on, create a new repository in GitHub by clicking on ‘New’ button as highlighted in the picture below.
-
A new page appears. Type in a repository name and description. Remember to make the repository private so that only your teammates have access to your code. Finally click on ‘Create Repository’ button at the bottom of the page.
What you name the repository is up to you, it doesn't have to be unique since it is specific to your user, so something like
bushfire-challenge
is fine. -
Inviting your teammates to collaborate: Navigate to the main page of the repository you have created. Under your repository name, click Settings. In the left sidebar, click Manage access. Click ‘Invite a collaborator’.
You'll need your teammates GitHub usernames to give them access.
Setup Git On Your Virtual Machine
-
Launch Jupyter notebooks (in your VM) by entering the public domain IP in your browser. Open a new Terminal windows by clicking on the ‘New’ button on the right and then selecting ‘Terminal’.
-
Once the terminal is open, setup your name & email in git by running following commands on terminal
git config --global user.name "Your Name" git config --global user.email "[email protected]"
-
Connect your git client with GitHub by caching your password. Running the following in the command line will store your credentials:
git config --global credential.helper wincred
Clone Your Repository
-
Grab the repository URL from its page in GitHub
-
Clone the repository to your VM with the below command, replacing the URL with the one your copied:
git clone https://github.com/USERNAME/REPO.git
When prompted, please enter your username and password for your GitHub account.
Please note that the https link of the repository to be cloned is taken from GitHub by clicking on the green button ‘code’:
-
You should see your repository folder in your Virtual machine tree. This is your working directory where you put your code.
Commit Code To Your Repository
-
Our repository is empty initially. Let’s assume you create some code in a new notebook:
myrepo/AnalysisCode1.ipynb
in the virtual Machine. You can push this notebook to Github by following commands:cd myrepo git add AnalysisCode1.ipynb
This tells the VM git client to start tracking the file.
Run
git status
for a summary of which files are tracked, new and modified.You can see that AnalysisCode1.ipynb is under
Changes to be committed
so it’s being tracked by our git client (in this case it is on our Virtual Machine). -
Next use
git commit
to commit the changes using the below command. Commit simply creates a checkpoint (or version) that you can revert back to at any time.Use the
-m
flag to include a comment describing what you did. This is required.git commit -m "Add some analysis code"
-
So far you have only committed changes to your local repository. To share with your teammates, you need to push the commits to GitHub.
First it's a good idea to check if any of your teammates have pushed any changes, make sure you have committed all your changes then run:
git pull
If there are change from your teammates, your local repository will be updated with them.
If you have changed the same files as your teammates, you may encounter a merge conflict. This is when git is unable to determine which change to keep and must be manually resolved. The GitHub documentation on merge conflicts provides a detailed overview of this scenario.
Once your local repository is up to date with theremote GitHub version, you can push your changes:
git push
Now your code is stored as a version in your team's Github repository. At this point it's a good idea to communicate to your teammates that you have made changes, so that they know to update their own repositories.
Git Best Practices for Jupyter Notebooks
Using Git Ignore
It's a good idea to instruct git to ignore certain files that are specific to your workspace and may cause problems when your teammates try to run the code.
Create a .gitignore
file in the root of your repository:
touch .gitignore
You can now edit this file in Jupyter (or from the terminal). Add the below content:
.ipynb_checkpoints
*/.ipynb_checkpoints/*
profile_default/
ipython_config.py
Using NBStripout
When working with notebooks, all the code and any output generated is stored in
JSON format. Typically, the cell outputs aren't something we want to track with
git. Jupyter let's us clear the output with menu option Cell
> All Output
>
Clear
, but it's easy to forget to do this before committing your changes.
To do it automatically we can use a tool called nbstripout.
Using a VM terminal (e.g. through Jupyter New > Terminal) runthe below from the repository root:
pip install nbstripout
nbstripout --install
Now when you commit changes, cell output will automatically be cleared.