Git Repositories - FLARE-forecast/flare-forecast.github.io GitHub Wiki
Git repositories hosted by GitHub are in charge of storing the data and logs alongside configurations and scripts. They are managed in different branches. Data branches are required to be purged once a year in order to avoid exceeding the maximum file size limit on GitHub repositories (currently 100 MB).
FLARE-related Code (Scripts, Config Files, LoRa, Nebula) are stored on the following repository named “miscellaneous”:
FLARE Miscellaneous Repository
In order to test the gateways setup and for development purposes, we can use mock data.
The mock data configurations including the name of the output file and repository/branch for mock data can be found and modified in the config file for each gateway.
For instance, here is a snippet of the config file for a test gateway named Carina:
general:
log_file: general.log
gateway_name: carina
gateway_location: datalogger-mock
gateway_power_mode: ac # "ac" or "battery"
data_dir: /data
apps_dir: /home/ubuntu/miscellaneous
datalogger_data_dir: datalogger-data
git_repo: [email protected]:FLARE-forecast/miscellaneous.git
git_data_branch: datalogger-mock-data
git_logs_branch: carina-logs
module_toggler_log_file: module-toggler.log
datalogger_mock_data_generator:
is_enabled: true
log_file: datalogger-mock-data-generator.log
data_file: datalogger-mock-data.csv
interval: 10 # Frequency of generating data in minutesThe config file can be found here:
Carina Config File
The backup service downloads FLARE repositories on Jetstream2 Front VM and pushes them to Amazon Glacier periodically.
The code uses the GitHub API and needs a token to access the repositories.
-
Downloads just the latest commits on each branch of every GitHub repo on the 15th of each month at 06:00 AM UTC and pushes them to Amazon Glacier.
-
Backup size: A few hundred MB.
-
The service runs as a cron job:
00 06 15 * * ~/applications/github-backup/head-backup/github-head-backup.sh
-
Storage space:
/dev/sdb mountedon/home/ubuntu/applications/github-backup/head-backup
- GitHub Full Backup service downloads the entire history on each branch of every GitHub repo on 1st of each month at 06:00 AM UTC and keeps just the last backup locally on the VM.
- Backup size: a few tens of GB.
- The service runs as a cron job:
0 06 01 * * ~/applications/github-backup/full-backup/github-full-backup.sh
Note 1: Due to lack of space, after each full backup, the backup gets moved from full-backup directory to head-backup directory which is on a separate disk leaving enough space on the full-backup disk for the future backup.
Note 2: Backup files cannot be kept on the VM disk space indefinitely. Otherwise, the VM will eventually run out of disk space. The old backup files may be removed from the VM disk periodically (e.g., once a year). Backup files on Amazon Glacier are kept indefinitely, though.
GitHub has a hard limit of 100 MB on file size. Sensor data files can exceed this limit, usually in a couple of years of operation. If that happens, the pushes from the local repo to the remote repo fail. To prevent this problem, we need to split the data file into smaller files.
We move the older observations (e.g., previous year observations) to separate legacy files.
-
Current data file,
FCRmet.csvinfcre-metstation-databranch:
FCRmet.csv -
Legacy files,
FCRmet_legacy_2018.csv,FCRmet_legacy_2019.csv, etc., infcre-metstation-data-qaqcbranch:
Legacy Files
File splitting should happen on the gateway itself to make sure the modified data files are not rewritten by unchanged data files from the gateways. To insure the process is not interrupted by new observations being added to the data files, the datalogger network interface can be taken down wile maintenance is ongoing and current data in local Git repositories needs to be synced with remote GitHub repositories.
sudo ip link set eno1 down
/home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.shAfter the file splitting is finished, the interface needs to be taken up again:
sudo ip link set eno1 up
/home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.shThe split-file.sh script removes the observations up to a timestamp from the original file and adds them plus the header rows to a new file.
https://github.com/FLARE-forecast/miscellaneous/blob/main/gateways/git-maintenance/split-file.sh
./split-file.sh <original_file> <new_file> <timestamp_pattern> <num_header_lines>For instance: To move all the observations before 2025 to a legacy file considering the fact that we have 4 header rows:
./split-file.sh data.csv data_legacy.csv '2025-01-01 00:00:00' 4- Keep in mind that header rows should be present in all files at the top, including current data files and legacy files.
- The process of splitting files and separating the legacy files should happen annually after the data has been published to EDI.
If a file exceeds the 100 MB limit, recovering from the error is necessary:
- Back up the large file and delete it from the local repo.
- Remove the large file from Git history.
- Split the large file into smaller parts.
- Add the smaller files back to the repo.
- Commit and push the changes.
If you haven't already installed git-filter-repo, follow the instructions below depending on your operating system:
sudo apt install git-filter-repoReplace path/to/your/large/file with the actual path to the large file you want to remove from the history:
git filter-repo --path path/to/your/large/file --invert-pathsThis will remove the file from the entire Git history.
After removing the large file, you’ll need to force-push the changes to GitHub:
git push origin --force --all⚠ Warning: This command will rewrite your history on the remote repository.
Make sure to inform any collaborators that they will need to re-clone the repository to avoid issues with their local histories.
You can also run garbage collection to clean up any lingering references to the large file:
git reflog expire --expire=now --all
git gc --prune=now --aggressiveAfter these steps, the large file should be completely removed from your history, and you should be able to push your repository to GitHub without any issues.
The purpose is to create a new shallow clone of an existing large repo, purge the history, and just keep the latest commit (HEAD).
This process needs to be done for each branch separately.
Note: To prevent from git conflicts or any other fatal git errors regrading to data logger updates:
- Turn the data logger interface down on the gateway before purging.
- Run
/home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.shto do lastest update. - Turn the push cron job down.
- Remeber resotre the group access on the new repository.
In a nutshell, we need to:
- Duplicate a mirror clone from the original repo.
- Create orphan branches to get rid of history in the new repo.
- Delete historical branches in the new repo.
- Archive the old repo.
- Use the new repo instead of the old one.
- Remember to add slack bot's token to repository secert. (Ask Zehua if you are not sure about it.)
For instance, here is the process for CCRE-data:
sudo apt-get update
sudo apt-get install curl
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
sudo sh -c 'echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" > /etc/apt/sources.list.d/github-cli.list'
sudo apt-get update
sudo apt-get install ghgh --versiongh auth login- Change the name of
FLARE-forecast/CCRE-datatoFLARE-forecast/CCRE-data-archivevia GitHub Settings > General > Repository name. - Make it read-only and archive it from Settings > Danger Zone > Archive this repository.
git clone https://github.com/FLARE-forecast/CCRE-data-archive
for branch in $(git branch -r | grep -v '\->'); do
local_branch="${branch#origin/}"
# Skip if branch is main or already exists locally
if [[ "$local_branch" == "main" ]] || git show-ref --verify --quiet "refs/heads/$local_branch"; then
echo "Skipping existing or default branch: $local_branch"
continue
fi
git checkout --track "$branch"
doneremove remote origin
git remote remove origingit switch ccre-dam-data
git checkout --orphan ccre-dam-data-headgit add .
git commit -m "Initial purging commit for ccre-dam-data"⚠ Warning: This command will delete branch. Make sure to perform this operation on the new repository.
git branch -D ccre-dam-data
git branch -m "ccre-dam-data-head" "ccre-dam-data"Example: Purging annie-logs without re-creating the empty repository:
git switch annie-logs
git checkout --orphan annie-logs-head
git add .
git commit -m "Initial purging commit for annie-logs-head"
git branch -D annie-logs
git branch -m "annie-logs-head" "annie-logs"gh repo create FLARE-forecast/CCRE-data-head --public --confirmgit remote add origin https://github.com/FLARE-forecast/CCRE-data-head.git
# make sure the default branch is main, replace main with whatever default branch's name
git switch main
git push origin main
for branch in $(git for-each-ref --format='%(refname:short)' refs/heads/); do
echo "Pushing branch: $branch"
git push origin "$branch"
doneAt the end of the process, we should have two separate repositories:
- The original
FLARE-forecast/CCRE-data-archive - The new
FLARE-forecast/CCRE-data-headwith no-history branches
- Change
FLARE-forecast/CCRE-data-headtoFLARE-forecast/CCRE-dataand use it as the main repo from now on.
Wherever the repo has been cloned (including on gateways), rename the local repo and clone the new repo:
git clone https://github.com/FLARE-forecast/CCRE-dataAlso don't forget to add group access on the new repository as what the archived one has.
Gateways may experience a Git error:
Error: object file .git/objects/xxx is empty
This is usually caused by staging errors or syncing issues on local repositories, especially on battery-powered gateways with intermittent internet connections.
-
Backup the problematic local repo.
-
Delete the problematic repo entirely.
-
Re-clone the repository from the remote Git repo:
git clone https://github.com/FLARE-forecast/CCRE-data
-
Replace the data or log files from the backup without replacing Git objects.
-
Commit and push the changes.
Due to slow LoRa connection speed, pushing data and logs from LoRa-connected gateways (e.g., at the FCR weir) can be tricky. If the internet is more unstable than usual, it may cause a commit backlog.
- Keep data and log pushes from LoRa-connected gateways going.
- If issues arise, resolve them ASAP before they create a major backlog.
A method implemented in git-push.sh helps push data and logs incrementally, instead of pushing everything at once:
for commit in $(git log --reverse --format="%H" --branches --not --remotes); do
git push --force origin $commit:refs/heads/$(git rev-parse --abbrev-ref HEAD) || continue
doneIf the gateway cannot recover from a backlog, connect it to faster internet:
-
Disable LoRa on the gateway.
- Run the toggler on the gateway.
- Choose
lora_radioand toggle the module status todisabled. - Reboot the gateway.
-
Manually Run
git-push.sh:/home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.sh
-
Re-enable LoRa:
- Run the toggler again.
- Choose
lora_radioand toggle the module status toenabled. - Reboot the gateway.
This command is an example to read the tnc log from 2025/09/15.
for file in /data/norvel-logs/tnc0/2025-09-15_*.pcap; do echo "Reading $file"; tcpdump -r "$file"; done