Git Repositories - FLARE-forecast/flare-forecast.github.io GitHub Wiki

Description

Git repositories hosted by GitHub are in charge of storing the data and logs alongside configurations and scripts. They are managed in different branches. Data branches are required to be purged once a year in order to avoid exceeding the maximum file size limit on GitHub repositories (currently 100 MB).


Setup and Configuration

FLARE-related Code

FLARE-related Code (Scripts, Config Files, LoRa, Nebula) are stored on the following repository named “miscellaneous”:
FLARE Miscellaneous Repository

Mock Data

In order to test the gateways setup and for development purposes, we can use mock data.

Mock Data Generator Script

Mock Data Generator Script

Mock Datalogger Data Repository/Branch

The mock data configurations including the name of the output file and repository/branch for mock data can be found and modified in the config file for each gateway.

For instance, here is a snippet of the config file for a test gateway named Carina:

general:
  log_file: general.log
  gateway_name: carina
  gateway_location: datalogger-mock
  gateway_power_mode: ac # "ac" or "battery"
  data_dir: /data
  apps_dir: /home/ubuntu/miscellaneous
  datalogger_data_dir: datalogger-data
  git_repo: [email protected]:FLARE-forecast/miscellaneous.git
  git_data_branch: datalogger-mock-data
  git_logs_branch: carina-logs
  module_toggler_log_file: module-toggler.log

datalogger_mock_data_generator:
  is_enabled: true
  log_file: datalogger-mock-data-generator.log
  data_file: datalogger-mock-data.csv
  interval: 10 # Frequency of generating data in minutes

Config File Location

The config file can be found here:
Carina Config File

Sample Mock Data Output for the Test Gateway (Carina)

Mock Data Output

Sample Mock Data Logs for the Test Gateway (Carina)

Mock Data Logs

Test Gateway (Carina) Startup Notifications

Startup Notifications

FLARE Repositories Backup Service

The backup service downloads FLARE repositories on Jetstream2 Front VM and pushes them to Amazon Glacier periodically.

Backup Service Code

GitHub Backup Services

The code uses the GitHub API and needs a token to access the repositories.

GitHub Head Backup

  • Downloads just the latest commits on each branch of every GitHub repo on the 15th of each month at 06:00 AM UTC and pushes them to Amazon Glacier.

  • Backup size: A few hundred MB.

  • The service runs as a cron job:

    00 06 15 * * ~/applications/github-backup/head-backup/github-head-backup.sh
  • Storage space: /dev/sdb mounted on /home/ubuntu/applications/github-backup/head-backup

GitHub Full Backup

  • GitHub Full Backup service downloads the entire history on each branch of every GitHub repo on 1st of each month at 06:00 AM UTC and keeps just the last backup locally on the VM.
  • Backup size: a few tens of GB.
  • The service runs as a cron job:
    0 06 01 * * ~/applications/github-backup/full-backup/github-full-backup.sh

Note 1: Due to lack of space, after each full backup, the backup gets moved from full-backup directory to head-backup directory which is on a separate disk leaving enough space on the full-backup disk for the future backup.

Note 2: Backup files cannot be kept on the VM disk space indefinitely. Otherwise, the VM will eventually run out of disk space. The old backup files may be removed from the VM disk periodically (e.g., once a year). Backup files on Amazon Glacier are kept indefinitely, though.


Maintenance and Troubleshooting

File Exceeds 100 MB Limit on Local Git Repo

GitHub has a hard limit of 100 MB on file size. Sensor data files can exceed this limit, usually in a couple of years of operation. If that happens, the pushes from the local repo to the remote repo fail. To prevent this problem, we need to split the data file into smaller files.

Legacy Files

We move the older observations (e.g., previous year observations) to separate legacy files.

Example:

  • Current data file, FCRmet.csv in fcre-metstation-data branch:
    FCRmet.csv

  • Legacy files, FCRmet_legacy_2018.csv, FCRmet_legacy_2019.csv, etc., in fcre-metstation-data-qaqc branch:
    Legacy Files

File Splitting Process

File splitting should happen on the gateway itself to make sure the modified data files are not rewritten by unchanged data files from the gateways. To insure the process is not interrupted by new observations being added to the data files, the datalogger network interface can be taken down wile maintenance is ongoing and current data in local Git repositories needs to be synced with remote GitHub repositories.

sudo ip link set eno1 down
/home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.sh

After the file splitting is finished, the interface needs to be taken up again:

sudo ip link set eno1 up
/home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.sh

Split File Script

The split-file.sh script removes the observations up to a timestamp from the original file and adds them plus the header rows to a new file.

Script URL

https://github.com/FLARE-forecast/miscellaneous/blob/main/gateways/git-maintenance/split-file.sh

Script Usage

./split-file.sh <original_file> <new_file> <timestamp_pattern> <num_header_lines>

For instance: To move all the observations before 2025 to a legacy file considering the fact that we have 4 header rows:

./split-file.sh data.csv data_legacy.csv '2025-01-01 00:00:00' 4

Notes:

  1. Keep in mind that header rows should be present in all files at the top, including current data files and legacy files.
  2. The process of splitting files and separating the legacy files should happen annually after the data has been published to EDI.

Recovering from Git Large File Detected Error

If a file exceeds the 100 MB limit, recovering from the error is necessary:

Steps:

  1. Back up the large file and delete it from the local repo.
  2. Remove the large file from Git history.
  3. Split the large file into smaller parts.
  4. Add the smaller files back to the repo.
  5. Commit and push the changes.

Remove the Large File from Git History

Install git-filter-repo

If you haven't already installed git-filter-repo, follow the instructions below depending on your operating system:

sudo apt install git-filter-repo

Run git filter-repo to Remove the File

Replace path/to/your/large/file with the actual path to the large file you want to remove from the history:

git filter-repo --path path/to/your/large/file --invert-paths

This will remove the file from the entire Git history.

Force Push the Rewritten History to GitHub

After removing the large file, you’ll need to force-push the changes to GitHub:

git push origin --force --all

Warning: This command will rewrite your history on the remote repository.
Make sure to inform any collaborators that they will need to re-clone the repository to avoid issues with their local histories.

Clean Up Dangling References

You can also run garbage collection to clean up any lingering references to the large file:

git reflog expire --expire=now --all
git gc --prune=now --aggressive

After these steps, the large file should be completely removed from your history, and you should be able to push your repository to GitHub without any issues.

Purging Git History of a Specific Branch on GitHub

The purpose is to create a new shallow clone of an existing large repo, purge the history, and just keep the latest commit (HEAD).
This process needs to be done for each branch separately.

Note: To prevent from git conflicts or any other fatal git errors regrading to data logger updates:

  • Turn the data logger interface down on the gateway before purging.
  • Run /home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.sh to do lastest update.
  • Turn the push cron job down.
  • Remeber resotre the group access on the new repository.

In a nutshell, we need to:

  1. Duplicate a mirror clone from the original repo.
  2. Create orphan branches to get rid of history in the new repo.
  3. Delete historical branches in the new repo.
  4. Archive the old repo.
  5. Use the new repo instead of the old one.
  6. Remember to add slack bot's token to repository secert. (Ask Zehua if you are not sure about it.)

For instance, here is the process for CCRE-data:

0. Install GitHub CLI

sudo apt-get update
sudo apt-get install curl

curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg

sudo sh -c 'echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" > /etc/apt/sources.list.d/github-cli.list'

sudo apt-get update
sudo apt-get install gh

Check if the installation is successful:

gh --version

Authenticate to GitHub:

gh auth login

1. Make the Repository Read-only

  • Change the name of FLARE-forecast/CCRE-data to FLARE-forecast/CCRE-data-archive via GitHub Settings > General > Repository name.
  • Make it read-only and archive it from Settings > Danger Zone > Archive this repository.

2. Clone the Original Repository

git clone https://github.com/FLARE-forecast/CCRE-data-archive
for branch in $(git branch -r | grep -v '\->'); do
  local_branch="${branch#origin/}"
  
  # Skip if branch is main or already exists locally
  if [[ "$local_branch" == "main" ]] || git show-ref --verify --quiet "refs/heads/$local_branch"; then
    echo "Skipping existing or default branch: $local_branch"
    continue
  fi

  git checkout --track "$branch"
done

remove remote origin

git remote remove origin

3. Create a Orphan Copy from Historical Branch

git switch ccre-dam-data
git checkout --orphan ccre-dam-data-head

4. Add All Files as a Single New Commit

git add .
git commit -m "Initial purging commit for ccre-dam-data"

5. Replace the Historical Branch

⚠ Warning: This command will delete branch. Make sure to perform this operation on the new repository.

git branch -D ccre-dam-data
git branch -m "ccre-dam-data-head" "ccre-dam-data"

6. Repeat the Process for Another Branch

Example: Purging annie-logs without re-creating the empty repository:

git switch annie-logs
git checkout --orphan annie-logs-head

git add .
git commit -m "Initial purging commit for annie-logs-head"

git branch -D annie-logs
git branch -m "annie-logs-head" "annie-logs"

7. Create a New Empty Repository

gh repo create FLARE-forecast/CCRE-data-head --public --confirm

8. Push the Purged Copy to the Newly Created Repository

git remote add origin https://github.com/FLARE-forecast/CCRE-data-head.git

# make sure the default branch is main, replace main with whatever default branch's name
git switch main
git push origin main

for branch in $(git for-each-ref --format='%(refname:short)' refs/heads/); do
  echo "Pushing branch: $branch"
  git push origin "$branch"
done

Final Steps

At the end of the process, we should have two separate repositories:

  1. The original FLARE-forecast/CCRE-data-archive
  2. The new FLARE-forecast/CCRE-data-head with no-history branches

Rename the New Repo

  • Change FLARE-forecast/CCRE-data-head to FLARE-forecast/CCRE-data and use it as the main repo from now on.

Update Cloned Repositories

Wherever the repo has been cloned (including on gateways), rename the local repo and clone the new repo:

git clone https://github.com/FLARE-forecast/CCRE-data

Also don't forget to add group access on the new repository as what the archived one has.

Git Object File Empty Error

Gateways may experience a Git error:

Error: object file .git/objects/xxx is empty

This is usually caused by staging errors or syncing issues on local repositories, especially on battery-powered gateways with intermittent internet connections.

Workaround

  1. Backup the problematic local repo.

  2. Delete the problematic repo entirely.

  3. Re-clone the repository from the remote Git repo:

    git clone https://github.com/FLARE-forecast/CCRE-data
  4. Replace the data or log files from the backup without replacing Git objects.

  5. Commit and push the changes.

Backlog on the Gateway at the Weir

What is a Backlog?

Due to slow LoRa connection speed, pushing data and logs from LoRa-connected gateways (e.g., at the FCR weir) can be tricky. If the internet is more unstable than usual, it may cause a commit backlog.

How to Avoid Backlogs

  • Keep data and log pushes from LoRa-connected gateways going.
  • If issues arise, resolve them ASAP before they create a major backlog.

Pushing Commits One by One

A method implemented in git-push.sh helps push data and logs incrementally, instead of pushing everything at once:

for commit in $(git log --reverse --format="%H" --branches --not --remotes); do
    git push --force origin $commit:refs/heads/$(git rev-parse --abbrev-ref HEAD) || continue
done

Recovering from a Backlog

If the gateway cannot recover from a backlog, connect it to faster internet:

  1. Disable LoRa on the gateway.

    • Run the toggler on the gateway.
    • Choose lora_radio and toggle the module status to disabled.
    • Reboot the gateway.
  2. Manually Run git-push.sh:

    /home/ubuntu/miscellaneous/gateways/git-maintenance/git-push.sh
  3. Re-enable LoRa:

    • Run the toggler again.
    • Choose lora_radio and toggle the module status to enabled.
    • Reboot the gateway.

Check tncattach status

This command is an example to read the tnc log from 2025/09/15.

for file in /data/norvel-logs/tnc0/2025-09-15_*.pcap; do     echo "Reading $file";     tcpdump -r "$file"; done
⚠️ **GitHub.com Fallback** ⚠️