09 link checker - harvardinformatics/informatics-website GitHub Wiki

The site uses lychee to check all embedded links upon site building. This is done automatically via github actions, the setup for which is located at .github/workflows/gh-pages.yml. The lychee config file is located at .github/workflows/lychee.toml and this controls some of the arguments passed to lychee, including the list of domains and paths to exclude from checking.

If the link checker fails, check the log by clicking on the red x appears next to the test. If you are on a branch, this will appear at the top of the code browser with the other commit info. If you are in a PR, it will appear directly below the comments and commits. Otherwise, you can navigate to Actions and click on the commit or PR to view the logs.

Cached: Error (cached)

lychee caches the results of previous checks for 30 days to prevent too many requests being sent to a domain. However, if a link failed, that failure may be cached and pop up again on subsequent builds as Cached: Error (cached), even if the error has been fixed. To prevent this, navigate to Actions > Caches and delete any entry that starts with "cache-lychee". To re-run the link checker, return to the log and click "Re-run jobs" in the upper right corner, then select "Re-run failed jobs" to get the link checker to run again. Without the cache, lychee should check the link again and resolve it.

Excluding a domain

Occasionally, the link checker will fail repeatedly for the same link with an error like 403: Network error: Forbidden. Likely this is because link is working, but that domain has blocked such requests. In this case, the best course of action is to add the domain to the list of domains that are excluded from the link checker.

  1. Manually check that the link is still active. Clicking on the link directly from the lychee log may result in syntax errors (e.g. "https://gwct.bio/" might try to load as "https://gwct.bio/](https://gwct.bio/)", so be sure you are checking the correct link.

  2. If the link is confirmed to be active, it may be necessary to add the domain to the exclude list. This is done in the lychee profile at .github/workflows/lychee.toml, which looks something like this:

# lychee.toml

# Optional: where to store cache
cache = true
max_cache_age = "30d"
max_concurrency = 1
require_https = true
timeout = 5

# Exclude full URLs (exact matches)
exclude = [
    "https://scholar.google.com",
    "https://academic.oup.com/bioinformatics/",
    "https://useast.ensembl.org",
    "https://doi.org",
    "https://academic.oup.com/nar",
    "https://www.gnu.org",
    "https://anaconda.org",
    "https://fonts.gstatic.com",
    "https://www.microsoft.com/en-us/microsoft-365/onedrive/online-cloud-storage",
]

# Exclude files or paths from checking
exclude_path = [
    "assets/home.html",
    "404.html"
]

Add the domain to the exclude list within the brackets and within quotes. Consider also leaving a note in the issue about excludes.

You'll notice that paths internal to the project can also be excluded. This is useful if we have a whole page whose links we don't want to pass to the link checker.

⚠️ **GitHub.com Fallback** ⚠️