1. Servers and Docker - Shared-Reality-Lab/IMAGE-server GitHub Wiki

Practical notes on server use

Note that this wiki page is very McGill-centric. If you are not on the McGill IMAGE team, you are likely running your own IMAGE test/production servers!

We're using pegasus.cim.mcgill.ca for our production server. A second machine, unicorn.cim.mcgill.ca, is available for testing. Note that other projects are using these servers as well, so we have to be good citizens.

General use

  • Jeff or Juliette can create you an account on unicorn. You'll need to be in the image group to access the /var/docker/image directory, which contains the persistent content for our docker containers, including the website.
  • There are channels for #unicorn and #pegasus in the SRL slack. If you care about what happens on these machines (reboots, storage issues, etc.) make sure you are subscribed.
  • Jaydeep and Juliette (and Jeff as a fallback) are the unicorn/pegasus (and therefore docker) point people for the IMAGE team. Contact one of them if anything is unclear. Also make them aware if you have any issues that would impact server/docker status, since unless you've told them, they will happily let other groups do docker prunes or reboot the server without necessarily checking with the IMAGE team.
  • For machine learning inferencing, unicorn has one NVIDIA graphics card. We cannot put this under heavy load on an ongoing basis without checking with other users. Normally, you bring up your containers to test, do your test, then shut them down to free resources for others.
    • GPUs aren't available to docker containers by default. Be sure to read the docs if you need them.
  • There is currently no automatic backup of these servers, so make sure you have everything saved safely elsewhere in case the entire server fails. Normally, this should mean pulling from git to update the server.

ML Models for IMAGE

  • Do not commit ML models to the git repositories. Git suffers with large, binary files and with large repositories, and so ML models pose a problem.
  • If you are using a model that is publicly available elsewhere:
    • In Docker, download the model into your image using something like wget. Note that this means the download should happen in the Dockerfile, not at runtime!
    • Clearly document where you got the model from. Ideally this would be a repository/archive with the software plus a paper on the model if one exists, but if not, anything necessary so another person can understand what is happening.
  • If you are making your own model:
    • Clearly document the dataset used to train and the method you followed. It should be possible for anyone to recreate your model.
    • Store the model file in the Production Resources` folder in the IMAGE SharePoint.
    • Once it is in the SharePoint folder, ping @jeffbl or @juliette to pull this onto unicorn and pegasus. It will appear on pegasus under /var/docker/image/models which will allow you to download the file into your Docker image from https://pegasus.cim.mcgill.ca/image/models/[YOUR MODEL].

Docker

IMAGE uses many docker containers working together to take requests from the browser extension and return renderings. Please read this section carefully before working with Docker on any of the SRL servers.

Howto

  1. If you need to work with Docker, you'll need to be a member of the docker group. Log out and back in for this change to take effect.
  2. There are many docker tutorials online. If you haven't used docker before, please do some self-study to get the basics. The official docker documentation is good and up to date.
  3. Be careful: With great power comes great responsibility. There is no protection between docker containers or images. If you run the wrong command (be especially careful with commands like "docker prune") you can easily damage other projects.

Requirements you MUST follow

  1. Pegasus is for production: It is not a place to experiment with docker. You can run docker on your personal machine to learn how to use it, or use unicorn for testing.
  2. Naming images and containers: All images and containers must be named in a way that makes it clear what they are, and what project they are used by. Default randomly assigned names are not allowed.
  3. Containers must be disposable: Do not create any data or configuration inside a running container. Map a volume instead. If you are running 'docker exec -it' for anything except debugging or seeing status, you're likely breaking this rule. Put another way: Docker containers are not virtual machines. Note that some pruning is done nightly (see below) so if you're storing any state in your container(s), it will disappear by tomorrow!
  4. Volumes should be backed up: While volumes are persistent and not deleted with typical pruning, they're still fragile. Volume data can be deleted or overwritten by anyone in the docker group and is less safe than a file in your home directory. Anything that needs to be saved for a long period of time or can't be easily recreated must have a copy outside of docker altogether and possibly outside of pegasus.
  5. When launching containers manually, make sure to use the --rm flag to remove the container automatically when it exits. Otherwise we end up with many lingering containers cluttering the system.

Helpful notes

  1. The "glances" command will show you an overview of all the running containers and lets you watch as they are created and destroyed.
  2. If you're temporarily using a container, docker run --rm will automatically destroy the container when it is stopped, and prevent docker container clutter.
  3. Using compose is a good way to keep track of containers. It lets you set up multi-container environments in an easy, reproducible way, and gives sensible names by default. It is highly recommended for anything that won't be immediately deleted.
  4. You may have TLS issues when using git and a remote repository over https. If this happens, using git over SSH should sidestep the problem. Version control is important!
  5. If you're building images on unicorn, tag them with a version or date. When a new image is built with the same tag, the other becomes unlisted and is more difficult to remove. Also avoid tagging images you build locally as "latest" for the same reason.
  6. The structure of a Dockerfile is important to optimize build efficiency, maintainability, and runtime performance.

Unresolved issues and Future plans

  • Right now, pegasus and unicorn have a traefik reverse proxy running (in docker, of course). /var/docker/traefik contains this configuration, but you should not need to edit it. If you do, please contact Juliette.
  • Every night, a cron job runs:'docker image prune' and docker container prune on both unicorn and pegasus. Do not keep state inside your containers!
⚠️ **GitHub.com Fallback** ⚠️