Why Hyades? - sipb/homeworld GitHub Wiki

Whyades

Hyades is a fresh start from SIPB's existing Scripts and XVM services, to modernize and expand them. It's a container-based system for running services, scripts, VMs, compute jobs, et cetera.

Motivations:

  • Give individual students free resources to encourage their sense of innovation.
  • Encourage by-students-for-students web services, like firehose and mastodon.
  • Help faculty use technology to teach and administrate courses.
  • Help expand the useful lifespan of Athena systems like AFS and moira.
  • Help SIPB run our own services more reliably and maintainably.
  • Help student groups to use technology to streamline their operations.
  • Enable students to run large compute jobs without needing to purchase expensive hardware.
  • Give researchers free resources to minimize the overhead of setting up their own infrastructure.
  • Help MIT avoid vendor lock-in with existing cloud services for our regular operations.

Why Hyades, and not just plain Kubernetes?

Almost all mechanisms of setting up Kubernetes are either:

  • Proprietary (and probably expensive)
  • Only capable of deploying to an existing cloud environment, like Google Cloud Platform
  • Not high-availability, and only run a single Kubernetes master node

These are problematic, because:

  • Deploying unnecessary proprietary software is against SIPB's mission, and our funding is tight enough that we can't pay any ongoing costs to outside providers.
  • Since the point of Hyades is to run our own hardware and have our own fully controlled cluster, running on an existing cloud platform is not sufficient.
  • One of the goals of Hyades is to have no single point of failure, which means that Kubernetes must be configured with multiple master nodes.

Additionally, we have unusual requirements:

  • We want our management system to start from the bare metal, not from nodes that have operating systems already installed.
  • We assume that our network is insecure and untrusted, whereas many components like Ceph assume a trustworthy network.
  • We're trying to run containers from users who don't trust each other and who we don't trust on the same physical machines. While container security is getting better, it is often not a goal to have containers with strong isolation, so we need to use a container manager (like rkt) that can be put into a high-isolation mode.
  • We want horizontally scalable software load balancing, running on the IP layer instead of the ethernet layer or the HTTP layer, whereas most load balancers are either not software or not horizontally scalable.

We were unable to find anything existing that satisfied our requirements, so we decided to build our own.