Skip to content

Service disruption post mortems

Rod Vagg edited this page Jun 13, 2017 · 1 revision

2017-06-07 CI Infrastructure Partial Outages

Outage details

Main outage discussed in https://github.com/nodejs/build/issues/749, secondary outage in https://github.com/nodejs/build/issues/751. In addition, we're continuing to experience problems with the Raspberry Pi 3's that may be related to the outage, that is being tracked in this issue: https://github.com/nodejs/build/issues/758. This outage occurred just as @jasnell was preparing an 8.1.0 release as per https://github.com/nodejs/node/pull/13483.

The main outage occurred at @rvagg's hosted location where the bulk of the ARM resources for CI are held. This also happens to be the only current location of our macOS resources since we lost our Voxer machines and have not yet set up an alternative. The cause of the outage was a loss of power due to the local provider replacing some lines and a transformer nearby. The outage ran from approximately 11pm UTC to 4am UTC, at which point the resources were successfully reconnected. The key resources at this location required for releases are the macOS build machine, the Raspberry Pi 1 B+ machines that create our ARMv6 binaries and the AppliedMicro X-Gene ARM64 servers used to create our ARM64 / ARMv8 binaries.

The secondary outage occurred at OSUOSL location that hosts all of our PPC hardware. There was a power outage here as well during approximately the same period. The key resources at this location required for resources are for building AIX binaries.

Impact

The ARMv6 and ARM64 Linux, and the AIX binaries are considered of secondary importance while the macOS binaries are critical to releases. Therefore, the 8.1.0 release was postponed. This was unfortunate since there was an expectation set on social media and GitHub by @jasnell that the 8.1.0 would be out that day.

Test hardware that was unavailable during this time were:

  • PPC infrastructure
  • All Raspberry Pi infrastructure: 1 B+, 2 and 3
  • AppliedMicro X-Gene ARM64 / ARMv8
  • macOS

Scaleway ARMv7 and miniNodes ARM64 (Odroid C2) hardware were not impacted by this outage (i.e. not a complete ARM outage, and our ARMv7 binaries could still be produced).

Resolution

When power was restored to both locations, the CI nodes were reconnected to their Jenkins masters. Before reconnecting the Raspberry Pis, @rvagg cleaned out the workspaces (hosted on an SSD, mounted via NFS) and performed some updates on them, part of routine maintenance that is undertaken occasionally. @rvagg then proceeded to prepare the 8.1.0 release, taking over from @jasnell's mostly ready release proposal branch. The 8.1.0 release went ahead but the AIX and ARMv6 binaries were deferred until the next day as the AIX hardware recovery was delayed (@mhdawson took care of it when he came online for the day) and the ARMv6 binaries took extra time to compile, mainly due to having to create new workspaces and clone the repo from scratch. Additionally, there appears to exist a new (maybe?) problem on the network that primarily impacts the Raspberry Pi 3's but also seems to impact other Raspberry Pi's, although to a lesser extend. This is still being diagnosed and understood and is being documented in https://github.com/nodejs/node/pull/13483.

Weaknesses exposed

The primary weakness exposed by this downtime was the fragility of our macOS resources. This is a well understood problem but the Build WG has been slow to come to a resolution. The Build WG has been keen to source donor hardware where possible and has preferred to exhaust its options before exploring paid options. Unfortunately this has left us in a position of only having a single full-time Mac Mini (plus an occasionally available one that hasn't been handling any meaningful load) running a two virtual machines, one for test and one for release. This creates a bottleneck as tests must be queued if run concurrently and slows down releases as release builds for macOS are divided into two separate jobs, one for .pkg and one for .tar (i.e. they must run in serial whereas in the past they have run in parallel on different machines). In addition, this single Mac Mini is located on @rvagg's network which has been one of our least reliable networks—this doesn't pair well with the criticality of macOS in our priorities for both test and build.

AIX / PPC, ARMv6, ARMv7 and ARM64 were all also impacted (partially or completely), however due to their low download numbers (see metrics), we do not consider these critical in the same way as the macOS resources. It has historically been considered reasonable to go ahead with release builds without completion of these builds, promoting them after the fact (the release blog post script inserts a "Coming soon" where they are missing). It may also be considered reasonable to proceed with merging pull requests if tests cannot be run on these platforms if they successfully run on the remaining platforms. ARM64 and ARMv7 have hardware located elsewhere and tests could still be run on both during the downtime. Release builds for ARMv7 were available but ARM64 and ARMv6 were not.

Action items post-outage

  • The Build WG has since secured a partnership with MacStadium to provide considerable cloud resources to the project, more details can be found in https://github.com/nodejs/build/issues/756, this is a work in progress but it will allow us to either retire the current macOS hardware or use it for additional redundancy. The Build WG continues to reach out to other potential partners to increase our macOS resiliency.
  • packet.net recently reached a partnership with the project to provide access to modern server-class bare-metal ARM64 hardware. Since the outage, the Build WG has connected new instances for both test and release, see https://github.com/nodejs/build/issues/755 for some context. These new options already make our older ARM64 infrastructure redundant and they are likely to be decomissioned. In addition, Scaleway has released some new ARM64 options and we should be able to expand our partnership there to provide additional redundancy.
  • NFS / network speed problems are still being explored, see https://github.com/nodejs/build/issues/758
  • The Build WG has had brief discussions about the possibility of either splitting the ARM cluster up into multiple locations (this involves some unfortunate maintenance overhead that we may be best to avoid), or connecting some backup Raspberry Pi hardware to be used exclusively for release builds, as a fail-over.
  • The Build WG has not discussed redundancy for PPC / AIX hardware, but this is considered a low priority so has not been a focus.

In summary, the Build WG has already improved resilience of some of the hardware impacted by this outage. Within a matter of weeks we should be at a state where similar outages will not be reasonable cause to hold up releases.

Please continue any discussion in https://github.com/nodejs/build/issues/749