Scaling For Events - PuzzleServer/mainpuzzleserver GitHub Wiki
The Azure services are normally kept at free/cheapest tier levels to keep the site available without incurring bills. For an event or major registration, we scale the services to handle increased load. To do this, you will need access to the Azure portal for the site. If you don't have it, you likely need somebody else to do this for you.
Before an Event
Database
Navigate to the "puzzleserver SQL database" resource on the Azure portal. Go to the "Compute + Storage" tab and ensure the service tier is set to Standard under the DTU-based purchasing model. This enables redundancy, as well as allowing access to enough compute to handle a busy event. Leave data size as-is (unless the database has actually grown large enough to require a larger size). Choosing a number of DTUs is a bit more art than science, but historically, we have used 50 DTUs for events with 1000+ players. The metrics tab shows current DTU usage and usage over the last 3 months. If you guess wrong, you can change this during the event with no downtime. Because DTUs are billed hourly, do this as close to the beginning of the event as you comfortably can.
SignalR
The Azure SignalR hub must be used when frontend is scaled out to make notifications and other push functionality work between instances. It should only be used while scaled out since it costs at least $1.60/day and isn't necessary otherwise. Unfortunately, the Free tier is not large enough to work with more than a couple of users, so if it's in use, it must be on Standard tier. To enable it:
- Navigate to the "puzzlehuntsignalr" resource on the Azure portal.
- Go to the "Scale Up" tab
- Change to the Standard tier and click Save
- Wait for the scale to complete (watch the bell in the corner)
- Go to the "Scale Out" tab
- Choose a number of Units. Each Unit is good for 1000 connections, which is roughly the max number of players times pages you expect to have open and not sleeping at a time. Units cost about $1/day. 10 might be a good start for a big event.
- Click Save
- Configure the frontend to use the AzureSignalRHub. NOTE: This will restart the frontend, causing brief downtime and forcing Blazor reloads, so do this before there are active players:
- Navigate to the "puzzlehunt Web App" resource on the Azure portal.
- Click the Configuration tab
- Edit the UseAzureSignalR setting to "true"
- Click Save
Web Frontend
Navigate to the "puzzlehunt Web App" resource on the Azure portal. Frontend instances can be scaled up (onto more powerful hardware) or out (onto more more instances. We typically only scale out (keeping the scale up value at Premium P2 or whatever else is cheapest in the Production tier for uptime). Navigate to the "Scale Out" tab. Historically, we have used "Rules-based" scaling. After selecting it, click "Manage rules-based scaling". On the rules page, you can set the scheduled times for your event. Ensure the automatic scaling starts before your event (checking the timezone) and ends once your players are likely to be done (including checking scoreboards). Ensure the minimum instance count is at least 2 (otherwise, uptime isn't guaranteed) and choose a maximum (usually only a few instances are needed, but setting maximum like 7-10 is pretty safe).
Azure recently added an automatic scaling preview that will hopefully work better than rules-based scaling. This shouldn't require scheduling and can set maximum burst to the same maximum as above, and always-ready instances to 1.
Adding more instances during the event has no downtime; removing instances might cause some requests in flight to fail (but autoscale handles that by redirecting requests).
After an Event
Database
Scale the database down to the minimum DTUs in standard tier (10) in the same place as above. Since it's inexpensive, the increased backup length is better than going to basic tier for free.
Web Frontend
If rules-based scaling was scheduled, no work should be required. If automatic scaling was used and always-ready instances is 1, no work should be required. However, if the server was manually scaled to 2 or more instances, set the scale back to 1.
Azure SignalR
Once the frontend is down to 1 instance, set the puzzlehunt Web App UseAzureSignalR setting to "false" (as above, this will cause a restart, so wait until players are gone). Set the puzzlehuntsignalr tier to Free.
Monitoring
Application Insights includes CPU and database usage, which are usually the bottlenecks. Check these if the site is feeling sluggish to see what needs to be scaled.