Chef Server Scaling - chef-boneyard/chef-summit-2014 GitHub Wiki

Location

Thursday, Leschi, 10:30

Convener

Irving Popovetsky

Topics of discussion

Scalability - hard to talk about without it being repeatable
- People's efforts tend to be unique to their situation
- Irving built a load testing tool
Chef server releases are not viewed by the community until it's actually released.
- Want to make the release process more transparent for the community.
- We're doing nightlys
Address recent open source changes - opscode-omnibus
- There is no 'chef server repo'
- There are plans for a new repo for 'chef server' that people can open issues against. It won't have code in it initially and will mostly be a pointer to the various other repos.

Scalability

Black art
Undocumented
There are a bunch of knobs to change
- Depth solver timeout
- DB connection settings
There could be a 'scaling toolkit' - here's what you need to change as you grow
- E.g. lower amount of node data you save - what data?
Discussed increasing the number of connections
- Number of connections to postgres should be number of cores + N
- Someone mentioned a problem they had with TCP timeouts being 55 minutes - problems when connection died and postgres took a while to actually timeout.
Mark Mzyk suggested starting a place for people to post their 'tuning stories' (e.g. issues in the new chef server repo), and when people come to consensus, Chef's technical writer will make the documentation official.
- Documentation should be based on user stories and not people in a room coming up with hypothetical scenarios.
- This is where collecting metrics can help.
Mark Anderson mentioned that Chef 12 and Chef 11 tuning is very different.
Discussion on bookshelf issues, it was pointed out that S3 can be used instead, although someone else said they'd been recommended not to do that.
We should document on what is affected in specific scenarios
- If you're standing up 5000 servers, these things will be hit hardest
- If you're search heavy, you need to scale up X
Discussed things you shouldn't do in cookbooks that you know will be harmful to the chef server. E.g. lots of searching.
Metrics/instrumentation
- Are there more analytics/metrics we can get out of the server to aid with scalability?
  - Not just number of clients
  - Node saves per minute
  - Searches per minute
- There are existing metrics
  - Mostly request/response times
  - Not 'number of searches' and so on.
- One thing that would be nice to have is a client-run-id in logs to be able to get a profile of a given chef client run. (Mark Anderson)
- We have report handlers which gives some of this
  - It's from the client point of view though
  - It doesn't explain the effort a server had to go through to serve the requests (e.g. I spent X ms in search, this many iops in Y)
- Discussed linking problems downloading objects to a specific item in a cookbook. E.g. identifying a corrupted object.
Someone asked if we can scale the backend horizontally
- Answer was it was hard to do with couchdb
- Mentioned the other components - rabbit, postgres, solr
- Someone commented that you probably don't need to scale the backend

Community involvement

Discussed separate mailing list for chef-server, but people didn't seem very interested in the idea.
Discussed that only half of the room was subscribed to the list
We ran out of time before discussing community in detail, so we want to have a separate discussion on that.

What will we do now? What needs to happen next?

Improve documentation on tuning/scaling - issues on chef server repo - make an RFC for this to gather people's scaling stories.
Better instrumentation of chef server that can be useful to those scaling
We ran out of time before discussing community in detail, so we want to have a separate discussion on that.

Chef Server Scaling - chef-boneyard/chef-summit-2014 GitHub Wiki

Location

Convener

Topics of discussion

Scalability

Community involvement

What will we do now? What needs to happen next?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️