Chef Server Scaling - chef-boneyard/chef-summit-2014 GitHub Wiki
Thursday, Leschi, 10:30
Irving Popovetsky
- Scalability - hard to talk about without it being repeatable
- People's efforts tend to be unique to their situation
- Irving built a load testing tool
- Chef server releases are not viewed by the community until it's actually
released.
- Want to make the release process more transparent for the community.
- We're doing nightlys
- Address recent open source changes - opscode-omnibus
- There is no 'chef server repo'
- There are plans for a new repo for 'chef server' that people can open issues against. It won't have code in it initially and will mostly be a pointer to the various other repos.
- Black art
- Undocumented
- There are a bunch of knobs to change
- Depth solver timeout
- DB connection settings
- There could be a 'scaling toolkit' - here's what you need to change as you
grow
- E.g. lower amount of node data you save - what data?
- Discussed increasing the number of connections
- Number of connections to postgres should be number of cores + N
- Someone mentioned a problem they had with TCP timeouts being 55 minutes - problems when connection died and postgres took a while to actually timeout.
- Mark Mzyk suggested starting a place for people to post their 'tuning stories'
(e.g. issues in the new chef server repo), and when people come to
consensus, Chef's technical writer will make the documentation official.
- Documentation should be based on user stories and not people in a room coming up with hypothetical scenarios.
- This is where collecting metrics can help.
- Mark Anderson mentioned that Chef 12 and Chef 11 tuning is very different.
- Discussion on bookshelf issues, it was pointed out that S3 can be used instead, although someone else said they'd been recommended not to do that.
- We should document on what is affected in specific scenarios
- If you're standing up 5000 servers, these things will be hit hardest
- If you're search heavy, you need to scale up X
- Discussed things you shouldn't do in cookbooks that you know will be harmful to the chef server. E.g. lots of searching.
- Metrics/instrumentation
- Are there more analytics/metrics we can get out of the server to aid with
scalability?
- Not just number of clients
- Node saves per minute
- Searches per minute
- There are existing metrics
- Mostly request/response times
- Not 'number of searches' and so on.
- One thing that would be nice to have is a client-run-id in logs to be able to get a profile of a given chef client run. (Mark Anderson)
- We have report handlers which gives some of this
- It's from the client point of view though
- It doesn't explain the effort a server had to go through to serve the requests (e.g. I spent X ms in search, this many iops in Y)
- Discussed linking problems downloading objects to a specific item in a cookbook. E.g. identifying a corrupted object.
- Are there more analytics/metrics we can get out of the server to aid with
scalability?
- Someone asked if we can scale the backend horizontally
- Answer was it was hard to do with couchdb
- Mentioned the other components - rabbit, postgres, solr
- Someone commented that you probably don't need to scale the backend
- Discussed separate mailing list for chef-server, but people didn't seem very interested in the idea.
- Discussed that only half of the room was subscribed to the list
- We ran out of time before discussing community in detail, so we want to have a separate discussion on that.
- Improve documentation on tuning/scaling - issues on chef server repo - make an RFC for this to gather people's scaling stories.
- Better instrumentation of chef server that can be useful to those scaling
- We ran out of time before discussing community in detail, so we want to have a separate discussion on that.