Chef Server Scaling - chef-boneyard/chef-summit-2014 GitHub Wiki

Location

Thursday, Leschi, 10:30

Convener

Irving Popovetsky

Topics of discussion

  • Scalability - hard to talk about without it being repeatable
    • People's efforts tend to be unique to their situation
    • Irving built a load testing tool
  • Chef server releases are not viewed by the community until it's actually released.
    • Want to make the release process more transparent for the community.
    • We're doing nightlys
  • Address recent open source changes - opscode-omnibus
    • There is no 'chef server repo'
    • There are plans for a new repo for 'chef server' that people can open issues against. It won't have code in it initially and will mostly be a pointer to the various other repos.

Scalability

  • Black art
  • Undocumented
  • There are a bunch of knobs to change
    • Depth solver timeout
    • DB connection settings
  • There could be a 'scaling toolkit' - here's what you need to change as you grow
    • E.g. lower amount of node data you save - what data?
  • Discussed increasing the number of connections
    • Number of connections to postgres should be number of cores + N
    • Someone mentioned a problem they had with TCP timeouts being 55 minutes - problems when connection died and postgres took a while to actually timeout.
  • Mark Mzyk suggested starting a place for people to post their 'tuning stories' (e.g. issues in the new chef server repo), and when people come to consensus, Chef's technical writer will make the documentation official.
    • Documentation should be based on user stories and not people in a room coming up with hypothetical scenarios.
    • This is where collecting metrics can help.
  • Mark Anderson mentioned that Chef 12 and Chef 11 tuning is very different.
  • Discussion on bookshelf issues, it was pointed out that S3 can be used instead, although someone else said they'd been recommended not to do that.
  • We should document on what is affected in specific scenarios
    • If you're standing up 5000 servers, these things will be hit hardest
    • If you're search heavy, you need to scale up X
  • Discussed things you shouldn't do in cookbooks that you know will be harmful to the chef server. E.g. lots of searching.
  • Metrics/instrumentation
    • Are there more analytics/metrics we can get out of the server to aid with scalability?
      • Not just number of clients
      • Node saves per minute
      • Searches per minute
    • There are existing metrics
      • Mostly request/response times
      • Not 'number of searches' and so on.
    • One thing that would be nice to have is a client-run-id in logs to be able to get a profile of a given chef client run. (Mark Anderson)
    • We have report handlers which gives some of this
      • It's from the client point of view though
      • It doesn't explain the effort a server had to go through to serve the requests (e.g. I spent X ms in search, this many iops in Y)
    • Discussed linking problems downloading objects to a specific item in a cookbook. E.g. identifying a corrupted object.
  • Someone asked if we can scale the backend horizontally
    • Answer was it was hard to do with couchdb
    • Mentioned the other components - rabbit, postgres, solr
    • Someone commented that you probably don't need to scale the backend

Community involvement

  • Discussed separate mailing list for chef-server, but people didn't seem very interested in the idea.
  • Discussed that only half of the room was subscribed to the list
  • We ran out of time before discussing community in detail, so we want to have a separate discussion on that.

What will we do now? What needs to happen next?

  • Improve documentation on tuning/scaling - issues on chef server repo - make an RFC for this to gather people's scaling stories.
  • Better instrumentation of chef server that can be useful to those scaling
  • We ran out of time before discussing community in detail, so we want to have a separate discussion on that.
⚠️ **GitHub.com Fallback** ⚠️