Testing More Than Servers: InfraSpec - chef-boneyard/chef-summit-2014 GitHub Wiki

Testing More Than Servers: InfraSpec

Thursday, Kirkland, 13:30

Convener

Michael Ducy

Participants

Charles Johnson, Joseph, Kristian, 3-4 others

Summary of Discussions

So we've got test kitchen for invdividual cookbooks in isolation (maybe dependencies). We only look at one OS / host. But in reality we're testing much bigger: The application, the storage, firewall ports, queues, etc. How do we test larger pieces of infrastructure?

Joseph: "2 main kinds of testing: The sort of thing you can stick in Nagios or Sensu, and then there's integration tests. Real-life verification stuff. Usually keep separate. "

A: "Liebnitz is multi-server testing with Test Kitchen as the workhorse. SNS project."

Q: Is there anything that simulates Amazon (besides Eucalyptus)? Do we do it all straight against Amazon? A: Not every part is going to be free. Gotta contain costs.

Anecdote: Tightly coupling services with credentials means that there are some pieces that can't be mocked, too.

"Chef doesn't really have a good domain model for something higher-level than a server."

"Metal's attempting to do that but it's not there yet."

"It's focused on servers right now, but we're rolling in additional hardware."

"Wish we had more orchestration features in Chef."

"We haven't built it, right? this is one of the things Push Jobs was built to solve, the primitive this sits on top of. The reason it took so long is that it was built entirely on assumptions of Enterprise Chef. Now that those features are open, you can run push jobs. Installed as an add-on, no limit applies."

(A digression into the provenance of push jobs ensues.)

"So now we've got a messaging layer. But the layer isn't a domain language. We have tools for machines, load balancers, representing them as resources. We don't have - who uses the word service on a daily basis? Who means the same thing when they type service in a cookbook? So how do we encapsulate a service inside of a recipe?"

"With Metal + Pushy you could build a rudimentary orchestration service."

"We've been trying to do this with Oracle RAC. But there's still services that need to wait to register until you add nodes to the grid. Testing that is really hard."

Adam jumps in! "If I derail just kick me out." "One of the things that's problematic is when you start breaking down the orchestration problem, using a RAC cluster, it's complex. So there's a way you think about that problem as a human - how would I automate this thing? Write down steps. I'd do X, then Y, then I'd wait, and it all comes out. But with service discovery and distributed lock are what people reach for first. If I could have everyone jam their shit in here I'll build a giant distributed lock and it'll all work out. Except it doesn't. There's a modeling thing here that's wrong. The thing is if you back up and say, let us make a list of all the actors in a system. What are the components and what do they do? We'll name them, talk about them, their promises to other parts of the system. When there are relationships, what is the series of promises? How do those actors validate those promises? In RAC, I promise not to add a new node to the cluster until the cluster controllers are alive and have achieved quorum. So what you need is a way for that state to be understood by those autonomous actors. Maybe it's a service discovery bus? But it probably ain't. We're attracted to those, but it doesn't make it better. What about we make it better - a tiny service that runs on the RAC cluster and understands if there's quorum. Could we write that? Then we query it and if there's no quorum, we lock.

"Literally every single one of these convos I go into, it's the same level of complexity."

Q: Couldn't you use chef service as a broker for the quorum?

A: "There's an active state problem where using chef is not actually any better, but it's better than other ways. I mean, sticking it in RAFT? It's not any better for me."

Q: "Stick it in a giant JSON database managed by Erlang?"

A: "May as well!"

Other speaker:

Joseph: "So I gave a talk in Austin" (shows diagrams of various topologies) "and the thing about all those systems is, we could have one giant bus for the entire infra across datacenters if you wanted, but most of the time you just need the lb knowing about pool members and clients knowing about the lb. So what you really need is consensus between LB's about nodes"

Q: Do you really need consensus?

A: "Probably not. So we need to know about the promises, and test all those things and the underlying services that are there. "

Charles: "Check out https://github.com/ryotarai/infrataster and https://speakerdeck.com/ryotarai/infrataster-infra-behavior-testing-framework-number-oedo04"

Adam: "Simply saying I have a list of things and their relationships. If I could validate those relationships, that'd be really useful. The way this thing knows that other thing is behaving appropriately is because x to y."

Q: "Could we do a cookbook that does this?"

Q: "The thing that kills me about TK and stuff is that if I write a resource I need a "load current resource" that needs to - shouldn't that be identical to what my test is verifying? Why is that code separate?"

A: "There's a thing - what the code is trying to assert vs the state that the system should be in. There's no implied assertion about what comes back out of the current resource. Implied assertion about definition of the new resource - you could argue that chef is just auto-remediating assertions anyway."

Q: Yep!

A: "Serverspec's interesting because there's a missing piece of the cookbook - specify a control that's different from my implementation of the control. Where the control is a test - is this condition true or false? If my expectation is not true, we've got something that catches. Can't do it in a recipe because of dry-run, it gets weird. Can't do it in a recipe because it breaks the point of writing the test. The value of the test is that it's a separate thing whose job is the assertion. Separating the assertion from the implementation is what makes it valuable. If you mash the Chef resources in -

A: "It's double-entry bookkeeping."

A: "RIGHT!"

A: "Measure twice, cut once."

A: "People want the assertion without the code. Just the audit. Just the assertion without the remediation."

A: "So if you have a list that describes all the relationships, write all the controls as we go, spin up all the stuff in whatever order I have to do, and walk the controls for everything that built the infrastructure, then I could aggregate the results of those controls and tell you if the infra thing you built was in or out of spec."

Q: "So this is not a thing that exists, yet. I know that serverspec side of it is the rspec extensions, but the heavy listing is done with specinfra. Can we leverage that to put a different face on it than what we're used to using for smoke & integration tests? Embed specinfra with different stuff around it?"

A: "Yeah maybe. It feels kind of rightish."

Joseph: "You ever use Cucumber-Nagios? It was cool until it stopped working and became actively unmaintained. Pushing broken code on purpose."

A: "It also had no tests."

Joseph: "Cucumber + SSH'ing into boxes and asserting things. So sans the cucumber stuff"

A: "That's serverspec. You can write harnesses that target specific locations."

MD: "So that tests all the individual machines, but what about things that are outside of the machine itself?"

Q: "So how do you programmatically test something outside your machine? Consume it the same way you consume the production environment. Write more serverspec shit."

Adam: "The missing piece is I wanna run this against any environment. That's the part that's weird - where do those definitions come from? When they're dynamic that gets complicated. And there the service bus thing comes back up. So how do I know where to target them?"

A: "You want to decouple them. What Fletcher was talking about in the TK talk earlier was that it finds the bindings early and feeds that to components. It seems like you'd want to do that, feed what the thing should be targeted to the testing tool, so how do we specify? Discover from service bus or pass it in."

Joseph: "Thing about the tool I'm working on is that it's not opinionated. Supports flat file, DNS, etcd, chef searches, someone will add consul later. The reason I wrote it is to have a single API to find that thing."

Q: "So the tool to test infrastructure becomes capable of consuming that tool, but isn't tightly bound to that tool."

A: "Yes!"

MD: "So we at CHEF are beginning to work more with traditional hardware vendors. Building resources to manipulate storage or network devices. What we are not doing is building - our direction is TK and ChefDK - we're not building anything that tests that I am correctly creating things on the SAN."

Q: "It seems like you'd want 1:1 serverspec / resource assertions."

A: "Imagine if there was an expectation for every single resource. Some you can't necessarily load information about a service distributed. Can't tell the runlevel something is configured for from outside the box. Maybe it's driven distributed."

A: "Could be simplified if you think about each piece of the infrastructure as black boxes, and ask them to communicate state. We can mock those and then manipulate them to create failure states for testing inside each box."

A: "So like, every app needs to speak syslog, right? We need a standardized system for logging."

Charles: "Check out http://www.slideshare.net/m_richardson/serverspec-and-sensu-testing-and-monitoring-collide for Sensu-Serverspec."

What will we do now? What needs to happen next?

  • Create documentation on this, post, discuss more.
⚠️ **GitHub.com Fallback** ⚠️