Blueprint - michaelsevilla/mantle-popper GitHub Wiki

Blueprint

Summary

Multiple, active MDSs can migrate directories to balance metadata load. The policies for when, where, and how much to migrate are hard coded into the metadata balancing module. Mantle is a programmable metadata balancer built into the MDS. The idea is to protect the mechanisms for balancing load (migration, replication, fragmentation) but stub out the balancing policies using Lua.

This PR does not not have the following features from the Supercomputing paper:

  1. Balancing API: all we require is that balancer written in Lua returns a targets table, where each index is the amount of load to send to each MDS

  2. "How much" hook: this let's the user define meta_load()

  3. Instantaneous CPU utilization as metric

Supercomputing '15 Paper: http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html

Owners

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name (Affiliation)

Current Status

Uses Lua fork from https://github.com/ceph/ceph/pull/7338. Sits alongside the current balancer implementation and it's enabled with a string in ceph.conf.

Questions:

  • Do we want to be able to dynimcally load C++ balancers (like cls)?

  • How do I test this?

  • What security features (similar to whitelisting cls classes) do we need?

  • This is implemented in automake -- is cmake required?

  • How do I do the documentation? Quickstart guide?

Detailed Description

Mantle Components:

  1. mantle: write balancer policies in Lua

  2. mantle: store balancer in RADOS, version in MDSMap

  3. mds: expose instantaneous cpu utilization as a metric

Write Balancer Policies in Lua

Exposing Metrics to Lua

Metrics are exposed directly to the Lua code as global variables instead of using a well-defined function signature. There is a global "mds" table, where each index is an MDS number (e.g., 0) and each value is a dictionary of metrics and values. The Lua code can grab metrics using something like this:

mds[0]["queue_len"]

This is in contrast to cls-lua in the OSDs, which has well-defined arguments (e.g., input/output bufferlists). Exposing the metrics directly makes it easier to add new metrics without having to change the API on the Lua side; we want the API to grow and shrink as we explore which metrics matter. The downside of this approach is that the person programming Lua balancer policies has to look at the Ceph source code to see which metrics are exposed. We figure that the Mantle developer will be in touch with MDS internals anyways.

Compile/Execute the Balancer

Here we use lua_pcall instead of lua_call because we want to handle errors in the MDBalancer. We do not want the error propagating up the call chain. The cls Lua class wants to handle the error itself because it must fail gracefully. For Mantle, we don't care if a Lua error crashes our balancer -- in that case, we'll fall back to the original balancer.

The performance improvement of using lua_call over lua_pcall would not be leveraged here because the balancer is invoked every 10 seconds by default.

References: Stack Overflow 1 and Stack Overflow 2

Returning Policy Decision to C++

We force the Lua policy engine to return a table of values, corresponding to the amount of load to send to each MDS. We do not allow the MDS to return a table of MDSs and metrics because we want the decision to be completely made on the Lua side.

Iterating through tables returned by Lua is done through the stack. In Lua jargon: a dummy value is pushed onto the stack and the next iterator replaces the top of the stack with a (k, v) pair. After reading each value, pop that value but keep the key for the next call to lua_next.

Reference: blog

Debugging

Logging in a Lua policy will appear in the MDS log (/var/log/mds.a.log). The syntax is the same as the cls logging interface:

BAL_LOG(0, "this is a log message")

It is implemented by passing a function that wraps the dout logging framework (dout_wrapper) to Lua with the lua_register() primitive. The Lua code is actually calling the dout function in C++.

References: blog, [StackOverflow](call c++ function from lua)

Testing

./vstart -n -l

./ceph mds set allow_multimds true --yes-i-really-mean-it; ./ceph mds set max_mds 5; for i in a b c; do ./ceph --admin-daemon out/mds.$i.asok config set debug_ms 0; ./ceph --admin-daemon out/mds.$i.asok config set debug_mds_balancer 5; done

metrics = {"auth.meta_load", "all.meta_load", "req_rate", "queue_len", "cpu_load_avg"}
for i=0, #mds do
  s = "MDS"..i..": < "
  for j=1, #metrics do
    s = s..metrics[j].."="..mds[i][metrics[j]].." "
  end
  BAL_LOG(0, s..">")
end

return {3, 4, 5}

TODO: write guide for adding metrics

TOOD: rebase on top of new cls-lua

TODO: test multi-node container deployment

TODO: write unit-tests