Blueprint - michaelsevilla/mantle-popper GitHub Wiki
Blueprint
Summary
Multiple, active MDSs can migrate directories to balance metadata load. The policies for when, where, and how much to migrate are hard coded into the metadata balancing module. Mantle is a programmable metadata balancer built into the MDS. The idea is to protect the mechanisms for balancing load (migration, replication, fragmentation) but stub out the balancing policies using Lua.
This PR does not not have the following features from the Supercomputing paper:
-
Balancing API: all we require is that balancer written in Lua returns a
targets
table, where each index is the amount of load to send to each MDS -
"How much" hook: this let's the user define
meta_load()
-
Instantaneous CPU utilization as metric
Supercomputing '15 Paper: http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html
Owners
- Michael Sevilla (UC Santa Cruz, [email protected])
- Name
- Name
Interested Parties
- Name (Affiliation)
- Name (Affiliation)
- Name (Affiliation)
Current Status
Uses Lua fork from https://github.com/ceph/ceph/pull/7338. Sits alongside the current balancer implementation and it's enabled with a string in ceph.conf
.
Questions:
-
Do we want to be able to dynimcally load C++ balancers (like cls)?
-
How do I test this?
-
What security features (similar to whitelisting cls classes) do we need?
-
This is implemented in automake -- is cmake required?
-
How do I do the documentation? Quickstart guide?
Detailed Description
Mantle Components:
-
mantle: write balancer policies in Lua
-
mantle: store balancer in RADOS, version in MDSMap
-
mds: expose instantaneous cpu utilization as a metric
Write Balancer Policies in Lua
Exposing Metrics to Lua
Metrics are exposed directly to the Lua code as global variables instead of using a well-defined function signature. There is a global "mds" table, where each index is an MDS number (e.g., 0) and each value is a dictionary of metrics and values. The Lua code can grab metrics using something like this:
mds[0]["queue_len"]
This is in contrast to cls-lua in the OSDs, which has well-defined arguments (e.g., input/output bufferlists). Exposing the metrics directly makes it easier to add new metrics without having to change the API on the Lua side; we want the API to grow and shrink as we explore which metrics matter. The downside of this approach is that the person programming Lua balancer policies has to look at the Ceph source code to see which metrics are exposed. We figure that the Mantle developer will be in touch with MDS internals anyways.
Compile/Execute the Balancer
Here we use lua_pcall
instead of lua_call
because we want to handle errors in the MDBalancer. We do not want the error propagating up the call chain. The cls Lua class wants to handle the error itself because it must fail gracefully. For Mantle, we don't care if a Lua error crashes our balancer -- in that case, we'll fall back to the original balancer.
The performance improvement of using lua_call
over lua_pcall
would not be leveraged here because the balancer is invoked every 10 seconds by default.
References: Stack Overflow 1 and Stack Overflow 2
Returning Policy Decision to C++
We force the Lua policy engine to return a table of values, corresponding to the amount of load to send to each MDS. We do not allow the MDS to return a table of MDSs and metrics because we want the decision to be completely made on the Lua side.
Iterating through tables returned by Lua is done through the stack. In Lua jargon: a dummy value is pushed onto the stack and the next iterator replaces the top of the stack with a (k, v) pair. After reading each value, pop that value but keep the key for the next call to lua_next
.
Reference: blog
Debugging
Logging in a Lua policy will appear in the MDS log (/var/log/mds.a.log
). The syntax is the same as the cls logging interface:
BAL_LOG(0, "this is a log message")
It is implemented by passing a function that wraps the dout
logging framework (dout_wrapper
) to Lua with the lua_register()
primitive. The Lua code is actually calling the dout
function in C++.
References: blog, [StackOverflow](call c++ function from lua)
Testing
./vstart -n -l
./ceph mds set allow_multimds true --yes-i-really-mean-it; ./ceph mds set max_mds 5; for i in a b c; do ./ceph --admin-daemon out/mds.$i.asok config set debug_ms 0; ./ceph --admin-daemon out/mds.$i.asok config set debug_mds_balancer 5; done
metrics = {"auth.meta_load", "all.meta_load", "req_rate", "queue_len", "cpu_load_avg"}
for i=0, #mds do
s = "MDS"..i..": < "
for j=1, #metrics do
s = s..metrics[j].."="..mds[i][metrics[j]].." "
end
BAL_LOG(0, s..">")
end
return {3, 4, 5}