Producer Metadata 数据结构与读取、更新策略 - tenji/ks GitHub Wiki
Metadata
这个类被 client 线程和后台 sender 所共享,它只保存了所有 topic 的部分数据,当我们请求一个它上面没有的 topic meta 时,它会通过发送 metadata update 来更新 meta 信息,如果 topic meta 过期策略是允许的,那么任何 topic 过期的话都会被从集合中移除。
public final class Metadata {
private static final Logger log = LoggerFactory.getLogger(Metadata.class);
public static final long TOPIC_EXPIRY_MS = 5 * 60 * 1000;
private static final long TOPIC_EXPIRY_NEEDS_UPDATE = -1L;
private final long refreshBackoffMs; // metadata 更新失败时,为避免频繁更新 meta,最小的间隔时间,默认 100ms
private final long metadataExpireMs; // metadata 的过期时间, 默认 60,000ms
private int version; // 每更新成功1次,version自增1,主要是用于判断 metadata 是否更新
private long lastRefreshMs; // 最近一次更新时的时间(包含更新失败的情况)
private long lastSuccessfulRefreshMs; // 最近一次成功更新的时间(如果每次都成功的话,与前面的值相等, 否则,lastSuccessulRefreshMs < lastRefreshMs)
private Cluster cluster; // 集群中一些 topic 的信息
private boolean needUpdate; // 是都需要更新 metadata
/* Topics with expiry time */
private final Map<String, Long> topics; // topic 与其过期时间的对应关系
private final List<Listener> listeners; // 事件监控者
private final ClusterResourceListeners clusterResourceListeners; //当接收到 metadata 更新时, ClusterResourceListeners的列表
private boolean needMetadataForAllTopics; // 是否强制更新所有的 metadata
private final boolean topicExpiryEnabled; // 默认为 true, Producer 会定时移除过期的 topic,consumer 则不会移除
}
关于 topic 的详细信息(leader 所在节点、replica 所在节点、isr 列表)都是在Cluster
实例中保存的。
// 并不是一个全集,metadata的主要组成部分
public final class Cluster {
// 从命名直接就看出了各个变量的用途
private final boolean isBootstrapConfigured;
private final List<Node> nodes; // node 列表
private final Set<String> unauthorizedTopics; // 未认证的 topic 列表
private final Set<String> internalTopics; // 内置的 topic 列表
private final Map<TopicPartition, PartitionInfo> partitionsByTopicPartition; // partition 的详细信息
private final Map<String, List<PartitionInfo>> partitionsByTopic; // topic 与 partition 的对应关系
private final Map<String, List<PartitionInfo>> availablePartitionsByTopic; // 可用(leader 不为 null)的 topic 与 partition 的对应关系
private final Map<Integer, List<PartitionInfo>> partitionsByNode; // node 与 partition 的对应关系
private final Map<Integer, Node> nodesById; // node 与 id 的对应关系
private final ClusterResource clusterResource;
}
// org.apache.kafka.common.PartitionInfo
// topic-partition: 包含 topic、partition、leader、replicas、isr
public class PartitionInfo {
private final String topic;
private final int partition;
private final Node leader;
private final Node[] replicas;
private final Node[] inSyncReplicas;
}
Cluster
实例主要是保存:
- broker.id 与
node
的对应关系; - topic 与 partition (
PartitionInfo
)的对应关系; -
node
与 partition (PartitionInfo
)的对应关系。
Producer 在调用dosend()
方法时,第一步就是通过waitOnMetadata
方法获取该 topic 的 metadata 信息。
// 等待 metadata 的更新
private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException {
metadata.add(topic); // 在 metadata 中添加 topic 后,如果 metadata 中没有这个 topic 的 meta,那么 metadata 的更新标志设置为了 true
Cluster cluster = metadata.fetch();
Integer partitionsCount = cluster.partitionCountForTopic(topic); // 如果 topic 已经存在 meta 中,则返回该 topic 的 partition 数,否则返回 null
// 当前 metadata 中如果已经有这个 topic 的 meta 的话,就直接返回
if (partitionsCount != null && (partition == null || partition < partitionsCount))
return new ClusterAndWaitTime(cluster, 0);
long begin = time.milliseconds();
long remainingWaitMs = maxWaitMs;
long elapsed;
// 发送 metadata 请求,直到获取了这个 topic 的 metadata 或者请求超时
do {
log.trace("Requesting metadata update for topic {}.", topic);
int version = metadata.requestUpdate(); // 返回当前版本号,初始值为0,每次更新时会自增,并将 needUpdate 设置为 true
sender.wakeup();// 唤起 sender,发送 metadata 请求
try {
metadata.awaitUpdate(version, remainingWaitMs); // 等待 metadata 的更新
} catch (TimeoutException ex) {
// Rethrow with original maxWaitMs to prevent logging exception with remainingWaitMs
throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
}
cluster = metadata.fetch();
elapsed = time.milliseconds() - begin;
if (elapsed >= maxWaitMs)
throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms."); // 超时
if (cluster.unauthorizedTopics().contains(topic)) // 认证失败,对当前 topic 没有 Write 权限
throw new TopicAuthorizationException(topic);
remainingWaitMs = maxWaitMs - elapsed;
partitionsCount = cluster.partitionCountForTopic(topic);
} while (partitionsCount == null); // 不停循环,直到 partitionsCount 不为 null(即直到 metadata 中已经包含了这个 topic 的相关信息)
if (partition != null && partition >= partitionsCount) {
throw new KafkaException(
String.format("Invalid partition given with record: %d is not in the range [0...%d).", partition, partitionsCount));
}
return new ClusterAndWaitTime(cluster, elapsed);
}
如果 metadata 中不存在这个 topic 的 metadata,那么就请求更新 metadata,如果 metadata 没有更新的话,方法就一直处在 do ... while
的循环之中,在循环之中,主要做以下操作:
- metadata.requestUpdate() 将 metadata 的 needUpdate 变量设置为 true(强制更新),并返回当前的版本号(version),通过版本号来判断 metadata 是否完成更新;
- sender.wakeup() 唤醒 sender 线程,sender 线程又会去唤醒 NetworkClient 线程,NetworkClient 线程进行一些实际的操作(后面详细介绍);
- metadata.awaitUpdate(version, remainingWaitMs) 等待 metadata 的更新。
// 更新 metadata 信息(根据当前 version 值来判断)
public synchronized void awaitUpdate(final int lastVersion, final long maxWaitMs) throws InterruptedException {
if (maxWaitMs < 0) {
throw new IllegalArgumentException("Max time to wait for metadata updates should not be < 0 milli seconds");
}
long begin = System.currentTimeMillis();
long remainingWaitMs = maxWaitMs;
while (this.version <= lastVersion) { // 不断循环,直到 metadata 更新成功,version 自增
if (remainingWaitMs != 0)
wait(remainingWaitMs); // 阻塞线程,等待 metadata 的更新
long elapsed = System.currentTimeMillis() - begin;
if (elapsed >= maxWaitMs) // timeout
throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
remainingWaitMs = maxWaitMs - elapsed;
}
}
在 Metadata.awaitUpdate() 方法中,线程会阻塞在 while 循环中,直到 metadata 更新成功或者 timeout。
从前面可以看出,此时 Producer 线程会阻塞在两个 while 循环中,直到 metadata 信息更新,那么 metadata 是如何更新的呢?如果有印象的话,前面应该已经介绍过了,主要是通过 sender.wakeup() 来唤醒 sender 线程,间接唤醒 NetworkClient 线程,NetworkClient 线程来负责发送 Metadata 请求,并处理 Server 端的响应。
sender 线程会调用 NetworkClient.poll() 方法进行实际的操作,其源码如下:
public List<ClientResponse> poll(long timeout, long now) {
long metadataTimeout = metadataUpdater.maybeUpdate(now); // 判断是否需要更新 meta,如果需要就更新(请求更新 metadata 的地方)
try {
this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
} catch (IOException e) {
log.error("Unexpected error during I/O", e);
}
// process completed actions
long updatedNow = this.time.milliseconds();
List<ClientResponse> responses = new ArrayList<>();
handleAbortedSends(responses);
handleCompletedSends(responses, updatedNow); // 通过 selector 中获取 Server 端的 response
handleCompletedReceives(responses, updatedNow); // 在返回的 handler 中,会处理 metadata 的更新
handleDisconnections(responses, updatedNow);
handleConnections();
handleInitiateApiVersionRequests(updatedNow);
handleTimedOutRequests(responses, updatedNow);
// invoke callbacks
for (ClientResponse response : responses) {
try {
response.onComplete();
} catch (Exception e) {
log.error("Uncaught error in request completion:", e);
}
}
return responses;
}
在这个方法中,主要会以下操作:
- metadataUpdater.maybeUpdate(now):判断是否需要更新 Metadata,如果需要更新的话,先与 Broker 建立连接,然后发送更新 metadata 的请求;
- 处理 Server 端的一些响应,这里主要讨论的是 handleCompletedReceives(responses, updatedNow) 方法,它会处理 Server 端返回的 Metadata 结果。
先看一下 metadataUpdater.maybeUpdate() 的具体实现:
public long maybeUpdate(long now) {
// should we update our metadata?
// metadata 是否应该更新
long timeToNextMetadataUpdate = metadata.timeToNextUpdate(now);// metadata 下次更新的时间(需要判断是强制更新还是 metadata 过期更新,前者是立马更新,后者是计算 metadata 的过期时间)
// 如果一条 metadata 的 fetch 请求还未从 server 收到恢复,那么时间设置为 waitForMetadataFetch(默认30s)
long waitForMetadataFetch = this.metadataFetchInProgress ? requestTimeoutMs : 0;
long metadataTimeout = Math.max(timeToNextMetadataUpdate, waitForMetadataFetch);
if (metadataTimeout > 0) { // 时间未到时,直接返回下次应该更新的时间
return metadataTimeout;
}
Node node = leastLoadedNode(now); // 选择一个连接数最小的节点
if (node == null) {
log.debug("Give up sending metadata request since no node is available");
return reconnectBackoffMs;
}
return maybeUpdate(now, node); // 可以发送 metadata 请求的话,就发送 metadata 请求
}
/**
* Add a metadata request to the list of sends if we can make one
*/
// 判断是否可以发送请求,可以的话将 metadata 请求加入到发送列表中
private long maybeUpdate(long now, Node node) {
String nodeConnectionId = node.idString();
if (canSendRequest(nodeConnectionId)) { // 通道已经 ready 并且支持发送更多的请求
this.metadataFetchInProgress = true; // 准备开始发送数据,将 metadataFetchInProgress 置为 true
MetadataRequest.Builder metadataRequest; // 创建 metadata 请求
if (metadata.needMetadataForAllTopics()) // 强制更新所有 topic 的 metadata(虽然默认不会更新所有 topic 的 metadata 信息,但是每个 Broker 会保存所有 topic 的 meta 信息)
metadataRequest = MetadataRequest.Builder.allTopics();
else // 只更新 metadata 中的 topics 列表(列表中的 topics 由 metadata.add() 得到)
metadataRequest = new MetadataRequest.Builder(new ArrayList<>(metadata.topics()));
log.debug("Sending metadata request {} to node {}", metadataRequest, node.id());
sendInternalMetadataRequest(metadataRequest, nodeConnectionId, now); // 发送 metadata 请求
return requestTimeoutMs;
}
// If there's any connection establishment underway, wait until it completes. This prevents
// the client from unnecessarily connecting to additional nodes while a previous connection
// attempt has not been completed.
if (isAnyNodeConnecting()) { // 如果 client 正在与任何一个 node 的连接状态是 connecting,那么就进行等待
// Strictly the timeout we should return here is "connect timeout", but as we don't
// have such application level configuration, using reconnect backoff instead.
return reconnectBackoffMs;
}
if (connectionStates.canConnect(nodeConnectionId, now)) { // 如果没有连接这个 node,那就初始化连接
// we don't have a connection to this node right now, make one
log.debug("Initialize connection to node {} for sending metadata request", node.id());
initiateConnect(node, now); // 初始化连接
return reconnectBackoffMs;
}
return Long.MAX_VALUE;
}
// 发送 Metadata 请求
private void sendInternalMetadataRequest(MetadataRequest.Builder builder,
String nodeConnectionId, long now) {
ClientRequest clientRequest = newClientRequest(nodeConnectionId, builder, now, true);// 创建 metadata 请求
doSend(clientRequest, true, now);
}
所以,每次 Producer 请求更新 metadata 时,会有以下几种情况:
- 如果 node 可以发送请求,则直接发送请求;
- 如果该 node 正在建立连接,则直接返回;
- 如果该 node 还没建立连接,则向 broker 初始化链接。
而 KafkaProducer 线程之前是一直阻塞在两个 while 循环中,直到 metadata 更新
- sender 线程第一次调用 poll() 方法时,初始化与 node 的连接;
- sender 线程第二次调用 poll() 方法时,发送 Metadata 请求;
- sender 线程第三次调用 poll() 方法时,获取 metadataResponse,并更新 metadata。
经过上述 sender 线程三次调用 poll()方法,所请求的 metadata 信息才会得到更新,此时 Producer 线程也不会再阻塞,开始发送消息。
NetworkClient 接收到 Server 端对 Metadata 请求的响应后,更新 Metadata 信息。
// 处理任何已经完成的接收响应
private void handleCompletedReceives(List<ClientResponse> responses, long now) {
for (NetworkReceive receive : this.selector.completedReceives()) {
String source = receive.source();
InFlightRequest req = inFlightRequests.completeNext(source);
AbstractResponse body = parseResponse(receive.payload(), req.header);
log.trace("Completed receive from node {}, for key {}, received {}", req.destination, req.header.apiKey(), body);
if (req.isInternalRequest && body instanceof MetadataResponse) // 如果是 meta 响应
metadataUpdater.handleCompletedMetadataResponse(req.header, now, (MetadataResponse) body);
else if (req.isInternalRequest && body instanceof ApiVersionsResponse)
handleApiVersionsResponse(responses, req, now, (ApiVersionsResponse) body); // 如果是其他响应
else
responses.add(req.completed(body, now));
}
}
// 处理 Server 端对 Metadata 请求处理后的 response
public void handleCompletedMetadataResponse(RequestHeader requestHeader, long now, MetadataResponse response) {
this.metadataFetchInProgress = false;
Cluster cluster = response.cluster();
// check if any topics metadata failed to get updated
Map<String, Errors> errors = response.errors();
if (!errors.isEmpty())
log.warn("Error while fetching metadata with correlation id {} : {}", requestHeader.correlationId(), errors);
// don't update the cluster if there are no valid nodes...the topic we want may still be in the process of being
// created which means we will get errors and no nodes until it exists
if (cluster.nodes().size() > 0) {
this.metadata.update(cluster, now); // 更新 meta 信息
} else { // 如果 metadata 中 node 信息无效,则不更新 metadata 信息
log.trace("Ignoring empty metadata response with correlation id {}.", requestHeader.correlationId());
this.metadata.failedUpdate(now);
}
}
- initiateConnect 失败时
org.apache.kafka.clients.NetworkClient
类:
private void initiateConnect(Node node, long now) {
String nodeConnectionId = node.idString();
try {
log.debug("Initiating connection to node {} at {}:{}.", node.id(), node.host(), node.port());
this.connectionStates.connecting(nodeConnectionId, now);
selector.connect(nodeConnectionId,
new InetSocketAddress(node.host(), node.port()),
this.socketSendBuffer,
this.socketReceiveBuffer);
} catch (IOException e) {
/* attempt failed, we'll try again after the backoff */
connectionStates.disconnected(nodeConnectionId, now);
/* maybe the problem is our metadata, update it */
metadataUpdater.requestUpdate(); // 尝试更新 Metadata
log.debug("Error connecting to node {} at {}:{}:", node.id(), node.host(), node.port(), e);
}
}
- Poll 有连接断开时
org.apache.kafka.clients.NetworkClient
类:
private void handleDisconnections(List<ClientResponse> responses, long now) {
for (String node : this.selector.disconnected()) {
log.debug("Node {} disconnected.", node);
processDisconnection(responses, node, now);
}
// we got a disconnect so we should probably refresh our metadata and see if that broker is dead
if (this.selector.disconnected().size() > 0)
metadataUpdater.requestUpdate(); // 尝试更新 Metadata
}
- Poll 有连接超时时
org.apache.kafka.clients.NetworkClient
类:
private void handleTimedOutRequests(List<ClientResponse> responses, long now) {
List<String> nodeIds = this.inFlightRequests.getNodesWithTimedOutRequests(now, this.requestTimeoutMs);
for (String nodeId : nodeIds) {
// close connection to the node
this.selector.close(nodeId);
log.debug("Disconnecting from node {} due to request timeout.", nodeId);
processDisconnection(responses, nodeId, now);
}
// we disconnected, so we should probably refresh our metadata
if (!nodeIds.isEmpty())
metadataUpdater.requestUpdate(); // 尝试更新 Metadata
}
- 生产时有部分 Partition 的 Leader 找不到
org.apache.kafka.clients.producer.internals.Sender
类:
void run(long now) {
Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic);
this.metadata.requestUpdate(); // 尝试更新 Metadata
}
- 返回的 Response 有异常时
org.apache.kafka.clients.producer.internals.Sender
类:
private void handleProduceResponse(ClientResponse response, Map<TopicPartition, RecordBatch> batches, long now) {
int correlationId = response.requestHeader().correlationId();
if (response.wasDisconnected()) {
log.trace("Cancelled request {} due to node {} being disconnected", response, response.destination());
for (RecordBatch batch : batches.values())
completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION), correlationId, now);
} else if (response.versionMismatch() != null) {
log.warn("Cancelled request {} due to a version mismatch with node {}",
response, response.destination(), response.versionMismatch());
for (RecordBatch batch : batches.values())
completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.INVALID_REQUEST), correlationId, now);
} else {
...
}
}
private void completeBatch(RecordBatch batch, ProduceResponse.PartitionResponse response, long correlationId,
long now) {
Errors error = response.error;
if (error != Errors.NONE && canRetry(batch, error)) {
// retry
log.warn("Got error produce response with correlation id {} on topic-partition {}, retrying ({} attempts left). Error: {}",
correlationId,
batch.topicPartition,
this.retries - batch.attempts - 1,
error);
this.accumulator.reenqueue(batch, now);
this.sensors.recordRetries(batch.topicPartition.topic(), batch.recordCount);
} else {
RuntimeException exception;
if (error == Errors.TOPIC_AUTHORIZATION_FAILED)
exception = new TopicAuthorizationException(batch.topicPartition.topic());
else
exception = error.exception();
// tell the user the result of their request
batch.done(response.baseOffset, response.logAppendTime, exception);
this.accumulator.deallocate(batch);
if (error != Errors.NONE)
this.sensors.recordErrors(batch.topicPartition.topic(), batch.recordCount);
}
if (error.exception() instanceof InvalidMetadataException) {
if (error.exception() instanceof UnknownTopicOrPartitionException)
log.warn("Received unknown topic or partition error in produce request on partition {}. The " +
"topic/partition may not exist or the user may not have Describe access to it", batch.topicPartition);
metadata.requestUpdate(); // 尝试更新 Metadata
}
// Unmute the completed partition.
if (guaranteeMessageOrder)
this.accumulator.unmutePartition(batch.topicPartition);
}
总而言之,发生各式各样的异常,数据不同步,都认为 metadata 可能出问题了,都尝试去更新。
另外,Metadata 更新的特点:
- 更新请求 MetadataRequest 是 nio 异步发送的,在 poll 的返回中,处理 MetadataResponse 的时候,才真正更新 Metadata;
- 更新的时候,是从 metadata 保存的所有 Node,或者说 Broker 中,选负载最小的那个,也就是当前接收请求最少的那个。向其发送 MetadataRequest 请求,获取新的 Cluster 对象。