Ceph:故障排除 - AaronPei/blog GitHub Wiki
[TOC]
ceph故障排除:"rbd: error: image still has watchers"
问题描述
有时会遇到无法删除虚拟机,查看nova-compute.log发现报image删除失败的。
2019-03-22 10:09:44.135 5354 TRACE nova.openstack.common.rpc.amqp rbd.RBD().remove(client.ioctx, volume)
2019-03-22 10:09:44.135 5354 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/rbd.py", line 300, in remove
2019-03-22 10:09:44.135 5354 TRACE nova.openstack.common.rpc.amqp raise make_ex(ret, 'error removing image')
2019-03-22 10:09:44.135 5354 TRACE nova.openstack.common.rpc.amqp 'ImageBusy: error removing image'
既然无法删除该虚机image,那么直接通过rbd rm会怎么样呢?
[root@xxx ~]# rbd rm ac49-48ab-993e-8a666fbbe658_disk.swap -p vms
2019-03-22 10:12:23.099964 7f35442bb760 -1 librbd: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.
我们会看到同样无法删除,报: image still has watchers错误
问题原因:
该image仍旧被一个客户端在访问,具体表现为该image中有watcher。如果该客户端异常了,那么就会出现无法删除该image的情况。
什么是Watcher
Ceph中有一个watch/notify机制(粒度是object),它用来在不同客户端之间进行消息通知,使得各客户端之间的状态保持一致,而每一个进行watch的客户端,对于Ceph集群来说都是一个watcher。
解决办法
思路就是:找到这个watcher,把它加到ceph的blacklist中去,那么再删除image就可以了。
查看当前image的watcher
1)找到该image的header对象
[root@xxx ~]# rbd infoac49-48ab-993e-8a666fbbe658_disk.swap -p vms
rbd image 'ac49-48ab-993e-8a666fbbe658_disk.swap':
size 512 MB in 128 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.bc2ae8944a
format: 2
features: layering
由该image的block_name_prefix为 rbd_data.bc2ae8944a,可知该image的header对象为rbd_header.bc2ae8944a,得到了header对象后,查看watcher信息
2)查看该image header对象上的watcher信息
[root@xxx ~]# rados -p vms listwatchers rbd_header.bc2ae8944a
watcher=10.xxx.xxx.47:0/2024175 client.38437876 cookie=1
删除该image上的watcher对象
1)先把该watcher加入ceph 黑名单
[root@xxx ~]# ceph osd blacklist add 10.xxx.xxx.47:0/2024175
blacklisting 10.xxx.xxx.47:0/2024175 until 2019-03-22 11:21:25.710014 (3600 sec)
2)删除该image
[root@xxx ~]# rbd rm ac49-48ab-993e-8a666fbbe658_disk.swap -p vms
Removing image: 100% complete...done.
3)删除之前无法删除的虚拟机
恢复操作
1)查询cpeh黑名单
[root@xxx ~]# ceph osd blacklist ls
listed 2 entries
10.xxx.xxx.47:0/1024175 2019-03-22 11:18:55.627116
2) 从黑名单中删除
[root@xxx ~]# ceph osd blacklist rm 10.xxx.xxx.47:0/1024175
un-blacklisting 10.xxx.xxx.47:0/1024175