$ date
Tue Oct 13 15:10:19 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-13 20:12:21
nova-cert localhost.localdomain internal enabled :-) 2015-10-13 20:12:24
nova-network localhost.localdomain internal enabled :-) 2015-10-13 20:12:27
nova-compute localhost.localdomain nova enabled :-) 2015-10-13 20:12:28
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-13 20:12:20
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-13 20:12:20
- Change the date to the future (Note the
State
column is :-)
for all services)
$ date
Tue Oct 13 17:23:36 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-13 22:25:01
nova-cert localhost.localdomain internal enabled :-) 2015-10-13 22:25:01
nova-network localhost.localdomain internal enabled :-) 2015-10-13 22:25:01
nova-compute localhost.localdomain nova enabled :-) 2015-10-13 22:25:01
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-13 22:25:01
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-13 22:25:01
- Change the date back (Note the
State
column is XXX
for all services)
$ date
Tue Oct 13 15:28:25 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-13 22:28:01
nova-cert localhost.localdomain internal enabled XXX 2015-10-13 22:28:01
nova-network localhost.localdomain internal enabled XXX 2015-10-13 22:28:01
nova-compute localhost.localdomain nova enabled XXX 2015-10-13 22:28:01
nova-scheduler localhost.localdomain internal enabled XXX 2015-10-13 22:28:01
nova-consoleauth localhost.localdomain internal enabled XXX 2015-10-13 22:28:01
2015-10-13 15:28:51.023 DEBUG nova.servicegroup.drivers.db [req-3d8f32c3-baf3-46fc-a572-bd286bd16e84 None None] Seems service is down. Last heartbeat was 2015-10-13 22:28:01. Elapsed time is -7149.976574 from (pid=29718) is_up /opt/stack/nova/nova/servicegroup/drivers/db.py:75
- Does this bug happen on nova master? Yes, it looks like it.
----------> clean devstack (master), nova at commit e55130a8d6eb348b004d0c1ee0b39b94c2e0b4b5
$ date
Wed Oct 14 17:27:23 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-14 22:27:36
nova-cert localhost.localdomain internal enabled :-) 2015-10-14 22:27:29
nova-network localhost.localdomain internal enabled :-) 2015-10-14 22:27:33
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-14 22:27:35
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-14 22:27:31
nova-compute localhost.localdomain nova enabled :-) 2015-10-14 22:27:36
----------> move clock ahead 2 hours
$ date
Wed Oct 14 19:28:08 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-15 00:28:21
nova-cert localhost.localdomain internal enabled :-) 2015-10-15 00:28:21
nova-network localhost.localdomain internal enabled :-) 2015-10-15 00:28:21
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-15 00:28:20
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-15 00:28:21
nova-compute localhost.localdomain nova enabled :-) 2015-10-15 00:28:21
----------> move clock back to the actual time
$ date
Wed Oct 14 17:29:54 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-cert localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-network localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-scheduler localhost.localdomain internal enabled XXX 2015-10-15 00:28:50
nova-consoleauth localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-compute localhost.localdomain nova enabled XXX 2015-10-15 00:28:51
----------> Note that `Status` is `XXX` for all services above
$ date
Wed Oct 14 17:39:25 CDT 2015
$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-cert localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-network localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-scheduler localhost.localdomain internal enabled XXX 2015-10-15 00:28:50
nova-consoleauth localhost.localdomain internal enabled XXX 2015-10-15 00:28:51
nova-compute localhost.localdomain nova enabled XXX 2015-10-15 00:28:51
----------> Also note that the heartbeats have stopped altogether (see `Updated_At` above)
- The following patch appears to fix most of the services.
diff --git a/nova/openstack/common/loopingcall.py b/nova/openstack/common/loopingcall.py
index 82411af..9369cb2 100644
--- a/nova/openstack/common/loopingcall.py
+++ b/nova/openstack/common/loopingcall.py
@@ -16,21 +16,16 @@
# under the License.
import sys
-import time
from eventlet import event
from eventlet import greenthread
+from oslo_utils import timeutils
from nova.openstack.common.gettextutils import _LE, _LW
from nova.openstack.common import log as logging
LOG = logging.getLogger(__name__)
-# NOTE(zyluo): This lambda function was declared to avoid mocking collisions
-# with time.time() called in the standard logging module
-# during unittests.
-_ts = lambda: time.time()
-
class LoopingCallDone(Exception):
"""Exception to break out and stop a LoopingCallBase.
@@ -76,13 +71,15 @@ class FixedIntervalLoopingCall(LoopingCallBase):
greenthread.sleep(initial_delay)
try:
+ watch = timeutils.StopWatch()
while self._running:
- start = _ts()
+ start = watch.restart()
self.f(*self.args, **self.kw)
- end = _ts()
+ watch.stop()
if not self._running:
break
- delay = end - start - interval
+ elapsed = watch.elapsed()
+ delay = elapsed - interval
if delay > 0:
LOG.warn(_LW('task %(func_name)s run outlasted '
'interval by %(delay).2f sec'),
diff --git a/nova/openstack/common/periodic_task.py b/nova/openstack/common/periodic_task.py
index ebbb98b..fff5c5c 100644
--- a/nova/openstack/common/periodic_task.py
+++ b/nova/openstack/common/periodic_task.py
@@ -14,6 +14,7 @@
import random
import time
+from monotonic import monotonic as now
from oslo.config import cfg
import six
@@ -77,7 +78,7 @@ def periodic_task(*args, **kwargs):
if f._periodic_immediate:
f._periodic_last_run = None
else:
- f._periodic_last_run = time.time()
+ f._periodic_last_run = now()
return f
# NOTE(sirp): The `if` is necessary to allow the decorator to be used with
@@ -154,7 +155,7 @@ def _nearest_boundary(last_run, spacing):
do not synchronize. This jitter is rounded to the nearest second, this
means that spacings smaller than 20 seconds will not have jitter.
"""
- current_time = time.time()
+ current_time = now()
if last_run is None:
return current_time
delta = current_time - last_run
@@ -184,7 +185,7 @@ class PeriodicTasks(object):
# Check if due, if not skip
idle_for = min(idle_for, spacing)
if last_run is not None:
- delta = last_run + spacing - time.time()
+ delta = last_run + spacing - now()
if delta > 0:
idle_for = min(idle_for, delta)
continue
diff --git a/nova/service.py b/nova/service.py
index 5f851fe..8826c0c 100644
--- a/nova/service.py
+++ b/nova/service.py
@@ -21,6 +21,8 @@ import os
import random
import sys
+import eventlet
+import monotonic
from oslo.config import cfg
from oslo import messaging
@@ -132,6 +134,12 @@ class Service(service.Service):
periodic_interval_max=None, db_allowed=True,
*args, **kwargs):
super(Service, self).__init__()
+
+ # bz #1217389: use a monotonic clock for the eventlet event loop
+ # to not hang when the system clock is modified (ex: NTP update)
+ hub = eventlet.hubs.get_hub()
+ hub.clock = monotonic.monotonic
+
self.host = host
self.binary = binary
self.topic = topic
diff --git a/requirements.txt b/requirements.txt
index d9f458d..c0749cc 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -45,3 +45,5 @@ lockfile<=0.8,>=0.8
simplejson<=3.3.1,>=2.2.0
rfc3986<=0.2.0,>=0.2.0 # Apache-2.0
oslo.vmware<0.9.0,>=0.6.0 # Apache-2.0
+monotonic>=0.4
+oslo.utils>=1.6.0 # Apache-2.0
- Here's me trying to reproduce it, post-patch:
[diana@localhost devstack]$ date
Fri Oct 16 14:57:57 CDT 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-16 19:58:00
nova-cert localhost.localdomain internal enabled :-) 2015-10-16 19:58:03
nova-network localhost.localdomain internal enabled :-) 2015-10-16 19:57:56
nova-compute localhost.localdomain nova enabled :-) 2015-10-16 19:57:57
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-16 19:58:00
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-16 19:57:59
[diana@localhost devstack]$ date
Fri Oct 16 16:58:49 CDT 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-16 21:58:59
nova-cert localhost.localdomain internal enabled :-) 2015-10-16 21:58:57
nova-network localhost.localdomain internal enabled :-) 2015-10-16 21:58:50
nova-compute localhost.localdomain nova enabled :-) 2015-10-16 21:58:51
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-16 21:58:54
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-16 21:58:53
[diana@localhost devstack]$ date
Fri Oct 16 15:00:08 CDT 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-16 21:59:39
nova-cert localhost.localdomain internal enabled :-) 2015-10-16 20:00:03
nova-network localhost.localdomain internal enabled :-) 2015-10-16 20:00:06
nova-compute localhost.localdomain nova enabled :-) 2015-10-16 20:00:07
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-16 20:00:10
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-16 20:00:09
[diana@localhost devstack]$ date
Fri Oct 16 15:01:08 CDT 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-16 21:59:39
nova-cert localhost.localdomain internal enabled :-) 2015-10-16 20:01:03
nova-network localhost.localdomain internal enabled :-) 2015-10-16 20:01:06
nova-compute localhost.localdomain nova enabled :-) 2015-10-16 20:01:07
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-16 20:01:10
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-16 20:01:09
- What's up with
nova-conductor
?
- What's the difference between
eventlet.sleep(0)
and time.sleep(0)
?
- Is patching just periodic task sufficient, or does looping call need to be patched too? The above patch patched both.
- Patching just periodic task appears to be sufficient.
- Patching just periodic task and service. Note the
State
column each time. And that only nova-conductor
can't seem to heal itself.
[diana@localhost devstack]$ date
Sat Oct 17 02:55:45 IST 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-17 01:55:56
nova-cert localhost.localdomain internal enabled :-) 2015-10-17 01:55:59
nova-network localhost.localdomain internal enabled :-) 2015-10-17 01:56:02
nova-compute localhost.localdomain nova enabled :-) 2015-10-17 01:55:53
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-17 01:55:55
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-17 01:55:55
[diana@localhost devstack]$ date
Sat Oct 17 04:57:37 IST 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled :-) 2015-10-17 03:57:40
nova-cert localhost.localdomain internal enabled :-) 2015-10-17 03:57:34
nova-network localhost.localdomain internal enabled :-) 2015-10-17 03:57:37
nova-compute localhost.localdomain nova enabled :-) 2015-10-17 03:57:38
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-17 03:57:40
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-17 03:57:40
[diana@localhost devstack]$ date
Sat Oct 17 04:58:02 IST 2015
[diana@localhost devstack]$ date
Sat Oct 17 02:58:09 IST 2015
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-17 03:58:00
nova-cert localhost.localdomain internal enabled :-) 2015-10-17 01:58:09
nova-network localhost.localdomain internal enabled :-) 2015-10-17 01:58:12
nova-compute localhost.localdomain nova enabled XXX 2015-10-17 03:57:58
nova-scheduler localhost.localdomain internal enabled XXX 2015-10-17 03:58:00
nova-consoleauth localhost.localdomain internal enabled XXX 2015-10-17 03:58:00
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-17 03:58:00
nova-cert localhost.localdomain internal enabled :-) 2015-10-17 01:58:09
nova-network localhost.localdomain internal enabled :-) 2015-10-17 01:58:12
nova-compute localhost.localdomain nova enabled :-) 2015-10-17 01:58:13
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-17 01:58:15
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-17 01:58:15
[diana@localhost devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor localhost.localdomain internal enabled XXX 2015-10-17 03:58:00
nova-cert localhost.localdomain internal enabled :-) 2015-10-17 01:58:19
nova-network localhost.localdomain internal enabled :-) 2015-10-17 01:58:12
nova-compute localhost.localdomain nova enabled :-) 2015-10-17 01:58:13
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-17 01:58:15
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-17 01:58:15
2015-10-17 03:03:15.111 DEBUG nova.servicegroup.drivers.db [req-99ffe846-0438-434c-84b2-e6263075495e None None] Seems service is down. Last heartbeat was 2015-10-17 03:58:00. Elapsed time is -6884.888095 from (pid=13288) is_up /opt/stack/nova/nova/servicegroup/drivers/db.py:75
diff --git a/nova/openstack/common/periodic_task.py b/nova/openstack/common/periodic_task.py
index ebbb98b..fff5c5c 100644
--- a/nova/openstack/common/periodic_task.py
+++ b/nova/openstack/common/periodic_task.py
@@ -14,6 +14,7 @@
import random
import time
+from monotonic import monotonic as now
from oslo.config import cfg
import six
@@ -77,7 +78,7 @@ def periodic_task(*args, **kwargs):
if f._periodic_immediate:
f._periodic_last_run = None
else:
- f._periodic_last_run = time.time()
+ f._periodic_last_run = now()
return f
# NOTE(sirp): The `if` is necessary to allow the decorator to be used with
@@ -154,7 +155,7 @@ def _nearest_boundary(last_run, spacing):
do not synchronize. This jitter is rounded to the nearest second, this
means that spacings smaller than 20 seconds will not have jitter.
"""
- current_time = time.time()
+ current_time = now()
if last_run is None:
return current_time
delta = current_time - last_run
@@ -184,7 +185,7 @@ class PeriodicTasks(object):
# Check if due, if not skip
idle_for = min(idle_for, spacing)
if last_run is not None:
- delta = last_run + spacing - time.time()
+ delta = last_run + spacing - now()
if delta > 0:
idle_for = min(idle_for, delta)
continue
diff --git a/nova/service.py b/nova/service.py
index 5f851fe..8826c0c 100644
--- a/nova/service.py
+++ b/nova/service.py
@@ -21,6 +21,8 @@ import os
import random
import sys
+import eventlet
+import monotonic
from oslo.config import cfg
from oslo import messaging
@@ -132,6 +134,12 @@ class Service(service.Service):
periodic_interval_max=None, db_allowed=True,
*args, **kwargs):
super(Service, self).__init__()
+
+ # bz #1217389: use a monotonic clock for the eventlet event loop
+ # to not hang when the system clock is modified (ex: NTP update)
+ hub = eventlet.hubs.get_hub()
+ hub.clock = monotonic.monotonic
+
self.host = host
self.binary = binary
self.topic = topic
diff --git a/requirements.txt b/requirements.txt
index d9f458d..866b7a0 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -45,3 +45,4 @@ lockfile<=0.8,>=0.8
simplejson<=3.3.1,>=2.2.0
rfc3986<=0.2.0,>=0.2.0 # Apache-2.0
oslo.vmware<0.9.0,>=0.6.0 # Apache-2.0
+monotonic>=0.4
- Query logs, database rows, service list
2015-10-20 13:34:14.910 INFO sqlalchemy.engine.base.Engine [req-4be85c99-4f8c-4dfb-8162-4932
77649116 None None] SELECT services.deleted_at AS services_deleted_at, services.deleted AS s
ervices_deleted, services.created_at AS services_created_at, services.updated_at AS services
_updated_at, services.id AS services_id, services.host AS services_host, services.`binary` A
S services_binary, services.topic AS services_topic, services.report_count AS services_repor
t_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabl
ed_reason
FROM services
WHERE services.deleted = %s AND services.id = %s
LIMIT %s
2015-10-20 13:34:14.910 INFO sqlalchemy.engine.base.Engine [req-4be85c99-4f8c-4dfb-8162-4932
77649116 None None] (0, 4, 1)
2015-10-20 13:34:14.912 INFO sqlalchemy.engine.base.Engine [req-4be85c99-4f8c-4dfb-8162-4932
77649116 None None] UPDATE services SET updated_at=%s, report_count=%s WHERE services.id = %
s
2015-10-20 13:34:14.913 INFO sqlalchemy.engine.base.Engine [req-4be85c99-4f8c-4dfb-8162-4932
77649116 None None] (datetime.datetime(2015, 10, 20, 18, 34, 14, 912251), 6828, 4L)
2015-10-20 13:34:14.915 INFO sqlalchemy.engine.base.Engine [req-4be85c99-4f8c-4dfb-8162-4932
77649116 None None] COMMIT
$ nova-manage service list
nova-conductor localhost.localdomain internal enabled XXX 2015-10-20 18:40:12
nova-cert localhost.localdomain internal enabled :-) 2015-10-20 18:34:55
nova-network localhost.localdomain internal enabled :-) 2015-10-20 18:34:33
nova-compute localhost.localdomain nova enabled :-) 2015-10-20 18:34:24
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-20 18:34:56
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-20 18:35:05
MariaDB [nova]> select * from services order by id\G
*************************** 1. row ***************************
created_at: 2015-10-19 23:19:40
updated_at: 2015-10-20 18:40:12
deleted_at: NULL
id: 1
host: localhost.localdomain
binary: nova-conductor
topic: conductor
report_count: 6065
disabled: 0
deleted: 0
disabled_reason: NULL
*************************** 2. row ***************************
created_at: 2015-10-19 23:19:43
updated_at: 2015-10-20 18:37:15
deleted_at: NULL
id: 2
host: localhost.localdomain
binary: nova-cert
topic: cert
report_count: 6903
disabled: 0
deleted: 0
disabled_reason: NULL
*************************** 3. row ***************************
created_at: 2015-10-19 23:19:46
updated_at: 2015-10-20 18:34:33
deleted_at: NULL
id: 3
host: localhost.localdomain
binary: nova-network
topic: network
report_count: 6830
disabled: 0
deleted: 0
disabled_reason: NULL
*************************** 4. row ***************************
created_at: 2015-10-19 23:19:47
updated_at: 2015-10-20 18:34:24
deleted_at: NULL
id: 4
host: localhost.localdomain
binary: nova-compute
topic: compute
report_count: 6829
disabled: 0
deleted: 0
disabled_reason: NULL
*************************** 5. row ***************************
created_at: 2015-10-19 23:19:49
updated_at: 2015-10-20 18:37:16
deleted_at: NULL
id: 5
host: localhost.localdomain
binary: nova-scheduler
topic: scheduler
report_count: 6945
disabled: 0
deleted: 0
disabled_reason: NULL
*************************** 6. row ***************************
created_at: 2015-10-19 23:19:58
updated_at: 2015-10-20 18:37:15
deleted_at: NULL
id: 6
host: localhost.localdomain
binary: nova-consoleauth
topic: consoleauth
report_count: 6944
disabled: 0
deleted: 0
disabled_reason: NULL
6 rows in set (0.00 sec)
- Restart so that all services are in a good state. Are the logs from service id 1 different at all?
2015-10-20 13:55:15.221 INFO sqlalchemy.engine.base.Engine [-] SELECT services.deleted_at AS
services_deleted_at, services.deleted AS services_deleted, services.created_at AS services_
created_at, services.updated_at AS services_updated_at, services.id AS services_id, services
.host AS services_host, services.`binary` AS services_binary, services.topic AS services_top
ic, services.report_count AS services_report_count, services.disabled AS services_disabled,
services.disabled_reason AS services_disabled_reason
FROM services
WHERE services.deleted = %s AND services.id = %s
LIMIT %s
2015-10-20 13:55:15.221 INFO sqlalchemy.engine.base.Engine [-] (0, 1L, 1)
2015-10-20 13:55:15.224 INFO sqlalchemy.engine.base.Engine [-] UPDATE services SET updated_a
t=%s, report_count=%s WHERE services.id = %s
2015-10-20 13:55:15.224 INFO sqlalchemy.engine.base.Engine [-] (datetime.datetime(2015, 10,
20, 18, 55, 15, 224134), 6089L, 1L)
2015-10-20 13:55:15.226 INFO sqlalchemy.engine.base.Engine [-] COMMIT
2015-10-20 13:55:14.965 INFO sqlalchemy.engine.base.Engine [req-700cda20-4939-4ecd-a17a-1793
e53e88aa None None] SELECT services.deleted_at AS services_deleted_at, services.deleted AS s
ervices_deleted, services.created_at AS services_created_at, services.updated_at AS services
_updated_at, services.id AS services_id, services.host AS services_host, services.`binary` A
S services_binary, services.topic AS services_topic, services.report_count AS services_repor
t_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabl
ed_reason
FROM services
WHERE services.deleted = %s AND services.id = %s
LIMIT %s
2015-10-20 13:55:14.966 INFO sqlalchemy.engine.base.Engine [req-700cda20-4939-4ecd-a17a-1793
e53e88aa None None] (0, 4, 1)
2015-10-20 13:55:14.971 INFO sqlalchemy.engine.base.Engine [req-700cda20-4939-4ecd-a17a-1793
e53e88aa None None] UPDATE services SET updated_at=%s, report_count=%s WHERE services.id = %
s
2015-10-20 13:55:14.971 INFO sqlalchemy.engine.base.Engine [req-700cda20-4939-4ecd-a17a-1793
e53e88aa None None] (datetime.datetime(2015, 10, 20, 18, 55, 14, 971237), 6895, 4L)
2015-10-20 13:55:14.972 INFO sqlalchemy.engine.base.Engine [req-700cda20-4939-4ecd-a17a-1793
e53e88aa None None] COMMIT
- Perhaps the issues with
nova-conductor
are fixed in master. The logs above are from Juno. Same issue in Liberty with a similar patch? Answer: Yup, also in Liberty (even with eventlet patch).
[diana@reserved devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor reserved internal enabled :-) 2015-10-20 21:00:33
nova-cert reserved internal enabled :-) 2015-10-20 21:00:30
nova-network reserved internal enabled :-) 2015-10-20 21:00:34
nova-scheduler reserved internal enabled :-) 2015-10-20 21:00:38
nova-consoleauth reserved internal enabled :-) 2015-10-20 21:00:34
nova-compute reserved nova enabled :-) 2015-10-20 21:00:38
[diana@reserved devstack]$ date
Tue Oct 20 16:00:43 CDT 2015
[diana@reserved devstack]$ date
Tue Oct 20 16:10:29 CDT 2015
[diana@reserved devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor reserved internal enabled :-) 2015-10-20 21:10:31
nova-cert reserved internal enabled :-) 2015-10-20 21:10:34
nova-network reserved internal enabled :-) 2015-10-20 21:10:28
nova-scheduler reserved internal enabled :-) 2015-10-20 21:10:32
nova-consoleauth reserved internal enabled :-) 2015-10-20 21:10:28
nova-compute reserved nova enabled :-) 2015-10-20 21:10:32
[diana@reserved devstack]$ date
Tue Oct 20 16:01:58 CDT 2015
[diana@reserved devstack]$ nova-manage service list
Binary Host Zone Status State Updated_At
nova-conductor reserved internal enabled XXX 2015-10-20 21:10:41
nova-cert reserved internal enabled :-) 2015-10-20 21:02:00
nova-network reserved internal enabled :-) 2015-10-20 21:01:54
nova-scheduler reserved internal enabled :-) 2015-10-20 21:01:58
nova-consoleauth reserved internal enabled :-) 2015-10-20 21:02:04
nova-compute reserved nova enabled :-) 2015-10-20 21:01:58
- Relevant snippets of code
class DbDriver(base.Driver):
def join(self, member, group, service=None):
LOG.debug('DB_Driver: join new ServiceGroup member %(member)s to '
'the %(group)s group, service = %(service)s',
{'member': member, 'group': group,
'service': service})
if service is None:
raise RuntimeError(_('service is a mandatory argument for DB based'
' ServiceGroup driver'))
report_interval = service.report_interval
if report_interval:
service.tg.add_timer(report_interval, self._report_state,
api.INITIAL_REPORTING_DELAY, service)
def is_up(self, service_ref):
"""Moved from nova.utils
Check whether a service is up based on last heartbeat.
"""
# Keep checking 'updated_at' if 'last_seen_up' isn't set.
# Should be able to use only 'last_seen_up' in the M release
last_heartbeat = (service_ref.get('last_seen_up') or
service_ref['updated_at'] or service_ref['created_at'])
if isinstance(last_heartbeat, six.string_types):
# NOTE(russellb) If this service_ref came in over rpc via
# conductor, then the timestamp will be a string and needs to be
# converted back to a datetime.
last_heartbeat = timeutils.parse_strtime(last_heartbeat)
else:
# Objects have proper UTC timezones, but the timeutils comparison
# below does not (and will fail)
last_heartbeat = last_heartbeat.replace(tzinfo=None)
# Timestamps in DB are UTC.
elapsed = timeutils.delta_seconds(last_heartbeat, timeutils.utcnow())
is_up = abs(elapsed) <= self.service_down_time
if not is_up:
LOG.debug('Seems service is down. Last heartbeat was %(lhb)s. '
'Elapsed time is %(el)s',
{'lhb': str(last_heartbeat), 'el': str(elapsed)})
return is_up
def _report_state(self, service):
"""Update the state of this service in the datastore."""
try:
service.service_ref.report_count += 1
service.service_ref.save()
if getattr(service, 'model_disconnected', False):
service.model_disconnected = False
LOG.info(
_LI('Recovered connection to nova-conductor '
'for reporting service status.'))
# because we are communicating over conductor, a failure to
# connect is going to be a messaging failure, not a db error.
except messaging.MessagingTimeout:
if not getattr(service, 'model_disconnected', False):
service.model_disconnected = True
LOG.warn(_LW('Lost connection to nova-conductor '
'for reporting service status.'))
@oslo_db_api.wrap_db_retry(max_retries=5, retry_on_deadlock=True)
def service_update(context, service_id, values):
session = get_session()
with session.begin():
service_ref = _service_get(context, service_id, session=session)
# Only servicegroup.drivers.db.DbDriver._report_state() updates
# 'report_count', so if that value changes then store the timestamp
# as the last time we got a state report.
if 'report_count' in values:
if values['report_count'] > service_ref.report_count:
service_ref.last_seen_up = timeutils.utcnow()
service_ref.update(values)
return service_ref
def add_timer(self, interval, callback, initial_delay=None,
*args, **kwargs):
pulse = loopingcall.FixedIntervalLoopingCall(callback, *args, **kwargs)
pulse.start(interval=interval,
initial_delay=initial_delay)
self.timers.append(pulse)
return pulse
MariaDB [nova]> select * from services order by id asc\G
*************************** 1. row ***************************
created_at: 2015-10-21 18:11:54
updated_at: 2015-10-21 22:12:30
deleted_at: NULL
id: 1
host: localhost.localdomain
binary: nova-conductor
topic: conductor
report_count: 1270
disabled: 0
deleted: 0
disabled_reason: NULL
last_seen_up: 2015-10-21 22:12:30
forced_down: 0
version: 2
*************************** 2. row ***************************
created_at: 2015-10-21 18:11:57
updated_at: 2015-10-21 22:08:53
deleted_at: NULL
id: 3
host: localhost.localdomain
binary: nova-cert
topic: cert
report_count: 1401
disabled: 0
deleted: 0
disabled_reason: NULL
last_seen_up: 2015-10-21 22:08:53
forced_down: 0
version: 2
*************************** 3. row ***************************
created_at: 2015-10-21 18:12:00
updated_at: 2015-10-21 22:08:53
deleted_at: NULL
id: 4
host: localhost.localdomain
binary: nova-network
topic: network
report_count: 1361
disabled: 0
deleted: 0
disabled_reason: NULL
last_seen_up: 2015-10-21 22:08:53
forced_down: 0
version: 2
*************************** 4. row ***************************
created_at: 2015-10-21 18:12:03
updated_at: 2015-10-21 22:08:59
deleted_at: NULL
id: 5
host: localhost.localdomain
binary: nova-scheduler
topic: scheduler
report_count: 1395
disabled: 0
deleted: 0
disabled_reason: NULL
last_seen_up: 2015-10-21 22:08:59
forced_down: 0
version: 2
*************************** 5. row ***************************
created_at: 2015-10-21 18:12:09
updated_at: 2015-10-21 22:08:54
deleted_at: NULL
id: 6
host: localhost.localdomain
binary: nova-consoleauth
topic: consoleauth
report_count: 1392
disabled: 0
deleted: 0
disabled_reason: NULL
last_seen_up: 2015-10-21 22:08:54
forced_down: 0
version: 2
*************************** 6. row ***************************
created_at: 2015-10-21 18:12:13
updated_at: 2015-10-21 22:08:53
deleted_at: NULL
id: 7
host: localhost.localdomain
binary: nova-compute
topic: compute
report_count: 1355
disabled: 0
deleted: 0
disabled_reason: NULL
last_seen_up: 2015-10-21 22:08:53
forced_down: 0
version: 2
6 rows in set (0.00 sec)
$ nova-manage service list
nova-conductor localhost.localdomain internal enabled XXX 2015-10-21 22:12:30
nova-cert localhost.localdomain internal enabled :-) 2015-10-21 22:09:33
nova-network localhost.localdomain internal enabled :-) 2015-10-21 22:09:33
nova-scheduler localhost.localdomain internal enabled :-) 2015-10-21 22:09:29
nova-consoleauth localhost.localdomain internal enabled :-) 2015-10-21 22:09:34
nova-compute localhost.localdomain nova enabled :-) 2015-10-21 22:09:33
2015-10-22 16:38:15.564 service_update(7)
2015-10-22 16:38:20.124 BEFORE (11, '_report_state')
2015-10-22 16:38:20.128 service_update(1)
2015-10-22 16:38:20.138 BEFORE (11, '_report_state')
2015-10-22 16:38:20.141 service_update(1)
2015-10-22 16:38:20.145 11 '_report_state' sleeping for 9.98 seconds
2015-10-22 16:38:20.160 11 '_report_state' sleeping for 9.98 seconds
2015-10-22 16:38:25.551 service_update(4)
2015-10-22 16:38:25.567 service_update(7)
2015-10-22 16:38:30.126 AFTER (11, 9.979305732005741)
2015-10-22 16:38:30.126 BEFORE (12, '_report_state')
2015-10-22 16:38:30.133 service_update(1)
2015-10-22 16:38:30.144 AFTER (11, 9.979391832006513)
2015-10-22 16:38:30.145 BEFORE (12, '_report_state')
2015-10-22 16:38:30.150 service_update(1)
2015-10-22 16:38:30.156 12 '_report_state' sleeping for 9.99 seconds
2015-10-22 16:38:30.157 12 '_report_state' sleeping for 9.97 seconds
2015-10-22 16:35:25.619 service_update(4)
2015-10-22 16:35:25.633 service_update(7)
2015-10-22 16:35:35.620 service_update(4)
2015-10-22 16:35:35.632 service_update(7)
...
- logs truncated from 16:35:35 to 16:38:25
- there are no logs where service_id=1 until the actual time catches up to the future time
- the only logs are service_update calls for service ids 4 & 7
2015-10-22 16:38:25.616 service_update(4)
2015-10-22 16:38:25.629 service_update(7)
2015-10-22 16:38:35.617 service_update(4)
2015-10-22 16:38:35.629 service_update(7)
2015-10-22 16:38:40.127 AFTER (12, 199.90615309600253) <---- BOOM!
2015-10-22 16:38:40.127 BEFORE (13, '_report_state')
2015-10-22 16:38:40.131 service_update(1)
2015-10-22 16:38:40.145 AFTER (12, 199.92564915500407) <---- BOOM!
2015-10-22 16:38:40.145 BEFORE (13, '_report_state')
while self._running:
foo += 1
LOG.error('-------------------> BEFORE (%d, %r)', foo, func_name)
watch.restart()
result = func(*self.args, **self.kw)
watch.stop()
if not self._running:
break
idle = idle_for_func(result, watch.elapsed())
LOG.error('%(foo)s %(func_name)r sleeping '
'for %(idle).02f seconds',
{'func_name': func_name, 'idle': idle,
'foo': foo})
watch.restart()
greenthread.sleep(idle)
watch.stop()
LOG.error('-------------------> AFTER (%d, %r)', foo, watch.elapsed())
- Yup, in all the other services
hub.clock
is monotonic. Only the conductor
service is still pointing to time
.
nova-scheduler:
2015-10-22 17:09:17.727 ERROR oslo.service.loopingcall
[-] -----> hub.clock: <function monotonic at 0x7fde77199500>
nova-conductor:
2015-10-22 17:12:10.528 ERROR oslo.service.loopingcall
[-] -----> hub.clock: <built-in function time>