#8 - "linux, xenbus mutex hangs when rebooting dom0 and guests hung."

Owner: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Date: Tue May 28 18:15:02 2013

Last Update: Tue May 28 18:15:02 2013

Severity: normal

Affects:

State: Closed

[ Retrieve as mbox ]


Missing Control message: <20141024181544.GA16071@laptop.dumpdata.com>; (Archives: marc.info, gmane)


From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: xen@bugs.xenproject.org, "Ren, Yongjie" <yongjie.ren@intel.com>, george.dunlap@eu.citrix.com
Cc: "Liu, SongtaoX" <songtaox.liu@intel.com>, "Tian, Yongxue" <yongxue.tian@intel.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, "Xu, YongweiX" <yongweix.xu@intel.com>
Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
Date: Tue, 28 May 2013 11:21:56 -0400
Message-ID: <20130528152156.GB3027@phenom.dumpdata.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

> > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> 
> Ok, I can reproduce that too.

This is what dom0 tells me:

[  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
[  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D ffff880062b59c78  5904  4163      1 0x00000000
[  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]  ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000
[  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500
[  483.689527] Call Trace:
[  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
[  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
[  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
[  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
[  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
[  483.792791]  [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
[  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
[  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
[  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
[  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
[  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
[  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
[  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]  [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
[  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[  483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
[  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
[  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
[  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
[  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
[  484.070747]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
[  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
[  484.116585] 3 locks held by init/4163:
[  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180
[  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0

create !
title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests hung."

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Liu, SongtaoX" <songtaox.liu@intel.com>, "Tian, Yongxue" <yongxue.tian@intel.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, xen@bugs.xenproject.org, "Xu, YongweiX" <yongweix.xu@intel.com>, "Ren, Yongjie" <yongjie.ren@intel.com>
Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
Date: Tue, 28 May 2013 16:24:48 +0100
Message-ID: <51A4CC40.1090802@eu.citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On 28/05/13 16:21, Konrad Rzeszutek Wilk wrote:
>>> 5. Dom0 cannot be shutdown before PCI device detachment from guest
>>>    http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
>> Ok, I can reproduce that too.
> This is what dom0 tells me:
>
> [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> [  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D ffff880062b59c78  5904  4163      1 0x00000000
> [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]  ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000
> [  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500
> [  483.689527] Call Trace:
> [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
> [  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
> [  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
> [  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
> [  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
> [  483.792791]  [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
> [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
> [  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
> [  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
> [  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
> [  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
> [  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
> [  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]  [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
> [  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[  483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
> [  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
> [  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
> [  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
> [  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
> [  484.070747]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
> [  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
> [  484.116585] 3 locks held by init/4163:
> [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
> ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180
> [  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0
>
> create !
> title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests hung."

1. I think that these commands have to come at the top
2. You don't need quotes in the title
3. You need to be polite and say "thanks" at the end so it knows it can 
stop paying attention. :-)

  -George



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: "Ren, Yongjie" <yongjie.ren@intel.com>, xen@bugs.xenproject.org, george.dunlap@eu.citrix.com
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, "Liu, SongtaoX" <songtaox.liu@intel.com>, "Tian, Yongxue" <yongxue.tian@intel.com>, "Xu, YongweiX" <yongweix.xu@intel.com>
Subject: [Xen-devel] Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Date: Fri, 8 Nov 2013 11:21:21 -0500
Message-ID: <20131108162121.GA25007@phenom.dumpdata.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:
> > > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> > 
> > Ok, I can reproduce that too.
> 
> This is what dom0 tells me:
> 
> [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> [  483.603675] "echo 0 > /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D ffff880062b59c78  5904  4163      1 0x00000000
> [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]  ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000
> [  483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180 ffff880078bca500
> [  483.689527] Call Trace:
> [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70
> [  483.723604]  [<ffffffff813bb0dd>] read_reply+0xad/0x160
> [  483.741162]  [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40
> [  483.758572]  [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0
> [  483.775741]  [<ffffffff813bb3c6>] xs_single+0x46/0x60
> [  483.792791]  [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
> [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90
> [  483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50
> [  483.860412]  [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10
> [  483.877312]  [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0
> [  483.894036]  [<ffffffff8142e275>] device_shutdown+0x15/0x180
> [  483.910605]  [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40
> [  483.927100]  [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]  [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260
> [  483.959480]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[  483.975786]  [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10
> [  483.991819]  [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360
> [  484.007675]  [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]  [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50
> [  484.039176]  [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0
> [  484.055174]  [<ffffffff816aae95>] ? sysret_check+0x22/0x5d
> [  484.070747]  [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10
> [  484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b
> [  484.116585] 3 locks held by init/4163:
> [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
> ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180
> [  484.164359]  #2:  (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>] xs_talkv+0x6b/0x1f0
> 

A bit of debugging shows that when we are in this state:


MSent SIGKILL to[  100.454603] xen-pciback pci-1-0: shutdown

telnet> send brk 
[  110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w) dump-ftrace-buffer(z) 

... snip..

 xenstored       x 0000000000000002  5504  3437      1 0x00000006
  ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000
  ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000
  ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480
 Call Trace:
  [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130
  [<ffffffff816b1594>] schedule+0x24/0x70
  [<ffffffff8109c43d>] do_exit+0x79d/0xbc0
  [<ffffffff8109c981>] do_group_exit+0x51/0x140
  [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760
  [<ffffffff8104c49f>] do_signal+0x4f/0x610
  [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60
  [<ffffffff811c3d39>] ? vfs_write+0x129/0x170
  [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80
  [<ffffffff816bc372>] int_signal+0x12/0x17


The 'x' means that the task has been killed.

(The other two threads 'xenbus' and 'xenwatch' are sleeping).

Since the xenstored can actually be in a domain nowadays and not
just in the initial domain and xenstored can be restarted anytime - we
can't depend on the task pid. Nor can we depend on the other
domain telling us that it is dead.

The best we can do is to get out of the way of the shutdown
process and not hang on forever.

This patch should solve it:
From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 8 Nov 2013 10:48:58 -0500
Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
 shutdown/restart.

The 'read_reply' works with 'process_msg' to read of a reply in XenBus.
'process_msg' is running from within the 'xenbus' thread. Whenever
a message shows up in XenBus it is put on a xs_state.reply_list list
and 'read_reply' picks it up.

The problem is if the backend domain or the xenstored process is killed.
In which case 'xenbus' is still awaiting - and 'read_reply' if called -
stuck forever waiting for the reply_list to have some contents.

This is normally not a problem - as the backend domain can come back
or the xenstored process can be restarted. However if the domain
is in process of being powered off/restarted/halted - there is no
point of waiting on it coming back - as we are effectively being
terminated and should not impede the progress.

This patch solves this problem by checking the 'system_state' value
to see if we are in heading towards death. We also make the wait
mechanism a bit more asynchronous.

Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index b6d5fff..177fb19 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len)
 
 	while (list_empty(&xs_state.reply_list)) {
 		spin_unlock(&xs_state.reply_lock);
-		/* XXX FIXME: Avoid synchronous wait for response here. */
-		wait_event(xs_state.reply_waitq,
-			   !list_empty(&xs_state.reply_list));
+		wait_event_timeout(xs_state.reply_waitq,
+				   !list_empty(&xs_state.reply_list),
+				   msecs_to_jiffies(500));
+
+		/*
+		 * If we are in the process of being shut-down there is
+		 * no point of trying to contact XenBus - it is either
+		 * killed (xenstored application) or the other domain
+		 * has been killed or is unreachable.
+		 */
+		switch (system_state) {
+			case SYSTEM_POWER_OFF:
+			case SYSTEM_RESTART:
+			case SYSTEM_HALT:
+				return ERR_PTR(-EIO);
+			default:
+				break;
+		}
 		spin_lock(&xs_state.reply_lock);
 	}
 
@@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg)
 
 	mutex_unlock(&xs_state.request_mutex);
 
+	if (IS_ERR(ret))
+		return ret;
+
 	if ((msg->type == XS_TRANSACTION_END) ||
 	    ((req_msg.type == XS_TRANSACTION_START) &&
 	     (msg->type == XS_ERROR)))
-- 
1.7.7.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


From: Matt Wilson <msw@linux.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Ren, Yongjie" <yongjie.ren@intel.com>, "Xu, YongweiX" <yongweix.xu@intel.com>, xen@bugs.xenproject.org, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, "Liu, SongtaoX" <songtaox.liu@intel.com>, george.dunlap@eu.citrix.com, "Tian, Yongxue" <yongxue.tian@intel.com>
Subject: Re: [Xen-devel] Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Date: Sun, 10 Nov 2013 12:20:18 -0800
Message-ID: <20131110202018.GA20536@u109add4315675089e695.ant.amazon.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Fri, Nov 08, 2013 at 11:21:21AM -0500, Konrad Rzeszutek Wilk wrote:
[...]
> This patch should solve it:
> From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Fri, 8 Nov 2013 10:48:58 -0500
> Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
>  shutdown/restart.
> 
> The 'read_reply' works with 'process_msg' to read of a reply in XenBus.
> 'process_msg' is running from within the 'xenbus' thread. Whenever
> a message shows up in XenBus it is put on a xs_state.reply_list list
> and 'read_reply' picks it up.
> 
> The problem is if the backend domain or the xenstored process is killed.
> In which case 'xenbus' is still awaiting - and 'read_reply' if called -
> stuck forever waiting for the reply_list to have some contents.
> 
> This is normally not a problem - as the backend domain can come back
> or the xenstored process can be restarted. However if the domain
> is in process of being powered off/restarted/halted - there is no
> point of waiting on it coming back - as we are effectively being
> terminated and should not impede the progress.
> 
> This patch solves this problem by checking the 'system_state' value
> to see if we are in heading towards death. We also make the wait
> mechanism a bit more asynchronous.
> 
> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Makes sense to me.

Acked-by: Matt Wilson <msw@amazon.com>

> ---
>  drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
>  1 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
> index b6d5fff..177fb19 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len)
>  
>  	while (list_empty(&xs_state.reply_list)) {
>  		spin_unlock(&xs_state.reply_lock);
> -		/* XXX FIXME: Avoid synchronous wait for response here. */
> -		wait_event(xs_state.reply_waitq,
> -			   !list_empty(&xs_state.reply_list));
> +		wait_event_timeout(xs_state.reply_waitq,
> +				   !list_empty(&xs_state.reply_list),
> +				   msecs_to_jiffies(500));
> +
> +		/*
> +		 * If we are in the process of being shut-down there is
> +		 * no point of trying to contact XenBus - it is either
> +		 * killed (xenstored application) or the other domain
> +		 * has been killed or is unreachable.
> +		 */
> +		switch (system_state) {
> +			case SYSTEM_POWER_OFF:
> +			case SYSTEM_RESTART:
> +			case SYSTEM_HALT:
> +				return ERR_PTR(-EIO);
> +			default:
> +				break;
> +		}
>  		spin_lock(&xs_state.reply_lock);
>  	}
>  
> @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg)
>  
>  	mutex_unlock(&xs_state.request_mutex);
>  
> +	if (IS_ERR(ret))
> +		return ret;
> +
>  	if ((msg->type == XS_TRANSACTION_END) ||
>  	    ((req_msg.type == XS_TRANSACTION_START) &&
>  	     (msg->type == XS_ERROR)))

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


From: "Liu, SongtaoX" <songtaox.liu@intel.com>
To: "george.dunlap@eu.citrix.com" <george.dunlap@eu.citrix.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, "Zhou, Chao" <chao.zhou@intel.com>, "Zhang, Yang Z" <yang.z.zhang@intel.com>, "Xu, Jiajun" <jiajun.xu@intel.com>, "xen@bugs.xenproject.org" <xen@bugs.xenproject.org>
Cc: "Xu, YongweiX" <yongweix.xu@intel.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [Xen-devel] linux, xenbus mutex hangs when rebooting dom0 and guests hung." Was:Re: test report for Xen 4.3 RC1
Date: Mon, 11 Nov 2013 02:40:39 +0000
Message-ID: <582FB90AB890394081254B69739046FC01322F5F@SHSMSX101.ccr.corp.intel.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

Yes, the patch fixed the dom0 hang issue during rebooting with guest pci device conflict.
Thanks.


Regards
Songtao

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Saturday, November 09, 2013 12:21 AM
> To: Ren, Yongjie; george.dunlap@eu.citrix.com; xen@bugs.xenproject.org
> Cc: Xu, YongweiX; Liu, SongtaoX; Tian, Yongxue; xen-devel@lists.xen.org
> Subject: Is: linux, xenbus mutex hangs when rebooting dom0 and guests hung."
> Was:Re: [Xen-devel] test report for Xen 4.3 RC1
> 
> On Tue, May 28, 2013 at 11:21:56AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > 5. Dom0 cannot be shutdown before PCI device detachment from guest
> > > >   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
> > >
> > > Ok, I can reproduce that too.
> >
> > This is what dom0 tells me:
> >
> > [  483.586675] INFO: task init:4163 blocked for more than 120 seconds.
> > [  483.603675] "echo 0 >
> /proc/sys/kernel/hung_task_timG^G[  483.620747] init            D
> ffff880062b59c78  5904  4163      1 0x00000000
> > [  483.637699]  ffff880062b59bc8 0000000000000^G[  483.655189]
> > ffff880062b58000 ffff880062b58000 ffff880062b58010 ffff880062b58000 [
> > 483.672505]  ffff880062b59fd8 ffff880062b58000 ffff880062f20180
> ffff880078bca500 [  483.689527] Call Trace:
> > [  483.706298]  [<ffffffff816a0814>] schedule+0x24/0x70 [  483.723604]
> > [<ffffffff813bb0dd>] read_reply+0xad/0x160 [  483.741162]
> > [<ffffffff810b6b10>] ? wake_up_bit+0x40/0x40 [  483.758572]
> > [<ffffffff813bb274>] xs_talkv+0xe4/0x1f0 [  483.775741]
> > [<ffffffff813bb3c6>] xs_single+0x46/0x60 [  483.792791]
> > [<ffffffff813bbab4>] xenbus_transaction_start+0x24/0x60
> > [  483.809929]  [<ffffffff813ba202>] __xenbus_switch_ste+0x32/0x120
> > ^G[  483.826947]  [<ffffffff8142df39>] ? __dev_printk+0x39/0x90 [
> > 483.843792]  [<ffffffff8142dfde>] ? _dev_info+0x4e/0x50 [  483.860412]
> > [<ffffffff813ba2fb>] xenbus_switch_state+0xb/0x10 [  483.877312]
> > [<ffffffff813bd487>] xenbus_dev_shutdown+0x37/0xa0 [  483.894036]
> > [<ffffffff8142e275>] device_shutdown+0x15/0x180 [  483.910605]
> > [<ffffffff810a8841>] kernel_restart_prepare+0x31/0x40 [  483.927100]
> > [<ffffffff810a88a1>] kernel_restart+0x11^G[  483.943262]
> > [<ffffffff810a8ab5>] SYSC_reboot+0x1b5/0x260 [  483.959480]
> > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0^G[  483.975786]
> > [<ffffffff810ed5fd>] ? trace_hardirqs_on+0xd/0x10 [  483.991819]
> > [<ffffffff8119db03>] ? kmem_cache_free+0x123/0x360 [  484.007675]
> > [<ffffffff8115c725>] ? __free_pages+0x25/0x^G[  484.023336]
> > [<ffffffff8115c9ac>] ? free_pages+0x4c/0x50 [  484.039176]
> > [<ffffffff8108b527>] ? __mmdrop+0x67/0xd0 [  484.055174]
> > [<ffffffff816aae95>] ? sysret_check+0x22/0x5d [  484.070747]
> > [<ffffffff810ed52d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> > [  484.086121]  [<ffffffff810a8b69>] SyS_reboot+0x9/0x10 [
> > 484.101318]  [<ffffffff816aae69>] system_call_fastpath+0x16/0x1b [
> > 484.116585] 3 locks held by init/4163:
> > [  484.131650]+.+.+.}, at: [<ffffffff810a89e0>] SYSC_reboot+0xe0/0x260
> > ^G^G^G^G^G^G[  484.147704]  #1:  (&__lockdep_no_validate__){......},
> > at: [<ffffffff8142e323>] device_shutdown+0xc3/0x180 [  484.164359]
> > #2:  (&xs_state.request_mutex){+.+...}, at: [<ffffffff813bb1fb>]
> > xs_talkv+0x6b/0x1f0
> >
> 
> A bit of debugging shows that when we are in this state:
> 
> 
> MSent SIGKILL to[  100.454603] xen-pciback pci-1-0: shutdown
> 
> telnet> send brk
> [  110.134554] SysRq : HELP : loglevel(0-9) reboot(b) crash(c)
> terminate-all-tasks(e) memory-full-oom-kill(f) debug(g) kill-all-tasks(i)
> thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
> show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p)
> show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V)
> show-blocked-tasks(w) dump-ftrace-buffer(z)
> 
> ... snip..
> 
>  xenstored       x 0000000000000002  5504  3437      1 0x00000006
>   ffff88006b6efc88 0000000000000246 0000000000000d6d ffff88006b6ee000
>   ffff88006b6effd8 ffff88006b6ee000 ffff88006b6ee010 ffff88006b6ee000
>   ffff88006b6effd8 ffff88006b6ee000 ffff88006bc39500 ffff8800788b5480
>  Call Trace:
>   [<ffffffff8110fede>] ? cgroup_exit+0x10e/0x130
>   [<ffffffff816b1594>] schedule+0x24/0x70
>   [<ffffffff8109c43d>] do_exit+0x79d/0xbc0
>   [<ffffffff8109c981>] do_group_exit+0x51/0x140
>   [<ffffffff810ae6f4>] get_signal_to_deliver+0x264/0x760
>   [<ffffffff8104c49f>] do_signal+0x4f/0x610
>   [<ffffffff811c62ce>] ? __sb_end_write+0x2e/0x60
>   [<ffffffff811c3d39>] ? vfs_write+0x129/0x170
>   [<ffffffff8104cabd>] do_notify_resume+0x5d/0x80
>   [<ffffffff816bc372>] int_signal+0x12/0x17
> 
> 
> The 'x' means that the task has been killed.
> 
> (The other two threads 'xenbus' and 'xenwatch' are sleeping).
> 
> Since the xenstored can actually be in a domain nowadays and not
> just in the initial domain and xenstored can be restarted anytime - we
> can't depend on the task pid. Nor can we depend on the other
> domain telling us that it is dead.
> 
> The best we can do is to get out of the way of the shutdown
> process and not hang on forever.
> 
> This patch should solve it:
> From 228bb2fcde1267ed2a0b0d386f54d79ecacd0eb4 Mon Sep 17 00:00:00
> 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Fri, 8 Nov 2013 10:48:58 -0500
> Subject: [PATCH] xen/xenbus: Avoid synchronous wait on XenBus stalling
>  shutdown/restart.
> 
> The 'read_reply' works with 'process_msg' to read of a reply in XenBus.
> 'process_msg' is running from within the 'xenbus' thread. Whenever
> a message shows up in XenBus it is put on a xs_state.reply_list list
> and 'read_reply' picks it up.
> 
> The problem is if the backend domain or the xenstored process is killed.
> In which case 'xenbus' is still awaiting - and 'read_reply' if called -
> stuck forever waiting for the reply_list to have some contents.
> 
> This is normally not a problem - as the backend domain can come back
> or the xenstored process can be restarted. However if the domain
> is in process of being powered off/restarted/halted - there is no
> point of waiting on it coming back - as we are effectively being
> terminated and should not impede the progress.
> 
> This patch solves this problem by checking the 'system_state' value
> to see if we are in heading towards death. We also make the wait
> mechanism a bit more asynchronous.
> 
> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  drivers/xen/xenbus/xenbus_xs.c |   24 +++++++++++++++++++++---
>  1 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
> index b6d5fff..177fb19 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type
> *type, unsigned int *len)
> 
>  	while (list_empty(&xs_state.reply_list)) {
>  		spin_unlock(&xs_state.reply_lock);
> -		/* XXX FIXME: Avoid synchronous wait for response here. */
> -		wait_event(xs_state.reply_waitq,
> -			   !list_empty(&xs_state.reply_list));
> +		wait_event_timeout(xs_state.reply_waitq,
> +				   !list_empty(&xs_state.reply_list),
> +				   msecs_to_jiffies(500));
> +
> +		/*
> +		 * If we are in the process of being shut-down there is
> +		 * no point of trying to contact XenBus - it is either
> +		 * killed (xenstored application) or the other domain
> +		 * has been killed or is unreachable.
> +		 */
> +		switch (system_state) {
> +			case SYSTEM_POWER_OFF:
> +			case SYSTEM_RESTART:
> +			case SYSTEM_HALT:
> +				return ERR_PTR(-EIO);
> +			default:
> +				break;
> +		}
>  		spin_lock(&xs_state.reply_lock);
>  	}
> 
> @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct
> xsd_sockmsg *msg)
> 
>  	mutex_unlock(&xs_state.request_mutex);
> 
> +	if (IS_ERR(ret))
> +		return ret;
> +
>  	if ((msg->type == XS_TRANSACTION_END) ||
>  	    ((req_msg.type == XS_TRANSACTION_START) &&
>  	     (msg->type == XS_ERROR)))
> --
> 1.7.7.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


From: Ian Campbell <Ian.Campbell@citrix.com>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [Xen-devel] test report for Xen 4.3 RC1
Date: Mon, 11 Nov 2013 10:22:23 +0000
Message-ID: <1384165343.3189.181.camel@kazak.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Tue, 2013-05-28 at 16:24 +0100, George Dunlap wrote:
> > create !
> > title -1 "linux, xenbus mutex hangs when rebooting dom0 and guests hung."
> 
> 1. I think that these commands have to come at the top
> 2. You don't need quotes in the title
> 3. You need to be polite and say "thanks" at the end so it knows it can 
> stop paying attention. :-)

4. Use Bcc and not Cc so that the entirely subsequent thread doesn't get
sent to the bot when folks reply-all.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <Ian.Campbell@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>
Cc: xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org, JBeulich@suse.com, boris.ostrovsky@oracle.com
Subject: Re: [Xen-devel] [PATCH 4/4] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
Date: Fri, 22 Nov 2013 09:30:25 +0000
Message-ID: <1385112625.25845.6.camel@kazak.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

graft 8 <20130528152156.GB3027@phenom.dumpdata.com>
prune 8 <20130528181149.GA27718@phenom.dumpdata.com>
thanks

On Thu, 2013-11-21 at 17:52 +0000, David Vrabel wrote:
> > Fixes-Bug: http://bugs.xenproject.org/xen/bug/8
> 
> This bug link has no useful information in it.

Looks like the intention was for it to reference this mail:
http://thread.gmane.org/gmane.comp.emulators.xen.devel/160720/focus=160828
this has the exact same contents as the control mail that created this
bug which I dug out of the bug trackers spool. The probles was that in
the original the commands were appended instead of at the front of the
message, so they got ignored. Then when the commands were correctly sent
the mail in question used "!" (meaning this mail) but didn't go to
xen-devel, so it didn't actually refer to a known thread. The correct
thing to do in that resend would have been to reference the relevant
message id directly.

I think I've fixed it up with the above commands.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel