#30 - libxl should implement non-suspend-cancel based resume path

Owner: Ian Campbell <Ian.Campbell@citrix.com>

Date: Wed Jan 8 13:30:03 2014

Last Update: Wed Jan 8 13:30:04 2014

Severity: normal

Affects:

State: Open

[ Retrieve as mbox ]


From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: konrad.wilk@oracle.com
Cc: xen-devel@lists.xen.org
Subject: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Tue, 7 Jan 2014 18:55:56 +0000
Message-ID: <21196.19900.136146.867552@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

I did the following test:

   mv /etc/xen/scripts/block /etc/xen/scripts/block.aside
   xl migrate debian.guest.osstest localhost

xl did what appears to be the right thing: it did most of the
migration, failed to run the block scripts at the end of the
migration, and destroyed the destination domain and instead resumed
the source guest.

However, the source guest immediately went mad spewing WARNINGs and
was after that no longer contactable via the network and not
apparently responsive on the console.  See below.

This is with:

  [    0.000000] Linux version 3.4.70+ (osstest@rice-weevil) (gcc
  version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Wed Dec 4 03:14:51 GMT 2013

For reasons I don't understand it doesn't seem to print the actual
kernel git hash in dmesg, but I think it was that from flight 22264,
i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
64-bit Xen.

Thanks,
Ian.

debian login: [  124.595658] PM: freeze of devices complete after 2.980 msecs
[  124.595991] PM: late freeze of devices complete after 0.013 msecs
[  124.600919] PM: noirq freeze of devices complete after 4.884 msecs
[  124.601105] Grant tables using version 2 layout.
[  124.601105] ------------[ cut here ]------------
[  124.601105] kernel BUG at drivers/xen/events.c:1582!
[  124.601105] invalid opcode: 0000 [#1] SMP 
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] 
[  124.601105] Pid: 6, comm: migration/0 Not tainted 3.4.70+ #1  
[  124.601105] EIP: 0061:[<c12f5d25>] EFLAGS: 00010082 CPU: 0
[  124.601105] EIP is at xen_irq_resume+0x215/0x370
[  124.601105] EAX: ffffffef EBX: deadbeef ECX: deadbeef EDX: 00000000
[  124.601105] ESI: c190b020 EDI: df461f24 EBP: df451eb8 ESP: df451e10
[  124.601105]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[  124.601105] CR0: 8005003b CR2: 08b7c8a8 CR3: 038f0000 CR4: 00002660
[  124.601105] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  124.601105] DR6: ffff0ff0 DR7: 00000400
[  124.601105] Process migration/0 (pid: 6, ti=df450000 task=df43d860 task.ti=df450000)
[  124.601105] Stack:
[  124.601105]  c104ea40 df451e18 c398b80c deadbeef df461f10 df451e58 c12f350a c19b165c
[  124.601105]  df451e94 00000003 df451e78 c190b080 c190b020 00000000 00000010 00000000
[  124.601105]  00000000 00000000 9420f17e 0008a6c2 fc798ba3 0008a6df 00000004 00000413
[  124.601105] Call Trace:
[  124.601105]  [<c104ea40>] ? xen_iret_crit_fixup+0x3c/0x3c
[  124.601105]  [<c12f350a>] ? gnttab_map_frames_v2+0xda/0x120
[  124.601105]  [<c1055b90>] ? xen_spin_lock+0xa0/0x100
[  124.601105]  [<c104d155>] ? xen_mm_unpin_all+0x65/0x80
[  124.601105]  [<c12f6cad>] xen_suspend+0x8d/0xc0
[  124.601105]  [<c10e750b>] stop_machine_cpu_stop+0x9b/0x110
[  124.601105]  [<c10e71f7>] cpu_stopper_thread+0xc7/0x1a0
[  124.601105]  [<c10b3f6f>] ? finish_task_switch+0x5f/0xe0
[  124.601105]  [<c10e7470>] ? stop_one_cpu_nowait+0x40/0x40
[  124.601105]  [<c10b682b>] ? default_wake_function+0xb/0x10
[  124.601105]  [<c10af990>] ? __wake_up_common+0x40/0x70
[  124.601105]  [<c16441ad>] ? _raw_spin_unlock_irqrestore+0x2d/0x50
[  124.601105]  [<c10b2479>] ? complete+0x49/0x60
[  124.601105]  [<c10e7130>] ? res_counter_charge+0x180/0x180
[  124.601105]  [<c10a7474>] kthread+0x74/0x80
[  124.601105]  [<c10a7400>] ? kthread_freezable_should_stop+0x60/0x60
[  124.601105]  [<c164b276>] kernel_thread_helper+0x6/0x10
[  124.601105] Code: 22 e8 ff ff 8b 55 8c 89 d8 e8 88 e6 ff ff 83 45 94 01 83 7d 94 04 0f 84 80 fe ff ff 8b 55 8c 8b 04 95 e0 11 88 c1 e9 64 ff ff ff <0f> 0b eb fe 0f 0b eb fe 8b 1d 00 60 85 c1 81 fb 00 60 85 c1 74 
[  124.601105] EIP: [<c12f5d25>] xen_irq_resume+0x215/0x370 SS:ESP 0069:df451e10
[  124.601105] ---[ end trace 69a5c8cd56e77bce ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/tick-sched.c:464 tick_nohz_idle_enter+0x7a/0x90()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D      3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10d1b1a>] ? tick_nohz_idle_enter+0x7a/0x90
[  124.601105]  [<c10d1b1a>] ? tick_nohz_idle_enter+0x7a/0x90
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10d1b1a>] tick_nohz_idle_enter+0x7a/0x90
[  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bcf ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
[  124.601105]  [<c10d1664>] tick_nohz_stop_sched_tick+0x34/0x3f0
[  124.601105]  [<c10886ea>] ? print_oops_end_marker+0x2a/0x30
[  124.601105]  [<c10888fd>] ? warn_slowpath_common+0x7d/0xa0
[  124.601105]  [<c10d1b1a>] ? tick_nohz_idle_enter+0x7a/0x90
[  124.601105]  [<c10d1ae5>] tick_nohz_idle_enter+0x45/0x90
[  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bd0 ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
[  124.601105]  [<c10d1379>] tick_check_idle+0x39/0xf0
[  124.601105]  [<c108f06c>] irq_enter+0x4c/0x70
[  124.601105]  [<c12f4e36>] xen_evtchn_do_upcall+0x16/0x30
[  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
[  124.601105]  [<c1002227>] ? hypercall_page+0x227/0x1000
[  124.601105]  [<c104e16a>] ? xen_force_evtchn_callback+0x1a/0x30
[  124.601105]  [<c104e994>] check_events+0x8/0xc
[  124.601105]  [<c104e93c>] ? xen_clocksource_get_cycles+0xc/0xc
[  124.601105]  [<c104e953>] ? xen_irq_enable_direct_reloc+0x4/0x4
[  124.601105]  [<c10d1af4>] ? tick_nohz_idle_enter+0x54/0x90
[  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bd1 ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
[  124.601105]  [<c10d1664>] tick_nohz_stop_sched_tick+0x34/0x3f0
[  124.601105]  [<c12f4288>] ? info_for_irq+0x8/0x20
[  124.601105]  [<c12f47c3>] ? evtchn_from_irq+0x13/0x40
[  124.601105]  [<c104e789>] ? xen_clocksource_read+0x19/0x20
[  124.601105]  [<c12f4b68>] ? __xen_evtchn_do_upcall+0x258/0x2b0
[  124.601105]  [<c10d1a5f>] tick_nohz_irq_exit+0x3f/0x80
[  124.601105]  [<c108ef1f>] irq_exit+0x4f/0xb0
[  124.601105]  [<c12f4e40>] xen_evtchn_do_upcall+0x20/0x30
[  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
[  124.601105]  [<c1002227>] ? hypercall_page+0x227/0x1000
[  124.601105]  [<c104e16a>] ? xen_force_evtchn_callback+0x1a/0x30
[  124.601105]  [<c104e994>] check_events+0x8/0xc
[  124.601105]  [<c104e93c>] ? xen_clocksource_get_cycles+0xc/0xc
[  124.601105]  [<c104e953>] ? xen_irq_enable_direct_reloc+0x4/0x4
[  124.601105]  [<c10d1af4>] ? tick_nohz_idle_enter+0x54/0x90
[  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bd2 ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
[  124.601105]  [<c10d1379>] tick_check_idle+0x39/0xf0
[  124.601105]  [<c108f06c>] irq_enter+0x4c/0x70
[  124.601105]  [<c12f4e36>] xen_evtchn_do_upcall+0x16/0x30
[  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
[  124.601105]  [<c10023a7>] ? hypercall_page+0x3a7/0x1000
[  124.601105]  [<c104e1c2>] ? xen_safe_halt+0x12/0x20
[  124.601105]  [<c104e1b0>] ? xen_irq_disable+0x10/0x10
[  124.601105]  [<c105ed2b>] default_idle+0x5b/0x190
[  124.601105]  [<c1040054>] ? svm_set_tsc_khz+0x74/0x140
[  124.601105]  [<c105e27f>] cpu_idle+0x6f/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bd3 ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
[  124.601105]  [<c10d1664>] tick_nohz_stop_sched_tick+0x34/0x3f0
[  124.601105]  [<c12f4288>] ? info_for_irq+0x8/0x20
[  124.601105]  [<c12f47c3>] ? evtchn_from_irq+0x13/0x40
[  124.601105]  [<c104e789>] ? xen_clocksource_read+0x19/0x20
[  124.601105]  [<c12f4b68>] ? __xen_evtchn_do_upcall+0x258/0x2b0
[  124.601105]  [<c10d1a5f>] tick_nohz_irq_exit+0x3f/0x80
[  124.601105]  [<c108ef1f>] irq_exit+0x4f/0xb0
[  124.601105]  [<c12f4e40>] xen_evtchn_do_upcall+0x20/0x30
[  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
[  124.601105]  [<c10023a7>] ? hypercall_page+0x3a7/0x1000
[  124.601105]  [<c104e1c2>] ? xen_safe_halt+0x12/0x20
[  124.601105]  [<c104e1b0>] ? xen_irq_disable+0x10/0x10
[  124.601105]  [<c105ed2b>] default_idle+0x5b/0x190
[  124.601105]  [<c1040054>] ? svm_set_tsc_khz+0x74/0x140
[  124.601105]  [<c105e27f>] cpu_idle+0x6f/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bd4 ]---
[  124.601105] ------------[ cut here ]------------
[  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
[  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
[  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
[  124.601105] Call Trace:
[  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
[  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
[  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
[  124.601105]  [<c10d1379>] tick_check_idle+0x39/0xf0
[  124.601105]  [<c108f06c>] irq_enter+0x4c/0x70
[  124.601105]  [<c12f4e36>] xen_evtchn_do_upcall+0x16/0x30
[  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
[  124.601105]  [<c10023a7>] ? hypercall_page+0x3a7/0x1000
[  124.601105]  [<c104e1c2>] ? xen_safe_halt+0x12/0x20
[  124.601105]  [<c104e1b0>] ? xen_irq_disable+0x10/0x10
[  124.601105]  [<c105ed2b>] default_idle+0x5b/0x190
[  124.601105]  [<c1040054>] ? svm_set_tsc_khz+0x74/0x140
[  124.601105]  [<c105e27f>] cpu_idle+0x6f/0xa0
[  124.601105]  [<c16242f8>] rest_init+0x58/0x60
[  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
[  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
[  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
[  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
[  124.601105] ---[ end trace 69a5c8cd56e77bd5 ]---
[  124.601105] ------------[ cut here ]------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: boris.ostrovsky@oracle.com, david.vrabel@citrix.com, Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Tue, 7 Jan 2014 14:11:56 -0500
Message-ID: <20140107191156.GA10370@phenom.dumpdata.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On Tue, Jan 07, 2014 at 06:55:56PM +0000, Ian Jackson wrote:
> I did the following test:
> 
>    mv /etc/xen/scripts/block /etc/xen/scripts/block.aside
>    xl migrate debian.guest.osstest localhost
> 
> xl did what appears to be the right thing: it did most of the
> migration, failed to run the block scripts at the end of the
> migration, and destroyed the destination domain and instead resumed
> the source guest.
> 
> However, the source guest immediately went mad spewing WARNINGs and
> was after that no longer contactable via the network and not
> apparently responsive on the console.  See below.
> 
> This is with:
> 
>   [    0.000000] Linux version 3.4.70+ (osstest@rice-weevil) (gcc
>   version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Wed Dec 4 03:14:51 GMT 2013
> 
> For reasons I don't understand it doesn't seem to print the actual
> kernel git hash in dmesg, but I think it was that from flight 22264,
> i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
> 64-bit Xen.

This a bit of ancient kernel. Does it show up with 3.12?

CC-ing the other maintainers.
> 
> Thanks,
> Ian.
> 
> debian login: [  124.595658] PM: freeze of devices complete after 2.980 msecs
> [  124.595991] PM: late freeze of devices complete after 0.013 msecs
> [  124.600919] PM: noirq freeze of devices complete after 4.884 msecs
> [  124.601105] Grant tables using version 2 layout.
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] kernel BUG at drivers/xen/events.c:1582!
> [  124.601105] invalid opcode: 0000 [#1] SMP 
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] 
> [  124.601105] Pid: 6, comm: migration/0 Not tainted 3.4.70+ #1  
> [  124.601105] EIP: 0061:[<c12f5d25>] EFLAGS: 00010082 CPU: 0
> [  124.601105] EIP is at xen_irq_resume+0x215/0x370
> [  124.601105] EAX: ffffffef EBX: deadbeef ECX: deadbeef EDX: 00000000
> [  124.601105] ESI: c190b020 EDI: df461f24 EBP: df451eb8 ESP: df451e10
> [  124.601105]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
> [  124.601105] CR0: 8005003b CR2: 08b7c8a8 CR3: 038f0000 CR4: 00002660
> [  124.601105] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [  124.601105] DR6: ffff0ff0 DR7: 00000400
> [  124.601105] Process migration/0 (pid: 6, ti=df450000 task=df43d860 task.ti=df450000)
> [  124.601105] Stack:
> [  124.601105]  c104ea40 df451e18 c398b80c deadbeef df461f10 df451e58 c12f350a c19b165c
> [  124.601105]  df451e94 00000003 df451e78 c190b080 c190b020 00000000 00000010 00000000
> [  124.601105]  00000000 00000000 9420f17e 0008a6c2 fc798ba3 0008a6df 00000004 00000413
> [  124.601105] Call Trace:
> [  124.601105]  [<c104ea40>] ? xen_iret_crit_fixup+0x3c/0x3c
> [  124.601105]  [<c12f350a>] ? gnttab_map_frames_v2+0xda/0x120
> [  124.601105]  [<c1055b90>] ? xen_spin_lock+0xa0/0x100
> [  124.601105]  [<c104d155>] ? xen_mm_unpin_all+0x65/0x80
> [  124.601105]  [<c12f6cad>] xen_suspend+0x8d/0xc0
> [  124.601105]  [<c10e750b>] stop_machine_cpu_stop+0x9b/0x110
> [  124.601105]  [<c10e71f7>] cpu_stopper_thread+0xc7/0x1a0
> [  124.601105]  [<c10b3f6f>] ? finish_task_switch+0x5f/0xe0
> [  124.601105]  [<c10e7470>] ? stop_one_cpu_nowait+0x40/0x40
> [  124.601105]  [<c10b682b>] ? default_wake_function+0xb/0x10
> [  124.601105]  [<c10af990>] ? __wake_up_common+0x40/0x70
> [  124.601105]  [<c16441ad>] ? _raw_spin_unlock_irqrestore+0x2d/0x50
> [  124.601105]  [<c10b2479>] ? complete+0x49/0x60
> [  124.601105]  [<c10e7130>] ? res_counter_charge+0x180/0x180
> [  124.601105]  [<c10a7474>] kthread+0x74/0x80
> [  124.601105]  [<c10a7400>] ? kthread_freezable_should_stop+0x60/0x60
> [  124.601105]  [<c164b276>] kernel_thread_helper+0x6/0x10
> [  124.601105] Code: 22 e8 ff ff 8b 55 8c 89 d8 e8 88 e6 ff ff 83 45 94 01 83 7d 94 04 0f 84 80 fe ff ff 8b 55 8c 8b 04 95 e0 11 88 c1 e9 64 ff ff ff <0f> 0b eb fe 0f 0b eb fe 8b 1d 00 60 85 c1 81 fb 00 60 85 c1 74 
> [  124.601105] EIP: [<c12f5d25>] xen_irq_resume+0x215/0x370 SS:ESP 0069:df451e10
> [  124.601105] ---[ end trace 69a5c8cd56e77bce ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/tick-sched.c:464 tick_nohz_idle_enter+0x7a/0x90()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D      3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10d1b1a>] ? tick_nohz_idle_enter+0x7a/0x90
> [  124.601105]  [<c10d1b1a>] ? tick_nohz_idle_enter+0x7a/0x90
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10d1b1a>] tick_nohz_idle_enter+0x7a/0x90
> [  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bcf ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
> [  124.601105]  [<c10d1664>] tick_nohz_stop_sched_tick+0x34/0x3f0
> [  124.601105]  [<c10886ea>] ? print_oops_end_marker+0x2a/0x30
> [  124.601105]  [<c10888fd>] ? warn_slowpath_common+0x7d/0xa0
> [  124.601105]  [<c10d1b1a>] ? tick_nohz_idle_enter+0x7a/0x90
> [  124.601105]  [<c10d1ae5>] tick_nohz_idle_enter+0x45/0x90
> [  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bd0 ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
> [  124.601105]  [<c10d1379>] tick_check_idle+0x39/0xf0
> [  124.601105]  [<c108f06c>] irq_enter+0x4c/0x70
> [  124.601105]  [<c12f4e36>] xen_evtchn_do_upcall+0x16/0x30
> [  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
> [  124.601105]  [<c1002227>] ? hypercall_page+0x227/0x1000
> [  124.601105]  [<c104e16a>] ? xen_force_evtchn_callback+0x1a/0x30
> [  124.601105]  [<c104e994>] check_events+0x8/0xc
> [  124.601105]  [<c104e93c>] ? xen_clocksource_get_cycles+0xc/0xc
> [  124.601105]  [<c104e953>] ? xen_irq_enable_direct_reloc+0x4/0x4
> [  124.601105]  [<c10d1af4>] ? tick_nohz_idle_enter+0x54/0x90
> [  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bd1 ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
> [  124.601105]  [<c10d1664>] tick_nohz_stop_sched_tick+0x34/0x3f0
> [  124.601105]  [<c12f4288>] ? info_for_irq+0x8/0x20
> [  124.601105]  [<c12f47c3>] ? evtchn_from_irq+0x13/0x40
> [  124.601105]  [<c104e789>] ? xen_clocksource_read+0x19/0x20
> [  124.601105]  [<c12f4b68>] ? __xen_evtchn_do_upcall+0x258/0x2b0
> [  124.601105]  [<c10d1a5f>] tick_nohz_irq_exit+0x3f/0x80
> [  124.601105]  [<c108ef1f>] irq_exit+0x4f/0xb0
> [  124.601105]  [<c12f4e40>] xen_evtchn_do_upcall+0x20/0x30
> [  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
> [  124.601105]  [<c1002227>] ? hypercall_page+0x227/0x1000
> [  124.601105]  [<c104e16a>] ? xen_force_evtchn_callback+0x1a/0x30
> [  124.601105]  [<c104e994>] check_events+0x8/0xc
> [  124.601105]  [<c104e93c>] ? xen_clocksource_get_cycles+0xc/0xc
> [  124.601105]  [<c104e953>] ? xen_irq_enable_direct_reloc+0x4/0x4
> [  124.601105]  [<c10d1af4>] ? tick_nohz_idle_enter+0x54/0x90
> [  124.601105]  [<c105e22a>] cpu_idle+0x1a/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bd2 ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
> [  124.601105]  [<c10d1379>] tick_check_idle+0x39/0xf0
> [  124.601105]  [<c108f06c>] irq_enter+0x4c/0x70
> [  124.601105]  [<c12f4e36>] xen_evtchn_do_upcall+0x16/0x30
> [  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
> [  124.601105]  [<c10023a7>] ? hypercall_page+0x3a7/0x1000
> [  124.601105]  [<c104e1c2>] ? xen_safe_halt+0x12/0x20
> [  124.601105]  [<c104e1b0>] ? xen_irq_disable+0x10/0x10
> [  124.601105]  [<c105ed2b>] default_idle+0x5b/0x190
> [  124.601105]  [<c1040054>] ? svm_set_tsc_khz+0x74/0x140
> [  124.601105]  [<c105e27f>] cpu_idle+0x6f/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bd3 ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
> [  124.601105]  [<c10d1664>] tick_nohz_stop_sched_tick+0x34/0x3f0
> [  124.601105]  [<c12f4288>] ? info_for_irq+0x8/0x20
> [  124.601105]  [<c12f47c3>] ? evtchn_from_irq+0x13/0x40
> [  124.601105]  [<c104e789>] ? xen_clocksource_read+0x19/0x20
> [  124.601105]  [<c12f4b68>] ? __xen_evtchn_do_upcall+0x258/0x2b0
> [  124.601105]  [<c10d1a5f>] tick_nohz_irq_exit+0x3f/0x80
> [  124.601105]  [<c108ef1f>] irq_exit+0x4f/0xb0
> [  124.601105]  [<c12f4e40>] xen_evtchn_do_upcall+0x20/0x30
> [  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
> [  124.601105]  [<c10023a7>] ? hypercall_page+0x3a7/0x1000
> [  124.601105]  [<c104e1c2>] ? xen_safe_halt+0x12/0x20
> [  124.601105]  [<c104e1b0>] ? xen_irq_disable+0x10/0x10
> [  124.601105]  [<c105ed2b>] default_idle+0x5b/0x190
> [  124.601105]  [<c1040054>] ? svm_set_tsc_khz+0x74/0x140
> [  124.601105]  [<c105e27f>] cpu_idle+0x6f/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bd4 ]---
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] WARNING: at kernel/time/timekeeping.c:266 ktime_get+0xe9/0x100()
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] Pid: 0, comm: swapper/0 Tainted: G      D W    3.4.70+ #1
> [  124.601105] Call Trace:
> [  124.601105]  [<c10888ed>] warn_slowpath_common+0x6d/0xa0
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c10c9fc9>] ? ktime_get+0xe9/0x100
> [  124.601105]  [<c108893d>] warn_slowpath_null+0x1d/0x20
> [  124.601105]  [<c10c9fc9>] ktime_get+0xe9/0x100
> [  124.601105]  [<c10d1379>] tick_check_idle+0x39/0xf0
> [  124.601105]  [<c108f06c>] irq_enter+0x4c/0x70
> [  124.601105]  [<c12f4e36>] xen_evtchn_do_upcall+0x16/0x30
> [  124.601105]  [<c164b2c7>] xen_do_upcall+0x7/0xc
> [  124.601105]  [<c10023a7>] ? hypercall_page+0x3a7/0x1000
> [  124.601105]  [<c104e1c2>] ? xen_safe_halt+0x12/0x20
> [  124.601105]  [<c104e1b0>] ? xen_irq_disable+0x10/0x10
> [  124.601105]  [<c105ed2b>] default_idle+0x5b/0x190
> [  124.601105]  [<c1040054>] ? svm_set_tsc_khz+0x74/0x140
> [  124.601105]  [<c105e27f>] cpu_idle+0x6f/0xa0
> [  124.601105]  [<c16242f8>] rest_init+0x58/0x60
> [  124.601105]  [<c1887919>] start_kernel+0x355/0x35b
> [  124.601105]  [<c1887435>] ? kernel_init+0x1cf/0x1cf
> [  124.601105]  [<c18870ba>] i386_start_kernel+0xa9/0xb0
> [  124.601105]  [<c188b733>] xen_start_kernel+0x5c4/0x5cc
> [  124.601105] ---[ end trace 69a5c8cd56e77bd5 ]---
> [  124.601105] ------------[ cut here ]------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: david.vrabel@citrix.com, xen-devel@lists.xen.org, boris.ostrovsky@oracle.com
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Tue, 7 Jan 2014 19:23:32 +0000
Message-ID: <21196.21556.181273.225889@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

Konrad Rzeszutek Wilk writes ("Re: 3.4.70+ kernel WARNING spew dysfunction on failed migration"):
> On Tue, Jan 07, 2014 at 06:55:56PM +0000, Ian Jackson wrote:
> > For reasons I don't understand it doesn't seem to print the actual
> > kernel git hash in dmesg, but I think it was that from flight 22264,
> > i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
> > 64-bit Xen.
> 
> This a bit of ancient kernel. Does it show up with 3.12?

3.4.70 is what the osstest push gate is using.  (ISTR trying to switch
to 3.11 but encountering some problem.)

I haven't tried 3.12 but can do so.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: david.vrabel@citrix.com, xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Tue, 07 Jan 2014 14:36:44 -0500
Message-ID: <52CC574C.8080405@oracle.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On 01/07/2014 02:23 PM, Ian Jackson wrote:
> Konrad Rzeszutek Wilk writes ("Re: 3.4.70+ kernel WARNING spew dysfunction on failed migration"):
>> On Tue, Jan 07, 2014 at 06:55:56PM +0000, Ian Jackson wrote:
>>> For reasons I don't understand it doesn't seem to print the actual
>>> kernel git hash in dmesg, but I think it was that from flight 22264,
>>> i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
>>> 64-bit Xen.
>> This a bit of ancient kernel. Does it show up with 3.12?
> 3.4.70 is what the osstest push gate is using.  (ISTR trying to switch
> to 3.11 but encountering some problem.)
>
> I haven't tried 3.12 but can do so.
>
> Ian.

This is hypercall failing, btw:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/drivers/xen/events.c?id=refs/tags/v3.4.75#n1582

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: david.vrabel@citrix.com, xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Tue, 07 Jan 2014 15:05:48 -0500
Message-ID: <52CC5E1C.90704@oracle.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On 01/07/2014 02:36 PM, Boris Ostrovsky wrote:
> On 01/07/2014 02:23 PM, Ian Jackson wrote:
>> Konrad Rzeszutek Wilk writes ("Re: 3.4.70+ kernel WARNING spew 
>> dysfunction on failed migration"):
>>> On Tue, Jan 07, 2014 at 06:55:56PM +0000, Ian Jackson wrote:
>>>> For reasons I don't understand it doesn't seem to print the actual
>>>> kernel git hash in dmesg, but I think it was that from flight 22264,
>>>> i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
>>>> 64-bit Xen.
>>> This a bit of ancient kernel. Does it show up with 3.12?
>> 3.4.70 is what the osstest push gate is using.  (ISTR trying to switch
>> to 3.11 but encountering some problem.)
>>
>> I haven't tried 3.12 but can do so.
>>
>> Ian.
>
> This is hypercall failing, btw:
>
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/drivers/xen/events.c?id=refs/tags/v3.4.75#n1582 
>

More specifically, it fails

     if ( v->virq_to_evtchn[virq] != 0 )
         ERROR_EXIT(-EEXIST);

in Xen's evtchn_bind_virq().

Would be interesting to see if this is still a problem in new kernels.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <Ian.Campbell@citrix.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Tue, 7 Jan 2014 22:43:01 +0000
Message-ID: <1389134581.6917.19.camel@dagon.hellion.org.uk>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On Tue, 2014-01-07 at 18:55 +0000, Ian Jackson wrote:
> I did the following test:
> 
>    mv /etc/xen/scripts/block /etc/xen/scripts/block.aside
>    xl migrate debian.guest.osstest localhost
> 
> xl did what appears to be the right thing: it did most of the
> migration, failed to run the block scripts at the end of the
> migration, and destroyed the destination domain and instead resumed
> the source guest.
> 
> However, the source guest immediately went mad spewing WARNINGs and
> was after that no longer contactable via the network and not
> apparently responsive on the console.  See below.

Might this be the libxl resume thing described at the end of:
http://lists.xen.org/archives/html/xen-devel/2013-02/msg00130.html ?

I thought we'd switch to using fast resume by default to workaround
this, but looking at the code it seems not.

It'd be lovely if the slow path finally got implemented instead of
falling through the cracks again.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <Ian.Campbell@citrix.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Wed, 8 Jan 2014 13:02:24 +0000
Message-ID: <1389186144.4883.60.camel@kazak.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

create ^
title it libxl should implement non-suspend-cancel based resume path
owner Ian Jackson <Ian.Jackson@eu.citrix.com>
thanks

To summarise what I just said to Ian J in the corridor (and lets have a
bug to record it):

There are two mechanisms by which a suspend can be aborted and the
original domain resumed.

The older method is that the toolstack resets a bunch of state (see
tools/python/xen/xend/XendDomainInfo.py resumeDomain) and then restarts
the domain. The domain will see HYPERVISOR_suspend return 0 and will
continue without any realisation that it is actually running in the
original domain and not in a new one. This method is supposed to be
implemented by libxl_domain_resume(suspend_cancel=0) but it is not.

The other method is newer and in this case the toolstack arranges that
HYPERVISOR_suspend returns 1 and restarts it (I beleiv . The domain will
observe this and realise that it has been restarted in the same domain
and will behave accordingly. This method is implemented, correctly
AFAIK, by libxl_domain_resume(suspend_cancel=1).

However the newer method is not available in all kernels, although it
does date from the Linux 2.6.18 days and is implemented in all Linux
pvops kernels I can't speak for others (e.g. BSD). The toolstack is
supposed to check for the XEN_ELFNOTE_SUSPEND_CANCEL ELF note when
building the domain. The presence/absence of this flag needs to be
remembered so that it can be consulted on resume (this also implies
preserving that knowledge over migration).

xl currently uses libxl_domain_resume(suspend_cancel=0) on migration
failure which as it stands won't work for *any* domain. Arguably
switching to suspend_cancel=1 for now will mean that some subset of
kernels will work, and those which don't will not have regressed, until
we can correctly implement the suspend_cancel=0 and the necessary
tracking of XEN_ELFNOTE_SUSPEND_CANCEL.

I've also just noticed that on failure to save (as opposed to migrate)
xl does use suspend_cancel=1.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


From: David Vrabel <david.vrabel@citrix.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>, xen-devel@lists.xen.org, Ian Campbell <ian.campbell@citrix.com>
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Wed, 8 Jan 2014 14:19:37 +0000
Message-ID: <52CD5E79.9000008@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On 07/01/14 18:55, Ian Jackson wrote:
> I did the following test:
> 
>    mv /etc/xen/scripts/block /etc/xen/scripts/block.aside
>    xl migrate debian.guest.osstest localhost
> 
> xl did what appears to be the right thing: it did most of the
> migration, failed to run the block scripts at the end of the
> migration, and destroyed the destination domain and instead resumed
> the source guest.
> 
> However, the source guest immediately went mad spewing WARNINGs and
> was after that no longer contactable via the network and not
> apparently responsive on the console.  See below.
> 
> This is with:
> 
>   [    0.000000] Linux version 3.4.70+ (osstest@rice-weevil) (gcc
>   version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Wed Dec 4 03:14:51 GMT 2013
> 
> For reasons I don't understand it doesn't seem to print the actual
> kernel git hash in dmesg, but I think it was that from flight 22264,
> i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
> 64-bit Xen.
> 
> Thanks,
> Ian.
> 
> debian login: [  124.595658] PM: freeze of devices complete after 2.980 msecs
> [  124.595991] PM: late freeze of devices complete after 0.013 msecs
> [  124.600919] PM: noirq freeze of devices complete after 4.884 msecs
> [  124.601105] Grant tables using version 2 layout.
> [  124.601105] ------------[ cut here ]------------
> [  124.601105] kernel BUG at drivers/xen/events.c:1582!
> [  124.601105] invalid opcode: 0000 [#1] SMP 
> [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> [  124.601105] 
> [  124.601105] Pid: 6, comm: migration/0 Not tainted 3.4.70+ #1  
> [  124.601105] EIP: 0061:[<c12f5d25>] EFLAGS: 00010082 CPU: 0
> [  124.601105] EIP is at xen_irq_resume+0x215/0x370

We shouldn't be calling xen_irq_resume() when resuming the source VM.
The EVTCHNOP_bind_irq is failing because the VIRQ is still bound.

This would suggest that the suspend hypercall has not correctly returned
the cancelled state.

Could this be because of the tools issue mentioned by Ian C?

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <Ian.Campbell@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel@lists.xen.org, Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Wed, 8 Jan 2014 14:24:28 +0000
Message-ID: <1389191068.4883.86.camel@kazak.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On Wed, 2014-01-08 at 14:19 +0000, David Vrabel wrote:
> On 07/01/14 18:55, Ian Jackson wrote:
> > I did the following test:
> > 
> >    mv /etc/xen/scripts/block /etc/xen/scripts/block.aside
> >    xl migrate debian.guest.osstest localhost
> > 
> > xl did what appears to be the right thing: it did most of the
> > migration, failed to run the block scripts at the end of the
> > migration, and destroyed the destination domain and instead resumed
> > the source guest.
> > 
> > However, the source guest immediately went mad spewing WARNINGs and
> > was after that no longer contactable via the network and not
> > apparently responsive on the console.  See below.
> > 
> > This is with:
> > 
> >   [    0.000000] Linux version 3.4.70+ (osstest@rice-weevil) (gcc
> >   version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Wed Dec 4 03:14:51 GMT 2013
> > 
> > For reasons I don't understand it doesn't seem to print the actual
> > kernel git hash in dmesg, but I think it was that from flight 22264,
> > i.e.  234d96ee0f3b8e49501d068a2a3165aa4db60903.  It's i386, on a
> > 64-bit Xen.
> > 
> > Thanks,
> > Ian.
> > 
> > debian login: [  124.595658] PM: freeze of devices complete after 2.980 msecs
> > [  124.595991] PM: late freeze of devices complete after 0.013 msecs
> > [  124.600919] PM: noirq freeze of devices complete after 4.884 msecs
> > [  124.601105] Grant tables using version 2 layout.
> > [  124.601105] ------------[ cut here ]------------
> > [  124.601105] kernel BUG at drivers/xen/events.c:1582!
> > [  124.601105] invalid opcode: 0000 [#1] SMP 
> > [  124.601105] Modules linked in: [last unloaded: scsi_wait_scan]
> > [  124.601105] 
> > [  124.601105] Pid: 6, comm: migration/0 Not tainted 3.4.70+ #1  
> > [  124.601105] EIP: 0061:[<c12f5d25>] EFLAGS: 00010082 CPU: 0
> > [  124.601105] EIP is at xen_irq_resume+0x215/0x370
> 
> We shouldn't be calling xen_irq_resume() when resuming the source VM.
> The EVTCHNOP_bind_irq is failing because the VIRQ is still bound.
> 
> This would suggest that the suspend hypercall has not correctly returned
> the cancelled state.
> 
> Could this be because of the tools issue mentioned by Ian C?

I'm fairly confident that it is, yes.

(well "this" is actually, toolstack failed to implement the old style
resume but told the guest it had, but not returning cancel...)
Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Thu, 9 Jan 2014 19:08:41 +0000
Message-ID: <21198.62393.819485.361532@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

Ian Campbell writes ("Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration"):
> The older method is that the toolstack resets a bunch of state (see
> tools/python/xen/xend/XendDomainInfo.py resumeDomain) and then restarts
> the domain. The domain will see HYPERVISOR_suspend return 0 and will
> continue without any realisation that it is actually running in the
> original domain and not in a new one. This method is supposed to be
> implemented by libxl_domain_resume(suspend_cancel=0) but it is not.

I have looked into this and I think I can fairly simply implement the
old protocol in libxl.  This is necessary, I think, to preserve our
back-to-3.0 ABI compatibility guarantee.

Looking at a modern pvops Linux kernel, does seem to try to cope with
older hypervisors which don't do the "new" protocol.  So that's a
reasonable thing to start with, but looking at the code in Linux I
suspect it may not actually work very well.  So if anyone has an
ancient test case of some kind that would be helpful...

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <Ian.Campbell@citrix.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: xen-devel@lists.xen.org
Subject: Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration
Date: Fri, 10 Jan 2014 10:26:31 +0000
Message-ID: <1389349591.19142.25.camel@kazak.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: gmane, marc.info ]

On Thu, 2014-01-09 at 19:08 +0000, Ian Jackson wrote:
> Ian Campbell writes ("Re: [Xen-devel] 3.4.70+ kernel WARNING spew dysfunction on failed migration"):
> > The older method is that the toolstack resets a bunch of state (see
> > tools/python/xen/xend/XendDomainInfo.py resumeDomain) and then restarts
> > the domain. The domain will see HYPERVISOR_suspend return 0 and will
> > continue without any realisation that it is actually running in the
> > original domain and not in a new one. This method is supposed to be
> > implemented by libxl_domain_resume(suspend_cancel=0) but it is not.
> 
> I have looked into this and I think I can fairly simply implement the
> old protocol in libxl.  This is necessary, I think, to preserve our
> back-to-3.0 ABI compatibility guarantee.
> 
> Looking at a modern pvops Linux kernel, does seem to try to cope with
> older hypervisors which don't do the "new" protocol.  So that's a
> reasonable thing to start with, but looking at the code in Linux I
> suspect it may not actually work very well.  So if anyone has an
> ancient test case of some kind that would be helpful...

The linux-2.6.18-xen.hg kernel ought to work in the old mode I think. Or
any of the SLES fwd ports?

Looks like RHEL4 (linux-2.6.9-89.0.16.EL kernel) doesn't have the
support for the new mode at all.

It would probably be wise to validate this under xend before chasing
red-herrings with xl.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel