From xen-devel-bounces@lists.xen.org Fri Sep 11 17:54:38 2015 Received: (at maildrop) by bugs.xenproject.org; 11 Sep 2015 16:54:38 +0000 Received: from lists.xen.org ([50.57.142.19]) by bugs.xenproject.org with esmtp (Exim 4.80) (envelope-from ) id 1ZaRav-0004Ze-VT for xen-devel-maildrop-Eithu9ie@bugs.xenproject.org; Fri, 11 Sep 2015 17:54:37 +0100 Received: from localhost ([127.0.0.1] helo=lists.xen.org) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1ZaRXV-0007cq-5M; Fri, 11 Sep 2015 16:51:05 +0000 Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1ZaRXU-0007cl-2t for xen-devel@lists.xensource.com; Fri, 11 Sep 2015 16:51:04 +0000 Received: from [85.158.137.68] by server-10.bemta-3.messagelabs.com id 94/7B-23203-77603F55; Fri, 11 Sep 2015 16:51:03 +0000 X-Env-Sender: konrad.wilk@oracle.com X-Msg-Ref: server-2.tower-31.messagelabs.com!1441990261!12419958!1 X-Originating-IP: [141.146.126.69] X-SpamReason: No, hits=0.0 required=7.0 tests=sa_preprocessor: VHJ1c3RlZCBJUDogMTQxLjE0Ni4xMjYuNjkgPT4gMjc3MjE4\n X-StarScan-Received: X-StarScan-Version: 6.13.16; banners=-,-,- X-VirusChecked: Checked Received: (qmail 23032 invoked from network); 11 Sep 2015 16:51:02 -0000 Received: from aserp1040.oracle.com (HELO aserp1040.oracle.com) (141.146.126.69) by server-2.tower-31.messagelabs.com with DHE-RSA-AES256-SHA encrypted SMTP; 11 Sep 2015 16:51:02 -0000 Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id t8BGojv7014875 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 11 Sep 2015 16:50:45 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0022.oracle.com (8.13.8/8.13.8) with ESMTP id t8BGoiNb025746 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Fri, 11 Sep 2015 16:50:44 GMT Received: from abhmp0015.oracle.com (abhmp0015.oracle.com [141.146.116.21]) by userv0121.oracle.com (8.13.8/8.13.8) with ESMTP id t8BGoheS014964; Fri, 11 Sep 2015 16:50:43 GMT Received: from l.oracle.com (/10.137.176.158) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 11 Sep 2015 09:50:42 -0700 Received: by l.oracle.com (Postfix, from userid 1000) id 79B706A3C80; Fri, 11 Sep 2015 12:50:41 -0400 (EDT) Date: Fri, 11 Sep 2015 12:50:41 -0400 From: Konrad Rzeszutek Wilk To: Zhu Yanhai Message-ID: <20150911165041.GB27598@l.oracle.com> References: <1383720072-6242-1-git-send-email-gaoyang.zyh@taobao.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1383720072-6242-1-git-send-email-gaoyang.zyh@taobao.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Source-IP: aserv0022.oracle.com [141.146.126.234] Cc: Zhu Yanhai , xen-devel@lists.xensource.com, Ian Campbell , George Dunlap , Andrew Cooper , xen@bugs.xenproject.org, Charles Wang , Shen Yiben , Wan Jia Subject: Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler X-BeenThere: xen-devel@lists.xen.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org On Wed, Nov 06, 2013 at 02:41:12PM +0800, Zhu Yanhai wrote: > As we know Intel X86's CR0.TS is a sticky bit, which means once set > it remains set until cleared by some software routines, in other words, > the exception handler expects the bit is set when it starts to execute. > > However xen doesn't simulate this behavior quite well for PV guests - > vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning, > so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking > it's fine since the linux kernel executes the exception handler with > interrupt disabled and a sane #NM handler will clear the bit anyway > before it exits, but there's a catch: if it's the first FPU trap for the process, > the linux kernel must allocate a piece of SLAB memory for it to save > the FPU registers, which opens a schedule window as the memory > allocation might sleep -- and with CR0.TS keeps clear! > With the Ingo's FPU rewrite we haven't been able to retrigger this. (Tests ran for 2 weeks while they would have failed within two hours). And when I dug in this I found the reason: commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a Author: Dave Hansen Date: Fri Jul 17 12:28:11 2015 +0200 x86/fpu, sched: Dynamically allocate 'struct fpu' The FPU rewrite removed the dynamic allocations of 'struct fpu'. But, this potentially wastes massive amounts of memory (2k per task on systems that do not have AVX-512 for instance). Instead of having a separate slab, this patch just appends the space that we need to the 'task_struct' which we dynamically allocate already. This saves from doing an extra slab allocation at fork(). And that when the #NM is called ('do_device_not_available') it does: fpu__restore(¤t->thread.fpu); /* interrupts still off */ |+- fpu__activate_curr (which just inits the already allocated space) | \- memset(state, 0, xstate_size); |+- fpregs_activate \- stts() So there is no scheduling window during this time, while in kernels prior to Linux 4.2 there was. And it took a bit of time to figure out what exactly the problem was. I appreciate folks emails (and this giant thread) about this but without some sort of diagram it was hard to understand this (at least to me). So here it is in case somebody is doing code archaeology: For simplicity we assume the guest/baremetal use the lazy mechanism not eager. That makes 'switch_fpu_prepare' (called by schedule()) effectively: if (previous task had PF_USED_MATH set) stts (CR0.TS=1) else ; I am ignoring the case if the task had used the FPU more than five times - where we do things a bit different. The time diagram looks great at 132x42. Anyhow, lets assume that we have two tasks: A and B. Both haven't used the FPU. This is on PVHVM: CR0.TS=1 CR0.TS=1 CR0.TS=0 CR0.TS=1 CR0.TS=0 ------------------------------------------------------------------------------------+--------+-------------------+-------+ task A | #NM |task B| |taskB | | task A | |taskA | MMX |math_state_restore | | | | | | | | op | \- fpu_init | | | | | | | | | \- .. schedule() | | | | | | | | | [swap task B] | | | | | | | | | [since task A | | | | | | | | | hadn't set | | | | | | | | | PF_USED_MATH | | | | | | | | | we don't muck| | | | | | | | | with CR0.TS] | | | | | | | | | |MMX op| | | | | | | | | |#NM | | | | | | | | |math_state_restore | | | | | | | | | fpu_init worked | | | | | | | | | clts() | | | | | | | | |task_B->flags |= | | | | | | | | | PF_USED_MATH | | | | | | | | | return; | | | | | | | | | |syscall| | | | | | | | | |schedule() | | | | | | | | |[swap task A] | | | | | | | | |[taskB has | | | | | | | | | PF_USED_MATH] | | | | | | | | |[so CR0.TS=1] | | | | | | | | | task A runs | | | | | | | | | |MMX op | | | | | | | | | |#NM | | | | | | | | | fpu_init works | | | | | | | | | clts() | | | | | | | | | taskA->flags |= | | | | | | | | | PF_USED_MATH | | | | | | | | | return | | | | | | | | | |MMX op | And under Xen PV: CR0.TS=1 CR0.TS=0 CR0.TS=0 CR0.TS=0 CR0.TS=0 [but Xen sets it to CR0.TS=0 and calls Linux #NM:] ------------------------------------------------------------------------------------+--------+-------------------+-------+ task A | #NM |task B| |taskB | | task A | |taskA | MMX |math_state_restore | | | | | | | | op | \- fpu_init | | | | | | | | | \- .. schedule() | | | | | | | | | [swap task B] | | | | | | | | | [since task A | | | | | | | | | hadn't set | | | | | | | | | PF_USED_MATH | | | | | | | | | we don't muck| | | | | | | | | with CR0.TS] | | | | | | | | | |MMX op| | | | | | | | | |[no trap to Linux or| | | | | | | | |Xen as CR0.TS=0] | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |syscall| | | | | | | | | |schedule() | | | | | | | | |[swap task A] | | | | | | | | |[with task B] | | | | | | | | | task A runs | | | | | | | | | |MMX op | | | | | | | | | |[again, no trap to | | | | | | | | | Xen or Linux b/c | | | | | | | | | CR0.TS=0 *1] | | | | | | | | | |MMX op | And so on - FPU registers are effectively leaking across tasks. The [*1] refers to the Xen scheduler. If any of the syscalls that the user application called, ended in the Linux kernel halt (xen_safe_halt) routine - we would deschedule the guest VCPU. When that VCPU is re-scheduled, Xen would set CR0.TS=1 back so the #NM would function again. Not pretty - and only happening if the fpu_alloc() ends up calling the schedule(). I followed Jan's recommendation and cobbled this up: diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 324ab52..06b7843 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -837,11 +837,17 @@ asmlinkage __visible void __attribute__((weak)) smp_threshold_interrupt(void) * Must be called with kernel preemption disabled (eg with local * local interrupts as in the case of do_device_not_available). */ +#include void math_state_restore(void) { struct task_struct *tsk = current; if (!tsk_used_math(tsk)) { + /* + * See http://bugs.xenproject.org/xen/bug/40 + */ + if (xen_pv_domain()) + stts(); local_irq_enable(); /* * does a slab alloc which can sleep @@ -854,6 +860,8 @@ void math_state_restore(void) return; } local_irq_disable(); + if (xen_pv_domain()) + clts(); } /* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */ For the older pvops kernels and trying it out now. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel