#51 - libxl exit() on ENOMEM incompatible with gc'd languages

Owner: Ian Campbell <ian.campbell@citrix.com>

Date: Thu Nov 19 11:30:02 2015

Last Update: Thu Nov 19 11:30:02 2015

Severity: normal

Affects:

State: Open

[ Retrieve as mbox ]


From: Ian Campbell <ian.campbell@citrix.com>
To: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 19 Nov 2015 11:23:15 +0000
Message-ID: <1447932195.5647.46.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

create !
title it libxl exit() on ENOMEM incompatible with gc'd languages
thanks

On Thu, 2015-11-19 at 10:55 +0000, Andrew Cooper wrote:
> On 19/11/15 09:20, Ian Campbell wrote:
> > On Wed, 2015-11-18 at 18:32 +0000, Martin Osterloh wrote:
> > 
> > > I wanted to inquire about the current state of LibXL and in
> > > particular 
> > > if there are any issues with using it in long-running processes.
> > It is currently being used by libvirtd which I think has shaken out
> > most of
> > the issues with that environment.
> > 
> > There are certain to be other bugs, but nothing show-stopping.
> 
> There really is a show-stopper, which I have stated before.

Ah yes, nobody ever made a proposal to fix this so it slipped off my radar.

I've recorded it in the BTS this time.

> Languages such as OCaml use -ENOMEM as a hint to run the garbage
> collector some more.  I expect Haskell is the same.
> 
> It is not appropriate for libxl (or any library for that matter) to use
> exit() as its method of resolving out-of-memory conditions.

Note that the decision to take this approach was widely consulted at the
time, including with Ocaml folks, it wasn't just done on a whim. That's not
to say we cannot reconsider and find a different, better, approach which
works for such languages.

I think retrofitting all the necessary error paths to libxl to report
memory allocation failures back up to the caller is going to be untenable
and probably a project which would never actually be complete.

But I notice that there are only ~10 calls to libxl__alloc_failed in libxl.

I think it would be possible to turn each of those into a retry loop which
calls an application provided hook function on each iteration.

If no hook is provided by the application then the current behaviour would
remain.

Language bindings for Ocaml/haskel/etc would set the hook and use it to
call into their gc. You'd probably want to limit the number of retry
attempts and provide a way for the app to say "no, we really are out of
memory" if it wants.

Would such an approach work for Ocaml and haskell?

In particular I'd be concerned about things like the ocaml interpreter lock
(caml_enter_blocking_section etc) and calls to libxl not consistently
holding/dropping it potentially leading to deadlocks.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


From: Andrew Cooper <andrew.cooper3@citrix.com>
To: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Ian Campbell <ian.campbell@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 19 Nov 2015 11:33:39 +0000
Message-ID: <564DB393.3070805@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On 19/11/15 11:23, Ian Campbell wrote:
> create !
> title it libxl exit() on ENOMEM incompatible with gc'd languages
> thanks

Can this be extended to "should not use exit() in general" ?

andrewcoop@andrewcoop:/local/xen.git/xen$ git grep exit\( --
:/tools/libxl/libxl*
../tools/libxl/libxl.c:1707:        _exit(0);
../tools/libxl/libxl.c:1711:            _exit(errno);
../tools/libxl/libxl.c:1716:            _exit(-1);
../tools/libxl/libxl_aoutils.c:478:            if (r) {
LOGE(ERROR,"openpty failed"); _exit(-1); }
../tools/libxl/libxl_aoutils.c:482:        if (rc) { LOGE(ERROR,"sendmsg
to parent failed"); _exit(-1); }
../tools/libxl/libxl_aoutils.c:483:        _exit(0);
../tools/libxl/libxl_bootloader.c:557:        if (r) { LOGE(ERROR,
"login_tty failed"); exit(-1); }
../tools/libxl/libxl_bootloader.c:559:        exit(-1);
../tools/libxl/libxl_event.c:1387:    exit(-1);
../tools/libxl/libxl_event.h:104:   * and call exit(-1).
../tools/libxl/libxl_exec.c:106:    _exit(-1);
../tools/libxl/libxl_exec.c:316:        exit(255);
../tools/libxl/libxl_exec.c:324:        _exit(127);
../tools/libxl/libxl_exec.c:340:    _exit(r);
../tools/libxl/libxl_internal.c:28:    _exit(-1);
../tools/libxl/libxl_remus_disk_drbd.c:229:    _exit(ackwait);
../tools/libxl/libxl_save_callout.c:184:                exit(-1);
../tools/libxl/libxl_save_helper.c:64:    if (r < 0) { perror("memory
allocation failed during logging"); exit(-1); }
../tools/libxl/libxl_save_helper.c:102:    exit(-1);
../tools/libxl/libxl_save_helper.c:122:    if (!r) { perror("memory
allocation failed"); exit(-1); }
../tools/libxl/libxl_save_helper.c:189:        if (r<0) {
perror("write"); exit(-1); }
../tools/libxl/libxl_save_helper.c:210:    if (r<=0) exit(-2);
../tools/libxl/libxl_save_helper.c:230:    exit(0);
../tools/libxl/libxlu_cfg_l.c:1763:     exit( YY_EXIT_FAILURE );
../tools/libxl/libxlu_disk_l.c:2248:    exit( YY_EXIT_FAILURE );

The majority of those are cases are not appropriate uses of exit(). 
AFAIIR, the *only* valid use of exit() in a library is to clean up in a
child process from a library-initiated fork().

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 19 Nov 2015 11:48:47 +0000
Message-ID: <1447933727.5647.51.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2015-11-19 at 11:33 +0000, Andrew Cooper wrote:
> 
> The majority of those are cases are not appropriate uses of exit(). 
> AFAIIR, the *only* valid use of exit() in a library is to clean up in a
> child process from a library-initiated fork().

... or (in this case) in the libxl-save-helper (separate process).

The only one I can find which isn't one of this is
in libxl__event_disaster, and that is only if the applications (or language
bindings) haven't provided a suitable disaster callback.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>
Cc: Yang Hongyang <yanghy@cn.fujitsu.com>, Shriram Rajagopalan <rshriram@cs.ubc.ca>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 19 Nov 2015 11:55:20 +0000
Message-ID: <1447934120.5647.54.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2015-11-19 at 11:48 +0000, Ian Campbell wrote:
> On Thu, 2015-11-19 at 11:33 +0000, Andrew Cooper wrote:
> >  
> > The majority of those are cases are not appropriate uses of exit(). 
> > AFAIIR, the *only* valid use of exit() in a library is to clean up in a
> > child process from a library-initiated fork().
> 
> ... or (in this case) in the libxl-save-helper (separate process).
> 
> The only one I can find which isn't one of this is
> in libxl__event_disaster, and that is only if the applications (or language
> bindings) haven't provided a suitable disaster callback.

Was looking at 4.4, in staging I also see a very odd one in
drbd_preresume_async, which isn't obviously in a child process AFAICT.

Hongyang, what prevents that exit from killing the whole toolstack process?

Ian.

> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Yang Hongyang <yanghy@cn.fujitsu.com>, Shriram Rajagopalan <rshriram@cs.ubc.ca>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 19 Nov 2015 12:16:51 +0000
Message-ID: <1447935411.5647.55.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2015-11-19 at 11:55 +0000, Ian Campbell wrote:
> On Thu, 2015-11-19 at 11:48 +0000, Ian Campbell wrote:
> > On Thu, 2015-11-19 at 11:33 +0000, Andrew Cooper wrote:
> > >  
> > > The majority of those are cases are not appropriate uses of exit(). 
> > > AFAIIR, the *only* valid use of exit() in a library is to clean up in
> > > a
> > > child process from a library-initiated fork().
> > 
> > ... or (in this case) in the libxl-save-helper (separate process).
> > 
> > The only one I can find which isn't one of this is
> > in libxl__event_disaster, and that is only if the applications (or
> > language
> > bindings) haven't provided a suitable disaster callback.
> 
> Was looking at 4.4, in staging I also see a very odd one in
> drbd_preresume_async, which isn't obviously in a child process AFAICT.
> 
> Hongyang, what prevents that exit from killing the whole toolstack
> process?

I had missed an _async suffix on that function versus the one which was the
actual callback, it is invoked via drbd_async_call which involves a fork().

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: George Dunlap <George.Dunlap@eu.citrix.com>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 19 Nov 2015 15:34:24 +0000
Message-ID: <CAFLBxZZQ7cdK=mmwVWp+ax0bKE3zOAR2cGcOc=STuVQAX26etg@mail.gmail.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, Nov 19, 2015 at 11:23 AM, Ian Campbell <ian.campbell@citrix.com> wrote:
> create !
> title it libxl exit() on ENOMEM incompatible with gc'd languages
> thanks

Actually, can I suggest that someone send a new thread with an
appropriate title, so that people who aren't directly following the
thread know what kinds of things are being discussed?

Also so Martin can decide whether he wants to follow it or not. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Yang Hongyang <hongyang.yang@easystack.cn>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Ian Campbell <ian.campbell@citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>, Yang Hongyang <yanghy@cn.fujitsu.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Fri, 20 Nov 2015 08:30:46 +0800
Message-ID: <564E69B6.3020509@easystack.cn>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On 2015年11月19日 20:16, Ian Campbell wrote:
> On Thu, 2015-11-19 at 11:55 +0000, Ian Campbell wrote:
>> On Thu, 2015-11-19 at 11:48 +0000, Ian Campbell wrote:
>>> On Thu, 2015-11-19 at 11:33 +0000, Andrew Cooper wrote:
>>>>
>>>> The majority of those are cases are not appropriate uses of exit().
>>>> AFAIIR, the *only* valid use of exit() in a library is to clean up in
>>>> a
>>>> child process from a library-initiated fork().
>>>
>>> ... or (in this case) in the libxl-save-helper (separate process).
>>>
>>> The only one I can find which isn't one of this is
>>> in libxl__event_disaster, and that is only if the applications (or
>>> language
>>> bindings) haven't provided a suitable disaster callback.
>>
>> Was looking at 4.4, in staging I also see a very odd one in
>> drbd_preresume_async, which isn't obviously in a child process AFAICT.
>>
>> Hongyang, what prevents that exit from killing the whole toolstack
>> process?
>
> I had missed an _async suffix on that function versus the one which was the
> actual callback, it is invoked via drbd_async_call which involves a fork().

Yeah, it is in a child process.

>
> Ian.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

-- 
Thanks,
Yang

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: George Dunlap <dunlapg@umich.edu>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Yang Hongyang <yanghy@cn.fujitsu.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:09:24 +0000
Message-ID: <CAFLBxZaUU0z3W9GU6imwooA+pZVqheLHDFzPq-8vc-0RsTXNRA@mail.gmail.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, Nov 19, 2015 at 12:16 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
> On Thu, 2015-11-19 at 11:55 +0000, Ian Campbell wrote:
>> On Thu, 2015-11-19 at 11:48 +0000, Ian Campbell wrote:
>> > On Thu, 2015-11-19 at 11:33 +0000, Andrew Cooper wrote:
>> > >
>> > > The majority of those are cases are not appropriate uses of exit().
>> > > AFAIIR, the *only* valid use of exit() in a library is to clean up in
>> > > a
>> > > child process from a library-initiated fork().
>> >
>> > ... or (in this case) in the libxl-save-helper (separate process).
>> >
>> > The only one I can find which isn't one of this is
>> > in libxl__event_disaster, and that is only if the applications (or
>> > language
>> > bindings) haven't provided a suitable disaster callback.
>>
>> Was looking at 4.4, in staging I also see a very odd one in
>> drbd_preresume_async, which isn't obviously in a child process AFAICT.
>>
>> Hongyang, what prevents that exit from killing the whole toolstack
>> process?
>
> I had missed an _async suffix on that function versus the one which was the
> actual callback, it is invoked via drbd_async_call which involves a fork().

So what was the conclusion here?  It looks like we've confirmed that
exit() is only called:

1. In the case of a malloc() failure
2. in libxl-save-helper (a separate process forked by the library)
3. In libxl__event_disaster(), if no callback is provided

Which just leaves #1 as something to be discussed?

(Also, I had suggested making a new thread, but I can't find any
threads continuing this conversation.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: George Dunlap <dunlapg@umich.edu>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>, Ian Campbell <ian.campbell@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>, Yang Hongyang <yanghy@cn.fujitsu.com>, Andrew Cooper <andrew.cooper3@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:19:56 +0000
Message-ID: <22213.64828.978431.135803@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

George Dunlap writes ("Re: [Xen-devel] Current LibXL Status"):
> So what was the conclusion here?  It looks like we've confirmed that
> exit() is only called:
> 
> 1. In the case of a malloc() failure
> 2. in libxl-save-helper (a separate process forked by the library)
> 3. In libxl__event_disaster(), if no callback is provided
  4. In other processes forked by the library

But, yes,l basically.

> Which just leaves #1 as something to be discussed?

Is this crashing on malloc failure a problem ?

From the point of view of libxl's innards, making malloc failures
fatal means that nothing that allocates memory needs to care about
malloc failures.  That massively reduces the number of error paths to
be considered and eliminates an entire class of (largely theoretical)
bugs.

And often there is no good recovery possible (and logging the error is
hard too).

I'm not sure whether I'd want to change this policy even if someone
wanted to commit to auditing libxl and submitting the necessary
patches to cope with every malloc failure.  Having to cope with malloc
failure would be a continual burden on every proposed change or new
feature.

If in practice this is a problem it would probaby be better to run
libxl in a process which can be restarted if it dies due to malloc
failure.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: George Dunlap <dunlapg@umich.edu>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Yang Hongyang <yanghy@cn.fujitsu.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>, Shriram Rajagopalan <rshriram@cs.ubc.ca>, Ian Jackson <Ian.Jackson@eu.citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:24:21 +0000
Message-ID: <1455816261.6225.57.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2016-02-18 at 17:09 +0000, George Dunlap wrote:
> On Thu, Nov 19, 2015 at 12:16 PM, Ian Campbell <ian.campbell@citrix.com>
> wrote:
> > On Thu, 2015-11-19 at 11:55 +0000, Ian Campbell wrote:
> > > On Thu, 2015-11-19 at 11:48 +0000, Ian Campbell wrote:
> > > > On Thu, 2015-11-19 at 11:33 +0000, Andrew Cooper wrote:
> > > > > 
> > > > > The majority of those are cases are not appropriate uses of
> > > > > exit().
> > > > > AFAIIR, the *only* valid use of exit() in a library is to clean
> > > > > up in
> > > > > a
> > > > > child process from a library-initiated fork().
> > > > 
> > > > ... or (in this case) in the libxl-save-helper (separate process).
> > > > 
> > > > The only one I can find which isn't one of this is
> > > > in libxl__event_disaster, and that is only if the applications (or
> > > > language
> > > > bindings) haven't provided a suitable disaster callback.
> > > 
> > > Was looking at 4.4, in staging I also see a very odd one in
> > > drbd_preresume_async, which isn't obviously in a child process
> > > AFAICT.
> > > 
> > > Hongyang, what prevents that exit from killing the whole toolstack
> > > process?
> > 
> > I had missed an _async suffix on that function versus the one which was
> > the
> > actual callback, it is invoked via drbd_async_call which involves a
> > fork().
> 
> So what was the conclusion here?  It looks like we've confirmed that
> exit() is only called:
> 
> 1. In the case of a malloc() failure
> 2. in libxl-save-helper (a separate process forked by the library)
> 3. In libxl__event_disaster(), if no callback is provided
> 
> Which just leaves #1 as something to be discussed?

Somewhere/when I proposed handling #1 by having the small number of places
which call libxl__alloc_failed() call an application (or language binding)
provided "please free some memory" hook (i.e. "please run your garbage
collector") and then retry some appropriate number of times. Perhaps with
an error code available for the hook to say "this is never going to help".

There's less than a dozen such call sites so this is quite doable, vastly
so compared with adding OOM error handling and reporting back up the
callchain to all libxl functions.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: George Dunlap <dunlapg@umich.edu>, Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Yang Hongyang <yanghy@cn.fujitsu.com>, Martin Osterloh <osterlohm@ainfosec.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Shriram Rajagopalan <rshriram@cs.ubc.ca>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:26:33 +0000
Message-ID: <1455816393.6225.59.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2016-02-18 at 17:19 +0000, Ian Jackson wrote:
> George Dunlap writes ("Re: [Xen-devel] Current LibXL Status"):
> > So what was the conclusion here?  It looks like we've confirmed that
> > exit() is only called:
> > 
> > 1. In the case of a malloc() failure
> > 2. in libxl-save-helper (a separate process forked by the library)
> > 3. In libxl__event_disaster(), if no callback is provided
>   4. In other processes forked by the library
> 
> But, yes,l basically.
> 
> > Which just leaves #1 as something to be discussed?
> 
> Is this crashing on malloc failure a problem ?

It is for non-C language bindings which might be using garbage collection,
since they might be OOM from a malloc perspective but actually have loads
of spare memory waiting to be collected (which they might plausibly be
doing quite lazily).

I just reminded people of my proposal to provide a callback to allow the
app/bindings to run their gc here in my reply to George before I saw your
reply.

> From the point of view of libxl's innards, making malloc failures
> fatal means that nothing that allocates memory needs to care about
> malloc failures.  That massively reduces the number of error paths to
> be considered and eliminates an entire class of (largely theoretical)
> bugs.
> 
> And often there is no good recovery possible (and logging the error is
> hard too).
> 
> I'm not sure whether I'd want to change this policy even if someone
> wanted to commit to auditing libxl and submitting the necessary
> patches to cope with every malloc failure.  Having to cope with malloc
> failure would be a continual burden on every proposed change or new
> feature.

I agree that we don't want to change this policy, but I think an OOM hook
is sufficient to solve the actual problem.


> If in practice this is a problem it would probaby be better to run
> libxl in a process which can be restarted if it dies due to malloc
> failure.
> 
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: George Dunlap <George.Dunlap@eu.citrix.com>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: Jonathan Ludlam <jonathan.ludlam@eu.citrix.com>, Rob Hoes <Rob.Hoes@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:26:52 +0000
Message-ID: <CAFLBxZbFOOJCx2bQ=kGfSDAso4D-MNuB72wMVLEuLXSnh8+xqg@mail.gmail.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, Nov 19, 2015 at 11:23 AM, Ian Campbell <ian.campbell@citrix.com> wrote:
> create !
> title it libxl exit() on ENOMEM incompatible with gc'd languages
> thanks
>
> On Thu, 2015-11-19 at 10:55 +0000, Andrew Cooper wrote:
>> On 19/11/15 09:20, Ian Campbell wrote:
>> > On Wed, 2015-11-18 at 18:32 +0000, Martin Osterloh wrote:
>> >
>> > > I wanted to inquire about the current state of LibXL and in
>> > > particular
>> > > if there are any issues with using it in long-running processes.
>> > It is currently being used by libvirtd which I think has shaken out
>> > most of
>> > the issues with that environment.
>> >
>> > There are certain to be other bugs, but nothing show-stopping.
[snip]
>> Languages such as OCaml use -ENOMEM as a hint to run the garbage
>> collector some more.  I expect Haskell is the same.

So the implication you've made here, is that if libxl were to return
-ENOMEM to ocaml, that it would run the garbage collector and free up
some heap space, so that malloc() could succeed and the libxl
operation could proceed.

In preparation for writing up a descripion for an "outreach project"
(Outreachy/OPW/GSoC), I went and had a chat with some of the guys on
the xapi team (cc'd) about what such a thing might look like.

According to them, the ocaml garbage collector never frees memory.  It
grows its own internal heap as necessary, but it never reduces the
size of its heap.  So no -ENOMEM call or OOM callback can make a
failed malloc succeed.

One might then ask if libxl could simply allocate memory *from the
ocaml heap* itself.  It turns out that is also not tenable: Data in
the ocaml heap is stored in a heavily coded format.  (For example
integers are stored as (n*2+1), so that pointers can all be even and
non-pointers can all be odd.)

The only thing they said might be improved is:

1.  To know that libxl would never call exit() for any other reason
(which it seems is true)

2. To  have a callback in OOM conditions.  It's unlikely the process
as a whole could do anything but exit, but the ocaml runtime itself
might still have internal heap available, which would allow it to exit
more gracefully.

I think if Haskell or any library *is* capable of integrating with the
garbage collector, we'll have to wait until someone who understands
the language actually writes bindings and can ask for something they
know to be useful for that language.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jonathan Ludlam <jonathan.ludlam@eu.citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Rob Hoes <Rob.Hoes@citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:39:27 +0000
Message-ID: <1455817167.6225.65.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2016-02-18 at 17:26 +0000, George Dunlap wrote:

> According to them, the ocaml garbage collector never frees memory.  It
> grows its own internal heap as necessary, but it never reduces the
> size of its heap.

Assume that's true: Wow!

Although, I presume this is a factor of the current/particular
implementation, rather than a fundamental property of an ocaml runtime, so
in theory it could change (and we have here a good real world argument why
it perhaps should do!)

> So no -ENOMEM call or OOM callback can make a failed malloc succeed.
> 
> One might then ask if libxl could simply allocate memory *from the
> ocaml heap* itself.  It turns out that is also not tenable: Data in
> the ocaml heap is stored in a heavily coded format.  (For example
> integers are stored as (n*2+1), so that pointers can all be even and
> non-pointers can all be odd.)
> 
> The only thing they said might be improved is:
> 
> 1.  To know that libxl would never call exit() for any other reason
> (which it seems is true)
> 
> 2. To  have a callback in OOM conditions.  It's unlikely the process
> as a whole could do anything but exit, but the ocaml runtime itself
> might still have internal heap available, which would allow it to exit
> more gracefully.
> 
> I think if Haskell or any library *is* capable of integrating with the
> garbage collector, we'll have to wait until someone who understands
> the language actually writes bindings and can ask for something they
> know to be useful for that language.

Agreed.

Hopefully it is possible to make the callback without needing to malloc
anything!

If I were adding an API for #2 to libxl (which must therefore be stable) I
think I might be inclined to allow for the possibility of the callback
returning "please retry" since it would be inconvenient in the usual API
stability ways to retrofit it, I would guess, but really until a language
even exposes that possibility we don't really know if it is worthwhile
doing so.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: George Dunlap <dunlapg@umich.edu>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Yang Hongyang <yanghy@cn.fujitsu.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:40:42 +0000
Message-ID: <CAFLBxZaiC4RHDhoPc72J3AaCC2nCpJT1cRBWNS=xLm99cqJxCA@mail.gmail.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, Feb 18, 2016 at 5:26 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
> On Thu, 2016-02-18 at 17:19 +0000, Ian Jackson wrote:
>> George Dunlap writes ("Re: [Xen-devel] Current LibXL Status"):
>> > So what was the conclusion here?  It looks like we've confirmed that
>> > exit() is only called:
>> >
>> > 1. In the case of a malloc() failure
>> > 2. in libxl-save-helper (a separate process forked by the library)
>> > 3. In libxl__event_disaster(), if no callback is provided
>>   4. In other processes forked by the library
>>
>> But, yes,l basically.
>>
>> > Which just leaves #1 as something to be discussed?
>>
>> Is this crashing on malloc failure a problem ?
>
> It is for non-C language bindings which might be using garbage collection,
> since they might be OOM from a malloc perspective but actually have loads
> of spare memory waiting to be collected (which they might plausibly be
> doing quite lazily).
>
> I just reminded people of my proposal to provide a callback to allow the
> app/bindings to run their gc here in my reply to George before I saw your
> reply.
>
>> From the point of view of libxl's innards, making malloc failures
>> fatal means that nothing that allocates memory needs to care about
>> malloc failures.  That massively reduces the number of error paths to
>> be considered and eliminates an entire class of (largely theoretical)
>> bugs.
>>
>> And often there is no good recovery possible (and logging the error is
>> hard too).
>>
>> I'm not sure whether I'd want to change this policy even if someone
>> wanted to commit to auditing libxl and submitting the necessary
>> patches to cope with every malloc failure.  Having to cope with malloc
>> failure would be a continual burden on every proposed change or new
>> feature.
>
> I agree that we don't want to change this policy, but I think an OOM hook
> is sufficient to solve the actual problem.

And in the situation like ocaml seems to be at the moment, it also
gives the process an opportunity to attempt to shut down as gracefully
as it can even if it knows it can't free up any more memory to libxl.
(And who knows, some future version of ocaml may even allow you to ask
to shrink the heap.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: George Dunlap <George.Dunlap@eu.citrix.com>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: Rob Hoes <Rob.Hoes@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>, Jonathan Ludlam <jonathan.ludlam@eu.citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:47:53 +0000
Message-ID: <CAFLBxZZPt8bwrmiJGK6aAkCasDMzMUxE9RapgHowV88tP9ek-A@mail.gmail.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, Feb 18, 2016 at 5:39 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
> On Thu, 2016-02-18 at 17:26 +0000, George Dunlap wrote:
>
>> According to them, the ocaml garbage collector never frees memory.  It
>> grows its own internal heap as necessary, but it never reduces the
>> size of its heap.
>
> Assume that's true: Wow!
>
> Although, I presume this is a factor of the current/particular
> implementation, rather than a fundamental property of an ocaml runtime, so
> in theory it could change (and we have here a good real world argument why
> it perhaps should do!)
>
>> So no -ENOMEM call or OOM callback can make a failed malloc succeed.
>>
>> One might then ask if libxl could simply allocate memory *from the
>> ocaml heap* itself.  It turns out that is also not tenable: Data in
>> the ocaml heap is stored in a heavily coded format.  (For example
>> integers are stored as (n*2+1), so that pointers can all be even and
>> non-pointers can all be odd.)
>>
>> The only thing they said might be improved is:
>>
>> 1.  To know that libxl would never call exit() for any other reason
>> (which it seems is true)
>>
>> 2. To  have a callback in OOM conditions.  It's unlikely the process
>> as a whole could do anything but exit, but the ocaml runtime itself
>> might still have internal heap available, which would allow it to exit
>> more gracefully.
>>
>> I think if Haskell or any library *is* capable of integrating with the
>> garbage collector, we'll have to wait until someone who understands
>> the language actually writes bindings and can ask for something they
>> know to be useful for that language.
>
> Agreed.
>
> Hopefully it is possible to make the callback without needing to malloc
> anything!
>
> If I were adding an API for #2 to libxl (which must therefore be stable) I
> think I might be inclined to allow for the possibility of the callback
> returning "please retry" since it would be inconvenient in the usual API
> stability ways to retrofit it, I would guess, but really until a language
> even exposes that possibility we don't really know if it is worthwhile
> doing so.

Well we could just make the semantics such that if the callback
returns at all, that libxl should retry.  That shouldn't be
particularly harder than just making a callback we don't expect to
ever return (which is what it sounds like the xapi folks would be
happy with).  Then when we come across a language runtime /
application that *can* do something useful, it might possibly Just
Work; but if it doesn't we can implement what they actually need.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jonathan Ludlam <jonathan.ludlam@eu.citrix.com>, Rob Hoes <Rob.Hoes@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 17:50:41 +0000
Message-ID: <1455817841.6225.70.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

On Thu, 2016-02-18 at 17:39 +0000, Ian Campbell wrote:
> On Thu, 2016-02-18 at 17:26 +0000, George Dunlap wrote:
> 
> > According to them, the ocaml garbage collector never frees memory.  It
> > grows its own internal heap as necessary, but it never reduces the
> > size of its heap.
> 
> Assume that's true: Wow!

https://github.com/ocaml/ocaml/blob/trunk/byterun/compact.c#L374
is a call by do_compaction to caml_shrink_heap at
https://github.com/ocaml/ocaml/blob/trunk/byterun/memory.c#L408
which calls caml_free_for_heap at
https://github.com/ocaml/ocaml/blob/trunk/byterun/memory.c#L290
which certainly looks like it is returning memory to the malloc pool.

Maybe this never happens in practice for some reason, or perhaps the
"byterun" in those paths is relevant? I couldn't find another heap
implementation which would be associated with native binaries.

http://caml.inria.fr/mantis/view.php?id=5389 also seems to suggest that it
is expected for the heap to shrink (i.e. it was considered a bug that it
did not in this case).

Ian.

> 
> Although, I presume this is a factor of the current/particular
> implementation, rather than a fundamental property of an ocaml runtime,
> so
> in theory it could change (and we have here a good real world argument
> why
> it perhaps should do!)
> 
> > So no -ENOMEM call or OOM callback can make a failed malloc succeed.
> > 
> > One might then ask if libxl could simply allocate memory *from the
> > ocaml heap* itself.  It turns out that is also not tenable: Data in
> > the ocaml heap is stored in a heavily coded format.  (For example
> > integers are stored as (n*2+1), so that pointers can all be even and
> > non-pointers can all be odd.)
> > 
> > The only thing they said might be improved is:
> > 
> > 1.  To know that libxl would never call exit() for any other reason
> > (which it seems is true)
> > 
> > 2. To  have a callback in OOM conditions.  It's unlikely the process
> > as a whole could do anything but exit, but the ocaml runtime itself
> > might still have internal heap available, which would allow it to exit
> > more gracefully.
> > 
> > I think if Haskell or any library *is* capable of integrating with the
> > garbage collector, we'll have to wait until someone who understands
> > the language actually writes bindings and can ask for something they
> > know to be useful for that language.
> 
> Agreed.
> 
> Hopefully it is possible to make the callback without needing to malloc
> anything!
> 
> If I were adding an API for #2 to libxl (which must therefore be stable)
> I
> think I might be inclined to allow for the possibility of the callback
> returning "please retry" since it would be inconvenient in the usual API
> stability ways to retrofit it, I would guess, but really until a language
> even exposes that possibility we don't really know if it is worthwhile
> doing so.
> 
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>, Martin Osterloh <osterlohm@ainfosec.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: [Xen-devel] libxl and malloc failure (Re:  Current LibXL Status)
Date: Thu, 18 Feb 2016 18:15:23 +0000
Message-ID: <22214.2619.86697.724156@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

Andrew Cooper writes ("Re: [Xen-devel] Current LibXL Status"):
> There really is a show-stopper, which I have stated before.
> 
> Languages such as OCaml use -ENOMEM as a hint to run the garbage
> collector some more.  I expect Haskell is the same.

This cannot possibly be true (if what you mean is that they use
ENOMEM[1] from malloc as such an indication).  It would make it
impossible to write a correct binding to a normal C library.

Typically C library which calls malloc will do so in the middle of its
execution.  Even if the library returns the resulting error up as
a distinguishable error code, you can't just make the same library
call again - it may have done half its work but not the other half.

So I think you must be wrong.

Ian.

[1] NB that when malloc fails there is no -ENOMEM, only ENOMEM.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [Xen-devel] Current LibXL Status
Date: Thu, 18 Feb 2016 18:30:57 +0000
Message-ID: <22214.3553.260609.617265@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

Ian Campbell writes ("Re: [Xen-devel] Current LibXL Status"):
> The only one I can find which isn't one of this is
> in libxl__event_disaster, and that is only if the applications (or language
> bindings) haven't provided a suitable disaster callback.

libxl__event_disaster can currently happen in the following cases.

Disasters which are specific to a specific requested event (type!=0):

  xc_domain_getinfolist failure (breaks reporting domain death, only
  xw_write failure for acknowledging disk eject (breaks reporting
      ejects, only)

General disasters (type==0) due to xenstore problems:

  POLLERR or POLLHUP on xenstore fd
  xs_check_watch fails (represents trouble on xenstore fd or with xenstore)
  xs_read for domain death check

General disasters (type==0) due to hypercall trouble:

  xenevtchn_pending fails other than with EAGAIN
  (unrequested) event other than POLLIN on evtchn fd

General disasters (type==0) due to syscall failure:

  poll syscall fails
  (unrequested) event other than POLLIN on self-pipe
  read() or write() fails on a self-pipe (other than EAGAIN/EWOULDBLOCK)
  waitpid() syscall fails

General disasters (type==0) induced by the application:

  waitpid reports ECHILD when we know we have a child (probably,
    something else in the process reaped it; arguably this should abort)
  application's childproc_hooks->reaped_callback failure

Disasters are not reported unless it is impossible to proceed.  So in
general it is not possible for a process to recover from a disaster
reported by libxl.

But disasters are never expected.  They should occur only if
everything is totally doomed anyway.  Having a daemon crash is a
perfectly reasonable response.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

From: Ian Campbell <ian.campbell@citrix.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>
Subject: Re: [Xen-devel] libxl and malloc failure (Re: Current LibXL Status)
Date: Fri, 19 Feb 2016 10:52:11 +0000
Message-ID: <1455879131.6225.89.camel@citrix.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

graft 51 ^
thanks

On Thu, 2016-02-18 at 18:15 +0000, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [Xen-devel] Current LibXL Status"):
> > There really is a show-stopper, which I have stated before.
> > 
> > Languages such as OCaml use -ENOMEM as a hint to run the garbage
> > collector some more.  I expect Haskell is the same.
> 
> This cannot possibly be true (if what you mean is that they use
> ENOMEM[1] from malloc as such an indication).  It would make it
> impossible to write a correct binding to a normal C library.
> 
> Typically C library which calls malloc will do so in the middle of its
> execution.  Even if the library returns the resulting error up as
> a distinguishable error code, you can't just make the same library
> call again - it may have done half its work but not the other half.

FWIW there is some code in the Ocaml runtime which raises an Out_of_memory
exception in response to ENOMEM in C land (and from ocaml code in some
circumstances too).

The Unix module (which wraps standard Unix libc type APIs) looks to (when
appropriate) reasonably consistently return errors as Unix_error(errno),
that is handled generically so will include raising Unix_error(ENOMEM) as
necessary.

In principal it ought to be possible to write an ocaml program which
correctly dealt with those (and similar ones from other lower level
bindings), for example the interactive command line interpreter catches
Out_of_memory (but not Unix_error) and calls the gc before iterating. (I
did see one reference in a book which suggested Out_of_memory wasn't
catcheable, but that appears to be out of date from what I can tell).

So I suppose it ought to be possible to write an ocaml daemon which caught
errors like these at the top level and aborted only the currently attempted
operation without taking down the whole daemon process. How feasible that
is in practice I can't say.

This of course relies on all the C bindings correctly turning errors into
exceptions and the semantics of your particular daemon allowing for
abandoning things half way through (which I suppose is a relative of some
kind to crash-only s/w).

Ian.

> 
> So I think you must be wrong.
> 
> Ian.
> 
> [1] NB that when malloc fails there is no -ENOMEM, only ENOMEM.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: Ian Campbell <ian.campbell@citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Martin Osterloh <osterlohm@ainfosec.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [Xen-devel] libxl and malloc failure (Re: Current LibXL Status)
Date: Mon, 22 Feb 2016 16:48:50 +0000
Message-ID: <22219.15346.138200.277441@mariner.uk.xensource.com>

[ Reply to this message; Retrieve Raw Message; Archives: marc.info, gmane ]

Ian Campbell writes ("Re: libxl and malloc failure (Re: [Xen-devel] Current LibXL Status)"):
> On Thu, 2016-02-18 at 18:15 +0000, Ian Jackson wrote:
> > Andrew Cooper writes ("Re: [Xen-devel] Current LibXL Status"):
> > > Languages such as OCaml use -ENOMEM as a hint to run the garbage
> > > collector some more.  I expect Haskell is the same.
> > 
> > This cannot possibly be true (if what you mean is that they use
> > ENOMEM[1] from malloc as such an indication).  It would make it
> > impossible to write a correct binding to a normal C library.
> > 
> > Typically C library which calls malloc will do so in the middle of its
> > execution.  Even if the library returns the resulting error up as
> > a distinguishable error code, you can't just make the same library
> > call again - it may have done half its work but not the other half.
> 
> FWIW there is some code in the Ocaml runtime which raises an Out_of_memory
> exception in response to ENOMEM in C land (and from ocaml code in some
> circumstances too).

ISTM that either there is something else which is turning ocaml out of
heap errors into ENOMEM, or that this code you describe is
fundamentally misconceived.

> In principal it ought to be possible to write an ocaml program which
> correctly dealt with those (and similar ones from other lower level
> bindings), for example the interactive command line interpreter catches
> Out_of_memory (but not Unix_error) and calls the gc before iterating. (I
> did see one reference in a book which suggested Out_of_memory wasn't
> catcheable, but that appears to be out of date from what I can tell).

But iterating in the cli would not recover the operation which had
failed.

In the general case, there is no recovery possible when a C language
library function returns ENOMEM.  Sometimes the internal state of the
library will now be unsuitable.  Certainly, a policy of always
retrying the operation after a gc would be very wrong indeed.

Ie if ocaml has the behaviour claimed by Andrew (ie, that the need for
a gc can be detected during malloc and reported as ENOMEM), all ocaml
programs would already suffer from occasional random failures of
operations being performed by code written in C.

Note that "need for a gc" is not an abnormal part of operation in a
consing garbage collected language like ocaml.  It is entirely
routine.  Ie these random failures would occur in normal, otherwise
entirely successful, operation.  They would occur even if the
program's actual memory use was modest.

> This of course relies on all the C bindings correctly turning errors into
> exceptions and the semantics of your particular daemon allowing for
> abandoning things half way through (which I suppose is a relative of some
> kind to crash-only s/w).

Even if the operations are properly abandonded, they have still been
erroneously made to fail.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel