guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: out-of-control GC


From: Linas Vepstas
Subject: Re: out-of-control GC
Date: Sat, 9 Sep 2017 23:14:46 -0500

To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say that
this is "wrong", right now, but it illustrates the scandalous nature of the
problem.

Its a simple loop, about 6 lines of code total, it mostly just wastes cpu
time.  There's also a longer progress-report printer, similar to the
previous one I posted.

It runs at about 1 million iterations/minute, for me, of which 15 seconds
are spent "doing real work", and 45-75 seconds in gc.  Note that GC runs in
parallel, usually using about 3-4 cores, so the fraction-cpu-time is the
fraction relative to wall-clock, so that %350 cpu time is typical (i.e. 3.5
cpu cores are running)

This seems to be a reasonable model of the behavior that I am seeing.

--linas

; Report the average amount of time spent in GC
(define avg-gc-cpu-time
(let ((last-gc (gc-stats))
(start-time (get-internal-real-time))
(run-time (get-internal-run-time)))
(lambda ()
(define now (get-internal-real-time))
(define run (get-internal-run-time))
(define cur (gc-stats))
(define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
(define elapsed-time (* 1.0e-9 (- now start-time)))
(define cpu-time (* 1.0e-9 (- run run-time)))
(define non-gc-cpu (- cpu-time gc-time-taken))
(define ngc (- (assoc-ref cur 'gc-times)
(assoc-ref last-gc 'gc-times)))
(format #t "Elapsed: ~5f secs. non-gc: ~5f secs Rate: ~5f gc/min %GC: ~5f%
 %cpu: ~5f%\n"
elapsed-time
non-gc-cpu
(/ (* ngc 60) elapsed-time)
(* 100 (/ gc-time-taken elapsed-time))
(* 100 (/ cpu-time elapsed-time))
)
(set! last-gc cur)
(set! start-time now)
(set! run-time run))))

; a tail-recursive loop, seems to waste a lot of time in gc.
(define (make-bgl lis cnt)
; a bogus longer list
(define longer (cons (format #f "~A" cnt) lis))
; periodically report statistics
(if (eq? 0 (modulo cnt (* 1000 1000)))
(begin (format #t "~A " cnt)  (avg-gc-cpu-time)))
; periodically trim the list
(if (eq? 0 (modulo cnt (* 4123 1123)))
(set! longer '()))
; loop for a while.
(if (eq? 0 cnt) lis
(make-bgl longer (- cnt 1))))

; (define foo (make-bgl '() (* 1235 1000 1000)))




On Sat, Sep 9, 2017 at 2:47 PM, Linas Vepstas <address@hidden>
wrote:

>
> I've been experiencing problems with guile GC for a while now, that I've
> mostly been able to ignore, but that are becoming debilitating, and I need
> help, or need to find some solution.
>
> Background: I have a C++ library that has guile bindings; it is wrapped up
> as a guile module. It manages a lot of data, anywhere from a few gigabytes,
> to a over a hundred gigabytes.  Almost all this RAM usage should be
> invisible, opaque to guile: so, for example,  while my library is managing
> 5 or 10 GBytes, guile itself is managing maybe 20 or 50 or 100MBytes,
> according to (gc-stats). This is normal and expected. Some of my programs
> have guile use considerably more than this, in the multi-GByte range
>
> However, the larger the RAM usage, the more frequently guile GC seems to
> run, and the longer it takes, so that by the time my dataset is 100GBytes,
> guile is using maybe 1GBytes, and is spending 99% of its time in GC. My app
> slows to a crawl, and nothing gets done.  I need to get GC under control,
> because it is running pointlessly, because its clear that (almost) nothing
> can be reclaimed, (usually) nothing is being reclaimed.
>
> Here's my tool:
>
>    ; Report the average amount of time spent in GC
>    (define avg-gc-cpu-time
>       (let ((last-gc (gc-stats))
>             (start-time (get-internal-real-time))
>             (run-time (get-internal-run-time)))
>          (lambda ()
>             (define now (get-internal-real-time))
>             (define run (get-internal-run-time))
>             (define cur (gc-stats))
>             (define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
>             (define elapsed-time (* 1.0e-9 (- now start-time)))
>             (define cpu-time (* 1.0e-9 (- run run-time)))
>             (format #t "Elapsed time: ~5f secs. GC: ~5f%  cpu-usage:
> ~5f%\n"
>                elapsed-time
>                (* 100 (/ gc-time-taken elapsed-time))
>                (* 100 (/ cpu-time elapsed-time))
>             )
>             (set! last-gc cur)
>             (set! start-time now)
>             (set! run-time run))))
>
> Here's what it prints:
>
> guile-en-r> (avg-gc-cpu-time)
> Elapsed time: 46.03 secs. GC: 342.8%  cpu-usage: 346.4%
>
> guile-en-r> (avg-gc-cpu-time)
> Elapsed time: 7288. secs. GC: 329.5%  cpu-usage: 330.9%
>
> guile-en-r> (avg-gc-cpu-time)
> Elapsed time: 3205. secs. GC: 318.0%  cpu-usage: 319.4%
>
> Here's a typical gc from the start/middle of this.
>
> > (gc-stats)
> ((gc-time-taken . 634825386689) (heap-size . 16199680) (heap-free-size .
> 1925120) (heap-total-allocated . 15424779760) (heap-allocated-since-gc .
> 7692512) (protected-objects . 7) (gc-times . 3786))
>
> In this case, the heap is 16MBytes, little or no work is being done in
> guile, almost all is done in the C++ code. GC has taken 634 seconds ... a
> lot but acceptable.  After this point, the number of GC's skyrockets -- a
> gc ever few seconds, sometimes tens of them per second.  It sucks up cpu
> time, and makes the guile prompt be very laggy.  I'd give more examples,
> except that now, my app has moved to the stage where it really is creating
> giant-sized scheme lists:
>
> > (gc-stats)
> ((gc-time-taken . 281465535649189) (heap-size . 5175664640
> <(517)%20566-4640>) (heap-free-size
> . 1538387968) (heap-total-allocated . 67649385200) (heap-allocated-since-gc
> . 6112) (protected-objects . 7) (gc-times . 11319))
>
> That's right: 5GBytes of scheme stuff, and 281465 seconds = 78 cpu-hours
> spent in GC.  top shows 116GBytes total in the app, which is normal and
> expected.
>
> When my app is about 50GBytes in size, it will run for about 12 hours,
> with an outrageous amount of time spent in GC, which I just ignore out of
> habit, because I can wait 12 hours.  This one, a bigger dataset, appears to
> be stalled -- after 24 hours, less than 1% of the computations have been
> completed.  I'm screwed until I can somehow disable guile GC long enough
> for the thing to make forward progress.
>
> I'm sort-of willing to hack on guile internals, except that my project is
> already late and I am panicking ...
>
> --linas
>
> --
> *"The problem is not that artificial intelligence will get too smart and
> take over the world," computer scientist Pedro Domingos writes, "the
> problem is that it's too stupid and already has." *
>



-- 
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *


reply via email to

[Prev in Thread] Current Thread [Next in Thread]