bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#39413: 26.2; Emacs gets hung


From: chiaki-ishikawa-thunderbird-account
Subject: bug#39413: 26.2; Emacs gets hung
Date: Wed, 20 May 2020 13:31:29 +0900
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

Hi,

I am back.

I have observed another long hung.

This time, I did verify that it is during GC.

The hung lasted close to 8 minutes give or take.

My comment inline.

On 2020/04/30 11:45, chiaki-ishikawa-thunderbird-account wrote:
Hi,

Thank you for your tips.

My comment inline.:

On 2020/04/29 15:55, Eli Zaretskii wrote:
From: "ISHIKAWA,chiaki" <ishikawa@yk.rim.or.jp>
Date: Wed, 29 Apr 2020 06:36:46 +0900
Cc: chiaki-ishikawa-thunderbird-account <chiaki.ishikawa@ubin.jp>,
  Noam Postavsky <npostavs@gmail.com>

I am back to this strange guru meditation of emacs.
Emacs would eventually begin working, but the silent period where no key
or mouse event is acknowledged is rather longish.

Several weeks after I installed GNU Emacs 28.0.50 (that is what
|emacs-version| variable says. I installed from git repo.) , I began
noticing the occasional long pause (no response) again.
I suggest to begin by establishing whether these pauses are caused by
GC.  To that end, set garbage-collection-messages to a non-nil value,
and see if Emacs displays a message about GC in progress when these
pauses occur.  Also, try to tell how frequent are these pauses and
much time each pause takes.

If you see that every time you experience a pause there's GC running,
we can then proceed to investigating whether the frequency of GC and
the time it takes on your system is something abnormal, and why.
I have set garbage-collection-messages to t.

Let me see what happens when the next apparent long guru meditation happens.

Since the last message, I did not get a very long hung  until this morning. Maybe 5-10 second long GC only. Those I can tolerate. I feel "what?" after maybe a five second hung and then realize something is holding up emacs, and then the hung is over. I tell myself GC was rather long, but that is it.

But this morning I get a rather long hung, or GC, twice.

Attaching gdb to the emacs process then after a 10 second or so of the first hung, and noticed that sweep_cons() took three seconds to finish after I entered "finish" command to gdb. That certainly felt long during gdb session.

So that was an initial clue that something is wrong.

(Now I suspect some kind of memory area corruption. Before this happened I loaded a several large log files each about 20MB, and when I tried to erase a rather large *shell* buffer that contained the log from a large compilation job, this garbage collection kicked in.)

Anyway, I checked and found gc-elapsed showed a number in 400-500 seconds range. I really wished I wrote down the exact number.



So my take on this is sweep_conses() takes a rather long time and not
letting emacs to take any input. (And there could be other functions
before it.)
The question is how long is "long time".  We need some quantitative
measure of this before we can tell whether it's normal or not.
I agree. It  is all relative.

I agree "long" is a relative term.

But after the somewhat long pause of maybe 20-30 seconds, I get another really long pause which prompted me to write this mail.

The second long hung this morning was maybe longer than 8 minutes give or take a minute or so.

I stopped emacs  a few times after the initial five minute or so of the long hung to check where it spends time occasionally using gdb. But it definitely is close to 8 minutes. I would have killed emacs and start over if I had not learned about gc-elapsed and other internal variables.

After GC finally finished and control returned to keyboard, I checked gc-elapsed.
It now shows:
gc-elapsed is a variable defined in ‘C source code’.
Its value is 1067.998007492

It matches my observation that gc took about 8 minutes give or take. (Of course, there were a few times I interrupted the operation from Control-C from GDB and even captured a screen shot or two. But that was less than two minutes in total.)


Question.: Has there been any change in emacs's alloc.c code in the
last year or two?
Yes, a lot.  But I don't see any significant change in frequency or
time duration of GC on my system due to those changes, FWIW.

Maybe I should take a look when my time permits.


I am running emacs under Debian GNU/linux inside VirtualBox which is
assigned 16GB memory.
It is possible that the virtual machine's memory management is
misconfigured, so it slows down Emacs when it needs to access a large
portion of its address space (which happens during GC).

I have not touched any thing (famous last words) for my linux configuration for the last few years aside from simply upgrading the packages from time to time.


gc-elapsed is a variable defined in ‘C source code’.
Its value is 68.3795433
The question is how many GC cycles took 68 seconds.  What is the value
of gcs-done?  Also, is this in a session to which you attached GDB? if
so, then some of the elapsed time was due to Emacs being stopped under
the debugger.  We need these values in a session that is never stopped
by a debugger.


I will monitor gcs-done when the next guru meditation happens.

Is there a hook to garbage-collection so that I can print gcs-done and gc-elapsed maybe after the garbage-collection is run?

I failed to check gcs-done. It shows 1032 after the fact.

But given that I monitored gc using gdb and it took almost 8 minutes before successive "finish" commands to gdb (after sweep_floats(), etc.) to come up to the interactive functions, I am quite sure that one GC-related entry point used approximately 8 min to finish.
And I think that entry is maybe_gc() or maybe_garbage_collect().

(That is why I suspected some type of memory corruption software-wise. My PC has ECC enabled and so it is rather unlikely memory got corrupted spontaneously.)




PS: Given that this may have something to do with the paging activity
of the emacs process, the linux version change (from 4 to 5) may have
affected the paging behavior of the program, but it is hard to say.  I
now see this linux instance is using 293M SWAP.  Compiling mozilla
thunderbird source runs |make -j6| and g++-9 compiles CPP source files
which are very large and so they exert heavy memory pressure. A bad
memory access pattern of emacs, not friendly to locality of access,
would add to page faults. Except, |xosview| program does not show any
paging activity. I am not sure if |xosview| is correct or not.
Building a large program with -j6 also uses CPU, and that could make
CPU-intensive operations, such as GC, slow.  Was that build running in
the same VirtualBox as Emacs?  If so, how many physical CPU execution
units do you have allocated to VirtualBox?

Yes, the build was running in the same VirtualBox as Emacs.
I have assigned 7 cores to VirtualBox.
(Oh wait!!! My current VirtualBox says it is assigned only 6 cores!?
I thought I did assign 7 cores to this VirtualBox image.
My PC runs Windows 10 natively on AMD Ryzen 7 Eight-Core processor that has additional 8 hyperthreads.
So there are 16 virtual cores.
I have no idea why only 6 core is assigned to VM image today.
The number has been tweaked when an earlier VBox had an unresolved issue when AMD Ryzen CPU came out.
[I used to run this image on XEON CPU.]
Maybe this could explain the problem on my HOME PC.However, the same guru meditation has been observed on office PC with a different Intel CPU, too. Again, Emacs is run inside virtualBox on that PC that runs Windows 10 professional natively.))

However, the important point is this. I have not noticed this strange guru meditation until about 12 months ago or so. That is why I suspect that there could be an issue with the recent alloc.c or something, but I think I need to gather more data
to focus in it yet.

Stay tuned.

Thank you again for your help. I will try to gather more circumstantial evidence so that we can make an educated guess why Emacs seems to enter guru meditation on my PC.

Chiaki


Since last message, I have re-assigned 7 CPUs to my virtual box image.

At the time of long hung this morning, I have no largish build or anything. I was only editing a message file or something. I wanted to clear *shell* buffer with "erase-buffer" when this series of long hungs hit emacs.

I am attaching two screen dumps.
sweep-conses shows that emacs is spending time in sweep-conses after I typed "finish" command to gdb. Emacs process in the background did not respond to X repaint event and thus only a portion of its screen, most notably echo area is partially shown.
The window on the right is xosview system resource monitor program.
That shows CPP 2, 3, and 4 are used 100% successively (and that is emacs).
The durration of time covered by the horizontal length of the montor window is about 15 seconds. Emacs spent time almost like 8 minutes (I need to subtract a minute or two for gdb interaction)
like this. Just churning CPU and nothing else.

memory-consolidate-1.PNG is interesting in that this was the first time I saw something two level below sweep_conses().
A routine malloc_consolidate() is visible.
Maybe the memory consolidation may take time.

A repeated question, which I think has No answer is as follows.:
Is there a hook that gets executed just before GC happens?

Now I think the answer is "No" because GC is invoked when the available conses are few and we don't want to run arbitrary lisp program whch may lead to another GC invocation.

Short of such feature, is there a post hook that gets executed AFTER GC finishes (well in the stacktrace I see maybe_gc() called from call2(), and so
the hook may have to be placedon maybe_gc() routine if possible.

I am asking the questions about the hook(s) since I want to record some statistics about memory usage, etc. automatically before/after GC so that we can have a scientific/measurable data to discuss what is going on. If we can have a set of such pre/post data for the last few dozen GC invocations, we may be able to home into the cause of the program. Short of lisp function, maybe I can add message print out in alloc.c to print out such information so that it gets recorded in *Message* buffer. It is done only when gabage-collection-message is t and there should be some throttling mechanism to avoid spamming the user.


I now feel both alloc.c and malloc.c may be to blame, most probably in terms of locality of references that may interfere badly with CPU cache and linux memory system. However, do not there does not seem to be much paging activity AND more than 4 GB real memory is more or less available  still. That is visible in xosview's screen in sweep-conses.png. That lead to me wonder if there is algorithmic slow down due to non-linear execution time of memory size? But again, this is hard to believe in this age of home users having GB of memory. Such algorithmic lapse can't remain undetected for so long.

Hmm, that makes me to throw the net on linux AND virtualbox which I use to run linux on my home Windows PC. I have 32 GB of memory on my home PC and assign 16 GB of memory to VirtualBox image. Given that I still have 2 GB of free memory on Window side right now (I just checked with system performance monitor), all the memory references during GC ought to go into the real memory without paging on the Window side even, .i.e., the virtual memory is mapped straight to the real memory on the PC.

OR, do I feel lucky and compile emacs with GCC-9's -O3 flag and see if it speeds up emacs? To be honest, I always prefer correctness rather than speed, but 6 minutes GC is not something I can put up with in an interactive program. Come to think of it if Emacs is going multi-threaded (I mean Emacs-Lisp interpreter) then low-level GC can become incremental opportunistic non-blocking GC?

Any pointer re the hook(s) short of rewriting alloc.c is appreciated. Well, if worst comes to worst, I don't mind rewriting alloc.c to create an array of fixed size to store the last dozen or so GC-related information that is updated before and after invocation of a certain GC entry point.
I can print that out after a long GC to learn the memory situation.
But that sounds a bit drastic. OTOH, it may be the only way to go.
But the last time I worked on alloc.c to debug it to run on non-UNIX platform was close to 30 years ago. :-)

Thank you in advance for your attention on this matter.

Chiaki


Attachment: malloc-consolidate-1.PNG
Description: PNG image

Attachment: sweep-conses.png
Description: PNG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]