bug#39413: 26.2; Emacs gets hung

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#39413: 26.2; Emacs gets hung

From:	chiaki-ishikawa-thunderbird-account
Subject:	bug#39413: 26.2; Emacs gets hung
Date:	Wed, 20 May 2020 13:31:29 +0900
User-agent:	Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

Hi,

I am back.

I have observed another long hung.

This time, I did verify that it is during GC.

The hung lasted close to 8 minutes give or take.

My comment inline.

On 2020/04/30 11:45, chiaki-ishikawa-thunderbird-account wrote:

Hi,

Thank you for your tips.

My comment inline.:

On 2020/04/29 15:55, Eli Zaretskii wrote:

From: "ISHIKAWA,chiaki" <ishikawa@yk.rim.or.jp>
Date: Wed, 29 Apr 2020 06:36:46 +0900
Cc: chiaki-ishikawa-thunderbird-account <chiaki.ishikawa@ubin.jp>,
  Noam Postavsky <npostavs@gmail.com>

I am back to this strange guru meditation of emacs.

Emacs would eventually begin working, but the silent period where nokey

or mouse event is acknowledged is rather longish.

Several weeks after I installed GNU Emacs 28.0.50 (that is what
|emacs-version| variable says. I installed from git repo.) , I began
noticing the occasional long pause (no response) again.

I suggest to begin by establishing whether these pauses are caused by
GC.  To that end, set garbage-collection-messages to a non-nil value,
and see if Emacs displays a message about GC in progress when these
pauses occur.  Also, try to tell how frequent are these pauses and
much time each pause takes.

If you see that every time you experience a pause there's GC running,
we can then proceed to investigating whether the frequency of GC and
the time it takes on your system is something abnormal, and why.

I have set garbage-collection-messages to t.

Let me see what happens when the next apparent long guru meditationhappens.

Since the last message, I did not get a very long hung until thismorning. Maybe 5-10 second long GC only. Those I can tolerate.I feel "what?" after maybe a five second hung and then realize somethingis holding up emacs, and then the hung is over. I tell myself GC wasrather long, but that is it.


But this morning I get a rather long hung, or GC, twice.

Attaching gdb to the emacs process then after a 10 second or so of thefirst hung, and noticed that sweep_cons() took three seconds to finishafter I entered "finish" command to gdb. That certainly felt long duringgdb session.


So that was an initial clue that something is wrong.

(Now I suspect some kind of memory area corruption. Before this happenedI loaded a several large log files each about 20MB, and when I tried toerase a rather large *shell* buffer that contained the log from a largecompilation job, this garbage collection kicked in.)

Anyway, I checked and found gc-elapsed showed a number in 400-500seconds range. I really wished I wrote down the exact number.

So my take on this is sweep_conses() takes a rather long time and not
letting emacs to take any input. (And there could be other functions
before it.)

The question is how long is "long time".  We need some quantitative
measure of this before we can tell whether it's normal or not.

I agree. It  is all relative.

I agree "long" is a relative term.

But after the somewhat long pause of maybe 20-30 seconds, I get anotherreally long pause which prompted me to write this mail.

The second long hung this morning was maybe longer than 8 minutes giveor take a minute or so.

I stopped emacs a few times after the initial five minute or so of thelong hung to check where it spends time occasionally using gdb. But itdefinitely is close to 8 minutes. I would have killed emacs and startover if I had not learned about gc-elapsed and other internal variables.

After GC finally finished and control returned to keyboard, I checkedgc-elapsed.

It now shows:
gc-elapsed is a variable defined in ‘C source code’.
Its value is 1067.998007492

It matches my observation that gc took about 8 minutes give or take. (Ofcourse, there were a few times I interrupted the operation fromControl-C from GDB and even captured a screen shot or two. But that wasless than two minutes in total.)

Question.: Has there been any change in emacs's alloc.c code in the
last year or two?

Yes, a lot.  But I don't see any significant change in frequency or
time duration of GC on my system due to those changes, FWIW.


Maybe I should take a look when my time permits.

I am running emacs under Debian GNU/linux inside VirtualBox which is
assigned 16GB memory.

It is possible that the virtual machine's memory management is
misconfigured, so it slows down Emacs when it needs to access a large
portion of its address space (which happens during GC).

I have not touched any thing (famous last words) for my linuxconfiguration for the last few years aside from simply upgrading thepackages from time to time.

gc-elapsed is a variable defined in ‘C source code’.
Its value is 68.3795433

The question is how many GC cycles took 68 seconds.  What is the value
of gcs-done?  Also, is this in a session to which you attached GDB? if
so, then some of the elapsed time was due to Emacs being stopped under
the debugger.  We need these values in a session that is never stopped
by a debugger.


I will monitor gcs-done when the next guru meditation happens.

Is there a hook to garbage-collection so that I can print gcs-done andgc-elapsed maybe after the garbage-collection is run?


I failed to check gcs-done. It shows 1032 after the fact.

But given that I monitored gc using gdb and it took almost 8 minutesbefore successive "finish" commands to gdb (after sweep_floats(), etc.)to come up to the interactive functions, I am quite sure that oneGC-related entry point used approximately 8 min to finish.

And I think that entry is maybe_gc() or maybe_garbage_collect().

(That is why I suspected some type of memory corruption software-wise.My PC has ECC enabled and so it is rather unlikely memory got corruptedspontaneously.)

PS: Given that this may have something to do with the paging activity
of the emacs process, the linux version change (from 4 to 5) may have
affected the paging behavior of the program, but it is hard to say.  I
now see this linux instance is using 293M SWAP.  Compiling mozilla
thunderbird source runs |make -j6| and g++-9 compiles CPP source files
which are very large and so they exert heavy memory pressure. A bad
memory access pattern of emacs, not friendly to locality of access,
would add to page faults. Except, |xosview| program does not show any
paging activity. I am not sure if |xosview| is correct or not.
Building a large program with -j6 also uses CPU, and that could make
CPU-intensive operations, such as GC, slow.  Was that build running in
the same VirtualBox as Emacs?  If so, how many physical CPU execution
units do you have allocated to VirtualBox?
Yes, the build was running in the same VirtualBox as Emacs.
I have assigned 7 cores to VirtualBox.
(Oh wait!!! My current VirtualBox says it is assigned only 6 cores!?
I thought I did assign 7 cores to this VirtualBox image.
My PC runs Windows 10 natively on AMD Ryzen 7 Eight-Core processorthat has additional 8 hyperthreads.
So there are 16 virtual cores.
I have no idea why only 6 core is assigned to VM image today.
The number has been tweaked when an earlier VBox had an unresolvedissue when AMD Ryzen CPU came out.
[I used to run this image on XEON CPU.]
Maybe this could explain the problem on my HOME PC.However, the sameguru meditation has been observed on office PC with a different IntelCPU, too.Again, Emacs is run inside virtualBox on that PC that runs Windows 10professional natively.))
However, the important point is this. I have not noticed this strangeguru meditation until about 12 months ago or so.That is why I suspect that there could be an issue with the recentalloc.c or something, but I think I need to gather more data
to focus in it yet.

Stay tuned.
Thank you again for your help. I will try to gather morecircumstantial evidence so that we can make an educated guess whyEmacs seems to enter guru meditation on my PC.
Chiaki

Since last message, I have re-assigned 7 CPUs to my virtual box image.

At the time of long hung this morning, I have no largish build oranything. I was only editing a message file or something.I wanted to clear *shell* buffer with "erase-buffer" when this series oflong hungs hit emacs.


I am attaching two screen dumps.

sweep-conses shows that emacs is spending time in sweep-conses after Ityped "finish" command to gdb.Emacs process in the background did not respond to X repaint event andthus only a portion of its screen, most notably echo area is partiallyshown.

The window on the right is xosview system resource monitor program.
That shows CPP 2, 3, and 4 are used 100% successively (and that is emacs).

The durration of time covered by the horizontal length of the montorwindow is about 15 seconds.Emacs spent time almost like 8 minutes (I need to subtract a minute ortwo for gdb interaction)

like this. Just churning CPU and nothing else.

memory-consolidate-1.PNG is interesting in that this was the first timeI saw something two level below sweep_conses().

A routine malloc_consolidate() is visible.
Maybe the memory consolidation may take time.

A repeated question, which I think has No answer is as follows.:
Is there a hook that gets executed just before GC happens?

Now I think the answer is "No" because GC is invoked when the availableconses are few and we don't want to run arbitrary lisp program whch maylead to another GC invocation.

Short of such feature, is there a post hook that gets executed AFTER GCfinishes (well in the stacktrace I see maybe_gc() called from call2(),and so

the hook may have to be placedon maybe_gc() routine if possible.

I am asking the questions about the hook(s) since I want to record somestatistics about memory usage, etc. automaticallybefore/after GC so that we can have a scientific/measurable data todiscuss what is going on.If we can have a set of such pre/post data for the last few dozen GCinvocations, we may be able to home into the cause of the program.Short of lisp function, maybe I can add message print out in alloc.c toprint out such information so that it gets recorded in *Message* buffer.It is done only when gabage-collection-message is t and there should besome throttling mechanism to avoid spamming the user.

I now feel both alloc.c and malloc.c may be to blame, most probably interms of locality of references that may interfere badly with CPU cacheand linux memory system. However, do not there does not seem to be muchpaging activity AND more than 4 GB real memory is more or lessavailable still. That is visible in xosview's screen insweep-conses.png. That lead to me wonder if there is algorithmic slowdown due to non-linear execution time of memory size? But again, this ishard to believe in this age of home users having GB of memory. Suchalgorithmic lapse can't remain undetected for so long.

Hmm, that makes me to throw the net on linux AND virtualbox which I useto run linux on my home Windows PC.I have 32 GB of memory on my home PC and assign 16 GB of memory toVirtualBox image.Given that I still have 2 GB of free memory on Window side right now (Ijust checked with system performance monitor),all the memory references during GC ought to go into the real memorywithout paging on the Window side even, .i.e., the virtual memory ismapped straight to the real memory on the PC.

OR, do I feel lucky and compile emacs with GCC-9's -O3 flag and see ifit speeds up emacs?To be honest, I always prefer correctness rather than speed, but 6minutes GC is not something I can put up with in an interactive program.Come to think of it if Emacs is going multi-threaded (I mean Emacs-Lispinterpreter) then low-level GC can become incremental opportunisticnon-blocking GC?

Any pointer re the hook(s) short of rewriting alloc.c is appreciated.Well, if worst comes to worst, I don't mind rewriting alloc.c to createan array of fixed size to store the last dozen or so GC-relatedinformation that is updated before and after invocation of a certain GCentry point.

I can print that out after a long GC to learn the memory situation.
But that sounds a bit drastic. OTOH, it may be the only way to go.

But the last time I worked on alloc.c to debug it to run on non-UNIXplatform was close to 30 years ago. :-)


Thank you in advance for your attention on this matter.

Chiaki

malloc-consolidate-1.PNG
Description: PNG image

sweep-conses.png
Description: PNG image

[Prev in Thread]

Current Thread

[Next in Thread]

bug#39413: 26.2; Emacs gets hung, chiaki-ishikawa-thunderbird-account <=
- bug#39413: 26.2; Emacs gets hung, Noam Postavsky, 2020/05/25
  - bug#39413: 26.2; Emacs gets hung, chiaki-ishikawa-thunderbird-account, 2020/05/25

Prev by Date: bug#41386: 28.0.50; Gnus nnimap OAuth 2.0 support
Next by Date: bug#41373: GNU ELPA: Add "reporting bugs" to individual package pages
Previous by thread: bug#41414: 28.0.50; `find-file-noselect' + `file-truename' gotcha
Next by thread: bug#39413: 26.2; Emacs gets hung
Index(es):
- Date
- Thread