octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #59277] [octave-forge](io) xls2oct is slow whe


From: Philip Nienhuis
Subject: [Octave-bug-tracker] [bug #59277] [octave-forge](io) xls2oct is slow when a spreadsheet contains many text cells
Date: Sat, 24 Oct 2020 10:32:13 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0

Update of bug #59277 (project octave):

                 Summary: xls2oct and/or openxls behave unexpected =>
[octave-forge](io) xls2oct is slow when a spreadsheet contains many text cells

    _______________________________________________________

Follow-up Comment #21:

A progress report:

First, I've adapted the title to the actual issue you entered this bug report
for.

Then, again working with the code I remembered the way  __OCT_xlsx2oct__.m  is
supposed to work (it's quite some time ago for me).

Most of the time is probably spent in the regexp call in L.99; that one
actually parses all individual cells and extracts cell addresses, content
types and values. For sharedString cells those "values" are actually pointers
into the sharedStrings XML (0-based).
In 2017 Markus and I have spent quite some time figuring out how to make that
regexp fast. I think there's not much to gain there anymore.

The code in section 2.B, ~L.139-151 where you worked on, merely parses the
sharedStrings XML. As the name implies it holds text cell contents from the
entire spreadsheet. That means that if the spreadsheet contains many different
text cells, considerable time will be spent there. For most spreadsheets out
there in the wild that's no big deal as they contain much more numerical than
text data. But you are apparently in the niche where it *does* count.
You mentioned in comment #4 that presence of extra worksheets increases the
time spent here; that fits perfectly with the sharedStrings XML containing the
strings of the entire spreadsheet rather than just one worksheet.
Nothing can be done about it; it's just the way OOXML spreadsheets stick
together.
Looking a bit further and after some experimenting I think there's no way the
remaining for loop in L.150-155 can be vectorized.
All in all I think the speed increase you obtained (good work BTW) is about
all we can get for the moment.

Finally, the individual sharedStrings values parsed in section 2.B are
assigned to the "val" array where they replace the pointers mentioned above.
That is merely assignment and is usually very fast.

There are additional steps that take time, e.g. figuring out column and row
numbers from cell addresses and copying "val" contents into the proper
locations in the output cellstr array. All those calls have also been
vectorized as far as possible.

I started thinking about caching the parsed sharedStrings data, but soon I
remembered I thought about that before :-) An obstacle is that if even only
one text cell anywhere in the spreadsheet is changed the entire sharedStrings
data is invalidated and needs to be reparsed/rebuilt. I suppose there are ways
to do that and/or circumvent or mitigate it but:
0 It'll add quite a bit of complexity, and the OCT interface already is a
beast;
0 AFAICS it'll only benefit a few use cases;
0 I simply have other priorities,
so I'm not going to spend much more time on it now.

I'm always available for advice if you want to try that route.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?59277>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]