[Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when usi

octave-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when usi

From:	Markus Mützel
Subject:	[Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface
Date:	Wed, 2 Aug 2017 16:48:26 -0400 (EDT)
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0

Follow-up Comment #16, bug #51512 (project octave):

It definitely is an improvement if more data can be fetched. And you could
reduce code complexity at the same time. Excellent.

I had problems reading a big file with approximately 1000x1000 cells saved
with Excel 2016 that contains doubles, Booleans, and strings (no dates) as
values and as formula results. (That file did not upload the last time,
probably due to file size limits. File size 8 MB. xml string size 44.2 MB.)
With Octave 4.2.1 on Windows, it gave the following warning and then caused a
segfault:

warning: your pattern caused PCRE to hit its MATCH_LIMIT; trying harder now,
but this will be slow
warning: called from
    __OCT_xls2oct__ at line 141 column 11
    xls2oct at line 201 column 27


The help string for "regexp" was updated recently by Dan Sebald regarding this
issue.

Additionally, a formula can span several rows. I didn't realize that until
now. In this case, a cell could be described like this in xml:

<c r="I3" t="str"><f t="shared" si="554"/><v>3</v></c>


Note that the f-tag has no content. That isn't yet matched by the regexp in
the file from comment #13.
I tried with the following regexp in the failing line:

    valf2 = cell2mat (regexp (rawdata, '<c r="(\w+)"[^>]*(?:
t="str")[^>]*>(?:(?:<f[^>]*>[^<]*</f>)|(?:<f[^/>]*/>))(?:<v[^>]*>([^<]*)</v>)?',
"tokens"));


I don't know whether this is the most efficient regular expression. But it
doesn't segfault for me and matches all strings.
Performance-wise:

>> profile clear
>> profile on
>> tic; [~, ~, raw] = xlsread ("Excel2013_1001x999.xlsx", 1, "", "oct"); toc
Elapsed time is 100.848 seconds.
>> profile off
>> profshow

   #         Function Attr     Time (s)   Time (%)        Calls
---------------------------------------------------------------
  54           regexp            68.435      68.35         1128
  62       str2double            10.411      10.40            8
  60              cat             4.931       4.93         1132
  57          cellfun             3.709       3.70        10083
  77          col2num             3.158       3.15      1999998
  70 __OCT_xlsx2oct__             2.993       2.99            1
  67          xls2oct             1.312       1.31            1
  47           system             1.209       1.21            1
  81           strrep             0.995       0.99            5
  55         cell2mat             0.845       0.84         2239
  71         num2cell             0.807       0.81            5
  74            clear             0.555       0.55            6
   5           ischar             0.333       0.33      1000022
  66        postfix '             0.083       0.08            9
  82        parsecell             0.079       0.08            1
  19             cell             0.068       0.07            6
  49            fread             0.066       0.07            4
   2          xlsread             0.044       0.04            1
   7        binary ==             0.012       0.01         6720
   6         prefix !             0.010       0.01        10145


Compared to "COM" (Excel 2016):

>> tic; [~, ~, raw] = xlsread ("Excel2013_1001x999.xlsx", 1, "", "com"); toc
Checking requested interface(s):
COM*;
Elapsed time is 6.22531 seconds.


Excel (via COM) is still 16x faster for me. But you should probably compare
with the same file on the same system to get comparable results. I can send
you the file I used for testing by email if you would like to test with that
one.
My system: Windows 10 Home 1703, Core i7 7500U @ 2.7 GHz, 8 GB RAM, Octave
installed on a SSD but reading the test file from a HDD.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?51512>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface, Markus Mützel <=
- [Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface, Philip Nienhuis, 2017/08/05
  - [Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface, Philip Nienhuis, 2017/08/05
    - [Octave-bug-tracker] [bug #51512] [octave forge] (io) Missing or wrong types when using xlsread with OCT interface, John W. Eaton, 2017/08/11
    - [Octave-bug-tracker] [bug #51512] [octave forge] (io) Missing or wrong types when using xlsread with OCT interface, Markus Mützel, 2017/08/13
    - [Octave-bug-tracker] [bug #51512] [octave forge] (io) Missing or wrong types when using xlsread with OCT interface, Philip Nienhuis, 2017/08/18
    - [Octave-bug-tracker] [bug #51512] [octave forge] (io) Missing or wrong types when using xlsread with OCT interface, Philip Nienhuis, 2017/08/25

Prev by Date: [Octave-bug-tracker] [bug #51641] Indexing classdef properties with end fails
Next by Date: [Octave-bug-tracker] [bug #51632] MS Windows portable (i.e. zip) version cannot find pre-installed packages, needs pkg rebuild, not mentioned at first run
Previous by thread: [Octave-bug-tracker] [bug #51643] setting OMP_NUM_THREADS at runtime does not chaneg number of openmp threads
Next by thread: [Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface
Index(es):
- Date
- Thread