[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when usi
From: |
Markus Mützel |
Subject: |
[Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface |
Date: |
Wed, 2 Aug 2017 16:48:26 -0400 (EDT) |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0 |
Follow-up Comment #16, bug #51512 (project octave):
It definitely is an improvement if more data can be fetched. And you could
reduce code complexity at the same time. Excellent.
I had problems reading a big file with approximately 1000x1000 cells saved
with Excel 2016 that contains doubles, Booleans, and strings (no dates) as
values and as formula results. (That file did not upload the last time,
probably due to file size limits. File size 8 MB. xml string size 44.2 MB.)
With Octave 4.2.1 on Windows, it gave the following warning and then caused a
segfault:
warning: your pattern caused PCRE to hit its MATCH_LIMIT; trying harder now,
but this will be slow
warning: called from
__OCT_xls2oct__ at line 141 column 11
xls2oct at line 201 column 27
The help string for "regexp" was updated recently by Dan Sebald regarding this
issue.
Additionally, a formula can span several rows. I didn't realize that until
now. In this case, a cell could be described like this in xml:
<c r="I3" t="str"><f t="shared" si="554"/><v>3</v></c>
Note that the f-tag has no content. That isn't yet matched by the regexp in
the file from comment #13.
I tried with the following regexp in the failing line:
valf2 = cell2mat (regexp (rawdata, '<c r="(\w+)"[^>]*(?:
t="str")[^>]*>(?:(?:<f[^>]*>[^<]*</f>)|(?:<f[^/>]*/>))(?:<v[^>]*>([^<]*)</v>)?',
"tokens"));
I don't know whether this is the most efficient regular expression. But it
doesn't segfault for me and matches all strings.
Performance-wise:
>> profile clear
>> profile on
>> tic; [~, ~, raw] = xlsread ("Excel2013_1001x999.xlsx", 1, "", "oct"); toc
Elapsed time is 100.848 seconds.
>> profile off
>> profshow
# Function Attr Time (s) Time (%) Calls
---------------------------------------------------------------
54 regexp 68.435 68.35 1128
62 str2double 10.411 10.40 8
60 cat 4.931 4.93 1132
57 cellfun 3.709 3.70 10083
77 col2num 3.158 3.15 1999998
70 __OCT_xlsx2oct__ 2.993 2.99 1
67 xls2oct 1.312 1.31 1
47 system 1.209 1.21 1
81 strrep 0.995 0.99 5
55 cell2mat 0.845 0.84 2239
71 num2cell 0.807 0.81 5
74 clear 0.555 0.55 6
5 ischar 0.333 0.33 1000022
66 postfix ' 0.083 0.08 9
82 parsecell 0.079 0.08 1
19 cell 0.068 0.07 6
49 fread 0.066 0.07 4
2 xlsread 0.044 0.04 1
7 binary == 0.012 0.01 6720
6 prefix ! 0.010 0.01 10145
Compared to "COM" (Excel 2016):
>> tic; [~, ~, raw] = xlsread ("Excel2013_1001x999.xlsx", 1, "", "com"); toc
Checking requested interface(s):
COM*;
Elapsed time is 6.22531 seconds.
Excel (via COM) is still 16x faster for me. But you should probably compare
with the same file on the same system to get comparable results. I can send
you the file I used for testing by email if you would like to test with that
one.
My system: Windows 10 Home 1703, Core i7 7500U @ 2.7 GHz, 8 GB RAM, Octave
installed on a SSD but reading the test file from a HDD.
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?51512>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Octave-bug-tracker] [bug #51512] of-io: Missing or wrong types when using xlsread with OCT interface,
Markus Mützel <=