Re: Test files [WAS: Re: xlsread in Octave 3.6.4]

Markus,

Before answering, just a quick question:
What mailer do you use?

just ugly googlemail webinterface.

- Its indentation seems tabified and (IMO) screws up readibility;

Thank you google :D Maybe i've change my Mailadress for the list.

- Could you please strip my email address from your replies? I know it is easily reconstructed, but to just shout it all over the place....
Thanks

i'll try. sorry.

Read on...

Markus Bergholz wrote:

On Sun, Sep 15, 2013 at 9:56 PM, Philip Nienhuis wrote

:
<snip>
:
This weekend I'll sent you a few test Excel files with empty
row/columns, merged cells, dates, times, booleans, formulas,
erroneous
formulas and -consequently- error values in the cached value
attributes,
etc.
Attached. I've combined a few test spreadsheets I had lying around.

I used most of these for exploring ranges of occupied cells (which
in OOXML turns out to be extremely easy, the range is in one of the
first nodes of the worksheets) and cell value types. You'll find a
few spreadsheet formulas which -intentionally- contain errors.

Good luck with it,

<snip>

>
> * Mergedranges.xlsx
> looks nearly the same as with matlab R2010 at work. One difference is
> Matlab start read in column J. My xlsxread start at A as defined in the
> sheet. So i guess matlab ignore the real defined size.
> the values are the same. (as far as i quickly see).

Sorry, Matlab is actually right.

Yes and no.

# Matlab 2013a

>> tic, a=xlsread('testOOXML.xlsx'); toc

Elapsed time is 0.453576 seconds.

>> a(:,1)

ans =

1.0000

NaN

4.2500

NaN

>> size(a)

ans =

8 29

>> a(isnan(a))=0; sum(sum(a))

ans =

21.5277

# Octave

octave:16> tic, a=xlsxread('../testOOXML.xlsx'); toc

Elapsed time is 0.1616 seconds.

octave:17> a(:,1)

ans =

NaN

1.0000

NaN

4.2500

NaN

octave:18> size(a)

ans =

15 29

octave:19> a(isnan(a))=0; sum(sum(a))

ans = 21.528

octave:20>

According to sheet1.xml <dimension ref="B1:AD15"/>:

As you see, Matlab follows the beginning of the sheet at B1 (Octave style (1,2)). Both (Matlab and xlsxread) do the same here.

And AD15: AD=30 columns, but minus 1 because we start at B. So we should have 29 columns and 15 rows.

My xlsxread follows this - Matlab do not. The values are the same as you can see.

But this is the part i've implemented to the best of my logical belief without testing before how Matlab is handling this.

That's what i've mean with

> (get rid of|change) line 102-109+122

That are the lines where i've implemented this behavior. But this Matlab error should be easy to backport to xlsxread.

I deliberately merged the ten "topleftmost" rows and columns into one *empty* cell. So your xlsxread.m fails at that hurdle. Oopsie...

Here is what xlsfinfo.m (in Octave's OF-io package, using LibreOffice ("UNO") interface) says:

=============================================================
>> xlsfinfo ("Mergedranges.xlsx", "uno")
1: Sheet1 (Used range ~ J1:AMJ15)
ans = Microsoft Excel Spreadsheet
============================================================

eh dafuq m$ my ass!

Now, in the expanded xml (expanded using 7-zip) I see:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
- <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="A1:AMJ15" />

Hey.....?? ^^^^^^^^^^^^^^^^^^^ ??

<v>

1.0011000000000001</v>

</c>

So foldered cells start with s="2" and ends with s="3".

This needs one more balancing act^^

>
> * testOOXML.xlsx
> same for this file.
> the values are the same.

Octave (io package) does:
=============================================================
>> xlsfinfo ("testOOXML.xlsx", "uno")
1: FirstSheet (Used range ~ B2:AD9)
2: ThirdSheet (Used range ~ A4:AMJ39)
3: Sheet#5 (Used range ~ F2:Q8)
4: Sheet5 (Used range ~ B2:F14)
5: Sixth (Used range ~ A1:B3)
6: 7th (Used range ~ B2:K9)
7: Sheet_nr_3 (Used range ~ A2:M12)
ans = Microsoft Excel Spreadsheet
>>
=============================================================

...and that exactly matches the worksheets in question.

In the xml itself (again, expanded using 7-zip) I see:

testOOXML.xlsx, for example 4th worksheet ("Sheet5"):
---------------
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
- <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="B1:F65536" />

==> Wrong again. Should be B2:F14

I think this points to a more general issue/problem/bug with MS-Excel: it actually hands you the range of formatted cells (incl. empty ones!), rather than the range of cells with actual data contents.
So detecting the occupied cell range isn't quite as easy as I hoped it would be; it appears one cannot avoid scanning the entire worksheet :-(

Until now i've figured out 3 variety how to read this ugly format. The first one (add \n after each >) was the easiest and the slowest way to do it). But i guess this could be the easiest way to detect those insane ranges. To get it faster, it should be rewrite in C. Alternativly i'll take a look if mix xmlread + regex is easier to handle it – I have to think about it a few days...but if so, this would be the 4th rewrite of xlsxread.

Oh and this is the very reason that Matlab often gives you rows and columns of NaNs surrounding the actual data matrix. IMO Matlab behaves extremely dumb here.
In the xls2oct.m, ods2oct.m and parsecell.m functions in the OF io package I've added (default) options to strip away such empty rows and columns (and to keep track of which spreadsheet cell ranges the stripped data originate from).

Another issue I think needs some attention is that reading data values from raw spreadsheets as we do here, carries the risk of actually reading cached formula values rather than recalculated ones as MS-Excel and LibreOffice would do

I can't follow this issue. Formula values are ignored by xlsxread. I doesn't matte if they are cached or recalculated.

(if invoked through ActiveX or Java). IOW, one might read outdated values (depending on recalculation settings in the spreadsheet program used to create the file).
Note that the same applies to JExcelAPI (JXL) and OpenXLS (OXS) for .xls/.xlsx, and ODF Toolkit (OTK) and jOpenDocument (JOD) for .ods.
Of the "pure" Java interfaces, only Apache POI (POI; .xls/.xlsx) has a built-in formula evaluator (that I do invoke in the relevant io package functions).

I guess it's a bit to early to talk about forumla evaluation :)

In my opinion this is not needed. You shouldn't do math in excel and then read it in Octave - that's heavy dump. But yes, maybe it's not a personaly fault if you get already one with calulations.

>
> the question is now if (get rid of|change) line 102-109+122 to get the
> same wrong result as matlab or ignore this and we are done?

As you can guess, I don't think "we" are done yet :-)

Yes, thank you Bill Gates!

Philip

--
icq: 167498924
XMPP|Jabber: address@hidden

From:	Markus Bergholz
Subject:	Re: Test files [WAS: Re: xlsread in Octave 3.6.4]
Date:	Tue, 17 Sep 2013 22:02:29 +0200