octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Spreadsheet support in core Octave instead of io package?


From: Philip Nienhuis
Subject: Re: Spreadsheet support in core Octave instead of io package?
Date: Sun, 19 Oct 2014 22:25:03 +0200
User-agent: Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0 Iceape/2.22

John W. Eaton wrote:
xlsread, xlswrite, and xlsfinfo are all core functions in Matlab.  I'd
like to make them core functions in Octave as well, as I suspect many
users would like to have these functions available by default without
having to install a separate package.  Is there any objection to moving
them from the io package to core Octave?

I'd also consider moving the odsread, odswrite, and odsfinfo functions
to core Octave at the same time, though I also wonder whether it might
make sense to just put the XLS and ODS capabilities in a single function
instead of splitting them up in separate functions.

Comments?

No of course I have no objections, au contraire. But I think it does require a bit more preparation and planning than might appear at first sight.

Some questions, sorry :-) plus some comments and clarifications.

Apologies for an extremely long post.


Timing / planning
-----------------
1. Do you want to integrate these functions in the default branch (future 4.2+)? I suppose 4.0's (gui-release) target is only, well, the GUI.

- That would imply continued maintenance of the io package as it stands until 4.2 is official, maybe a little later for those users who cannot upgrade so fast. (I expect end of 2015 at the earliest for 4.2). Anyway I see no problems there as maintainer. - That would also give time (for me) to polish the spreadsheet stuff before inclusion into core Octave. That polishing is a long-standing wish of mine anyway, lots of duplicate code sneaked in that I want to eliminate. I suppose parts of polished spreadsheet I/O could already be transferred (copied) early next year.
- I'd rather fiddle in an OF package than in core Octave.

All in all a delay would be welcome.


Choice of interfaces
--------------------
"Interface" refers to the external dependencies for some particular file format, or set of file formats. The various interfaces and file formats constitute a fairly complicated patchwork of partly overlapping sets. Below is the latest situation as copied from the io-2.2.2 NEWS. Additional complications are that each interface also has its pros and cons, rate of matureness, and license. So there's the main reason that the io package has so "many" interfaces.

                                 Interface
File extension       COM POI POI/OOXML JXL OXS UNO OTK JOD OCT
--------------------------------------------------------------
.xls (Excel95)        R                 R       R
.xls (Excel97-2003)   +   +      +      +   +   +
.xlsx (Excel2007+)    ~          +         (+)  R           +
.xlsb, .xlsm          ~                     ?   R           R?
.wk1                  +                         R
.wks                  +                         R
.dbf                  +                         +
.ods                  ~                         +   +   +   +
.sxc                                            +       +
.fods                                           +
.uos                                            +
.dif                                            +
.csv                  +                         R
.gnumeric                                                   +
--------------------------------------------------------------
(~ = dependent on Excel/LO/OOo version); + = read/write; R = only reading )

So we have:

* No external dependencies (but limited flexibility):
"OCT", with support for only a limited number of XML-based file formats: .xlsx (OOXML, or Excel 2007+), ,ods (LibreOffice) and .gnumeric. .xls (Excel'97) is still lacking, a little unfortunate as there's so much of that format out there, and still so much legacy SW that can only produce .xls. Markus and I are looking into Mac OSX' Numbers file format, I did peek into oleo and applixware a while ago (looked promising. It is all XML).

* Java-based, using ready-baked Java class libs (.jar files). Offers much more ready-baked flexibility (like deleting data, inserting graphs, cell formatting) for those users that are prepared to dive in the relevant javadocs.
- "POI" (Apache POI), only support for .xls (BIFF8 / Excel'97) and .xlsx
- "JXL" (JExcelAPI), supports .xls BIFF8 plus reading .xls (BIFF5 / Excel'95)
- "OXS" (OpenXLS), supports .xls (BIFF8) plus buggy .xlsx
- "OTK" (ODF Toolkit), supports .ods (LibreOffice)
- "JOD" (JOpenDocument), supports .ods and (reading) .sxc (old StarOffice/OpenOffice.org)

* Java-based, invoking LibreOffice/OpenOffice.org:
- "UNO" (Java UNO bridge). Supports in principle any file format that LO or OOo can process, but not all has been implemented in the io package. Also here a lot of (yet unimplemented) flexibility is within reach.

* On Windows only (with the OF windows package):
- "COM", (ActiveX) using the COM server (system library) to invoke Excel itself (a sort of "system (...)" but more direct). Like UNO, any file format is supported that Excel can handle. This is what Matlab does too. As regards licenses I think this is OK for an OF package but close to the edge; the io package's spreadsheet I/O doesn't depend on it. I wouldn't want core Octave to be associated so tightly with proprietary SW, so this is no candidate interface for core Octave.
Another limitation is that ActiveX / COM only works with 32-bit MS-Office.
Just to clarify, COM isn't limited to Excel only:
- At my work we use it to send Matlab graphics directly into MS Word or Powerpoint documents; - The COM interface can also be used to invoke OSS like LibreOffice. I tried a few times but couldn't find a way to create arrays with COM & Starbasic only and gave up there, but I did get LibreOffice to execute basic commands from the Octave terminal.

The point to be aware of is that external support SW (Java-based) should be available for users, either somehow included in self-contained binaries (for Windows) or through distro package managers. AFAIK Redhat does have Apache POI and maybe even JExcelAPI in its repos (the io package's user support function can find it). FYI, the 11 Java class libs for all interfaces except UNO occupy around 22 MB on my systems.


Support functions
-----------------
Apart from the actual spreadsheet I/O routines there's quite a bit of other required stuff. For the XML-based formats two basic XML functions are needed (isolating a node from a char array and getting node attributes), there are support functions for parsing mixed cell arrays into numeric and text components, and some functions relating to spreadsheet cell addresses. There's also a user support function that PKG_ADD and PKG_DEL use to modify the javaclasspath.
All of this has to be transferred too.


Splitting up or one comprehensive function
------------------------------------------
There are 2 dimensions here, I suppose you mean "B":
A. Making xlsread just one comprehensive function.
B. Combining ODS and XLS into one xlsread or so.

A.
Matlab's xlsread and xlswrite are just one function (from a user perspective) and in at least my experience that has been a bad decision. That way, reading data from or writing data to file can only be done one worksheet at a time as follows:
1. The entire spreadsheet file has to be read in memory
2. The data have to be read from or written to one worksheet only
3. (for writing) the entire spreadsheet file has to be written to disk
4. The file handle or background process (Excel in case of Matlab) has to be closed. All in all a very inefficient procedure once more worksheets in one file need to be processed.

Octave's xlsread/xlswrite/odsread/odswrite and even xlsfinfo/odsfinfo are mere wrappers around functions that
- open and read a spreadsheet file into RAM and return a handle;
- functions to read from or write to a worksheet using the handle;
- functions for parsing a raw cell array into numeric and char components (in case of reading from worksheet); - functions for closing the handle and -if needed- write the file to disk, optionally with another file name and -location. I've even thought of implementing writing to another file format. For the UNO and COM interfaces that's already implicitly in place and there are workarounds for the other Java-based interfaces.

B.
To me the name "xlsread" has the connotation of linking to some proprietary SW; even though Matlab's xlsread can read ODS (provided Excel2007 is the backend in use). I'd rather have "spshread" or "spshwrite". In fact, that's one of the things I wanted to do - replace the xlsopen/xls2oct/oct2xls/xlsclose and odsopen/ods2oct/oct2ods/odsclose functions by wrappers calling spshopen/spsh2oct/oct2spsh/spshclose functions, same for xlsread etc. This way, xlsread could be just a thin wrapper or duplicate for spshread, same for odsread and similar to the other ones (xls/odsfinfo & xls/odswrite). Maybe even gnmread and gnmwrite for gnumeric? I know, it does fill up the namespace. But apart from a lot of names, there are no foreseen name clashes.


Other remarks
-------------
The io package's spreadsheet stuff is self-contained and independent of any other OF package. There are only suggested dependencies. I took care that adding or deleting interfaces should be easy (of course, writing the actual interface-specific parts themselves is another story).


Philip




reply via email to

[Prev in Thread] Current Thread [Next in Thread]