Re: casefile random access

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: casefile random access

From:	John Darrington
Subject:	Re: casefile random access
Date:	Sun, 4 Jun 2006 10:57:08 +0800
User-agent:	Mutt/1.5.4i

On Sat, Jun 03, 2006 at 11:33:49AM -0700, Ben Pfaff wrote:

     If you're using a casefile, then it's backed either by an array
     of cases in memory or by a disk file in the casefile format.
     Random access in an array is trivial.  A disk file can be read
     sequentially or randomly.  We currently do only sequential
     access.  Adding random access wouldn't change that: procedures
     would still read the casefile sequentially.
     
[:snip:]

     In short, I think that random access for interactive usage is
     fine.
     
[:snip:]

OK.  

If you're able and willing to write a random access casereader, then
that will certainly make the gui code simpler.  I've got a change
almost ready to commit, which will remove the GUI's hard limit  on the
number of variables.   The next major step for the GUI is to unlimit
the number of cases. So I'm just about ready to use such casereader.


My idea of the interface would be something along the lines of the
following, but you may have better ideas.

struct ra_casereader;

/* Read case CNUM into C */
bool
ra_casereader_read (struct ra_casereader *reader, int cnum, 
                   struct ccase *c) ;


     There's another issue here.  All this assumes that your data is
     in a casefile.  But you're really talking about a system file,
     which is a different beast.  To do what I'm talking about above,
     you'd have to copy the system file's data in a casefile.  This
     would double the disk space needed (the original system file plus
     a copy in a disk-based casefile).  What you might really want is
     to be able to operate directly on the system file's content.  The
     system file interface doesn't support random access, but it
     could, just as the casefile interface could.  At least, it could
     easily for non-compressed system files; it would require extra
     time or extra space to support random access in compressed system
     files (because there's no way to know where to seek to).


I don't think it's worth messing with the internals of sysfile-reader
just for the GUI's benefit.  It's true that at the moment, the GUI
operates only on system files.  But this was just a temporary
convenience that was easy(ish) to program.  Clearly what is needed is the
GUI to have its own casefile.   

     
     I'll conclude by adding a final, forward-looking issue.
     Currently, every procedure reads the active file from somewhere,
     such as a system file or casefile, transforms it, and then writes
     the transformed version to a casefile[*].  That is, it always
     makes a new copy (and if the old version is a casefile, throws
     away the old copy).  (In SPSS syntax, this is equivalent to
     always running the CACHE utility.)  But, as you've pointed out
     before, this is wasteful; it is usually[+] possible to avoid
     writing a new copy, if you just retain the old transformations
     and re-apply them to the old data, followed by any new
     transformations, on the next procedure.  I'm planning to
     implement this relatively soon.  But after that, there's no
     obvious place for the data viewer window to get its data from,
     because the data it wants to show is not actually stored
     anywhere; it's just defined in terms of a source file plus a
     bunch of transformations.  I'm not sure what we'll want to do
     about that; one option would be to, when using the GUI, always
     write out a new copy.

There was a nagging worry beginning to grow in the back of my mind
about this.  Like you say, probably the answer is to explicitly read
into a new casefile if necessary.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

pgpq4T7N5bd44.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: PSPP conference call notes., Ben Pfaff, 2006/06/01
- Re: PSPP conference call notes., John Darrington, 2006/06/01
  - Re: PSPP conference call notes., Ben Pfaff, 2006/06/01
    - Re: PSPP conference call notes., John Darrington, 2006/06/02
    - casefile random access (was: Re: PSPP conference call notes.), Ben Pfaff, 2006/06/03
    - Re: casefile random access, John Darrington <=
    - Re: casefile random access, Ben Pfaff, 2006/06/05
    - Re: casefile random access, John Darrington, 2006/06/05
    - Re: casefile random access, Ben Pfaff, 2006/06/05
    - Re: casefile random access, John Darrington, 2006/06/06
    - Re: casefile random access, Ben Pfaff, 2006/06/06
    - Re: casefile random access, John Darrington, 2006/06/06
    - Re: casefile random access, Ben Pfaff, 2006/06/06
    - Re: casefile random access, John Darrington, 2006/06/07
    - Re: casefile random access, Ben Pfaff, 2006/06/08
    - Re: casefile random access, John Darrington, 2006/06/08

Prev by Date: error i18n (was: Re: PSPP conference call notes.)
Next by Date: Re: error i18n
Previous by thread: casefile random access (was: Re: PSPP conference call notes.)
Next by thread: Re: casefile random access
Index(es):
- Date
- Thread