pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: very long string support


From: John Darrington
Subject: Re: very long string support
Date: Wed, 3 May 2006 09:15:29 +0800
User-agent: Mutt/1.5.9i

I can see where you're comming from.  You want the copy_mangle and
copy_demangle functions to be wholly contained inside sys-file-write.c
and sys-file-read.c respectively.  That's how I had orginally
envisaged it to be, and I tried hard to do it that way.

But I ran into problems when implementing it. Consider the
simple case of reading from a file:

GET /FILE='file-with-very-long-strings.sav'.

LIST.

When the values are printed in the list statement, where do they come
from? The casefile source in this instance is the *.sav file itself.
Thus, the only way to implement it like you suggest, would be to copy
the entire file into temporary storage and demangle it there.  Since
the file could have > 100,000,000 cases, this is impractical.

I wanted to keep the mangling and demangling with the
sfm_{read,write}_case functions.  I think I decided that to do this I
would have to change the signature of case_data_all so that it was
aware of the variable widths, and that exposing the mangling was the
lesser of the two evils.  Having said that, looking at it afresh, I
see that the sfm_* structs do contain the variable widths so by
completely rewriting the sfm_{read,write}_case functions it may be
possible (but I've not yet convinced myself).  Notice that even if we
can mangle/demangle early, we'll have to change the mangling so that
the spaces which are removed are placed at the end of the string,
because we cannot change the size of cases.

The only other way I can to possibly isolate the (de)mangle functions,
would be to add a transformation which does the demangling for us. But
there are issues which I'm unsure about.

J'

On Tue, May 02, 2006 at 03:35:32PM -0700, Ben Pfaff wrote:
     I took a look at your changes for supporting very long strings.
     It looks good, except that I don't understand the choice of where
     to "mangle" and "demangle" strings.
     
     Here's what I understand about the system file format for very
     long strings, based on what you checked in.  The basic system
     file format only supports strings up to 255 bytes long.
     Therefore, for compatibility very long strings are broken up into
     255-byte segments with unique names, and a system file extension
     record explains how to paste those segments back together.
     
     I think that what you checked in changed the *internal* PSPP
     storage of strings so that it stores each 255 bytes of a string
     in 256 bytes, putting a space in the 256th byte.  In other words,
     it changes PSPP internals to match the stupid but compatible
     format of SPSS system files.
     
     To me that looks like a mistake.  System files have to be
     compatible, so they have a stupid format for very long strings.
     But that's no reason to use that stupid format internally and
     then have to deal with it potentially all over (it doesn't look
     to me like you fixed up everything that can use very long
     strings, e.g. AGGREGATE, and I'd rather that we not have to).
     Instead, we should translate between the obvious, normal format
     and the stupid one on input and output, and then all the internal
     code can stay the way it is.
     
     Most system files won't need translation at all, because most of
     them don't have very long strings, so on input this can just be
     another thing that happens in the "slow path" in sfm_read_case().
     On output we don't currently have a "slow path" (except for
     compressed data) so we'll have to add something.
     
     Does all of that make sense?
     -- 
     Ben Pfaff 
     email: address@hidden
     web: http://benpfaff.org

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]