pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: very long string support


From: Ben Pfaff
Subject: Re: very long string support
Date: Tue, 02 May 2006 19:51:02 -0700
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> I can see where you're comming from.  You want the copy_mangle and
> copy_demangle functions to be wholly contained inside sys-file-write.c
> and sys-file-read.c respectively.  That's how I had orginally
> envisaged it to be, and I tried hard to do it that way.
>
> But I ran into problems when implementing it. Consider the
> simple case of reading from a file:
>
> GET /FILE='file-with-very-long-strings.sav'.
>
> LIST.
>
> When the values are printed in the list statement, where do they come
> from? The casefile source in this instance is the *.sav file itself.
> Thus, the only way to implement it like you suggest, would be to copy
> the entire file into temporary storage and demangle it there.  Since
> the file could have > 100,000,000 cases, this is impractical.

Currently PSPP *always* makes a copy of the data source, whether
in memory or on disk.[*]  Thus, your 1e8 case .sav file could be a
problem anyway.

But there's no reason to convert the .sav file in place or all at
once.  You just read a case from disk and convert that case in
memory and pass the converted case along.  

[*] It's not impossible to change it so that it does not need to
do so.  In fact it should become relatively easy to do so once my
current group of patches are checked in.  But that's an
orthogonal issue that does not change what we're discussing here.

I think you may have a misconception about how procedures work.
This is essentially what happens:

        1. A case gets read from the source (represented as a
           "struct case_source"), in this case a system file.
           The source is responsible for putting the case into
           the "struct ccase" format.  If we're very lucky, which
           happens if the source is a casefile or a system file
           with certain constraints, no transformation is needed
           and just calling "fread" is sufficient.  But that's by
           no means necessary; arbitrary changes may be
           necessary, e.g. decompression, translating ASCII into
           binary, etc.

        2. The case passes through transformations.  New
           variables might be added or existing variables
           modified.  It might get dropped.  If not, it passes to
           the next step.

           (In your example there aren't any transformations.)

        3. The case is written to the sink (represented as a
           "struct case_sink"), usually, as in this case, a
           casefile.  This sink will become the source for the
           next procedure.

        4. The case passes through transformations that follow
           TEMPORARY.  New variables might be added or existing
           variables modified.  It might get dropped.  If not, it
           passes to the next step.

           (In your example there aren't any temporary
           transformations.)

        5. The case is passed to the procedure.  It does with it
           whatever it likes.

This happens *per case*.  After one case traverses these steps,
the next case starts at step 1.  It's possible for the case to
increase in size or be modified in steps 2 or 4.  The case has to
come from somewhere in step 1, and there's nothing restricting
how it's derived.  Certainly, adding or deleting spaces is not a
problem.

> I wanted to keep the mangling and demangling with the
> sfm_{read,write}_case functions.  I think I decided that to do this I
> would have to change the signature of case_data_all so that it was
> aware of the variable widths, and that exposing the mangling was the
> lesser of the two evils.  Having said that, looking at it afresh, I
> see that the sfm_* structs do contain the variable widths so by
> completely rewriting the sfm_{read,write}_case functions it may be
> possible (but I've not yet convinced myself).  

There shouldn't be any need to modify case_data_all() or anything
outside the system file reader/writer.

> Notice that even if we can mangle/demangle early, we'll have to
> change the mangling so that the spaces which are removed are
> placed at the end of the string, because we cannot change the
> size of cases.

This doesn't make sense to me.  We can change the format of data
on input or output as much as we want.

Are you confusing system files and casefiles by any chance?
They're completely separate sets of code.

> The only other way I can to possibly isolate the (de)mangle functions,
> would be to add a transformation which does the demangling for us. But
> there are issues which I'm unsure about.

Should not be necessary.
-- 
Ben Pfaff 
email: address@hidden
web: http://benpfaff.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]