pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: very long string support


From: Ben Pfaff
Subject: Re: very long string support
Date: Tue, 02 May 2006 22:07:49 -0700
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> On Tue, May 02, 2006 at 07:51:02PM -0700, Ben Pfaff wrote:
>      
>      Currently PSPP *always* makes a copy of the data source, whether
>      in memory or on disk.[*]  Thus, your 1e8 case .sav file could be a
>      problem anyway.
>
> ???? Are you saying that in order to read a 1e8 case by 10 variable
> .sav file, the machine needs to have a spare 1e9 x 8 byte chunk of
> memory?  That's not how I thought it works.

Memory *or disk space*, yes.  Every bit of data that gets sucked
from the data source gets written to a casefile.  The casefile
will start out in memory and then if it gets too big it'll get
dumped out to disk.

As I said, we can change that.  Until now it hasn't been a
problem.

>      > Notice that even if we can mangle/demangle early, we'll have to
>      > change the mangling so that the spaces which are removed are
>      > placed at the end of the string, because we cannot change the
>      > size of cases.
>      
>      This doesn't make sense to me.  We can change the format of data
>      on input or output as much as we want.
>
> What if the modified case isn't a multiple of 8 bytes long?  Would
> that be a problem? And even if it is a n x 8 bytes, wouldn't there be
> a problem when an expression like  case_data (c, v->fv) is
> encountered?

Strings stored in "struct ccase" should always constructed be
according to the rules we've followed until now: numbers are
stored as 8-byte doubles, and strings are padded to an 8-byte
boundary.  This strategy should continue to be perfectly feasible.

> Eg: If our dictionary contains:
>
> VAR   Length  Case data offset  FV
> ===   ======  ================  ==
> a     8       0                       0
> b     2550    8                 1
> c     8       2568              321
>
>
> Then after we've dropped all the spaces it'll look like:
>
> VAR   Length  Case data offset  FV
> ===   ======  ================  ==
> a     8       0                       0
> b     2550    8                 1
> c     8       2558              321
>
> which means that casedata(c, v->fv) will index into the wrong place in
> the case.  

The former should never be entered into the dictionary.  The
dictionary should only contain values for unmangled data, and
mangling/unmangling should only happen on output.  There should
really be no confusion about it.  If the reader or writer code
needs to track some kind of separate "mangled length" then it
should create a separate data structure to do so.  In fact
there's already a "struct sfm_var" that can be used for that
purpose.

> Of course, we could make the demangling process update the fv
> values, but then it needs to be aware of the dictionary.  And
> the dictionary would have two states, one for unmangled cases
> and one for mangled, which seems error prone.

Consider the following code in sfm_read_case():

      for (i = 0; i < r->value_cnt; i++)
        {
          struct sfm_var *v = &r->vars[i];

          if (v->width == 0)
            {
              flt64 f = *bounce_cur++;
              if (r->reverse_endian)
                bswap_flt64 (&f);
              case_data_rw (c, v->fv)->f = f == r->sysmis ? SYSMIS : f;
            }
          else if (v->width != -1)
            {
              memcpy (case_data_rw (c, v->fv)->s, bounce_cur, v->width);
              bounce_cur += DIV_RND_UP (v->width, sizeof (flt64));
            }
        }

I believe that the following, or close to it, would implement all
the unmangling necessary on input, assuming that v->width
receives the unmangled variable width (and "fillers" get changed
to -1 width or removed).  Only the second "if" clause's statement
changes:

      for (i = 0; i < r->value_cnt; i++)
        {
          struct sfm_var *v = &r->vars[i];

          if (v->width == 0)
            {
              flt64 f = *bounce_cur++;
              if (r->reverse_endian)
                bswap_flt64 (&f);
              case_data_rw (c, v->fv)->f = f == r->sysmis ? SYSMIS : f;
            }
          else if (v->width != -1)
            {
              int ofs = 0;
              while (ofs < v->width)
                {
                  int chunk = MIN (255, v->width - ofs);
                  memcpy (case_data_rw (c, v->fv)->s + ofs, bounce_cur, chunk);
                  bounce_cur += DIV_RND_UP (chunk, sizeof (flt64));
                  ofs += chunk;
                }
            }
        }

See how it works?  It takes 256-byte chunks from the bounce
buffer and copies only the first 255 bytes of them into the case,
and keeps doing this while there's still some data to copy.

Similar changes apply to sfm_write_case().  We'd want to change
this code

      for (i = 0; i < w->var_cnt; i++) 
        {
          struct sfm_var *v = &w->vars[i];

          memset(bounce_cur, ' ', v->flt64_cnt * sizeof (flt64));

          if (v->width == 0) 
            *bounce_cur = case_num (c, v->fv);
          else 
            {
              buf_copy_rpad((char*)bounce_cur, v->flt64_cnt * sizeof (flt64),
                            case_data(c, v->fv)->s, 
                            v->width);
            }
          bounce_cur += v->flt64_cnt;
        }

to

      for (i = 0; i < w->var_cnt; i++) 
        {
          struct sfm_var *v = &w->vars[i];

          memset(bounce_cur, ' ', v->flt64_cnt * sizeof (flt64));

          if (v->width == 0) 
            {
              *bounce_cur = case_num (c, v->fv);
              bounce_cur++;
            }
          else 
            {
              int ofs = 0;
              while (ofs < v->width)
                {
                  int chunk = MIN (255, v->width - ofs);
                  int nv = DIV_ROUND_UP (chunk, sizeof (flt64);
                  buf_copy_rpad ((char *) bounce_cur, nv * sizeof (flt64),
                                 case_data (c, v->fv)->s + ofs, chunk);
                  bounce_cur += nv;
                  ofs += chunk;
                }
            }
        }

I haven't tested either of these but I'm pretty sure they're
conceptually sound.  They assume that, say, a 2551-byte string is
formatted in a system file as 10 consecutive 255-byte strings
followed by a 1-byte string (with each of those strings padded to
an 8-byte boundary), but I haven't carefully verified that (and
it isn't explicitly stated in the docs you wrote).

In both cases we'd also need to add something like
"w->has_very_long_strings" to the conditions to take the slow
path.
-- 
Ben Pfaff 
email: address@hidden
web: http://benpfaff.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]