pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: very long string support


From: John Darrington
Subject: Re: very long string support
Date: Wed, 3 May 2006 14:48:56 +0800
User-agent: Mutt/1.5.9i

On Tue, May 02, 2006 at 10:07:49PM -0700, Ben Pfaff wrote:

     >      > Notice that even if we can mangle/demangle early, we'll have to
     >      > change the mangling so that the spaces which are removed are
     >      > placed at the end of the string, because we cannot change the
     >      > size of cases.
     >      
     >      This doesn't make sense to me.  We can change the format of data
     >      on input or output as much as we want.
     >
     > What if the modified case isn't a multiple of 8 bytes long?  Would
     > that be a problem? And even if it is a n x 8 bytes, wouldn't there be
     > a problem when an expression like  case_data (c, v->fv) is
     > encountered?
     
     Strings stored in "struct ccase" should always constructed be
     according to the rules we've followed until now: numbers are
     stored as 8-byte doubles, and strings are padded to an 8-byte
     boundary.  This strategy should continue to be perfectly feasible.
     
OK.

     > Of course, we could make the demangling process update the fv
     > values, but then it needs to be aware of the dictionary.  And
     > the dictionary would have two states, one for unmangled cases
     > and one for mangled, which seems error prone.
     
     Consider the following code in sfm_read_case():
     
           for (i = 0; i < r->value_cnt; i++)
             {
               struct sfm_var *v = &r->vars[i];
     
               if (v->width == 0)
                 {
                   flt64 f = *bounce_cur++;
                   if (r->reverse_endian)
                     bswap_flt64 (&f);
                   case_data_rw (c, v->fv)->f = f == r->sysmis ? SYSMIS : f;
                 }
               else if (v->width != -1)
                 {
                   memcpy (case_data_rw (c, v->fv)->s, bounce_cur, v->width);
                   bounce_cur += DIV_RND_UP (v->width, sizeof (flt64));
                 }
             }
     
     I believe that the following, or close to it, would implement all
     the unmangling necessary on input, assuming that v->width
     receives the unmangled variable width (and "fillers" get changed
     to -1 width or removed).  Only the second "if" clause's statement
     changes:
     
           for (i = 0; i < r->value_cnt; i++)
             {
               struct sfm_var *v = &r->vars[i];
     
               if (v->width == 0)
                 {
                   flt64 f = *bounce_cur++;
                   if (r->reverse_endian)
                     bswap_flt64 (&f);
                   case_data_rw (c, v->fv)->f = f == r->sysmis ? SYSMIS : f;
                 }
               else if (v->width != -1)
                 {
                   int ofs = 0;
                   while (ofs < v->width)
                     {
                       int chunk = MIN (255, v->width - ofs);
                       memcpy (case_data_rw (c, v->fv)->s + ofs, bounce_cur, 
chunk);
                       bounce_cur += DIV_RND_UP (chunk, sizeof (flt64));
                       ofs += chunk;
                     }
                 }
             }
     
     See how it works?  It takes 256-byte chunks from the bounce
     buffer and copies only the first 255 bytes of them into the case,
     and keeps doing this while there's still some data to copy.
     
     Similar changes apply to sfm_write_case().  We'd want to change
     this code
     
           for (i = 0; i < w->var_cnt; i++) 
             {
               struct sfm_var *v = &w->vars[i];
     
          memset(bounce_cur, ' ', v->flt64_cnt * sizeof (flt64));
     
               if (v->width == 0) 
                 *bounce_cur = case_num (c, v->fv);
               else 
            {
              buf_copy_rpad((char*)bounce_cur, v->flt64_cnt * sizeof (flt64),
                            case_data(c, v->fv)->s, 
                            v->width);
            }
               bounce_cur += v->flt64_cnt;
             }
     
     to
     
           for (i = 0; i < w->var_cnt; i++) 
             {
               struct sfm_var *v = &w->vars[i];
     
          memset(bounce_cur, ' ', v->flt64_cnt * sizeof (flt64));
     
               if (v->width == 0) 
                 {
                   *bounce_cur = case_num (c, v->fv);
                   bounce_cur++;
                 }
               else 
                 {
                   int ofs = 0;
                   while (ofs < v->width)
                     {
                       int chunk = MIN (255, v->width - ofs);
                       int nv = DIV_ROUND_UP (chunk, sizeof (flt64);
                       buf_copy_rpad ((char *) bounce_cur, nv * sizeof (flt64),
                                      case_data (c, v->fv)->s + ofs, chunk);
                       bounce_cur += nv;
                       ofs += chunk;
                     }
                 }
             }
     
     I haven't tested either of these but I'm pretty sure they're
     conceptually sound.  They assume that, say, a 2551-byte string is
     formatted in a system file as 10 consecutive 255-byte strings
     followed by a 1-byte string (with each of those strings padded to
     an 8-byte boundary), but I haven't carefully verified that (and
     it isn't explicitly stated in the docs you wrote).

It seems that spss formats a 2551 byte string as 10 consecutive
255-byte strings followed by a 103 byte string with the last 102 bytes
empty.  
Or in general a N byte string (where N > 255) is formated as 
(N div 252) consecutive 255 byte strings followed by a  8xsup(N - (N div
252)x255/8) byte string (God knows why!).
When reading we just have to assume that there'll be "a number" of 255
byte strings, followed by a string <= 255 bytes in length.
     
     In both cases we'd also need to add something like
     "w->has_very_long_strings" to the conditions to take the slow
     path.

OK.  I think that's where I came into problems.  I was trying to
ensure that the fast path would always be taken. Maybe that was a
misguided goal.

I'm not sure about the comments you made about AGGREGATE.  Can you
provide a test case which fails.  Then I can check it in as a
regression test.

Thanks.

J'
-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]