octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #60928] Performance of sort unexpectedly slow


From: Rik
Subject: [Octave-bug-tracker] [bug #60928] Performance of sort unexpectedly slow for DIM=2
Date: Sat, 14 Aug 2021 11:05:55 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36

Follow-up Comment #7, bug #60928 (project octave):

The Octave architecture is built around a library of mathematical routines (in
liboctave) and then an interpreter (in libinterpreter).  The structure was
created this way so that, if desired, programmers could write their own
applications in whatever language they want (Fortran, C, C++, others) and link
against the Octave library (liboctave).

If a performance change is made in an m-file (a new sort.m which does the
permute and then calls the previous C++ sort in libinterp) or in
libinterp/corefcn/data.cc (in libinterp) then the sort algorithm will still be
slow for programmers who link directly against liboctave and call the sort
routines on Array objects.  I will say that I don't think many people use
liboctave directly, but it would still be nice to preserve that capability.

Based on the discussion above, I think any fix should be in liboctave.  There
are a few specializations of sort, such as for Range objects, but the
principal function is templated code in Array.cc so it would still require
basically just one change in one location.

The code to review is in liboctave/array/Array.cc:1762.  It seems that the
issue is related to cache-line efficiency.  The code begins with calculating a
stride: the distance between data points in an N-D array.


  octave_idx_type ns = dv(dim);
  octave_idx_type iter = dv.numel () / ns;
  octave_idx_type stride = 1;

  for (int i = 0; i < dim; i++)
    stride *= dv(i);


In the code above, dim is the 0-based dimension to sort on and dv is the
dimension vector for the array.  For this example,


tmp = rand (3,4,5,6);
sort (tmp, 2)


then


dim = 1
dv = [3, 4, 5, 6];


When sorting over the first dimension the stride is 1 which means the
subsequent code takes a shortcut and can just grab the data from the source
matrix one value after another and then place the sorted data into the output
matrix one after another.  Cache efficiency makes two appearances here.

When not sorting over the first dimension, the correct data has to be gathered
from the source matrix, sorted, and then scattered to the destination matrix. 
The code is complicated by the addition of NaN processing, but you can see how
the gather/scatter is working and how it could potentially be tremendously
inefficient.


          // gather and partition out NaNs.
          // FIXME: impact on integer types noticeable?
          octave_idx_type kl = 0;
          octave_idx_type ku = ns;
          for (octave_idx_type i = 0; i < ns; i++)
            {
              T tmp = ov[i*stride + offset];
              if (sort_isnan<T> (tmp))
                buf[--ku] = tmp;
              else
                buf[kl++] = tmp;
            }

          // sort.
          lsort.sort (buf, kl);

          if (ku < ns)
            {
              // NaNs are in reverse order
              std::reverse (buf + ku, buf + ns);
              if (mode == DESCENDING)
                std::rotate (buf, buf + ku, buf + ns);
            }

          // scatter.
          for (octave_idx_type i = 0; i < ns; i++)
            v[i*stride + offset] = buf[i];


Seems like this can be avoided by using equivalent of permute/ipermute since
those functions are also templated code in Array.cc.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60928>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]