octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #57805] Some array permutations are much slowe


From: Michael Leitner
Subject: [Octave-bug-tracker] [bug #57805] Some array permutations are much slower than others
Date: Fri, 14 Feb 2020 03:41:16 -0500 (EST)
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0

Follow-up Comment #1, bug #57805 (project octave):

To answer your question, I would say yes, it is desirable to perform as fast
as the specific problem allows. On my computer, I have the timings 


t =

   3.990061   3.176484   3.510465   2.557613   2.339767   0.015316


where P is 


P =

   3   2   1
   3   1   2
   2   3   1
   2   1   3
   1   3   2
   1   2   3



For completeness, I let i go up to 6. So the permutation with respect to
P(6,:) is of course a no-op, so it is naturally the fastest. P(4,:) leaves the
third dimension where it is, corresponding to 100 independent 100x100-matrix
transpositions, while P(5,:) leaves the first dimensionen, corresponding to a
single 100x100-matrix transposition, where the elements are however
100-element columns. So it is not surprising that these two are faster than
the first three, and just looking at these results, I would find no aspect in
which the present implementation is lacking.

However, you say that Matlab is overall faster, so this means there is
potential for optimization. But then you say that Matlab is even more faster
if you do it multithreaded. And this is something I find hard to believe. I
would have expected these operations to be essentially memory
bandwidth-limited, that is, that it is the rate by which data can be
transferred from and to main memory that decides the performance, and not the
computing power. And if you have multiple CPUs working, they should still have
to go through the same memory bus, I think. This should definitely be the case
for P(4,:), where you copy around 800 byte chunks. Are you sure that your data
are large enough to be out of cache? A 100x100x100-matrix of doubles is just
8000000 bytes, which is in the range of today's L3 caches. 

Perhaps somebody can point out the position in the source code where the
copying around is actually done (I just could follow up to args(0).permute
(vec, inv) in data.cc) to see whether there is something that can be
optimized. For a pointer to present investigations of such algorithms see
https://doi.org/10.1016/j.aej.2015.03.024

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?57805>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]