[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: conv2 performance
From: |
Jaroslav Hajek |
Subject: |
Re: conv2 performance |
Date: |
Tue, 2 Mar 2010 10:56:33 +0100 |
On Mon, Mar 1, 2010 at 9:46 AM, Jaroslav Hajek <address@hidden> wrote:
> On Sun, Feb 28, 2010 at 4:53 PM, John W. Eaton <address@hidden> wrote:
>> I recieved a complaint that Octave's conv2 function is about 5 times
>> slower than Matlab's. By making a few simple changes, I was able to
>> improve the perfromance some, but not by a factor of 5. My changes
>> are here:
>>
>> http://hg.savannah.gnu.org/hgweb/octave/rev/dc8637fd7a76
>>
>> On my system, I see the following improvement after these changes:
>>
>> > n = 1024; m = 64; x = rand(n,n); k = rand(m,m); tic; y = conv2(x,k); toc
>>
>> old: Elapsed time is 26.2576 seconds.
>> new: Elapsed time is 13.6104 seconds.
>>
>> and
>>
>> > n = 1024; m = 512; x = rand(n,1); k = rand(m,1); m = rand (m,n);
>> > tic; y = conv2(x,k,m); toc
>>
>> old: Elapsed time is 8.53273 seconds.
>> new: Elapsed time is 3.27103 seconds.
>>
>> Maybe there is still room for improvement here. I would happily use a
>> free software library with a GPL-compatible license to implement this
>> function, but I don't know whether one is available.
>>
>> jwe
>>
>>
>
>
> I have some ideas for speeding up the conv2 in stock. Using an
> existing library would of course be better, but I don't know of a
> suitable one. Unless anyone can come up with an existing solution,
> I'll try to convert the ideas to code eventually.
>
>
I think John Swensen has a point in that FFT-based convolution is
asymptotically better. 4th-order loops can outperform it only for
small kernels (the meaning of "small" depending on implementation).
Hence I think any loop-based code should primarily target small
kernels.
I committed an initial implementation of convolutions in liboctave:
http://hg.savannah.gnu.org/hgweb/octave/rev/978f5c94b11f
summary:
This instantiates a specialized code for all kernels up to 8x8 in
size. Larger kernels are split into 8x8 blocks.
The 4th-order loop is such that the innermost 2 loops have fixed size,
hence can be completely unrolled.
I used the attached benchmark to measure the time for convolution of a
1000x1000 matrix with all kernel sizes up to 10x10.
Times are averaged over 5 runs. Results on my computer (Core 2 Duo @
2.83 GHz, g++ -O3 -march=native) are attached.
octave:1> load all_data
with a recent tip:
octave:2> tocs_oct0
tocs_oct0 =
0.012844 0.013554 0.016711 0.022400 0.023357 0.026729
0.030247 0.033534 0.036832 0.040352
0.021440 0.029549 0.039860 0.042910 0.046509 0.043920
0.050186 0.055437 0.060768 0.064602
0.014037 0.022069 0.032884 0.041644 0.046854 0.052915
0.061239 0.067863 0.073653 0.079161
0.015658 0.030140 0.034210 0.045213 0.052253 0.060553
0.069365 0.078562 0.086291 0.093605
0.019656 0.028900 0.045464 0.047990 0.058807 0.066490
0.079297 0.088274 0.097667 0.108786
0.022072 0.033146 0.047572 0.060943 0.070684 0.081349
0.092902 0.100911 0.115715 0.127048
0.021898 0.030693 0.043972 0.058956 0.077249 0.084425
0.100613 0.111503 0.126935 0.141250
0.021071 0.037299 0.050950 0.070723 0.081438 0.101516
0.114788 0.125197 0.141521 0.158815
0.024811 0.036528 0.052532 0.073733 0.095367 0.101299
0.117457 0.132726 0.147585 0.165222
0.024415 0.045528 0.062038 0.077777 0.094293 0.110504
0.127776 0.146341 0.163502 0.178777
with the new patch:
octave:3> tocs_oct1
tocs_oct1 =
0.0074930 0.0093061 0.0101123 0.0059468 0.0067618
0.0073412 0.0095744 0.0111704 0.0155106 0.0155500
0.0056431 0.0061924 0.0084282 0.0121774 0.0137026
0.0166219 0.0200053 0.0235966 0.0279384 0.0284500
0.0060744 0.0084982 0.0126268 0.0165213 0.0221927
0.0273326 0.0332757 0.0390677 0.0434884 0.0455091
0.0063169 0.0110016 0.0172338 0.0247400 0.0311880
0.0387136 0.0465136 0.0544024 0.0592641 0.0642496
0.0066828 0.0141782 0.0221559 0.0309312 0.0406902
0.0501870 0.0602919 0.0492872 0.0548416 0.0618096
0.0084610 0.0162350 0.0308314 0.0393550 0.0504898
0.0641784 0.0500531 0.0585372 0.0652503 0.0736086
0.0105060 0.0235956 0.0373542 0.0465218 0.0594338
0.0500659 0.0609870 0.0713593 0.0790979 0.0899705
0.0148870 0.0233453 0.0382854 0.0579996 0.0480443
0.0571550 0.0695515 0.0805356 0.0899501 0.1026436
0.0159232 0.0315198 0.0468793 0.0580004 0.0528670
0.0641594 0.0777459 0.0901399 0.1043304 0.1173685
0.0191256 0.0280614 0.0450657 0.0671604 0.0608538
0.0769042 0.0929764 0.1040684 0.1214832 0.1341881
using Matlab 2007a:
octave:4> tocs_matlab
tocs_matlab =
0.0062376 0.0139384 0.0180750 0.0221970 0.0220130
0.0257478 0.0305874 0.0344506 0.0383642 0.0428446
0.0109510 0.0266958 0.0351854 0.0458640 0.0620206
0.0698450 0.0828816 0.0961360 0.1055158 0.1111318
0.0134432 0.0272864 0.0418616 0.0544420 0.0615062
0.0739450 0.0873426 0.0982644 0.1111086 0.1225830
0.0184896 0.0378428 0.0543014 0.0658802 0.0816770
0.0981762 0.1139014 0.1303218 0.1464620 0.1628398
0.0219104 0.0428282 0.0660560 0.0862450 0.1027792
0.1222084 0.1425500 0.1629794 0.1833898 0.2025316
0.0265670 0.0535164 0.0782622 0.0981846 0.1218348
0.1462216 0.1700228 0.1945714 0.2187592 0.2437806
0.0340150 0.0575714 0.0851972 0.1174214 0.1405062
0.1686618 0.1965610 0.2246892 0.2526790 0.2803052
0.0348728 0.0733926 0.1020340 0.1292908 0.1611636
0.1934834 0.2251806 0.2569800 0.2883594 0.3213730
0.0423292 0.0779326 0.1099018 0.1503224 0.1818022
0.2175340 0.2541414 0.2897038 0.3259232 0.3611042
0.0422816 0.0899742 0.1258612 0.1622720 0.2018848
0.2427072 0.2823120 0.3228700 0.3632112 0.4034948
you can see that this is actually slower in most cases than Octave
(even with the previous implementation).
The speed-ups (in %) of the new code compared to the old one (more is
better, 0 is no speed-up)
octave:5> 100 * (tocs_oct0 ./ tocs_oct1 - 1)
ans =
71.410 45.645 65.257 276.671 245.425 264.099 215.913
200.206 137.462 159.501
279.935 377.184 372.942 252.378 239.414 164.228 150.863
134.938 117.508 127.072
131.090 159.694 160.428 152.063 111.125 93.595 84.036
73.706 69.362 73.946
147.882 173.962 98.506 82.752 67.542 56.413 49.128
44.409 45.604 45.689
194.121 103.835 105.200 55.150 44.523 32.485 31.522
79.101 78.090 76.002
160.868 104.162 54.297 54.854 39.997 26.754 85.606
72.388 77.340 72.599
108.435 30.080 17.717 26.727 29.975 68.627 64.974
56.256 60.479 56.995
41.537 59.772 33.078 21.937 69.506 77.616 65.040
55.455 57.333 54.725
55.819 15.888 12.059 27.125 80.391 57.886 51.079
47.244 41.459 40.772
27.655 62.245 37.662 15.809 54.951 43.691 37.429
40.620 34.589 33.229
apparently (and according to expectations), the small kernels are the
most affected ones.
I think 8x8 is unnecessarily high (generates 64 functions), I'll
probably reduce that to 7x5 or something like that.
For comparison, speed-ups compared to Matlab:
ans =
-16.755 49.776 78.742 273.261 225.548 250.730 219.470
208.410 147.342 175.528
94.059 331.103 317.473 276.633 352.618 320.198 314.298
307.414 277.673 290.621
121.310 221.083 231.530 229.527 177.146 170.538 162.481
151.523 155.490 169.359
192.703 243.976 215.087 166.290 161.886 153.596 144.878
139.552 147.134 153.449
227.861 202.071 198.142 178.828 152.589 143.506 136.433
230.673 234.399 227.670
213.994 229.637 153.839 149.484 141.306 127.836 239.685
232.389 235.262 231.185
223.769 143.992 128.079 152.401 136.408 236.879 222.300
214.870 219.451 211.552
134.249 214.378 166.509 122.917 235.448 238.524 223.761
219.089 220.577 213.096
165.833 147.250 134.435 159.175 243.886 239.052 226.887
221.394 212.395 207.667
121.073 220.633 179.284 141.619 231.754 215.597 203.638
210.248 198.981 200.693
hmmm. Maybe there were some improvements in Matlab since?
There are still a few gotchas:
First, only the "full" convolution is treated directly, the other
cases are extracted from it. I can try to add direct code for the
"valid" case in the future.
Also, quoting the sources:
// FIXME: Only the axpy-form accumulation is used. This is natural for outer
// convolution as it requires no boundary treatment.
// This typically requires one more memory store per operation, but as the
// memory access pattern is equivalent (switching ro and rw parts), I wouldn't
// expect a significant difference. cf. for instance sum(), where row-wise sum
// (axpy-based) is no slower than column-wise (dot-based).
// It would be nice, however, if the inner convolution was computed directly by
// dot-based accumulation.
Looking at the assembler code generated for oct-convn.cc, I would
expect the inner loops to be vectorized, but it doesn't appear so -
they're only unrolled. This is a bit of surprise to me because with
newer gcc -ftree-vectorize is on with -03. I googled a little and
found that aliasing might be the issue, but adding __restrict__
doesn't seem to help. One more thing to figure out.
regards
--
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
ttc2.m
Description: Mathematica Notebook document
all_data
Description: Binary data
- Re: conv2 performance, Jaroslav Hajek, 2010/03/01
- Re: conv2 performance,
Jaroslav Hajek <=
- Re: conv2 performance, Jaroslav Hajek, 2010/03/02
- Re: conv2 performance, Jaroslav Hajek, 2010/03/03
- Re: conv2 performance, Michael D. Godfrey, 2010/03/03
- Re: conv2 performance, Jaroslav Hajek, 2010/03/03
- Re: conv2 performance, John W. Eaton, 2010/03/03
- Re: conv2 performance, Michael D. Godfrey, 2010/03/03
- Re: conv2 performance, Jaroslav Hajek, 2010/03/04
- Re: conv2 performance, Michael Goffioul, 2010/03/04
- Re: conv2 performance, John W. Eaton, 2010/03/04
- Re: conv2 performance, Jaroslav Hajek, 2010/03/04