|
From: | Andreas Stahel |
Subject: | Re: CPU usage by call of C++ code through system() on Linux |
Date: | Fri, 7 Aug 2020 09:55:11 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 |
On 6/29/20 4:37 PM, Andreas Stahel wrote:
On 29.06.20 09:49, Kai Torben Ohlhus wrote:On 6/26/20 3:58 PM, Andreas Stahel wrote:On 6/26/20 6:28 AM, Kai Torben Ohlhus wrote:On 6/26/20 1:17 AM, Andreas Stahel wrote:Dear Octave Users Maybe one of you can give me a hint on how to make my Octave code run faster. Within a good size program (run time 40 sec) the command system() is used to call a C++ code. The C++ code uses pthreads. While the code is running htop show approximately 40% of load by the kernel on each CPU and 60% "normal" (user space?). When running the same code in Matlab only the "normal"load shows and very little kernel load on the CPUs. The computation time by Matlab is also only 60% of the time consumed by Octave (5.2.0) The system is an Ubuntu 20.04 on a AMD Ryzen 3950X. Any hints on what is slowing Octave down? With best regards AndreasDear Andreas, Maybe I do not understand your setup correctly. You have a C++ code using threads compiled to, e.g. "code.exe" (the suffix does not matter), and an Octave script "benchmark.m" with somewhere the code line system ("code.exe") First question is, do "benchmark.m" and "code.exe" interact with each other? Means, does "code.exe" compute something that "benchmark.m" processes further by importing results? What is the purpose of Octave calling "code.exe"? Benchmarking with tic-toc? Second question, does "code.exe" (standalone, without Octave or Matlab) or "benchmark.m" (called from Octave or Matlab) have a run time of 40 seconds? Now to your observation. When running "benchmark.m" in Octave and Matlab you observe Octave is slower. I do not understand how this is related to the CPU "kernel" and "normal" usage? What is the runtime of "benchmark.m" in Matlab and Octave, respectively? Do you complain not all CPU cores are used? Maybe it is best to give us (some) code to better understand the situation. KaiDear Kai Thank you for the quick reply and attempt to locate the problem. The code in "benchmark.m" is a loop with 600 iterations. In each iteration a C++ code is called through system(). The C++ code is heavily threaded, and using FFTW extensively. FFTW is used as single thread library. Thu multithreading is "hand coded" I have two options set up NumIter = 0, no FFT computations NumIter = 2, many FFT computations In addition I called the binary with a loop in bash. These are the observed wall times, averaged for one call of the binary – Octave NumIter=2 : 59.6 ms, NumIter=0 : 16.3 ms, – MATLAB NumIter=2 : 38.3 ms, NumIter=0 : 20.1 ms, – bash NumIter=2 : 37.9 ms, NumIter=0 : 19.2 ms, This puzzles me thoroughly! Andreas PS. on nabble these messages show up in the wrong thread!Dear Andreas, The maintainers list was not in the CC. Sorry for the late reply. I am still not really convinced, that I understand your setup and the purpose of your computation. Is there any output or synchronization between "code.exe" or "benchmark.m"? The Octave interpreter interpreting a for-loop alone consumes already "lots of time" compared to your fast overall computation time. a = 0; tic; for i = 1:600, a = a + i; end; toc Octave 1.53995 ms. Matlab 0.025 ms. So maybe you just measure "slow" code interpretation when the body of the for-loop is "heavier" than the one shown above? Do you measure your wall time inside "code.exe" or in "benchmark.m" by tic-toc, like in my example? Maybe you find no differences, if you use a more precise C/C++ library to measure the wall time and return it for further processing by Octave or Matlab? KaiDear Kai Thank you for your effort. Here an attempt to clear up the situation. The loop runs over 600 frames, the timings given as average per frame. In the code "benchmark.m" the time per frame is measured by a tic()/toc() pair. tic(); system(command); %% this is where the computations are performed systemtime = toc(); display(sprintf('time = %f',systemtime)) % to get an impression while it is running systemtimetotal = systemtimetotal+systemtime; Based on your suggestion I added two system calls to gettimeofday() in the C code. The observed timing is consistent with the tic()/toc() result, i.e. tic()/toc() slightly higher. The C code was compiled with gcc -O3 -Wall RunMultipleTH_z_Neumann2.c -lpthread -lm -lfftw3 -o RunMultipleTH_z_Neumann2 "benchmark.m" and "code.exe" exchange some information through files. I timed those file reads and writes, it uses very little time. on a host with a Ryzen 3950X CPU * running "code.exe" in a bash loop leads to 33 ms per frame htop has almost all of the CPU load assigned to the user * running the code in Octave leads to 59 ms per frame htop has a sizable part of the CPU load assigned to kernel * running the code in Matlab leads to 37 ms per frame htop has almost all of the CPU load assigned to the user If I reduce the FFTW computations withing "code.exe" Octave is faster than bash or Matlab, but by very little. The multiple threads are still launched within the C code, but no FFT 2D operations applied. On a host with a Intel Xeon E5-1650 CPU a similar effect occurs, not quite as drastic bash 80 ms Matlab 99 ms Octave 127 ms I have no idea what could cause this surprising effect. Enjoy the day Andreas
Questions answered. It is an effect caused by using openBLAS. If setting the environment variable by "export OPENBLAS_NUM_THREADs=1" before starting Octave, then the speed is similar to Matlab or bash. Enjoy the day Andreas -- Andreas Stahel Mathematics, BFH-TI E-Mail: Andreas.Stahel@[ANTI-SPAM]bfh.ch Quellgasse 21 HuCE, Institute for Human Centered Engineering CH-2502 Biel WWW: https://web.sha1.bfh.science Switzerland Phone: ++41 +32 32 16 258
[Prev in Thread] | Current Thread | [Next in Thread] |