First, let me make a confession: I'm a little obsessed by performance. I don't mind reading assembly language and get a kick out of being able to make a tweak that makes things faster on 3 architectures and 2 different compilers.
But now I've run into something that has got me stumped. it started with a complaint by an engineer that Octave 2.9.9 was about 15% slower at solving a optimization problem using an octave-forge package than Octave 2.9.5. After working with it a bit, I found other applications that had slowed down as well. I began to suspect that the cost of doing a function call had increased, so I started benchmarking with a simple little function that I use to test function
function uck = quack(m,n) % global calls % calls = calls + 1 ; if(m == 0 ) uck = n + 1 ; return; end if(n == 0 ) uck = quack(m-1,1); return; end uck = quack(m-1,quack(m,n-1)); return;
Now this little devil makes lots of function calls. If you un-comment the 'global calls counter', for quack(3,6), you will find a total of 172,233 function calls are made to compute the number 509. Of course, you will need to set your max_recursion_depth to 10+(2^(n+3)) because those function calls get nested verrrrry deeply.
So then I ran it under a few different versions of Octave to see how well they performed:
Octave Version | ACK(3,6) X 2 Time | 5 | 6.911 | 6 | 8.149 | 7 | 8.388 | 8 | 8.728 | 9 | 8.309 | 10 | 8.569 |
As you can see, something went awry between Octave 2.9.5 and 2.9.6, and has stayed bad since, to the tune of 20% less goodness.
My next impulse was to try compiling octave 2.9.6 using the '-pg' switch and build a profiling version of octave 2.9.6 and see what were the top 50 things octave spent its timing doing while running this program:
% time | cumulative seconds | self seconds | calls | Calls Per Ack | | Name | 5.26 | 2.05 | 2.05 | 357507 | 1.038 | | unwind_protect::run_frame(std::string const&) | 4.21 | 3.69 | 1.64 | 344464 | 1.000 | | symbol_table::push_context() | 4.17 | 5.32 | 1.63 | 345337 | 1.003 | | symbol_table::clear() | 4.05 | 6.9 | 1.58 | 344464 | 1.000 | | symbol_table::pop_context() | 3.72 | 8.35 | 1.45 | 5727363 | 16.627 | | unwind_protect::add(void (*)(void*), void*) | 3.49 | 9.71 | 1.36 | 345329 | 1.003 | | charMatrix::charMatrix(string_vector const&) | 3.1 | 10.92 | 1.21 | 1916765 | 5.564 | | tree_identifier::rvalue(int) | 3.03 | 12.1 | 1.18 | 2411248 | 7.000 | | symbol_record::push_context() | 3 | 13.27 | 1.17 | 2411248 | 7.000 | | symbol_record::pop_context() | 2.92 | 14.41 | 1.14 | 1743921 | 5.063 | | symbol_table::lookup(std::string const&, bool, bool) | 2.8 | 15.5 | 1.09 | 5931242 | 17.219 | | std::vector<octave_value, std::allocator<octave_value> >::operator=(std::vector<octave_value, std::allocator<octave_value> > const&) | 2.45 | 16.45 | 0.96 | 2075833 | 6.026 | | symbol_record::variable_reference() | 2.1 | 17.27 | 0.82 | 2426192 | 7.043 | | symbol_record::clear() | 2.04 | 18.07 | 0.8 | 35985058 | 104.466 | | octave_value::~octave_value() | 1.8 | 18.77 | 0.7 | 345337 | 1.003 | | octave_user_function::do_multi_index_op(int, octave_value_list const&) | 1.68 | 19.42 | 0.66 | 5582276 | 16.206 | | octave_value_list::operator=(octave_value_list const&) | 1.55 | 20.03 | 0.61 | 10172344 | 29.531 | | Array<std::string>::~Array() | 1.49 | 20.61 | 0.58 | 1038927 | 3.016 | | tree_binary_expression::rvalue() | 1.45 | 21.17 | 0.57 | 11861282 | 34.434 | | octave_value::octave_value() | 1.38 | 21.71 | 0.54 | 352868 | 1.024 | | tree_index_expression::rvalue(int) | 1.33 | 22.23 | 0.52 | 6080792 | 17.653 | | octave_allocator::alloc(unsigned long) | 1.31 | 22.74 | 0.51 | 345161 | 1.002 | | tree_parameter_list::define_from_arg_vector(octave_value_list const&) | 1.28 | 23.24 | 0.5 | 3647156 | 10.588 | | saved_variable::restore(void*) | 1.15 | 23.69 | 0.45 | 1916765 | 5.564 | | tree_identifier::rvalue() | 1.13 | 24.13 | 0.44 | 352861 | 1.024 | | tree_argument_list::convert_to_const_vector(octave_value const*) | 1.1 | 24.56 | 0.43 | 348966 | 1.013 | | tree_simple_assignment::rvalue() | 1.05 | 24.97 | 0.41 | 1214929 | 3.527 | | tree_statement::eval(bool, int, bool) | 1.03 | 25.37 | 0.4 | 352517 | 1.023 | | tree_index_expression::rvalue() | 0.95 | 25.74 | 0.37 | 1214931 | 3.527 | | unwind_protect::run() | 0.94 | 26.11 | 0.37 | 345331 | 1.003 | | octave_user_function::subsref(std::string const&, std::list<octave_value_list, std::allocator<octave_value_list> > const&, int) | 0.9 | 26.46 | 0.35 | 521125 | 1.513 | | tree_statement_list::eval(bool, int) | 0.87 | 26.8 | 0.34 | 11850042 | 34.401 | | octave_base_value::~octave_base_value() | 0.82 | 27.12 | 0.32 | 2075833 | 6.026 | | tree_identifier::lvalue() | 0.74 | 27.7 | 0.29 | 690657 | 2.005 | | tree_parameter_list::undefine() | 0.74 | 27.41 | 0.29 | | 0.000 | | octave_base_value::clone() const | 0.72 | 27.98 | 0.28 | 5115165 | 14.850 | | octave_value* std::__uninitialized_copy_aux<__gnu_cxx::__normal_iterator<octave_value const*, std::vector<octave_value, std::allocator<octave_value> > >, octave_value*>(__gnu_cxx::__normal_iterator<octave_value const*, std::vector<octave_value, std::allocator<octave_value> > >, __gnu_cxx::__normal_iterator<octave_value const*, std::vector<octave_value, std::allocator<octave_value> > >, octave_value*, __false_type) | 0.64 | 28.48 | 0.25 | 3120063 | 9.058 | | octave_value::operator=(octave_value const&) | 0.64 | 28.23 | 0.25 | 8640634 | 25.084 | | octave_base_scalar<double>::is_defined() const | 0.58 | 28.7 | 0.23 | 2428813 | 7.051 | | std::_List_base<octave_value_list, std::allocator<octave_value_list> >::__clear() | 0.55 | 29.13 | 0.22 | 1917009 | 5.565 | | lookup(symbol_record*, bool) | 0.55 | 29.35 | 0.22 | 1045453 | 3.035 | | Array<octave_value (*)(octave_base_value const&, octave_base_value const&)>::checkelem(int, int, int) | 0.55 | 28.92 | 0.22 | 6431971 | 18.672 | | octave_call_stack::instance_ok() | 0.54 | 29.77 | 0.21 | 3666711 | 10.645 | | octave_value_list::elem(int) | 0.54 | 29.98 | 0.21 | 348966 | 1.013 | | tree_simple_assignment::rvalue(int) | 0.54 | 29.56 | 0.21 | | 0.000 | | octave_base_scalar<double>::nnz() const | 0.51 | 30.18 | 0.2 | 2075004 | 6.024 | | octave_lvalue::assign(octave_value::assign_op, octave_value const&) | 0.51 | 30.38 | 0.2 | 520804 | 1.512 | | tree_if_command_list::eval() | 0.47 | 30.94 | 0.19 | 345337 | 1.003 | | octave_user_function::restore_args_passed() | 0.47 | 31.13 | 0.19 | 345337 | 1.003 | | octave_user_function::install_automatic_vars() | 0.49 | 30.57 | 0.19 | 2966199 | 8.611 | | octave_value* std::__uninitialized_fill_n_aux<octave_value*, unsigned long, octave_value>(octave_value*, unsigned long, octave_value const&, __false_type) |
The "Calls Per Ack" column is the number of "Calls" divided by 172,233*2. It gives you hint about how many times a function is executed as a result of each call to quack(). Some numbers are scary-big, like the number of times the octave_value destructor is called (104 Calls per ACK) or the Array<std::string> destructor (29+ times per ACK). I don't at all understand why so many octave_value and string array objects are being created for so simple a program, they are not directly big chunks of time. Most of the time is pretty spread out among the classes unwind_protect::, symbol_table::, and tree_*::.
I tried taking a look at the diff set between 2.9.5 and 2.9.6, but was not able to see anything that leapt out at me as an obvious problem.
I then generated profiling octaves for all versions from 2.9.5 to 2.9.10 and ran the same application through. Though the profiles differer in the order of what's on the top, the same cast of characters is present.
It is almost as if the cost of doing all of these varied operations just increased, not one method or one class.
Does anyone have any hints as to what might be going on or what can be done or tried?
--Luis Ortiz
|