bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

C++ hash map usage


From: Gavin Smith
Subject: C++ hash map usage
Date: Mon, 14 Oct 2024 13:25:29 +0100

On Sun, Oct 06, 2024 at 02:28:02PM +0200, Patrice Dumas wrote:
> On Sun, Oct 06, 2024 at 12:09:45PM +0100, Gavin Smith wrote:
> > On Sun, Oct 06, 2024 at 09:53:22AM +0200, Patrice Dumas wrote:
> > > > How much slower would the linear search actually be?
> > > 
> > > It is much slower (if I recall well, it was something like 100 times
> > > slower for the texi2any manual).  I juste tested the overall effect and
> > > for the pure C demonstrator for the Texinfo manual, time is about 0.375
> > > s with linear search and 0.305 with the hash map.
> 
> > Does it scale with the size of the input?  Perhaps a certain part of
> > this 0.07 sec is fixed regardless of the size of the input file.  Have
> > you tried it with a larger manual like glibc or elisp?
> 
> I think that it should be worse for bigger manuals, as the linear search
> time increases with the number of entries for each search, while the
> hashmap time traversal should be less dependent to the number of
> entries.
> 
> For the elisp manual, the times are 1.12s with hashmaps and 1.51s
> with linear search.

I get a similar comparison.

I understand that the C++ hash maps make a large difference in
relative terms, but it is still no more than half a second shorter run
time.  I doubt this is worth the extra complexity of having part of
the source code in C++. 

Even if there is back-up code if a C++ compiler isn't available, it will
tend not be tested and may become buggy and/or inefficient over time.  The
presence of C++ code will also make more C++ more likely in the future.

It may be possible to get almost the same amount of efficiency with a
relatively small amount of C code.

I thought it would be interesting to see the hash usage for some manuals
(tested with teximakehtml):

manual  number of keys          number of buckets
elisp   9779                    10273
libc    8018                    10273
texinfo 3997                    5087
gawk    5079                    5087
m4      793                     1109

(groff.texi gave a segmentation fault - I haven't investigated.)

These numbers do not seem so high that we need an ultra-optimised
algorithm (developed by C++ boffins) to deal with them - anything faster
than a linear search would probably good enough.

I have some spare time over the next couple of days so may be able
to come up with something.

We do not need the full level of generality that a general purpose library
in C would likely provide as the hashmap is from std::string (char * from
the C code) and maps to an int, and moreover keys only take the value 1 or
are undefined.

Tested with the following patch:

diff --git a/tp/Texinfo/XS/convert/call_html_cxx_function.cpp 
b/tp/Texinfo/XS/convert/call_html_cxx_function.cpp
index 429d0ab600..4d174da100 100644
--- a/tp/Texinfo/XS/convert/call_html_cxx_function.cpp
+++ b/tp/Texinfo/XS/convert/call_html_cxx_function.cpp
@@ -52,6 +52,7 @@ hashmap_register_id (CONVERTER *self, const char *in_string)
   CXX_HASHMAP *registered_id = (CXX_HASHMAP *)self->registered_ids_hashmap;
   std::string string (in_string);
   (*registered_id)[string] = 1;
+  std::cout << registered_id->size() << ":" << registered_id->bucket_count() 
<< "\n";
 }
 
 void
diff --git a/tp/Texinfo/XS/teximakehtml.c b/tp/Texinfo/XS/teximakehtml.c
index 16029d4bd8..9f91ae5a06 100644
--- a/tp/Texinfo/XS/teximakehtml.c
+++ b/tp/Texinfo/XS/teximakehtml.c
@@ -314,9 +314,8 @@ main (int argc, char *argv[])
                                    &convert_options,
    /* default, use C++ hashmap if available */
                                    0);
-   /* to test linear search
-                                   CONVF_string_list);
-    */
+   /* to test linear search */
+                                   //CONVF_string_list);
 
   free_strings_list (&converter_texinfo_language_config_dirs);
   free_strings_list (&texinfo_language_config_dirs);




reply via email to

[Prev in Thread] Current Thread [Next in Thread]