guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: map-par slower than map


From: Damien Mattei
Subject: Re: map-par slower than map
Date: Sun, 23 Oct 2022 03:06:01 +0200

after intense coding i finally got the good results,
my assumption about the global variable hash table was true ,it is
incompatible with 'future : the competition for the writing into the hash
table breaks the code.

If i do writing in hash table out of // region all is ok:

a simple function to extract the hash table access:

(define (tag-minterms i umt)
  (when umt
     {mt1 <+ (first {minterms-vector[i]})}
     {mt2 <+ (second {minterms-vector[i]})}
     {minterms-ht[mt1] <- #t}
     {minterms-ht[mt2] <- #t}))



;; proc to be called with futures
(define (proc-unify-minterms-seg seg)

  {function-unify-minterms-list <+ (λ (L) (apply
function-unify-two-minterms L))}
  ;;{function-unify-minterms-list <+ (λ (L) (apply
function-unify-two-minterms-and-tag L))}


  {start <+ (segment-start seg)}
  {end <+ (segment-end seg)}
  (for ({i <+ start} {i <= end} {i <- {i + 1}})
       {mtL <+ {minterms-vector[i]}}
       (nodebug
          (dv mtL))
       {unified-minterms-vector-1[i] <- (function-unify-minterms-list
mtL)}))


(declare minterms-vector unified-minterms-vector-1)

https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L2893

and the call to the code that use hash table is now completely out of the
// region:

 (debug
   (display-nl "before //"))

  (run-in-parallel segmts proc-unify-minterms-seg) ;; run the parallel code

  (debug
   (display-nl "after //"))

  (vector-for-each tag-minterms unified-minterms-vector-1) ;; tag the
minterms in the hash table

with a partitioned vector it works because one cell is not accessed by
different threads

other stuff like continuation seems to work with 'future, i wanted to
rewrite code with call/cc in 'future but it was too long and apparently it
works,any way i do not see why call/cc would cause trouble in a thread...

i think 'futures are good if you use them carefully like in OpenMP
regions... in other languages.

the code is very very fast now i reached C15,in a few minutes on an Apple M1
scheme@(guile-user)>  (current-processor-count)
$1 = 8

C15 = ((B1 ∧ B3 ∧ B4 ∧ B6 ∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B13) ∨ (B1 ∧ B3 ∧ B4 ∧ B6
∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B4 ∧ B6 ∧ B7 ∧ B9 ∧ B11 ∧ B12 ∧
B14) ∨ (B1 ∧ B3 ∧ B4 ∧ B6 ∧ B7 ∧ B9 ∧ B11 ∧ B13) ∨ (B1 ∧ B3 ∧ B4 ∧ B6 ∧ B8
∧ B9 ∧ B11 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B4 ∧ B6 ∧ B8 ∧ B9 ∧ B11 ∧ B13) ∨ (B1 ∧
B3 ∧ B4 ∧ B6 ∧ B8 ∧ B10 ∧ B11 ∧ B13) ∨ (B1 ∧ B3 ∧ B4 ∧ B6 ∧ B8 ∧ B10 ∧ B12
∧ B13) ∨ (B1 ∧ B3 ∧ B4 ∧ B6 ∧ B8 ∧ B10 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B5 ∧ B6 ∧
B8 ∧ B9 ∧ B11 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B9 ∧ B11 ∧ B13) ∨
(B1 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B10 ∧ B11 ∧ B13) ∨ (B1 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B10
∧ B12 ∧ B13) ∨ (B1 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B10 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B5 ∧
B7 ∧ B8 ∧ B10 ∧ B11 ∧ B13) ∨ (B1 ∧ B3 ∧ B5 ∧ B7 ∧ B8 ∧ B10 ∧ B12 ∧ B13) ∨
(B1 ∧ B3 ∧ B5 ∧ B7 ∧ B8 ∧ B10 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B5 ∧ B7 ∧ B9 ∧ B10
∧ B12 ∧ B13) ∨ (B1 ∧ B3 ∧ B5 ∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B5 ∧
B7 ∧ B9 ∧ B11 ∧ B12 ∧ B14) ∨ (B1 ∧ B3 ∧ B5 ∧ B7 ∧ B9 ∧ B11 ∧ B13) ∨ (B2 ∧
B3 ∧ B5 ∧ B6 ∧ B8 ∧ B9 ∧ B11 ∧ B12 ∧ B14) ∨ (B2 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B9 ∧
B11 ∧ B13) ∨ (B2 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B10 ∧ B11 ∧ B13) ∨ (B2 ∧ B3 ∧ B5 ∧
B6 ∧ B8 ∧ B10 ∧ B12 ∧ B13) ∨ (B2 ∧ B3 ∧ B5 ∧ B6 ∧ B8 ∧ B10 ∧ B12 ∧ B14) ∨
(B2 ∧ B3 ∧ B5 ∧ B7 ∧ B8 ∧ B10 ∧ B11 ∧ B13) ∨ (B2 ∧ B3 ∧ B5 ∧ B7 ∧ B8 ∧ B10
∧ B12 ∧ B13) ∨ (B2 ∧ B3 ∧ B5 ∧ B7 ∧ B8 ∧ B10 ∧ B12 ∧ B14) ∨ (B2 ∧ B3 ∧ B5 ∧
B7 ∧ B9 ∧ B10 ∧ B12 ∧ B13) ∨ (B2 ∧ B3 ∧ B5 ∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B14) ∨
(B2 ∧ B3 ∧ B5 ∧ B7 ∧ B9 ∧ B11 ∧ B12 ∧ B14) ∨ (B2 ∧ B3 ∧ B5 ∧ B7 ∧ B9 ∧ B11
∧ B13) ∨ (B2 ∧ B4 ∧ B5 ∧ B7 ∧ B8 ∧ B10 ∧ B11 ∧ B13) ∨ (B2 ∧ B4 ∧ B5 ∧ B7 ∧
B8 ∧ B10 ∧ B12 ∧ B13) ∨ (B2 ∧ B4 ∧ B5 ∧ B7 ∧ B8 ∧ B10 ∧ B12 ∧ B14) ∨ (B2 ∧
B4 ∧ B5 ∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B13) ∨ (B2 ∧ B4 ∧ B5 ∧ B7 ∧ B9 ∧ B10 ∧ B12
∧ B14) ∨ (B2 ∧ B4 ∧ B5 ∧ B7 ∧ B9 ∧ B11 ∧ B12 ∧ B14) ∨ (B2 ∧ B4 ∧ B5 ∧ B7 ∧
B9 ∧ B11 ∧ B13) ∨ (B2 ∧ B4 ∧ B6 ∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B13) ∨ (B2 ∧ B4 ∧
B6 ∧ B7 ∧ B9 ∧ B10 ∧ B12 ∧ B14) ∨ (B2 ∧ B4 ∧ B6 ∧ B7 ∧ B9 ∧ B11 ∧ B12 ∧
B14) ∨ (B2 ∧ B4 ∧ B6 ∧ B7 ∧ B9 ∧ B11 ∧ B13) ∨ (B2 ∧ B4 ∧ B6 ∧ B8 ∧ B9 ∧ B11
∧ B12 ∧ B14) ∨ (B2 ∧ B4 ∧ B6 ∧ B8 ∧ B9 ∧ B11 ∧ B13) ∨ (B2 ∧ B4 ∧ B6 ∧ B8 ∧
B10 ∧ B11 ∧ B13) ∨ (B2 ∧ B4 ∧ B6 ∧ B8 ∧ B10 ∧ B12 ∧ B13) ∨ (B2 ∧ B4 ∧ B6 ∧
B8 ∧ B10 ∧ B12 ∧ B14))

and it consumes more than 2Gb in memory:


PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS
 PGRP  PPID  STATE    BOOSTS          %CPU_ME
15081  guile        442.8 57:24.28 17/8  0    44    *2423M*  0B     118M
15081 10551 running  *0[1]           0.0000

i checked a part of the logical computation with Mathematica online and it
was ok but i will do a Python program to check the logical expression again
and compare speed with Sympy...

and for the space complexity i do not think it can be reduced

good night all

On Sat, Oct 22, 2022 at 6:01 PM Damien Mattei <damien.mattei@gmail.com>
wrote:

> just for the fun things, i just find that the parallelized results are
> false :-) even for low indices:
> the non // result: C5 = ((B1 ∧ B3) ∨ (B2 ∧ B3) ∨ (B2 ∧ B4))
> is different than the parallelized result below:
> C5 = ((B1 ∧ B3) ∨ (B2 ∧ B3) ∨ (B2 ∧ B4) ∨ (B3 ∧ B4))
>
> i have to check things again ,i think futures are easy to use but limited
> in solving acute // problems, such as using a global variable like an hash
> table...
>
> Damien
>
>
>
> On Mon, Oct 17, 2022 at 3:17 PM Damien Mattei <damien.mattei@gmail.com>
> wrote:
>
>> Hello,
>>
>> sorry for my late answer ( i wrote the code friday but i could only debug
>> today, being busy and on travel last week-end)
>>
>> i modified my code to works with 'futures' , the speed is now incredible
>> !!! (and it computes exact)
>> the speed seems multiplied by even more than the number of CPUs (6):
>> cat /proc/cpuinfo  | grep processor
>> processor : 0
>> processor : 1
>> processor : 2
>> processor : 3
>> processor : 4
>> processor : 5
>>
>>
>> https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L1900
>>
>> (declare minterms-vector unified-minterms-vector-1)
>>
>> (define (funct-unify-minterms-set-1-unit-future set1 set2)
>>
>>   {function-unify-minterms-list <+ (λ (L) (apply
>> function-unify-two-minterms-and-tag L))}
>>
>>   ;; note : sorting is useless
>>
>>   {minterms-set <+ (product-set-with-set-imperative-sorted set1 set2)}
>> ;;(product-set-with-set-imperative set1 set2)} ;;(product-set-with-set set1
>> set2)} ;;(associate-set-with-set set1 set2)} ;; set multiplication : create
>> list of pair of minterms
>>
>>   {minterms-vector <- (list->vector minterms-set)}
>>
>>   {minterms-vector-length <+ (vector-length minterms-vector)}
>>
>>   {nb-procs <+ (current-processor-count)}
>>
>>   {segmts <+ (segment 0 {minterms-vector-length - 1} nb-procs)} ;;
>> compute the segments
>>
>>   {unified-minterms-vector-1 <- (make-vector minterms-vector-length #f)}
>>
>>   (run-in-parallel segmts proc-unify-minterms-seg) ;; run the parallel
>> code
>>
>>   {unified-minterms-set-1 <+ (vector->list unified-minterms-vector-1)}
>>
>>   {unified-minterms-set-2 <+ (filter (λ (x) x) unified-minterms-set-1)}
>> ;; remove #f results
>>
>>   {unified-minterms-set <+ (remove-duplicates-sorted
>> unified-minterms-set-2)} ;; uniq
>>
>>   unified-minterms-set)
>>
>> i keeped your 'segment' definitions to index an array of data after
>> convert it from list to vector (list->vector) which probably consume
>> additional time on big data list ( i had more than 1000000 element list
>> lengths at some point)
>>
>> i used a simplified version of run-in parallel (that do not do 'reduce
>> after processing data):
>>
>>
>> https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L1794
>>
>> the computation on a segment is done by those functions:
>>
>>
>> https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L1842
>>
>> {function-unify-minterms-list <+ (λ (L) (apply
>> function-unify-two-minterms-and-tag L))}
>>
>> ;; proc to be called with futures
>> (define (proc-unify-minterms-seg seg)
>>   {start <+ (segment-start seg)}
>>   {end <+ (segment-end seg)}
>>   (for ({i <+ start} {i <= end} {i <- {i + 1}})
>>        {mtL <+ {minterms-vector[i]}}
>>        {unified-minterms-vector-1[i] <- (function-unify-minterms-list
>> mtL)}))
>>
>>
>> i use those code in another program symbolically to compute a value named
>> Cn:
>>
>> C0 = T
>> C1 = T
>> C2 = T
>> C3 = (B1 ∨ B2)
>> C4 = (B2 ∨ (B1 ∧ B3))
>> C5 = ((B1 ∧ B3) ∨ (B2 ∧ B3) ∨ (B2 ∧ B4) ∨ (B3 ∧ B4))
>> C6 = ((B1 ∧ B3 ∧ B5) ∨ (B2 ∧ B3 ∧ B5) ∨ (B2 ∧ B4) ∨ (B3 ∧ B4) ∨ (B4 ∧ B5))
>> C7 = ((B1 ∧ B3 ∧ B5) ∨ (B2 ∧ B3 ∧ B5) ∨ (B2 ∧ B4 ∧ B6) ∨ (B3 ∧ B4 ∧ B6) ∨
>> (B4 ∧ B5) ∨ (B5 ∧ B6))
>> C8 = ((B1 ∧ B3 ∧ B5 ∧ B7) ∨ (B2 ∧ B3 ∧ B5 ∧ B7) ∨ (B2 ∧ B4 ∧ B6) ∨ (B3 ∧
>> B4 ∧ B6) ∨ (B4 ∧ B5 ∧ B7) ∨ (B5 ∧ B6) ∨ (B6 ∧ B7))
>> C9 = ((B1 ∧ B3 ∧ B5 ∧ B7) ∨ (B2 ∧ B3 ∧ B5 ∧ B7) ∨ (B2 ∧ B4 ∧ B6 ∧ B8) ∨
>> (B3 ∧ B4 ∧ B6 ∧ B8) ∨ (B4 ∧ B5 ∧ B7) ∨ (B5 ∧ B6 ∧ B8) ∨ (B6 ∧ B7) ∨ (B7 ∧
>> B8))
>>
>> from C1 to C9 used to be fast,less than  a minute for the whole with or
>> without //
>>
>> C10 = ((B1 ∧ B3 ∧ B5 ∧ B7 ∧ B9) ∨ (B2 ∧ B3 ∧ B5 ∧ B7 ∧ B9) ∨ (B2 ∧ B4 ∧
>> B6 ∧ B8) ∨ (B3 ∧ B4 ∧ B6 ∧ B8) ∨ (B4 ∧ B5 ∧ B7 ∧ B9) ∨ (B5 ∧ B6 ∧ B8) ∨ (B6
>> ∧ B7 ∧ B9) ∨ (B7 ∧ B8) ∨ (B8 ∧ B9))
>>
>> C10 takes a few minutes
>>
>> but C11 used to take one day before // with 'future' and i got now the
>> result during less than one hour!!!
>>
>> C11 = ((B1 ∧ B3 ∧ B5 ∧ B7 ∧ B9) ∨ (B10 ∧ B2 ∧ B4 ∧ B6 ∧ B8) ∨ (B10 ∧ B3 ∧
>> B4 ∧ B6 ∧ B8) ∨ (B10 ∧ B5 ∧ B6 ∧ B8) ∨ (B10 ∧ B7 ∧ B8) ∨ (B10 ∧ B9) ∨ (B2 ∧
>> B3 ∧ B5 ∧ B7 ∧ B9) ∨ (B4 ∧ B5 ∧ B7 ∧ B9) ∨ (B6 ∧ B7 ∧ B9) ∨ (B8 ∧ B9))
>>
>> i never got C12 in the past ,even with 7 days of computation! and i got
>> it today during the lunchbreak !!!
>>
>> C12 = ((B1 ∧ B11 ∧ B3 ∧ B5 ∧ B7 ∧ B9) ∨ (B10 ∧ B11) ∨ (B10 ∧ B2 ∧ B4 ∧ B6
>> ∧ B8) ∨ (B10 ∧ B3 ∧ B4 ∧ B6 ∧ B8) ∨ (B10 ∧ B5 ∧ B6 ∧ B8) ∨ (B10 ∧ B7 ∧ B8)
>> ∨ (B10 ∧ B9) ∨ (B11 ∧ B2 ∧ B3 ∧ B5 ∧ B7 ∧ B9) ∨ (B11 ∧ B4 ∧ B5 ∧ B7 ∧ B9) ∨
>> (B11 ∧ B6 ∧ B7 ∧ B9) ∨ (B11 ∧ B8 ∧ B9))
>>
>> (note that a wise people can find a general formula for Cn based on Cn-1
>> but that is another (mathematical) story....)
>>
>> i'm computing C13 but i see that it consumes 12gb out of 16gb of my
>> system ! and stopped on the linux box :
>> GC Warning: Repeated allocation of very large block (appr. size
>> 510935040):
>> May lead to memory leak and poor performance
>> Processus arrêté
>> but still running on the Mac laptop...ok will see that but i think that
>> the data set is now huge and there is noting that can be done with that...
>>
>> note that there is still an hash table used in
>> function-unify-two-minterms-and-tag and i will use another data structure
>> and algo to avoid that, i feared that the concurrent access to hash table
>> can cause the guile 'future' to crash or fails or to slow down the process
>> but not! result is incredibly fast.Also i use continuation and i read it
>> can cause problem with 'future' i will improve that too...
>>
>> I will see where i can improve the algo and data structure to optimize
>> again but this is already really good.
>>
>> Thanks
>>
>> Damien
>>
>>
>> On Fri, Oct 14, 2022 at 10:39 AM Zelphir Kaltstahl <
>> zelphirkaltstahl@posteo.de> wrote:
>>
>>> Hello Damien,
>>> On 10/14/22 10:21, Damien Mattei wrote:
>>>
>>> yes Zelphir i think there is a problem in the threading part of guile,
>>> whatever the code is ,it run well the first time, and after is unstable.
>>> Long ago at the very begin of Java language on my master project at
>>> university i experienced same problem with Java "green" threads, under
>>> windows and/or linux ,i can't remember.
>>>
>>> I'm trying your code and future, which have perheaps also the
>>> portability with other schemes, future can provide "light" // , with
>>> carefull code it could be ok.
>>>
>>> in your segment.scm code ,is segment-count possibly replaced by the
>>> number of available CPUs or nodes, if i understand well?
>>>
>>> Regards,
>>> Damien
>>>
>>> Correct. For each segment one future is used.
>>>
>>> However, there are scenarios, where one would rather want to interleave
>>> the numbers of the whole range, to have a "fairer" workload per future, for
>>> example:
>>>
>>> (1 2 3 4 5 6 7 8 9)
>>>
>>> -> (1 4 7), (2 5 8), (3 6 9)
>>>
>>> instead of
>>>
>>> -> (1 2 3) (4 5 6) (7 8 9)
>>>
>>> (I seem to remember this to be the case for Mandelbrot set calculations,
>>> but might be wrong.)
>>>
>>> But that is all a matter of forming some segments and what a segment is,
>>> not really part of the logic of creating futures. So one could create then
>>> in any way that fits ones use-case. I have not generalized that segment
>>> logic that far, but it is doable.
>>>
>>> Anyway, if anyone more knowledgeable could chime in on what the
>>> performance differences between futures and parallel map are, that would be
>>> great! Or pointing out some flaws that this kind of future usage might
>>> have. Or when the point is reached to switch to guile-fibers library.
>>>
>>> Regards,
>>> Zelphir
>>>
>>> --
>>> repositories: https://notabug.org/ZelphirKaltstahl
>>>
>>>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]