guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug in Guile 2.2.6 parallel forms implementation?


From: Zelphir Kaltstahl
Subject: Re: Bug in Guile 2.2.6 parallel forms implementation?
Date: Fri, 17 Jan 2020 01:32:19 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

Hi Guile Users!

I have now checked every single procedure, which is used in the code,
which runs in parallel, for any kind of mutation and the following is
the result, a list of everything that is used down to standard functions
of Guile like + - * / and a short explanation of what each procedure is
and as sub list elements other procedures, which are used inside the
procedures (org mode format, copied from my notes file, so ignore the
tilde characters, if not viewing in org mode):

~~~~START~~~~
- ~select-min-cost-split~
  - can raise a condition in case of wrong arguments (empty list of splits)
- ~get-best-split-for-column~
  - ~make-split~
    - is a constructor
    - only ever creates new struct instances
  - ~dataset-get-col~
    - ~dataset-map~
      - uses only ~map~ internally, is just part of an abstraction layer over 
lists
    - ~data-point-get-col~
      - is just an alias for ~vector-ref~
    - ~dataset-column-empty?~
      - is an alias for ~null?~, is part of an abstraction layer
    - ~dataset-column-first~
      - is an alias for ~car~, is part of an abstraction layer
    - ~split-data~
      - makes use of ~receive~, inside which the bug happened, that ~<~ got an 
argument of the wrong type
      - ~dataset-partition~
        - is a wrapper for ~partition~ (or actually an alias)
        - is part of the abstraction layer over lists
        - ~partition~
          - is a library function from ~srfi-1~
        - ~data-point-get-col~
          - alias for vector-ref
          - part of abstraction layer over vectors for data points
        - ~<~
          - standard mathematical function
        - ~list~
          - is the list constructor
          - standard function
    - ~split-quality-proc~
      - is in this case actually ~gini-index~, given as an argument
      - ~gini-index~
        - ~+~
          - standard math function
        - ~apply~
          - standard higher-order function
        - ~map~
          - standard higher-order function
        - ~calc-proportion~
          - ~dataset-empty?~
            - abstraction layer over lists
            - alias for ~null?~
          - ~dataset-length~
            - abstraction layer over lists
            - alias for ~length~
          - ~count~
            - SRFI-1 higher-order function
          - ~data-point-get-col~
            - alias for ~vector-ref~
            - part of abstraction layer over vectors for data points
          - ~/~
            - standard math function
          - ~*~
            - standard math function
          - ~-~
            - standard math function
    - ~iter-col-vals~
      - is a named let, defined in ~get-best-split-for-column~ itself
    - ~dataset-column-rest~
      - is an alias for ~cdr~
      - is part of an abstraction layer over lists
~~~~~END~~~~~

I was not able to find a single procedure, which I wrote, or usage of
mutation, which could cause problems, when I run the code in multiple
threads using parallel forms. This makes me think the following:

  * Perhaps some function from SRFI 1 is not suitable for running in
    parallel?
  * Perhaps the implementation of the parallel forms is not correct?
  * Perhaps the underlying threading code for parallel forms is not correct?
  * Perhaps recursively splitting into parallel executions for branches
    of the tree is not a use case foreseen for parallel forms?
  * Perhaps a language primitive is not usable in parallel code? (unlikely?)
  * Maybe the ~receive~ form is buggy in parallel execution.

My algorithm in itself should be correct. It works just fine and never
has any "wrong type" errors in non-concurrent execution. It also has
many tests and they cover most procedures I wrote and none of those fails.

I will look into making use of fibers as well. If it turns out, that no
error happens when I am using fibers, then that would be even more an
indication of parallel forms implementation containing some kind of
error. If there is still an error like the ones I described in my
previous e-mail, then perhaps the whole thing is about recursively
spawning futures and fibers, or some primitive I used does not work in
parallel for some reason.

What else can I try?

Regards,
Zelphir

On 1/14/20 10:55 PM, Zelphir Kaltstahl wrote:
> Hi Guile Users!
>
> For my project of implementing a decision tree algorithm, which I ported
> to Guile, I am trying ways of parallelizing the algorithm. Since I did
> not make use of mutation or global state in the algorithm, it should be
> fairly simple to parallelize: "Simply split execution into 2 parts at
> every split of the tree and let those run on different cores." – That's
> at least the theory. I am using Guile 2.2.6, installed via Guix package
> manager.
>
> However, I think there might be a bug in Guile's implementation of the
> parallel forms. There is still a small chance, that somehow the bug is
> in my code and I overlooked something, that would only matter in a
> concurrent or parallel scenario.
>
> The commit id is:
> https://notabug.org/ZelphirKaltstahl/guile-ml/src/d61a9535508264dd065760b20caa25a820ed00be
>
> I wrote a more detailed bug report in my todo file:
> https://notabug.org/ZelphirKaltstahl/guile-ml/src/d61a9535508264dd065760b20caa25a820ed00be/todo.org
> (view the non-rendered version, notabug does not render org files correctly)
>
> Here is why I am saying that there might be a bug in the parallel forms
> somewhere:
>
> * When I run the algorithm without concurrency or parallelism, no error
> happens. Especially all procedures receive correct types of arguments
> and run without problem.
> * As stated, I took care to not use mutation anywhere. I am not aware of
> any mutation in my code. I think almost, if not all, participating
> procedures have referential transparency.
> * The bug only happens sometimes and only when using parallel forms. (I
> have yet to try fibers and futures.)
> * The project has passing tests, which include fitting the decision
> tree. Although they do not use the same data set, the structure of the
> data is the same. No "wrong type" errors happen. The tests do not yet
> cover every single procedure, but the most important once, I think.
> * Once the parallel forms are involved, I get the following errors
> seemingly at random, perhaps approximately every 10th run of the very
> same script, which makes use of the algorithm. I seemingly get them more
> often when running on my core 2 duo machine than on my Lenovo Thinkpad
> X200, which has 2 cores. I can sometimes run the algorithm 5-10 times
> without error and then the next time I run it, an error will happen:
>
> --------8<--------8<--------
> scheme@(guile-user)> (define tree
>   (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345)
>        #:feature-column-indices (list 0 1 2 3)
>        #:label-column-index 4
>        #:max-depth 5
>        #:min-data-points 12
>        #:min-data-points-ratio 0.02
>        #:min-impurity-split (expt 10 -7)
>        #:stop-at-no-impurity-improvement #t))
> recursive split on depth: 1
> will check for stop conditions now
> checking for stop condition: max-depth-reached?
> checking for stop condition: insufficient-data-points-for-split?
> checking for stop condition: insufficient-data-points-ratio-for-split?
> checking for stop condition: all-same-label?
> INFO: CONTINUING SPLITT: still got 1372 data points
> ice-9/threads.scm:289:22: In procedure loop:
> In procedure <: Wrong type argument in position 1: 0.39142673091293256
>
> Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
> -------->8-------->8--------
>
> This never happens when not running concurrently and also does not
> happen on every run. Also I think floats are OK for `<` to handle. I am
> using my own procedures with correct arguments. Somehow Guile gets confused.
>
> Here is another one:
>
> --------8<--------8<--------
> scheme@(guile-user)> (define tree
>   (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345)
>        #:feature-column-indices (list 0 1 2 3)
>        #:label-column-index 4
>        #:max-depth 5
>        #:min-data-points 12
>        #:min-data-points-ratio 0.02
>        #:min-impurity-split (expt 10 -7)
>        #:stop-at-no-impurity-improvement #t))
> recursive split on depth: 1
> will check for stop conditions now
> checking for stop condition: max-depth-reached?
> checking for stop condition: insufficient-data-points-for-split?
> checking for stop condition: insufficient-data-points-ratio-for-split?
> checking for stop condition: all-same-label?
> INFO: CONTINUING SPLITT: still got 1372 data points
> ice-9/threads.scm:289:22: In procedure loop:
> In procedure <: Wrong type argument in position 1: 0.39142673091293256
>
> Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
> scheme@(guile-user) [1]> ,bt
> In current input:
>     161:2  3 (_)
> In decision-tree.scm:
>    218:18  2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) 
> #(-1.3931 1.5664 7.5382 0.78403 0) #(-5.4414 7.2363 0.10938 -7.5642 1) 
> #(-2.7908 -5.7133 5.953 0.45946 1) #(-3.518 2.8763 0.1548 # 1) # …) …)
>    110:13  1 (get-best-split _ _ _ #:split-quality-proc _)
> In ice-9/threads.scm:
>    289:22  0 (loop _)
> scheme@(guile-user) [1]>
> -------->8-------->8--------
>
> It even claims that I cannot use `<` with a floating point number, which
> is weird, I think:
>
> In procedure <: Wrong type argument in position 1: 0.39142673091293256
>
> And here another one:
>
> --------8<--------8<--------
> [03:05:29]:[~/development/Guile/guile-ml]: guile --debug -L . run.scm
> recursive split on depth: 1
> will check for stop conditions now
> checking for stop condition: max-depth-reached?
> checking for stop condition: insufficient-data-points-for-split?
> checking for stop condition: insufficient-data-points-ratio-for-split?
> checking for stop condition: all-same-label?
> INFO: CONTINUING SPLITT: still got 1372 data points
> calculating best split for feature columns (0 1 2 3)
> inside get-best-split / let
> calculating best split for feature column 0
> in get-best-split-for-column with column 0
> calculating best split for feature column 1
> calculating best split for feature column 2
> in get-best-split-for-column with column 1
> in get-best-split-for-column with column 1
> in get-best-split-for-column with column 2
> calculating best split for feature column 3
> in get-best-split-for-column with column 3
> Backtrace:
>           13 (apply-smob/1 #<catch-closure 1cbb940>)
> In ice-9/boot-9.scm:
>     705:2 12 (call-with-prompt _ _ #<procedure default-prompt-handler (k 
> proc)>)
> In ice-9/eval.scm:
>     619:8 11 (_ (#(-0.2361 9.3221 2.1307 -4.3793 0) #(-2.4115 -9.1359 9.3444 
> -0.65259 1) #(# # # …) …))
> In ice-9/boot-9.scm:
>    2312:4 10 (save-module-excursion _)
>   3832:12  9 (_)
> In run.scm:
>      95:2  8 (_)
> In decision-tree.scm:
>    236:18  7 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) 
> #(-1.3931 1.5664 7.5382 # 0) # …) …)
>    124:13  6 (get-best-split _ _ _ #:split-quality-proc _)
> In ice-9/threads.scm:
>    288:21  5 (loop _)
> In decision-tree.scm:
>    128:30  4 (_ _)
>     86:34  3 (get-best-split-for-column (#(-2.7028 1.6327 0.83598 -0.091393 
> 1) #(-1.3931 1.5664 …) …) …)
>      31:8  2 (split-data _ _ _)
> In unknown file:
>            1 (partition #<procedure 1c67b40 at decision-tree.scm:31:27 
> (data-point)> (#(-2.7028 # …) …))
> In decision-tree.scm:
>     32:29  0 (_ _)
>
> decision-tree.scm:32:29: In procedure <: Wrong type: #(-0.72068 -6.7583 
> 5.8408 0.62369 1)
> -------->8-------->8--------
>
> The log shows at least a place in my code, where supposedly the wrong
> argument gets in. But there should not be such a wrong argument, as the
> structure of a data set for the algorithm is always a list of vectors as
> data points.
>
> After adding more debug logging another one:
>
> --------8<--------8<--------
> will compare the following values: 0.4565034338822441 and 0.5763181258315033
> wwill compare the following values: 0.4812939489026074 and 0.7189958135901167
> before recurring in iter-col-vals
> in select-min-cost-split
> will compare the following values: 0.48037557849768603 and 0.4984472198525817
> before recurring in iter-col-vals
> in select-min-cost-split
> will compare the following values: 0.4938631012588292 and 0.9781115994890177
> before recurring in iter-col-vals
> in select-min-cost-split
> will compare the following values: before recurring in 
> iter-col-vals0.48037557849768603 and  and e recurring in 
> iter-col-vals0.480375578497686030.6404180444583852
> in select-min-cost-split
> will compare the following values: 0.4565034338822441 and 0.5299176721713759
> before recurring in iter-col-vals
> in select-min-cost-split
> will compare the following values: (Segmentation fault (core dumped)
> -------->8-------->8--------
>
> That one is even a segmentation fault, which I thought could never
> happen from my algorithm, as I am only using "normal" Guile stuff +
> libraries, no low level manipulations of bits or fiddling with OS
> threads or anything like that. It should come somewhere from the
> underlying stuff of parallel forms. Or perhaps from too excessive
> logging using (display …).
>
> I also got one in combination with the usage of `<` and the segmentation
> fault:
>
> --------8<--------8<--------
> got back result for column
> Backtrace:
>            8 (apply-smob/1 #<catch-closure 130b940>)
> In ice-9/boot-9.scm:
>     705:2  7 (call-with-prompt _ _ #<procedure default-prompt-handler (k 
> proc)>)
> In ice-9/eval.scm:
>     619:8  6 (_ #(#(#<directory (guile-user) 1397140>)))
> In ice-9/boot-9.scm:
>    2312:4  5 (save-module-excursion _)
>   3832:12  4 (_)
> In run.scm:
>      95:2  3 (_)
> In decision-tree.scm:
>    239:18  2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) 
> #(-1.3931 1.5664 7.5382 # 0) # …) …)
>    127:13  1 (get-best-split _ _ _ #:split-quality-proc _)
> In ice-9/threads.scm:
>    289:22  0 (loop _)
>
> ice-9/threads.scm:289:22: In procedure loop:
> In procedure <: Wrong type argument in position 1: Segmentation fault (core 
> dumped)
> -------->8-------->8--------
>
> None of this has ever happened when running single core, without
> concurrency. It is also strange, that it does not seem to happen
> deterministically at a specific time, like always at the beginning or
> first split, but can happen at any time during fitting the tree to the
> training data.
>
> I still have to try using fibers instead of parallel forms.
>
> I don't really know how to best debug this, besides adding debug logs
> and seeing how far it comes before getting the errors. But that is not
> reliable, because multiple threads try to display logs and the outputs
> could be mixed. I already have test cases and they all pass.
>
> Any advice on how to debug such an issue is welcome. I've looked at this
> for hours, but so far could not find anything standing out, possibly
> causing this bug in my code.
>
> Regards,
> Zelphir
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]