bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#50672: python-pytorch is not reproducible


From: Ludovic Courtès
Subject: bug#50672: python-pytorch is not reproducible
Date: Sun, 19 Sep 2021 11:57:14 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Bad news!

--8<---------------cut here---------------start------------->8---
$ guix challenge python-pytorch
/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0 contents 
differ:
  no local build for 
'/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0'
  
https://ci.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0:
 0i55iwy3z4da4lhn93dnrmz775s9ga5kyfli6cmrchacacf9xfpq
  
https://bordeaux.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0:
 1fl2v4pd0gcw7wp5k662q0zd4lvvzsggcm5ii8b4kq4v6synhkic
  differing file:
    /lib/python3.8/site-packages/torch/lib/libtorch_cpu.so

1 store items were analyzed:
  - 0 (0.0%) were identical
  - 1 (100.0%) differed
  - 0 (0.0%) were inconclusive
$ guix describe 
Generacio 189   Aug 30 2021 12:09:27    (nuna)
  guix f91ae94
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: f91ae9425bb385b60396a544afe27933896b8fa3
--8<---------------cut here---------------end--------------->8---

The file is 165 MiB and Diffoscope (which reads the output of ‘objdump’)
takes forever on it.

However, by comparing the output of ‘strings’ on each file, we get a
hint:

diff -ubBr --show-c-function /tmp/str2 /tmp/str1
--- /tmp/str2   2021-09-19 11:14:47.806798779 +0200
+++ /tmp/str1   2021-09-19 11:14:41.962761127 +0200
@@ -1100584,472 +1100584,472 @@ compute_fast_convolution_input_gradient
 compute_grad_kernel_transform
 compute_fast_convolution_kernel_gradient.isra.0
 compute_fast_convolution_output
-nnp_fft8x8_with_offset_and_stream__avx2.__local0
-nnp_fft8x8_with_offset_and_stream__avx2.__local13
-nnp_fft8x8_with_offset_and_stream__avx2.__local18
-nnp_fft8x8_with_offset_and_stream__avx2.__local1
+nnp_fft8x8_with_offset_and_stream__avx2.__local5
 nnp_fft8x8_with_offset_and_stream__avx2.__local16
+nnp_fft8x8_with_offset_and_stream__avx2.__local6
+nnp_fft8x8_with_offset_and_stream__avx2.__local11
+nnp_fft8x8_with_offset_and_stream__avx2.__local0
 nnp_fft8x8_with_offset_and_stream__avx2.__local2
 nnp_fft8x8_with_offset_and_stream__avx2.__local7
-nnp_fft8x8_with_offset_and_stream__avx2.__local17
-nnp_fft8x8_with_offset_and_stream__avx2.__local10
-nnp_fft8x8_with_offset_and_stream__avx2.__local8
 nnp_fft8x8_with_offset_and_stream__avx2.__local15
+nnp_fft8x8_with_offset_and_stream__avx2.__local8
 nnp_fft8x8_with_offset_and_stream__avx2.__local3
-nnp_fft8x8_with_offset_and_stream__avx2.__local6
-nnp_fft8x8_with_offset_and_stream__avx2.__local14
-nnp_fft8x8_with_offset_and_stream__avx2.__local9
+nnp_fft8x8_with_offset_and_stream__avx2.__local1
 nnp_fft8x8_with_offset_and_stream__avx2.__local4
[…]
 nnp_shdotxf8__avx2.__local13
-nnp_shdotxf8__avx2.__local15
 nnp_shdotxf8__avx2.__local0
+nnp_shdotxf8__avx2.__local9
+nnp_shdotxf8__avx2.__local10
+nnp_shdotxf8__avx2.__local11
+nnp_shdotxf8__avx2.__local12
+nnp_shdotxf8__avx2.__local2
This appears to come from NNPACK, one of the libraries that are still
bundled.  These functions seem to be generated by Python scripts that
use PeachPy, such as NNPACK/src/x86_64-fma/2d-fourier-8x8.py:

--8<---------------cut here---------------start------------->8---
for post_operation in ["stream", "store"]:
    fft8x8_arguments = (arg_t_pointer, arg_f_pointer, arg_t_stride, 
arg_f_stride, arg_row_count, arg_column_count, arg_row_offset, 
arg_column_offset)
    with 
Function("nnp_fft8x8_with_offset_and_{post_operation}__avx2".format(post_operation=post_operation),
        fft8x8_arguments, target=uarch.default + isa.fma3 + isa.avx2):
[…]
--8<---------------cut here---------------end--------------->8---


The ‘__local’ bit in the name comes from PeachPy, in peachpy/name.py:

--8<---------------cut here---------------start------------->8---
            suffixed_name = "__local" + str(suffix)
            for name_object in iter(unnamed_objects):
                # Generate a non-conflicting name by appending a suffix
                while suffixed_name in self.names:
                    suffix += 1
                    suffixed_name = "__local" + str(suffix)
--8<---------------cut here---------------end--------------->8---

So the problem may be that these things get generated in parallel, and
thus numbering is non-deterministic.

NNPACK/CMakeLists.txt has this bit to generate targets to build all
that:

--8<---------------cut here---------------start------------->8---
      ADD_CUSTOM_COMMAND(
        OUTPUT ${obj}
        COMMAND "PYTHONPATH=${PEACHPY_PYTHONPATH}"
          ${PYTHON_EXECUTABLE} -m peachpy.x86_64
            -mabi=sysv -g4 -mimage-format=${PEACHPY_IMAGE_FORMAT}
            "-I${PROJECT_SOURCE_DIR}/src" 
"-I${PROJECT_SOURCE_DIR}/src/x86_64-fma" "-I${FP16_SOURCE_DIR}/include"
            -o ${obj} "${PROJECT_SOURCE_DIR}/${src}"
        DEPENDS ${NNPACK_BACKEND_PEACHPY_OBJS})
--8<---------------cut here---------------end--------------->8---

It might be that building just those targets sequentially would solve
the problem.

To be continued…

Ludo’.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]