[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#50672: python-pytorch is not reproducible
From: |
Ludovic Courtès |
Subject: |
bug#50672: python-pytorch is not reproducible |
Date: |
Sun, 19 Sep 2021 11:57:14 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) |
Bad news!
--8<---------------cut here---------------start------------->8---
$ guix challenge python-pytorch
/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0 contents
differ:
no local build for
'/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0'
https://ci.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0:
0i55iwy3z4da4lhn93dnrmz775s9ga5kyfli6cmrchacacf9xfpq
https://bordeaux.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0:
1fl2v4pd0gcw7wp5k662q0zd4lvvzsggcm5ii8b4kq4v6synhkic
differing file:
/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
1 store items were analyzed:
- 0 (0.0%) were identical
- 1 (100.0%) differed
- 0 (0.0%) were inconclusive
$ guix describe
Generacio 189 Aug 30 2021 12:09:27 (nuna)
guix f91ae94
repository URL: https://git.savannah.gnu.org/git/guix.git
branch: master
commit: f91ae9425bb385b60396a544afe27933896b8fa3
--8<---------------cut here---------------end--------------->8---
The file is 165 MiB and Diffoscope (which reads the output of ‘objdump’)
takes forever on it.
However, by comparing the output of ‘strings’ on each file, we get a
hint:
diff -ubBr --show-c-function /tmp/str2 /tmp/str1
--- /tmp/str2 2021-09-19 11:14:47.806798779 +0200
+++ /tmp/str1 2021-09-19 11:14:41.962761127 +0200
@@ -1100584,472 +1100584,472 @@ compute_fast_convolution_input_gradient
compute_grad_kernel_transform
compute_fast_convolution_kernel_gradient.isra.0
compute_fast_convolution_output
-nnp_fft8x8_with_offset_and_stream__avx2.__local0
-nnp_fft8x8_with_offset_and_stream__avx2.__local13
-nnp_fft8x8_with_offset_and_stream__avx2.__local18
-nnp_fft8x8_with_offset_and_stream__avx2.__local1
+nnp_fft8x8_with_offset_and_stream__avx2.__local5
nnp_fft8x8_with_offset_and_stream__avx2.__local16
+nnp_fft8x8_with_offset_and_stream__avx2.__local6
+nnp_fft8x8_with_offset_and_stream__avx2.__local11
+nnp_fft8x8_with_offset_and_stream__avx2.__local0
nnp_fft8x8_with_offset_and_stream__avx2.__local2
nnp_fft8x8_with_offset_and_stream__avx2.__local7
-nnp_fft8x8_with_offset_and_stream__avx2.__local17
-nnp_fft8x8_with_offset_and_stream__avx2.__local10
-nnp_fft8x8_with_offset_and_stream__avx2.__local8
nnp_fft8x8_with_offset_and_stream__avx2.__local15
+nnp_fft8x8_with_offset_and_stream__avx2.__local8
nnp_fft8x8_with_offset_and_stream__avx2.__local3
-nnp_fft8x8_with_offset_and_stream__avx2.__local6
-nnp_fft8x8_with_offset_and_stream__avx2.__local14
-nnp_fft8x8_with_offset_and_stream__avx2.__local9
+nnp_fft8x8_with_offset_and_stream__avx2.__local1
nnp_fft8x8_with_offset_and_stream__avx2.__local4
[…]
nnp_shdotxf8__avx2.__local13
-nnp_shdotxf8__avx2.__local15
nnp_shdotxf8__avx2.__local0
+nnp_shdotxf8__avx2.__local9
+nnp_shdotxf8__avx2.__local10
+nnp_shdotxf8__avx2.__local11
+nnp_shdotxf8__avx2.__local12
+nnp_shdotxf8__avx2.__local2
This appears to come from NNPACK, one of the libraries that are still
bundled. These functions seem to be generated by Python scripts that
use PeachPy, such as NNPACK/src/x86_64-fma/2d-fourier-8x8.py:
--8<---------------cut here---------------start------------->8---
for post_operation in ["stream", "store"]:
fft8x8_arguments = (arg_t_pointer, arg_f_pointer, arg_t_stride,
arg_f_stride, arg_row_count, arg_column_count, arg_row_offset,
arg_column_offset)
with
Function("nnp_fft8x8_with_offset_and_{post_operation}__avx2".format(post_operation=post_operation),
fft8x8_arguments, target=uarch.default + isa.fma3 + isa.avx2):
[…]
--8<---------------cut here---------------end--------------->8---
The ‘__local’ bit in the name comes from PeachPy, in peachpy/name.py:
--8<---------------cut here---------------start------------->8---
suffixed_name = "__local" + str(suffix)
for name_object in iter(unnamed_objects):
# Generate a non-conflicting name by appending a suffix
while suffixed_name in self.names:
suffix += 1
suffixed_name = "__local" + str(suffix)
--8<---------------cut here---------------end--------------->8---
So the problem may be that these things get generated in parallel, and
thus numbering is non-deterministic.
NNPACK/CMakeLists.txt has this bit to generate targets to build all
that:
--8<---------------cut here---------------start------------->8---
ADD_CUSTOM_COMMAND(
OUTPUT ${obj}
COMMAND "PYTHONPATH=${PEACHPY_PYTHONPATH}"
${PYTHON_EXECUTABLE} -m peachpy.x86_64
-mabi=sysv -g4 -mimage-format=${PEACHPY_IMAGE_FORMAT}
"-I${PROJECT_SOURCE_DIR}/src"
"-I${PROJECT_SOURCE_DIR}/src/x86_64-fma" "-I${FP16_SOURCE_DIR}/include"
-o ${obj} "${PROJECT_SOURCE_DIR}/${src}"
DEPENDS ${NNPACK_BACKEND_PEACHPY_OBJS})
--8<---------------cut here---------------end--------------->8---
It might be that building just those targets sequentially would solve
the problem.
To be continued…
Ludo’.
- bug#50672: python-pytorch is not reproducible,
Ludovic Courtès <=