guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PyTorch with ROCm


From: David Elsing
Subject: Re: PyTorch with ROCm
Date: Wed, 03 Apr 2024 22:21:43 +0000

Hello,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Yeah, we could think about a transformation option.  Maybe
> ‘--with-configure-flags=python-pytorch=-DAMDGPU_TARGETS=xyz’ would work,
> and if not, we can come up with a specific transformation and/or an
> procedure that takes a list of architectures and returns a package.

I think that would work for python-pytorch itself, but it would need to
be set for all ROCm dependencies as well. It would be good to make sure
that the targets for a package are a subset of the intersection of the
targets specified for its dependencies.

>>>> - Many tests assume a GPU to be present, so they need to be disabled.
>>>
>>> Yes.  I/we’d like to eventually support that.  (There’d need to be some
>>> annotation in derivations or packages specifying what hardware is
>>> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need
>>> to honor that.)
>>
>> That sounds like a good idea, could this also include CPU ISA
>> extensions, such as AVX2 and AVX-512?
>
> That’d be great, yes.  Don’t hold your breath though as I/we haven’t
> scheduled work on this yet.  If you’re interested in working on it, we
> can discuss it of course.

I am definitively interested, but am not familiar with Cuirass. Would
this also require support by the build daemon to determine which
hardware is available?

>> I think the issue is simply that elf-file? just checks the magic bytes
>> and has-elf-header? checks for the entire header. If the former returns
>> #t and the latter #f, an error is raised by parse-elf in guix/elf.scm.
>> It seems some ROCm (or tensile?) ELF files have another header format.
>
> Uh, never came across such a situation.  What’s so special about those
> ELF files?  How are they created?

After checking again, I noticed that the error actually only occurs for
rocblas. :)
Here, the problematic ELF files are generated by Tensile [1], and are
installed in lib/rocblas/library (by library/src/CMakeLists.txt, which
calls a CMake function from the Tensile package). They are shared object
libraries for the GPU architecture(s) [2]. Tensile uses
`clang-offload-builder` (from rocm-toolchain) to create the files, and
it seems to me that the "ELF" header comes from there, but I don't know
why it is special.

Thanks,
David

[1] 
https://github.com/ROCm/Tensile/blob/17df881bde80fc20f997dfb290f4bb4b0e05a7e9/Tensile/TensileCreateLibrary.py#L283
[2] 
https://github.com/ROCm/Tensile/wiki/TensileCreateLibrary#code-object-libraries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]