Re: PyTorch with ROCm

guix-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PyTorch with ROCm

From:	David Elsing
Subject:	Re: PyTorch with ROCm
Date:	Sun, 31 Mar 2024 22:21:26 +0000

Hi!

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> I’m happy to merge your changes in the ‘guix-hpc’ channel for the time
> being (I can create you an account there if you wish so you can create
> merge requests etc.).  Let me know!

Ok sure, that sounds good! I made the packages only for ROCm 6.0.2 so
far though.

> I agree with Ricardo that this should be merged into Guix proper
> eventually.  This is still in flux and we’d need to check what Kjetil
> and Thomas at AMD think, in particular wrt. versions, so no ETA so far.

Yes I agree, the ROCm packages are not ready to be merged yet.

> Is PyTorch able to build code for several GPU architectures and pick the
> right one at run time?  If it does, that would seem like the better
> option for me, unless that is indeed so computationally expensive that
> it’s not affordable.

It is the same as for other HIP/ROCm libraries, so the GPU architectures
chosen at build time are all available at runtime and automatically
picked. For reference, the Arch Linux package for PyTorch [1] enables 12
architectures. I think the architectures which can be chosen at compile
time also depend on the ROCm version.

>> I'm not sure they can be combined however, as the GPU code is included
>> in the shared libraries. Thus all dependent packages like
>> python-pytorch-rocm would need to be built for each architecture as
>> well, which is a large duplication for the non-GPU parts.
>
> Yeah, but maybe that’s OK if we keep the number of supported GPU
> architectures to a minimum?

If it's no issue for the build farm it would probably be good to include
a set of default architectures (the officially supported ones?) like you
suggested, and make it easy to recompile all dependent packages for
other architectures. Maybe this can be done with a package
transformation like for '--tune'?. IIRC, building composable-kernel for
the default architectures with 16 threads exceeded 32 GB of memory
before I cancelled the build and set it to only architecture.

>> - Many tests assume a GPU to be present, so they need to be disabled.
>
> Yes.  I/we’d like to eventually support that.  (There’d need to be some
> annotation in derivations or packages specifying what hardware is
> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need
> to honor that.)

That sounds like a good idea, could this also include CPU ISA
extensions, such as AVX2 and AVX-512?

>> - For several packages (e.g. rocfft), I had to disable the
>>   validate-runpath? phase, as there was an error when reading ELF
>>   files. It is however possible that I also disabled it for packages
>>   where it was not necessary, but it was the case for rocblas at
>>   least. Here, kernels generated are contained in ELF files, which are
>>   detected by elf-file? in guix/build/utils.scm, but rejected by
>>   has-elf-header? in guix/elf.scm, which leads to an error.
>
> Weird.  We’d need to look more closely into the errors you got.

I think the issue is simply that elf-file? just checks the magic bytes
and has-elf-header? checks for the entire header. If the former returns
#t and the latter #f, an error is raised by parse-elf in guix/elf.scm.
It seems some ROCm (or tensile?) ELF files have another header format.

> Oh, just noticed your patch bring a lot of things beyond PyTorch itself!
> I think there’s some overlap with
> <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we
> should synchronize.
Ah, I did not see this before, the overlap seems to be tensile,
roctracer and rocblas. For rocblas, I saw that they set
"-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing?

Thank you!
David

[1] 
https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/ae90c1e8bdb99af458ca0a545c5736950a747690/PKGBUILD

[Prev in Thread]

Current Thread

[Next in Thread]

PyTorch with ROCm, David Elsing, 2024/03/24
- Re: PyTorch with ROCm, Ricardo Wurmus, 2024/03/24
  - Re: PyTorch with ROCm, David Elsing, 2024/03/24
- Re: PyTorch with ROCm, Ludovic Courtès, 2024/03/28
  - Re: PyTorch with ROCm, David Elsing <=

Prev by Date: Coordinators for patch review session on Tuesday
Previous by thread: Re: PyTorch with ROCm
Next by thread: Shepherd timers
Index(es):
- Date
- Thread