[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Spreading parallel across nodes on HPC system
From: |
Ken Mankoff |
Subject: |
Re: Spreading parallel across nodes on HPC system |
Date: |
Fri, 11 Nov 2022 08:24:54 +0100 |
User-agent: |
mu4e 1.8.10; emacs 27.1 |
I'll try to simplify my original question...
If I run
parallel -s-slf hostfile -j 1000 <script> ::: $(seq 1000)
And hostfile has some hosts that have 1 CPU, and some hosts that have 100s of
CPUs, does parallel take care of handling this?
I've now just read the man page in more detail and above --slf under the -S
documentation I see
> GNU parallel will determine the number of CPUs on the remote computers
> and run the number of jobs as specified by -j.
So I *think* that if I leave "-j" off the command line, parallel will use the
maximum number of available CPUs. This all sounds good.
Last question, which I may be able to figure out with trial-and-error testing.
Does parallel detect the total number of CPUs on host, or the number of CPUs
allocated to me and my job? I only have access to the latter...
Thanks,
-k.
On 2022-11-10 at 20:49 +01, Ken Mankoff <mankoff@gmail.com> wrote:
> Hello,
>
> I'm trying to run parallel on multiple nodes. Each node may have a
> different number of CPUs. It appears the best syntax for this is from
> the man page --slf section:
>
> 8/my-8-cpu-server.example.com
> 2/my_other_username@my-dualcore.example.net
>
> My problem is that I'm running in the SLURM environment. I can get the
> hostnames with
>
> scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.0
>
> But I cannot easily get the CPUS-per-node. From the SLURM docs,
>
> SLURM_JOB_CPUS_PER_NODE: Count of CPUs available to the job on the
> nodes in the allocation, using the format
> CPU_count[(xnumber_of_nodes)][,CPU_count [(xnumber_of_nodes)] ...].
> For example: SLURM_JOB_CPUS_PER_NODE='72(x2),36' indicates that on the
> first and second nodes (as listed by SLURM_JOB_NODELIST) the
> allocation has 72 CPUs, while the third node has 36 CPUs.
>
> So, parsing '72(x2),36' seems complicated.
>
> If I requested a total of 1000 tasks, but have no control over how
> many nodes, can I just call parallel with -j1000 and pass it a
> hostfile without the "CPUs/" prepended to the hostname? Would parallel
> then start however many jobs it can per node, and if for some reason I
> was allocated 1000 CPUS on 1 node, that would work fine, as would 1
> CPU on 1000 different nodes?
>
> Thanks,
>
> -k.