Re: Spreading parallel across nodes on HPC system

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Spreading parallel across nodes on HPC system

From:	Ken Mankoff
Subject:	Re: Spreading parallel across nodes on HPC system
Date:	Sat, 12 Nov 2022 09:01:26 +0100
User-agent:	mu4e 1.8.10; emacs 27.1

Dear Ole,


I do not have the error messages in front of me at the moment, but parallel 
reported that it could not detect CPUs on remote host, and was only spawning 1 
job.

The solution was documented here 
https://curc.readthedocs.io/en/iaasce-954_grouper/software/GNUParallel.html

I use sbatch to launch a script that is allocated 32 tasks (cores; unknown 
number of hosts and CPUs per host). The script then launches 32 srun jobs. It 
seems to all works well.

I'm happy to close this issue, but if it is important for your development, I 
am happy to run various commands, output from 'env', etc. and share output with 
you.

  -k.


On 2022-11-11 at 19:33 +01, Ole Tange <ole@tange.dk> wrote:
> On Fri, Nov 11, 2022 at 5:58 PM Ken Mankoff <mankoff@gmail.com> wrote:
>
>> I'll try to simplify my original question...
>>
>> If I run
>>
>> parallel -s-slf hostfile -j 1000 <script> ::: $(seq 1000)
>>
>> And hostfile has some hosts that have 1 CPU, and some hosts that have 100s 
>> of CPUs, does parallel take care of handling this?
>>
>> I've now just read the man page in more detail and above --slf under the -S 
>> documentation I see
>>
>> > GNU parallel will determine the number of CPUs on the remote computers
>> > and run the number of jobs as specified by -j.
>>
>> So I *think* that if I leave "-j" off the command line, parallel will use 
>> the maximum number of available CPUs. This all sounds good.
>>
>> Last question, which I may be able to figure out with trial-and-error 
>> testing. Does parallel
>> detect the total number of CPUs on host, or the number of CPUs allocated to 
>> me and my job? I only
>> have access to the latter...
>
> Try running this:
>
> $ seq 100000 | parallel -Slo,h --eta true
>
> Computers / CPU cores / Max jobs to run
> 1:h / 2 / 2
> 2:lo / 8 / 8
>
> Computer:jobs running/jobs completed/%of started jobs/Average seconds
> to complete
> ETA: 10558s Left: 99920 AVG: 0.10s  h:2/21/26%/1.5s  lo:8/59/73%/0.5s
>
> The server h has 2 CPU threads, the server lo has 8 CPU threads.
>
> So GNU Parallel detects the number of CPU threads the server has.
>
> It does not detect how many threads are reserved for you by SLURM.
>
> What happens if you use more threads than allocated for you?
>
>> > SLURM_JOB_CPUS_PER_NODE: Count of CPUs available to the job on the
>> > nodes in the allocation, using the format
>> > CPU_count[(xnumber_of_nodes)][,CPU_count [(xnumber_of_nodes)] ...].
>> > For example: SLURM_JOB_CPUS_PER_NODE='72(x2),36' indicates that on the
>> > first and second nodes (as listed by SLURM_JOB_NODELIST) the
>> > allocation has 72 CPUs, while the third node has 36 CPUs.
>
> It seems SLURM sets a lot of other env vars. Maybe one of those is
> easier to parse? Could you get a sample output of `env`?
>
> It seems it should be possible to generate a --slf by merging
> SLURM_JOB_CPUS_PER_NODE and SLURM_JOB_NODELIST. But I really need to
> see real examples of SLURM_JOB_CPUS_PER_NODE and SLURM_JOB_NODELIST to
> confirm that.
>
> /Ole

[Prev in Thread]

Current Thread

[Next in Thread]

Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/10
- Re: Spreading parallel across nodes on HPC system, Rob Sargent, 2022/11/10
  - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Rob Sargent, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Christian Meesters, 2022/11/11
- Re: Spreading parallel across nodes on HPC system, Christian Meesters, 2022/11/10
  - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
- Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
  - Re: Spreading parallel across nodes on HPC system, Ole Tange, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Ken Mankoff <=

Prev by Date: Re: Spreading parallel across nodes on HPC system
Next by Date: How to parallelize find|while read do; ffprobe, grep and get the filenames?
Previous by thread: Re: Spreading parallel across nodes on HPC system
Next by thread: How to parallelize find|while read do; ffprobe, grep and get the filenames?
Index(es):
- Date
- Thread