|
From: | Christian Meesters |
Subject: | Re: file permissions on joblog |
Date: | Thu, 28 Jul 2022 19:46:12 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 |
This is no SLURM job file, as it contains no '#SBATCH' directives. (Yes, they could be given on the command line).
It is also a bit peculiar, as you must think it is necessary to
adjust permissions. This is usually done in so-called prolog
scripts, which run prior to the job start. If your cluster
deviates, you should discuss this with your admins, as it makes
your work cumbersome and error prone. Also, it is not necessary to
infer the number of CPUs on a node as the number of CPUs available
in your particular job should be available as environment
variables (see the wiki-link I have given). Please contact your
administrators about these things.
As for the job log: SLURM gathers stdout/stderr as specified by
the sbatch -o and -e directives. They should be directed to shared
file systems. Anything which is local to job, might not be
accessible after the job finished. Whether or not /sratch is a
global filesystem or a local one, cannot be understood from the
context.
All in all, you should contact your local helpdesk, there are a
number of things, which might be due to the application or the
cluster settings, not parallel.
On 7/28/22 09:28, Christian Meesters wrote:
if I follow correctly that is what I am doing. Here's my slurm job
On 7/28/22 14:56, Rob Sargent wrote:
On Jul 28, 2022, at 1:10 AM, Christian Meesters <meesters@uni-mainz.de> wrote: Hi, not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with "srun". I summarized this approach here: https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts (uh-oh - I need to clean up that site, many outdated sections there, but this one should still be ok) One advantage: you can safely utilize the resources of both (or more) hosts - the master hosts and all secondaries. How much resources you require depends on your application and the work it does. Be sure to consider I/O (e.g. stage-in file to avoid random I/O with too many concurrent applications, etc.), if this is an issue for your application. Cheers ChristianChristian, My use of GNU parallel does not include ssh. Rather I simply fill the slurm node with —jobs=ncoresThat would require to have an interactive job and having ncores_per_node/threads_per_application ssh-connections, and you have to manually trigger the script. My solution is to use parallel in a SLURM-job context and avoid the synchronization step by a human, whilst offering a potential multi-node job with smp applications. It's your choice, of course.
#!/bin/bashIf the complete job finishes nicely then I can read/write the job log. the trap is there in case the slurm job exceeds time limits. But while things are running, I cannot look at the '.ll' file
LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
chmod a+x $LOGDIR/*
days=$1; shift
tid=$1; shift
if [[ "$tid"x == "x" ]]
then
JOBDIR=`mktemp --directory --tmpdir=$LOGDIR XXXXXX`
tid=$(basename $JOBDIR)
else
JOBDIR=$LOGDIR/$tid
mkdir -p $JOBDIR
fi
. /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh
chmod -R a+rwx $JOBDIR
rnow=$(date +%s)
rsec=$(( $days * 24 * 3600 ))
endtime=$(( $rnow+$rsec ))
cores=`grep -c processor /proc/cpuinfo`
cores=$(( $cores / 2 ))
trap "chmod -R a+rw $JOBDIR" SIGCONT SIGTERM
parallel \
--joblog $JOBDIR/${tid}.ll \
--verbose \
--jobs $cores \
--delay 1 \
/uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt 83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}
chmod a+rw $JOBDIR/${tid}.ll
rjs
[Prev in Thread] | Current Thread | [Next in Thread] |