Documentations for AI cloud
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
docs_aicloud/aicloud_slurm/TroubleShooting.md

6.5 KiB

Fixing problems

Singularity

No space left on device

singularity --debug pull docker://nvcr.io/nvidia/tensorflow:19.05-py3
Unable to pull docker://nvcr.io/nvidia/tensorflow:19.05-py3: While running mksquashfs: exit status 1:  Write failed because No space left on device  FATAL ERROR:Failed to write to output filesystem

Cause:

singularity default temporary directory is /tmp. This directory may get full during the unpacking process of the pull command, hence the error. We can solve this by setting a custom location for singularity to store its temporary files.

Solution:

mkdir -p singtmp; SINGULARITY_TMPDIR=./singtmp singularity --debug pull docker://nvcr.io/nvidia/tensorflow:19.05-py3

Notice that this can not be combined with a singularity build command as this requres sudo singularity build and the environment variables will not be transfered. You can only rely on that the /tmp directory is large enough. If this is problem please contact us.

SLURM

Why my slurm job fails to run?

There can be many reasons why a job fails to run. We can divide them into two groups: SLURM/Server's fault, and Your code's faults.

To inspect SLURM/server's faults, you can use the following commands:

sinfo -R # Show reasons why the server is not accepting jobs
scontrol show job <JOBID> # Pay attention to the "reason" that SLURM gives
squeue
squeue -o '%.5i %.20u %.2t %.4M %.5D %7H %6I %7J %6z %R' # View squeue output with custom output format

To inspect problems with your code or something else, check the standard output (stdout) for the job - by deault it is the file with name pattern slurm-.out in the directory where you submitted the job. Fix any problem you find in the stdout file.

Tips and tricks

Cannot dlopen some TensorRT libraries

This error is caused by configuration settings in your user directory interfering with the software in your container. To learn more about the reasons and how to enable containers to "peacefully co-exist", see Installing Python packages with pip in Singularity containers.
The following example uses the simple and brutal solution of just deleting things that are getting in your way:

$ nv-ai-fe01:~$ srun --time=10:00 --pty singularity exec --nv tensorflow_20.03-tf2-py3.sif ipython -c "import tensorflow"
srun: job 41527 queued and waiting for resources
srun: job 41527 has been allocated resources
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
2020-06-03 12:53:49.413305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmp/tmp.nfy1Wgac8o/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2020-06-03 12:53:49.413707: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmp/tmp.nfy1Wgac8o/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2020-06-03 12:53:49.413727: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
$ rm -rf .byobu/ .config/ .ipython/ .jupyter/ .keras/ .local/ .nano/ .singularity/ .ssh/
$ srun --time=10:00 --pty singularity exec --nv tensorflow_20.03-tf2-py3.sif ipython -c "import tensorflow"
srun: job 41530 queued and waiting for resources
srun: job 41530 has been allocated resources
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
WARNING: group: unknown groupid 116157
2020-06-03 13:20:14.206562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-06-03 13:20:15.674483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2020-06-03 13:20:15.680294: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7

It's tedious to type some common commands again and again!

You should setup aliases. Read more about bash aliases

For example, following are some aliases for SLURM. Put them into your ~/.bash_aliases and reload the file by running source ~/.bash_aliases. It will be loaded automatically next time you log in.

# some useful aliases to put into ~/.bash_aliases
alias si="sinfo -R" # Show reasons why the server is not accepting jobs
# show info for job
sjob(){
jobid=$1
if [[ "x$jobid" == "x" ]]; then
    jobid=$(squeue --noheader -O jobid -u $USER|tail -n 1)
fi
scontrol show job $jobid

}
alias sqa="squeue -o '%.5i %.20u %.2t %.4M %.5D %7H %6I %7J %6z %R'"
# from https://www.chpc.utah.edu/documentation/software/slurm.php#usercomm
#alias sq="squeue -o \"%8i %12j %4t %20u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""
alias sq="squeue --Format='jobid:10,username:25,tres-alloc:40,state:10,timeused:15,account:10,endtime:20,reason:10,nodelist:20'"
alias sqinfo="squeue --Format='jobid:10,username:25,tres-per-node:10,timeused:15,workdir:50,command:50'"

# follow stdout of a job
# if no jobid provided, it will try to fetch the latest existing jobid for current user
sfstdout(){
jobid=$1

if [[ "x$jobid" == "x" ]]; then
    jobid=$(squeue --noheader -O jobid -u $USER|tail -n 1)
fi

tail -f $(squeue --noheader -O stdout:100 --job $jobid)
}

# read stdout  of a job
# if no jobid provided, it will try to fetch the latest existing jobid for current user
srstdout(){
jobid=$1

if [[ "x$jobid" == "x" ]]; then
    jobid=$(squeue --noheader -O jobid -u $USER|tail -n 1)
fi

less $(squeue --noheader -O stdout:100 --job $jobid)
}
# sacctmgr modify account where cluster=ngc name=et set Description="Energy technology department"
alias squs="sacctmgr show qos format=name,maxtresperuser,maxwalldurationperjob"
alias myslurmaccts='printf "%-15s%-25s%s\n" "Cluster" "Account" "Partition" && sacctmgr -p show assoc user=$USER | awk -F"|" "NR>1 { printf \"%-15s%-25s%s\n\", \$1, \$2, \$18 }" | sort'

# show what slurm account(s) a user has
showuser (){

userID="$1"
if [[ "x$userID" == "x" ]]; then
    userID="$USER"
fi

sacctmgr show user "$userID"  withassoc

}