6.5 KiB
Fixing problems
Singularity
No space left on device
singularity --debug pull docker://nvcr.io/nvidia/tensorflow:19.05-py3
Unable to pull docker://nvcr.io/nvidia/tensorflow:19.05-py3: While running mksquashfs: exit status 1: Write failed because No space left on device FATAL ERROR:Failed to write to output filesystem
Cause:
singularity default temporary directory is /tmp
. This directory may get full during the unpacking process of the pull
command, hence the error. We can solve this by setting a custom location for singularity to store its temporary files.
Solution:
mkdir -p singtmp; SINGULARITY_TMPDIR=./singtmp singularity --debug pull docker://nvcr.io/nvidia/tensorflow:19.05-py3
Notice that this can not be combined with a singularity build command as this requres sudo singularity build and the environment variables will not be transfered. You can only rely on that the /tmp directory is large enough. If this is problem please contact us.
SLURM
Why my slurm job fails to run?
There can be many reasons why a job fails to run. We can divide them into two groups: SLURM/Server's fault, and Your code's faults.
To inspect SLURM/server's faults, you can use the following commands:
sinfo -R # Show reasons why the server is not accepting jobs
scontrol show job <JOBID> # Pay attention to the "reason" that SLURM gives
squeue
squeue -o '%.5i %.20u %.2t %.4M %.5D %7H %6I %7J %6z %R' # View squeue output with custom output format
To inspect problems with your code or something else, check the standard output (stdout) for the job - by deault it is the file with name pattern slurm-.out in the directory where you submitted the job. Fix any problem you find in the stdout file.
Tips and tricks
Cannot dlopen some TensorRT libraries
This error is caused by configuration settings in your user directory
interfering with the software in your container. To learn more about
the reasons and how to enable containers to "peacefully co-exist", see
Installing Python packages with pip in Singularity
containers.
The following example uses the simple and brutal solution of just
deleting things that are getting in your way:
$ nv-ai-fe01:~$ srun --time=10:00 --pty singularity exec --nv tensorflow_20.03-tf2-py3.sif ipython -c "import tensorflow"
srun: job 41527 queued and waiting for resources
srun: job 41527 has been allocated resources
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
2020-06-03 12:53:49.413305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmp/tmp.nfy1Wgac8o/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2020-06-03 12:53:49.413707: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmp/tmp.nfy1Wgac8o/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2020-06-03 12:53:49.413727: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
$ rm -rf .byobu/ .config/ .ipython/ .jupyter/ .keras/ .local/ .nano/ .singularity/ .ssh/
$ srun --time=10:00 --pty singularity exec --nv tensorflow_20.03-tf2-py3.sif ipython -c "import tensorflow"
srun: job 41530 queued and waiting for resources
srun: job 41530 has been allocated resources
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
WARNING: group: unknown groupid 116157
2020-06-03 13:20:14.206562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-06-03 13:20:15.674483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2020-06-03 13:20:15.680294: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
It's tedious to type some common commands again and again!
You should setup aliases. Read more about bash aliases
For example, following are some aliases for SLURM. Put them into your ~/.bash_aliases
and reload the file by running source ~/.bash_aliases
. It will be loaded automatically next time you log in.
# some useful aliases to put into ~/.bash_aliases
alias si="sinfo -R" # Show reasons why the server is not accepting jobs
# show info for job
sjob(){
jobid=$1
if [[ "x$jobid" == "x" ]]; then
jobid=$(squeue --noheader -O jobid -u $USER|tail -n 1)
fi
scontrol show job $jobid
}
alias sqa="squeue -o '%.5i %.20u %.2t %.4M %.5D %7H %6I %7J %6z %R'"
# from https://www.chpc.utah.edu/documentation/software/slurm.php#usercomm
#alias sq="squeue -o \"%8i %12j %4t %20u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""
alias sq="squeue --Format='jobid:10,username:25,tres-alloc:40,state:10,timeused:15,account:10,endtime:20,reason:10,nodelist:20'"
alias sqinfo="squeue --Format='jobid:10,username:25,tres-per-node:10,timeused:15,workdir:50,command:50'"
# follow stdout of a job
# if no jobid provided, it will try to fetch the latest existing jobid for current user
sfstdout(){
jobid=$1
if [[ "x$jobid" == "x" ]]; then
jobid=$(squeue --noheader -O jobid -u $USER|tail -n 1)
fi
tail -f $(squeue --noheader -O stdout:100 --job $jobid)
}
# read stdout of a job
# if no jobid provided, it will try to fetch the latest existing jobid for current user
srstdout(){
jobid=$1
if [[ "x$jobid" == "x" ]]; then
jobid=$(squeue --noheader -O jobid -u $USER|tail -n 1)
fi
less $(squeue --noheader -O stdout:100 --job $jobid)
}
# sacctmgr modify account where cluster=ngc name=et set Description="Energy technology department"
alias squs="sacctmgr show qos format=name,maxtresperuser,maxwalldurationperjob"
alias myslurmaccts='printf "%-15s%-25s%s\n" "Cluster" "Account" "Partition" && sacctmgr -p show assoc user=$USER | awk -F"|" "NR>1 { printf \"%-15s%-25s%s\n\", \$1, \$2, \$18 }" | sort'
# show what slurm account(s) a user has
showuser (){
userID="$1"
if [[ "x$userID" == "x" ]]; then
userID="$USER"
fi
sacctmgr show user "$userID" withassoc
}