My own version of Steven’s WATO cheat sheet :)
SLURM Guide
The SLURM environment is configured to be as close to the general-use environment as possible. All of the same network drives and software are available. However, there are some differences:
- The SLURM environment uses a
/tmp
drive for temporary storage instead of/mnt/scratch
on general-use machines. Temporary storage can be requested using the--gres tmpdisk:<size_in_MiB>
flag. - The SLURM environment does not have a user-space systemd for managing background processes like the Docker daemon. Please follow the instructions in the Using Docker section to start the Docker daemon.
Starting SLURM
Basically in a non-vscode terminal just run \
srun --cpus-per-task 10 --mem 16G --gres tmpdisk:40960,shard:8192 --time 1:00:00 --pty bash
for a beefy environment to do work in.
Running Docker
To actually do things in docker we need to start the Docker daemon manually since the SLURM environment does not provide user-space systemd for managing background processes like the Docker daemon.
slurm-start-dockerd.sh
CUDA
If your workload requires CUDA, you have a few options (not exhaustive), you can use the nvidia/cuda
docker image:
docker run --rm -it --gpus all -v $(pwd):/workspace nvidia/cuda:12.0.0-devel-ubuntu22.04 nvcc --version
Starting SSH Server
Then run the following to start the ssh server
tmux
/usr/sbin/sshd -D -p 2222 -f /dev/null -h ${HOME}/.ssh/id_rsa
Then Ctr-B + D to detach from terminal
Connecting
Using Remote-Explorer
At this point you can launch vscode and ssh into the slurm node by logging into slurm-node
within remote-explorer
VSCode Tunnel
To connect to a VSCode instance using vscode (don’t do this, doesn’t play well with foxglove):
# In the slurm job, run:
cd /tmp
curl --silent --location 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' | tar -xzv
./code tunnel --name slurm-${SLURM_JOB_ID}
Disk Quota Exceeded
The best way to avoid this is to frequently prune unused containers… but…
Check if daemon is running
systemctl --user status docker-rootless
Check journal for disk quota exceeded
journalctl --user --catalog --pager-end --unit docker-rootless.service
Free up space
buildah unshare rm -r /var/lib/cluster/users/$(id -u)/docker
Restart daemon
systemctl --user restart docker-rootless
Perceptions Useful Commands
Port forwarding for foxglove
lsof -ti:55200 | xargs kill -9
ssh -NfL 55200:localhost:55200 danielrhuynh@derek3-ubuntu2.cluster.watonomous.ca
Watod
Watod is our WATO docker orchestration script, some useful commands here include:
watod build <module>
watod -dev up <name_of_container>
watod -t lidar_object_detection
LiDar Object Detection Specific Stuff
Running different Rosbags
Year 3 bag
ros2 bag play -l ~/ament_ws/rosbags2/year3/test_track_days/W20.0/public_road_to_TT
Nuscenes mini batch
watod run data_stream ros2 bag play ./nuscenes/NuScenes-v1.0-mini-scene-0061/NuScenes-v1.0-mini-scene-0061_0.mcap --loop
Running Current LiDar Object Detection
watod build lidar_object_detection
watod up foxglove data_stream lidar_object_detection
Then run Rosbag of choice.
Training Models
Exec Container:
docker exec -u root danielrhuynh-lidar_object_detection-1 chown -R bolty:bolty /home/bolty/OpenPCDet
Train:
python3 train.py --cfg_file cfgs/nuscenes_models/cbgs_voxel0075_voxelnext.yaml --batch_size 3 --epochs 20
Test:
python3 test.py --cfg_file cfgs/nuscenes_models/transfusion_lidar.yaml --batch_size 10 --ckpt /home/bolty/OpenPCDet/models/checkpoint_epoch_50.pth
Running Legacy static LiDar Detections
Checkout a specific commit hash for 2D LiDar object detection, or static 3D LiDar object detection
Building ROS Image:
docker build -f lidar_object_detection.Dockerfile -t lidar-object-det /home/danielrhuynh/
docker run -it -p 8765:8765 -v /mnt/wato-drive2/nuscenes-1.0-mini/samples/LIDAR_TOP:/home/bolty/data lidar-object-det bash
docker run -it -p 8765:8765 -v /home/danielrhuynh/data:/home/bolty/data lidar-object-det bash
colcon build
ros2 run lidar_object_detection inference
cd ../OpenPCDet/ && python3 setup.py develop && cd ../ament_ws/
ros2 launch foxglove_bridge foxglove_bridge_launch.xml port:=8765