My own version of Steven’s WATO cheat sheet :)

SLURM Guide

The SLURM environment is configured to be as close to the general-use environment as possible. All of the same network drives and software are available. However, there are some differences:

  • The SLURM environment uses a /tmp drive for temporary storage instead of /mnt/scratch on general-use machines. Temporary storage can be requested using the --gres tmpdisk:<size_in_MiB> flag.
  • The SLURM environment does not have a user-space systemd for managing background processes like the Docker daemon. Please follow the instructions in the Using Docker section to start the Docker daemon.

Starting SLURM

Basically in a non-vscode terminal just run \

srun --cpus-per-task 10 --mem 16G --gres tmpdisk:40960,shard:8192 --time 1:00:00 --pty bash

for a beefy environment to do work in.

Running Docker

To actually do things in docker we need to start the Docker daemon manually since the SLURM environment does not provide user-space systemd for managing background processes like the Docker daemon.

slurm-start-dockerd.sh

CUDA

If your workload requires CUDA, you have a few options (not exhaustive), you can use the nvidia/cuda docker image:

docker run --rm -it --gpus all -v $(pwd):/workspace nvidia/cuda:12.0.0-devel-ubuntu22.04 nvcc --version

Starting SSH Server

Then run the following to start the ssh server

tmux
/usr/sbin/sshd -D -p 2222 -f /dev/null -h ${HOME}/.ssh/id_rsa

Then Ctr-B + D to detach from terminal

Connecting

Using Remote-Explorer

At this point you can launch vscode and ssh into the slurm node by logging into slurm-node within remote-explorer

VSCode Tunnel

To connect to a VSCode instance using vscode (don’t do this, doesn’t play well with foxglove):

# In the slurm job, run: 
cd /tmp 
curl --silent --location 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' | tar -xzv 
./code tunnel --name slurm-${SLURM_JOB_ID}

Disk Quota Exceeded

The best way to avoid this is to frequently prune unused containers… but…

Check if daemon is running

systemctl --user status docker-rootless

Check journal for disk quota exceeded

journalctl --user --catalog --pager-end --unit docker-rootless.service

Free up space

buildah unshare rm -r /var/lib/cluster/users/$(id -u)/docker

Restart daemon

systemctl --user restart docker-rootless

Perceptions Useful Commands

Port forwarding for foxglove

lsof -ti:55200 | xargs kill -9
ssh -NfL 55200:localhost:55200 danielrhuynh@derek3-ubuntu2.cluster.watonomous.ca

Watod

Watod is our WATO docker orchestration script, some useful commands here include:

watod build <module>
watod -dev up <name_of_container>
watod -t lidar_object_detection

LiDar Object Detection Specific Stuff

Running different Rosbags

Year 3 bag

ros2 bag play -l ~/ament_ws/rosbags2/year3/test_track_days/W20.0/public_road_to_TT

Nuscenes mini batch

watod run data_stream ros2 bag play ./nuscenes/NuScenes-v1.0-mini-scene-0061/NuScenes-v1.0-mini-scene-0061_0.mcap --loop

Running Current LiDar Object Detection

watod build lidar_object_detection
watod up foxglove data_stream lidar_object_detection

Then run Rosbag of choice.

Training Models

Exec Container:

docker exec -u root danielrhuynh-lidar_object_detection-1 chown -R bolty:bolty /home/bolty/OpenPCDet

Train:

python3 train.py --cfg_file cfgs/nuscenes_models/cbgs_voxel0075_voxelnext.yaml --batch_size 3 --epochs 20 

Test:

python3 test.py --cfg_file cfgs/nuscenes_models/transfusion_lidar.yaml --batch_size 10 --ckpt /home/bolty/OpenPCDet/models/checkpoint_epoch_50.pth

Running Legacy static LiDar Detections

Checkout a specific commit hash for 2D LiDar object detection, or static 3D LiDar object detection

Building ROS Image:

docker build -f lidar_object_detection.Dockerfile -t lidar-object-det /home/danielrhuynh/
docker run -it -p 8765:8765 -v /mnt/wato-drive2/nuscenes-1.0-mini/samples/LIDAR_TOP:/home/bolty/data lidar-object-det bash
docker run -it -p 8765:8765 -v /home/danielrhuynh/data:/home/bolty/data lidar-object-det bash
colcon build
ros2 run lidar_object_detection inference
cd ../OpenPCDet/ && python3 setup.py develop && cd ../ament_ws/
ros2 launch foxglove_bridge foxglove_bridge_launch.xml port:=8765