Add instruction to setup communication port. (#1838)

* add doc for setting slurm port * update default * add instruction for port * add evaluate interval * update docs * fix typo Co-authored-by: Kai Chen <chenkaidev@gmail.com>

Add instruction to setup communication port. (#1838)
da6b1c82 · Wenwei Zhang · Kai Chen · 537bb34c · da6b1c82 · da6b1c82
Commit da6b1c82 authored 5 years ago by Wenwei Zhang Committed by Kai Chen 5 years ago
--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -155,6 +155,11 @@ which uses `MMDistributedDataParallel` and `MMDataParallel` respectively.
 All outputs (log files and checkpoints) will be saved to the working directory,
 which is specified by `work_dir` in the config file.

+By default we evaluate the model on the validation set after each epoch, you can change the evaluation interval by adding the interval argument in the training config.
+```python
+evaluation = dict(interval=12)  # This evaluate the model per 12 epoch.
+```
+
 **\*Important\***: The default learning rate in config files is for 8 GPUs and 2 img/gpu (batch size = 8*2 = 16).
 According to the [Linear Scaling Rule](https://arxiv.org/abs/1706.02677), you need to set the learning rate proportional to the batch size if you use different GPUs or images per GPU, e.g., lr=0.01 for 4 GPUs * 2 img/gpu and lr=0.08 for 16 GPUs * 4 img/gpu.

@@ -202,6 +207,36 @@ If you have just multiple machines connected with ethernet, you can refer to
 pytorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility).
 Usually it is slow if you do not have high speed networking like infiniband.

+### Launch multiple jobs on a single machine
+
+If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
+you need to specify different ports (29500 by default) for each job to avoid communication conflict.
+
+If you use `dist_train.sh` to launch training jobs, you can set the port in commands.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
+```
+
+If you use launch training jobs with slurm, you need to modify the config files (usually the 6th line from the bottom in config files) to set different communication ports. 
+
+In `config1.py`,
+```python
+dist_params = dict(backend='nccl', port=29500)
+```
+
+In `config2.py`,
+```python
+dist_params = dict(backend='nccl', port=29501)
+```
+
+Then you can launch two jobs with `config1.py` ang `config2.py`.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} 4
+CUDA_VISIBLE_DEVICES=4,5,6,7 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} 4
+```

 ## Useful tools


--- a/tools/dist_test.sh
+++ b/tools/dist_test.sh
@@ -5,6 +5,7 @@ PYTHON=${PYTHON:-"python"}
 CONFIG=$1
 CHECKPOINT=$2
 GPUS=$3
+PORT=${PORT:-29500}

-$PYTHON -m torch.distributed.launch --nproc_per_node=$GPUS \
+$PYTHON -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
    $(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
--- a/tools/dist_train.sh
+++ b/tools/dist_train.sh
@@ -4,6 +4,7 @@ PYTHON=${PYTHON:-"python"}

 CONFIG=$1
 GPUS=$2
+PORT=${PORT:-29500}

-$PYTHON -m torch.distributed.launch --nproc_per_node=$GPUS \
+$PYTHON -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
    $(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:3}