first commit
This commit is contained in:
190
README_TRAIN.md
Normal file
190
README_TRAIN.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Training
|
||||
|
||||
This repository supports finetuning SAM3 models on custom datasets in multi-node setup or local execution. The training script is located at `sam3/train.py` and uses Hydra configuration management to handle complex training setups.
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd sam3
|
||||
pip install -e ".[train]"
|
||||
```
|
||||
|
||||
### Training Script Usage
|
||||
|
||||
The main training script is located at `sam3/train.py`. It uses Hydra configuration management to handle complex training setups.
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```bash
|
||||
# Example: Train on Roboflow dataset
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml
|
||||
# Example: Train on ODinW13 dataset
|
||||
python sam3/train/train.py -c configs/odinw13/odinw_text_only_train.yaml
|
||||
```
|
||||
Follow [`Roboflow 100-VL`](https://github.com/roboflow/rf100-vl/) to download the roboflow 100-vl datasets. Follow [`GLIP`](https://github.com/microsoft/GLIP) to download the ODinW datasets. The data folder should be organized as follows, and put your roboflow_vl_100_root and odinw_data_root in the job configs.
|
||||
```
|
||||
roboflow_vl_100_root:
|
||||
13-lkc01
|
||||
train
|
||||
valid
|
||||
test
|
||||
2024-frc
|
||||
actions
|
||||
...
|
||||
odinw_data_root:
|
||||
AerialMaritimeDrone
|
||||
large
|
||||
train
|
||||
valid
|
||||
test
|
||||
Aquarium
|
||||
...
|
||||
```
|
||||
|
||||
#### Command Line Arguments
|
||||
|
||||
The training script supports several command line arguments:
|
||||
|
||||
```bash
|
||||
python sam3/train/train.py \
|
||||
-c CONFIG_NAME \
|
||||
[--use-cluster 0|1] \
|
||||
[--partition PARTITION_NAME] \
|
||||
[--account ACCOUNT_NAME] \
|
||||
[--qos QOS_NAME] \
|
||||
[--num-gpus NUM_GPUS] \
|
||||
[--num-nodes NUM_NODES]
|
||||
```
|
||||
|
||||
**Arguments:**
|
||||
- `-c, --config`: **Required.** Path to the configuration file (e.g., `sam3/train/configs/roboflow_v100_full_ft_100_images.yaml`)
|
||||
- `--use-cluster`: Whether to launch on a cluster (0: local, 1: cluster). Default: uses config setting
|
||||
- `--partition`: SLURM partition name for cluster execution
|
||||
- `--account`: SLURM account name for cluster execution
|
||||
- `--qos`: SLURM QOS (Quality of Service) setting
|
||||
- `--num-gpus`: Number of GPUs per node. Default: uses config setting
|
||||
- `--num-nodes`: Number of nodes for distributed training. Default: uses config setting
|
||||
|
||||
#### Local Training Examples
|
||||
|
||||
```bash
|
||||
# Single GPU training
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0 --num-gpus 1
|
||||
|
||||
# Multi-GPU training on a single node
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0 --num-gpus 4
|
||||
|
||||
# Force local execution even if config specifies GPUs
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0
|
||||
```
|
||||
|
||||
#### Cluster Training Examples
|
||||
|
||||
```bash
|
||||
# Basic cluster training with default settings from config
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 1
|
||||
|
||||
# Cluster training with specific SLURM settings
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
|
||||
--use-cluster 1 \
|
||||
--partition gpu_partition \
|
||||
--account my_account \
|
||||
--qos high_priority \
|
||||
--num-gpus 8 \
|
||||
--num-nodes 2
|
||||
```
|
||||
|
||||
### Configuration Files
|
||||
|
||||
Training configurations are stored in `sam3/train/configs/`. The configuration files use Hydra's YAML format and support:
|
||||
|
||||
- **Dataset Configuration**: Data paths, transforms, and loading parameters
|
||||
- **Model Configuration**: Architecture settings, checkpoint paths, and model parameters
|
||||
- **Training Configuration**: Batch sizes, learning rates, optimization settings
|
||||
- **Launcher Configuration**: Distributed training and cluster settings
|
||||
- **Logging Configuration**: TensorBoard, experiment tracking, and output directories
|
||||
|
||||
#### Key Configuration Sections
|
||||
|
||||
```yaml
|
||||
# Paths to datasets and checkpoints
|
||||
paths:
|
||||
bpe_path: /path/to/bpe/file
|
||||
dataset_root: /path/to/dataset
|
||||
experiment_log_dir: /path/to/logs
|
||||
|
||||
# Launcher settings for local/cluster execution
|
||||
launcher:
|
||||
num_nodes: 1
|
||||
gpus_per_node: 2
|
||||
experiment_log_dir: ${paths.experiment_log_dir}
|
||||
|
||||
# Cluster execution settings
|
||||
submitit:
|
||||
use_cluster: True
|
||||
timeout_hour: 72
|
||||
cpus_per_task: 10
|
||||
partition: null
|
||||
account: null
|
||||
```
|
||||
|
||||
### Monitoring Training
|
||||
|
||||
The training script automatically sets up logging and saves outputs to the experiment directory:
|
||||
|
||||
```bash
|
||||
# Logs are saved to the experiment_log_dir specified in config
|
||||
experiment_log_dir/
|
||||
├── config.yaml # Original configuration
|
||||
├── config_resolved.yaml # Resolved configuration with all variables expanded
|
||||
├── checkpoints/ # Model checkpoints (if skip_checkpointing=False)
|
||||
├── tensorboard/ # TensorBoard logs
|
||||
├── logs/ # Text logs
|
||||
└── submitit_logs/ # Cluster job logs (if using cluster)
|
||||
```
|
||||
|
||||
You can monitor training progress using TensorBoard:
|
||||
|
||||
```bash
|
||||
tensorboard --logdir /path/to/experiment_log_dir/tensorboard
|
||||
```
|
||||
|
||||
### Job Arrays for Dataset Sweeps
|
||||
|
||||
The Roboflow and ODinW configuration supports job arrays for training multiple models on different datasets:
|
||||
|
||||
This feature is specifically enabled via,
|
||||
```yaml
|
||||
submitit:
|
||||
job_array:
|
||||
num_tasks: 100
|
||||
task_index: 0
|
||||
```
|
||||
|
||||
The configuration includes a complete list of 100 Roboflow supercategories, and the `submitit.job_array.task_index` automatically selects which dataset to use based on the array job index.
|
||||
|
||||
```bash
|
||||
# Submit job array to train on different Roboflow datasets
|
||||
# The job array index selects which dataset from all_roboflow_supercategories
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
|
||||
--use-cluster 1
|
||||
```
|
||||
|
||||
### Reproduce ODinW13 10-shot results
|
||||
Running the following job will give the results on the ODinW13 seed 300, see `odinw_train.train_file: fewshot_train_shot10_seed300` in the config file.
|
||||
```bash
|
||||
# Example: Train on ODinW13 dataset
|
||||
python sam3/train/train.py -c configs/odinw13/odinw_text_only_train.yaml
|
||||
```
|
||||
Change `odinw_train.train_file` to `fewshot_train_shot10_seed30` and `fewshot_train_shot10_seed3` to get the results for the other two seeds. Final results are aggregated from the three seeds. Notice that a small number of jobs may diverge during training, in which case we just use the last checkpoint's result before it diverges.
|
||||
|
||||
|
||||
### Eval Script Usage
|
||||
With a similar setup as the training config, the training script `sam3/train.py` can also be used for evaluation, too, when setting `trainer.mode = val` in the job config. Run the following job will give the results on the zero-shot results on RF100-VL and ODinW13 datasets.
|
||||
```bash
|
||||
# Example: Evaluate on Roboflow dataset
|
||||
python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_eval.yaml
|
||||
# Example: Evaluate on ODinW13 dataset
|
||||
python sam3/train/train.py -c configs/odinw13/odinw_text_only.yaml
|
||||
```
|
||||
Reference in New Issue
Block a user