first commit

2026-02-12 13:17:11 +08:00
commit caa519e62e
504 changed files with 123004 additions and 0 deletions
--- a/README_TRAIN.md
+++ b/README_TRAIN.md
@@ -0,0 +1,190 @@
+# Training
+
+This repository supports finetuning SAM3 models on custom datasets in multi-node setup or local execution. The training script is located at `sam3/train.py` and uses Hydra configuration management to handle complex training setups.
+
+
+## Installation
+
+```bash
+cd sam3
+pip install -e ".[train]"
+```
+
+### Training Script Usage
+
+The main training script is located at `sam3/train.py`. It uses Hydra configuration management to handle complex training setups.
+
+#### Basic Usage
+
+```bash
+# Example: Train on Roboflow dataset
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml
+# Example: Train on ODinW13 dataset
+python sam3/train/train.py -c configs/odinw13/odinw_text_only_train.yaml
+```
+Follow [`Roboflow 100-VL`](https://github.com/roboflow/rf100-vl/) to download the roboflow 100-vl datasets. Follow [`GLIP`](https://github.com/microsoft/GLIP) to download the ODinW datasets. The data folder should be organized as follows, and put your roboflow_vl_100_root and odinw_data_root in the job configs.
+```
+roboflow_vl_100_root:
+  13-lkc01
+    train
+    valid
+    test
+  2024-frc
+  actions
+  ...
+odinw_data_root:
+  AerialMaritimeDrone
+    large
+      train
+      valid
+      test
+  Aquarium
+  ...
+```
+
+#### Command Line Arguments
+
+The training script supports several command line arguments:
+
+```bash
+python sam3/train/train.py \
+    -c CONFIG_NAME \
+    [--use-cluster 0|1] \
+    [--partition PARTITION_NAME] \
+    [--account ACCOUNT_NAME] \
+    [--qos QOS_NAME] \
+    [--num-gpus NUM_GPUS] \
+    [--num-nodes NUM_NODES]
+```
+
+**Arguments:**
+- `-c, --config`: **Required.** Path to the configuration file (e.g., `sam3/train/configs/roboflow_v100_full_ft_100_images.yaml`)
+- `--use-cluster`: Whether to launch on a cluster (0: local, 1: cluster). Default: uses config setting
+- `--partition`: SLURM partition name for cluster execution
+- `--account`: SLURM account name for cluster execution
+- `--qos`: SLURM QOS (Quality of Service) setting
+- `--num-gpus`: Number of GPUs per node. Default: uses config setting
+- `--num-nodes`: Number of nodes for distributed training. Default: uses config setting
+
+#### Local Training Examples
+
+```bash
+# Single GPU training
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0 --num-gpus 1
+
+# Multi-GPU training on a single node
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0 --num-gpus 4
+
+# Force local execution even if config specifies GPUs
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0
+```
+
+#### Cluster Training Examples
+
+```bash
+# Basic cluster training with default settings from config
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 1
+
+# Cluster training with specific SLURM settings
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
+    --use-cluster 1 \
+    --partition gpu_partition \
+    --account my_account \
+    --qos high_priority \
+    --num-gpus 8 \
+    --num-nodes 2
+```
+
+### Configuration Files
+
+Training configurations are stored in `sam3/train/configs/`. The configuration files use Hydra's YAML format and support:
+
+- **Dataset Configuration**: Data paths, transforms, and loading parameters
+- **Model Configuration**: Architecture settings, checkpoint paths, and model parameters
+- **Training Configuration**: Batch sizes, learning rates, optimization settings
+- **Launcher Configuration**: Distributed training and cluster settings
+- **Logging Configuration**: TensorBoard, experiment tracking, and output directories
+
+#### Key Configuration Sections
+
+```yaml
+# Paths to datasets and checkpoints
+paths:
+  bpe_path: /path/to/bpe/file
+  dataset_root: /path/to/dataset
+  experiment_log_dir: /path/to/logs
+
+# Launcher settings for local/cluster execution
+launcher:
+  num_nodes: 1
+  gpus_per_node: 2
+  experiment_log_dir: ${paths.experiment_log_dir}
+
+# Cluster execution settings
+submitit:
+  use_cluster: True
+  timeout_hour: 72
+  cpus_per_task: 10
+  partition: null
+  account: null
+```
+
+### Monitoring Training
+
+The training script automatically sets up logging and saves outputs to the experiment directory:
+
+```bash
+# Logs are saved to the experiment_log_dir specified in config
+experiment_log_dir/
+├── config.yaml              # Original configuration
+├── config_resolved.yaml     # Resolved configuration with all variables expanded
+├── checkpoints/             # Model checkpoints (if skip_checkpointing=False)
+├── tensorboard/             # TensorBoard logs
+├── logs/                    # Text logs
+└── submitit_logs/           # Cluster job logs (if using cluster)
+```
+
+You can monitor training progress using TensorBoard:
+
+```bash
+tensorboard --logdir /path/to/experiment_log_dir/tensorboard
+```
+
+### Job Arrays for Dataset Sweeps
+
+The Roboflow and ODinW configuration supports job arrays for training multiple models on different datasets:
+
+This feature is specifically enabled via,
+```yaml
+submitit:
+  job_array:
+    num_tasks: 100
+    task_index: 0
+```
+
+The configuration includes a complete list of 100 Roboflow supercategories, and the `submitit.job_array.task_index` automatically selects which dataset to use based on the array job index.
+
+```bash
+# Submit job array to train on different Roboflow datasets
+# The job array index selects which dataset from all_roboflow_supercategories
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
+    --use-cluster 1
+```
+
+### Reproduce ODinW13 10-shot results
+Running the following job will give the results on the ODinW13 seed 300, see `odinw_train.train_file: fewshot_train_shot10_seed300` in the config file.
+```bash
+# Example: Train on ODinW13 dataset
+python sam3/train/train.py -c configs/odinw13/odinw_text_only_train.yaml
+```
+Change `odinw_train.train_file` to `fewshot_train_shot10_seed30` and `fewshot_train_shot10_seed3` to get the results for the other two seeds. Final results are aggregated from the three seeds. Notice that a small number of jobs may diverge during training, in which case we just use the last checkpoint's result before it diverges.
+
+
+### Eval Script Usage
+With a similar setup as the training config, the training script `sam3/train.py` can also be used for evaluation, too, when setting `trainer.mode = val` in the job config. Run the following job will give the results on the zero-shot results on RF100-VL and ODinW13 datasets.
+```bash
+# Example: Evaluate on Roboflow dataset
+python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_eval.yaml
+# Example: Evaluate on ODinW13 dataset
+python sam3/train/train.py -c configs/odinw13/odinw_text_only.yaml
+```