Initial commit

fbshipit-source-id: da6be2f26e3a1202f4bffde8cb980e2dcb851294
This commit is contained in:
facebook-github-bot
2025-11-18 23:07:42 -08:00
commit a13e358df4
504 changed files with 122758 additions and 0 deletions

299
scripts/eval/gold/README.md Normal file
View File

@@ -0,0 +1,299 @@
# SA-Co/Gold benchmark
SA-Co/Gold is a benchmark for promptable concept segmentation (PCS) in images. The benchmark contains images paired with text labels, also referred as Noun Phrases (NPs), each annotated exhaustively with masks on all object instances that match the label. SA-Co/Gold comprises 7 subsets, each targeting a different annotation domain: MetaCLIP captioner NPs, SA-1B captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment. The images are originally from the MetaCLIP and SA-1B datasets.
For each subset, the annotations are multi-reviewed by 3 independent human annotators. Each row in the figure shows an image and noun phrase pair from
one of the domains, and masks from the 3 annotators. Dashed borders indicate special group masks that cover more than a single instance, used when separating into instances is deemed too difficult. Annotators sometimes disagree on precise mask borders, the number of instances, and whether the phrase exists. Having 3 independent annotations allow us to measure human agreement on the task, which serves as an upper bound for model performance.
<p align="center">
<img src="../../../assets/saco_gold_annotation.png?" style="width:80%;" />
</p>
# Preparation
## Download annotations
The GT annotations can be downloaded from [Hugging Face](https://huggingface.co/datasets/facebook/SACo-Gold) or [Roboflow](https://universe.roboflow.com/sa-co-gold)
## Download images
There are two image sources for the evaluation dataset: MetaCLIP and SA-1B.
1) The MetaCLIP images are referred in 6 out of 7 subsets (MetaCLIP captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment) and can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-gold/gold-metaclip-merged-a-release-test/).
2) The SA-1B images are referred in 1 out of 7 subsets (SA-1B captioner NPs) and can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-gold/gold-sa-1b-merged-a-release-test/). Alternatively, they can be downloaded from [here](https://ai.meta.com/datasets/segment-anything-downloads/). Please access the link for `sa_co_gold.tar` from dynamic links available under `Download text file` to download the SA-1B images referred in SA-Co/Gold.
# Usage
## Visualization
- Visualize GT annotations: [saco_gold_silver_vis_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_vis_example.ipynb)
- Visualize GT annotations and sample predictions side-by-side: [sam3_data_and_predictions_visualization.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/sam3_data_and_predictions_visualization.ipynb)
## Run evaluation
The official metric for SA-Co/Gold is cgF1. Please refer to the SAM3 paper for details.
Our evaluator inherits from the official COCO evaluator, with some modifications. Recall that in the Gold subset, there are three annotations for each datapoint. We evaluate against each of them and picks the most favorable (oracle setting). It has minimal dependency (pycocotools, numpy and scipy), to help reusability in other projects. In this section we provide several pointers to run evaluation of SAM3 or third-party models.
### Evaluate SAM3
We provide inference configurations to reproduce the evaluation of SAM3.
First, please edit the file [eval_base.yaml](https://github.com/facebookresearch/sam3/blob/main/sam3/train/configs/eval_base.yaml) with the paths where you downloaded the images and annotations above.
There are 7 subsets and as many configurations to be run.
Let's take the first subset as an example. The inference can be run locally using the following command (you can adjust the number of gpus):
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml --use-cluster 0 --num-gpus 1
```
The predictions will be dumped in the folder specified in eval_base.yaml.
We also provide support for SLURM-based cluster inference. Edit the eval_base.yaml file to reflect your slurm configuration (partition, qos, ...), then run
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml --use-cluster 1
```
We provide the commands for all subsets below
#### MetaCLIP captioner NPs
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml --use-cluster 1
```
#### SA-1B captioner NPs
Refer to SA-1B images for this subset. For the other 6 subsets, refer to MetaCLIP images.
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_sa1b_nps.yaml --use-cluster 1
```
#### Attributes
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_attributes.yaml --use-cluster 1
```
#### Crowded Scenes
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_crowded.yaml --use-cluster 1
```
#### Wiki-Common1K
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_wiki_common.yaml --use-cluster 1
```
#### Wiki-Food/Drink
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_fg_food.yaml --use-cluster 1
```
#### Wiki-Sports Equipment
```bash
python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_fg_sports.yaml --use-cluster 1
```
### Offline evaluation
If you have the predictions in the COCO result format (see [here](https://cocodataset.org/#format-results)), then we provide scripts to easily run the evaluation.
For an example on how to run the evaluator on all subsets and aggregate results, see the following notebook: [saco_gold_silver_eval_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_eval_example.ipynb)
Alternatively, you can run `python scripts/eval/gold/eval_sam3.py`
If you have a prediction file for a given subset, you can run the evaluator specifically for that one using the standalone script. Example:
```bash
python scripts/eval/standalone_cgf1.py --pred_file /path/to/coco_predictions_segm.json --gt_files /path/to/annotations/gold_metaclip_merged_a_release_test.json /path/to/annotations/gold_metaclip_merged_b_release_test.json /path/to/annotations/gold_metaclip_merged_c_release_test.json
```
# Results
Here we collect the segmentation results for SAM3 and some baselines. Note that the baselines that do not produce masks are evaluated by converting the boxes to masks using SAM2
<table style="border-color:black;border-style:solid;border-width:1px;border-collapse:collapse;border-spacing:0;text-align:right" class="tg"><thead>
<tr><th style="text-align:center"></th><th style="text-align:center" colspan="3">Average</th><th style="text-align:center" colspan="3">Captioner metaclip</th><th style="text-align:center" colspan="3">Captioner sa1b</th>
<th style="text-align:center" colspan="3">Crowded</th><th style="text-align:center" colspan="3">FG food</th><th style="text-align:center" colspan="3">FG sport</th><th style="text-align:center" colspan="3">Attributes</th>
<th style="text-align:center" colspan="3">Wiki common</th></tr>
</thead>
<tbody>
<tr><td ></td><td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td>
<td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td>
<td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td><td >IL_MCC</td>
<td >positive_micro_F1</td><td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td>
<td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td>
<td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td><td >IL_MCC</td>
<td >positive_micro_F1</td></tr>
<tr><td >gDino-T</td><td >3.25</td><td >0.15</td><td >16.2</td>
<td >2.89</td><td >0.21</td><td >13.88</td><td >3.07</td>
<td >0.2</td><td >15.35</td><td >0.28</td><td >0.08</td>
<td >3.37</td><td >0.96</td><td >0.1</td><td >9.83</td>
<td >1.12</td><td >0.1</td><td >11.2</td><td >13.75</td>
<td >0.29</td><td >47.3</td><td >0.7</td><td >0.06</td>
<td >12.14</td></tr>
<tr><td >OWLv2*</td><td >24.59</td><td >0.57</td><td >42</td>
<td >17.69</td><td >0.52</td><td >34.27</td><td >13.32</td>
<td >0.5</td><td >26.83</td><td >15.8</td><td >0.51</td>
<td >30.74</td><td >31.96</td><td >0.65</td><td >49.35</td>
<td >36.01</td><td >0.64</td><td >56.19</td><td >35.61</td>
<td >0.63</td><td >56.23</td><td >21.73</td><td >0.54</td>
<td >40.25</td></tr>
<tr><td >OWLv2</td><td >17.27</td><td >0.46</td><td >36.8</td>
<td >12.21</td><td >0.39</td><td >31.33</td><td >9.76</td>
<td >0.45</td><td >21.65</td><td >8.87</td><td >0.36</td>
<td >24.77</td><td >24.36</td><td >0.51</td><td >47.85</td>
<td >24.44</td><td >0.52</td><td >46.97</td><td >25.85</td>
<td >0.54</td><td >48.22</td><td >15.4</td><td >0.42</td>
<td >36.64</td></tr>
<tr><td >LLMDet-L</td><td >6.5</td><td >0.21</td><td >27.3</td>
<td >4.49</td><td >0.23</td><td >19.36</td><td >5.32</td>
<td >0.23</td><td >22.81</td><td >2.42</td><td >0.18</td>
<td >13.74</td><td >5.5</td><td >0.19</td><td >29.12</td>
<td >4.39</td><td >0.17</td><td >25.34</td><td >22.17</td>
<td >0.39</td><td >57.13</td><td >1.18</td><td >0.05</td>
<td >23.3</td></tr>
<tr><td >APE</td><td >16.41</td><td >0.4</td><td >36.9</td>
<td >12.6</td><td >0.42</td><td >30.11</td><td >2.23</td>
<td >0.22</td><td >10.01</td><td >7.15</td><td >0.35</td>
<td >20.3</td><td >22.74</td><td >0.51</td><td >45.01</td>
<td >31.79</td><td >0.56</td><td >56.45</td><td >26.74</td>
<td >0.47</td><td >57.27</td><td >11.59</td><td >0.29</td>
<td >39.46</td></tr>
<tr><td >DINO-X</td><td >21.26</td><td >0.38</td><td >55.2</td>
<td >17.21</td><td >0.35</td><td >49.17</td><td >19.66</td>
<td >0.48</td><td >40.93</td><td >12.86</td><td >0.34</td>
<td >37.48</td><td >30.07</td><td >0.49</td><td >61.72</td>
<td >28.36</td><td >0.41</td><td >69.4</td><td >30.97</td>
<td >0.42</td><td >74.04</td><td >9.72</td><td >0.18</td>
<td >53.52</td></tr>
<tr><td >Gemini 2.5</td><td >13.03</td><td >0.29</td><td >46.1</td>
<td >9.9</td><td >0.29</td><td >33.79</td><td >13.1</td>
<td >0.41</td><td >32.1</td><td >8.15</td><td >0.27</td>
<td >30.34</td><td >19.63</td><td >0.33</td><td >59.52</td>
<td >15.07</td><td >0.28</td><td >53.5</td><td >18.84</td>
<td >0.3</td><td >63.14</td><td >6.5</td><td >0.13</td>
<td >50.32</td></tr>
<tr><td >SAM 3</td><td >54.06</td><td >0.82</td><td >66.11</td>
<td >47.26</td><td >0.81</td><td >58.58</td><td >53.69</td>
<td >0.86</td><td >62.55</td><td >61.08</td><td >0.9</td>
<td >67.73</td><td >53.41</td><td >0.79</td><td >67.28</td>
<td >65.52</td><td >0.89</td><td >73.75</td><td >54.93</td>
<td >0.76</td><td >72</td><td >42.53</td><td >0.7</td>
<td >60.85</td></tr>
</tbody></table>
# Annotation format
The annotation format is derived from [COCO format](https://cocodataset.org/#format-data). Notable data fields are:
- `images`: a `list` of `dict` features, contains a list of all image-NP pairs. Each entry is related to an image-NP pair and has the following items.
- `id`: an `int` feature, unique identifier for the image-NP pair
- `text_input`: a `string` feature, the noun phrase for the image-NP pair
- `file_name`: a `string` feature, the relative image path in the corresponding data folder.
- `height`/`width`: dimension of the image
- `is_instance_exhaustive`: Boolean (0 or 1). If it's 1 then all the instances are correctly annotated. For instance segmentation, we only use those datapoints. Otherwise, there may be either missing instances or crowd segments (a segment covering multiple instances)
- `is_pixel_exhaustive`: Boolean (0 or 1). If it's 1, then the union of all masks cover all pixels corresponding to the prompt. This is weaker than instance_exhaustive since it allows crowd segments. It can be used for semantic segmentation evaluations.
- `annotations`: a `list` of `dict` features, containing a list of all annotations including bounding box, segmentation mask, area etc.
- `image_id`: an `int` feature, maps to the identifier for the image-np pair in images
- `bbox`: a `list` of float features, containing bounding box in [x,y,w,h] format, normalized by the image dimensions
- `segmentation`: a dict feature, containing segmentation mask in RLE format
- `category_id`: For compatibility with the coco format. Will always be 1 and is unused.
- `is_crowd`: Boolean (0 or 1). If 1, then the segment overlaps several instances (used in cases where instances are not separable, for e.g. due to poor image quality)
- `categories`: a `list` of `dict` features, containing a list of all categories. Here, we provide the category key for compatibility with the COCO format, but in open-vocabulary detection we do not use it. Instead, the text prompt is stored directly in each image (text_input in images). Note that in our setting, a unique image (id in images) actually corresponds to an (image, text prompt) combination.
For `id` in images that have corresponding annotations (i.e. exist as `image_id` in `annotations`), we refer to them as a "positive" NP. And, for `id` in `images` that don't have any annotations (i.e. they do not exist as `image_id` in `annotations`), we refer to them as a "negative" NP.
A sample annotation from Wiki-Food/Drink domain looks as follows:
#### images
```
[
{
"id": 10000000,
"file_name": "1/1001/metaclip_1_1001_c122868928880ae52b33fae1.jpeg",
"text_input": "chili",
"width": 600,
"height": 600,
"queried_category": "0",
"is_instance_exhaustive": 1,
"is_pixel_exhaustive": 1
},
{
"id": 10000001,
"file_name": "1/1001/metaclip_1_1001_c122868928880ae52b33fae1.jpeg",
"text_input": "the fish ball",
"width": 600,
"height": 600,
"queried_category": "2001",
"is_instance_exhaustive": 1,
"is_pixel_exhaustive": 1
}
]
```
#### annotations
```
[
{
"id": 1,
"image_id": 10000000,
"source": "manual",
"area": 0.002477777777777778,
"bbox": [
0.44333332777023315,
0.0,
0.10833333432674408,
0.05833333358168602
],
"segmentation": {
"counts": "`kk42fb01O1O1O1O001O1O1O001O1O00001O1O001O001O0000000000O1001000O010O02O001N10001N0100000O10O1000O10O010O100O1O1O1O1O0000001O0O2O1N2N2Nobm4",
"size": [
600,
600
]
},
"category_id": 1,
"iscrowd": 0
},
{
"id": 2,
"image_id": 10000000,
"source": "manual",
"area": 0.001275,
"bbox": [
0.5116666555404663,
0.5716666579246521,
0.061666667461395264,
0.036666665226221085
],
"segmentation": {
"counts": "aWd51db05M1O2N100O1O1O1O1O1O010O100O10O10O010O010O01O100O100O1O00100O1O100O1O2MZee4",
"size": [
600,
600
]
},
"category_id": 1,
"iscrowd": 0
}
]
```
# Data Stats
Here are the stats for the 7 annotation domains. The # Image-NPs represent the total number of unique image-NP pairs including both “positive” and “negative” NPs.
| Domain | Media | # Image-NPs | # Image-NP-Masks|
|--------------------------|--------------|---------------| ----------------|
| MetaCLIP captioner NPs | MetaCLIP | 33393 | 20144 |
| SA-1B captioner NPs | SA-1B | 13258 | 30306 |
| Attributes | MetaCLIP | 9245 | 3663 |
| Crowded Scenes | MetaCLIP | 20687 | 50417 |
| Wiki-Common1K | MetaCLIP | 65502 | 6448 |
| Wiki-Food&Drink | MetaCLIP | 13951 | 9825 |
| Wiki-Sports Equipment | MetaCLIP | 12166 | 5075 |

View File

@@ -0,0 +1,104 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
"""Script to run the evaluator offline given the GTs for SAC-Gold test set and SAM3 model prediction files.
It reports CGF1, IL_MCC, PM_F1 metrics for each subset of SAC-Gold test set.
Usage: python eval_sam3.py --gt-folder <folder_with_gts> --pred-folder <folder_with_predictions>
"""
import argparse
import os
from sam3.eval.cgf1_eval import CGF1Evaluator
# Relative file names for GT files for 7 SA-Co/Gold subsets
saco_gold_gts = {
# MetaCLIP Captioner
"metaclip_nps": [
"gold_metaclip_merged_a_release_test.json",
"gold_metaclip_merged_b_release_test.json",
"gold_metaclip_merged_c_release_test.json",
],
# SA-1B captioner
"sa1b_nps": [
"gold_sa1b_merged_a_release_test.json",
"gold_sa1b_merged_b_release_test.json",
"gold_sa1b_merged_c_release_test.json",
],
# Crowded
"crowded": [
"gold_crowded_merged_a_release_test.json",
"gold_crowded_merged_b_release_test.json",
"gold_crowded_merged_c_release_test.json",
],
# FG Food
"fg_food": [
"gold_fg_food_merged_a_release_test.json",
"gold_fg_food_merged_b_release_test.json",
"gold_fg_food_merged_c_release_test.json",
],
# FG Sports
"fg_sports_equipment": [
"gold_fg_sports_equipment_merged_a_release_test.json",
"gold_fg_sports_equipment_merged_b_release_test.json",
"gold_fg_sports_equipment_merged_c_release_test.json",
],
# Attributes
"attributes": [
"gold_attributes_merged_a_release_test.json",
"gold_attributes_merged_b_release_test.json",
"gold_attributes_merged_c_release_test.json",
],
# Wiki common
"wiki_common": [
"gold_wiki_common_merged_a_release_test.json",
"gold_wiki_common_merged_b_release_test.json",
"gold_wiki_common_merged_c_release_test.json",
],
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"-g",
"--gt-folder",
type=str,
help="Path to the folder containing the ground truth json files.",
)
parser.add_argument(
"-p",
"--pred-folder",
type=str,
help="Path to the folder containing the predictions json files.",
)
args = parser.parse_args()
results = ""
for subset_name, gts in saco_gold_gts.items():
print("Processing subset: ", subset_name)
gt_paths = [os.path.join(args.gt_folder, gt) for gt in gts]
evaluator = CGF1Evaluator(
gt_path=gt_paths, verbose=True, iou_type="segm"
) # change to bbox if you want detection performance
pred_path = os.path.join(
args.pred_folder,
f"gold_{subset_name}/dumps/gold_{subset_name}/coco_predictions_segm.json",
)
summary = evaluator.evaluate(pred_path)
cgf1 = str(round(summary["cgF1_eval_segm_cgF1"] * 100, 2))
il_mcc = str(round(summary["cgF1_eval_segm_IL_MCC"], 2))
pmf1 = str(round(summary["cgF1_eval_segm_positive_micro_F1"] * 100, 2))
final_str = f"{cgf1},{il_mcc},{pmf1}"
results += subset_name + ": " + final_str + "\n"
print("Subset name, CG_F1, IL_MCC, pmF1")
print(results)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,24 @@
path_annotations: <YOUR_ANNOTATIONS_PATH>/saco_frames_test_sets/annotations/
# Paths with downloaded data
droid_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/droid/
sav_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/sav/
ego4d_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/ego4d/
yt1b_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/yt1b/
# Configuration to download and extract video frames
cookies_path: <YOUR_COOKIES_PATH>/cookies.txt # Required to download YT1B videos
update_annotation_yt1b: true
update_annotation_ego4d: true
sav_videos_fps_6_download_path: ''
remove_downloaded_videos_yt1b: false
remove_downloaded_videos_droid: false
remove_downloaded_videos_ego4d: false
remove_downloaded_videos_sav: false
# Configuration for visualization of data
num_images_show: 5
saco_subset_show: yt1b # Options: [yt1b, ego4d, sav, droid]
directory_save: <YOUR_SAVE_DIR>

View File

@@ -0,0 +1,405 @@
# SA-Co/Silver benchmark
SA-Co/Silver is a benchmark for promptable concept segmentation (PCS) in images. The benchmark contains images paired with text labels (also referred as Noun Phrases aka NPs), each annotated exhaustively with masks on all object instances that match the label.
SA-Co/Silver comprises 10 subsets, covering a diverse array of domains including food, art, robotics, driving etc. Unlike SA-Co/Gold, there is only a single ground-truth for each datapoint, which means the results may have a bit more variance and tend to underestimate model performance, since they don't account for possible different interpretations of each query.
- BDD100k
- DROID
- Ego4D
- MyFoodRepo-273
- GeoDE
- iNaturalist-2017
- National Gallery of Art
- SA-V
- YT-Temporal-1B
- Fathomnet
The README contains instructions on how to download and setup the annotations, image data to prepare them for evaluation on SA-Co/Silver.
# Preparation
## Download annotations
The GT annotations can be downloaded from [Hugging Face](https://huggingface.co/datasets/facebook/SACo-Silver) or [Roboflow](https://universe.roboflow.com/sa-co-silver)
## Download images and video frames
### Image Datasets
#### GeoDE
The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/geode/) OR follow the below steps to prepare the processed images.
1. Download dataset with raw images from [GeoDE](https://geodiverse-data-collection.cs.princeton.edu/).
2. Extract the downloaded file to a location, say `<RAW_GEODE_IMAGES_FOLDER>`
3. Run the below command to pre-process the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_GEODE_IMAGES_FOLDER>`
```
python preprocess_silver_geode_bdd100k_food_rec.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_geode_merged_test.json --raw_images_folder <RAW_GEODE_IMAGES_FOLDER> --processed_images_folder <PROCESSED_GEODE_IMAGES_FOLDER> --dataset_name geode
```
#### National Gallery of Art (NGA)
The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/national-gallery-of-art/) OR follow the below steps to prepare the processed images.
1. Run the below command to download raw images and pre-process the images to prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_NGA_IMAGES_FOLDER>`.
```
python download_preprocess_nga.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_nga_art_merged_test.json --raw_images_folder <RAW_NGA_IMAGES_FOLDER> --processed_images_folder <PROCESSED_NGA_IMAGES_FOLDER>
```
#### Berkeley Driving Dataset (BDD) 100k
The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/bdd100k-gwmh6/) OR follow the below steps to prepare the processed images.
1. Download data with raw images from the `100K Images` dataset in [BDD100k](http://bdd-data.berkeley.edu/download.html)
2. Extract the downloaded file to a location, say `<RAW_BDD_IMAGES_FOLDER>`
3. Run the below command to pre-process the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_BDD_IMAGES_FOLDER>`
```
python preprocess_silver_geode_bdd100k_food_rec.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_bdd100k_merged_test.json --raw_images_folder <RAW_BDD_IMAGES_FOLDER> --processed_images_folder <PROCESSED_BDD_IMAGES_FOLDER> --dataset_name bdd100k
```
#### Food Recognition Challenge 2022
1. Download data with raw images from the [website](https://www.aicrowd.com/challenges/food-recognition-benchmark-2022). Download `[Round 2] public_validation_set_2.0.tar.gz` file.
2. Extract the downloaded file to a location, say `<RAW_FOOD_IMAGES_FOLDER>`
3. Run the below command to pre-process the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_FOOD_IMAGES_FOLDER>`
```
python preprocess_silver_geode_bdd100k_food_rec.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_food_rec_merged_test.json --raw_images_folder <RAW_FOOD_IMAGES_FOLDER> --processed_images_folder <PROCESSED_FOOD_IMAGES_FOLDER> --dataset_name food_rec
```
#### iNaturalist
The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/inaturalist-2017/) OR follow the below steps to prepare the processed images.
1. Run the below command to download, extract images in `<RAW_INATURALIST_IMAGES_FOLDER>` and prepare them for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_INATURALIST_IMAGES_FOLDER>`
```
python download_inaturalist.py --raw_images_folder <RAW_INATURALIST_IMAGES_FOLDER> --processed_images_folder <PROCESSED_INATURALIST_IMAGES_FOLDER>
```
#### Fathomnet
The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/fathomnet-kmz5d/) OR follow the below steps to prepare the processed images.
1. Install the FathomNet API
```
pip install fathomnet
```
2. Run the below command to download the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_BDD_IMAGES_FOLDER>`
```
python download_fathomnet.py --processed_images_folder <PROCESSED_BFATHOMNET_IMAGES_FOLDER>
```
### Frame Datasets
These datasets correspond to annotations for individual frames coming from videos. The file `CONFIG_FRAMES.yaml` is used to unify the downloads for the datasets, as explained below.
Before following the other dataset steps, update `CONFIG_FRAMES.yaml` with the correct `path_annotations` path where the annotation files are.
#### DROID
The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/droid-cfual/) OR follow the below steps to prepare the processed frames.
1. Install the gsutil package:
```bash
pip install gsutil
```
2. Modify the `droid_path` variable in `CONFIG_FRAMES.yaml`. This is the path where the DROID data will be downloaded.
3. _\[Optional\] Update the variable `remove_downloaded_videos_droid` to (not) remove the videos after the frames have been extracted.
4. Download the data:
```bash
python download_videos.py droid
```
5. Extract the frames:
```bash
python extract_frames.py droid
```
See the [DROID website](https://droid-dataset.github.io/droid/the-droid-dataset#-using-the-dataset) for more information.
#### SA-V
The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/sa-v) OR follow the below steps to prepare the processed frames.
1. Follow instructions in the [Segment Anything official website](https://ai.meta.com/datasets/segment-anything-video-downloads/) to obtain access to the download links (they are dynamic links).
2. Update `CONFIG_FRAMES.yaml`:
- Update the `sav_path` variable, where the frames will be saved.
- Update the `sav_videos_fps_6_download_path` variable. Copy paste the path corresponding to the `videos_fps_6.tar` in the list that you obtained in step 1.
- _\[Optional\]_ Update the variable `remove_downloaded_videos_sav` to (not) remove the videos after the frames have been extracted.
3. Download the videos:
```bash
python download_videos.py sav
```
4. Extract the frames:
```
python extract_frames.py sav
```
#### Ego4D
The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/ego4d-w7fiu/) OR follow the below steps to prepare the processed frames.
1. Review and accept the license agreement in the [official Ego4D website](https://ego4d-data.org/docs/start-here/#license-agreement).
2. Configure AWS credentials. Run:
```bash
pip install awscli
aws configure
```
and copy the values shown in the email you received after step 1 (you can leave "region name" and "output format" empty). You can verify that the variables were set up correctly:
```bash
cat ~/.aws/credentials
```
3. Install the Ego4D library:
```bash
pip install ego4d
```
4. Update `CONFIG_FRAMES.yaml`:
- Set up AWS credentials following the instructions in the email you received after step 2. Modify the following variables: `aws_access_key_id` and `aws_secret_access_key`.
- Update the `ego4d_path` variable, where the frames will be saved.
- _\[Optional\]_ Update the variable `remove_downloaded_videos_ego4d` to (not) remove the videos after the frames have been extracted..
5. Download the `clips` subset of the Ego4D dataset:
```python
python download_videos.py ego4d
```
6. Extract the frames:
```
python extract_frames.py ego4d
```
See the [official CLI](https://ego4d-data.org/docs/CLI/) and the [explanation about the videos](https://ego4d-data.org/docs/data/videos/) for more information.
#### YT1B
The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/yt-temporal-1b/) OR follow the below steps to prepare the processed frames.
1. Install the yt-dlp library:
```bash
python3 -m pip install -U "yt-dlp[default]"
```
2. Create a `cookies.txt` file following the instructions from yt-dlp [exporting-youtube-cookies](https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies) and [pass-cookies-to-yt-dlp](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp). This is required to download youtube videos. Then, update the path for that file in the `CONFIG_FRAMES.yaml` file, in the variable `cookies_path`.
3. Update `CONFIG_FRAMES.yaml`:
- Update the `yt1b_path`, where the frames will be saved.
- _\[Optional\]_ Some YouTube videos may not be available on YouTube anymore. Set `update_annotation_yt1b` to `True` in `CONFIG_FRAMES.yaml` to remove the annotations corresponding to such videos. Note that the evaluations will not be directly comparable with other reported evaluations.
- _\[Optional\]_ Update the variable `remove_downloaded_videos_yt1b` to (not) remove the videos after the frames have been extracted.
4. Run the following code to download the videos:
```
python download_videos.py yt1b
```
5. Extract the frames:
```
python extract_frames.py yt1b
```
# Usage
## Visualization
- Visualize GT annotations: [saco_gold_silver_vis_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_vis_example.ipynb)
## Run evaluation
The official metric for SA-Co/Silver is cgF1. Please refer to the SAM3 paper for details.
Unlike Gold, the silver subset only has a single annotation per image. Therefore, the performance may be underestimated, because the model may be wrongly penalized for choosing an interpretation which is valid but different from that of the human annotator.
### Evaluate SAM3
We provide inference configurations to reproduce the evaluation of SAM3.
First, please edit the file [eval_base.yaml](https://github.com/facebookresearch/sam3/blob/main/sam3/train/configs/eval_base.yaml) with the paths where you downloaded the images and annotations above.
There are 10 subsets and as many configurations to be run.
Let's take the first subset as an example. The inference can be run locally using the following command (you can adjust the number of gpus):
```bash
python sam3/train/train.py -c configs/silver_image_evals/sam3_gold_image_bdd100k.yaml --use-cluster 0 --num-gpus 1
```
The predictions will be dumped in the folder specified in eval_base.yaml.
We also provide support for SLURM-based cluster inference. Edit the eval_base.yaml file to reflect your slurm configuration (partition, qos, ...), then run
```bash
python sam3/train/train.py -c configs/silver_image_evals/sam3_gold_image_bdd100k.yaml --use-cluster 1
```
### Offline evaluation
If you have the predictions in the COCO result format (see [here](https://cocodataset.org/#format-results)), then we provide scripts to easily run the evaluation.
For an example on how to run the evaluator on all subsets and aggregate results, see the following notebook: [saco_gold_silver_eval_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_eval_example.ipynb)
If you have a prediction file for a given subset, you can run the evaluator specifically for that one using the standalone script. Example:
```bash
python scripts/eval/standalone_cgf1.py --pred_file /path/to/coco_predictions_segm.json --gt_files /path/to/annotations/silver_bdd100k_merged_test.json
```
# Results
<table style="border-color:black;border-style:solid;border-width:1px;border-collapse:collapse;border-spacing:0;text-align:right" class="tg"><thead>
<tr style="text-align:center">
<th></th>
<th colspan="3">Average</th>
<th colspan="3">BDD100k</th>
<th colspan="3">Droids</th>
<th colspan="3">Ego4d</th>
<th colspan="3">Food Rec</th>
<th colspan="3">Geode</th>
<th colspan="3">iNaturalist</th>
<th colspan="3">Nga Art</th>
<th colspan="3">SAV</th>
<th colspan="3">YT1B</th>
<th colspan="3">Fathomnet</th>
</tr></thead>
<tbody>
<tr>
<td></td>
<td>cgF1</td>
<td>IL_MCC</td>
<td>PmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
<td>CGF1</td>
<td>IL_MCC</td>
<td>pmF1</td>
</tr>
<tr>
<td>gDino-T</td> <td>3.09</td> <td>0.12</td> <td>19.75</td> <td>3.33</td> <td>0.17</td> <td>19.54</td> <td>4.26</td> <td>0.15</td> <td>28.38</td> <td>2.87</td> <td>0.1</td>
<td>28.72</td> <td>0.69</td> <td>0.05</td> <td>13.88</td> <td>9.61</td> <td>0.24</td> <td>40.03</td> <td>0</td> <td>0</td> <td>1.97</td> <td>1.31</td> <td>0.09</td>
<td>14.57</td> <td>5.18</td> <td>0.19</td> <td>27.25</td> <td>3.6</td> <td>0.16</td> <td>22.5</td> <td>0</td> <td>0</td> <td>0.64</td>
</tr>
<tr>
<td>OWLv2*</td> <td>11.23</td> <td>0.32</td> <td>31.18</td> <td>14.97</td> <td>0.46</td> <td>32.34</td> <td>10.84</td> <td>0.36</td> <td>30.1</td> <td>7.36</td> <td>0.23</td>
<td>31.99</td> <td>19.35</td> <td>0.44</td> <td>43.98</td> <td>27.04</td> <td>0.5</td> <td>54.07</td> <td>3.92</td> <td>0.14</td> <td>27.98</td> <td>8.05</td> <td>0.31</td>
<td>25.98</td> <td>10.59</td> <td>0.32</td> <td>33.1</td> <td>10.15</td> <td>0.38</td> <td>26.7</td> <td>0.04</td> <td>0.01</td> <td>5.57</td>
</tr>
<tr>
<td>OWLv2</td> <td>8.18</td> <td>0.23</td> <td>32.55</td> <td>8.5</td> <td>0.31</td> <td>27.79</td> <td>7.21</td> <td>0.25</td> <td>28.84</td> <td>5.64</td> <td>0.18</td>
<td>31.35</td> <td>14.18</td> <td>0.32</td> <td>44.32</td> <td>13.04</td> <td>0.28</td> <td>46.58</td> <td>3.62</td> <td>0.1</td> <td>36.23</td> <td>7.22</td> <td>0.25</td>
<td>28.88</td> <td>10.86</td> <td>0.32</td> <td>33.93</td> <td>11.7</td> <td>0.35</td> <td>33.43</td> <td>-0.14</td> <td>-0.01</td> <td>14.15</td>
</tr>
<tr>
<td>LLMDet-L</td> <td>6.73</td> <td>0.17</td> <td>28.19</td> <td>1.69</td> <td>0.08</td> <td>19.97</td> <td>2.56</td> <td>0.1</td> <td>25.59</td> <td>2.39</td>
<td>0.08</td> <td>29.92</td> <td>0.98</td> <td>0.06</td> <td>16.26</td> <td>20.82</td> <td>0.37</td> <td>56.26</td> <td>27.37</td> <td>0.46</td> <td>59.5</td>
<td>2.17</td> <td>0.13</td> <td>16.68</td> <td>5.37</td> <td>0.19</td> <td>28.26</td> <td>3.73</td> <td>0.16</td> <td>23.32</td> <td>0.24</td> <td>0.04</td> <td>6.1</td>
</tr>
<tr>
<td>Gemini 2.5</td> <td>9.67</td> <td>0.19</td> <td>45.51</td> <td>5.83</td> <td>0.19</td> <td>30.66</td> <td>5.61</td> <td>0.14</td> <td>40.07</td>
<td>0.38</td> <td>0.01</td> <td>38.14</td> <td>10.92</td> <td>0.24</td> <td>45.52</td> <td>18.28</td> <td>0.26</td> <td>70.29</td> <td>26.57</td> <td>0.36</td>
<td>73.81</td> <td>8.18</td> <td>0.2</td> <td>40.91</td> <td>9.48</td> <td>0.22</td> <td>43.1</td> <td>8.66</td> <td>0.23</td> <td>37.65</td> <td>2.8</td>
<td>0.08</td> <td>34.99</td>
</tr>
<tr> <td>SAM3</td> <td>49.57</td> <td>0.76</td> <td>65.17</td> <td>46.61</td> <td>0.78</td> <td>60.13</td> <td>45.58</td> <td>0.76</td>
<td>60.35</td> <td>38.64</td> <td>0.62</td> <td>62.56</td> <td>52.96</td> <td>0.79</td> <td>67.21</td> <td>70.07</td> <td>0.89</td>
<td>78.73</td> <td>65.8</td> <td>0.82</td> <td>80.67</td> <td>38.06</td> <td>0.66</td> <td>57.62</td> <td>44.36</td> <td>0.67</td>
<td>66.05</td> <td>42.07</td> <td>0.72</td> <td>58.36</td> <td>51.53</td> <td>0.86</td> <td>59.98</td>
</tr>
</tbody></table>
# Annotation format
The annotation format is derived from [COCO format](https://cocodataset.org/#format-data). Notable data fields are:
- `images`: a `list` of `dict` features, contains a list of all image-NP pairs. Each entry is related to an image-NP pair and has the following items.
- `id`: an `int` feature, unique identifier for the image-NP pair
- `text_input`: a `string` feature, the noun phrase for the image-NP pair
- `file_name`: a `string` feature, the relative image path in the corresponding data folder.
- `height`/`width`: dimension of the image
- `is_instance_exhaustive`: Boolean (0 or 1). If it's 1 then all the instances are correctly annotated. For instance segmentation, we only use those datapoints. Otherwise, there may be either missing instances or crowd segments (a segment covering multiple instances)
- `is_pixel_exhaustive`: Boolean (0 or 1). If it's 1, then the union of all masks cover all pixels corresponding to the prompt. This is weaker than instance_exhaustive since it allows crowd segments. It can be used for semantic segmentation evaluations.
- `annotations`: a `list` of `dict` features, containing a list of all annotations including bounding box, segmentation mask, area etc.
- `image_id`: an `int` feature, maps to the identifier for the image-np pair in images
- `bbox`: a `list` of float features, containing bounding box in [x,y,w,h] format, normalized by the image dimensions
- `segmentation`: a dict feature, containing segmentation mask in RLE format
- `category_id`: For compatibility with the coco format. Will always be 1 and is unused.
- `is_crowd`: Boolean (0 or 1). If 1, then the segment overlaps several instances (used in cases where instances are not separable, for e.g. due to poor image quality)
- `categories`: a `list` of `dict` features, containing a list of all categories. Here, we provide the category key for compatibility with the COCO format, but in open-vocabulary detection we do not use it. Instead, the text prompt is stored directly in each image (text_input in images). Note that in our setting, a unique image (id in images) actually corresponds to an (image, text prompt) combination.
For `id` in images that have corresponding annotations (i.e. exist as `image_id` in `annotations`), we refer to them as a "positive" NP. And, for `id` in `images` that don't have any annotations (i.e. they do not exist as `image_id` in `annotations`), we refer to them as a "negative" NP.
A sample annotation from DROID domain looks as follows:
#### images
```
[
{
"id": 10000000,
"file_name": "AUTOLab_failure_2023-07-07_Fri_Jul__7_18:50:36_2023_recordings_MP4_22008760/00002.jpg",
"text_input": "the large wooden table",
"width": 1280,
"height": 720,
"queried_category": "3",
"is_instance_exhaustive": 1,
"is_pixel_exhaustive": 1
}
]
```
#### annotations
```
[
{
"area": 0.17324327256944444,
"id": 1,
"image_id": 10000000,
"source": "created by SAM3",
"bbox": [
0.03750000149011612,
0.5083333253860474,
0.8382812738418579,
0.49166667461395264
],
"segmentation": {
"counts": "[^R11]f03O0O100O2N100O1O100O100O100O100O1O100O100O100O100O100O1O10000O1O10000O1O100O10000O1O100O100O100O100O100O100O100O100O100O100O1O100O100O10000O100O100O100O101N100O1O011O0O1O101OO0010O100O1O100O2OO0100O100O100O100O100O10000O100O100O1O100O10000O1O100O100O100O10000O1O100O100O100O10000O1O10000O1O100O100O100O100O100O100O1O100O100O100O100O100O100O100O100O100O100O100O100O100O100O10000O100O100O1O100O10000O100O100O100O100O1O100O100O100O100O100O100O10O0100O100O2O000O1O10000O1O10000O100O100O100O1O100O100O100O100O100O100O100O100O100O100O100O100O1O100O100O100O10000O100O100O100O100O100O100O100O100O100O100O100O100O100O10000O100O100O100O100O100O100O1O10000O1O10000O100O1O100O100O100O100O100O100O100O100O10000O1O100O100O100O100O1O10000O10\\MP@hNo?W1U@gNk?X1W@gNh?Y1Z@fNf?Y1\\@fNc?[1^@dNb?[1`@dN_?]1b@bN^?]1e@aNZ?_1i@_NW?a1l@\\NS?d1RAXNn>h1TAVNk>k1VATNj>k1XATNg>m1YASNg>m1YASNf>m1[ASNe>m1[ASNd>m1]ASNc>m1]ASNb>l1`ATN`>i1cAWN\\>d1jA\\NV>_1oAaNP>^1RBbNn=\\1TBdNk=\\1VBdNj=1`@dNGO02P2Z1h=L_AfNj0^1g=FmC;R<EoC;Q<DPD<o;DRD<n;DQD=n;DjAnN?^1g=DhAQO?\\1h=DhAUO<W1l=EeAZO:R1P>F]ABa0h0Q>Hd@lNDV1e17S>k1iAWNW>i1hAXNW>j1gAWNY>i1fAXNY>j1eAWNZ>k1dAVN\\>k1bAVN^>k1`AVN_>l1`ATN`>m1^ATNa>o1]AQNc>P2[AQNd>P2\\APNd>Q2[AoMd>R2[AoMd>R2\\AnMd>S2ZAnMe>S2[AmMe>T2YAmMf>T2YAmMg>T2WAmMh>U2VAlMj>U2TAlMl>U2PAnMo>U2j@PNV?e4O100O100O100O100O100O100O100O100O100O100O100O100O101N100O100O10O0100O100O100O100O100O100O1000000O1000000O100O100O1O1O1O100O100O1O100O100O100O100O100O100O100O100O100O1O100O100O100O100O100O10000O100O1O100O100O100O100O100O100OkK_B]Oa=7oBEP=4YCKg<1^CNa<1bCN^<OeC1[<LhC4W<KlC4S<KoC5Q<JPD6o;JRD6n;JSD5l;LTD4l;LTD4k;MUD3k;MUD4j;LWD2i;OWD1i;OWD1h;0XD0h;1WDOh;2XDOg;1ZDNe;3[DMe;3[DNc;3]DLd;4\\DLc;5]DKb;7]DIc;7^DHa;9_DGa;9_DG`;:`DF`;;_DE`;<`DCa;=^DDa;=_DC`;>_DCa;>^DBb;[OUCiMW1n2c;YO[CeMn0V3g;TO^CeMf0[3k;POaCdM>b3Q<iNbCfM7f3V<dNeCeMKQ4`<YNgCfMAX4g<RNiCk2W<SMlCl2S<TMnCl2R<SMoCm2Q<RMQDm2n;TMRDl2n;SMTDl2k;UMUDk2k;UMVDj2i;VMXDj2h;VMXDj2g;VM[Di2e;VM\\Dj2c;VM^Dj2b;TMaDk2^;PMhDP3X;aL`CjM`1e5o:\\L^Ed3b:WLdEh3[:nKPFR4P:jKTFV4k9hKXFX4h9hKXFX4g9hKYFY4f9hKZFX4f9hKZFX4e9iKZFW4g9iKXFX4g9iKPElN\\O\\5c;iKeDYOEo4f;iK]DAJh4g;iKTDJ3^4i;jKkCO;X4i;hMVDX2j;hMUDY2j;iMUDW2k;iMTDW2l;kMSDU2m;kMRDV2m;lMRDT2n;mMPDT2P<mMoCS2P<oMnCR2R<V4O100O100OiInCR2Q<kMWDQ2i;kM_DQ2`;lMoDi1Q;TNWEg1h:XN^Ed1a:\\NdE`1\\:^NjE^1U:aNPF]1o9aNUF]1k9bNXF\\1g9dN]FY1c9fN`FX1_9hNdFV1\\9iNhFT1W9lNmFQ1S9nNQGo0n8QOTGn0l8ROWGk0h8UO[Gi0e8VO^Gh0a8YO`Gf0`8YOcGe0\\8\\OeGc0[8\\OiGa0V8@lG>T8AnG>Q8BQH=o7CRH<m7DVH:j7FWH9h7HYH7g7H[H7d7J^H4b7L^H4b7K`H4_7MbH2^7NcH1\\7OfH0Z70gHOX72iHMW73jHLV74jHLU74mHKS75mHKS75nHJR76oHIQ77oHIR7jMkDP1U4U1S7RM_D0h0g1f3W1^8hNcGV1_8iNaGX1_8gNaGY1`8fNaGY1_8gNaGY1`8fNaGY1_8gNaGY1`8fNaGY1_8gNaGY1`8fNaGY1_8gNaGY1_8gNaGY1_8gNbGX1_8gNaGY1_8gNaGY1_8fNbGY1`8fNaGY1_8gNaGY1_8gNaGY1_8gNaGY1_8gNbGX1^8hNbGX1^8hNbGX1^8hNbGX1^8hNbGX1^8iNbGV1^8jNbGV1^8jNbGV1^8jNbGV1^8jNbGV1^8jNbGV1^8jNbGV1]8lNbGT1^8lNcGS1\\8nNdGR1\\8nNdGR1[8oNeGQ1Z8POfGP1X8SOhGl0W8UOiGk0U8WOkGi0S8YOmGg0P8\\OPHd0n7_ORH`0l7BTH>j7DVH<g7HYH7d7L\\H4b7N^H2`71_HO^74bHL[77eHIY7:fHFX7<hHDV7>jHBT7a0kH_OT7b0mH]OR7d0nH\\OQ7f0nH]OQ7g0oHZOQ7g0oHYOQ7h0nHXOR7h0nHXOR7h0nHXOR7i0mHWOT7h0kHYOU7h0jHXOV7h0iHYOW7g0iHYOW7h0hHXOY7g0fHZOZ7f0eH[O\\7e0cHhNlKSNa;U3bHeNSLTN\\;W3_HbN]LRNU;\\3]H^Nb8c1\\G\\Ng8c1XG\\Nj8e1TGZNo8e1PGYNS9h1lFUNW9l1gFRN]9m1bFRN`9o1^FPNe9o1[FoMg9R2WFnMj9S2TFmMn9R2RFnMn9S2PFmMR:R2nEmMS:T2kEmMU:T2jEkMX:T2gEmMY:T2fElMZ:U2dEkM^:T2aEmM_:T2`ElM`:U2^ElMc:S2\\EmMe:T2YEmMg:T2WEmMj:S2UEmMk:T2SEmMn:S2PEnMP;S2nDoMQ;R2mDoMT;Q2kDoMU;R2iDoMX;Q2fDQNY;P2eDQN[;P2cDQN^;o1`DSN_;n1^DTNc;l1[DVNd;k1ZDVNg;j1WDXNh;j1UDWNk;j1SDWNn;i1oCZNP<h1mCYNS<h1kCZNU<g1gC\\NX<e1fC\\N[<d1cC^N\\<d1aC^N_<c1^C_Na<b1\\CaNc<a1ZCaNf<_1XCcNg<_1UCeNj<^1oBfNP=]1iBiN?gL^;e4hCkNf0dLb;`8YDcGg;^8VDdGk;^8mChGR<_8bCfG_<U900001N101O00001O001O00001O00001O0O2N1O1O2N1O2N100O2N1O1O2N1O2N1O1O2N1O2M200O2M2O2N1N2O2N1N3N1O1N3N1N3M2O2kMkAkKW>Q4RBiKo=8^AR2j0`Mk=:aAP2i0bMh==eAj1g0eMf=?hAh1f0eMd=?lAg1c0gMc=`0nAe1c0hMa=a0oAd1b0iM`=a0QBc1c0iM]=c0SB`1d0iM\\=e0SB^1e0jMY=g0VB[1e0jMV=k0WBW1V`0gNn_OT1T`0lNo_Oo0S`0POS@i0P`0VOT@d0n?\\OT@`0n?@T@<o?CR@^OUN6ka0=P@XO\\N6ga0a0j@WOY?i0X3O001O00010O00001O0010O0001O00010O001O00001O001O01O01O00001O001O000O2O0O2O0O2N1O2N1O2M3MYl51fSJ3L3O1O100O1O100000000001O000000001O00000000001O01OO1000000000001O000001O000O10000000000000000O10000O10000O10000O100O1O100O1O1O1O1O1O1N2O1O1O1O1O1O1O1O1O1O1O1O1O1O1O1O1N2O1O1O1O1O1O1O100O100N21O00001O001O2N1O1O2N1O2N1O2M3N4IVT_3",
"size": [
720,
1280
]
},
"category_id": 1,
"iscrowd": 0
}
]
```
### Data Stats
Here are the stats for the 10 annotation domains. The # Image-NPs represent the total number of unique image-NP pairs including both “positive” and “negative” NPs.
| Domain | # Image-NPs | # Image-NP-Masks|
|--------------------------|--------------| ----------------|
| BDD100k | 5546 | 13210 |
| DROID | 9445 | 11098 |
| Ego4D | 12608 | 24049 |
| MyFoodRepo-273 | 20985 | 28347 |
| GeoDE | 14850 | 7570 |
| iNaturalist-2017 | 1439051 | 48899 |
| National Gallery of Art | 22294 | 18991 |
| SA-V | 18337 | 39683 |
| YT-Temporal-1B | 7816 | 12221 |
| Fathomnet | 287193 | 14174 |

View File

@@ -0,0 +1,62 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
import json
import os
from multiprocessing import Pool
from pathlib import Path
import requests
from fathomnet.api import images
from tqdm import tqdm
def download_imgs(args, image_uuids):
flag = 0
for uuid in tqdm(image_uuids, desc="Downloading images"):
image = images.find_by_uuid(uuid)
file_name = (
Path(args.processed_images_folder)
/ f"{image.uuid}.{image.url.split('.')[-1]}"
)
if not file_name.exists():
try:
resp = requests.get(image.url, stream=True)
resp.raise_for_status()
with open(file_name, "wb") as f:
for chunk in resp.iter_content(chunk_size=1024):
f.write(chunk)
flag += 1
except requests.exceptions.RequestException as e:
print(f"Error downloading {image.url}: {e}")
print(f"Downloaded {flag} new images to {args.processed_images_folder}")
def main():
parser = argparse.ArgumentParser(description="Download images from FathomNet")
parser.add_argument("--processed_images_folder", help="Path to downloaded images")
parser.add_argument(
"--image-uuids",
default="fathomnet_image_uuids.json",
help="Path to JSON file containing image uuids to download",
)
parser.add_argument(
"--num-procs", type=int, default=16, help="Number of parallel processes"
)
args = parser.parse_args()
with open(args.image_uuids, "r") as f:
all_uuids = json.load(f)
Path(args.processed_images_folder).mkdir(parents=True, exist_ok=True)
chunk_size = len(all_uuids) // args.num_procs
chunks = [
all_uuids[i : i + chunk_size] for i in range(0, len(all_uuids), chunk_size)
]
with Pool(processes=args.num_procs) as pool:
pool.starmap(download_imgs, [(args, chunk) for chunk in chunks])
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,81 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
import json
import shutil
import subprocess
import sys
import tarfile
from pathlib import Path
from tqdm import tqdm
def download_archive(url, dest_dir):
dest_dir = Path(dest_dir)
dest_dir.mkdir(parents=True, exist_ok=True)
archive_path = dest_dir / url.split("/")[-1]
if not archive_path.exists():
print(f"Downloading archive to {archive_path}...")
result = subprocess.run(["wget", "-O", str(archive_path), url])
if result.returncode != 0:
print("Download failed.")
sys.exit(1)
else:
print(f"Archive already exists at {archive_path}")
return archive_path
def extract_archive(archive_path, dest_dir):
print(f"Extracting {archive_path} to {dest_dir}...")
with tarfile.open(archive_path, "r:gz") as tar:
tar.extractall(path=dest_dir)
print("Extraction complete.")
def copy_images(subset_json, untar_dir, output_dir):
with open(subset_json, "r") as f:
image_dict = json.load(f)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for target_name, rel_path in tqdm(image_dict.items(), "Copying image subset"):
src = Path(untar_dir) / rel_path
dst = output_dir / target_name
if not src.exists():
print(f"Warning: Source image {src} does not exist, skipping.")
continue
shutil.copy2(src, dst)
print(f"Copied {len(image_dict)} images to {output_dir}")
def main():
parser = argparse.ArgumentParser(
description="Download, extract, and copy subset of iNaturalist images from archive."
)
parser.add_argument(
"--raw_images_folder", help="Path to downloaded and extract the archive"
)
parser.add_argument("--processed_images_folder", help="Path to processed images")
parser.add_argument(
"--subset-json",
default="inaturalist_image_subset.json",
help="Path to iNaturalist images subset",
)
parser.add_argument(
"--archive-url",
default="https://ml-inat-competition-datasets.s3.amazonaws.com/2017/train_val_images.tar.gz",
help="URL of the archive to download",
)
args = parser.parse_args()
dest_dir = Path(args.raw_images_folder)
images_dir = Path(args.processed_images_folder)
archive_path = download_archive(args.archive_url, dest_dir)
extract_archive(archive_path, dest_dir)
untar_dir = dest_dir / "train_val_images"
copy_images(args.subset_json, untar_dir, images_dir)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,140 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
import os
from functools import partial
from multiprocessing import Pool
from pathlib import Path
import numpy as np
import pandas as pd
import requests
import utils
from PIL import Image
from tqdm import tqdm
METADATA_FILE = "published_images.csv"
METADATA_URL = "https://raw.githubusercontent.com/NationalGalleryOfArt/opendata/refs/heads/main/data" # data/published_iamges.csv from https://github.com/NationalGalleryOfArt/opendata/tree/main
IMG_URL = "https://api.nga.gov/iiif/%s/full/%s/0/default.jpg"
METADATA_FOLDER = "metadata"
EXTENSION = ".jpg"
def download_metadata(annotation_folder):
output_folder = annotation_folder / METADATA_FOLDER
output_folder.mkdir(exist_ok=True)
url = f"{METADATA_URL}/{METADATA_FILE}"
print(url)
response = requests.get(url)
if response.status_code == 200:
with open(output_folder / METADATA_FILE, "wb") as f:
f.write(response.content)
def download_url(row):
if np.isnan(row.maxpixels) or (
row.maxpixels > row.width and row.maxpixels > row.height
):
url = IMG_URL % (row.uuid, "full")
else:
url = IMG_URL % (row.uuid, f"!{row.maxpixels},{row.maxpixels}")
return url
def download_item(item, output_folder):
uuid, url = item
try:
if (output_folder / f"{uuid}{EXTENSION}").exists():
print("skipping", uuid, "already downloaded")
return
response = requests.get(url)
if response.status_code == 200:
with open(output_folder / f"{uuid}{EXTENSION}", "wb") as f:
f.write(response.content)
except:
print("errored", item)
return
def remove_non_compliant_image(item, output_folder):
uuid, max_pixels = item
if np.isnan(max_pixels):
return
if not (output_folder / f"{uuid}{EXTENSION}").exists():
return
img = Image.open(output_folder / f"{uuid}{EXTENSION}")
if img.width > max_pixels or img.height > max_pixels:
os.remove(output_folder / f"{uuid}{EXTENSION}") # delete image
return uuid
def reshape_image(rel_path, filename_size_map, output_folder):
w, h = filename_size_map[rel_path]
path = output_folder / f"{rel_path}"
img = Image.open(path)
if img.width != w or img.height != h:
new_size = (w, h)
resized_img = img.resize(new_size)
resized_img.save(path)
def main(args, workers=20):
raw_folder = Path(args.raw_images_folder)
processed_folder = Path(args.processed_images_folder)
utils.setup(raw_folder)
utils.setup(processed_folder)
uuids = utils.get_image_ids(args.annotation_file)
filename_size_map = utils.get_filename_size_map(args.annotation_file)
if not ((raw_folder / METADATA_FOLDER) / METADATA_FILE).exists():
download_metadata(raw_folder)
metadata = pd.read_csv((raw_folder / METADATA_FOLDER) / METADATA_FILE)
metadata["download_url"] = metadata.apply(download_url, axis=1)
available_uuids = list(uuids.intersection(set(metadata["uuid"].tolist())))
print(len(available_uuids), "available for download out of", len(uuids), "target")
url_data = list(
metadata.set_index("uuid")
.loc[available_uuids]
.to_dict()["download_url"]
.items()
)
download_single = partial(download_item, output_folder=(processed_folder))
print("Preparing to download", len(url_data), "items")
with Pool(20) as p:
for _ in tqdm(p.imap(download_single, url_data), total=len(url_data)):
continue
check_img_size = partial(
remove_non_compliant_image, output_folder=(processed_folder)
)
max_pixels_dict_all = metadata.set_index("uuid").to_dict()["maxpixels"]
max_pixels_dict = {item[0]: max_pixels_dict_all[item[0]] for item in url_data}
print("Checking all images within size constraints")
non_compliant = set()
with Pool(20) as p:
for each in tqdm(
p.imap(check_img_size, max_pixels_dict.items()), total=len(max_pixels_dict)
):
if each is not None:
non_compliant.add(each)
print(len(non_compliant), "not compliant size, removed")
reshape_single = partial(
reshape_image,
filename_size_map=(filename_size_map),
output_folder=(processed_folder),
)
rel_paths = os.listdir(args.processed_images_folder)
print("Preparing to reshape", len(rel_paths), "items")
with Pool(20) as p:
for _ in tqdm(p.imap(reshape_single, rel_paths), total=len(rel_paths)):
continue
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--annotation_file", help="Path to annotation file")
parser.add_argument("--raw_images_folder", help="Path to downloaded images")
parser.add_argument("--processed_images_folder", help="Path to processed images")
args = parser.parse_args()
main(args)

View File

@@ -0,0 +1,260 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import ast
import concurrent.futures
import os
import shutil
import subprocess
import sys
from concurrent.futures import as_completed, ThreadPoolExecutor
from pathlib import Path
import yt_dlp
from utils import (
annotation_files,
config,
load_json,
run_command,
save_json,
update_annotations,
)
def construct_gcs_path(original_video):
"""
Convert original_video string to GCS path.
Example:
'AUTOLab_failure_2023-07-07_Fri_Jul__7_18:50:36_2023_recordings_MP4_22008760.mp4'
->
'gs://gresearch/robotics/droid_raw/1.0.1/AUTOLab/failure/2023-07-07/Fri_Jul__7_18:50:36_2023/recordings/MP4/22008760.mp4'
"""
parts = original_video.split("_")
lab = parts[0]
failure = parts[1]
date = parts[2]
time = "_".join(parts[3:-3])
recordings = parts[-3]
mp4 = parts[-2]
file_id = parts[-1].split(".")[0]
gcs_path = (
f"gs://gresearch/robotics/droid_raw/1.0.1/"
f"{lab}/{failure}/{date}/{time}/{recordings}/{mp4}/{file_id}.mp4"
)
return gcs_path
def download_video(args):
gcs_path, dst_dir, json_file = args
# Ensure subdirectory exists
subdir = Path(dst_dir)
os.makedirs(subdir, exist_ok=True)
# Save file with its original name inside the subdir
print(json_file)
local_path = subdir / json_file
cmd = f'gsutil cp "{gcs_path}" "{local_path}"'
print(f"Running: {cmd}")
try:
run_command(cmd)
return (gcs_path, True, None)
except Exception as e:
return (gcs_path, False, str(e))
def download_youtube_video(youtube_id, output_path=None):
try:
if output_path is None:
output_path = os.path.join(
config["yt1b_path"], "downloaded_videos", f"video_{youtube_id}.mp4"
)
url = f"https://www.youtube.com/watch?v={youtube_id}"
if os.path.exists(output_path):
return youtube_id, None
format = "best[height<=720][fps<=30]/best[height<=720]/best" # 720p or lower, max 30fps
ydl_opts = {
"format": format,
"outtmpl": output_path,
"merge_output_format": "mp4",
"quiet": True,
"cookiefile": config["cookies_path"],
"socket_timeout": 60, # Increase timeout to 60 seconds (default is 10)
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
ydl.download([url])
return youtube_id, None
except Exception as e:
return youtube_id, str(e)
def download_youtube():
all_videos_to_download = set()
for annotation_file in annotation_files["yt1b"]:
ann = load_json(os.path.join(config["path_annotations"], annotation_file))
for video_info in ann["images"]:
youtube_id = video_info["original_video"]
all_videos_to_download.add(youtube_id)
videos_to_download_still = all_videos_to_download
videos_downloaded = set()
videos_unavailable = set()
num_download_retries = 3
for _ in range(num_download_retries):
if len(videos_to_download_still) == 0:
break
videos_error = set()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(download_youtube_video, youtube_id)
for youtube_id in videos_to_download_still
]
for future in concurrent.futures.as_completed(futures):
youtube_id, exception = future.result()
if exception is None:
videos_downloaded.add(youtube_id)
elif "unavailable" in exception or "members-only" in exception:
videos_unavailable.add(youtube_id)
else:
videos_error.add(youtube_id)
videos_to_download_still = (
all_videos_to_download - videos_downloaded - videos_unavailable
)
assert videos_to_download_still == videos_error
if len(videos_unavailable) + len(videos_to_download_still) > 0:
message = "Some videos are either no longer available on YouTube, or are set to private, or resulted in some other error. "
if config["update_annotation_yt1b"]:
message += "The unavailable videos will be ***REMOVED*** from the annotation file. This will make the test results NOT DIRECTLY COMPARABLE to other reported results."
print(message)
update_annotations("yt1b", videos_downloaded)
else:
message += "You may want to either re-try the download, or remove these videos from the evaluation json"
print(message)
def download_droid():
ann_dir = Path(config["path_annotations"])
dst_dir = Path(config["droid_path"]) / "downloaded_videos"
json_files = annotation_files["droid"]
download_tasks = []
original_videos = set()
for json_file in json_files:
json_path = ann_dir / json_file
data = load_json(json_path)
for img in data["images"]:
original_video = img["original_video"]
original_videos.add(original_video)
print(len(original_videos))
for original_video in original_videos:
gcs_path = construct_gcs_path(original_video)
download_tasks.append((gcs_path, dst_dir, original_video))
max_workers = min(16, len(download_tasks))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_task = {
executor.submit(download_video, task): task for task in download_tasks
}
for future in as_completed(future_to_task):
gcs_path, success, error = future.result()
if not success:
print(f"Failed to download {gcs_path}: {error}")
def download_ego4d():
output_dir = os.path.join(config["ego4d_path"], "downloaded_videos")
ann_dir = Path(config["path_annotations"])
json_files = annotation_files["ego4d"]
original_videos = set()
for json_file in json_files:
json_path = ann_dir / json_file
data = load_json(json_path)
for img in data["images"]:
original_video = img["original_video"]
original_videos.add(original_video)
original_video_uids = [
video_uid.replace(".mp4", "") for video_uid in original_videos
]
video_ids_download = original_video_uids
num_download_retries = 2
download_correct = False
message = ""
for _ in range(num_download_retries):
cmd = (
[
# "python", "-m", "ego4d.cli.cli",
"ego4d",
"--output_directory",
output_dir,
"--datasets",
"clips",
"--version",
"v1",
"--video_uids",
]
+ video_ids_download
+ ["--yes"]
)
# Run the command
result = subprocess.run(cmd, capture_output=True, text=True)
message = result.stderr
if (
"RuntimeError: The following requested video UIDs could not be found in the manifest for version:"
in result.stderr
):
not_findable_videos = ast.literal_eval(result.stderr.split("\n")[-2])
video_ids_download = [
video_uid
for video_uid in video_ids_download
if video_uid not in not_findable_videos
]
else:
download_correct = True
break
if not download_correct:
print(f"There was an error downloading the Ego4D data: {message}")
if len(video_ids_download) != len(original_video_uids):
message = "Some videos are no longer available. "
if config["update_annotation_ego4d"]:
message += "The unavailable videos will be ***REMOVED*** from the annotation file. This will make the test results NOT DIRECTLY COMPARABLE to other reported results."
print(message)
update_annotations("ego4d", video_ids_download)
else:
message += "You may want to either re-try the download, or remove these videos from the evaluation json"
print(message)
def download_sav():
tar_url = config["sav_videos_fps_6_download_path"]
tar_file = "videos_fps_6.tar"
sav_data_dir = os.path.join(config["sav_path"], "downloaded_videos")
os.makedirs(sav_data_dir, exist_ok=True)
subprocess.run(["wget", tar_url, "-O", tar_file], cwd=sav_data_dir, check=True)
subprocess.run(["tar", "-xvf", tar_file], cwd=sav_data_dir, check=True)
subprocess.run(["rm", tar_file], cwd=sav_data_dir, check=True)
def main():
assert len(sys.argv) > 1, "You have to provide the name of the dataset"
dataset_name = sys.argv[1]
assert (
dataset_name in annotation_files
), f"The dataset can be one of {list(annotation_files.keys())}"
if dataset_name == "yt1b":
download_youtube()
elif dataset_name == "droid":
download_droid()
elif dataset_name == "ego4d":
download_ego4d()
elif dataset_name == "sav":
download_sav()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,99 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
"""
This file extracts the frames for the frame datasets in SA-CO/Gold and Silver.
Call like:
> python extract_frames.py <dataset_name>
"""
import json
import os
import shutil
import sys
from multiprocessing import Pool
from PIL import Image
from tqdm import tqdm
from utils import (
annotation_files,
config,
get_frame_from_video,
is_valid_image,
update_annotations,
)
def extract_frame(path_video, global_frame_idx, path_frame, image_size, file_name):
frame = get_frame_from_video(path_video, global_frame_idx)
os.makedirs(os.path.dirname(path_frame), exist_ok=True)
img = Image.fromarray(frame)
if frame.shape[:2] != image_size:
print(f"Resizing image {file_name} from {frame.shape[:2]} to {image_size}")
height, width = image_size
img = img.resize((width, height)) # Uses Image.NEAREST by default
img.save(path_frame)
def process_image(args):
image, dataset_name, config = args
original_video, global_frame_idx, file_name, image_size = image
extra_subpath = ""
if dataset_name == "ego4d":
extra_subpath = "v1/clips"
elif dataset_name == "yt1b":
original_video = f"video_{original_video}.mp4"
elif dataset_name == "sav":
extra_subpath = "videos_fps_6"
path_video = os.path.join(
config[f"{dataset_name}_path"],
"downloaded_videos",
extra_subpath,
original_video,
)
path_frame = os.path.join(config[f"{dataset_name}_path"], "frames", file_name)
to_return = file_name
try:
extract_frame(path_video, global_frame_idx, path_frame, image_size, file_name)
if not is_valid_image(path_frame):
print(f"Invalid image in {path_frame}")
to_return = None
except:
print(f"Invalid image in {path_frame}")
to_return = None
return to_return
def main():
assert len(sys.argv) > 1, "You have to provide the name of the dataset"
dataset_name = sys.argv[1]
assert (
dataset_name in annotation_files
), f"The dataset can be one of {list(annotation_files.keys())}"
all_outputs = []
for file in annotation_files[dataset_name]:
with open(os.path.join(config["path_annotations"], file), "r") as f:
annotation = json.load(f)
images = annotation["images"]
images = set(
(
image["original_video"],
image["global_frame_idx"],
image["file_name"],
tuple(image["image_size"]),
)
for image in images
)
args_list = [(image, dataset_name, config) for image in images]
with Pool(os.cpu_count()) as pool:
outputs = list(
tqdm(pool.imap_unordered(process_image, args_list), total=len(images))
)
all_outputs.extend(outputs)
if any(out is None for out in outputs):
update_annotations(dataset_name, all_outputs, key="file_name")
if config[f"remove_downloaded_videos_{dataset_name}"]:
shutil.rmtree(os.path.join(config[f"{dataset_name}_path"], "downloaded_videos"))
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,70 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
from multiprocessing import Pool
from pathlib import Path
import pandas as pd
import utils
from tqdm import tqdm
def main(args, n_workers=20):
raw_folder = Path(args.raw_images_folder)
processed_folder = Path(args.processed_images_folder)
utils.setup(processed_folder)
img_ids = utils.get_image_ids(args.annotation_file)
if args.dataset_name == "geode":
metadata = pd.read_csv(raw_folder / "index.csv")
metadata["flat_filepath"] = metadata.file_path.apply(
lambda x: x.replace("/", "_")
)
metadata["original_absolute_path"] = metadata.file_path.apply(
lambda x: str((raw_folder / "images") / x)
)
metadata["new_absolute_path"] = metadata.flat_filepath.apply(
lambda x: str(processed_folder / x)
)
metadata["filestem"] = metadata.new_absolute_path.apply(lambda x: Path(x).stem)
img_id_mapping = metadata.set_index("filestem").to_dict()
# print(img_id_mapping.keys())
paths = [
(
img_id_mapping["original_absolute_path"][each],
img_id_mapping["new_absolute_path"][each],
)
for each in img_ids
]
elif args.dataset_name == "bdd100k":
bdd_subfolder = "100k/train"
img_filenames = utils.get_filenames(args.annotation_file)
raw_folder_bdd_images = raw_folder / bdd_subfolder
paths = [
(raw_folder_bdd_images / each, processed_folder / each)
for each in img_filenames
]
elif args.dataset_name == "food_rec":
food_subfolder = "public_validation_set_2.0/images"
img_filenames = utils.get_filenames(args.annotation_file)
raw_folder_food_images = raw_folder / food_subfolder
paths = [
(
raw_folder_food_images
/ f'{Path(each).stem.split("_")[-1]}{Path(each).suffix}',
processed_folder / each,
)
for each in img_filenames
]
print("Preparing to copy and flatten filename for", len(paths), "images")
with Pool(20) as p:
for _ in tqdm(p.imap(utils.copy_file, paths), total=len(paths)):
continue
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--annotation_file", help="Path to annotation file")
parser.add_argument("--raw_images_folder", help="Path to downloaded images")
parser.add_argument("--processed_images_folder", help="Path to processed images")
parser.add_argument("--dataset_name", help="Path to processed images")
args = parser.parse_args()
main(args)

View File

@@ -0,0 +1,148 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import json
import os
import shutil
import subprocess
from io import BytesIO
from pathlib import Path
import cv2
import matplotlib.pyplot as plt
import numpy as np
import yaml
from PIL import Image
from pycocotools import mask as mask_utils
from tqdm import tqdm
annotation_files = {
"droid": [
"silver_droid_merged_test.json",
],
"sav": [
"silver_sav_merged_test.json",
],
"yt1b": [
"silver_yt1b_merged_test.json",
],
"ego4d": [
"silver_ego4d_merged_test.json",
],
}
def load_yaml(filename):
with open(filename, "r") as f:
return yaml.safe_load(f)
def load_json(filename):
with open(filename, "r") as f:
return json.load(f)
def save_json(content, filename):
with open(filename, "w") as f:
json.dump(content, f)
def run_command(cmd):
"""Run a shell command and raise if it fails."""
result = subprocess.run(cmd, shell=True)
if result.returncode != 0:
raise RuntimeError(f"Command failed: {cmd}")
config = load_yaml("CONFIG_FRAMES.yaml")
def is_valid_image(img_path):
try:
img = Image.open(img_path).convert("RGB")
return True
except Exception:
return False
def get_frame_from_video(video_path, frame_id):
cap = cv2.VideoCapture(video_path)
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_id)
ret, frame = cap.read()
cap.release()
if not ret:
# Some videos cannot be open with OpenCV
import av
container = av.open(video_path)
stream = container.streams.video[0]
for i, frame in tqdm(
enumerate(container.decode(stream)),
desc="Decoding with AV",
total=frame_id + 1,
):
if i == frame_id:
img = frame.to_ndarray(format="rgb24")
return img
raise ValueError(
f"Could not read frame {frame_id} from video {video_path} (out of frame)"
)
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
return frame_rgb
def update_annotations(dataset_name, file_names_keep, key="original_video"):
for annotation_file in annotation_files[dataset_name]:
path_ann = os.path.join(config["path_annotations"], annotation_file)
path_original_ann = os.path.join(
config["path_annotations"],
annotation_file.replace(".json", "_original.json"),
)
ann = load_json(path_ann)
shutil.copy(path_ann, path_original_ann)
new_images = []
image_ids_keep = set()
for image in ann["images"]:
if image[key].replace(".mp4", "") in file_names_keep:
new_images.append(image)
image_ids_keep.add(image["id"])
new_annotations = []
for annotation in ann["annotations"]:
if annotation["image_id"] in image_ids_keep:
new_annotations.append(annotation)
ann["images"] = new_images
ann["annotations"] = new_annotations
save_json(ann, path_ann)
def get_filename_size_map(annotation_path):
with open(annotation_path) as f:
annotations = json.load(f)
filename_size_map = {}
for each in annotations["images"]:
filename_size_map[each["file_name"]] = (each["width"], each["height"])
return filename_size_map
def get_filenames(annotation_path):
with open(annotation_path) as f:
annotations = json.load(f)
filenames = {Path(each["file_name"]) for each in annotations["images"]}
return filenames
def get_image_ids(annotation_path):
filenames = get_filenames(annotation_path)
filestems = {Path(each).stem for each in filenames}
return filestems
def setup(folder):
print("Making dir", folder)
folder.mkdir(exist_ok=True)
def copy_file(paths):
old_path, new_path = paths
print("Copy from", old_path, "to", new_path)
if not Path(new_path).exists():
shutil.copy2(old_path, new_path)

View File

@@ -0,0 +1,48 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
"""Simple script to run the CGF1 evaluator given a prediction file and GT file(s).
Usage: python standalone_cgf1.py --pred_file <path_to_prediction_file> --gt_files <path_to_gt_file1> <path_to_gt_file2> ...
"""
import argparse
from sam3.eval.cgf1_eval import CGF1Evaluator
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--pred_file",
type=str,
required=True,
help="Path to the prediction file in COCO format.",
)
parser.add_argument(
"--gt_files",
type=str,
nargs="+",
required=True,
help="Paths to the ground truth files in COCO format.",
)
args = parser.parse_args()
if len(args.gt_files) == 0:
raise ValueError("At least one GT file must be provided.")
is_gold = args.gt_files[0].split("_")[-1].startswith("gold_")
if is_gold and len(args.gt_files) < 3:
print(
"WARNING: based on the name, it seems you are using gold GT files. Typically, there should be 3 GT files for gold subsets (a, b, c)."
)
evaluator = CGF1Evaluator(
gt_path=args.gt_files, verbose=True, iou_type="segm"
) # change to bbox if you want detection performance
results = evaluator.evaluate(args.pred_file)
print(results)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,244 @@
# SA-Co/VEval Dataset
**License** each domain has its own License
* SA-Co/VEval - SA-V: CC-BY-NC 4.0
* SA-Co/VEval - YT-Temporal-1B: CC-BY-NC 4.0
* SA-Co/VEval - SmartGlasses: CC-by-4.0
**SA-Co/VEval** is an evaluation dataset comprising of 3 domains, each domain has a val and test split.
* SA-Co/VEval - SA-V: videos are from the [SA-V dataset](https://ai.meta.com/datasets/segment-anything-video/)
* SA-Co/VEval - YT-Temporal-1B: videos are from the [YT-Temporal-1B](https://cove.thecvf.com/datasets/704)
* SA-Co/VEval - SmartGlasses: egocentric videos from [Smart Glasses](https://huggingface.co/datasets/facebook/SACo-VEval/blob/main/media/saco_sg.tar.gz)
## Environment
Install the SA-Co/VEVal required environment
```
pip install -e ".[veval]"
```
This will allow us to run:
* `scripts/eval/veval/saco_yt1b_downloader.py` preparing frames for SA-Co/VEval - YT-Temporal-1B
* `examples/saco_veval_eval_example.ipynb` example of running an offline evaluator
* `examples/saco_veval_vis_example.ipynb` example of loading and visualizing the data
## Download
### The expected folder structure
The following folder structure is expected after finishing all the download and pre-processing steps in this section
```
data/
├── annotation/
│ ├── saco_veval_sav_test.json
│ ├── saco_veval_sav_val.json
│ ├── saco_veval_smartglasses_test.json
│ ├── saco_veval_smartglasses_val.json
│ ├── saco_veval_yt1b_test.json
│ ├── saco_veval_yt1b_val.json
└── media/
├── saco_sav
│ └── JPEGImages_24fps
├── saco_sg
│ └── JPEGImages_6fps
└── saco_yt1b
└── JPEGImages_6fps
```
### Download ready-to-use data
The following links provide ready-to-use data, hosted on Roboflow, after completing the pre-processing steps outlined in the next section.
For each domain:
- [SA-Co/VEval - SA-V](https://universe.roboflow.com/sa-co-veval/sa-v-test/)
- [SA-Co/VEval - YT-Temporal-1B](https://universe.roboflow.com/sa-co-veval/yt-temporal-1b-test/)
- [SA-Co/VEval - SmartGlasses](https://universe.roboflow.com/sa-co-veval/smartglasses-test/)
For all three domains:
- [SA-Co/VEval](https://universe.roboflow.com/sa-co-veval)
### Download via preprocessing steps
#### Download annotations
The GT annotations are available at Hugging Face:
* [SA-Co/VEval](https://huggingface.co/datasets/facebook/SACo-VEval/tree/main)
* SA-Co/VEval SA-V
* Test: `annotation/saco_veval_sav_test.json`
* Val: `annotation/saco_veval_sav_val.json`
* SA-Co/VEval YT-Temporal-1B
* Test: `annotation/saco_veval_yt1b_test.json`
* Val: `annotation/saco_veval_yt1b_val.json`
* SA-Co/VEval SmartGlasses
* Test: `annotation/saco_veval_smartglasses_test.json`
* Val: `annotation/saco_veval_smartglasses_val.json`
#### Download videos or frames
##### SA-Co/VEval - SAV
Follow instructions in [SA-V dataset](https://ai.meta.com/datasets/segment-anything-video/). Only the following two datasets are needed:
* sav_test.tar
* sav_val.tar
After untar:
```
sav_test/
├── Annotations_6fps [ignore this is the SAM 2 annotation]
├── JPEGImages_24fps
sav_val/
├── Annotations_6fps [ignore this is the SAM 2 annotation]
└── JPEGImages_24fps
```
Then merge the two JPEGImages_24fps together to better match our annotation json file path e.g.
```
media/
└── saco_sav
└── JPEGImages_24fps [merged from the two JPEGImages_24fps above]
```
Example commands to download and merge folders
```
cd ../data/media/saco_sav
wget -O sav_test.tar <sav_test.tar download link from the SA-V dataset page>
wget -O sav_val.tar <sav_val.tar download link from the SA-V dataset page>
tar -xf sav_test.tar
tar -xf sav_val.tar
mkdir JPEGImages_24fps
chmod -R u+w sav_test/
chmod -R u+w sav_val/
mv sav_test/JPEGImages_24fps/* JPEGImages_24fps/
mv sav_val/JPEGImages_24fps/* JPEGImages_24fps/
```
##### SA-Co/VEval - YT-Temporal-1B
Two files are needed to download the SA-Co/VEval - YT-Temporal-1B Youtube videos.
* Download `media/yt1b_start_end_time.json` from [SA-Co/VEval](https://huggingface.co/datasets/facebook/SACo-VEval/tree/main), which contains the Youtube video ids and the start and end time used in SA-Co/VEval - YT-Temporal-1B.
* Prepare the `cookies.txt` file. Follow instruction in yt-dlp [exporting-youtube-cookies](https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies) and [pass-cookies-to-yt-dlp](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp) to prepare the cookies_file.
* Please see the full **WARNINGS** in yt-dlp regarding the risk of Youtube account ban!!
Then run `scripts/eval/veval/saco_yt1b_downloader.py` to download the videos and prepare the frames e.g.
```
python saco_yt1b_downloader.py \
--data_dir ../data/media/saco_yt1b \
--cookies_file ../data/media/saco_yt1b/cookies.txt \
--yt1b_start_end_time_file ../data/media/saco_yt1b/yt1b_start_end_time.json \
--yt1b_frame_prep_log_file ../data/media/saco_yt1b/yt1b_frame_prep.log
```
* data_dir: The directoy to download the Youtube videos and store the extraced frames
* cookies_file: the `cookies.txt` downloaded above
* yt1b_start_end_time_file: the `yt1b_start_end_time.json` downloaded above
* yt1b_frame_prep_log_file: a log file to track the video downloading and frame extracting status
Then run `scripts/eval/veval/saco_yt1b_annot_update.py` to update the annotation based on the video availability e.g.
```
python saco_yt1b_annot_update.py \
--yt1b_media_dir ../data/media/saco_yt1b/JPEGImages_6fps \
--yt1b_input_annot_path ../data/annotation/saco_veval_yt1b_val.json \
--yt1b_output_annot_path ../data/annotation/saco_veval_yt1b_val_updated.json \
--yt1b_annot_update_log_path ../data/annotation/saco_veval_yt1b_val_updated.log
```
**NOTE**:
* Not all Youtube videos might be available as Youtube videos can be deleted or become private. The script `saco_yt1b_annot_update.py` is used to remove the annotations of the unavailable videos.
* **Frame Shifting Alert!!** Even when the videos are still available, their specifications, such as fps and duration, may differ from those used during annotation when re-downloaded from YouTube. Additionally, sometimes `ffmpeg` seems to find it hard to guarantee consistent frame extraction from the same video across different environments. This may cause the re-downloaded and re-extracted frames to have alignment issues with our annotations due to frame shifting. Please be aware of this caveat when evaluating on SA-Co/VEval - YT-Temporal-1B.
##### SA-Co/VEval - SmartGlasses
Go to [SACo-VEval](https://huggingface.co/datasets/facebook/SACo-VEval/tree/main) download `media/saco_sg.tar.gz`
```
cd ../data
hf download facebook/SACo-VEval media/saco_sg.tar.gz --repo-type dataset --local-dir .
cd ../data/media
tar -xzf saco_sg.tar.gz
```
## Annotation Format
The format is similar to the [YTVIS](https://youtube-vos.org/dataset/vis/) format.
In the annotation json, e.g. `saco_veval_sav_test.json` there are 5 fields:
* info:
* A dict containing the dataset info
* E.g. {'version': 'v1', 'date': '2025-09-24', 'description': 'SA-Co/VEval SA-V Test'}
* videos
* A list of videos that are used in the current annotation json
* It contains {id, video_name, file_names, height, width, length}
* annotations
* A list of **positive** masklets and their related info
* It contains {id, segmentations, bboxes, areas, iscrowd, video_id, height, width, category_id, noun_phrase}
* video_id should match to the `videos - id` field above
* category_id should match to the `categories - id` field below
* segmentations is a list of [RLE](https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/mask.py)
* categories
* A **globally** used noun phrase id map, which is true across all 3 domains.
* It contains {id, name}
* name is the noun phrase
* video_np_pairs
* A list of video-np pairs, including both **positive** and **negative** used in the current annotation json
* It contains {id, video_id, category_id, noun_phrase, num_masklets}
* video_id should match the `videos - id` above
* category_id should match the `categories - id` above
* when `num_masklets > 0` it is a positive video-np pair, and the presenting masklets can be found in the annotations field
* when `num_masklets = 0` it is a negative video-np pair, meaning no masklet presenting at all
```
data {
"info": info
"videos": [video]
"annotations": [annotation]
"categories": [category]
"video_np_pairs": [video_np_pair]
}
video {
"id": int
"video_name": str # e.g. sav_000000
"file_names": List[str]
"height": int
"width": width
"length": length
}
annotation {
"id": int
"segmentations": List[RLE]
"bboxes": List[List[int, int, int, int]]
"areas": List[int]
"iscrowd": int
"video_id": str
"height": int
"width": int
"category_id": int
"noun_phrase": str
}
category {
"id": int
"name": str
}
video_np_pair {
"id": int
"video_id": str
"category_id": int
"noun_phrase": str
"num_masklets" int
}
```
[sam3/examples/saco_veval_vis_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_veval_vis_example.ipynb) shows some examples of the data format and data visualization.
## Run Offline Eval
An example notebook and an eval script have been provided for offline evaluation.
```
sam3/
├── examples/
│ └── saco_veval_eval_example.ipynb # this notebook will load eval res or run the eval on the fly, and print the results
└── sam3/eval/
└── saco_veval_eval.py # this script will run the offline evaluator
```
`saco_veval_eval.py` supports two modes, `one` and `all`.
* `one`: will take only one pair of gt and pred files to eval
* `all`: will eval on all 6 SACo/VEval datasets
Example usage
```
python saco_veval_eval.py one \
--gt_annot_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_gt.json \
--pred_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_pred.json \
--eval_res_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_eval_res.json
```
* `gt_annot_file`: the location of the GT file
* `pred_file`: the location of the Pred file
* `eval_res_file`: the location where the eval result will be written to
```
python saco_veval_eval.py all \
--gt_annot_dir ../data/annotation \
--pred_dir ../data/pred \
--eval_res_dir ../data/pred
```
* `gt_annot_dir`: the location of the GT files
* `pred_dir`: the location of the Pred files
* `eval_res_dir`: the location where the eval results will be written to

View File

@@ -0,0 +1 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved

View File

@@ -0,0 +1,136 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
import json
import logging
import os
import pandas as pd
logger = logging.getLogger(__name__)
def get_available_saco_yt1b_ids(yt1b_meida_dir, data):
vdf = pd.DataFrame(data["videos"])
expected_saco_yt1b_ids = vdf.video_name.tolist()
yt1b_media_folders = os.listdir(yt1b_meida_dir)
available_saco_yt1b_ids = []
for yt1b_media_folder in yt1b_media_folders:
if yt1b_media_folder not in expected_saco_yt1b_ids:
continue
jpeg_folder_dir = os.path.join(yt1b_meida_dir, yt1b_media_folder)
jpeg_count = len(os.listdir(jpeg_folder_dir))
if jpeg_count > 0:
available_saco_yt1b_ids.append(yt1b_media_folder)
else:
logger.info(
f"No JPEG images found for {yt1b_media_folder}. The annotation related to this video will be removed."
)
logger.info(
f"Expected {len(expected_saco_yt1b_ids)} videos for {data['info']}. Found {len(available_saco_yt1b_ids)} videos available in {yt1b_meida_dir}."
)
return available_saco_yt1b_ids
def update_yt1b_annot_per_field(data, field, id_col, available_ids):
field_data = data[field]
new_field_data = []
for data_entry in field_data:
if data_entry[id_col] not in available_ids:
logger.info(
f"{field}: Removing {data_entry} due to the video being unavailable."
)
continue
new_field_data.append(data_entry)
data[field] = new_field_data
logger.info(
f"Updated {field} by {id_col} - Before: {len(field_data)}, After: {len(new_field_data)}, Removed: {len(field_data) - len(new_field_data)}"
)
return data
def update_yt1b_annot(yt1b_input_annot_path, yt1b_media_dir, yt1b_output_annot_path):
with open(yt1b_input_annot_path, "r") as f:
data = json.load(f)
available_saco_yt1b_ids = get_available_saco_yt1b_ids(yt1b_media_dir, data)
data = update_yt1b_annot_per_field(
data=data,
field="videos",
id_col="video_name",
available_ids=available_saco_yt1b_ids,
)
videos_data = data["videos"]
available_video_incremental_ids = [data_entry["id"] for data_entry in videos_data]
data = update_yt1b_annot_per_field(
data=data,
field="annotations",
id_col="video_id",
available_ids=available_video_incremental_ids,
)
data = update_yt1b_annot_per_field(
data=data,
field="video_np_pairs",
id_col="video_id",
available_ids=available_video_incremental_ids,
)
with open(yt1b_output_annot_path, "w") as f:
json.dump(data, f)
return data
def main():
parser = argparse.ArgumentParser(description="Run video grounding evaluators")
parser.add_argument(
"--yt1b_media_dir",
type=str,
help="Path to the directory where the yt1b media is stored e.g media/saco_yt1b/JPEGImages_6fps",
)
parser.add_argument(
"--yt1b_input_annot_path",
type=str,
help="Path to the saco_veval_yt1b input annotation file e.g annotation/saco_veval_yt1b_test.json or annotation/saco_veval_yt1b_val.json",
)
parser.add_argument(
"--yt1b_output_annot_path",
type=str,
help="Path to the output annotation file e.g annotation/saco_veval_yt1b_test_updated.json or annotation/saco_veval_yt1b_val_updated.json",
)
parser.add_argument(
"--yt1b_annot_update_log_path",
type=str,
help="Path to the yt1b annot update log file e.g annotation/yt1b_annot_update_log.log",
)
args = parser.parse_args()
os.makedirs(os.path.dirname(args.yt1b_annot_update_log_path), exist_ok=True)
os.makedirs(os.path.dirname(args.yt1b_output_annot_path), exist_ok=True)
logging.basicConfig(
filename=args.yt1b_annot_update_log_path,
format="%(asctime)s [%(threadName)s] %(levelname)s: %(message)s",
level=logging.INFO,
filemode="w",
)
_ = update_yt1b_annot(
yt1b_input_annot_path=args.yt1b_input_annot_path,
yt1b_media_dir=args.yt1b_media_dir,
yt1b_output_annot_path=args.yt1b_output_annot_path,
)
print("Done!! Check the log at", args.yt1b_annot_update_log_path)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,136 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
import logging
import multiprocessing as mp
import os
from functools import partial
import pandas as pd
from saco_yt1b_frame_prep_util import YtVideoPrep
from tqdm import tqdm
logger = logging.getLogger(__name__)
def download_and_extract_frames(saco_yt1b_id, args):
video_prep = YtVideoPrep(
saco_yt1b_id=saco_yt1b_id,
data_dir=args.data_dir,
cookies_file=args.cookies_file,
yt1b_start_end_time_file=args.yt1b_start_end_time_file,
ffmpeg_timeout=args.ffmpeg_timeout,
sleep_interval=args.sleep_interval,
max_sleep_interval=args.max_sleep_interval,
)
status = video_prep.download_youtube_video()
logger.info(f"[video download][{saco_yt1b_id}] download status {status}")
if status not in ["already exists", "success"]:
logger.warning(
f"Video download failed for {saco_yt1b_id}, skipping frame generation"
)
return False
status = video_prep.extract_frames_in_6fps_and_width_1080()
logger.info(f"[frame extracting][{saco_yt1b_id}] frame extracting status {status}")
return True
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--data_dir",
type=str,
required=True,
)
parser.add_argument(
"--cookies_file",
type=str,
required=True,
)
parser.add_argument(
"--yt1b_start_end_time_file",
type=str,
required=True,
)
parser.add_argument(
"--yt1b_frame_prep_log_file",
type=str,
required=True,
)
parser.add_argument(
"--ffmpeg_timeout",
type=str,
default=7200, # Use longer timeout in case of large videos processing timeout
)
parser.add_argument(
"--sleep_interval",
type=int,
default=10,
)
parser.add_argument(
"--max_sleep_interval",
type=int,
default=30,
)
parser.add_argument(
"--num_workers",
type=int,
default=4,
)
args = parser.parse_args()
log_dir = os.path.dirname(args.yt1b_frame_prep_log_file)
if log_dir:
os.makedirs(log_dir, exist_ok=True)
# Set up logging to both file and console
# Configure the ROOT logger so all child loggers inherit the configuration
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(processName)s/%(threadName)s] %(name)s - %(levelname)s: %(message)s",
handlers=[
logging.FileHandler(args.yt1b_frame_prep_log_file, mode="w"),
logging.StreamHandler(),
],
force=True, # Override any existing configuration
)
YT_DLP_WARNING_STR = """ ==========
NOTICE!!
This script uses yt-dlp to download youtube videos.
See the youtube account banning risk in https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies
==========
"""
logger.info(YT_DLP_WARNING_STR)
args = parser.parse_args()
with open(args.yt1b_start_end_time_file, "r") as f:
yt1b_start_end_time_df = pd.read_json(f)
saco_yt1b_ids = yt1b_start_end_time_df.saco_yt1b_id.unique()
num_workers = args.num_workers
logger.info(
f"Starting with {num_workers} parallel worker(s) (sleep_interval={args.sleep_interval}-{args.max_sleep_interval}s)"
)
with mp.Pool(num_workers) as p:
download_func = partial(download_and_extract_frames, args=args)
list(tqdm(p.imap(download_func, saco_yt1b_ids), total=len(saco_yt1b_ids)))
done_str = f""" ==========
All DONE!!
Download, frame extraction, and frame matching is all done! YT1B frames are not ready to use in {args.data_dir}/JPEGImages_6fps
Check video frame preparing log at {args.yt1b_frame_prep_log_file}
Some videos might not be available any more which will affect the eval reproducibility
==========
"""
logger.info(done_str)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,265 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
import argparse
import logging
import os
import subprocess
import pandas as pd
import yt_dlp
logger = logging.getLogger(__name__)
class YtVideoPrep:
def __init__(
self,
saco_yt1b_id: str,
data_dir: str,
cookies_file: str,
yt1b_start_end_time_file: str,
ffmpeg_timeout: int,
sleep_interval: int = 10,
max_sleep_interval: int = 30,
):
self.saco_yt1b_id = saco_yt1b_id # saco_yt1b_id is like saco_yt1b_000000
self.data_dir = data_dir
self.cookies_file = cookies_file
self.ffmpeg_timeout = ffmpeg_timeout
self.sleep_interval = sleep_interval
self.max_sleep_interval = max_sleep_interval
self.yt1b_start_end_time_df = pd.read_json(yt1b_start_end_time_file)
(
self.yt_video_id,
self.yt_video_id_w_timestamps,
self.start_time,
self.end_time,
self.expected_num_frames,
) = self._get_yt_video_id_map_info()
self.raw_video_dir = os.path.join(self.data_dir, "raw_videos")
self.raw_video_path = os.path.join(
self.raw_video_dir, f"{self.yt_video_id}.mp4"
)
self.JPEGImages_6fps_dir = os.path.join(
self.data_dir, "JPEGImages_6fps", self.saco_yt1b_id
)
self.JPEGImages_6fps_pattern = os.path.join(
self.JPEGImages_6fps_dir, "%05d.jpg"
)
os.makedirs(self.raw_video_dir, exist_ok=True)
os.makedirs(self.JPEGImages_6fps_dir, exist_ok=True)
def _get_yt_video_id_map_info(self):
df = self.yt1b_start_end_time_df[
self.yt1b_start_end_time_df.saco_yt1b_id == self.saco_yt1b_id
]
assert (
len(df) == 1
), f"Expected exactly 1 row for saco_yt1b_id: {self.saco_yt1b_id}, found {len(df)}"
id_and_frame_map_row = df.iloc[0]
yt_video_id = (
id_and_frame_map_row.yt_video_id
) # yt_video_id is like -06NgWyZxC0
yt_video_id_w_timestamps = id_and_frame_map_row.yt_video_id_w_timestamps
start_time = id_and_frame_map_row.start_time
end_time = id_and_frame_map_row.end_time
expected_num_frames = id_and_frame_map_row.length
return (
yt_video_id,
yt_video_id_w_timestamps,
start_time,
end_time,
expected_num_frames,
)
def download_youtube_video(self):
video_url = f"https://youtube.com/watch?v={self.yt_video_id}"
assert os.path.exists(
self.cookies_file
), f"Cookies file '{self.cookies_file}' not found. Must have it to download videos."
outtmpl = self.raw_video_path
# Check if the output file already exists
if os.path.exists(outtmpl) and os.path.isfile(outtmpl):
return "already exists"
ydl_opts = {
"format": "best[height<=720]/best", # 720p or lower
"outtmpl": outtmpl,
"merge_output_format": "mp4",
"noplaylist": True,
"quiet": True,
"cookiefile": self.cookies_file,
"sleep_interval": self.sleep_interval, # Sleep before each download to avoid rate limiting
"max_sleep_interval": self.max_sleep_interval, # Random sleep for more human-like behavior
}
if self.yt_video_id in ["euohdDLEMRg", "nzfAn7n4d-0"]:
# For "euohdDLEMRg", we have to specify the https protocol or the video sometimes can't be downloaded completely
# For "nzfAn7n4d-0", without the https protocol, the video will be downloaded as 654×480, however we need 490×360 to match the frame matching after the 1080 width resizing
ydl_opts["format"] = (
"best[height<=720][ext=mp4][protocol^=https]/best[ext=mp4][protocol^=https]/best[height<=720]/best"
)
try:
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
ydl.download([video_url])
return "success"
except Exception as e:
logger.warning(
f"[video download][{self.saco_yt1b_id}] Error downloading video {self.yt_video_id}: {e}"
)
return f"error {e}"
def extract_frames_in_6fps_and_width_1080(self):
"""
Extract target frames in 6fps and width 1080.
"""
if not os.path.exists(self.raw_video_path):
logger.warning(
f"[frame extracting][{self.saco_yt1b_id}] Raw video file not found at {self.raw_video_path}"
)
os.rmdir(self.JPEGImages_6fps_dir)
return False
if (
os.path.exists(self.JPEGImages_6fps_dir)
and len(os.listdir(self.JPEGImages_6fps_dir)) == self.expected_num_frames
):
logger.info(
f"[frame extracting][{self.saco_yt1b_id}] JPEGImages_6fps directory already exists at {self.JPEGImages_6fps_dir} and expected number of frames {self.expected_num_frames} matches"
)
return True
# Clear the directory before extracting new frames
for file in os.listdir(self.JPEGImages_6fps_dir):
os.remove(os.path.join(self.JPEGImages_6fps_dir, file))
args = [
"-nostdin",
"-y",
# select video segment
"-ss",
str(self.start_time),
"-to",
str(self.end_time),
"-i",
self.raw_video_path,
# set output video resolution to be 6fps and at most 1080p
"-vf",
"fps=6,scale=1080:-2",
"-vsync",
"0", # passthrough mode - no frame duplication/dropping
"-q:v",
"2", # high quality JPEG output
"-start_number",
"0", # start frame numbering from 0
self.JPEGImages_6fps_pattern,
]
result = subprocess.run(
["ffmpeg"] + args,
timeout=self.ffmpeg_timeout,
capture_output=True,
text=True,
)
if result.returncode != 0:
logger.warning(
f"[frame extracting][{self.saco_yt1b_id}] Failed to extract raw frames: {result.stderr}"
)
os.rmdir(self.JPEGImages_6fps_dir)
return False
if len(os.listdir(self.JPEGImages_6fps_dir)) != self.expected_num_frames:
logger.warning(
f"[frame extracting][{self.saco_yt1b_id}] Expected {self.expected_num_frames} frames but extracted {len(os.listdir(self.JPEGImages_6fps_dir))}"
)
# Clear the directory after failed extraction
for file in os.listdir(self.JPEGImages_6fps_dir):
os.remove(os.path.join(self.JPEGImages_6fps_dir, file))
os.rmdir(self.JPEGImages_6fps_dir)
return False
logger.info(
f"[frame extracting][{self.saco_yt1b_id}] Successfully extracted {self.expected_num_frames} frames to {self.JPEGImages_6fps_dir}"
)
return True
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--saco_yt1b_id", type=str, required=True)
parser.add_argument(
"--data_dir",
type=str,
required=True,
)
parser.add_argument(
"--cookies_file",
type=str,
required=True,
)
parser.add_argument(
"--yt1b_start_end_time_file",
type=str,
required=True,
)
parser.add_argument(
"--yt1b_frame_prep_log_file",
type=str,
required=True,
)
parser.add_argument(
"--ffmpeg_timeout",
type=str,
default=7200, # Use longer timeout in case of large videos processing timeout
)
parser.add_argument(
"--sleep_interval",
type=int,
default=10,
)
parser.add_argument(
"--max_sleep_interval",
type=int,
default=30,
)
args = parser.parse_args()
logging.basicConfig(
filename=args.yt1b_frame_prep_log_file,
format="%(asctime)s [%(threadName)s] %(levelname)s: %(message)s",
level=logging.INFO,
filemode="w",
)
video_prep = YtVideoPrep(
saco_yt1b_id=args.saco_yt1b_id,
data_dir=args.data_dir,
cookies_file=args.cookies_file,
yt1b_start_end_time_file=args.yt1b_start_end_time_file,
ffmpeg_timeout=args.ffmpeg_timeout,
sleep_interval=args.sleep_interval,
max_sleep_interval=args.max_sleep_interval,
)
status = video_prep.download_youtube_video()
logger.info(f"[video download][{args.saco_yt1b_id}] download status {status}")
status = video_prep.extract_frames_in_6fps_and_width_1080()
logger.info(
f"[frame extracting][{args.saco_yt1b_id}] frame extracting status {status}"
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,95 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
"""This script summarizes odinw results"""
"""
python3 scripts/extract_odinw_results.py --res_dir /path/to/results/directory
Expected directory structure:
results_directory/
├── AerialMaritimeDrone_large/val_stats.json
├── Aquarium/val_stats.json
├── CottontailRabbits/val_stats.json
└── ...
"""
import argparse
import json
import os
VAL13_SET = [
"AerialMaritimeDrone_large",
"Aquarium",
"CottontailRabbits",
"EgoHands_generic",
"NorthAmericaMushrooms",
"Packages",
"PascalVOC",
"Raccoon",
"ShellfishOpenImages",
"VehiclesOpenImages",
"pistols",
"pothole",
"thermalDogsAndPeople",
]
METRIC_NAME = "coco_eval_bbox_AP"
def parse_args():
parser = argparse.ArgumentParser("ODinW results aggregation script")
parser.add_argument(
"--res_dir",
required=True,
type=str,
help="Parent directory containing subdirectories for each dataset with val_stats.json files",
)
return parser.parse_args()
def main(args):
# Dictionary to store results for each metric type
metric_results = {METRIC_NAME: []}
subset_results = {subset: {} for subset in VAL13_SET}
# Process each subset directory
for subset in VAL13_SET:
subset_dir = os.path.join(args.res_dir, subset)
val_stats_path = os.path.join(subset_dir, "val_stats.json")
if not os.path.exists(val_stats_path):
print(f"Warning: {val_stats_path} not found, skipping {subset}")
continue
try:
res = json.load(open(val_stats_path))
subset_results[subset] = res
# Extract metrics for this subset and group by metric type
for key, value in res.items():
if key.endswith(METRIC_NAME):
metric_results[METRIC_NAME].append(value)
except (json.JSONDecodeError, IOError) as e:
print(f"Error reading {val_stats_path}: {e}")
continue
# Print results
values = metric_results[METRIC_NAME]
if values:
avg = sum(values) / len(values)
print(f"Average {METRIC_NAME}: {avg:.4f} ({len(values)} datasets)")
# Show individual dataset results
for subset in VAL13_SET:
if subset in subset_results and subset_results[subset]:
for res_key, res_value in subset_results[subset].items():
if res_key.endswith(METRIC_NAME):
print(f" {subset}: {res_value:.4f}")
break
else:
print(f"No results found for {METRIC_NAME}")
if __name__ == "__main__":
main(parse_args())

View File

@@ -0,0 +1,380 @@
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
"""
Script to extract and analyze training results from Roboflow VL100 experiments.
This script processes training logs and configuration files to extract model performance
metrics and training parameters for analysis and comparison.
"""
import argparse
import json
import os
from typing import Any, Dict, List, Optional
import pandas as pd
import yaml
# Constants
CONFIG_FILENAME = "config_resolved.yaml"
RESULTS_FILENAME = "val_stats.json"
BBOX_AP_METRIC = "Meters_train/val_roboflow100/detection/coco_eval_bbox_AP"
# Roboflow dataset categories organized by domain
ROBOFLOW_CATEGORIES = {
"sports": [
"actions",
"aerial-pool",
"ball",
"bibdetection",
"football-player-detection",
"lacrosse-object-detection",
],
"other": [
"buoy-onboarding",
"car-logo-detection",
"clashroyalechardetector",
"cod-mw-warzone",
"countingpills",
"everdaynew",
"flir-camera-objects",
"halo-infinite-angel-videogame",
"mahjong",
"new-defects-in-wood",
"orionproducts",
"pill",
"soda-bottles",
"taco-trash-annotations-in-context",
"the-dreidel-project",
],
"aerial": [
"aerial-airport",
"aerial-cows",
"aerial-sheep",
"apoce-aerial-photographs-for-object-detection-of-construction-equipment",
"electric-pylon-detection-in-rsi",
"floating-waste",
"human-detection-in-floods",
"sssod",
"uavdet-small",
"wildfire-smoke",
"zebrasatasturias",
],
"medical": [
"canalstenosis",
"crystal-clean-brain-tumors-mri-dataset",
"dentalai",
"inbreast",
"liver-disease",
"nih-xray",
"spinefrxnormalvindr",
"stomata-cells",
"train",
"ufba-425",
"urine-analysis1",
"x-ray-id",
"xray",
],
"document": [
"activity-diagrams",
"all-elements",
"circuit-voltages",
"invoice-processing",
"label-printing-defect-version-2",
"macro-segmentation",
"paper-parts",
"signatures",
"speech-bubbles-detection",
"wine-labels",
],
"industrial": [
"-grccs",
"13-lkc01",
"2024-frc",
"aircraft-turnaround-dataset",
"asphaltdistressdetection",
"cable-damage",
"conveyor-t-shirts",
"dataconvert",
"deeppcb",
"defect-detection",
"fruitjes",
"infraredimageofpowerequipment",
"ism-band-packet-detection",
"l10ul502",
"needle-base-tip-min-max",
"recode-waste",
"screwdetectclassification",
"smd-components",
"truck-movement",
"tube",
"water-meter",
"wheel-defect-detection",
],
"flora_fauna": [
"aquarium-combined",
"bees",
"deepfruits",
"exploratorium-daphnia",
"grapes-5",
"grass-weeds",
"gwhd2021",
"into-the-vale",
"jellyfish",
"marine-sharks",
"orgharvest",
"peixos-fish",
"penguin-finder-seg",
"pig-detection",
"roboflow-trained-dataset",
"sea-cucumbers-new-tiles",
"thermal-cheetah",
"tomatoes-2",
"trail-camera",
"underwater-objects",
"varroa-mites-detection--test-set",
"wb-prova",
"weeds4",
],
}
def load_jsonl_last_row(file_path: str, keys: List[str]) -> Optional[Dict[str, Any]]:
"""
Load the last row from a JSONL file and extract specific keys.
Args:
file_path: Path to the JSONL file
keys: List of keys to extract from the last row
Returns:
Dictionary with extracted key-value pairs, or None if file not found/empty
"""
if not os.path.exists(file_path):
print(f"Warning: File not found: {file_path}")
return None
last_row = None
try:
with open(file_path, "r") as file:
for line in file:
last_row = json.loads(line.strip())
if last_row is None:
print(f"Warning: Empty JSONL file: {file_path}")
return None
return {key: last_row.get(key) for key in keys}
except json.JSONDecodeError as e:
print(f"Error: Failed to parse JSON in {file_path}: {e}")
return None
except Exception as e:
print(f"Error: Failed to read {file_path}: {e}")
return None
def find_config_files(directory: str, filename: str = CONFIG_FILENAME) -> List[str]:
"""
Recursively find configuration files with a specific filename.
Args:
directory: Root directory to search
filename: Target filename to search for
Returns:
List of full paths to matching files
"""
matching_files = []
for root, _, files in os.walk(directory):
# Skip code directories
if "/code/" in root:
continue
if filename in files:
matching_files.append(os.path.join(root, filename))
return matching_files
def extract_config_parameters(config_path: str, keys: List[str]) -> Dict[str, Any]:
"""
Extract specific parameters from a YAML configuration file.
Args:
config_path: Path to the YAML configuration file
keys: List of keys to extract from the 'scratch' section
Returns:
Dictionary containing extracted parameters
"""
try:
with open(config_path, "r") as file:
data = yaml.safe_load(file)
# Extract parameters from scratch section
scratch_params = {key: data["scratch"].get(key) for key in keys}
# Add computed parameters
launcher = data.get("launcher", {})
scratch_params["batch_size"] = int(launcher.get("gpus_per_node", 1)) * int(
launcher.get("num_nodes", 1)
)
scratch_params["lr_scale"] = data["scratch"].get("lr_scale")
roboflow_train = data.get("roboflow_train", {})
scratch_params["roboflow_num_images"] = roboflow_train.get("num_images")
return scratch_params
except Exception as e:
print(f"Error: Failed to parse config file {config_path}: {e}")
return {}
def calculate_average(values_dict: Dict[str, float]) -> float:
"""
Calculate the average of values in a dictionary.
Args:
values_dict: Dictionary with numeric values
Returns:
Average of all values, or 0 if empty
"""
if not values_dict:
return 0.0
return sum(values_dict.values()) / len(values_dict)
def extract_category_results(log_dir: str, categories: List[str]) -> Dict[str, float]:
"""
Extract bbox AP results for specific categories from log files.
Args:
log_dir: Directory containing category log subdirectories
categories: List of category names to extract results for
Returns:
Dictionary mapping category names to bbox AP scores
"""
results = {}
metric_keys = [BBOX_AP_METRIC]
for category in categories:
result_file = os.path.join(log_dir, f"logs/{category}/{RESULTS_FILENAME}")
category_result = load_jsonl_last_row(result_file, metric_keys)
if category_result is not None and category_result[BBOX_AP_METRIC] is not None:
results[category] = category_result[BBOX_AP_METRIC]
return results
def analyze_experiment_results(config_path: str) -> None:
"""
Analyze results from a single experiment configuration.
Args:
config_path: Path to the experiment configuration file
"""
print("=" * 80)
print(f"Analyzing experiment: {config_path}")
print("=" * 80)
# Extract configuration parameters
config_keys = [
"lr_transformer",
"lr_vision_backbone",
"lr_language_backbone",
"max_data_epochs",
]
config_params = extract_config_parameters(config_path, config_keys)
print("Configuration Parameters:")
for key, value in config_params.items():
print(f" {key}: {value}")
print()
# Extract results for each category
experiment_dir = os.path.dirname(config_path)
category_results = {}
category_averages = {}
all_scores = []
for super_category, categories in ROBOFLOW_CATEGORIES.items():
category_results[super_category] = extract_category_results(
experiment_dir, categories
)
if category_results[super_category]:
category_averages[super_category] = calculate_average(
category_results[super_category]
)
all_scores.extend(category_results[super_category].values())
# Print results summary
print("Results by Category:")
for super_category, avg_score in category_averages.items():
num_categories = len(category_results[super_category])
print(f" {super_category}: {avg_score:.4f} (n={num_categories})")
print(f"\nOverall Results:")
print(f" Weighted average: {calculate_average(category_averages):.4f}")
print(f" Total categories: {len(all_scores)}")
print(f" True average: {sum(all_scores) / len(all_scores):.4f}")
print()
def print_results_table(results_data: List[Dict[str, Any]]) -> None:
"""
Print results in a formatted table.
Args:
results_data: List of dictionaries containing results data
"""
if not results_data:
print("No results data to display.")
return
df = pd.DataFrame(results_data)
print("\nResults Summary Table:")
print("=" * 60)
print(df.to_string(index=False))
def main() -> None:
"""Main function to orchestrate the results extraction and analysis."""
parser = argparse.ArgumentParser(
description="Extract and analyze Roboflow VL100 training results"
)
parser.add_argument(
"-p",
"--path",
type=str,
required=True,
help="Root directory path containing experiment results",
)
args = parser.parse_args()
# Find all configuration files
config_files = find_config_files(args.path, CONFIG_FILENAME)
if not config_files:
print(f"No configuration files found in {args.path}")
return
print(f"Found {len(config_files)} experiment configurations")
print()
# Analyze each experiment
for config_file in config_files:
try:
analyze_experiment_results(config_file)
except Exception as e:
print(f"Error analyzing {config_file}: {e}")
continue
if __name__ == "__main__":
main()