Initial commit

fbshipit-source-id: da6be2f26e3a1202f4bffde8cb980e2dcb851294
2025-11-18 23:07:42 -08:00
commit a13e358df4
504 changed files with 122758 additions and 0 deletions
--- a/scripts/eval/gold/README.md
+++ b/scripts/eval/gold/README.md
@@ -0,0 +1,299 @@
+# SA-Co/Gold benchmark
+
+SA-Co/Gold is a benchmark for promptable concept segmentation (PCS) in images. The benchmark contains images paired with text labels, also referred as Noun Phrases (NPs), each annotated exhaustively with masks on all object instances that match the label. SA-Co/Gold comprises 7 subsets, each targeting a different annotation domain: MetaCLIP captioner NPs, SA-1B captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment. The images are originally from the MetaCLIP and SA-1B datasets.
+
+For each subset, the annotations are multi-reviewed by 3 independent human annotators. Each row in the figure shows an image and noun phrase pair from
+one of the domains, and masks from the 3 annotators. Dashed borders indicate special group masks that cover more than a single instance, used when separating into instances is deemed too difficult. Annotators sometimes disagree on precise mask borders, the number of instances, and whether the phrase exists. Having 3 independent annotations allow us to measure human agreement on the task, which serves as an upper bound for model performance.
+
+
+<p align="center">
+  <img src="../../../assets/saco_gold_annotation.png?" style="width:80%;" />
+</p>
+# Preparation
+
+## Download annotations
+
+The GT annotations can be downloaded from [Hugging Face](https://huggingface.co/datasets/facebook/SACo-Gold) or [Roboflow](https://universe.roboflow.com/sa-co-gold)
+
+## Download images
+
+There are two image sources for the evaluation dataset: MetaCLIP and SA-1B.
+
+1) The MetaCLIP images are referred in 6 out of 7 subsets (MetaCLIP captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment) and can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-gold/gold-metaclip-merged-a-release-test/).
+
+2) The SA-1B images are referred in 1 out of 7 subsets (SA-1B captioner NPs) and can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-gold/gold-sa-1b-merged-a-release-test/). Alternatively, they can be downloaded from [here](https://ai.meta.com/datasets/segment-anything-downloads/). Please access the link for `sa_co_gold.tar` from dynamic links available under `Download text file` to download the SA-1B images referred in SA-Co/Gold.
+
+# Usage
+## Visualization
+
+- Visualize GT annotations: [saco_gold_silver_vis_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_vis_example.ipynb)
+- Visualize GT annotations and sample predictions side-by-side: [sam3_data_and_predictions_visualization.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/sam3_data_and_predictions_visualization.ipynb)
+
+
+## Run evaluation
+
+The official metric for SA-Co/Gold is cgF1. Please refer to the SAM3 paper for details.
+Our evaluator inherits from the official COCO evaluator, with some modifications. Recall that in the Gold subset, there are three annotations for each datapoint. We evaluate against each of them and picks the most favorable (oracle setting). It has minimal dependency (pycocotools, numpy and scipy), to help reusability in other projects. In this section we provide several pointers to run evaluation of SAM3 or third-party models.
+
+### Evaluate SAM3
+
+We provide inference configurations to reproduce the evaluation of SAM3.
+First, please edit the file [eval_base.yaml](https://github.com/facebookresearch/sam3/blob/main/sam3/train/configs/eval_base.yaml) with the paths where you downloaded the images and annotations above.
+
+There are 7 subsets and as many configurations to be run.
+Let's take the first subset as an example. The inference can be run locally using the following command (you can adjust the number of gpus):
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml --use-cluster 0 --num-gpus 1
+```
+The predictions will be dumped in the folder specified in eval_base.yaml.
+
+We also provide support for SLURM-based cluster inference. Edit the eval_base.yaml file to reflect your slurm configuration (partition, qos, ...), then run
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml --use-cluster 1
+```
+
+We provide the commands for all subsets below
+#### MetaCLIP captioner NPs
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml --use-cluster 1
+```
+#### SA-1B captioner NPs
+
+Refer to SA-1B images for this subset. For the other 6 subsets, refer to MetaCLIP images.
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_sa1b_nps.yaml --use-cluster 1
+```
+#### Attributes
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_attributes.yaml --use-cluster 1
+```
+#### Crowded Scenes
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_crowded.yaml --use-cluster 1
+```
+#### Wiki-Common1K
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_wiki_common.yaml --use-cluster 1
+```
+#### Wiki-Food/Drink
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_fg_food.yaml --use-cluster 1
+```
+
+#### Wiki-Sports Equipment
+
+```bash
+python sam3/train/train.py -c configs/gold_image_evals/sam3_gold_image_fg_sports.yaml --use-cluster 1
+```
+
+### Offline evaluation
+
+If you have the predictions in the COCO result format (see [here](https://cocodataset.org/#format-results)), then we provide scripts to easily run the evaluation.
+
+For an example on how to run the evaluator on all subsets and aggregate results, see the following notebook: [saco_gold_silver_eval_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_eval_example.ipynb)
+Alternatively, you can run `python scripts/eval/gold/eval_sam3.py`
+
+If you have a prediction file for a given subset, you can run the evaluator specifically for that one using the standalone script. Example:
+```bash
+python scripts/eval/standalone_cgf1.py --pred_file /path/to/coco_predictions_segm.json --gt_files /path/to/annotations/gold_metaclip_merged_a_release_test.json  /path/to/annotations/gold_metaclip_merged_b_release_test.json  /path/to/annotations/gold_metaclip_merged_c_release_test.json
+```
+
+
+# Results
+Here we collect the segmentation results for SAM3 and some baselines. Note that the baselines that do not produce masks are evaluated by converting the boxes to masks using SAM2
+<table style="border-color:black;border-style:solid;border-width:1px;border-collapse:collapse;border-spacing:0;text-align:right" class="tg"><thead>
+<tr><th style="text-align:center"></th><th style="text-align:center" colspan="3">Average</th><th style="text-align:center" colspan="3">Captioner metaclip</th><th style="text-align:center" colspan="3">Captioner sa1b</th>
+<th style="text-align:center" colspan="3">Crowded</th><th style="text-align:center" colspan="3">FG food</th><th style="text-align:center" colspan="3">FG sport</th><th style="text-align:center" colspan="3">Attributes</th>
+<th style="text-align:center" colspan="3">Wiki common</th></tr>
+</thead>
+<tbody>
+<tr><td ></td><td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td>
+<td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td>
+<td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td><td >IL_MCC</td>
+<td >positive_micro_F1</td><td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td>
+<td >cgF1</td><td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td>
+<td >IL_MCC</td><td >positive_micro_F1</td><td >cgF1</td><td >IL_MCC</td>
+<td >positive_micro_F1</td></tr>
+<tr><td >gDino-T</td><td >3.25</td><td >0.15</td><td >16.2</td>
+<td >2.89</td><td >0.21</td><td >13.88</td><td >3.07</td>
+<td >0.2</td><td >15.35</td><td >0.28</td><td >0.08</td>
+<td >3.37</td><td >0.96</td><td >0.1</td><td >9.83</td>
+<td >1.12</td><td >0.1</td><td >11.2</td><td >13.75</td>
+<td >0.29</td><td >47.3</td><td >0.7</td><td >0.06</td>
+<td >12.14</td></tr>
+<tr><td >OWLv2*</td><td >24.59</td><td >0.57</td><td >42</td>
+<td >17.69</td><td >0.52</td><td >34.27</td><td >13.32</td>
+<td >0.5</td><td >26.83</td><td >15.8</td><td >0.51</td>
+<td >30.74</td><td >31.96</td><td >0.65</td><td >49.35</td>
+<td >36.01</td><td >0.64</td><td >56.19</td><td >35.61</td>
+<td >0.63</td><td >56.23</td><td >21.73</td><td >0.54</td>
+<td >40.25</td></tr>
+<tr><td >OWLv2</td><td >17.27</td><td >0.46</td><td >36.8</td>
+<td >12.21</td><td >0.39</td><td >31.33</td><td >9.76</td>
+<td >0.45</td><td >21.65</td><td >8.87</td><td >0.36</td>
+<td >24.77</td><td >24.36</td><td >0.51</td><td >47.85</td>
+<td >24.44</td><td >0.52</td><td >46.97</td><td >25.85</td>
+<td >0.54</td><td >48.22</td><td >15.4</td><td >0.42</td>
+<td >36.64</td></tr>
+<tr><td >LLMDet-L</td><td >6.5</td><td >0.21</td><td >27.3</td>
+<td >4.49</td><td >0.23</td><td >19.36</td><td >5.32</td>
+<td >0.23</td><td >22.81</td><td >2.42</td><td >0.18</td>
+<td >13.74</td><td >5.5</td><td >0.19</td><td >29.12</td>
+<td >4.39</td><td >0.17</td><td >25.34</td><td >22.17</td>
+<td >0.39</td><td >57.13</td><td >1.18</td><td >0.05</td>
+<td >23.3</td></tr>
+<tr><td >APE</td><td >16.41</td><td >0.4</td><td >36.9</td>
+<td >12.6</td><td >0.42</td><td >30.11</td><td >2.23</td>
+<td >0.22</td><td >10.01</td><td >7.15</td><td >0.35</td>
+<td >20.3</td><td >22.74</td><td >0.51</td><td >45.01</td>
+<td >31.79</td><td >0.56</td><td >56.45</td><td >26.74</td>
+<td >0.47</td><td >57.27</td><td >11.59</td><td >0.29</td>
+<td >39.46</td></tr>
+<tr><td >DINO-X</td><td >21.26</td><td >0.38</td><td >55.2</td>
+<td >17.21</td><td >0.35</td><td >49.17</td><td >19.66</td>
+<td >0.48</td><td >40.93</td><td >12.86</td><td >0.34</td>
+<td >37.48</td><td >30.07</td><td >0.49</td><td >61.72</td>
+<td >28.36</td><td >0.41</td><td >69.4</td><td >30.97</td>
+<td >0.42</td><td >74.04</td><td >9.72</td><td >0.18</td>
+<td >53.52</td></tr>
+<tr><td >Gemini 2.5</td><td >13.03</td><td >0.29</td><td >46.1</td>
+<td >9.9</td><td >0.29</td><td >33.79</td><td >13.1</td>
+<td >0.41</td><td >32.1</td><td >8.15</td><td >0.27</td>
+<td >30.34</td><td >19.63</td><td >0.33</td><td >59.52</td>
+<td >15.07</td><td >0.28</td><td >53.5</td><td >18.84</td>
+<td >0.3</td><td >63.14</td><td >6.5</td><td >0.13</td>
+<td >50.32</td></tr>
+<tr><td >SAM 3</td><td >54.06</td><td >0.82</td><td >66.11</td>
+<td >47.26</td><td >0.81</td><td >58.58</td><td >53.69</td>
+<td >0.86</td><td >62.55</td><td >61.08</td><td >0.9</td>
+<td >67.73</td><td >53.41</td><td >0.79</td><td >67.28</td>
+<td >65.52</td><td >0.89</td><td >73.75</td><td >54.93</td>
+<td >0.76</td><td >72</td><td >42.53</td><td >0.7</td>
+<td >60.85</td></tr>
+</tbody></table>
+
+
+
+# Annotation format
+
+The annotation format is derived from [COCO format](https://cocodataset.org/#format-data). Notable data fields are:
+
+- `images`: a `list` of `dict` features, contains a list of all image-NP pairs. Each entry is related to an image-NP pair and has the following items.
+  - `id`: an `int` feature, unique identifier for the image-NP pair
+  - `text_input`: a `string` feature, the noun phrase for the image-NP pair
+  - `file_name`: a `string` feature, the relative image path in the corresponding data folder.
+  - `height`/`width`: dimension of the image
+  - `is_instance_exhaustive`: Boolean (0 or 1). If it's 1 then all the instances are correctly annotated. For instance segmentation, we only use those datapoints. Otherwise, there may be either missing instances or crowd segments (a segment covering multiple instances)
+  - `is_pixel_exhaustive`: Boolean (0 or 1). If it's 1, then the union of all masks cover all pixels corresponding to the prompt. This is weaker than instance_exhaustive since it allows crowd segments. It can be used for semantic segmentation evaluations.
+
+- `annotations`: a `list` of `dict` features, containing a list of all annotations including bounding box, segmentation mask, area etc.
+  - `image_id`: an `int` feature, maps to the identifier for the image-np pair in images
+  - `bbox`: a `list` of float features, containing bounding box in [x,y,w,h] format, normalized by the image dimensions
+  - `segmentation`: a dict feature, containing segmentation mask in RLE format
+  - `category_id`: For compatibility with the coco format. Will always be 1 and is unused.
+  - `is_crowd`: Boolean (0 or 1). If 1, then the segment overlaps several instances (used in cases where instances are not separable, for e.g. due to poor image quality)
+
+- `categories`: a `list` of `dict` features, containing a list of all categories. Here, we provide  the category key for compatibility with the COCO format, but in open-vocabulary detection we do not use it. Instead, the text prompt is stored directly in each image (text_input in images). Note that in our setting, a unique image (id in images) actually corresponds to an (image, text prompt) combination.
+
+
+For `id` in images that have corresponding annotations (i.e. exist as `image_id` in `annotations`), we refer to them as a "positive" NP. And, for `id` in `images` that don't have any annotations (i.e. they do not exist as `image_id` in `annotations`), we refer to them as a "negative" NP.
+
+A sample annotation from Wiki-Food/Drink domain looks as follows:
+
+#### images
+
+```
+[
+  {
+    "id": 10000000,
+    "file_name": "1/1001/metaclip_1_1001_c122868928880ae52b33fae1.jpeg",
+    "text_input": "chili",
+    "width": 600,
+    "height": 600,
+    "queried_category": "0",
+    "is_instance_exhaustive": 1,
+    "is_pixel_exhaustive": 1
+  },
+  {
+    "id": 10000001,
+    "file_name": "1/1001/metaclip_1_1001_c122868928880ae52b33fae1.jpeg",
+    "text_input": "the fish ball",
+    "width": 600,
+    "height": 600,
+    "queried_category": "2001",
+    "is_instance_exhaustive": 1,
+    "is_pixel_exhaustive": 1
+  }
+]
+```
+
+#### annotations
+
+```
+[
+  {
+    "id": 1,
+    "image_id": 10000000,
+    "source": "manual",
+    "area": 0.002477777777777778,
+    "bbox": [
+      0.44333332777023315,
+      0.0,
+      0.10833333432674408,
+      0.05833333358168602
+    ],
+    "segmentation": {
+      "counts": "`kk42fb01O1O1O1O001O1O1O001O1O00001O1O001O001O0000000000O1001000O010O02O001N10001N0100000O10O1000O10O010O100O1O1O1O1O0000001O0O2O1N2N2Nobm4",
+      "size": [
+        600,
+        600
+      ]
+    },
+    "category_id": 1,
+    "iscrowd": 0
+  },
+  {
+    "id": 2,
+    "image_id": 10000000,
+    "source": "manual",
+    "area": 0.001275,
+    "bbox": [
+      0.5116666555404663,
+      0.5716666579246521,
+      0.061666667461395264,
+      0.036666665226221085
+    ],
+    "segmentation": {
+      "counts": "aWd51db05M1O2N100O1O1O1O1O1O010O100O10O10O010O010O01O100O100O1O00100O1O100O1O2MZee4",
+      "size": [
+        600,
+        600
+      ]
+    },
+    "category_id": 1,
+    "iscrowd": 0
+  }
+]
+```
+
+# Data Stats
+
+Here are the stats for the 7 annotation domains. The # Image-NPs represent the total number of unique image-NP pairs including both “positive” and “negative” NPs.
+
+
+| Domain                   | Media        | # Image-NPs   | # Image-NP-Masks|
+|--------------------------|--------------|---------------| ----------------|
+| MetaCLIP captioner NPs   | MetaCLIP     | 33393         | 20144           |
+| SA-1B captioner NPs      | SA-1B        | 13258         | 30306           |
+| Attributes               | MetaCLIP     | 9245          | 3663            |
+| Crowded Scenes           | MetaCLIP     | 20687         | 50417           |
+| Wiki-Common1K            | MetaCLIP     | 65502         | 6448            |
+| Wiki-Food&Drink          | MetaCLIP     | 13951         | 9825            |
+| Wiki-Sports Equipment    | MetaCLIP     | 12166         | 5075            |
--- a/scripts/eval/gold/eval_sam3.py
+++ b/scripts/eval/gold/eval_sam3.py
@@ -0,0 +1,104 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+
+"""Script to run the evaluator offline given the GTs for SAC-Gold test set and SAM3 model prediction files.
+It reports CGF1, IL_MCC, PM_F1 metrics for each subset of SAC-Gold test set.
+
+Usage: python eval_sam3.py --gt-folder <folder_with_gts> --pred-folder <folder_with_predictions>
+"""
+
+import argparse
+import os
+
+from sam3.eval.cgf1_eval import CGF1Evaluator
+
+# Relative file names for GT files for 7 SA-Co/Gold subsets
+
+saco_gold_gts = {
+    # MetaCLIP Captioner
+    "metaclip_nps": [
+        "gold_metaclip_merged_a_release_test.json",
+        "gold_metaclip_merged_b_release_test.json",
+        "gold_metaclip_merged_c_release_test.json",
+    ],
+    # SA-1B captioner
+    "sa1b_nps": [
+        "gold_sa1b_merged_a_release_test.json",
+        "gold_sa1b_merged_b_release_test.json",
+        "gold_sa1b_merged_c_release_test.json",
+    ],
+    # Crowded
+    "crowded": [
+        "gold_crowded_merged_a_release_test.json",
+        "gold_crowded_merged_b_release_test.json",
+        "gold_crowded_merged_c_release_test.json",
+    ],
+    # FG Food
+    "fg_food": [
+        "gold_fg_food_merged_a_release_test.json",
+        "gold_fg_food_merged_b_release_test.json",
+        "gold_fg_food_merged_c_release_test.json",
+    ],
+    # FG Sports
+    "fg_sports_equipment": [
+        "gold_fg_sports_equipment_merged_a_release_test.json",
+        "gold_fg_sports_equipment_merged_b_release_test.json",
+        "gold_fg_sports_equipment_merged_c_release_test.json",
+    ],
+    # Attributes
+    "attributes": [
+        "gold_attributes_merged_a_release_test.json",
+        "gold_attributes_merged_b_release_test.json",
+        "gold_attributes_merged_c_release_test.json",
+    ],
+    # Wiki common
+    "wiki_common": [
+        "gold_wiki_common_merged_a_release_test.json",
+        "gold_wiki_common_merged_b_release_test.json",
+        "gold_wiki_common_merged_c_release_test.json",
+    ],
+}
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-g",
+        "--gt-folder",
+        type=str,
+        help="Path to the folder containing the ground truth json files.",
+    )
+    parser.add_argument(
+        "-p",
+        "--pred-folder",
+        type=str,
+        help="Path to the folder containing the predictions json files.",
+    )
+    args = parser.parse_args()
+
+    results = ""
+
+    for subset_name, gts in saco_gold_gts.items():
+        print("Processing subset: ", subset_name)
+        gt_paths = [os.path.join(args.gt_folder, gt) for gt in gts]
+        evaluator = CGF1Evaluator(
+            gt_path=gt_paths, verbose=True, iou_type="segm"
+        )  # change to bbox if you want detection performance
+
+        pred_path = os.path.join(
+            args.pred_folder,
+            f"gold_{subset_name}/dumps/gold_{subset_name}/coco_predictions_segm.json",
+        )
+        summary = evaluator.evaluate(pred_path)
+
+        cgf1 = str(round(summary["cgF1_eval_segm_cgF1"] * 100, 2))
+        il_mcc = str(round(summary["cgF1_eval_segm_IL_MCC"], 2))
+        pmf1 = str(round(summary["cgF1_eval_segm_positive_micro_F1"] * 100, 2))
+        final_str = f"{cgf1},{il_mcc},{pmf1}"
+        results += subset_name + ": " + final_str + "\n"
+
+    print("Subset name, CG_F1, IL_MCC, pmF1")
+    print(results)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/silver/CONFIG_FRAMES.yaml
+++ b/scripts/eval/silver/CONFIG_FRAMES.yaml
@@ -0,0 +1,24 @@
+path_annotations: <YOUR_ANNOTATIONS_PATH>/saco_frames_test_sets/annotations/
+
+# Paths with downloaded data
+droid_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/droid/
+sav_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/sav/
+ego4d_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/ego4d/
+yt1b_path: <YOUR_DATASET_PATH>/saco_frames_test_sets/yt1b/
+
+# Configuration to download and extract video frames
+cookies_path: <YOUR_COOKIES_PATH>/cookies.txt  # Required to download YT1B videos
+update_annotation_yt1b: true
+update_annotation_ego4d: true
+
+sav_videos_fps_6_download_path: ''
+
+remove_downloaded_videos_yt1b: false
+remove_downloaded_videos_droid: false
+remove_downloaded_videos_ego4d: false
+remove_downloaded_videos_sav: false
+
+# Configuration for visualization of data
+num_images_show: 5
+saco_subset_show: yt1b  # Options: [yt1b, ego4d, sav, droid]
+directory_save: <YOUR_SAVE_DIR>
--- a/scripts/eval/silver/README.md
+++ b/scripts/eval/silver/README.md
@@ -0,0 +1,405 @@
+# SA-Co/Silver benchmark
+
+SA-Co/Silver is a benchmark for promptable concept segmentation (PCS) in images. The benchmark contains images paired with text labels (also referred as Noun Phrases aka NPs), each annotated exhaustively with masks on all object instances that match the label.
+
+SA-Co/Silver comprises 10 subsets, covering a diverse array of domains including food, art, robotics, driving etc. Unlike SA-Co/Gold, there is only a single ground-truth for each datapoint, which means the results may have a bit more variance and tend to underestimate model performance, since they don't account for possible different interpretations of each query.
+
+- BDD100k
+- DROID
+- Ego4D
+- MyFoodRepo-273
+- GeoDE
+- iNaturalist-2017
+- National Gallery of Art
+- SA-V
+- YT-Temporal-1B
+- Fathomnet
+
+The README contains instructions on how to download and setup the annotations, image data to prepare them for evaluation on SA-Co/Silver.
+
+# Preparation
+## Download annotations
+
+The GT annotations can be downloaded from [Hugging Face](https://huggingface.co/datasets/facebook/SACo-Silver) or [Roboflow](https://universe.roboflow.com/sa-co-silver)
+
+## Download images and video frames
+
+### Image Datasets
+
+#### GeoDE
+
+The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/geode/) OR follow the below steps to prepare the processed images.
+
+1. Download dataset with raw images from [GeoDE](https://geodiverse-data-collection.cs.princeton.edu/).
+2. Extract the downloaded file to a location, say `<RAW_GEODE_IMAGES_FOLDER>`
+
+3. Run the below command to pre-process the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_GEODE_IMAGES_FOLDER>`
+    ```
+    python preprocess_silver_geode_bdd100k_food_rec.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_geode_merged_test.json --raw_images_folder <RAW_GEODE_IMAGES_FOLDER> --processed_images_folder <PROCESSED_GEODE_IMAGES_FOLDER> --dataset_name geode
+    ```
+
+#### National Gallery of Art (NGA)
+
+The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/national-gallery-of-art/) OR follow the below steps to prepare the processed images.
+
+1. Run the below command to download raw images and pre-process the images to prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_NGA_IMAGES_FOLDER>`.
+    ```
+    python download_preprocess_nga.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_nga_art_merged_test.json --raw_images_folder <RAW_NGA_IMAGES_FOLDER> --processed_images_folder <PROCESSED_NGA_IMAGES_FOLDER>
+    ```
+
+#### Berkeley Driving Dataset (BDD) 100k
+
+The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/bdd100k-gwmh6/) OR follow the below steps to prepare the processed images.
+
+1. Download data with raw images from the `100K Images` dataset in [BDD100k](http://bdd-data.berkeley.edu/download.html)
+2. Extract the downloaded file to a location, say `<RAW_BDD_IMAGES_FOLDER>`
+3. Run the below command to pre-process the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_BDD_IMAGES_FOLDER>`
+    ```
+    python preprocess_silver_geode_bdd100k_food_rec.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_bdd100k_merged_test.json --raw_images_folder <RAW_BDD_IMAGES_FOLDER> --processed_images_folder <PROCESSED_BDD_IMAGES_FOLDER> --dataset_name bdd100k
+    ```
+
+#### Food Recognition Challenge 2022
+
+1. Download data with raw images from the [website](https://www.aicrowd.com/challenges/food-recognition-benchmark-2022). Download `[Round 2] public_validation_set_2.0.tar.gz` file.
+2. Extract the downloaded file to a location, say `<RAW_FOOD_IMAGES_FOLDER>`
+3. Run the below command to pre-process the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_FOOD_IMAGES_FOLDER>`
+    ```
+    python preprocess_silver_geode_bdd100k_food_rec.py --annotation_file <FOLDER_WITH_SILVER_ANNOTATIONS>/silver_food_rec_merged_test.json --raw_images_folder <RAW_FOOD_IMAGES_FOLDER> --processed_images_folder <PROCESSED_FOOD_IMAGES_FOLDER> --dataset_name food_rec
+    ```
+
+#### iNaturalist
+
+The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/inaturalist-2017/) OR follow the below steps to prepare the processed images.
+
+1. Run the below command to download, extract images in `<RAW_INATURALIST_IMAGES_FOLDER>` and prepare them for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_INATURALIST_IMAGES_FOLDER>`
+    ```
+    python download_inaturalist.py --raw_images_folder <RAW_INATURALIST_IMAGES_FOLDER> --processed_images_folder <PROCESSED_INATURALIST_IMAGES_FOLDER>
+    ```
+
+#### Fathomnet
+
+The processed images needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/fathomnet-kmz5d/) OR follow the below steps to prepare the processed images.
+
+1. Install the FathomNet API
+    ```
+    pip install fathomnet
+    ```
+
+2. Run the below command to download the images and prepare for evaluation. The proceesed images will be saved to the location specified in `<PROCESSED_BDD_IMAGES_FOLDER>`
+    ```
+    python download_fathomnet.py --processed_images_folder <PROCESSED_BFATHOMNET_IMAGES_FOLDER>
+    ```
+
+### Frame Datasets
+
+These datasets correspond to annotations for individual frames coming from videos. The file `CONFIG_FRAMES.yaml` is used to unify the downloads for the datasets, as explained below.
+
+Before following the other dataset steps, update `CONFIG_FRAMES.yaml` with the correct `path_annotations` path where the annotation files are.
+
+#### DROID
+
+The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/droid-cfual/) OR follow the below steps to prepare the processed frames.
+
+1. Install the gsutil package:
+    ```bash
+    pip install gsutil
+    ```
+2. Modify the `droid_path` variable in `CONFIG_FRAMES.yaml`. This is the path where the DROID data will be downloaded.
+3. _\[Optional\] Update the variable `remove_downloaded_videos_droid` to (not) remove the videos after the frames have been extracted.
+4. Download the data:
+    ```bash
+    python download_videos.py droid
+    ```
+5. Extract the frames:
+    ```bash
+    python extract_frames.py droid
+    ```
+
+See the [DROID website](https://droid-dataset.github.io/droid/the-droid-dataset#-using-the-dataset) for more information.
+
+#### SA-V
+
+The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/sa-v) OR follow the below steps to prepare the processed frames.
+
+1. Follow instructions in the [Segment Anything official website](https://ai.meta.com/datasets/segment-anything-video-downloads/) to obtain access to the download links (they are dynamic links).
+2. Update `CONFIG_FRAMES.yaml`:
+    - Update the `sav_path` variable, where the frames will be saved.
+    - Update the `sav_videos_fps_6_download_path` variable. Copy paste the path corresponding to the `videos_fps_6.tar` in the list that you obtained in step 1.
+    - _\[Optional\]_ Update the variable `remove_downloaded_videos_sav` to (not) remove the videos after the frames have been extracted.
+3. Download the videos:
+    ```bash
+    python download_videos.py sav
+    ```
+4. Extract the frames:
+    ```
+    python extract_frames.py sav
+    ```
+
+#### Ego4D
+
+The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/ego4d-w7fiu/) OR follow the below steps to prepare the processed frames.
+
+1. Review and accept the license agreement in the [official Ego4D website](https://ego4d-data.org/docs/start-here/#license-agreement).
+2. Configure AWS credentials. Run:
+    ```bash
+    pip install awscli
+    aws configure
+    ```
+    and copy the values shown in the email you received after step 1 (you can leave "region name" and "output format" empty). You can verify that the variables were set up correctly:
+    ```bash
+    cat ~/.aws/credentials
+    ```
+3. Install the Ego4D library:
+    ```bash
+    pip install ego4d
+    ```
+4. Update `CONFIG_FRAMES.yaml`:
+    - Set up AWS credentials following the instructions in the email you received after step 2. Modify the following variables: `aws_access_key_id` and `aws_secret_access_key`.
+    - Update the `ego4d_path` variable, where the frames will be saved.
+    - _\[Optional\]_ Update the variable `remove_downloaded_videos_ego4d` to (not) remove the videos after the frames have been extracted..
+5. Download the `clips` subset of the Ego4D dataset:
+    ```python
+    python download_videos.py ego4d
+    ```
+6. Extract the frames:
+    ```
+    python extract_frames.py ego4d
+    ```
+
+See the [official CLI](https://ego4d-data.org/docs/CLI/) and the [explanation about the videos](https://ego4d-data.org/docs/data/videos/) for more information.
+
+#### YT1B
+
+The processed frames needed for evaluation can be downloaded from [Roboflow](https://universe.roboflow.com/sa-co-silver/yt-temporal-1b/) OR follow the below steps to prepare the processed frames.
+
+1. Install the yt-dlp library:
+    ```bash
+    python3 -m pip install -U "yt-dlp[default]"
+    ```
+2. Create a `cookies.txt` file following the instructions from yt-dlp [exporting-youtube-cookies](https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies) and [pass-cookies-to-yt-dlp](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp). This is required to download youtube videos. Then, update the path for that file in the `CONFIG_FRAMES.yaml` file, in the variable `cookies_path`.
+3. Update `CONFIG_FRAMES.yaml`:
+    - Update the `yt1b_path`, where the frames will be saved.
+    - _\[Optional\]_ Some YouTube videos may not be available on YouTube anymore. Set `update_annotation_yt1b` to `True` in `CONFIG_FRAMES.yaml` to remove the annotations corresponding to such videos. Note that the evaluations will not be directly comparable with other reported evaluations.
+    - _\[Optional\]_ Update the variable `remove_downloaded_videos_yt1b` to (not) remove the videos after the frames have been extracted.
+4. Run the following code to download the videos:
+    ```
+    python download_videos.py yt1b
+    ```
+5. Extract the frames:
+    ```
+    python extract_frames.py yt1b
+    ```
+
+# Usage
+## Visualization
+
+- Visualize GT annotations: [saco_gold_silver_vis_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_vis_example.ipynb)
+
+## Run evaluation
+
+The official metric for SA-Co/Silver is cgF1. Please refer to the SAM3 paper for details.
+Unlike Gold, the silver subset only has a single annotation per image. Therefore, the performance may be underestimated, because the model may be wrongly penalized for choosing an interpretation which is valid but different from that of the human annotator.
+
+### Evaluate SAM3
+
+We provide inference configurations to reproduce the evaluation of SAM3.
+First, please edit the file [eval_base.yaml](https://github.com/facebookresearch/sam3/blob/main/sam3/train/configs/eval_base.yaml) with the paths where you downloaded the images and annotations above.
+
+There are 10 subsets and as many configurations to be run.
+Let's take the first subset as an example. The inference can be run locally using the following command (you can adjust the number of gpus):
+```bash
+python sam3/train/train.py -c configs/silver_image_evals/sam3_gold_image_bdd100k.yaml --use-cluster 0 --num-gpus 1
+```
+The predictions will be dumped in the folder specified in eval_base.yaml.
+
+We also provide support for SLURM-based cluster inference. Edit the eval_base.yaml file to reflect your slurm configuration (partition, qos, ...), then run
+
+```bash
+python sam3/train/train.py -c configs/silver_image_evals/sam3_gold_image_bdd100k.yaml --use-cluster 1
+```
+
+### Offline evaluation
+
+If you have the predictions in the COCO result format (see [here](https://cocodataset.org/#format-results)), then we provide scripts to easily run the evaluation.
+
+For an example on how to run the evaluator on all subsets and aggregate results, see the following notebook: [saco_gold_silver_eval_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_gold_silver_eval_example.ipynb)
+
+If you have a prediction file for a given subset, you can run the evaluator specifically for that one using the standalone script. Example:
+```bash
+python scripts/eval/standalone_cgf1.py --pred_file /path/to/coco_predictions_segm.json --gt_files /path/to/annotations/silver_bdd100k_merged_test.json
+```
+
+# Results
+<table style="border-color:black;border-style:solid;border-width:1px;border-collapse:collapse;border-spacing:0;text-align:right" class="tg"><thead>
+  <tr style="text-align:center">
+    <th></th>
+    <th colspan="3">Average</th>
+    <th colspan="3">BDD100k</th>
+    <th colspan="3">Droids</th>
+    <th colspan="3">Ego4d</th>
+    <th colspan="3">Food Rec</th>
+    <th colspan="3">Geode</th>
+    <th colspan="3">iNaturalist</th>
+    <th colspan="3">Nga Art</th>
+    <th colspan="3">SAV</th>
+    <th colspan="3">YT1B</th>
+    <th colspan="3">Fathomnet</th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td></td>
+    <td>cgF1</td>
+    <td>IL_MCC</td>
+    <td>PmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+    <td>CGF1</td>
+    <td>IL_MCC</td>
+    <td>pmF1</td>
+  </tr>
+  <tr>
+    <td>gDino-T</td> <td>3.09</td> <td>0.12</td> <td>19.75</td> <td>3.33</td> <td>0.17</td> <td>19.54</td> <td>4.26</td> <td>0.15</td> <td>28.38</td> <td>2.87</td> <td>0.1</td>
+    <td>28.72</td> <td>0.69</td> <td>0.05</td> <td>13.88</td> <td>9.61</td> <td>0.24</td> <td>40.03</td> <td>0</td> <td>0</td> <td>1.97</td> <td>1.31</td> <td>0.09</td>
+    <td>14.57</td> <td>5.18</td> <td>0.19</td> <td>27.25</td> <td>3.6</td> <td>0.16</td> <td>22.5</td> <td>0</td> <td>0</td> <td>0.64</td>
+  </tr>
+  <tr>
+    <td>OWLv2*</td> <td>11.23</td> <td>0.32</td> <td>31.18</td> <td>14.97</td> <td>0.46</td> <td>32.34</td> <td>10.84</td> <td>0.36</td> <td>30.1</td> <td>7.36</td> <td>0.23</td>
+    <td>31.99</td> <td>19.35</td> <td>0.44</td> <td>43.98</td> <td>27.04</td> <td>0.5</td> <td>54.07</td> <td>3.92</td> <td>0.14</td> <td>27.98</td> <td>8.05</td> <td>0.31</td>
+    <td>25.98</td> <td>10.59</td> <td>0.32</td> <td>33.1</td> <td>10.15</td> <td>0.38</td> <td>26.7</td> <td>0.04</td> <td>0.01</td> <td>5.57</td>
+  </tr>
+  <tr>
+    <td>OWLv2</td> <td>8.18</td> <td>0.23</td> <td>32.55</td> <td>8.5</td> <td>0.31</td> <td>27.79</td> <td>7.21</td> <td>0.25</td> <td>28.84</td> <td>5.64</td> <td>0.18</td>
+    <td>31.35</td> <td>14.18</td> <td>0.32</td> <td>44.32</td> <td>13.04</td> <td>0.28</td> <td>46.58</td> <td>3.62</td> <td>0.1</td> <td>36.23</td> <td>7.22</td> <td>0.25</td>
+    <td>28.88</td> <td>10.86</td> <td>0.32</td> <td>33.93</td> <td>11.7</td> <td>0.35</td> <td>33.43</td> <td>-0.14</td> <td>-0.01</td> <td>14.15</td>
+  </tr>
+  <tr>
+    <td>LLMDet-L</td> <td>6.73</td> <td>0.17</td> <td>28.19</td> <td>1.69</td> <td>0.08</td> <td>19.97</td> <td>2.56</td> <td>0.1</td> <td>25.59</td> <td>2.39</td>
+    <td>0.08</td> <td>29.92</td> <td>0.98</td> <td>0.06</td> <td>16.26</td> <td>20.82</td> <td>0.37</td> <td>56.26</td> <td>27.37</td> <td>0.46</td> <td>59.5</td>
+    <td>2.17</td> <td>0.13</td> <td>16.68</td> <td>5.37</td> <td>0.19</td> <td>28.26</td> <td>3.73</td> <td>0.16</td> <td>23.32</td> <td>0.24</td> <td>0.04</td> <td>6.1</td>
+  </tr>
+  <tr>
+    <td>Gemini 2.5</td> <td>9.67</td> <td>0.19</td> <td>45.51</td> <td>5.83</td> <td>0.19</td> <td>30.66</td> <td>5.61</td> <td>0.14</td> <td>40.07</td>
+    <td>0.38</td> <td>0.01</td> <td>38.14</td> <td>10.92</td> <td>0.24</td> <td>45.52</td> <td>18.28</td> <td>0.26</td> <td>70.29</td> <td>26.57</td> <td>0.36</td>
+    <td>73.81</td> <td>8.18</td> <td>0.2</td> <td>40.91</td> <td>9.48</td> <td>0.22</td> <td>43.1</td> <td>8.66</td> <td>0.23</td> <td>37.65</td> <td>2.8</td>
+    <td>0.08</td> <td>34.99</td>
+  </tr>
+  <tr> <td>SAM3</td> <td>49.57</td> <td>0.76</td> <td>65.17</td> <td>46.61</td> <td>0.78</td> <td>60.13</td> <td>45.58</td> <td>0.76</td>
+    <td>60.35</td> <td>38.64</td> <td>0.62</td> <td>62.56</td> <td>52.96</td> <td>0.79</td> <td>67.21</td> <td>70.07</td> <td>0.89</td>
+    <td>78.73</td> <td>65.8</td> <td>0.82</td> <td>80.67</td> <td>38.06</td> <td>0.66</td> <td>57.62</td> <td>44.36</td> <td>0.67</td>
+    <td>66.05</td> <td>42.07</td> <td>0.72</td> <td>58.36</td> <td>51.53</td> <td>0.86</td> <td>59.98</td>
+  </tr>
+</tbody></table>
+
+# Annotation format
+
+The annotation format is derived from [COCO format](https://cocodataset.org/#format-data). Notable data fields are:
+
+- `images`: a `list` of `dict` features, contains a list of all image-NP pairs. Each entry is related to an image-NP pair and has the following items.
+  - `id`: an `int` feature, unique identifier for the image-NP pair
+  - `text_input`: a `string` feature, the noun phrase for the image-NP pair
+  - `file_name`: a `string` feature, the relative image path in the corresponding data folder.
+  - `height`/`width`: dimension of the image
+  - `is_instance_exhaustive`: Boolean (0 or 1). If it's 1 then all the instances are correctly annotated. For instance segmentation, we only use those datapoints. Otherwise, there may be either missing instances or crowd segments (a segment covering multiple instances)
+  - `is_pixel_exhaustive`: Boolean (0 or 1). If it's 1, then the union of all masks cover all pixels corresponding to the prompt. This is weaker than instance_exhaustive since it allows crowd segments. It can be used for semantic segmentation evaluations.
+
+- `annotations`: a `list` of `dict` features, containing a list of all annotations including bounding box, segmentation mask, area etc.
+  - `image_id`: an `int` feature, maps to the identifier for the image-np pair in images
+  - `bbox`: a `list` of float features, containing bounding box in [x,y,w,h] format, normalized by the image dimensions
+  - `segmentation`: a dict feature, containing segmentation mask in RLE format
+  - `category_id`: For compatibility with the coco format. Will always be 1 and is unused.
+  - `is_crowd`: Boolean (0 or 1). If 1, then the segment overlaps several instances (used in cases where instances are not separable, for e.g. due to poor image quality)
+
+- `categories`: a `list` of `dict` features, containing a list of all categories. Here, we provide  the category key for compatibility with the COCO format, but in open-vocabulary detection we do not use it. Instead, the text prompt is stored directly in each image (text_input in images). Note that in our setting, a unique image (id in images) actually corresponds to an (image, text prompt) combination.
+
+
+For `id` in images that have corresponding annotations (i.e. exist as `image_id` in `annotations`), we refer to them as a "positive" NP. And, for `id` in `images` that don't have any annotations (i.e. they do not exist as `image_id` in `annotations`), we refer to them as a "negative" NP.
+
+A sample annotation from DROID domain looks as follows:
+
+#### images
+
+```
+[
+  {
+    "id": 10000000,
+    "file_name": "AUTOLab_failure_2023-07-07_Fri_Jul__7_18:50:36_2023_recordings_MP4_22008760/00002.jpg",
+    "text_input": "the large wooden table",
+    "width": 1280,
+    "height": 720,
+    "queried_category": "3",
+    "is_instance_exhaustive": 1,
+    "is_pixel_exhaustive": 1
+  }
+]
+```
+
+#### annotations
+
+```
+[
+  {
+    "area": 0.17324327256944444,
+    "id": 1,
+    "image_id": 10000000,
+    "source": "created by SAM3",
+    "bbox": [
+      0.03750000149011612,
+      0.5083333253860474,
+      0.8382812738418579,
+      0.49166667461395264
+    ],
+    "segmentation": {
+      "counts": "[^R11]f03O0O100O2N100O1O100O100O100O100O1O100O100O100O100O100O1O10000O1O10000O1O100O10000O1O100O100O100O100O100O100O100O100O100O100O1O100O100O10000O100O100O100O101N100O1O011O0O1O101OO0010O100O1O100O2OO0100O100O100O100O100O10000O100O100O1O100O10000O1O100O100O100O10000O1O100O100O100O10000O1O10000O1O100O100O100O100O100O100O1O100O100O100O100O100O100O100O100O100O100O100O100O100O100O10000O100O100O1O100O10000O100O100O100O100O1O100O100O100O100O100O100O10O0100O100O2O000O1O10000O1O10000O100O100O100O1O100O100O100O100O100O100O100O100O100O100O100O100O1O100O100O100O10000O100O100O100O100O100O100O100O100O100O100O100O100O100O10000O100O100O100O100O100O100O1O10000O1O10000O100O1O100O100O100O100O100O100O100O100O10000O1O100O100O100O100O1O10000O10\\MP@hNo?W1U@gNk?X1W@gNh?Y1Z@fNf?Y1\\@fNc?[1^@dNb?[1`@dN_?]1b@bN^?]1e@aNZ?_1i@_NW?a1l@\\NS?d1RAXNn>h1TAVNk>k1VATNj>k1XATNg>m1YASNg>m1YASNf>m1[ASNe>m1[ASNd>m1]ASNc>m1]ASNb>l1`ATN`>i1cAWN\\>d1jA\\NV>_1oAaNP>^1RBbNn=\\1TBdNk=\\1VBdNj=1`@dNGO02P2Z1h=L_AfNj0^1g=FmC;R<EoC;Q<DPD<o;DRD<n;DQD=n;DjAnN?^1g=DhAQO?\\1h=DhAUO<W1l=EeAZO:R1P>F]ABa0h0Q>Hd@lNDV1e17S>k1iAWNW>i1hAXNW>j1gAWNY>i1fAXNY>j1eAWNZ>k1dAVN\\>k1bAVN^>k1`AVN_>l1`ATN`>m1^ATNa>o1]AQNc>P2[AQNd>P2\\APNd>Q2[AoMd>R2[AoMd>R2\\AnMd>S2ZAnMe>S2[AmMe>T2YAmMf>T2YAmMg>T2WAmMh>U2VAlMj>U2TAlMl>U2PAnMo>U2j@PNV?e4O100O100O100O100O100O100O100O100O100O100O100O100O101N100O100O10O0100O100O100O100O100O100O1000000O1000000O100O100O1O1O1O100O100O1O100O100O100O100O100O100O100O100O100O1O100O100O100O100O100O10000O100O1O100O100O100O100O100O100OkK_B]Oa=7oBEP=4YCKg<1^CNa<1bCN^<OeC1[<LhC4W<KlC4S<KoC5Q<JPD6o;JRD6n;JSD5l;LTD4l;LTD4k;MUD3k;MUD4j;LWD2i;OWD1i;OWD1h;0XD0h;1WDOh;2XDOg;1ZDNe;3[DMe;3[DNc;3]DLd;4\\DLc;5]DKb;7]DIc;7^DHa;9_DGa;9_DG`;:`DF`;;_DE`;<`DCa;=^DDa;=_DC`;>_DCa;>^DBb;[OUCiMW1n2c;YO[CeMn0V3g;TO^CeMf0[3k;POaCdM>b3Q<iNbCfM7f3V<dNeCeMKQ4`<YNgCfMAX4g<RNiCk2W<SMlCl2S<TMnCl2R<SMoCm2Q<RMQDm2n;TMRDl2n;SMTDl2k;UMUDk2k;UMVDj2i;VMXDj2h;VMXDj2g;VM[Di2e;VM\\Dj2c;VM^Dj2b;TMaDk2^;PMhDP3X;aL`CjM`1e5o:\\L^Ed3b:WLdEh3[:nKPFR4P:jKTFV4k9hKXFX4h9hKXFX4g9hKYFY4f9hKZFX4f9hKZFX4e9iKZFW4g9iKXFX4g9iKPElN\\O\\5c;iKeDYOEo4f;iK]DAJh4g;iKTDJ3^4i;jKkCO;X4i;hMVDX2j;hMUDY2j;iMUDW2k;iMTDW2l;kMSDU2m;kMRDV2m;lMRDT2n;mMPDT2P<mMoCS2P<oMnCR2R<V4O100O100OiInCR2Q<kMWDQ2i;kM_DQ2`;lMoDi1Q;TNWEg1h:XN^Ed1a:\\NdE`1\\:^NjE^1U:aNPF]1o9aNUF]1k9bNXF\\1g9dN]FY1c9fN`FX1_9hNdFV1\\9iNhFT1W9lNmFQ1S9nNQGo0n8QOTGn0l8ROWGk0h8UO[Gi0e8VO^Gh0a8YO`Gf0`8YOcGe0\\8\\OeGc0[8\\OiGa0V8@lG>T8AnG>Q8BQH=o7CRH<m7DVH:j7FWH9h7HYH7g7H[H7d7J^H4b7L^H4b7K`H4_7MbH2^7NcH1\\7OfH0Z70gHOX72iHMW73jHLV74jHLU74mHKS75mHKS75nHJR76oHIQ77oHIR7jMkDP1U4U1S7RM_D0h0g1f3W1^8hNcGV1_8iNaGX1_8gNaGY1`8fNaGY1_8gNaGY1`8fNaGY1_8gNaGY1`8fNaGY1_8gNaGY1`8fNaGY1_8gNaGY1_8gNaGY1_8gNbGX1_8gNaGY1_8gNaGY1_8fNbGY1`8fNaGY1_8gNaGY1_8gNaGY1_8gNaGY1_8gNbGX1^8hNbGX1^8hNbGX1^8hNbGX1^8hNbGX1^8iNbGV1^8jNbGV1^8jNbGV1^8jNbGV1^8jNbGV1^8jNbGV1^8jNbGV1]8lNbGT1^8lNcGS1\\8nNdGR1\\8nNdGR1[8oNeGQ1Z8POfGP1X8SOhGl0W8UOiGk0U8WOkGi0S8YOmGg0P8\\OPHd0n7_ORH`0l7BTH>j7DVH<g7HYH7d7L\\H4b7N^H2`71_HO^74bHL[77eHIY7:fHFX7<hHDV7>jHBT7a0kH_OT7b0mH]OR7d0nH\\OQ7f0nH]OQ7g0oHZOQ7g0oHYOQ7h0nHXOR7h0nHXOR7h0nHXOR7i0mHWOT7h0kHYOU7h0jHXOV7h0iHYOW7g0iHYOW7h0hHXOY7g0fHZOZ7f0eH[O\\7e0cHhNlKSNa;U3bHeNSLTN\\;W3_HbN]LRNU;\\3]H^Nb8c1\\G\\Ng8c1XG\\Nj8e1TGZNo8e1PGYNS9h1lFUNW9l1gFRN]9m1bFRN`9o1^FPNe9o1[FoMg9R2WFnMj9S2TFmMn9R2RFnMn9S2PFmMR:R2nEmMS:T2kEmMU:T2jEkMX:T2gEmMY:T2fElMZ:U2dEkM^:T2aEmM_:T2`ElM`:U2^ElMc:S2\\EmMe:T2YEmMg:T2WEmMj:S2UEmMk:T2SEmMn:S2PEnMP;S2nDoMQ;R2mDoMT;Q2kDoMU;R2iDoMX;Q2fDQNY;P2eDQN[;P2cDQN^;o1`DSN_;n1^DTNc;l1[DVNd;k1ZDVNg;j1WDXNh;j1UDWNk;j1SDWNn;i1oCZNP<h1mCYNS<h1kCZNU<g1gC\\NX<e1fC\\N[<d1cC^N\\<d1aC^N_<c1^C_Na<b1\\CaNc<a1ZCaNf<_1XCcNg<_1UCeNj<^1oBfNP=]1iBiN?gL^;e4hCkNf0dLb;`8YDcGg;^8VDdGk;^8mChGR<_8bCfG_<U900001N101O00001O001O00001O00001O0O2N1O1O2N1O2N100O2N1O1O2N1O2N1O1O2N1O2M200O2M2O2N1N2O2N1N3N1O1N3N1N3M2O2kMkAkKW>Q4RBiKo=8^AR2j0`Mk=:aAP2i0bMh==eAj1g0eMf=?hAh1f0eMd=?lAg1c0gMc=`0nAe1c0hMa=a0oAd1b0iM`=a0QBc1c0iM]=c0SB`1d0iM\\=e0SB^1e0jMY=g0VB[1e0jMV=k0WBW1V`0gNn_OT1T`0lNo_Oo0S`0POS@i0P`0VOT@d0n?\\OT@`0n?@T@<o?CR@^OUN6ka0=P@XO\\N6ga0a0j@WOY?i0X3O001O00010O00001O0010O0001O00010O001O00001O001O01O01O00001O001O000O2O0O2O0O2N1O2N1O2M3MYl51fSJ3L3O1O100O1O100000000001O000000001O00000000001O01OO1000000000001O000001O000O10000000000000000O10000O10000O10000O100O1O100O1O1O1O1O1O1N2O1O1O1O1O1O1O1O1O1O1O1O1O1O1O1O1N2O1O1O1O1O1O1O100O100N21O00001O001O2N1O1O2N1O2N1O2M3N4IVT_3",
+      "size": [
+        720,
+        1280
+      ]
+    },
+    "category_id": 1,
+    "iscrowd": 0
+  }
+]
+```
+
+### Data Stats
+
+Here are the stats for the 10 annotation domains. The # Image-NPs represent the total number of unique image-NP pairs including both “positive” and “negative” NPs. 
+
+
+| Domain                   | # Image-NPs  | # Image-NP-Masks|
+|--------------------------|--------------| ----------------|
+| BDD100k                  | 5546         | 13210           |
+| DROID                    | 9445         | 11098           |
+| Ego4D                    | 12608        | 24049            |
+| MyFoodRepo-273           | 20985        | 28347           |
+| GeoDE                    | 14850        | 7570            |
+| iNaturalist-2017         | 1439051      | 48899           |
+| National Gallery of Art  | 22294        | 18991            |
+| SA-V                     | 18337        | 39683            |
+| YT-Temporal-1B           | 7816         | 12221            |
+| Fathomnet                | 287193         | 14174            |
--- a/scripts/eval/silver/download_fathomnet.py
+++ b/scripts/eval/silver/download_fathomnet.py
@@ -0,0 +1,62 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+import json
+import os
+from multiprocessing import Pool
+from pathlib import Path
+
+import requests
+from fathomnet.api import images
+from tqdm import tqdm
+
+
+def download_imgs(args, image_uuids):
+    flag = 0
+    for uuid in tqdm(image_uuids, desc="Downloading images"):
+        image = images.find_by_uuid(uuid)
+        file_name = (
+            Path(args.processed_images_folder)
+            / f"{image.uuid}.{image.url.split('.')[-1]}"
+        )
+        if not file_name.exists():
+            try:
+                resp = requests.get(image.url, stream=True)
+                resp.raise_for_status()
+                with open(file_name, "wb") as f:
+                    for chunk in resp.iter_content(chunk_size=1024):
+                        f.write(chunk)
+                flag += 1
+            except requests.exceptions.RequestException as e:
+                print(f"Error downloading {image.url}: {e}")
+    print(f"Downloaded {flag} new images to {args.processed_images_folder}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Download images from FathomNet")
+    parser.add_argument("--processed_images_folder", help="Path to downloaded images")
+    parser.add_argument(
+        "--image-uuids",
+        default="fathomnet_image_uuids.json",
+        help="Path to JSON file containing image uuids to download",
+    )
+    parser.add_argument(
+        "--num-procs", type=int, default=16, help="Number of parallel processes"
+    )
+    args = parser.parse_args()
+
+    with open(args.image_uuids, "r") as f:
+        all_uuids = json.load(f)
+
+    Path(args.processed_images_folder).mkdir(parents=True, exist_ok=True)
+
+    chunk_size = len(all_uuids) // args.num_procs
+    chunks = [
+        all_uuids[i : i + chunk_size] for i in range(0, len(all_uuids), chunk_size)
+    ]
+
+    with Pool(processes=args.num_procs) as pool:
+        pool.starmap(download_imgs, [(args, chunk) for chunk in chunks])
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/silver/download_inaturalist.py
+++ b/scripts/eval/silver/download_inaturalist.py
@@ -0,0 +1,81 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+import json
+import shutil
+import subprocess
+import sys
+import tarfile
+from pathlib import Path
+
+from tqdm import tqdm
+
+
+def download_archive(url, dest_dir):
+    dest_dir = Path(dest_dir)
+    dest_dir.mkdir(parents=True, exist_ok=True)
+    archive_path = dest_dir / url.split("/")[-1]
+    if not archive_path.exists():
+        print(f"Downloading archive to {archive_path}...")
+        result = subprocess.run(["wget", "-O", str(archive_path), url])
+        if result.returncode != 0:
+            print("Download failed.")
+            sys.exit(1)
+    else:
+        print(f"Archive already exists at {archive_path}")
+    return archive_path
+
+
+def extract_archive(archive_path, dest_dir):
+    print(f"Extracting {archive_path} to {dest_dir}...")
+    with tarfile.open(archive_path, "r:gz") as tar:
+        tar.extractall(path=dest_dir)
+    print("Extraction complete.")
+
+
+def copy_images(subset_json, untar_dir, output_dir):
+    with open(subset_json, "r") as f:
+        image_dict = json.load(f)
+    output_dir = Path(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    for target_name, rel_path in tqdm(image_dict.items(), "Copying image subset"):
+        src = Path(untar_dir) / rel_path
+        dst = output_dir / target_name
+        if not src.exists():
+            print(f"Warning: Source image {src} does not exist, skipping.")
+            continue
+        shutil.copy2(src, dst)
+    print(f"Copied {len(image_dict)} images to {output_dir}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Download, extract, and copy subset of iNaturalist images from archive."
+    )
+    parser.add_argument(
+        "--raw_images_folder", help="Path to downloaded and extract the archive"
+    )
+    parser.add_argument("--processed_images_folder", help="Path to processed images")
+    parser.add_argument(
+        "--subset-json",
+        default="inaturalist_image_subset.json",
+        help="Path to iNaturalist images subset",
+    )
+    parser.add_argument(
+        "--archive-url",
+        default="https://ml-inat-competition-datasets.s3.amazonaws.com/2017/train_val_images.tar.gz",
+        help="URL of the archive to download",
+    )
+    args = parser.parse_args()
+
+    dest_dir = Path(args.raw_images_folder)
+    images_dir = Path(args.processed_images_folder)
+
+    archive_path = download_archive(args.archive_url, dest_dir)
+    extract_archive(archive_path, dest_dir)
+
+    untar_dir = dest_dir / "train_val_images"
+    copy_images(args.subset_json, untar_dir, images_dir)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/silver/download_preprocess_nga.py
+++ b/scripts/eval/silver/download_preprocess_nga.py
@@ -0,0 +1,140 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+import os
+from functools import partial
+from multiprocessing import Pool
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import requests
+import utils
+from PIL import Image
+from tqdm import tqdm
+
+METADATA_FILE = "published_images.csv"
+METADATA_URL = "https://raw.githubusercontent.com/NationalGalleryOfArt/opendata/refs/heads/main/data"  # data/published_iamges.csv from https://github.com/NationalGalleryOfArt/opendata/tree/main
+IMG_URL = "https://api.nga.gov/iiif/%s/full/%s/0/default.jpg"
+METADATA_FOLDER = "metadata"
+EXTENSION = ".jpg"
+
+
+def download_metadata(annotation_folder):
+    output_folder = annotation_folder / METADATA_FOLDER
+    output_folder.mkdir(exist_ok=True)
+    url = f"{METADATA_URL}/{METADATA_FILE}"
+    print(url)
+    response = requests.get(url)
+    if response.status_code == 200:
+        with open(output_folder / METADATA_FILE, "wb") as f:
+            f.write(response.content)
+
+
+def download_url(row):
+    if np.isnan(row.maxpixels) or (
+        row.maxpixels > row.width and row.maxpixels > row.height
+    ):
+        url = IMG_URL % (row.uuid, "full")
+    else:
+        url = IMG_URL % (row.uuid, f"!{row.maxpixels},{row.maxpixels}")
+    return url
+
+
+def download_item(item, output_folder):
+    uuid, url = item
+    try:
+        if (output_folder / f"{uuid}{EXTENSION}").exists():
+            print("skipping", uuid, "already downloaded")
+            return
+        response = requests.get(url)
+        if response.status_code == 200:
+            with open(output_folder / f"{uuid}{EXTENSION}", "wb") as f:
+                f.write(response.content)
+    except:
+        print("errored", item)
+        return
+
+
+def remove_non_compliant_image(item, output_folder):
+    uuid, max_pixels = item
+    if np.isnan(max_pixels):
+        return
+    if not (output_folder / f"{uuid}{EXTENSION}").exists():
+        return
+    img = Image.open(output_folder / f"{uuid}{EXTENSION}")
+    if img.width > max_pixels or img.height > max_pixels:
+        os.remove(output_folder / f"{uuid}{EXTENSION}")  # delete image
+        return uuid
+
+
+def reshape_image(rel_path, filename_size_map, output_folder):
+    w, h = filename_size_map[rel_path]
+    path = output_folder / f"{rel_path}"
+    img = Image.open(path)
+    if img.width != w or img.height != h:
+        new_size = (w, h)
+        resized_img = img.resize(new_size)
+        resized_img.save(path)
+
+
+def main(args, workers=20):
+    raw_folder = Path(args.raw_images_folder)
+    processed_folder = Path(args.processed_images_folder)
+    utils.setup(raw_folder)
+    utils.setup(processed_folder)
+    uuids = utils.get_image_ids(args.annotation_file)
+    filename_size_map = utils.get_filename_size_map(args.annotation_file)
+    if not ((raw_folder / METADATA_FOLDER) / METADATA_FILE).exists():
+        download_metadata(raw_folder)
+
+    metadata = pd.read_csv((raw_folder / METADATA_FOLDER) / METADATA_FILE)
+    metadata["download_url"] = metadata.apply(download_url, axis=1)
+    available_uuids = list(uuids.intersection(set(metadata["uuid"].tolist())))
+    print(len(available_uuids), "available for download out of", len(uuids), "target")
+    url_data = list(
+        metadata.set_index("uuid")
+        .loc[available_uuids]
+        .to_dict()["download_url"]
+        .items()
+    )
+
+    download_single = partial(download_item, output_folder=(processed_folder))
+
+    print("Preparing to download", len(url_data), "items")
+    with Pool(20) as p:
+        for _ in tqdm(p.imap(download_single, url_data), total=len(url_data)):
+            continue
+    check_img_size = partial(
+        remove_non_compliant_image, output_folder=(processed_folder)
+    )
+    max_pixels_dict_all = metadata.set_index("uuid").to_dict()["maxpixels"]
+    max_pixels_dict = {item[0]: max_pixels_dict_all[item[0]] for item in url_data}
+    print("Checking all images within size constraints")
+    non_compliant = set()
+    with Pool(20) as p:
+        for each in tqdm(
+            p.imap(check_img_size, max_pixels_dict.items()), total=len(max_pixels_dict)
+        ):
+            if each is not None:
+                non_compliant.add(each)
+    print(len(non_compliant), "not compliant size, removed")
+
+    reshape_single = partial(
+        reshape_image,
+        filename_size_map=(filename_size_map),
+        output_folder=(processed_folder),
+    )
+    rel_paths = os.listdir(args.processed_images_folder)
+    print("Preparing to reshape", len(rel_paths), "items")
+    with Pool(20) as p:
+        for _ in tqdm(p.imap(reshape_single, rel_paths), total=len(rel_paths)):
+            continue
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--annotation_file", help="Path to annotation file")
+    parser.add_argument("--raw_images_folder", help="Path to downloaded images")
+    parser.add_argument("--processed_images_folder", help="Path to processed images")
+    args = parser.parse_args()
+    main(args)
--- a/scripts/eval/silver/download_videos.py
+++ b/scripts/eval/silver/download_videos.py
@@ -0,0 +1,260 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import ast
+import concurrent.futures
+import os
+import shutil
+import subprocess
+import sys
+from concurrent.futures import as_completed, ThreadPoolExecutor
+from pathlib import Path
+
+import yt_dlp
+
+from utils import (
+    annotation_files,
+    config,
+    load_json,
+    run_command,
+    save_json,
+    update_annotations,
+)
+
+
+def construct_gcs_path(original_video):
+    """
+    Convert original_video string to GCS path.
+    Example:
+    'AUTOLab_failure_2023-07-07_Fri_Jul__7_18:50:36_2023_recordings_MP4_22008760.mp4'
+    ->
+    'gs://gresearch/robotics/droid_raw/1.0.1/AUTOLab/failure/2023-07-07/Fri_Jul__7_18:50:36_2023/recordings/MP4/22008760.mp4'
+    """
+    parts = original_video.split("_")
+    lab = parts[0]
+    failure = parts[1]
+    date = parts[2]
+    time = "_".join(parts[3:-3])
+    recordings = parts[-3]
+    mp4 = parts[-2]
+    file_id = parts[-1].split(".")[0]
+    gcs_path = (
+        f"gs://gresearch/robotics/droid_raw/1.0.1/"
+        f"{lab}/{failure}/{date}/{time}/{recordings}/{mp4}/{file_id}.mp4"
+    )
+    return gcs_path
+
+
+def download_video(args):
+    gcs_path, dst_dir, json_file = args
+    # Ensure subdirectory exists
+    subdir = Path(dst_dir)
+    os.makedirs(subdir, exist_ok=True)
+    # Save file with its original name inside the subdir
+    print(json_file)
+    local_path = subdir / json_file
+    cmd = f'gsutil cp "{gcs_path}" "{local_path}"'
+    print(f"Running: {cmd}")
+    try:
+        run_command(cmd)
+        return (gcs_path, True, None)
+    except Exception as e:
+        return (gcs_path, False, str(e))
+
+
+def download_youtube_video(youtube_id, output_path=None):
+    try:
+        if output_path is None:
+            output_path = os.path.join(
+                config["yt1b_path"], "downloaded_videos", f"video_{youtube_id}.mp4"
+            )
+        url = f"https://www.youtube.com/watch?v={youtube_id}"
+        if os.path.exists(output_path):
+            return youtube_id, None
+        format = "best[height<=720][fps<=30]/best[height<=720]/best"  # 720p or lower, max 30fps
+        ydl_opts = {
+            "format": format,
+            "outtmpl": output_path,
+            "merge_output_format": "mp4",
+            "quiet": True,
+            "cookiefile": config["cookies_path"],
+            "socket_timeout": 60,  # Increase timeout to 60 seconds (default is 10)
+        }
+        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+            ydl.download([url])
+        return youtube_id, None
+    except Exception as e:
+        return youtube_id, str(e)
+
+
+def download_youtube():
+    all_videos_to_download = set()
+    for annotation_file in annotation_files["yt1b"]:
+        ann = load_json(os.path.join(config["path_annotations"], annotation_file))
+        for video_info in ann["images"]:
+            youtube_id = video_info["original_video"]
+            all_videos_to_download.add(youtube_id)
+
+    videos_to_download_still = all_videos_to_download
+    videos_downloaded = set()
+    videos_unavailable = set()
+    num_download_retries = 3
+    for _ in range(num_download_retries):
+        if len(videos_to_download_still) == 0:
+            break
+        videos_error = set()
+        with concurrent.futures.ThreadPoolExecutor() as executor:
+            futures = [
+                executor.submit(download_youtube_video, youtube_id)
+                for youtube_id in videos_to_download_still
+            ]
+            for future in concurrent.futures.as_completed(futures):
+                youtube_id, exception = future.result()
+                if exception is None:
+                    videos_downloaded.add(youtube_id)
+                elif "unavailable" in exception or "members-only" in exception:
+                    videos_unavailable.add(youtube_id)
+                else:
+                    videos_error.add(youtube_id)
+        videos_to_download_still = (
+            all_videos_to_download - videos_downloaded - videos_unavailable
+        )
+        assert videos_to_download_still == videos_error
+
+    if len(videos_unavailable) + len(videos_to_download_still) > 0:
+        message = "Some videos are either no longer available on YouTube, or are set to private, or resulted in some other error. "
+        if config["update_annotation_yt1b"]:
+            message += "The unavailable videos will be ***REMOVED*** from the annotation file. This will make the test results NOT DIRECTLY COMPARABLE to other reported results."
+            print(message)
+            update_annotations("yt1b", videos_downloaded)
+        else:
+            message += "You may want to either re-try the download, or remove these videos from the evaluation json"
+            print(message)
+
+
+def download_droid():
+    ann_dir = Path(config["path_annotations"])
+    dst_dir = Path(config["droid_path"]) / "downloaded_videos"
+    json_files = annotation_files["droid"]
+
+    download_tasks = []
+    original_videos = set()
+    for json_file in json_files:
+        json_path = ann_dir / json_file
+        data = load_json(json_path)
+        for img in data["images"]:
+            original_video = img["original_video"]
+            original_videos.add(original_video)
+
+    print(len(original_videos))
+    for original_video in original_videos:
+        gcs_path = construct_gcs_path(original_video)
+        download_tasks.append((gcs_path, dst_dir, original_video))
+
+    max_workers = min(16, len(download_tasks))
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        future_to_task = {
+            executor.submit(download_video, task): task for task in download_tasks
+        }
+        for future in as_completed(future_to_task):
+            gcs_path, success, error = future.result()
+            if not success:
+                print(f"Failed to download {gcs_path}: {error}")
+
+
+def download_ego4d():
+    output_dir = os.path.join(config["ego4d_path"], "downloaded_videos")
+
+    ann_dir = Path(config["path_annotations"])
+    json_files = annotation_files["ego4d"]
+    original_videos = set()
+    for json_file in json_files:
+        json_path = ann_dir / json_file
+        data = load_json(json_path)
+        for img in data["images"]:
+            original_video = img["original_video"]
+            original_videos.add(original_video)
+
+    original_video_uids = [
+        video_uid.replace(".mp4", "") for video_uid in original_videos
+    ]
+    video_ids_download = original_video_uids
+    num_download_retries = 2
+    download_correct = False
+    message = ""
+    for _ in range(num_download_retries):
+        cmd = (
+            [
+                # "python", "-m", "ego4d.cli.cli",
+                "ego4d",
+                "--output_directory",
+                output_dir,
+                "--datasets",
+                "clips",
+                "--version",
+                "v1",
+                "--video_uids",
+            ]
+            + video_ids_download
+            + ["--yes"]
+        )
+
+        # Run the command
+        result = subprocess.run(cmd, capture_output=True, text=True)
+        message = result.stderr
+        if (
+            "RuntimeError: The following requested video UIDs could not be found in the manifest for version:"
+            in result.stderr
+        ):
+            not_findable_videos = ast.literal_eval(result.stderr.split("\n")[-2])
+            video_ids_download = [
+                video_uid
+                for video_uid in video_ids_download
+                if video_uid not in not_findable_videos
+            ]
+        else:
+            download_correct = True
+            break
+
+    if not download_correct:
+        print(f"There was an error downloading the Ego4D data: {message}")
+
+    if len(video_ids_download) != len(original_video_uids):
+        message = "Some videos are no longer available. "
+        if config["update_annotation_ego4d"]:
+            message += "The unavailable videos will be ***REMOVED*** from the annotation file. This will make the test results NOT DIRECTLY COMPARABLE to other reported results."
+            print(message)
+            update_annotations("ego4d", video_ids_download)
+        else:
+            message += "You may want to either re-try the download, or remove these videos from the evaluation json"
+            print(message)
+
+
+def download_sav():
+    tar_url = config["sav_videos_fps_6_download_path"]
+    tar_file = "videos_fps_6.tar"
+    sav_data_dir = os.path.join(config["sav_path"], "downloaded_videos")
+    os.makedirs(sav_data_dir, exist_ok=True)
+
+    subprocess.run(["wget", tar_url, "-O", tar_file], cwd=sav_data_dir, check=True)
+    subprocess.run(["tar", "-xvf", tar_file], cwd=sav_data_dir, check=True)
+    subprocess.run(["rm", tar_file], cwd=sav_data_dir, check=True)
+
+
+def main():
+    assert len(sys.argv) > 1, "You have to provide the name of the dataset"
+    dataset_name = sys.argv[1]
+    assert (
+        dataset_name in annotation_files
+    ), f"The dataset can be one of {list(annotation_files.keys())}"
+
+    if dataset_name == "yt1b":
+        download_youtube()
+    elif dataset_name == "droid":
+        download_droid()
+    elif dataset_name == "ego4d":
+        download_ego4d()
+    elif dataset_name == "sav":
+        download_sav()
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/silver/extract_frames.py
+++ b/scripts/eval/silver/extract_frames.py
@@ -0,0 +1,99 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+"""
+This file extracts the frames for the frame datasets in SA-CO/Gold and Silver.
+
+Call like:
+> python extract_frames.py <dataset_name>
+"""
+
+import json
+import os
+import shutil
+import sys
+from multiprocessing import Pool
+
+from PIL import Image
+from tqdm import tqdm
+from utils import (
+    annotation_files,
+    config,
+    get_frame_from_video,
+    is_valid_image,
+    update_annotations,
+)
+
+
+def extract_frame(path_video, global_frame_idx, path_frame, image_size, file_name):
+    frame = get_frame_from_video(path_video, global_frame_idx)
+    os.makedirs(os.path.dirname(path_frame), exist_ok=True)
+    img = Image.fromarray(frame)
+    if frame.shape[:2] != image_size:
+        print(f"Resizing image {file_name} from {frame.shape[:2]} to {image_size}")
+        height, width = image_size
+        img = img.resize((width, height))  # Uses Image.NEAREST by default
+    img.save(path_frame)
+
+
+def process_image(args):
+    image, dataset_name, config = args
+    original_video, global_frame_idx, file_name, image_size = image
+    extra_subpath = ""
+    if dataset_name == "ego4d":
+        extra_subpath = "v1/clips"
+    elif dataset_name == "yt1b":
+        original_video = f"video_{original_video}.mp4"
+    elif dataset_name == "sav":
+        extra_subpath = "videos_fps_6"
+    path_video = os.path.join(
+        config[f"{dataset_name}_path"],
+        "downloaded_videos",
+        extra_subpath,
+        original_video,
+    )
+    path_frame = os.path.join(config[f"{dataset_name}_path"], "frames", file_name)
+    to_return = file_name
+    try:
+        extract_frame(path_video, global_frame_idx, path_frame, image_size, file_name)
+        if not is_valid_image(path_frame):
+            print(f"Invalid image in {path_frame}")
+            to_return = None
+    except:
+        print(f"Invalid image in {path_frame}")
+        to_return = None
+    return to_return
+
+
+def main():
+    assert len(sys.argv) > 1, "You have to provide the name of the dataset"
+    dataset_name = sys.argv[1]
+    assert (
+        dataset_name in annotation_files
+    ), f"The dataset can be one of {list(annotation_files.keys())}"
+    all_outputs = []
+    for file in annotation_files[dataset_name]:
+        with open(os.path.join(config["path_annotations"], file), "r") as f:
+            annotation = json.load(f)
+        images = annotation["images"]
+        images = set(
+            (
+                image["original_video"],
+                image["global_frame_idx"],
+                image["file_name"],
+                tuple(image["image_size"]),
+            )
+            for image in images
+        )
+        args_list = [(image, dataset_name, config) for image in images]
+        with Pool(os.cpu_count()) as pool:
+            outputs = list(
+                tqdm(pool.imap_unordered(process_image, args_list), total=len(images))
+            )
+        all_outputs.extend(outputs)
+    if any(out is None for out in outputs):
+        update_annotations(dataset_name, all_outputs, key="file_name")
+    if config[f"remove_downloaded_videos_{dataset_name}"]:
+        shutil.rmtree(os.path.join(config[f"{dataset_name}_path"], "downloaded_videos"))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/silver/fathomnet_image_uuids.json
+++ b/scripts/eval/silver/fathomnet_image_uuids.json
--- a/scripts/eval/silver/inaturalist_image_subset.json
+++ b/scripts/eval/silver/inaturalist_image_subset.json
--- a/scripts/eval/silver/preprocess_silver_geode_bdd100k_food_rec.py
+++ b/scripts/eval/silver/preprocess_silver_geode_bdd100k_food_rec.py
@@ -0,0 +1,70 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+from multiprocessing import Pool
+from pathlib import Path
+
+import pandas as pd
+import utils
+from tqdm import tqdm
+
+
+def main(args, n_workers=20):
+    raw_folder = Path(args.raw_images_folder)
+    processed_folder = Path(args.processed_images_folder)
+    utils.setup(processed_folder)
+    img_ids = utils.get_image_ids(args.annotation_file)
+    if args.dataset_name == "geode":
+        metadata = pd.read_csv(raw_folder / "index.csv")
+        metadata["flat_filepath"] = metadata.file_path.apply(
+            lambda x: x.replace("/", "_")
+        )
+        metadata["original_absolute_path"] = metadata.file_path.apply(
+            lambda x: str((raw_folder / "images") / x)
+        )
+        metadata["new_absolute_path"] = metadata.flat_filepath.apply(
+            lambda x: str(processed_folder / x)
+        )
+        metadata["filestem"] = metadata.new_absolute_path.apply(lambda x: Path(x).stem)
+        img_id_mapping = metadata.set_index("filestem").to_dict()
+        # print(img_id_mapping.keys())
+        paths = [
+            (
+                img_id_mapping["original_absolute_path"][each],
+                img_id_mapping["new_absolute_path"][each],
+            )
+            for each in img_ids
+        ]
+    elif args.dataset_name == "bdd100k":
+        bdd_subfolder = "100k/train"
+        img_filenames = utils.get_filenames(args.annotation_file)
+        raw_folder_bdd_images = raw_folder / bdd_subfolder
+        paths = [
+            (raw_folder_bdd_images / each, processed_folder / each)
+            for each in img_filenames
+        ]
+    elif args.dataset_name == "food_rec":
+        food_subfolder = "public_validation_set_2.0/images"
+        img_filenames = utils.get_filenames(args.annotation_file)
+        raw_folder_food_images = raw_folder / food_subfolder
+        paths = [
+            (
+                raw_folder_food_images
+                / f'{Path(each).stem.split("_")[-1]}{Path(each).suffix}',
+                processed_folder / each,
+            )
+            for each in img_filenames
+        ]
+    print("Preparing to copy and flatten filename for", len(paths), "images")
+    with Pool(20) as p:
+        for _ in tqdm(p.imap(utils.copy_file, paths), total=len(paths)):
+            continue
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--annotation_file", help="Path to annotation file")
+    parser.add_argument("--raw_images_folder", help="Path to downloaded images")
+    parser.add_argument("--processed_images_folder", help="Path to processed images")
+    parser.add_argument("--dataset_name", help="Path to processed images")
+    args = parser.parse_args()
+    main(args)
--- a/scripts/eval/silver/utils.py
+++ b/scripts/eval/silver/utils.py
@@ -0,0 +1,148 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import json
+import os
+import shutil
+import subprocess
+from io import BytesIO
+from pathlib import Path
+
+import cv2
+import matplotlib.pyplot as plt
+import numpy as np
+import yaml
+from PIL import Image
+from pycocotools import mask as mask_utils
+from tqdm import tqdm
+
+
+annotation_files = {
+    "droid": [
+        "silver_droid_merged_test.json",
+    ],
+    "sav": [
+        "silver_sav_merged_test.json",
+    ],
+    "yt1b": [
+        "silver_yt1b_merged_test.json",
+    ],
+    "ego4d": [
+        "silver_ego4d_merged_test.json",
+    ],
+}
+
+
+def load_yaml(filename):
+    with open(filename, "r") as f:
+        return yaml.safe_load(f)
+
+
+def load_json(filename):
+    with open(filename, "r") as f:
+        return json.load(f)
+
+
+def save_json(content, filename):
+    with open(filename, "w") as f:
+        json.dump(content, f)
+
+
+def run_command(cmd):
+    """Run a shell command and raise if it fails."""
+    result = subprocess.run(cmd, shell=True)
+    if result.returncode != 0:
+        raise RuntimeError(f"Command failed: {cmd}")
+
+
+config = load_yaml("CONFIG_FRAMES.yaml")
+
+
+def is_valid_image(img_path):
+    try:
+        img = Image.open(img_path).convert("RGB")
+        return True
+    except Exception:
+        return False
+
+
+def get_frame_from_video(video_path, frame_id):
+    cap = cv2.VideoCapture(video_path)
+    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_id)
+    ret, frame = cap.read()
+    cap.release()
+    if not ret:
+        # Some videos cannot be open with OpenCV
+        import av
+
+        container = av.open(video_path)
+        stream = container.streams.video[0]
+        for i, frame in tqdm(
+            enumerate(container.decode(stream)),
+            desc="Decoding with AV",
+            total=frame_id + 1,
+        ):
+            if i == frame_id:
+                img = frame.to_ndarray(format="rgb24")
+                return img
+        raise ValueError(
+            f"Could not read frame {frame_id} from video {video_path} (out of frame)"
+        )
+    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+    return frame_rgb
+
+
+def update_annotations(dataset_name, file_names_keep, key="original_video"):
+    for annotation_file in annotation_files[dataset_name]:
+        path_ann = os.path.join(config["path_annotations"], annotation_file)
+        path_original_ann = os.path.join(
+            config["path_annotations"],
+            annotation_file.replace(".json", "_original.json"),
+        )
+        ann = load_json(path_ann)
+        shutil.copy(path_ann, path_original_ann)
+        new_images = []
+        image_ids_keep = set()
+        for image in ann["images"]:
+            if image[key].replace(".mp4", "") in file_names_keep:
+                new_images.append(image)
+                image_ids_keep.add(image["id"])
+        new_annotations = []
+        for annotation in ann["annotations"]:
+            if annotation["image_id"] in image_ids_keep:
+                new_annotations.append(annotation)
+        ann["images"] = new_images
+        ann["annotations"] = new_annotations
+        save_json(ann, path_ann)
+
+
+def get_filename_size_map(annotation_path):
+    with open(annotation_path) as f:
+        annotations = json.load(f)
+    filename_size_map = {}
+    for each in annotations["images"]:
+        filename_size_map[each["file_name"]] = (each["width"], each["height"])
+    return filename_size_map
+
+
+def get_filenames(annotation_path):
+    with open(annotation_path) as f:
+        annotations = json.load(f)
+    filenames = {Path(each["file_name"]) for each in annotations["images"]}
+    return filenames
+
+
+def get_image_ids(annotation_path):
+    filenames = get_filenames(annotation_path)
+    filestems = {Path(each).stem for each in filenames}
+    return filestems
+
+
+def setup(folder):
+    print("Making dir", folder)
+    folder.mkdir(exist_ok=True)
+
+
+def copy_file(paths):
+    old_path, new_path = paths
+    print("Copy from", old_path, "to", new_path)
+    if not Path(new_path).exists():
+        shutil.copy2(old_path, new_path)
--- a/scripts/eval/standalone_cgf1.py
+++ b/scripts/eval/standalone_cgf1.py
@@ -0,0 +1,48 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+
+"""Simple script to run the CGF1 evaluator given a prediction file and GT file(s).
+
+Usage: python standalone_cgf1.py --pred_file <path_to_prediction_file> --gt_files <path_to_gt_file1> <path_to_gt_file2> ...
+"""
+
+import argparse
+
+from sam3.eval.cgf1_eval import CGF1Evaluator
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pred_file",
+        type=str,
+        required=True,
+        help="Path to the prediction file in COCO format.",
+    )
+    parser.add_argument(
+        "--gt_files",
+        type=str,
+        nargs="+",
+        required=True,
+        help="Paths to the ground truth files in COCO format.",
+    )
+    args = parser.parse_args()
+    if len(args.gt_files) == 0:
+        raise ValueError("At least one GT file must be provided.")
+
+    is_gold = args.gt_files[0].split("_")[-1].startswith("gold_")
+    if is_gold and len(args.gt_files) < 3:
+        print(
+            "WARNING: based on the name, it seems you are using gold GT files. Typically, there should be 3 GT files for gold subsets (a, b, c)."
+        )
+
+    evaluator = CGF1Evaluator(
+        gt_path=args.gt_files, verbose=True, iou_type="segm"
+    )  # change to bbox if you want detection performance
+
+    results = evaluator.evaluate(args.pred_file)
+
+    print(results)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/veval/README.md
+++ b/scripts/eval/veval/README.md
@@ -0,0 +1,244 @@
+# SA-Co/VEval Dataset
+**License** each domain has its own License
+* SA-Co/VEval - SA-V: CC-BY-NC 4.0
+* SA-Co/VEval - YT-Temporal-1B: CC-BY-NC 4.0
+* SA-Co/VEval - SmartGlasses: CC-by-4.0
+
+**SA-Co/VEval** is an evaluation dataset comprising of 3 domains, each domain has a val and test split.
+* SA-Co/VEval - SA-V: videos are from the [SA-V dataset](https://ai.meta.com/datasets/segment-anything-video/)
+* SA-Co/VEval - YT-Temporal-1B: videos are from the [YT-Temporal-1B](https://cove.thecvf.com/datasets/704)
+* SA-Co/VEval - SmartGlasses: egocentric videos from [Smart Glasses](https://huggingface.co/datasets/facebook/SACo-VEval/blob/main/media/saco_sg.tar.gz)
+
+## Environment
+Install the SA-Co/VEVal required environment
+```
+pip install -e ".[veval]"
+```
+This will allow us to run:
+* `scripts/eval/veval/saco_yt1b_downloader.py` preparing frames for SA-Co/VEval - YT-Temporal-1B
+* `examples/saco_veval_eval_example.ipynb` example of running an offline evaluator
+* `examples/saco_veval_vis_example.ipynb` example of loading and visualizing the data
+
+## Download
+### The expected folder structure
+The following folder structure is expected after finishing all the download and pre-processing steps in this section
+```
+data/
+├── annotation/
+│   ├── saco_veval_sav_test.json
+│   ├── saco_veval_sav_val.json
+│   ├── saco_veval_smartglasses_test.json
+│   ├── saco_veval_smartglasses_val.json
+│   ├── saco_veval_yt1b_test.json
+│   ├── saco_veval_yt1b_val.json
+└── media/
+    ├── saco_sav
+    │   └── JPEGImages_24fps
+    ├── saco_sg
+    │   └── JPEGImages_6fps
+    └── saco_yt1b
+        └── JPEGImages_6fps
+```
+### Download ready-to-use data
+The following links provide ready-to-use data, hosted on Roboflow, after completing the pre-processing steps outlined in the next section.
+
+For each domain:
+- [SA-Co/VEval - SA-V](https://universe.roboflow.com/sa-co-veval/sa-v-test/)
+- [SA-Co/VEval - YT-Temporal-1B](https://universe.roboflow.com/sa-co-veval/yt-temporal-1b-test/)
+- [SA-Co/VEval - SmartGlasses](https://universe.roboflow.com/sa-co-veval/smartglasses-test/)
+
+For all three domains:
+- [SA-Co/VEval](https://universe.roboflow.com/sa-co-veval)
+
+### Download via preprocessing steps
+#### Download annotations
+The GT annotations are available at Hugging Face:
+* [SA-Co/VEval](https://huggingface.co/datasets/facebook/SACo-VEval/tree/main)
+    * SA-Co/VEval SA-V
+        * Test: `annotation/saco_veval_sav_test.json`
+        * Val: `annotation/saco_veval_sav_val.json`
+    * SA-Co/VEval YT-Temporal-1B
+        * Test: `annotation/saco_veval_yt1b_test.json`
+        * Val: `annotation/saco_veval_yt1b_val.json`
+    * SA-Co/VEval SmartGlasses
+        * Test: `annotation/saco_veval_smartglasses_test.json`
+        * Val: `annotation/saco_veval_smartglasses_val.json`
+
+#### Download videos or frames
+##### SA-Co/VEval - SAV
+Follow instructions in [SA-V dataset](https://ai.meta.com/datasets/segment-anything-video/). Only the following two datasets are needed:
+* sav_test.tar
+* sav_val.tar
+
+After untar:
+```
+sav_test/
+├── Annotations_6fps [ignore this is the SAM 2 annotation]
+├── JPEGImages_24fps
+sav_val/
+├── Annotations_6fps [ignore this is the SAM 2 annotation]
+└── JPEGImages_24fps
+```
+Then merge the two JPEGImages_24fps together to better match our annotation json file path e.g.
+```
+media/
+    └── saco_sav
+        └── JPEGImages_24fps [merged from the two JPEGImages_24fps above]
+```
+Example commands to download and merge folders
+```
+cd ../data/media/saco_sav
+wget -O sav_test.tar <sav_test.tar download link from the SA-V dataset page>
+wget -O sav_val.tar <sav_val.tar download link from the SA-V dataset page>
+tar -xf sav_test.tar
+tar -xf sav_val.tar
+mkdir JPEGImages_24fps
+chmod -R u+w sav_test/
+chmod -R u+w sav_val/
+mv sav_test/JPEGImages_24fps/* JPEGImages_24fps/
+mv sav_val/JPEGImages_24fps/* JPEGImages_24fps/
+```
+
+##### SA-Co/VEval - YT-Temporal-1B
+Two files are needed to download the SA-Co/VEval - YT-Temporal-1B Youtube videos.
+* Download `media/yt1b_start_end_time.json` from [SA-Co/VEval](https://huggingface.co/datasets/facebook/SACo-VEval/tree/main), which contains the Youtube video ids and the start and end time used in SA-Co/VEval - YT-Temporal-1B.
+* Prepare the `cookies.txt` file. Follow instruction in yt-dlp [exporting-youtube-cookies](https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies) and [pass-cookies-to-yt-dlp](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp) to prepare the cookies_file.
+    * Please see the full **WARNINGS** in yt-dlp regarding the risk of Youtube account ban!!
+
+Then run `scripts/eval/veval/saco_yt1b_downloader.py` to download the videos and prepare the frames e.g.
+```
+python saco_yt1b_downloader.py \
+--data_dir ../data/media/saco_yt1b \
+--cookies_file ../data/media/saco_yt1b/cookies.txt \
+--yt1b_start_end_time_file ../data/media/saco_yt1b/yt1b_start_end_time.json \
+--yt1b_frame_prep_log_file ../data/media/saco_yt1b/yt1b_frame_prep.log
+```
+* data_dir: The directoy to download the Youtube videos and store the extraced frames
+* cookies_file: the `cookies.txt` downloaded above
+* yt1b_start_end_time_file: the `yt1b_start_end_time.json` downloaded above
+* yt1b_frame_prep_log_file: a log file to track the video downloading and frame extracting status
+
+Then run `scripts/eval/veval/saco_yt1b_annot_update.py` to update the annotation based on the video availability e.g.
+```
+python saco_yt1b_annot_update.py \
+--yt1b_media_dir ../data/media/saco_yt1b/JPEGImages_6fps \
+--yt1b_input_annot_path ../data/annotation/saco_veval_yt1b_val.json \
+--yt1b_output_annot_path ../data/annotation/saco_veval_yt1b_val_updated.json \
+--yt1b_annot_update_log_path ../data/annotation/saco_veval_yt1b_val_updated.log
+```
+
+**NOTE**:
+* Not all Youtube videos might be available as Youtube videos can be deleted or become private. The script `saco_yt1b_annot_update.py` is used to remove the annotations of the unavailable videos.
+* **Frame Shifting Alert!!** Even when the videos are still available, their specifications, such as fps and duration, may differ from those used during annotation when re-downloaded from YouTube. Additionally, sometimes `ffmpeg` seems to find it hard to guarantee consistent frame extraction from the same video across different environments. This may cause the re-downloaded and re-extracted frames to have alignment issues with our annotations due to frame shifting. Please be aware of this caveat when evaluating on SA-Co/VEval - YT-Temporal-1B.
+
+##### SA-Co/VEval - SmartGlasses
+Go to [SACo-VEval](https://huggingface.co/datasets/facebook/SACo-VEval/tree/main) download `media/saco_sg.tar.gz`
+```
+cd ../data
+hf download facebook/SACo-VEval media/saco_sg.tar.gz --repo-type dataset --local-dir .
+cd ../data/media
+tar -xzf saco_sg.tar.gz
+```
+
+## Annotation Format
+The format is similar to the [YTVIS](https://youtube-vos.org/dataset/vis/) format.
+
+In the annotation json, e.g. `saco_veval_sav_test.json` there are 5 fields:
+* info:
+    * A dict containing the dataset info
+    * E.g. {'version': 'v1', 'date': '2025-09-24', 'description': 'SA-Co/VEval SA-V Test'}
+* videos
+    * A list of videos that are used in the current annotation json
+    * It contains {id, video_name, file_names, height, width, length}
+* annotations
+    * A list of **positive** masklets and their related info
+    * It contains {id, segmentations, bboxes, areas, iscrowd, video_id, height, width, category_id, noun_phrase}
+        * video_id should match to the `videos - id` field above
+        * category_id should match to the `categories - id` field below
+        * segmentations is a list of [RLE](https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/mask.py)
+* categories
+    * A **globally** used noun phrase id map, which is true across all 3 domains.
+    * It contains {id, name}
+        * name is the noun phrase
+* video_np_pairs
+    * A list of video-np pairs, including both **positive** and **negative** used in the current annotation json
+    * It contains {id, video_id, category_id, noun_phrase, num_masklets}
+        * video_id should match the `videos - id` above
+        * category_id should match the `categories - id` above
+        * when `num_masklets > 0` it is a positive video-np pair, and the presenting masklets can be found in the annotations field
+        * when `num_masklets = 0` it is a negative video-np pair, meaning no masklet presenting at all
+```
+data {
+    "info": info
+    "videos": [video]
+    "annotations": [annotation]
+    "categories": [category]
+    "video_np_pairs": [video_np_pair]
+}
+video {
+    "id": int
+    "video_name": str  # e.g. sav_000000
+    "file_names": List[str]
+    "height": int
+    "width": width
+    "length": length
+}
+annotation {
+    "id": int
+    "segmentations": List[RLE]
+    "bboxes": List[List[int, int, int, int]]
+    "areas": List[int]
+    "iscrowd": int
+    "video_id": str
+    "height": int
+    "width": int
+    "category_id": int
+    "noun_phrase": str
+}
+category {
+    "id": int
+    "name": str
+}
+video_np_pair {
+    "id": int
+    "video_id": str
+    "category_id": int
+    "noun_phrase": str
+    "num_masklets" int
+}
+```
+[sam3/examples/saco_veval_vis_example.ipynb](https://github.com/facebookresearch/sam3/blob/main/examples/saco_veval_vis_example.ipynb) shows some examples of the data format and data visualization.
+
+## Run Offline Eval
+An example notebook and an eval script have been provided for offline evaluation.
+```
+sam3/
+├── examples/
+│   └── saco_veval_eval_example.ipynb  # this notebook will load eval res or run the eval on the fly, and print the results
+└── sam3/eval/
+    └── saco_veval_eval.py  # this script will run the offline evaluator
+```
+`saco_veval_eval.py` supports two modes, `one` and `all`.
+* `one`: will take only one pair of gt and pred files to eval
+* `all`: will eval on all 6 SACo/VEval datasets
+
+Example usage
+```
+python saco_veval_eval.py one \
+--gt_annot_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_gt.json \
+--pred_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_pred.json \
+--eval_res_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_eval_res.json
+```
+* `gt_annot_file`: the location of the GT file
+* `pred_file`: the location of the Pred file
+* `eval_res_file`: the location where the eval result will be written to
+
+```
+python saco_veval_eval.py all \
+--gt_annot_dir ../data/annotation \
+--pred_dir ../data/pred \
+--eval_res_dir ../data/pred
+```
+* `gt_annot_dir`: the location of the GT files
+* `pred_dir`: the location of the Pred files
+* `eval_res_dir`: the location where the eval results will be written to
--- a/scripts/eval/veval/init.py
+++ b/scripts/eval/veval/init.py
@@ -0,0 +1 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
--- a/scripts/eval/veval/saco_yt1b_annot_update.py
+++ b/scripts/eval/veval/saco_yt1b_annot_update.py
@@ -0,0 +1,136 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+import json
+import logging
+import os
+
+import pandas as pd
+
+
+logger = logging.getLogger(__name__)
+
+
+def get_available_saco_yt1b_ids(yt1b_meida_dir, data):
+    vdf = pd.DataFrame(data["videos"])
+    expected_saco_yt1b_ids = vdf.video_name.tolist()
+
+    yt1b_media_folders = os.listdir(yt1b_meida_dir)
+
+    available_saco_yt1b_ids = []
+    for yt1b_media_folder in yt1b_media_folders:
+        if yt1b_media_folder not in expected_saco_yt1b_ids:
+            continue
+        jpeg_folder_dir = os.path.join(yt1b_meida_dir, yt1b_media_folder)
+        jpeg_count = len(os.listdir(jpeg_folder_dir))
+        if jpeg_count > 0:
+            available_saco_yt1b_ids.append(yt1b_media_folder)
+        else:
+            logger.info(
+                f"No JPEG images found for {yt1b_media_folder}. The annotation related to this video will be removed."
+            )
+
+    logger.info(
+        f"Expected {len(expected_saco_yt1b_ids)} videos for {data['info']}. Found {len(available_saco_yt1b_ids)} videos available in {yt1b_meida_dir}."
+    )
+    return available_saco_yt1b_ids
+
+
+def update_yt1b_annot_per_field(data, field, id_col, available_ids):
+    field_data = data[field]
+    new_field_data = []
+    for data_entry in field_data:
+        if data_entry[id_col] not in available_ids:
+            logger.info(
+                f"{field}: Removing {data_entry} due to the video being unavailable."
+            )
+            continue
+        new_field_data.append(data_entry)
+
+    data[field] = new_field_data
+    logger.info(
+        f"Updated {field} by {id_col} - Before: {len(field_data)}, After: {len(new_field_data)}, Removed: {len(field_data) - len(new_field_data)}"
+    )
+    return data
+
+
+def update_yt1b_annot(yt1b_input_annot_path, yt1b_media_dir, yt1b_output_annot_path):
+    with open(yt1b_input_annot_path, "r") as f:
+        data = json.load(f)
+
+    available_saco_yt1b_ids = get_available_saco_yt1b_ids(yt1b_media_dir, data)
+
+    data = update_yt1b_annot_per_field(
+        data=data,
+        field="videos",
+        id_col="video_name",
+        available_ids=available_saco_yt1b_ids,
+    )
+
+    videos_data = data["videos"]
+    available_video_incremental_ids = [data_entry["id"] for data_entry in videos_data]
+
+    data = update_yt1b_annot_per_field(
+        data=data,
+        field="annotations",
+        id_col="video_id",
+        available_ids=available_video_incremental_ids,
+    )
+    data = update_yt1b_annot_per_field(
+        data=data,
+        field="video_np_pairs",
+        id_col="video_id",
+        available_ids=available_video_incremental_ids,
+    )
+
+    with open(yt1b_output_annot_path, "w") as f:
+        json.dump(data, f)
+
+    return data
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Run video grounding evaluators")
+    parser.add_argument(
+        "--yt1b_media_dir",
+        type=str,
+        help="Path to the directory where the yt1b media is stored e.g media/saco_yt1b/JPEGImages_6fps",
+    )
+    parser.add_argument(
+        "--yt1b_input_annot_path",
+        type=str,
+        help="Path to the saco_veval_yt1b input annotation file e.g annotation/saco_veval_yt1b_test.json or annotation/saco_veval_yt1b_val.json",
+    )
+    parser.add_argument(
+        "--yt1b_output_annot_path",
+        type=str,
+        help="Path to the output annotation file e.g annotation/saco_veval_yt1b_test_updated.json or annotation/saco_veval_yt1b_val_updated.json",
+    )
+    parser.add_argument(
+        "--yt1b_annot_update_log_path",
+        type=str,
+        help="Path to the yt1b annot update log file e.g annotation/yt1b_annot_update_log.log",
+    )
+
+    args = parser.parse_args()
+
+    os.makedirs(os.path.dirname(args.yt1b_annot_update_log_path), exist_ok=True)
+    os.makedirs(os.path.dirname(args.yt1b_output_annot_path), exist_ok=True)
+
+    logging.basicConfig(
+        filename=args.yt1b_annot_update_log_path,
+        format="%(asctime)s [%(threadName)s] %(levelname)s: %(message)s",
+        level=logging.INFO,
+        filemode="w",
+    )
+
+    _ = update_yt1b_annot(
+        yt1b_input_annot_path=args.yt1b_input_annot_path,
+        yt1b_media_dir=args.yt1b_media_dir,
+        yt1b_output_annot_path=args.yt1b_output_annot_path,
+    )
+
+    print("Done!! Check the log at", args.yt1b_annot_update_log_path)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/veval/saco_yt1b_downloader.py
+++ b/scripts/eval/veval/saco_yt1b_downloader.py
@@ -0,0 +1,136 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+import logging
+
+import multiprocessing as mp
+import os
+from functools import partial
+
+import pandas as pd
+from saco_yt1b_frame_prep_util import YtVideoPrep
+from tqdm import tqdm
+
+logger = logging.getLogger(__name__)
+
+
+def download_and_extract_frames(saco_yt1b_id, args):
+    video_prep = YtVideoPrep(
+        saco_yt1b_id=saco_yt1b_id,
+        data_dir=args.data_dir,
+        cookies_file=args.cookies_file,
+        yt1b_start_end_time_file=args.yt1b_start_end_time_file,
+        ffmpeg_timeout=args.ffmpeg_timeout,
+        sleep_interval=args.sleep_interval,
+        max_sleep_interval=args.max_sleep_interval,
+    )
+
+    status = video_prep.download_youtube_video()
+    logger.info(f"[video download][{saco_yt1b_id}] download status {status}")
+
+    if status not in ["already exists", "success"]:
+        logger.warning(
+            f"Video download failed for {saco_yt1b_id}, skipping frame generation"
+        )
+        return False
+
+    status = video_prep.extract_frames_in_6fps_and_width_1080()
+    logger.info(f"[frame extracting][{saco_yt1b_id}] frame extracting status {status}")
+    return True
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--data_dir",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--cookies_file",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--yt1b_start_end_time_file",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--yt1b_frame_prep_log_file",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--ffmpeg_timeout",
+        type=str,
+        default=7200,  # Use longer timeout in case of large videos processing timeout
+    )
+    parser.add_argument(
+        "--sleep_interval",
+        type=int,
+        default=10,
+    )
+    parser.add_argument(
+        "--max_sleep_interval",
+        type=int,
+        default=30,
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=4,
+    )
+    args = parser.parse_args()
+
+    log_dir = os.path.dirname(args.yt1b_frame_prep_log_file)
+    if log_dir:
+        os.makedirs(log_dir, exist_ok=True)
+
+    # Set up logging to both file and console
+    # Configure the ROOT logger so all child loggers inherit the configuration
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(processName)s/%(threadName)s] %(name)s - %(levelname)s: %(message)s",
+        handlers=[
+            logging.FileHandler(args.yt1b_frame_prep_log_file, mode="w"),
+            logging.StreamHandler(),
+        ],
+        force=True,  # Override any existing configuration
+    )
+
+    YT_DLP_WARNING_STR = """ ==========
+        NOTICE!!
+        This script uses yt-dlp to download youtube videos.
+        See the youtube account banning risk in https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies
+        ==========
+        """
+
+    logger.info(YT_DLP_WARNING_STR)
+
+    args = parser.parse_args()
+
+    with open(args.yt1b_start_end_time_file, "r") as f:
+        yt1b_start_end_time_df = pd.read_json(f)
+
+    saco_yt1b_ids = yt1b_start_end_time_df.saco_yt1b_id.unique()
+    num_workers = args.num_workers
+    logger.info(
+        f"Starting with {num_workers} parallel worker(s) (sleep_interval={args.sleep_interval}-{args.max_sleep_interval}s)"
+    )
+
+    with mp.Pool(num_workers) as p:
+        download_func = partial(download_and_extract_frames, args=args)
+        list(tqdm(p.imap(download_func, saco_yt1b_ids), total=len(saco_yt1b_ids)))
+
+    done_str = f""" ==========
+        All DONE!!
+        Download, frame extraction, and frame matching is all done! YT1B frames are not ready to use in {args.data_dir}/JPEGImages_6fps
+        Check video frame preparing log at {args.yt1b_frame_prep_log_file}
+        Some videos might not be available any more which will affect the eval reproducibility
+        ==========
+    """
+    logger.info(done_str)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/eval/veval/saco_yt1b_frame_prep_util.py
+++ b/scripts/eval/veval/saco_yt1b_frame_prep_util.py
@@ -0,0 +1,265 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved
+import argparse
+import logging
+import os
+import subprocess
+
+import pandas as pd
+import yt_dlp
+
+logger = logging.getLogger(__name__)
+
+
+class YtVideoPrep:
+    def __init__(
+        self,
+        saco_yt1b_id: str,
+        data_dir: str,
+        cookies_file: str,
+        yt1b_start_end_time_file: str,
+        ffmpeg_timeout: int,
+        sleep_interval: int = 10,
+        max_sleep_interval: int = 30,
+    ):
+        self.saco_yt1b_id = saco_yt1b_id  # saco_yt1b_id is like saco_yt1b_000000
+        self.data_dir = data_dir
+        self.cookies_file = cookies_file
+        self.ffmpeg_timeout = ffmpeg_timeout
+        self.sleep_interval = sleep_interval
+        self.max_sleep_interval = max_sleep_interval
+
+        self.yt1b_start_end_time_df = pd.read_json(yt1b_start_end_time_file)
+        (
+            self.yt_video_id,
+            self.yt_video_id_w_timestamps,
+            self.start_time,
+            self.end_time,
+            self.expected_num_frames,
+        ) = self._get_yt_video_id_map_info()
+
+        self.raw_video_dir = os.path.join(self.data_dir, "raw_videos")
+        self.raw_video_path = os.path.join(
+            self.raw_video_dir, f"{self.yt_video_id}.mp4"
+        )
+
+        self.JPEGImages_6fps_dir = os.path.join(
+            self.data_dir, "JPEGImages_6fps", self.saco_yt1b_id
+        )
+        self.JPEGImages_6fps_pattern = os.path.join(
+            self.JPEGImages_6fps_dir, "%05d.jpg"
+        )
+
+        os.makedirs(self.raw_video_dir, exist_ok=True)
+        os.makedirs(self.JPEGImages_6fps_dir, exist_ok=True)
+
+    def _get_yt_video_id_map_info(self):
+        df = self.yt1b_start_end_time_df[
+            self.yt1b_start_end_time_df.saco_yt1b_id == self.saco_yt1b_id
+        ]
+        assert (
+            len(df) == 1
+        ), f"Expected exactly 1 row for saco_yt1b_id: {self.saco_yt1b_id}, found {len(df)}"
+        id_and_frame_map_row = df.iloc[0]
+
+        yt_video_id = (
+            id_and_frame_map_row.yt_video_id
+        )  # yt_video_id is like -06NgWyZxC0
+        yt_video_id_w_timestamps = id_and_frame_map_row.yt_video_id_w_timestamps
+        start_time = id_and_frame_map_row.start_time
+        end_time = id_and_frame_map_row.end_time
+        expected_num_frames = id_and_frame_map_row.length
+
+        return (
+            yt_video_id,
+            yt_video_id_w_timestamps,
+            start_time,
+            end_time,
+            expected_num_frames,
+        )
+
+    def download_youtube_video(self):
+        video_url = f"https://youtube.com/watch?v={self.yt_video_id}"
+
+        assert os.path.exists(
+            self.cookies_file
+        ), f"Cookies file '{self.cookies_file}' not found. Must have it to download videos."
+
+        outtmpl = self.raw_video_path
+
+        # Check if the output file already exists
+        if os.path.exists(outtmpl) and os.path.isfile(outtmpl):
+            return "already exists"
+
+        ydl_opts = {
+            "format": "best[height<=720]/best",  # 720p or lower
+            "outtmpl": outtmpl,
+            "merge_output_format": "mp4",
+            "noplaylist": True,
+            "quiet": True,
+            "cookiefile": self.cookies_file,
+            "sleep_interval": self.sleep_interval,  # Sleep before each download to avoid rate limiting
+            "max_sleep_interval": self.max_sleep_interval,  # Random sleep for more human-like behavior
+        }
+
+        if self.yt_video_id in ["euohdDLEMRg", "nzfAn7n4d-0"]:
+            # For "euohdDLEMRg", we have to specify the https protocol or the video sometimes can't be downloaded completely
+            # For "nzfAn7n4d-0", without the https protocol, the video will be downloaded as 654×480, however we need 490×360 to match the frame matching after the 1080 width resizing
+            ydl_opts["format"] = (
+                "best[height<=720][ext=mp4][protocol^=https]/best[ext=mp4][protocol^=https]/best[height<=720]/best"
+            )
+
+        try:
+            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+                ydl.download([video_url])
+                return "success"
+        except Exception as e:
+            logger.warning(
+                f"[video download][{self.saco_yt1b_id}] Error downloading video {self.yt_video_id}: {e}"
+            )
+            return f"error {e}"
+
+    def extract_frames_in_6fps_and_width_1080(self):
+        """
+        Extract target frames in 6fps and width 1080.
+        """
+        if not os.path.exists(self.raw_video_path):
+            logger.warning(
+                f"[frame extracting][{self.saco_yt1b_id}] Raw video file not found at {self.raw_video_path}"
+            )
+            os.rmdir(self.JPEGImages_6fps_dir)
+            return False
+
+        if (
+            os.path.exists(self.JPEGImages_6fps_dir)
+            and len(os.listdir(self.JPEGImages_6fps_dir)) == self.expected_num_frames
+        ):
+            logger.info(
+                f"[frame extracting][{self.saco_yt1b_id}] JPEGImages_6fps directory already exists at {self.JPEGImages_6fps_dir} and expected number of frames {self.expected_num_frames} matches"
+            )
+            return True
+
+        # Clear the directory before extracting new frames
+        for file in os.listdir(self.JPEGImages_6fps_dir):
+            os.remove(os.path.join(self.JPEGImages_6fps_dir, file))
+
+        args = [
+            "-nostdin",
+            "-y",
+            # select video segment
+            "-ss",
+            str(self.start_time),
+            "-to",
+            str(self.end_time),
+            "-i",
+            self.raw_video_path,
+            # set output video resolution to be 6fps and at most 1080p
+            "-vf",
+            "fps=6,scale=1080:-2",
+            "-vsync",
+            "0",  # passthrough mode - no frame duplication/dropping
+            "-q:v",
+            "2",  # high quality JPEG output
+            "-start_number",
+            "0",  # start frame numbering from 0
+            self.JPEGImages_6fps_pattern,
+        ]
+
+        result = subprocess.run(
+            ["ffmpeg"] + args,
+            timeout=self.ffmpeg_timeout,
+            capture_output=True,
+            text=True,
+        )
+
+        if result.returncode != 0:
+            logger.warning(
+                f"[frame extracting][{self.saco_yt1b_id}] Failed to extract raw frames: {result.stderr}"
+            )
+            os.rmdir(self.JPEGImages_6fps_dir)
+            return False
+
+        if len(os.listdir(self.JPEGImages_6fps_dir)) != self.expected_num_frames:
+            logger.warning(
+                f"[frame extracting][{self.saco_yt1b_id}] Expected {self.expected_num_frames} frames but extracted {len(os.listdir(self.JPEGImages_6fps_dir))}"
+            )
+            # Clear the directory after failed extraction
+            for file in os.listdir(self.JPEGImages_6fps_dir):
+                os.remove(os.path.join(self.JPEGImages_6fps_dir, file))
+
+            os.rmdir(self.JPEGImages_6fps_dir)
+            return False
+
+        logger.info(
+            f"[frame extracting][{self.saco_yt1b_id}] Successfully extracted {self.expected_num_frames} frames to {self.JPEGImages_6fps_dir}"
+        )
+        return True
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--saco_yt1b_id", type=str, required=True)
+    parser.add_argument(
+        "--data_dir",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--cookies_file",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--yt1b_start_end_time_file",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--yt1b_frame_prep_log_file",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--ffmpeg_timeout",
+        type=str,
+        default=7200,  # Use longer timeout in case of large videos processing timeout
+    )
+    parser.add_argument(
+        "--sleep_interval",
+        type=int,
+        default=10,
+    )
+    parser.add_argument(
+        "--max_sleep_interval",
+        type=int,
+        default=30,
+    )
+    args = parser.parse_args()
+
+    logging.basicConfig(
+        filename=args.yt1b_frame_prep_log_file,
+        format="%(asctime)s [%(threadName)s] %(levelname)s: %(message)s",
+        level=logging.INFO,
+        filemode="w",
+    )
+
+    video_prep = YtVideoPrep(
+        saco_yt1b_id=args.saco_yt1b_id,
+        data_dir=args.data_dir,
+        cookies_file=args.cookies_file,
+        yt1b_start_end_time_file=args.yt1b_start_end_time_file,
+        ffmpeg_timeout=args.ffmpeg_timeout,
+        sleep_interval=args.sleep_interval,
+        max_sleep_interval=args.max_sleep_interval,
+    )
+
+    status = video_prep.download_youtube_video()
+    logger.info(f"[video download][{args.saco_yt1b_id}] download status {status}")
+
+    status = video_prep.extract_frames_in_6fps_and_width_1080()
+    logger.info(
+        f"[frame extracting][{args.saco_yt1b_id}] frame extracting status {status}"
+    )
+
+
+if __name__ == "__main__":
+    main()
				`@@ -0,0 +1 @@`
				`# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved`