models OpenAI CLIP Image Text Embeddings ViT Large Patch14 336 - Azure/azureml-assets GitHub Wiki

OpenAI-CLIP-Image-Text-Embeddings-ViT-Large-Patch14-336

Overview

The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like `CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.

This model uses a ViT-L/14 Transformer architecture trained at 336x336 pixel resolution as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

The primary intended users of these models are AI researchers for tasks requiring image and/or text embeddings such as text and image retrieval.

Training Details

Training Data

The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the training data comes from the authors' crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.

Evaluation Results

This model was evaluated for text retrieval and image retrieval tasks on the Flickr30k and MSCOCO datasets. The results from Table 13 in the original CLIP paper are summarized below

Text Retrieval

Dataset R@1 R@5
Flickr30k 88.0 98.7
MSCOCO 58.4 81.5

Image Retrieval

Dataset R@1 R@5
Flickr30k 68.7 90.6
MSCOCO 37.8 62.4

Limitations and Biases

Limitations

CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which the authors discuss in the paper and is described briefly in the next section. Additionally, the authors' approach to testing CLIP also has an important limitation- in many cases they have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance.

Bias

The authors of the original CLIP paper found that the performance of the model and its biases can depend significantly on class design and the choices one makes for categories to include and exclude. They tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. They found significant disparities with respect to race and gender, which could shift based on how the classes were constructed. The authors also tested the performance of CLIP on gender, race, and age classification using the Fairface dataset. They found that the accuracy for gender classification was greater than 96% across all races, with 'Middle Eastern' having the highest accuracy (98.4%) and 'White' having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification.

License

MIT License

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time image-text-embeddings-online-endpoint.ipynb image-text-embeddings-online-endpoint.sh
Batch image-text-embeddings-batch-endpoint.ipynb image-text-embeddings-batch-endpoint.sh

Sample input and output for image embeddings

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", ""],
         ["image2", ""]
      ]
   }
}

Note: "image1" and "image2" should be publicly accessible urls or strings in base64 format

Sample output

[
    {
        "image_features": [-0.92, -0.13, 0.02, ... , 0.13],
    },
    {
        "image_features": [0.54, -0.83, 0.13, ... , 0.26],
    }
]

Note: returned embeddings have dimension 768 and are not normalized

Sample input and output for text embeddings

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["", "sample text 1"],
         ["", "sample text 2"]
      ]
   }
}

Sample output

[
    {
        "text_features": [0.42, -0.13, -0.92, ... , 0.63],
    },
    {
        "text_features": [-0.14, 0.93, -0.15, ... , 0.66],
    }
]

Note: returned embeddings have dimension 768 and are not normalized

Sample input and output for image and text embeddings

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", "sample text 1"],
         ["image2", "sample text 2"]
      ]
   }
}

Note: "image1" and "image2" should be publicly accessible urls or strings in base64 format

Sample output

[
    {
        "image_features": [0.92, -0.13, 0.02, ... , -0.13],
        "text_features": [0.42, 0.13, -0.92, ... , -0.63]
    },
    {
        "image_features": [-0.54, -0.83, 0.13, ... , -0.26],
        "text_features": [-0.14, -0.93, 0.15, ... , 0.66]
    }
]

Note: returned embeddings have dimension 768 and are not normalized

Version: 2

Tags

Preview huggingface_model_id : openai/clip-vit-large-patch14-336 SharedComputeCapacityEnabled license : mit task : embeddings hiddenlayerscanned inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/OpenAI-CLIP-Image-Text-Embeddings-ViT-Large-Patch14-336/version/2

License: mit

Properties

SharedComputeCapacityEnabled: True

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

⚠️ **GitHub.com Fallback** ⚠️