models OpenAI CLIP Image Text Embeddings ViT Large Patch14 336 - Azure/azureml-assets GitHub Wiki
The CLIP
model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like `CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
This model uses a ViT-L/14 Transformer architecture trained at 336x336 pixel resolution as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
The primary intended users of these models are AI researchers for tasks requiring image and/or text embeddings such as text and image retrieval.
The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the training data comes from the authors' crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
This model was evaluated for text retrieval and image retrieval tasks on the Flickr30k and MSCOCO datasets. The results from Table 13 in the original CLIP paper are summarized below
Text Retrieval
Dataset | R@1 | R@5 |
---|---|---|
Flickr30k | 88.0 | 98.7 |
MSCOCO | 58.4 | 81.5 |
Image Retrieval
Dataset | R@1 | R@5 |
---|---|---|
Flickr30k | 68.7 | 90.6 |
MSCOCO | 37.8 | 62.4 |
CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which the authors discuss in the paper and is described briefly in the next section. Additionally, the authors' approach to testing CLIP also has an important limitation- in many cases they have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance.
The authors of the original CLIP paper found that the performance of the model and its biases can depend significantly on class design and the choices one makes for categories to include and exclude. They tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. They found significant disparities with respect to race and gender, which could shift based on how the classes were constructed. The authors also tested the performance of CLIP on gender, race, and age classification using the Fairface dataset. They found that the accuracy for gender classification was greater than 96% across all races, with 'Middle Eastern' having the highest accuracy (98.4%) and 'White' having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification.
MIT License
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | image-text-embeddings-online-endpoint.ipynb | image-text-embeddings-online-endpoint.sh |
Batch | image-text-embeddings-batch-endpoint.ipynb | image-text-embeddings-batch-endpoint.sh |
{
"input_data":{
"columns":[
"image", "text"
],
"index":[0, 1],
"data":[
["image1", ""],
["image2", ""]
]
}
}
Note: "image1" and "image2" should be publicly accessible urls or strings in base64
format
[
{
"image_features": [-0.92, -0.13, 0.02, ... , 0.13],
},
{
"image_features": [0.54, -0.83, 0.13, ... , 0.26],
}
]
Note: returned embeddings have dimension 768 and are not normalized
{
"input_data":{
"columns":[
"image", "text"
],
"index":[0, 1],
"data":[
["", "sample text 1"],
["", "sample text 2"]
]
}
}
[
{
"text_features": [0.42, -0.13, -0.92, ... , 0.63],
},
{
"text_features": [-0.14, 0.93, -0.15, ... , 0.66],
}
]
Note: returned embeddings have dimension 768 and are not normalized
{
"input_data":{
"columns":[
"image", "text"
],
"index":[0, 1],
"data":[
["image1", "sample text 1"],
["image2", "sample text 2"]
]
}
}
Note: "image1" and "image2" should be publicly accessible urls or strings in base64
format
[
{
"image_features": [0.92, -0.13, 0.02, ... , -0.13],
"text_features": [0.42, 0.13, -0.92, ... , -0.63]
},
{
"image_features": [-0.54, -0.83, 0.13, ... , -0.26],
"text_features": [-0.14, -0.93, 0.15, ... , 0.66]
}
]
Note: returned embeddings have dimension 768 and are not normalized
Version: 2
Preview
huggingface_model_id : openai/clip-vit-large-patch14-336
SharedComputeCapacityEnabled
license : mit
task : embeddings
hiddenlayerscanned
inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
View in Studio: https://ml.azure.com/registries/azureml/models/OpenAI-CLIP-Image-Text-Embeddings-ViT-Large-Patch14-336/version/2
License: mit
SharedComputeCapacityEnabled: True
inference-min-sku-spec: 2|0|7|14
inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2