models OpenAI CLIP Image Text Embeddings vit base patch32 - Azure/azureml-assets GitHub Wiki
OpenAI's CLIP (Contrastive Language–Image Pre-training) model was designed to investigate the factors that contribute to the robustness of computer vision tasks. It can seamlessly adapt to a range of image classification tasks without requiring specific training for each, demonstrating efficiency, flexibility, and generality.
In terms of architecture, CLIP utilizes a ViT-B/32 Transformer for image encoding and a masked self-attention Transformer for text encoding. These encoders undergo training to improve the similarity of (image, text) pairs using a contrastive loss.
For training purposes, CLIP leverages image-text pairs from the internet and engages in a proxy task: when presented with an image, predict the correct text snippet from a set of 32,768 randomly sampled options. This approach allows CLIP to comprehend visual concepts and establish associations with their textual counterparts, enhancing its performance across various visual classification tasks.
The design of CLIP effectively tackles notable challenges, including the dependence on expensive labeled datasets, the need for fine-tuning on new datasets to achieve optimal performance across diverse tasks, and the disparity between benchmark and real-world performance.
The primary intended users of these models are AI researchers for tasks requiring image and/or text embeddings such as text and image retrieval.
For more details on CLIP model, review the original-paper or the original-model-card.
The training of the CLIP model involved utilizing publicly accessible image-caption data obtained by crawling several websites and incorporating commonly-used existing image datasets like YFCC100M. Researchers curated a novel dataset comprising 400 million image-text pairs sourced from diverse publicly available internet outlets. This dataset, referred to as WIT (WebImageText), possesses a word count comparable to the WebText dataset employed in training GPT-2.
As a consequence, the data in WIT is reflective of individuals and societies predominantly linked to the internet, often leaning towards more developed nations and a demographic skewed towards younger, male users.
The Vision Transformers ViT-B/32 underwent training for 32 epochs, employing the Adam optimizer with applied decoupled weight decay regularization. The learning rate was decayed using a cosine schedule. The learnable temperature parameter τ was initialized to the equivalent of 0.07. Training utilized a very large mini-batch size of 32,768, and mixed-precision techniques were employed to expedite training and conserve memory. The largest Vision Transformer was trained over a period of 12 days on 256 V100 GPUs. For a more in-depth understanding, refer to sections 2 and 3 of the original-paper.
The performance of CLIP has been evaluated on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The section 3 and 4 of the paper describes model performance on multiple datasets.
CLIP has difficulties with tasks such as fine-grained classification and object counting. Its performance also raises concerns regarding fairness and bias. Additionally, there is a notable limitation in the evaluation approach, with the use of linear probes potentially underestimating CLIP's true performance, as suggested by evidence.
CLIP's performance and inherent biases can vary depending on class design and category choices. Assessing Fairface images unveiled significant racial and gender disparities, influenced by class construction. Evaluations on gender, race, and age classification using the Fairface dataset indicated gender accuracy exceeding 96%, with variations among races. Racial classification achieved approximately 93%, while age classification reached around 63%. These assessments aim to gauge model performance across demographics, pinpoint potential risks, and are not intended to endorse or promote such tasks. For a more details, refer to sections 6 and 7 of the original-paper.
MIT License
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | image-text-embeddings-online-endpoint.ipynb | image-text-embeddings-online-endpoint.sh |
Batch | image-text-embeddings-batch-endpoint.ipynb | image-text-embeddings-batch-endpoint.sh |
{
"input_data":{
"columns":[
"image", "text"
],
"index":[0, 1],
"data":[
["image1", ""],
["image2", ""]
]
}
}
Note: "image1" and "image2" should be publicly accessible urls or strings in base64
format
[
{
"image_features": [-0.92, -0.13, 0.02, ... , 0.13],
},
{
"image_features": [0.54, -0.83, 0.13, ... , 0.26],
}
]
Note: returned embeddings have dimension 512 and are not normalized
{
"input_data":{
"columns":[
"image", "text"
],
"index":[0, 1],
"data":[
["", "sample text 1"],
["", "sample text 2"]
]
}
}
[
{
"text_features": [0.42, -0.13, -0.92, ... , 0.63],
},
{
"text_features": [-0.14, 0.93, -0.15, ... , 0.66],
}
]
Note: returned embeddings have dimension 512 and are not normalized
{
"input_data":{
"columns":[
"image", "text"
],
"index":[0, 1],
"data":[
["image1", "sample text 1"],
["image2", "sample text 2"]
]
}
}
Note: "image1" and "image2" should be publicly accessible urls or strings in base64
format
[
{
"image_features": [0.92, -0.13, 0.02, ... , -0.13],
"text_features": [0.42, 0.13, -0.92, ... , -0.63]
},
{
"image_features": [-0.54, -0.83, 0.13, ... , -0.26],
"text_features": [-0.14, -0.93, 0.15, ... , 0.66]
}
]
Note: returned embeddings have dimension 512 and are not normalized
Version: 10
Preview
huggingface_model_id : openai/clip-vit-base-patch32
SharedComputeCapacityEnabled
author : OpenAI
license : mit
task : embeddings
hiddenlayerscanned
inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
View in Studio: https://ml.azure.com/registries/azureml/models/OpenAI-CLIP-Image-Text-Embeddings-vit-base-patch32/version/10
License: mit
SharedComputeCapacityEnabled: True
inference-min-sku-spec: 2|0|7|14
inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2