models Salesforce BLIP image captioning base - Azure/azureml-assets GitHub Wiki
BLIP
(Bootstrapping Language-Image Pre-training) designed for unified vision-language understanding and generation is a new VLP framework that expands the scope of downstream tasks compared to existing methods. The framework encompasses two key contributions from both model and data perspectives.
-
BLIP incorporates the Multi-modal Mixture of Encoder-Decoder (MED), an innovative model architecture designed to facilitate effective multi-task pre-training and flexible transfer learning. This model is jointly pre-trained using three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling.
-
BLIP introduces Captioning and Filtering (CapFilt), a distinctive dataset bootstrapping method aimed at learning from noisy image-text pairs. The pre-trained MED is fine-tuned into a captioner that generates synthetic captions from web images, and a filter that removes noisy captions from both the original web texts and synthetic texts.
Authors of BLIP make following key observations based on extensive experiments and analysis. The collaboration between the captioner and filter significantly enhances performance across diverse downstream tasks through caption bootstrapping, with greater diversity in captions leading to more substantial gains. BLIP achieves state-of-the-art performance in various vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. It also achieves state-of-the-art zero-shot performance when directly applied to video-language tasks such as text-to-video retrieval and videoQA.
Researchers should carefully assess the safety and fairness of the model before deploying it in any real-world applications.
Model fine-tuned on COCO dataset with the language modeling (LM) loss to generate captions given images with base architecture (with ViT base backbone). For more details on Image Captioning with BLIP, review the section 5.2 of the original-paper.
BSD 3-Clause License
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | image-to-text-online-endpoint.ipynb | image-to-text-online-endpoint.sh |
Batch | image-to-text-batch-endpoint.ipynb | image-to-text-batch-endpoint.sh |
{
"input_data":{
"columns":[
"image"
],
"index":[0, 1],
"data":[
["image1"],
["image2"]
]
}
}
Note:
- "image1" and "image2" should be publicly accessible urls or strings in
base64
format.
[
{
"text": "a box of food sitting on top of a table"
},
{
"text": "a stream in the middle of a forest"
}
]
For sample image below, the output text is "a stream in the middle of a forest".
Version: 6
Preview
license : bsd-3-clause
task : image-to-text
SharedComputeCapacityEnabled
huggingface_model_id : Salesforce/blip-image-captioning-base
author : Salesforce
hiddenlayerscanned
inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
View in Studio: https://ml.azure.com/registries/azureml/models/Salesforce-BLIP-image-captioning-base/version/6
License: bsd-3-clause
SharedComputeCapacityEnabled: True
SHA: 89b09ea1789f7addf2f6d6f0dfc4ce10ab58ef84
inference-min-sku-spec: 2|0|7|14
inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2