Salesforce-BLIP-image-captioning-base

Overview

BLIP (Bootstrapping Language-Image Pre-training) designed for unified vision-language understanding and generation is a new VLP framework that expands the scope of downstream tasks compared to existing methods. The framework encompasses two key contributions from both model and data perspectives.

BLIP incorporates the Multi-modal Mixture of Encoder-Decoder (MED), an innovative model architecture designed to facilitate effective multi-task pre-training and flexible transfer learning. This model is jointly pre-trained using three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling.
BLIP introduces Captioning and Filtering (CapFilt), a distinctive dataset bootstrapping method aimed at learning from noisy image-text pairs. The pre-trained MED is fine-tuned into a captioner that generates synthetic captions from web images, and a filter that removes noisy captions from both the original web texts and synthetic texts.

Authors of BLIP make following key observations based on extensive experiments and analysis. The collaboration between the captioner and filter significantly enhances performance across diverse downstream tasks through caption bootstrapping, with greater diversity in captions leading to more substantial gains. BLIP achieves state-of-the-art performance in various vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. It also achieves state-of-the-art zero-shot performance when directly applied to video-language tasks such as text-to-video retrieval and videoQA.

Researchers should carefully assess the safety and fairness of the model before deploying it in any real-world applications.

Model fine-tuned on COCO dataset with the language modeling (LM) loss to generate captions given images with base architecture (with ViT base backbone). For more details on Image Captioning with BLIP, review the section 5.2 of the original-paper.

License

BSD 3-Clause License

Inference Samples

Inference type	Python sample (Notebook)	CLI with YAML
Real time	image-to-text-online-endpoint.ipynb	image-to-text-online-endpoint.sh
Batch	image-to-text-batch-endpoint.ipynb	image-to-text-batch-endpoint.sh

Sample input and output

Sample input

{
   "input_data":{
      "columns":[
         "image"
      ],
      "index":[0, 1],
      "data":[
         ["image1"],
         ["image2"]
      ]
   }
}

Note:

"image1" and "image2" should be publicly accessible urls or strings in base64 format.

Sample output

[
   {
      "text": "a box of food sitting on top of a table"
   },
   {
      "text": "a stream in the middle of a forest"
   }
]

Visualization of inference result for a sample image

For sample image below, the output text is "a stream in the middle of a forest".

Version: 7

View in Studio: https://ml.azure.com/registries/azureml/models/Salesforce-BLIP-image-captioning-base/version/7

Properties

SharedComputeCapacityEnabled: True

SHA: 89b09ea1789f7addf2f6d6f0dfc4ce10ab58ef84

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

models Salesforce BLIP image captioning base - Azure/azureml-assets GitHub Wiki

Salesforce-BLIP-image-captioning-base

Overview

License

Inference Samples

Sample input and output

Sample input

Sample output

Visualization of inference result for a sample image

Properties

⚠️ GitHub.com Fallback ⚠️

models Salesforce BLIP image captioning base - Azure/azureml-assets GitHub Wiki

Salesforce-BLIP-image-captioning-base

Overview

License

Inference Samples

Sample input and output

Sample input

Sample output

Visualization of inference result for a sample image

Properties

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️