Salesforce-BLIP-2-opt-2-7b-image-to-text

Overview

The BLIP-2 model, utilizing OPT-2.7b (a large language model with 2.7 billion parameters), is presented in the paper titled "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". This is a generic and efficient pre-training strategy that easily harvests development of pre-trained vision models and large language models (LLMs) for Vision-Language Pre-training (VLP). This model was made available in this repository.

BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model. The authors initialize the weights of the image encoder and large language model from pre-trained checkpoints and keep them frozen while training the Querying Transformer, which is a BERT-like Transformer encoder that maps a set of "query tokens" to query embeddings, which bridge the gap between the embedding space of the image encoder and the large language model.

The model's objective is to predict the next text token based on query embeddings and the previous text. This functionality allows the model to undertake a range of tasks, such as generating image captions, responding to visual questions (VQA), and participating in chat-like conversations using the image and preceding chat as input prompts.

Limitations and Biases

BLIP2-OPT uses off-the-shelf OPT as the language model. It shares the same potential risks and limitations outlined in Meta's model card.

Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, OPT-175B has limitations in terms of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern large language models.

BLIP2 undergoes fine-tuning on internet collected image-text datasets, which raises concerns about potential inappropriate content generation or replicating inherent biases from the underlying data. The model has not been tested in real-world applications, and caution is advised against direct deployment. Researchers should carefully assess the model's safety and fairness in the specific deployment context before considering its use.

License

mit

Inference Samples

Inference type	Python sample (Notebook)	CLI with YAML
Real time	image-to-text-online-endpoint.ipynb	image-to-text-online-endpoint.sh
Batch	image-to-text-batch-endpoint.ipynb	image-to-text-batch-endpoint.sh

Sample input and output

Sample input

{
   "input_data":{
      "columns":[
         "image"
      ],
      "index":[0, 1],
      "data":[
         ["image1"],
         ["image2"]
      ]
   }
}

Note:

"image1" and "image2" should be publicly accessible urls or strings in base64 format.

Sample output

[
   {
      "text": "a stream running through a forest with rocks and trees"
   },
   {
      "text": "a grassy hillside with trees and a sunset"
   }
]

Visualization of inference result for a sample image

For sample image below, the output text is "a grassy hillside with trees and a sunset".

Version: 8

Tags

Preview license : mit task : image-to-text SharedComputeCapacityEnabled huggingface_model_id : Salesforce/blip2-opt-2.7b author : Salesforce hiddenlayerscanned inference_compute_allow_list : ['Standard_DS5_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_FX4mds', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/Salesforce-BLIP-2-opt-2-7b-image-to-text/version/8

License: mit

Properties

SharedComputeCapacityEnabled: True

SHA: 6e723d92ee91ebcee4ba74d7017632f11ff4217b

inference-min-sku-spec: 4|0|32|64

inference-recommended-sku: Standard_DS5_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

model_id: Salesforce/blip2-opt-2.7b

models Salesforce BLIP 2 opt 2 7b image to text - Azure/azureml-assets GitHub Wiki

Salesforce-BLIP-2-opt-2-7b-image-to-text

Overview

Limitations and Biases

License

Inference Samples

Sample input and output

Sample input

Sample output

Visualization of inference result for a sample image

Tags

Properties

⚠️ GitHub.com Fallback ⚠️

models Salesforce BLIP 2 opt 2 7b image to text - Azure/azureml-assets GitHub Wiki

Salesforce-BLIP-2-opt-2-7b-image-to-text

Overview

Limitations and Biases

License

Inference Samples

Sample input and output

Sample input

Sample output

Visualization of inference result for a sample image

Tags

Properties

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️