models Salesforce BLIP 2 opt 2 7b image to text - Azure/azureml-assets GitHub Wiki
The BLIP-2
model, utilizing OPT-2.7b (a large language model with 2.7 billion parameters), is presented in the paper titled "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". This is a generic and efficient pre-training strategy that easily harvests development of pre-trained vision models and large language models (LLMs) for Vision-Language Pre-training (VLP). This model was made available in this repository.
BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model. The authors initialize the weights of the image encoder and large language model from pre-trained checkpoints and keep them frozen while training the Querying Transformer, which is a BERT-like Transformer encoder that maps a set of "query tokens" to query embeddings, which bridge the gap between the embedding space of the image encoder and the large language model.
The model's objective is to predict the next text token based on query embeddings and the previous text. This functionality allows the model to undertake a range of tasks, such as generating image captions, responding to visual questions (VQA), and participating in chat-like conversations using the image and preceding chat as input prompts.
BLIP2-OPT uses off-the-shelf OPT as the language model. It shares the same potential risks and limitations outlined in Meta's model card.
Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, OPT-175B has limitations in terms of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern large language models.
BLIP2 undergoes fine-tuning on internet collected image-text datasets, which raises concerns about potential inappropriate content generation or replicating inherent biases from the underlying data. The model has not been tested in real-world applications, and caution is advised against direct deployment. Researchers should carefully assess the model's safety and fairness in the specific deployment context before considering its use.
mit
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | image-to-text-online-endpoint.ipynb | image-to-text-online-endpoint.sh |
Batch | image-to-text-batch-endpoint.ipynb | image-to-text-batch-endpoint.sh |
{
"input_data":{
"columns":[
"image"
],
"index":[0, 1],
"data":[
["image1"],
["image2"]
]
}
}
Note:
- "image1" and "image2" should be publicly accessible urls or strings in
base64
format.
[
{
"text": "a stream running through a forest with rocks and trees"
},
{
"text": "a grassy hillside with trees and a sunset"
}
]
For sample image below, the output text is "a grassy hillside with trees and a sunset".
Version: 6
Preview
license : mit
task : image-to-text
SharedComputeCapacityEnabled
huggingface_model_id : Salesforce/blip2-opt-2.7b
author : Salesforce
hiddenlayerscanned
inference_compute_allow_list : ['Standard_DS5_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_FX4mds', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
View in Studio: https://ml.azure.com/registries/azureml/models/Salesforce-BLIP-2-opt-2-7b-image-to-text/version/6
License: mit
SharedComputeCapacityEnabled: True
SHA: 6e723d92ee91ebcee4ba74d7017632f11ff4217b
inference-min-sku-spec: 4|0|32|64
inference-recommended-sku: Standard_DS5_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2
model_id: Salesforce/blip2-opt-2.7b