models tiiuae falcon 40b - Azure/azureml-assets GitHub Wiki
Falcon-40B is a large language model (LLM) developed by the Technology Innovation Institute (TII) with 40 billion parameters. It is a causal decoder-only model trained on 1 trillion tokens from the RefinedWeb dataset, enhanced with curated corpora. Falcon-40B supports English, German, Spanish, and French languages, with limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish. It is available under the Apache 2.0 license.
Falcon-40B is considered the best open-source model currently available, optimized for inference with features such as FlashAttention and multiquery. However, it is recommended to fine-tune the model for specific use cases.
The training of Falcon-40B involved using 384 A100 40GB GPUs and took two months. The model carries biases and stereotypes encountered online and requires appropriate precautions for production use. It is suggested to finetune the model for specific tasks and consider guardrails. The technical specifications, training details, and evaluation results are provided in the summary.
The above summary was generated using ChatGPT. Review the original model card to understand the data used to train the model, evaluation metrics, license, intended uses, limitations and bias before using the model.
Falcon-40B was trained on 1,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile (Gao et al., 2020).
Data source | Fraction | Tokens | Sources |
---|---|---|---|
RefinedWeb-English | 75% | 750B | massive web crawl |
RefinedWeb-Europe | 7% | 70B | European massive web crawl |
Books | 6% | 60B | |
Conversations | 5% | 50B | Reddit, StackOverflow, HackerNews |
Code | 5% | 50B | |
Technical | 2% | 20B | arXiv, PubMed, USPTO, etc. |
RefinedWeb-Europe is made of the following languages:
Language | Fraction of multilingual data | Tokens |
---|---|---|
German | 26% | 18B |
Spanish | 24% | 17B |
French | 23% | 16B |
Italian | 7% | 5B |
Portuguese | 4% | 3B |
Polish | 4% | 3B |
Dutch | 4% | 3B |
Romanian | 3% | 2B |
Czech | 3% | 2B |
Swedish | 2% | 1B |
The data was tokenized with the Falcon-7B/40B tokenizer.
Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.
Hyperparameter | Value | Comment |
---|---|---|
Precision | bfloat16 |
|
Optimizer | AdamW | |
Learning rate | 1.85e-4 | 4B tokens warm-up, cosine decay to 1.85e-5 |
Weight decay | 1e-1 | |
Z-loss | 1e-4 | |
Batch size | 1152 | 100B tokens ramp-up |
Training started in December 2022 and took two months.
Paper coming soon.
See the OpenLLM Leaderboard for early results.
Falcon-40B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
The architecture is broadly adapted from the GPT-3 paper (Brown et al., 2020), with the following differences:
- Positionnal embeddings: rotary (Su et al., 2021);
- Attention: multiquery (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022);
- Decoder-block: parallel attention/MLP with a two layer norms.
For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.
Hyperparameter | Value | Comment |
---|---|---|
Layers | 60 | |
d_model |
8192 | |
head_dim |
64 | Reduced to optimise for FlashAttention |
Vocabulary | 65024 | |
Sequence length | 2048 |
Falcon-40B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.
Falcon-40B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)
Falcon-40B is made available under the Apache 2.0 license.
Task | Use case | Dataset | Python sample (Notebook) | CLI with YAML |
---|---|---|---|---|
Text Classification | Emotion Detection | Emotion | emotion-detection.ipynb | emotion-detection.sh |
Task | Use case | Dataset | Python sample (Notebook) | CLI with YAML |
---|---|---|---|---|
Text generation | Text generation | cnn_dailymail | evaluate-model-text-generation.ipynb | evaluate-model-text-generation.yml |
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | text-generation-online-endpoint.ipynb | text-generation-online-endpoint.sh |
Batch | text-generation-batch-endpoint.ipynb | coming soon |
{
"input_data": {
"input_string":["The meaning of the life is"]
}
}
[
{
"0": "The meaning of the life is to find your gift. The purpose of life is to give it away"
}
]
Version: 10
Featured
license : apache-2.0
SharedComputeCapacityEnabled
task : text-generation
author : tiiuae
hiddenlayerscanned
huggingface_model_id : tiiuae/falcon-40b
evaluation_compute_allow_list : ['Standard_NC24s_v3', 'Standard_NC24rs_v3', 'Standard_ND40rs_v2', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4']
inference_compute_allow_list : ['Standard_ND40rs_v2', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4']
finetune_compute_allow_list : ['Standard_ND40rs_v2', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4']
model_specific_defaults : ordereddict({'apply_lora': 'true', 'precision': '4'})
inference_supported_envs : ['vllm']
View in Studio: https://ml.azure.com/registries/azureml/models/tiiuae-falcon-40b/version/10
License: apache-2.0
SharedComputeCapacityEnabled: True
SHA: 3d7c5902f1dc9da830979a826cd96114b3ba4ec1
datasets: tiiuae/falcon-refinedweb
languages: en, de, es, fr
evaluation-min-sku-spec: 24|4|448|2900
evaluation-recommended-sku: Standard_NC24s_v3, Standard_NC24rs_v3, Standard_ND40rs_v2, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4
finetune-min-sku-spec: 40|8|672|2900
finetune-recommended-sku: Standard_ND40rs_v2, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4
finetuning-tasks: text-classification
inference-min-sku-spec: 40|8|672|2900
inference-recommended-sku: Standard_ND40rs_v2, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4