models tiiuae falcon 40b - Azure/azureml-assets GitHub Wiki

tiiuae-falcon-40b

Overview

Description

Falcon-40B is a large language model (LLM) developed by the Technology Innovation Institute (TII) with 40 billion parameters. It is a causal decoder-only model trained on 1 trillion tokens from the RefinedWeb dataset, enhanced with curated corpora. Falcon-40B supports English, German, Spanish, and French languages, with limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish. It is available under the Apache 2.0 license.

Falcon-40B is considered the best open-source model currently available, optimized for inference with features such as FlashAttention and multiquery. However, it is recommended to fine-tune the model for specific use cases.

The training of Falcon-40B involved using 384 A100 40GB GPUs and took two months. The model carries biases and stereotypes encountered online and requires appropriate precautions for production use. It is suggested to finetune the model for specific tasks and consider guardrails. The technical specifications, training details, and evaluation results are provided in the summary.

The above summary was generated using ChatGPT. Review the original model card to understand the data used to train the model, evaluation metrics, license, intended uses, limitations and bias before using the model.

Training Details

Training Data

Falcon-40B was trained on 1,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile (Gao et al., 2020).

Data source Fraction Tokens Sources
RefinedWeb-English 75% 750B massive web crawl
RefinedWeb-Europe 7% 70B European massive web crawl
Books 6% 60B
Conversations 5% 50B Reddit, StackOverflow, HackerNews
Code 5% 50B
Technical 2% 20B arXiv, PubMed, USPTO, etc.

RefinedWeb-Europe is made of the following languages:

Language Fraction of multilingual data Tokens
German 26% 18B
Spanish 24% 17B
French 23% 16B
Italian 7% 5B
Portuguese 4% 3B
Polish 4% 3B
Dutch 4% 3B
Romanian 3% 2B
Czech 3% 2B
Swedish 2% 1B

The data was tokenized with the Falcon-7B/40B tokenizer.

Training Procedure

Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.

Training Hyperparameters

Hyperparameter Value Comment
Precision bfloat16
Optimizer AdamW
Learning rate 1.85e-4 4B tokens warm-up, cosine decay to 1.85e-5
Weight decay 1e-1
Z-loss 1e-4
Batch size 1152 100B tokens ramp-up

Speeds, Sizes, Times

Training started in December 2022 and took two months.

Evaluation

Paper coming soon.

See the OpenLLM Leaderboard for early results.

Technical Specifications

Model Architecture and Objective

Falcon-40B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

The architecture is broadly adapted from the GPT-3 paper (Brown et al., 2020), with the following differences:

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.

Hyperparameter Value Comment
Layers 60
d_model 8192
head_dim 64 Reduced to optimise for FlashAttention
Vocabulary 65024
Sequence length 2048

Compute Infrastructure

Hardware

Falcon-40B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.

Software

Falcon-40B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

License

Falcon-40B is made available under the Apache 2.0 license.

Finetuning samples

Task Use case Dataset Python sample (Notebook) CLI with YAML
Text Classification Emotion Detection Emotion emotion-detection.ipynb emotion-detection.sh

Model Evaluation Sample

Task Use case Dataset Python sample (Notebook) CLI with YAML
Text generation Text generation cnn_dailymail evaluate-model-text-generation.ipynb evaluate-model-text-generation.yml

Inference samples

Inference type Python sample (Notebook) CLI with YAML
Real time text-generation-online-endpoint.ipynb text-generation-online-endpoint.sh
Batch text-generation-batch-endpoint.ipynb coming soon

Sample input (for real-time inference)

{
  "input_data": {
      "input_string":["The meaning of the life is"]
  }
}

Sample output

[
  {
    "0": "The meaning of the life is to find your gift. The purpose of life is to give it away"
  }
]

Version: 10

Tags

Featured license : apache-2.0 SharedComputeCapacityEnabled task : text-generation author : tiiuae hiddenlayerscanned huggingface_model_id : tiiuae/falcon-40b evaluation_compute_allow_list : ['Standard_NC24s_v3', 'Standard_NC24rs_v3', 'Standard_ND40rs_v2', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4'] inference_compute_allow_list : ['Standard_ND40rs_v2', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4'] finetune_compute_allow_list : ['Standard_ND40rs_v2', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4'] model_specific_defaults : ordereddict({'apply_lora': 'true', 'precision': '4'}) inference_supported_envs : ['vllm']

View in Studio: https://ml.azure.com/registries/azureml/models/tiiuae-falcon-40b/version/10

License: apache-2.0

Properties

SharedComputeCapacityEnabled: True

SHA: 3d7c5902f1dc9da830979a826cd96114b3ba4ec1

datasets: tiiuae/falcon-refinedweb

languages: en, de, es, fr

evaluation-min-sku-spec: 24|4|448|2900

evaluation-recommended-sku: Standard_NC24s_v3, Standard_NC24rs_v3, Standard_ND40rs_v2, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4

finetune-min-sku-spec: 40|8|672|2900

finetune-recommended-sku: Standard_ND40rs_v2, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4

finetuning-tasks: text-classification

inference-min-sku-spec: 40|8|672|2900

inference-recommended-sku: Standard_ND40rs_v2, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4

⚠️ **GitHub.com Fallback** ⚠️