Falcon-40B: The Most Powerful Open-source Model

Falcon-40B

The Most Powerful Open-source Model

Falcon-40B is a large language model (LLM) and one of Falcon LLM models with 40 billion parameters trained on 1,000B tokens of web data and curated corpora. It was developed by Technology Innovation Institute (TII) in Abu Dhabi and open-sourced under the Apache 2.0 license. Falcon-40B features an architecture optimized for inference, with FlashAttention and multiquery. It outperforms other open-source LLMs such as GPT-3, LLaMA, StableLM, RedPajama, and MPT.

Falcon-40B Features

Best open-source model

Falcon-40B outperforms other open-source models such as LLaMA, StableLM, RedPajama, and MPT. It is one of the top ranked projects in Huggingface OpenLLM Leaderboard.

Optimized architecture

Falcon-40B uses FlashAttention and multiquery techniques to improve inference speed and efficiency .

Permissive license

Falcon-40B is released under the Apache 2.0 license, which allows unrestricted commercial use without any royalties or restrictions.

Multilingual capabilities

Falcon-40B supports English, German, Spanish, French, and has limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish.

Data quality at scale

A data pipeline that extracts high-quality content from the web using extensive filtering and deduplication.

Get Started with Falcon-40B

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Training Details

Falcon-40B used RefinedWeb, a web dataset with high-quality filtering and deduplication, to train Falcon-40B on 1,000B tokens. they also added curated corpora to enhance the dataset, some of which were based on The Pile (Gao et al., 2020).

Data source	Fraction	Tokens	Sources
RefinedWeb-English	75%	750B	massive web crawl
RefinedWeb-Europe	7%	70B	European massive zeb crawl
Books	6%	60B
Conversations	5%	50B	Reddit, StackOverflow, HackerNews
Code	5%	50B
Technical	2%	20B	arXiv, PubMed, UPSTO, etc.

RefinedWeb-Europe is made of the following languages:

Language	Fraction of multilingual data	Tokens
German	26%	18B
Spanish	24%	17B
French	23%	16B
Italian	7%	5B
Portuguese	4%	3B
Polish	4%	3B
Dutch	4%	3B
Romanian	3%	2B
Czech	3%	2B
Swedish	2%	1B

The data was tokenized with the Falcon-7B/40B tokenizer.

Training Procedure

Using 3D parallelism (TP=8, PP=4, DP=12) and ZeRO, Falcon-40B was trained on 384 A100 40GB GPUs.

Training Hyperparameters

Hyperparameter	Value	Comment
Precision	`bfloat16`
Optimizer	AdamW
Learning rate	1.85e-4	4B tokens warm-up, cosine decay to 1.85e-5
Weight decay	1e-1
Z-loss	1e-4
Batch size	1152	100B tokens ramp-up

How much it takes to train Falcon-40B?

Training started in December 2022 and took two months.

Technical Specifications

Falcon-40B is a decoder-only model that learns to generate the next token in a sequence. It is based on the GPT-3 architecture (Brown et al., 2020), with some modifications:

Rotary positional embeddings (Su et al., 2021) to encode the relative positions of tokens;
Multiquery attention (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022) to efficiently compute attention scores;
Parallel attention/MLP decoder blocks with two layer normalization steps to stabilize the training.

Each tensor parallel degree has its own key and value in Falcon-40B’s multiquery, which uses a special version internally.

Hyperparameter	Value	Comment
Layers	60
`d_model`	8192
`head_dim`	64	Reduced to optimise for FlashAttention
Vocabulary	65024
Sequence length	2048

Lamitations

Falcon-40B is a multilingual model that can handle English, German, Spanish, French, and some other languages to a lesser extent. However, it is not suitable for languages outside its training data. Moreover, it may reflect the online prejudices and biases that are present in its large-scale web-based corpus.

Visit Website

Project Page

Huggin Face Page

FAQ

What is Falcon-40B and what can it do?

Falcon-40B is a 40 billion parameters causal decoder-only model built by TII and trained on 1,000 billion tokens of RefinedWeb enhanced with curated corpora. It can generate text for various tasks such as summarization, text generation, chatbot, etc.

How can I use Falcon-40B?

You can use Falcon-40B with the Hugging Face Transformers library. You need to install PyTorch 2.0 and import the AutoTokenizer and AutoModelForCausalLM classes from the transformers module. Then you can load the model and the tokenizer with the name “tiiuae/falcon-40b” and use the pipeline function to generate text with a given prompt.

What are the advantages of Falcon-40B?

Falcon-40B is the best open-source model currently available. It outperforms other models such as LLaMA, StableLM, RedPajama, MPT, etc. on the OpenLLM Leaderboard. It also features an architecture optimized for inference, with FlashAttention and multiquery . It is made available under a permissive Apache 2.0 license allowing for commercial use, without any royalties or restrictions.

What is the difference between Falcon-40B and Falcon-40B-Instruct?

Falcon-40B is a raw, pre-trained model that can be further finetuned for specific use cases. Falcon-40B-Instruct is a version of Falcon-40B that has been finetuned on a chat dataset and can take generic instructions in a chat format.