Effortless Fine-Tuning of Falcon Models Using QLoRA
Written on
Introduction to Falcon Models
The Falcon models have rapidly gained popularity as some of the leading large language models available today due to several compelling factors:
- They excel at solving complex problems.
- They are more compact than many other LLMs while outperforming them.
- They are completely free to use under the Apache 2.0 License.
- Different versions are available, including an instruct-version that simulates ChatGPT's behavior.
With innovative techniques like QLoRA, fine-tuning Falcon models can now be accomplished on consumer-grade hardware. Previous discussions have covered QLoRA and the fine-tuning process for Falcon models.
Finding Simplicity with Falcontune
Fine-tuning Falcon models using QLoRA is straightforward with the Hugging Face libraries. However, an even simpler solution that requires minimal coding is available: Falcontune.
Falcontune is an open-source initiative (Apache 2.0 license) created by Rumen Mihaylov. According to the project page:
Falcontune enables fine-tuning of FALCON models (e.g., falcon-40b-4bit) using just one consumer-grade A100 40GB GPU.
While fine-tuning a model with 40 billion parameters on a GPU with 40GB of VRAM sounds fantastic, referring to the A100 40GB as “consumer-grade” is a bit misleading, given its price tag of over $5,000. In contrast, we will focus on the Falcon 7B parameter model, which can comfortably run on consumer GPUs like the RTX 3060 with 12GB of VRAM.
Fine-Tuning Falcon-7B and Falcon-40B in One Command
Note: The commands below are tailored for Falcon-7B. Simply replace “7B” with “40B” to adjust for Falcon-40B.
Requirements
I conducted tests using a free instance of Google Colab.
To get started, we first need to clone Falcontune:
Next, install the required dependencies:
cd falcontune
pip install -r requirements.txt
python setup.py install
We also need the Falcon model. For this article, I utilized Falcon-7B provided by TheBloke:
Let's also download some sample datasets:
Now we're prepared!
The Command Line for Fine-Tuning
The “setup.py install” command earlier provided us with a “falcontune” command. To fine-tune Falcon-7B using the Alpaca dataset, run the following command:
falcontune finetune
--model=falcon-7b-instruct-4bit
—weights=./gptq_model-4bit-64g.safetensors
--dataset=./alpaca_data_cleaned.json
—data_type=alpaca
--lora_out_dir=./falcon-7b-instruct-4bit-alpaca/
—mbatch_size=1
--batch_size=2
—epochs=3
--lr=3e-4
—cutoff_len=256
--lora_r=8
—lora_alpha=16
--lora_dropout=0.05
—warmup_steps=5
--save_steps=50
—save_total_limit=3
--logging_steps=5
—target_modules='["query_key_value"]'
--backend=triton
Expect the process to take a while (around 24 hours on a free Google Colab instance due to disconnection after 12 hours). The Alpaca dataset is sizable, so you may want to reduce its dimensions for testing purposes. Thanks to LoRa, we are fine-tuning only 2,359,296 parameters.
If you wish to use your own dataset, refer to the format expected in the “alpaca_data_cleaned.json” file.
During the fine-tuning process, the peak memory usage was 4.0 GB for CPU RAM and 8.3 GB for GPU VRAM, which is a manageable setup for home-based fine-tuning. Keep in mind that the 40B version of Falcon will necessitate a more robust machine.
Testing Inference
To test the model's inference capabilities, execute:
falcontune generate
--interactive
—model=falcon-7b-instruct-4bit
--weights=./gptq_model-4bit-64g.safetensors
—lora_apply_dir falcon-7b-instruct-4bit-alpaca/
--max_new_tokens 50
—use_cache
--do_sample
—instruction "How to prepare pasta?"
--backend triton
And that’s all there is to it! You've successfully created an efficient chat model on your local machine.
This video demonstrates how to efficiently fine-tune models using QLoRA, showcasing its benefits and applications.
Learn how to fine-tune the Falcon 7B model with PEFT and QLoRA on a HuggingFace dataset.
If you appreciate this content and wish to support my work, consider following me on Medium.