In this post, I will go through the process of training a large language model on chat data, specifically using the LLaMA-7b model. The fine-tuned model has been shown to perform on par or better than most Hugging Face variants when trained on cleaned alpaca data. I will go into the benefits of using DeepSpeed for training and how LORA (Low-Rank Adaptation) can be used in combination with DeepSpeed to be very efficient during supervised fine-tuning on large models.

The code can be found at github

Setting up the Environment

First, let’s set up a container for our training environment. Go to the dockercontext folder and run the following commands:

cd dockercontext
make build

Then, enter the container:

make enter

Preparing the Dataset

I have used cleaned alpaca data for training the model, but any data in a similar format will work. To set up the dataset, execute the following command:

python src/setup.py --local --model 'decapoda-research/llama-7b-hf' --data_file data/alpaca_data_cleaned.json --max_tokens 512

If the --local flag is not set, the dataset will be uploaded to Azure ML along with the tokenizer model. The --model argument accepts Hugging Face decoder models.

DeepSpeed: Why Do We Need It?

Training large language models can be computationally expensive, and memory constraints is often a significant issue. DeepSpeed is a library that optimizes the training process and reduces memory requirements.

DeepSpeed uses “model parallelism” where the model is divided into smaller parts, and each part is trained on a different GPU. This technique helps to scale up the model size without running into memory limitations. In other words, models that would not fit on a single GPU can be partitioned into smaller parts that can be sent out to multiple GPU:s.

DeepSpeed Stages

DeepSpeed stages refer to the different levels of optimization applied during training. These stages help control memory usage and improve training efficiency. The table below demonstrates the memory requirements for different scenarios when training the LLaMA 7 billion model with LORA rank 16:

Scenario	Model	Precision	Batch Size	DeepSpeed Stage	Memory (GB)
Inference	Llama 7b	16-bit	1	N/A	13
Inference	Llama 7b	8-bit	1	N/A	8
Train	Llama 7b	16-bit	2	N/A	> 49
Train	Llama 7b	16-bit	2	2	27
Train	Llama 7b	16-bit	4	2	38
Train	Llama 7b	16-bit	4	2	43
Train	Llama 7b	16-bit	4	3	45
Train	Llama 7b	16-bit	6	2	> 49

Training the Model

The LLaMA-7b model was trained using a set of configurations, see config.yaml to achieve a balance between training speed, memory utilization, and model performance. The training process used 16-bit precision, which considerably reduces memory usage and accelerates the training process, compared to 32-bit precision.

The training used gradient accumulation over two steps, doubling the effective batch size to 10 without increasing GPU memory requirements. This technique allows for stable training with larger batch sizes even on GPUs with limited memory capacity. A batch size of five was chosen for this training process to ensure efficient use of the available GPU memory.

The training was carried out for five epochs, which took approximately 10 hours to complete on two NVIDIA RTX A6000 GPUs (49GB of memory).

To train the model, run the following command:

python src/train_lightning.py --config config.yaml --local

The config.yaml file contains all options that control the model training, from DeepSpeed configuration options to model options and flags for performing LORA-training. If the --local flag is set, the finished model will be uploaded to Azure ML.

Images Example

The Importance of LORA (Low-Rank Adaptation)

LORA is using the insight that weight matrices $W_{i}$ in a pre-trained model contain significant redundancy, in other words the intrinsic dimension of $x$ in the equation $y = W_{i} x$ is much smaller than the embedding dimension.

To exploit this, LORA applies low-rank decomposition to these matrices, effectively approximating them using a product of two smaller matrices with lower rank $r$.

$$ y = W_{i} x + \alpha (W^L_{b} W^L_{a} x) = (W_{i} + \alpha + W^L_{b} W^L_{a}) x = W^{*}_{i} x $$

where $W^L_{a} x$ maps x from $[n, 1]$ to $[r, 1]$ (its intrinsic dimension?), where $n » r$. $W^L_{b}$ then expand the the intrinsic mapping back to the original number of dimensions.

During the fine-tuning process, with the task-specific data, LORA only updates the smaller, low-rank matrices, while the original weights are kept frozen - maintaining the model’s expressiveness. This allows the model to adapt to the specific task with fewer trainable parameters and less computational resource. After fine-tuning, the three weight matrices can be combines into $W^{*}_{i}$ resulting in zero extra computational impact compared to a model trained without LORA.

By employing LORA during fine-tuning, we can achieve the following benefits:

Improved training efficiency: LORA significantly reduces the number of parameters to be fine-tuned, which in turn decreases training time and computational resources required.
Better generalization: LORA helps the model better adapt to the task-specific data, allowing it to generalize better to new examples.
Reduced overfitting: The low-rank adaptation reduces the risk of overfitting to the training data by constraining the model’s capacity.

Training the Model Using the Reacher Library

The Reacher library simplifies the process of setting up a remote training environment. First, set up the Reacher instance, connecting to the remote machine running the Docker images in dockercontext:

from reacher.reacher import Reacher, ReacherDocker, RemoteClient
from reacher.reacher import create_notebook, create_tensorboard

reacher = Reacher(
    build_name="training_alpacha_lora",
    host="",
    port=8961,
    user="root",
    ssh_key_filepath=".ssh/id_rsa",
    prefix_cmd="PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home//.local/bin"
)
reacher.setup()

You can also create a TensorBoard session on the remote and set up port forwarding between the remote and your local machine:

create_tensorboard(reacher, remote_port=55069, local_port=55069, logdir=reacher.build_path+"/artifacts/runs")

A notebook server can be set up on the remote, with port forwarding to your local machine:

create_notebook(reacher, remote_port=55068, local_port=55068)

To process the data and train the model using the Reacher library, execute the following commands:

# Process the data
reacher.execute(
    "python src/setup.py --local --model 'decapoda-research/llama-7b-hf' --data_file alpaca_data_cleaned.json --max_tokens 512",
    context=["src", "data/alpaca_data_cleaned.json"],
    wrap_in_screen=True,
    named_session="setup",
)
# Train the model
reacher.execute(
    "python src/train_lightning.py --config config.yaml --local",
    context=["src", "config.yaml"],
    wrap_in_screen=True,
    named_session="training",
)

After training, you can grab the checkpoint:

reacher.get("<path>/final.ckpt", "weights/")

Inference

For the inference process, refer to the inference.ipynb notebook. Load the configuration used during training and set the 8-bit mode to be true for the inference:

conf = yaml.safe_load(Path("config.yaml").read_text())
conf["model"]["load_in_8bit"] = True
conf["model"]["device_map"] = "auto"

Set up the model:

model = GeneratorModel(**conf["model"], torch_dtype=torch.float16)
model.load_lora_pretrained("weights/alpacha_lora/adapter_model.bin")

Finally, use the trained model to generate text:

for text in texts:
    response = generate_text(
        model=model,
        tokenizer=tokenizer,
        text=text,
        max_length=150,
        device="cuda",
    )
    print(response + "\n\n" + "-" * 40 + "\n\n")

Inference Examples

In this section, I have collected some examples of the generated responses from the trained model. These examples demonstrate how the model can provide contextually relevant and creative responses to different inputs.

Keep in mind that the model might produce some hallucinations or inaccuracies, but overall, it demonstrates an understanding of the given context and instructions.

Example 1:

Instruction: Think of a way to use the following material to build a shelter.

Input: sticks and a piece of cloth

Response: You can create a simple lean-to by using sticks as poles and wrapping them around a tree or other structure. The stick will hold up the roof while the cloth will provide protection from rain and wind.

Example 2:

Instruction: How can you stack the objects in a stable manner?

Input: a book, 9 eggs, a nail, a plant, a shoe

Response: The egg should be placed on top of the book and then the shoes could go next to it. The plant would need to be placed at the bottom of the pile. Then the nails could be used as support for the whole structure by placing them underneath the base of the plant. This way they will provide stability without being seen or touching any part of the object.

Example 3:

Instruction: Which is the third largest city in Sweden, list the first and second as well

Response: The third largest city in Sweden is Gothenburg (Göteborg). The first two are Stockholm and Malmö.

Example 4:

Instruction: Write a poem describing my love for my wife

Input: We meet 10 years ago

Response: My life has been filled with joy since I met you so long ago I’ve found my soul mate, our bond strong as steel Our hearts are one, forevermore Through thick and thin, through it all The world can never separate us two She’s always there to comfort me And she’ll be by my side till death do us part Love of this kindredominever Above the stars above Brighting brightest star Shining on high Inspiring me to fly higher than ever before Without her I would have no where to go Her heart is mine, and mine hers Foreverlasting No matter what comes or

Example 5:

Instruction: Who is the prime minister in Sweden?

Response: The Prime Minister of Sweden is Stefan Löfven, who has been serving since 2014. He was appointed by King Carl XVI Gustaf.

Example 6:

Instruction: Make up a game for two people that likes math and physics, make some examples as well

Response: A fun game to play with someone who enjoys math and physics could be called “The Physics Game”. In this game you have to answer questions about physics such as how far away from Earth would you need to travel in order to reach light speed? The first person to get all of their answers correct wins! Another example of a game like this one can be called “Physics Puzzle” where players must solve puzzles based on physics principles. For instance, they are given a picture of a car and then asked what type of engine it has or which way does its wheels move when it turns left or right. This game requires knowledge of basic physics concepts but also problem-solving skills. 2 more games like these include “Physics

Conclusion

In conclusion, training large (+7b) language models on chat data with DeepSpeed and LORA enables efficient fine-tuning, reducing memory requirements, and improving generalization. DeepSpeed provides a scalable and memory-efficient solution for training very large-scale models, while LORA helps the model adapt to task-specific data with low-rank adapation of the network’s weights. These techniques, when combined, is very efficient when fine-tuning state-of-the-art language models.

Setting up the Environment#

Preparing the Dataset#

DeepSpeed: Why Do We Need It?#

DeepSpeed Stages#

Training the Model#

The Importance of LORA (Low-Rank Adaptation)#

Training the Model Using the Reacher Library#

Inference#

Inference Examples#

Example 1:#

Example 2:#

Example 3:#

Example 4:#

Example 5:#

Example 6:#

Conclusion#