In this post, I will go through all the necessary steps to set up and train a state-of-the-art LLM for sentiment analysis (and many other NLP applications since the steps are almost the same) on Twitter data. I will cover the entire pipeline, from creating the training dataset to deploying the model using TorchServe and Kubernetes on Azure.

See github for the code

Setting up the Training Pipeline

Setting Up the Environment

First, we need to set up our environment to run the training and deployment steps. I’m using Docker to create a container with all the necessary dependencies. To build and enter the Docker container, execute the following commands:

cd dockercontext
make build 
make enter

Creating the Training Dataset

Next, we will create the training data using the following command:

python src/setup.py --dataset_name twitter-sentiment

During this step we are downloading the raw data, thereafter it will get tokenized using the tokenizer associated with the model and finally upload it to Azure ML Datasets for later use.

Training the Model

Now, it’s time to train the model on the dataset we just created. To do this, run the following command:

python train.py --dataset twitter-sentiment --lock_embedding --lock_first_n_layers 7 --batch_size 25 --iterations 50000

I also provide an alternative training script that uses PyTorch Lightning with Low-Rank Adaptation (LORA) and a cosine learning rate scheduler:

python src/train_lightning.py --config config.yaml

After the training is completed we will use torch.jit compiler to trace the trained model, during this steps, all computational steps are recorded and optimized and organized into a single file. The traced model can later be used stand-alone without access to the python code that setup the model in the first place.

Both the tokenizer, python model and the traced model will be uploaded to Azure ML Models to be used during next steps.

Low-Rank Adaptation - LORA

LORA is using the insight that weight matrices $W_{i}$ in a pre-trained model contain significant redundancy, in other words the intrinsic dimension of $x$ in the equation $y = W_{i} x$ is much smaller than the embedding dimension.

To exploit this, LORA applies low-rank decomposition to these matrices, effectively approximating them using a product of two smaller matrices with lower rank $r$.

$$ y = W_{i} x + \alpha (W^L_{b} W^L_{a} x) = (W_{i} + \alpha + W^L_{b} W^L_{a}) x = W^{*}_{i} x $$

where $W^L_{a} x$ maps x from $[n, 1]$ to $[r, 1]$ (its intrinsic dimension?), where $n » r$. $W^L_{b}$ then expand the the intrinsic mapping back to the original number of dimensions.

During the fine-tuning process, LORA only updates the smaller, low-rank matrices, while the original weights are kept frozen. This allows the model to adapt to the specific task with fewer trainable parameters and less computational resource. After fine-tuning, the three weight matrices can be combines into $W^{*}_{i}$ resulting in zero extra computational impact compared to a model trained without LORA.

Packaging the Model for TorchServe

Once the training is complete, we package the model and its dependencies into a .mar file for TorchServe, using the following command:

python package.py --model_name xlmr_sentiment_traced --version 1 --handler src/handler.py --requirements dockercontext_torchserve/requirements.txt

The package script will download the tokenizer and the traced model.

Ones all files are downloaded we will create the .mar file using the following command:

torch-model-archiver \
    --force\
    --model-name sentiment\
    --version {args.version}\
    --serialized-file {DESTINATION_FOLDER}/traced.pt\
    --handler {args.handler}\
    --export-path {DESTINATION_FOLDER}\
    --requirements-file {args.requirements}\
    --extra-files {DESTINATION_FOLDER}/special_tokens_map.json,{DESTINATION_FOLDER}/tokenizer_config.json,{DESTINATION_FOLDER}/tokenizer.json

Here we package the traced model, tokenizer, all dependencies specified in the requirements-file and the handler. The handler is responsible for initialize the model and tokenizer and to setup the function that the API will call when there is a request. The handler implements the functions below:

class SentimentHandler(BaseHandler):
    def __init__(self):
        super().__init__()
    def initialize(self, context):
        # Load the model and tokenizer
        pass 
    def postprocess(self, data):
        # Convert the output tensor to a list of predictions
        pass
    def preprocess(self, requests: List[Dict[str, bytearray]]):
        # Process the input text and tokenize it
        pass
    def inference(self, model_input):
        # Perform inference using the model and return predictions
        pass

Running the Pipeline Remotely with Reacher

To run the pipeline on a remote machine (needed for training - if you are working on a laptop), I’m using the Reacher library. First, set up a Reacher instance with your remote configuration:

from reacher.reacher import RemoteClient, ReacherDocker
from dotenv import dotenv_values

config = dotenv_values()

reacher = ReacherDocker(
    build_name="pytorch_base",
    image_name="pytorch_base",
    build_context="dockercontext",
    host=config["HOST"],
    user=config["USER"],
    password=config["PASSWORD"],
    ssh_key_filepath=config["SSH_KEY_PATH"],
)

reacher.build()

reacher.setup(
    ports=[8888, 6666],
    envs=dotenv_values(".env") 
)

Then, execute the different pipeline steps on the remote machine:

reacher.execute(
    context=["src"],
    command="python <SCRIPT>.py <ARGS>... ",
    named_session="",
)

Here you will use the same commands as running it locally, just specify the file that are needed to run the scripts and those will get uploaded to the remote before the commands is executed.

Ones all steps are done, fetch the .mar file containing the trained model:

reacher.get("artifacts/sentiment.mar", "artifacts")

Deployment on Azure Kubernetes Cluster with TorchServe

Setting up the Kubernetes cluster on Azure

To set up the Kubernetes cluster on Azure, export environment variables from the .env file:

export $(cat .env | xargs)

Create the cluster:

az aks create --resource-group ${AZURE_RESOURCE_GROUP} --name ${COMPUTE_CLUSTER_NAME} --node-vm-size Standard_DC2s_v2 --node-count 1 --generate-ssh-keys

Here you can specify a more performant node-size than Standard_DC2s_v2.

Install the necessary tools, such as kubectl:

az aks install-cli

Set up the credentials and push them to ~/.kube/config so that kubectl can be used:

az aks get-credentials --resource-group ${AZURE_RESOURCE_GROUP}  --name ${COMPUTE_CLUSTER_NAME}

Configuring the cluster

If you are using GPU-powered nodes we need to setup nvidia-device-plugin to allow the nodes to access the GPUs.

kubectl apply -f k8s/nvidia-device-plugin-ds.yaml

Configuring storage

Create a storage class for files across the nodes:

kubectl apply -f k8s/Azure_file_sc.yaml

Create a PersistentVolume to store mar-files and configs:

kubectl apply -f k8s/AKS_pv_claim.yaml

Create a pod with PersistentVolume for copying mar-files and config files:

kubectl apply -f k8s/model_store_pod.yaml

Check the running pods:

kubectl get pods -o wide

Uploading mar/config files to Kubernetes

Create a folder on the pod with PersistentVolume to upload mar/config files:

kubectl exec --tty pod/model-store-pod -- mkdir /mnt/azure/model-store/

Copy the mar file to the pod with PersistentVolume:

kubectl cp dockercontext_torchserve/model_store/sentiment.mar model-store-pod:/mnt/azure/model-store/sentiment.mar

Copy the config files:

kubectl exec --tty pod/model-store-pod -- mkdir /mnt/azure/config/
kubectl cp dockercontext_torchserve/config.properties model-store-pod:/mnt/azure/config/config.properties

Check if all the files have been uploaded:

kubectl exec --tty pod/model-store-pod -- find /mnt/azure/

Deployment for public access

Set up a TorchServe deployment and a service that forwards the inference/management/metric ports to the load balancer for external access:

kubectl create -f k8s/torchserve_public.yaml

Get information about the TorchServe service:

kubectl describe service torchserve

To get information about all services:

kubectl get service -A

Use the external IP along with the correct port to start the sentiment model on the deployment:

curl -v -X POST "http://<EXTERNAL_IP>:<EXTERNAL_MGM_PORT>/models?initial_workers=1&batch_size=5&maxWorkers=5&max_batch_delay=1000&synchronous=true&url=sentiment.mar"

Here, we set up one initial worker and allow TorchServe to scale up to 5, if needed. We allow a maximum batch size of 5, and if requests arrive within 1000ms, TorchServe groups them in batches of the maximum size.

To delete the deployment and service:

kubectl delete deployment torchserve
kubectl delete service torchserve

Deploying the Model for Private Access

To deploy the model for private access, create a pod with the TorchServe image:

kubectl create -f k8s/torchserve_private.yaml

Access the TorchServe pod:

kubectl exec --stdin --tty torchserve -- /bin/bash

Now, you can set up and test TorchServe in a similar way as for the public deployment. Change <EXTERNAL_IP> to 127.0.0.1 and use the ports specified in config.properties.

Inference on the Model

Finally, you can use your deployed model to perform sentiment analysis on text inputs:

Copy code
curl -X POST -H "Content-Type: application/json" -d '["The movie was so good", "The movie was so bad"]' http://<EXTERNAL_IP>:<EXTERNAL_INFERENCE_PORT>/v1/models/sentiment:predict

The output will look like this:

[
  0.9991590976715088,
  0.0054536801762878895
]

The model is confident that the first text has a positive sentiment, while the second text has a negative sentiment.

Cleaning Up Resources

To delete the deployment and service, run the following commands:

kubectl delete deployment torchserve
kubectl delete service torchserve

Conclusion

In this post, I have walked through the entire process of training a LLM for sentiment analysis, packaging it for TorchServe, and deploying it on an Azure Kubernetes cluster. With this, you can now create, train, and deploy your own models for various use cases.