Fine-tuning Llama-3, Mistral and Mixtral with Anyscale

By Marwan Sarieddine and Kamil Kaczmarek   

In this blog post, we’ll explore the process of fine-tuning some of the most popular large language models (LLMs) such as Llama-3, Mistral, and Mixtral using Anyscale. Specifically, we’ll demonstrate how to:

  1. Fine-tune an LLM using Anyscale’s llm-forge: We’ll cover the fine-tuning process end-to-end from preparing the input data to launching the fine-tuning job and monitoring the process.

  2. Serve your model with Anyscale’s ray-llm: We’ll explain how to serve both LoRA and the full-parameter fine-tuned models, and how to leverage ray-llm for easy deployment and LoRA multiplexing.

Let’s get started!

LinkFine-tuning LLMs with Anyscale - Overview

Here is a high-level overview of the steps involved in fine-tuning LLMs with Anyscale: 

  1. Prepare the data: Ensure that your training and validation data is in the OpenAI format.

  2. Launch a fine-tuning job: Use the llm-forge library to launch a fine-tuning job on Anyscale. 

  3. Monitor the fine-tuning process: Keep an eye on the training logs and metrics to track performance. 

  4. Inspect the produced model checkpoint: Check the model checkpoint and use it for deployment. 

Below is a diagram that helps visualize the process:

Steps involved in fine-tuning LLMs with Anyscale

LinkStep 1 - Prepare the data

Before you can start fine-tuning your LLM, you need to ensure that your training and validation data is in the correct format. The llm-forge library expects the data to be in the OpenAI dataset format, which is a JSON-based format that represents a conversation between a user and an assistant. Each JSON object in the dataset contains a “messages” field with a list of messages exchanged between the user and the assistant. Each message has a “role” field (either “system”, “user”, or “assistant”) and a “content” field that holds the actual text of the message.

Here’s an example of what the data should look like:

1{"messages": [
2  {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
3  {"role": "user", "content": "What's the capital of France?"},
4  {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}
5]}
6{"messages": [
7  {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
8  {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"},
9  {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}
10]}

llm-forge will convert the OpenAI dataset format into the appropriate prompt format for the chosen base model. This conversion is crucial because any slight misspecification in the prompt format can significantly impact the quality of the fine-tuned model.

Once you have prepared your data, you can proceed to the next step of launching the fine-tuning job on Anyscale.

LinkStep 2 - Launch a fine-tuning job

With llm-forge, you launch a fine-tuning job by preparing a YAML configuration file. This file allows you to specify all the details of the fine-tuning process, including the data, model, training hyperparameters, and output settings. llm-forge offers pre-built YAML configurations for the most common base models so you don’t have to figure out most parameters from scratch.

LinkLaunching a LoRA-based fine-tuning job

Here’s an example YAML configuration for a LoRA-based fine-tuning of the Meta-Llama-3-8B:

1model_id: meta-llama/Meta-Llama-3-8B # <-- change this to the model you want to fine-tune
2train_path: s3://air-example-data/gsm8k/train.jsonl # <-- change this to the path to your training data
3valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- change this to the path to your validation data. This is optional
4context_length: 512 # <-- change this to the context length you want to use
5num_devices: 4 # <-- change this to total number of GPUs that you want to use
6num_epochs: 10 # <-- change this to the number of epochs that you want to train for
7train_batch_size_per_device: 16
8eval_batch_size_per_device: 16
9learning_rate: 1e-4
10padding: "longest" # This will pad batches to the longest sequence. Use "max_length" when profiling to profile the worst case.
11num_checkpoints_to_keep: 1
12dataset_size_scaling_factor: 10000
13output_dir: /mnt/local_storage
14deepspeed:
15  config_path: deepspeed_configs/zero_3_offload_optim+param.json
16dataset_size_scaling_factor: 10000 # internal flag. No need to change
17flash_attention_2: true
18trainer_resources:
19  memory: 53687091200 # 50 GB memory
20worker_resources:
21  accelerator_type:A10G: 0.001
22lora_config:
23  r: 8
24  lora_alpha: 16
25  lora_dropout: 0.05
26  target_modules:
27    - q_proj
28    - v_proj
29    - k_proj
30    - o_proj
31    - gate_proj
32    - up_proj
33    - down_proj
34    - embed_tokens
35    - lm_head
36  task_type: "CAUSAL_LM"
37  modules_to_save: []
38  bias: "none"
39  fan_in_fan_out: false
40  init_lora_weights: true

This configuration sets up a LoRA-based fine-tuning job for the Meta-Llama-3-8B, using 4 A10 GPUs and training for 10 epochs. To launch the fine-tuning job, use the following command:

1llmforge anyscale finetune "finetune/lora/meta-llama-3-8b.yaml"llmforge anyscale finetune

You can also integrate with tools like Weights & Biases by setting your API key before launching your fine-tuning job:

1export WANDB_API_KEY={YOUR_WANDB_API_KEY}
2llmforge anyscale finetune "finetune/lora/meta-llama-3-8b.yaml"

LinkLaunching a full-parameter fine-tuning job

Alternatively, here’s an example YAML configuration for a full-parameter fine-tuning of the Llama 3 8B model:

1model_id: meta-llama/Meta-Llama-3-8B # <-- change this to the model you want to fine-tune
2train_path: s3://air-example-data/gsm8k/train.jsonl # <-- change this to the path to your training data
3valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- change this to the path to your validation data. This is optional
4context_length: 512 # <-- change this to the context length you want to use
5num_devices: 16 # <-- change this to total number of GPUs that you want to use
6num_epochs: 10 # <-- change this to the number of epochs that you want to train for
7train_batch_size_per_device: 8
8eval_batch_size_per_device: 16
9learning_rate: 5e-6
10padding: "longest" # This will pad batches to the longest sequence. Use "max_length" when profiling to profile the worst case.
11num_checkpoints_to_keep: 1
12dataset_size_scaling_factor: 10000
13output_dir: /mnt/local_storage
14deepspeed:
15  config_path: deepspeed_configs/zero_3_offload_optim+param.json
16dataset_size_scaling_factor: 10000 # internal flag. No need to change
17flash_attention_2: true
18trainer_resources:
19  memory: 53687091200 # 50 GB memory
20worker_resources:
21  accelerator_type:A10G: 0.001

This configuration sets up a full-parameter fine-tuning job for the Llama 3 8B model, using 16 A10 GPUs and training for 10 epochs. Note that both LoRA and full-parameter fine-tuning can leverage DeepSpeed optimizations, which you can read more about on the deep configuration page.

The key difference between the LoRA and full-parameter configurations is that the latter is significantly more resource intensive requiring 16 GPUs opposed to 4 GPUs for half the training batch size. We’ll discuss the trade-offs between these two approaches in the “Advanced fine-tuning tips” section.

To launch the fine-tuning job, use the following command:

1llmforge anyscale finetune "finetune/full_param/meta-llama-3-8b.yaml"

LinkAdvanced tuning tips

Should you perform LoRA-based or full-parameter fine-tuning?

There is no general answer to this, but here are some considerations:

  • The quality of the fine-tuned models will, in most cases, be comparable if not the same.

  • LoRA shines if:

    • You want to serve many fine-tuned models at once yourself.

    • You want to rapidly experiment (because fine-tuning, downloading, and serving the model take less time).

  • Full-parameter shines if:

    • You want to ensure that your fine-tuned model has the maximum quality.

    • You want to serve only one fine-tuned version of the model.

You can learn more about this in one of our blog posts, where you’ll also find guidance on the LoRA parameters and why, in most cases, you don’t need to change them.

How to optimize for compute cost?

Before optimizing for compute, ensure that you have selected a context length that is long enough for your dataset. If you have very few data points in your dataset that require a much larger context than the others, consider removing them. The model of your choice and fine-tuning technique should also suit your data.

If you want different compute, we suggest the following workflow to find a suitable configuration:

  • Start with a batch size of 1

  • Choose a GPU instance type that you think will give you good flops/$. If You are not Sure, here is a rough guideline:

    • A10 nodes for high availability

    • A100 nodes for lower availability but better flops/$

    • Anything higher-end if you have the means of acquiring them

  • Do some iterations of trial and error on instance types and deepspeed settings to fit the workload while keeping other settings fixed

    • Use deepspeed stage 3 (all default configs in this template use stage 3)

    • Try to use deepspeed offloading only if it reduces the minimum number of instances you have to use

      • Deepspeed offloading slows down training but allows for larger batch sizes because of a more relaxed GRAM foot-print

    • Use as few instances as possible. Fine-tune on the same machine if possible.

      • The GPU to GPU communication across machines is very expensive compared to the memory savings it could provide. You can use a cheap CPU-instance as a head-node for development and a GPU-instance that can scale down as a worker node for the heavy lifting.

      • Training single-node on A100s may end up cheaper than multi-node on A10s if availability is not an issue

  • Be aware that evaluation and checkpointing introduce their own memory-requirements

    • If things look good, run fine-tuning for a full epoch with evals.

  • After you have followed the steps above, increase batch size as much as possible without OOMing.

We do not guarantee that this will give you optimal settings, but have found this workflow to be helpful ourselves in the past.

LinkStep 3 - Monitor the fine-tuning process

As your fine-tuning job runs on the Anyscale platform, you’ll want to monitor the progress and performance metrics. Anyscale provides a convenient Ray Dashboard that allows you to monitor various aspects of your training run.

Ray Dashboard

In the Ray Dashboard, you can check the GPU usage and other key metrics.

In the logs of the fine-tuning job, you’ll see detailed information about the training process and performance metrics, such as:

  • Training loss over time

  • Validation loss and perplexity

  • Throughput of trained tokens

  • Time spent on forward and backward passes

  • Overall training time and iterations

Below is an example of the output you should see in the logs:

1Training finished iteration 84 at 2024-06-03 11:11:32. Total running time: 16min 44s
2╭─────────────────────────────────────────────────────────╮
3│ Training result                                         │
4├─────────────────────────────────────────────────────────┤
5│ checkpoint_dir_name                   checkpoint_000005 │
6│ time_this_iter_s                               36.404517│ time_total_s                                  844.859718│ training_iteration                                   849│ avg_bwd_time_per_epoch                          3.0473910│ avg_fwd_time_per_epoch                          2.9396711│ avg_train_loss_epoch                            0.0736812│ bwd_time                                        2.6380413│ epoch                                                1114│ eval_loss                                       0.2687415│ eval_time_per_epoch                            13.1511816│ fwd_time                                        2.7880917│ learning_rate                                        0.18│ num_iterations                                        719│ perplexity                                      1.3083120│ step                                                  621│ total_trained_steps                                  8422│ total_update_time                             511.1768523│ train_loss_batch                                 0.021324│ train_time_per_epoch                           42.6065725│ train_time_per_step                             5.4290726│ trained_tokens                                   70440027│ trained_tokens_this_iter                           217228│ trained_tokens_throughput                    1377.9966829│ trained_tokens_throughput_this_iter           400.2800130╰─────────────────────────────────────────────────────────╯

Additionally, the output includes information about the checkpoint directory, where the fine-tuned model is being saved. This allows you to easily access the checkpoint and use the fine-tuned model for deployment or further experimentation.

By closely monitoring the fine-tuning process, you can ensure that your model is converging as expected and make any necessary adjustments to the training configuration or hyperparameters. Finally, if you enable Weights and Biases, you can track your fine-tuning experiments and compare performance across different runs.

LinkStep 4 - Serve the fine-tuned models

Now that you’ve successfully fine-tuned your model, it’s time to serve it for inference.

The key steps involved are:

  1. Save the fine-tuned model checkpoint to a location accessible by Anyscale (e.g., a cloud storage service like S3 or GCS).

  2. Create a ray-llm compatible YAML configuration file that specifies the model to be served and any additional settings.

  3. Deploy and monitor the custom model to Anyscale Services, which will make it available as a scalable, production-ready endpoint.

Below is a diagram that helps visualize the process:

Serve the fine-tuned models

LinkServing full-parameter fine-tuned models

For full-parameter fine-tuning, you can use the stored checkpoint from the fine-tuning process and serve the model using Anyscale Services. The checkpoint will contain the complete set of fine-tuned model parameters.

Here’s an example of the checkpoint location:

1Best checkpoint is stored in:
2{CHECKPOINT_PATH}/TorchTrainer_2024-06-03_10-54-47/TorchTrainer_5e49d_00000_0_2024-06-03_10-54-48/checkpoint_000003

And here’s an example YAML configuration for serving the full-parameter fine-tuned model:

1applications:
2- args:
3    embedding_models: []
4    function_calling_models: []
5    models: []
6    multiplex_lora_adapters: []
7    multiplex_models: []
8    vllm_base_models:
9    - ./model_config/meta-llama--Meta-Llama-3-8B-Instruct.yaml
10  import_path: aviary_private_endpoints.backend.server.run:router_application
11  name: llm-endpoint
12  route_prefix: /

In this configuration, we reference the fine-tuned model configuration in the vllm_base_models field, which points to a yaml file containing the model checkpoint location.

Furthermore, you can refer to this working example that demonstrates how to deploy your own LLM.

LinkServing LoRA-based fine-tuned models

For LoRA-based fine-tuning, you will use a very similar YAML configuration file to the full-parameter finetuning except:

  • instead of referencing the model under vllm_base_models, you will use multiplex_models

  • you will store your LoRA model checkpoint under the dynamic_lora_loading_path

The dynamic_lora_loading_path is a “directory” that contains all the LoRAs that you want to multiplex on the same base model. This approach allows you to load the LoRA adapter weights dynamically, which is significantly more efficient than loading each fine-tune as a full parameter model.

Here’s an example YAML configuration for serving a LoRA-based fine-tuned model:

1applications:
2- args:
3    dynamic_lora_loading_path: {{your_dynamic_loading_path_here}}
4    embedding_models: []
5    function_calling_models: []
6    models: []
7    multiplex_lora_adapters: []
8    multiplex_models:
9    - ./model_config/meta-llama--Meta-Llama-3-8B-Instruct.yaml
10    vllm_base_models: []
11  import_path: aviary_private_endpoints.backend.server.run:router_application
12  name: llm-endpoint
13  route_prefix: /

As long as all the LoRA models are stored in the same location specified by dynamic_lora_loading_path, you can serve multiple LoRA models using a single ray-llm endpoint. ray-llm will efficiently multiplex the models for you, avoiding costly LoRA loading operations by optimally routing requests to the correct model based on the request headers.

Below is a diagram visualizing multiple LoRA models being served using ray-llm:

Multiple LoRA models

LinkConclusion

By following the instructions in this post, you should now have an understanding of: 

  • How to fine-tune large language models using Anyscale’s llm-forge library 

  • How to serve fine-tuned models using Anyscale’s ray-llm library

To proceed with fine-tuning your custom models, you should:

  1. Create an account on the Anyscale platform and get access to free compute credits

  2. checkout this working example on fine-tuning an LLM.

  3. checkout this working example on deploying an LLM.

Finally, if you want an end-to-end workflow for LLMs that you can adopt, checkout our detailed guide here

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.