Fine-tuning LLM on your laptop: VRAM vs Shared Memory vs GPU Load, Performance Considerations

Fine-tuning LLM on your laptop: VRAM vs Shared Memory vs GPU Load, Performance Considerations

I have been playing with Supervised Fine Tuning and LORA using my laptop with NVIDIA RTX 4060 8GB. The subject of SFT is vast, picking the correct training hyperparams is more magic than science, and there’s a good deal of experimentation…

Yet let me share one small finding. GPU utilization and Shared Memory effect on training speed.

I used Stable LM 2 1.6B base model and turned it into a chat model using 4400 samples from OASTT2 dataset. Here is the training file.

Below is a screenshot from W&B showing system metrics for 2 runs.

The only difference* was the number of epochs, all other params were the same:

1) Red run, 4 epochs – 18.5 minutes per epoch
2) Pink run, 11 epochs – 25.5 minutes per epoch

* in fact during the first run the GPU was at stock frequencies while the 2nd one had a 10% overclock. Yet it makes the argument even more pronounced

Pay attention to GPU Power Usage and how the red line fluctuates around 80W while the pink one averages 62W. Apparently, during the 1st run, the GPU was loaded better which resulted in 40% training time per single epoch.

Also, pay attention to memory usage, 97% in run #1 and 99.5% in run #2. GPU time spent accessing memory is 30% vs 50% correspondingly.

The reason for that is the use of system RAM instead of VRAM – when GPU is out of fast memory it will happily spill over its data into slower system memory. It took just a few percent to slow down the whole process by 40%. The larger the portion of data in system RAM – the slower the training.

E.g. during run #2 the GPU tab in the task manager looked like that (1GB was already consumed before the start of the training):

The takeaway is you should be considerate of GPU load and be on the watch-out for the training job getting outside of VRAM – doing it silently without any warnings and extending your training to another day 🙂 If it happens that you are slightly short of VRAM – you better play with quantization or batch size and see if there’s a way to fit all the data into VRAM and don’t have anything in RAM.

P.S.>

GPU load is not the only factor and is not the most important one in determining the total execution time. I.e. I could see 100W usage when trying out Galore(in blue) PEFT method. Yet it was way slower with the same dataset and similar params.

— the gap/straight line is due to a missing internet connection during this period.

Besides I could see a 50% speed with the original LORA version by simply enabling Flash Attention 2 – though I had to use WSL2 and run the job under Linux since Windows is not supported yet.

P.P.S.>

A small life hack to quickly see if the GPU is fully utilized – check the GPU temp in Windows Task Manager.

If it is just a few degrees above idle temp (e.g. 50-60°C), you are underutilized. If you hit 80°C degrees, you are good. E.g. RTX 4060 has a throttling temperature at 87°C degrees which means 100% utilisation. The advice won’t hold for desktops – there’s typically an extra cooling capacity and the GPU won’t be thermally throttled.

Leave a Reply

Your email address will not be published. Required fields are marked *