Fine-tuning LLM on your laptop: VRAM vs Shared Memory vs GPU Load, Performance Considerations

I have been playing with Supervised Fine Tuning and LORA using my laptop with NVIDIA RTX 4060 8GB. The subject of SFT is vast, picking the correct training hyperparams is more magic than science, and there’s a good deal of experimentation…

Yet let me share one small finding. GPU utilization and Shared Memory effect on training speed.

I used Stable LM 2 1.6B base model and turned it into a chat model using 4400 samples from OASTT2 dataset. Here is the training file.

Below is a screenshot from W&B showing system metrics for 2 runs.

The only difference* was the number of epochs, all other params were the same:

1) Red run, 4 epochs – 18.5 minutes per epoch
2) Pink run, 11 epochs – 25.5 minutes per epoch

* in fact during the first run the GPU was at stock frequencies while the 2nd one had a 10% overclock. Yet it makes the argument even more pronounced

Pay attention to GPU Power Usage and how the red line fluctuates around 80W while the pink one averages 62W. Apparently, during the 1st run, the GPU was loaded better which resulted in 40% training time per single epoch.

Also, pay attention to memory usage, 97% in run #1 and 99.5% in run #2. GPU time spent accessing memory is 30% vs 50% correspondingly.

The reason for that is the use of system RAM instead of VRAM – when GPU is out of fast memory it will happily spill over its data into slower system memory. It took just a few percent to slow down the whole process by 40%. The larger the portion of data in system RAM – the slower the training.

E.g. during run #2 the GPU tab in the task manager looked like that (1GB was already consumed before the start of the training):

The takeaway is you should be considerate of GPU load and be on the watch-out for the training job getting outside of VRAM – doing it silently without any warnings and extending your training to another day 🙂 If it happens that you are slightly short of VRAM – you better play with quantization or batch size and see if there’s a way to fit all the data into VRAM and don’t have anything in RAM.

P.S.>

GPU load is not the only factor and is not the most important one in determining the total execution time. I.e. I could see 100W usage when trying out Galore(in blue) PEFT method. Yet it was way slower with the same dataset and similar params.

— the gap/straight line is due to a missing internet connection during this period.

Besides I could see a 50% speed with the original LORA version by simply enabling Flash Attention 2 – though I had to use WSL2 and run the job under Linux since Windows is not supported yet.

P.P.S.>

A small life hack to quickly see if the GPU is fully utilized – check the GPU temp in Windows Task Manager.

If it is just a few degrees above idle temp (e.g. 50-60°C), you are underutilized. If you hit 80°C degrees, you are good. E.g. RTX 4060 has a throttling temperature at 87°C degrees which means 100% utilisation. The advice won’t hold for desktops – there’s typically an extra cooling capacity and the GPU won’t be thermally throttled.

Stiri similare

Chicago woman charged with biting cop at Hammond Walmart

Así ha sido el último punto de Nadal en el Mutua Madrid Open y sus partidos contra Djokovic y Federer en la Caja Mágica

Daily News boys athlete of the week: Dylan Volantis, Westlake

The Cheyenne Supercomputer is going for a fraction of its list price at auction right now

City celebrates townhome transformation in Nob Hill

Top battleground Senate race heats up as party-backed Republican faces onslaught from former Trump official

Fine-tuning LLM on your laptop: VRAM vs Shared Memory vs GPU Load, Performance Considerations

Related

Leave a Reply Cancel reply

Share on:

Related

Leave a Reply Cancel reply

Stiri similare