How And Why To Quantize Large Language Models
Learn how recent advances make it possible to deploy and fine-tune LLMs with Billions of parameters on consumer hardware - even on your own personal laptop!
Large Language Models (LLMs) are taking the world by storm — with the recent developments in ChatGPT/GPT-4 as well as recent open-source models like Llama2. However, these models are extremely memory intensive.
Let’s calculate how much memory is needed for putting a 10 Billion parameter in memory. If each parameter is a 32-bit float (or 4 bytes), we have 40 Billion bytes in total — or 40 GB! A1B parameter model requires 4 GB of memory, and a 100 Billion parameter model requires 400 GB. For faster inference, these need to be stored in RAM, or preferably in GPU memory — making deploying the models very expensive.
Floating Point Arithmetic And Quantization
Enter quantization — the idea is to convert a 32-bit model to lower precision — like 8 bit or 4-bit, which effectively reduces the memory. 8-bit model would take 4x less memory, making it much more affordable. Some smaller <10B parameter models then can be run locally on consumer grade CPUs or GPUs.
For an introduction to floating point precision, check out this image below. As you can see, FP32 contains 32 bits in total, double the number as FP16.
Training an ML model using bfloat16 would be half the memory footprint, but does not always work out well, due to the limited range of bfloat16 precision. This often leads to bad performance due to model weights going out of the range during training. However, there have been some innovations in floating point precision, for deep learning — the BF16, and TF32 formats.
As you can see below, the bfloat16 format has the same range as fp32, The only drawback is that these lose out on the precision, but researchers found that you can use bfloat16 to train models with almost the same performance as fp32.
Similarly, NVIDIA introduced the tf32 format, that they showed led to 6X speedup in training BERT.
Model Quantization
While the above concepts are appliable to model parameters prior to training these models, there is another direction of quantization — post training quantization, and quantization aware training.
GPTQ was a groundbreaking paper that tackles information reduction during quantization. It converted models to ~int4, leading to nearly a 4X memory savings, by minimizing the mean squared error between a model and the quantized version of the model, going layer by layer.
A package, bitsandbytes also enabled quantization easily for huggingface transformer models. Overall, bitsandbytes quantization is slightly slower during inference than GPTQ quantized models. During experiments, it was found in most cases the quantized models ran slower during inference (as measured by throughput in tokens generated per second) than the corresponding fp16 models.
Run Quantized LLMs On Your Laptop!
2 factors have led to the boom for desktop apps like LM Studio, that allow the download of models to run them locally. One is model quantization — now you can download a 7B parameter model, and 4-bit quantization means that this is ~3–4 GB of memory, and can be hosted in RAM or GPU memory for decent inference. Another innovation is the ggml library — which is a port for LLMs in C/C++. Because of this, it is easy to load these models in any operating systems due to the support of C/C++ based executables. GGML supports quantization in a lazy way, less sophisticated than GPTQ. It just rounds weights to lower precision. A prolific huggingface member, TheBloke has added 350+ ggml fine-tuned and quantized models to the huggingface model hub. These can be easily imported into most LLM desktop apps like LM Studio, to chat with.
Fine-Tuning And Quantization
It is not possible to perform full trainings on quantized models, but you can leverage parameter efficient fine-tuning using Low Rank Adapters (LoRA), where you just have to fine-tune adapters and load them properly inside the model for fine-tuning, instead of fine-tuning the entire model. So, the workflow would be to take an existing pre-trained LLM (or train from scratch), quantize using GPTQ/bitsandbytes, and then fine-tune using QLORA (Q here stands for quantized).
Takeaways And Future
From a model/app architecture perspective, quantization is a no brainer, as the model reduces memory requirements — making it easier and cheaper to deploy, and fine-tune. However, you would like minimal response degradation for quantized models. Nothing is free, and quantizing the models reduces precision of weights. However, huggingface experiments have shown minimal model degradation of GPTQ and bitsandbytes quantized models during inference. Much less is known about ggml quantized models and quality of responses remain unexplored as of yet.
In the era of GPU scarcity and significant hardware bottlenecks to deploy 10–100B parameter models, model quantization and quantized fine-tuning approaches seem promising solutions. There’s been an explosion of LLM desktop apps leveraging quantized models, that can be run locally just on your laptop. But validating response quality of these smaller models in custom domains is a whole different aspect.
Exciting times ahead!
References:
https://huggingface.co/blog/hf-bitsandbytes-integration
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
https://huggingface.co/blog/overview-quantization-transformers
https://arxiv.org/pdf/2210.17323.pdf
Introduction to Quantization cooked in 🤗 with 💗🧑🍳 (huggingface.co)
LM Studio — Discover, download, and run local LLMs
GitHub — ggerganov/ggml: Tensor library for machine learning
TheBloke (Tom Jobbins) (huggingface.co)
Here are some related articles:
LLM Evaluations:
4 Crucial Factors for Evaluating Large Language Models in Industry Applications
How Do You Evaluate Large Language Model Apps — When 99% is just not good enough?
Extractive vs Generative Q&A — Which is better for your business?
When Should You Fine-Tune LLMs?
Unleashing the Power of Generative AI For Your Customers
LLM Deployment:
Deploying Open-Source LLMs As APIs
LLM Systems Design:
Build Industry-Specific LLMs Using Retrieval Augmented Generation
How Do You Build A ChatGPT-Powered App?
Fine-Tune Transformer Models For Question Answering On Custom Data
LLM Economics: ChatGPT vs Open-Source
The Economics of Large Language Models
LLM Tooling:
Build A Custom AI Based ChatBot Using Langchain, Weviate, and Streamlit
How OpenAI’s New Function Calling Breaks Programming Boundaries