Here’s how it works: first, we “quantize” the weights in the neural network that makes up our language model. This means taking all those fancy numbers (which are usually floating point values with lots of decimal places) and rounding them down to whole numbers or integers. This might sound like a bad idea at first, but it actually helps reduce the amount of memory needed to store these weights in your phone’s RAM.
For example, let’s say we have a weight value that looks something like this: 0.3456789123456789. If we quantize this number using an 8-bit integer (which means rounding it to the nearest whole number between -128 and 127), our new weight value might look something like this: -100. This is a pretty big change, but it’s not as bad as you might think! In fact, studies have shown that quantizing weights can actually improve the performance of your language model in some cases (especially if you use techniques like pruning or sparsity to further reduce the number of parameters).
We also need a way to store these quantized weights on our phone without taking up too much memory. This is where “swap memory” comes in. Basically, we create a special buffer (called a “swap cache”) that stores frequently accessed data from the main RAM onto a faster type of memory called SRAM or eDRAM.
For example, let’s say you’re using your phone to write an email and your language model needs to access some weights in order to generate a response. Instead of loading these weights into your phone’s main RAM (which can be slow and power-hungry), we load them into the swap cache instead. This way, they’re always available when you need them without causing any delays or performance issues.
Quantization and Swap Memory for GPTQ: a fancy-sounding title that actually makes sense once you break it down. And who knows? Maybe someday we’ll be able to run these language models on our phones without even needing an internet connection (although let’s not get too ahead of ourselves).