Comparing Advantages: Merged vs. On-the-Fly LoRA
LoRA (Low Rank Adaptation) is a popular and efficient way to quickly fine tune a model, rather than updating all parameters of a pre-trained model, LoRA introduces trainable low-rank matrices that modify only a subset of the model’s weights.
I was researching the differences/advantages of using an on-the-fly loaded LoRA vs merging it to the base model. This is what I found out:
Theoretically a Merged LoRA should get the same results as an on-the-fly LoRA, but in the real world there are edge cases when this might not be true:
Quantization Effects
If the base model is quantized (reduced precision to save memory), merging the non-quantized LoRA weights can actually degrade performance. This occurs because:
- Quantized models use reduced precision (e.g., 4-bit or 8-bit) representations.
- LoRA adapters typically use full precision (16-bit or 32-bit).
- When merged, the precision benefits of quantization may be partially lost. In such cases, applying LoRA on-the-fly might preserve model quality better than merging.
Numerical Precision Variations
- Even with non-quantized models, tiny differences in computation order or floating-point arithmetic might introduce microscopic variances between the two approaches. These differences are typically negligible but could theoretically accumulate in extremely deep networks.
Advantages of Merged Models
- Performance: Merged models typically have faster inference times since they eliminate the need for additional computational steps during inference.
- Memory efficiency: During inference, merged models require less runtime memory as they don’t need to store separate sets of weights.
- Deployment simplicity: A merged model is a single entity, making deployment more straightforward in production environments.
Advantages of On-the-Fly Application
- Storage efficiency: Storing multiple LoRA adapters separately requires significantly less disk space than storing multiple complete merged models.
- Flexibility: Users can dynamically swap different LoRA adapters in and out of memory without reloading the entire model.
- Adapter combination: Multiple LoRA adapters can be applied simultaneously with different weights, enabling more complex adaptation scenarios.
- Preserves quantization benefits: When working with quantized models, on-the-fly application avoids potential quality degradation from merging.
