Sitemap

Comparing Advantages: Merged vs. On-the-Fly LoRA

2 min readMar 24, 2025

LoRA (Low Rank Adaptation) is a popular and efficient way to quickly fine tune a model, rather than updating all parameters of a pre-trained model, LoRA introduces trainable low-rank matrices that modify only a subset of the model’s weights.

I was researching the differences/advantages of using an on-the-fly loaded LoRA vs merging it to the base model. This is what I found out:

Theoretically a Merged LoRA should get the same results as an on-the-fly LoRA, but in the real world there are edge cases when this might not be true:

Quantization Effects

If the base model is quantized (reduced precision to save memory), merging the non-quantized LoRA weights can actually degrade performance. This occurs because:

  • Quantized models use reduced precision (e.g., 4-bit or 8-bit) representations.
  • LoRA adapters typically use full precision (16-bit or 32-bit).
  • When merged, the precision benefits of quantization may be partially lost. In such cases, applying LoRA on-the-fly might preserve model quality better than merging.

Numerical Precision Variations

  • Even with non-quantized models, tiny differences in computation order or floating-point arithmetic might introduce microscopic variances between the two approaches. These differences are typically negligible but could theoretically accumulate in extremely deep networks.

Advantages of Merged Models

  • Performance: Merged models typically have faster inference times since they eliminate the need for additional computational steps during inference.
  • Memory efficiency: During inference, merged models require less runtime memory as they don’t need to store separate sets of weights.
  • Deployment simplicity: A merged model is a single entity, making deployment more straightforward in production environments.

Advantages of On-the-Fly Application

  • Storage efficiency: Storing multiple LoRA adapters separately requires significantly less disk space than storing multiple complete merged models.
  • Flexibility: Users can dynamically swap different LoRA adapters in and out of memory without reloading the entire model.
  • Adapter combination: Multiple LoRA adapters can be applied simultaneously with different weights, enabling more complex adaptation scenarios.
  • Preserves quantization benefits: When working with quantized models, on-the-fly application avoids potential quality degradation from merging.

--

--

Alessandro Cauduro
Alessandro Cauduro

Written by Alessandro Cauduro

Technology and sports enthusiast. Chief AI Officer @ azion.com

No responses yet