Generative AI on your own data: finetuning with LoRA
/
Generative AI on your own data: finetuning with LoRA
Generative AI models open many doors for organizations that want to handle information more intelligently. Think of systems that generate internal reports, summarize documents, or answer questions in a company-specific style. The most challenging step is translating that general AI power into a context where your own data is central.
In the ecosystem around language models, there are three main routes for that.
The figure next to this clearly positions them. Prompt Engineering is the most accessible approach: you design targeted instructions and send them to the model, with or without adding your company-specific data. Retrieval-Augmented Generation links an external knowledge base to an LLM so that it can respond based on retrieved context. And through Fine-Tuning, the model itself is retrained to perform structurally better within your domain.
In this blog post, we focus on the latest technique: finetuning. We discuss what finetuning entails, when it is best to use it, and how modern techniques such as LoRA and QLoRA make this step practically feasible.
Main section
Quick facts
/
Finetuning as an alternative to RAG and prompting
/
Large models, large challenges
/
Finetuning with limited hardware: (Q)LoRA
/
Prevent catastrophic forgetting
Finetuning as a foundation for specialized AI models.
From basic model to customized model.
Finetuning is the process of further training an existing language model with a smaller, task-specific dataset. The figure below illustrates this as a simple flow: you start with a model that has learned from a large general corpus. By training it on your own data, its behavior is refined. This results in a final model that better fits concrete real-world applications.

Finetuning ensures that knowledge and task behavior do not have to be provided again through the prompt, but are directly present in the model. This reduces the cost.
When Do You Use Finetuning?
Finetuning is not always necessary. For many use cases, a well-thought-out prompt or a RAG architecture is sufficient. You opt for finetuning when the needs become more specific and structured.

The figure above illustrates this. Typical scenarios for using fine-tuning are:
- The model needs to gain in-depth expertise in a specific domain. This on a larger scale than just RAG.
- You want to apply a model for a new (programming) language.
- You have a new dataset with a large number of documents that remain relevant.
- Inference speed is important
- The information is too extensive or too specialized for RAG.
Finetuning is recommended as soon as you are working on a scale that is even larger than that for which you would only use RAG. However, the disadvantages of finetuning are the cost of training the model, the great need for data, and the risk of hallucinating and forgetting the original information the model was trained on (catastrophic forgetting).
Overview of Finetuning Methods
Within fine-tuning, there are various technical approaches. We will discuss two of them.
Full Finetuning

The above figure shows the principle of full finetuning. In this case, all parameters of the base model are adjusted. This provides complete freedom but requires heavy hardware and a lot of training data. The result can be very performant for one task but less suitable for modular environments.
Parameter Efficient Finetuning (PEFT)

PEFT is a whole family of methods that you can use to efficiently retrain your model. The PEFT landscape is very broad. The central idea is that you do not retrain the entire model, but add small extensions.
PEFT offers:
- Limited memory requirements
- train faster
- less risk of overfitting and catastrophic forgetting
- Reduced need for data
- use of adapters alongside the base model
Within this landscape, LoRA is the most used and most production-ready technique.
How does LoRA work?
When fine-tuning a language model classically, a gradient must be maintained for every parameter during training. In large LLMs, this leads to an enormous demand for GPU memory (for example, 17 gigabytes of required memory for a 7 billion parameter model). However, many of those parameters contain overlapping or redundant information. LoRA specifically addresses this.

LoRA (Low-Rank Adaptation)starts from the idea that you do not need to modify the original, pre-trained model. Instead, the existing weights are completely frozen. On top of certain layers of the model, small additional matrices are added that only learn the necessary corrections.

Concretely, the full weight matrix is not retrained, but a compact "delta matrix" that is split into two much smaller low-rank matrices. During inference, the model first calculates the answer with the original weights. The LoRA layers then add a limited, targeted adjustment to that.
This creates a model that exhibits specialized behavior, while the base model remains generic and stable.
Main benefits:
- Only a fraction of the parameters is trainable.
- Training becomes possible on lighter hardware.
- Adapters are small and quick to store.
- Multiple LoRA adapters can coexist.
- Quick switching between tasks
- No catastrophic forgetting
LoRA makes fine-tuning modular: you can combine one powerful base model with various task-specific extensions.
What is QLoRA?
QLoRA builds further on LoRA but adds an extra optimization step: quantization of the base model.
In QLoRA, the pre-trained language model is first loaded in a highly compressed representation, usually 4-bit. This means that each parameter takes up much less memory than in 16- or 32-bit precision. LoRA layers are then placed on this quantized model, which are trained in higher precision.

The combination works as follows:
- Download and load a quantized base model
- Removes all original weights
- Train only the compact LoRA adapters
- Use the fine-tuned result for fast inference.
Because the base model is so lightweight, QLoRA can also be applied to models with tens of billions of parameters without the need for professional GPU clusters. The inference remains efficient because the LoRA layers learn exactly which information is crucial for the specific task.
QLoRA offers:
- Drastically lower hardware requirements
- Faster and cheaper training
- Possibility to use very large models
- Lower energy consumption
- Production-worthy performance
Bottom section
Combine techniques for the best result
Finetuning produces a specialized model, but is rarely done in isolation. In practice, you want both deep domain knowledge and up-to-date information and control over behavior.
The figure in the attachment about "Combine fine-tuning and RAG" shows that each technique has its strengths. Therefore, it is best to combine them as follows:
- Prompting is used to guide the interaction style of your agent. Here you can optionally add a limited amount of your own data to avoid significantly increasing the cost.
- RAG provides current, dynamic context from your internal documents. This allows for quickly adding new data without retraining.
- Finetuning with LoRA/QLoRA integrates stable domain expertise directly into the model. You do this from time to time to help your model learn to handle a large amount of static new data.
By using a fine-tuned base model within a pipeline with prompting and retrieval, you get:
- shorter prompts
- fewer hallucinations
- faster inference
- maximum flexibility with new data
This is how you build generative AI systems that are both performant and maintainable.
This blog post is made possible by the Interreg Flanders-Netherlands project.Art-IE.
