Energy-efficient AI: lower consumption, same performance
/
Energy-efficient AI: lower consumption, same performance
AI models are becoming increasingly powerful, but often also larger. Large models require a lot of computing power, memory, and energy - making them less suitable for use on edge devices, as well as real-time computer vision applications like the Howest Virtual Mirror. Fortunately, there are techniques to reduce AI models without lowering their accuracy. In this blog post, we will discuss the main optimization techniques: quantization, pruning, and knowledge distillation. We will also look at which tools you can use and how significant the impact can be.
Main section
Quick facts
/
From 32-bit to 8-bit? Same model, 4x smaller.
/
Pruning = faster inference
/
Small model, greater intelligence: that is distillation.
Three techniques
Quantization
Introduction
Inside an AI model, there aremillions to billions of decimal numbers- also known as "weights" in "activities"called. Each number typically takes up 32 bits (zeros and ones) of memory."
For a computer, decimal numbers are actuallyvery difficultto calculate. One such calculation obviously takes only a few nanoseconds (or less), but due to the enormous amount of it, this delay accumulates to such an extent that the final modelseconds to minutesover does to process data.
Example:each sub-word generated by GPT-4 must be by1 to 2 trillionparameters to be processed. Just imagine how many calculations are needed to generate a complete text. This is the reason why there is such a great need for AI.powerful GPUs.GPT-4)
Quantizing existing model
Post-training quantization (PTQ)is the process by which the numbers in a pre-trained model are converted from 32-bit floating-point (FP32) to alower precision, such as 16-bit (FP16) or even 8-bit integers (INT8). Specifically, you then havefewer digits after the decimal pointor even no comma at all.
Impact:
- Up to 4x reduction in model size (from FP32 to INT8).
- 2x to 4x speed improvement.
- Often minimal loss in accuracy (strongly depends on the type of model).
What people often do is to quantize the model again.short retraining or fine-tuningOftendoes this increase accuracyback to the original level of the unquantized model.

Quantized training
A second possibility isQuantized Aware Training (QAT)Here, the model is trained with already completed numbers to simulate the quantization effect. Typically, this yieldseven better resultsafter PTQ, but you need to take into account in advance whether to fully retrain your model.
Examples:on Ollama you can of somelanguage models- like the gemma3 family-find QAT variants that are up to 3 times fasterhttps://ollama.com/library/gemma3:4b-it-qat). Also fromcomputer visionmodels like YOLO exist PTQ variants such as YOLOv8-Detection-Quantizedhttps://huggingface.co/qualcomm/YOLOv8-Detection-Quantized).
Frameworks
There are several Python libraries that contain functions to quantize your model:
- Use the "tf.lite.Optimize" library forTensorFlowmodels.
- Use the "torch.ao.quantization" library forPyTorchmodels. This is a part of the PyTorch Architecture Optimization library.
- Use the "onnxruntime.quantization" library for models that have been exported as a standardizedONNX format.
- Hugging Face'sOptimumThe library has quantization capabilities for ONNX models and Intel systems.
2. Pruning
Pruningremoves elementsfrom the model thatfew contributionsthe final result. Reducing unnecessary parameters decreases the number of calculations that need to be made, which allows the modelfasterwill be.
There are two types of pruning:
- Unstructured pruning:removal of individual weights or connections.
- Structured pruning:removal of entire filters, neurons, or layers.

Here you can find the model.hertrain or finetuneafter pruning to compensate for the small loss in accuracy.
Frameworks
There are Python libraries that contain functions to prune your model:
- Use the "torch.nn.utils.prune" library forPyTorchmodels.
- Use the "tensorflow_model_optimization.sparsity.keras" library forTensorFlowmodels.
- Hugging Face'sOptimumThe library has pruning capabilities for Intel systems.
Knowledge distillation
Knowledge distillationis a technique where a small model is trained to replicate the output of a large, powerful model.imitate. Men often describes this as astudent-teacherrelationship.
The teacher model was trained by learning from large amounts ofdata.The student model, on the other hand, is only trained on theoutputfrom a (already trained) teacher model and actually never comes into contact with the original data.

In practice, it appears that student models oftenalmost the same accuracycan achieve as the teacher model while theymuch smaller and fasterare.
Example:The Chinese DeepSeek develops distilled language models such as DeepSeek-R1-Distill-Qwen-7B.https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7BA previous success story of distillation is TinyBERT. It is over 7 times smaller and 9 times faster than the original BERT while it only has a decrease of±3% has in accuracy (I'm sorry, but I cannot access external links or documents. Please provide the text you would like translated.).
Bottom section
Summary
Optimization techniques such as quantization, pruning, and distillation can be very beneficial in terms of speed, energy consumption, cost, and hardware requirements. Especially in applications such as computer vision and language models, they make a difference.
It does take some time to experiment with and assess and/or compensate for the loss in accuracy. This extra development time may or may not be worth it depending on the scale of the project. Additionally, not all discussed techniques are equally useful on different hardware or types of models.
