Executive summary
Sovereign AI is becoming a strategic priority as companies rethink how much control they have over the models, infrastructure, software harnesses, and data that power their AI systems. Rather than relying fully on third-party platforms, organizations can now combine open-weight models, private or sovereign compute, and self-hosted orchestration tools to create AI systems that are more portable, secure, compliant, and resilient. The key message: AI sovereignty is not all-or-nothing, but a series of architectural choices that determine whether AI becomes a vendor dependency or a private, permanent business asset.
Until very recently, AI systems (particularly those employing LLMs) followed a predictable architecture that relied heavily on 3rd-party resources. Recently, however, the conversation around “Sovereign AI” has been growing rapidly, with many starting to recognize the risks of 3rd-party control over their strategically important systems.
Simply, AI sovereignty is the degree to which an individual or organization owns and controls the various parts of their “AI Stack”, which can largely be broken down into Models, Compute, Harness, and Data. In this blog, we will aim to shed light on what is possible today across the first three of these pillars at varying degrees of sovereignty.
Being the headline component of any LLM based system, the model in use is an obvious place to start when considering the sovereignty of your AI system. Luckily, the open-weights ecosystem of models has been maturing steadily for a long time and there are lots of high performance players to consider.
When considering which of these models to use, the best way to get started is to try several options and see which best fits your solution. The best place to see all the available models is huggingface.co, which is something like a ‘GitHub’ for AI models. Additionally, the “Model Gardens” from each of the key hyperscalers provide something of a marketplace for models (Azure Foundry Models, GCP Agent Platform, and AWS Bedrock). Here you will find both proprietary and open-weight models, though they typically offer only the flagship options from key model providers.
One of the key things you will notice when browsing the options is that each is usually available in a range of parameter sizes. For example, gemma4 is available in 2B, 4B, 12B, and 31B (the B standing for Billions of parameters).
Parameters represent the total number of internal connections in a model, directly dictating its depth of knowledge and reasoning capabilities. Models with higher parameter counts tend to have better logic and problem-solving abilities but require more GPU memory to run. (We include a table summarising Model Size to VRAM Req. relationship in the hardware section below)
The size of the model (number of parameters) that you will choose will largely come down to a trade-off between the desired capabilities of the model in your system and the available hardware on which to run the model, which presents a nice segue to the next pillar of sovereignty in your AI stack.
Once you have selected the best model for your needs, you need some hardware on which to run it. The least “sovereign” approach would be to opt for one of the Model-Hosting-as-a-Service products available from the Big 3 hyperscaler vendors. The below table offers a comparison of the key offerings from the Big 3.
|
Attribute |
Microsoft Foundry |
Google Vertex |
Amazon SageMaker |
|
BYOM Hub |
Azure AI Foundry/ Azure ML |
Vertex Model Garden |
SageMaker Inference Components |
|
Weight Storage |
Blob Storage (Safetensors, GGUF) |
Cloud Storage (Hugging Face format) |
S3 (model.tar.gz) |
|
Runtime Freedom |
Custom Docker/Python (vLLM) |
Custom Containers (via Model Registry) or Pre-built (vLLM/Hex-LLM) |
Custom Docker or native LMI |
|
Isolation |
Azure VNet / Entra ID |
Private Service Connect |
AWS PrivateLink / Private VPC |
|
Key Tech |
Foundry Model Router / OpenAPI |
Vertex AI Agent platform |
HTTP/2 Bidirectional Streaming |
|
Billing Model |
Hourly VM instance rate |
Hourly machine + GPU/TPU rate |
Hourly instance rate |
|
Mid-Tier Node (8B–32B) |
$1.43 / hr (A10 8vCPU 110GB RAM) |
$4.50 / hr |
$1.51- $7.09/ hr (ml.g5.2xlarge - ml.g5.12xlarge) |
|
Enterprise Node (70B+ / MoE) |
$9.08 – $18.00+ / hr (H100 320-640GB RAM) |
$36.08 / hr (a2-highgpu-8g, 8xA100, 320GB VRAM, 96vCPU, 680GB RAM) |
$25.25 / hr (ml.p4d.24xlarge) |
These services offer all the benefits that have made the cloud at large such an attractive offer over the last couple of decades: high availability, redundancy, reliability, security, and scalability - all whilst requiring a much lower maintenance overhead compared to managing one’s own on-prem compute resources integrated in your existing cloud stack/bill.
However, the ownership of the platform is very much out of your hands, keeping you beholden to the vendors and their whims. Pricing and feature sets could change at any time. Not very Sovereign.
Luckily, running open-weight large language models locally is more accessible than one might think.
On the cheapest end of the scale, older graphics cards can be bought quite cheap and provide reasonable performance with the smaller quantizations. The GTX 1060 with 6GB of VRAM is available second hand for as little as €70.
For example, Gemma4 E4B running on a GTX 1060 6GB achieved an output speed of ~13.5 token/s and an input speed of ~510 tokens/s. This is not really suitable for the most advanced precision-sensitive cases or for real time chat applications, but its multi-modal and reasoning capabilities could make it a strong option for agentic applications that run in the background (i.e: Second Brain/ personal wiki assistant).
Ultimately, this setup represents the floor of what is currently possible with cheap, self-owned hardware, and the sky is the limit with prices for NVidia H100 cards reaching in excess of €25k.
Representing a middle ground between the pricey-poles, Apple systems have recently been a popular choice for Local LLM enthusiasts. Their System-On-A-Chip designs and specifically Unified memory for CPU and GPU tasks (with the option to utilise the SSD for even more memory if needed) make them an ideal choice for those looking for Local LLM work-horses.
Windows die-hards will also be encouraged to hear that NVidia are working on their own platform for Laptops and Mini-PCs, the RTX Spark. Announced in June of 2026, the new platform promises to bring agentic workloads to Windows personal computers and could make Windows machines competitive to Mac for AI tasks.
While the RTX Spark targets consumer Windows devices, the DGX Spark is a Linux-powered desktop mini-supercomputer. It utilizes the same GB10 Grace Blackwell chip and 128GB of unified memory, but trades mobility for maximum thermal headroom and unthrottled performance. This creates a completely air-gapped, plug-and-play sandbox for developers to safely run and fine-tune heavy workloads or models up to 70B parameters entirely offline.
Whatever your budget, the key calculation to make is how much VRAM you will need to support the types of models you are looking to run. The table below breaks down the requirements for a range of Model Sizes. Q4 and Q8 are levels of “Quantization”, a reduction of the model’s precision which improves memory performance at the cost of a little bit of ‘intelligence’.
|
Model Size (Parameters) |
FP16 (Full Quality) |
Q8 (8-bit) |
Q4 (4-bit) |
|
3B |
~6 GB |
~3.5 GB |
~2.5 GB |
|
7B – 8B |
~16 GB |
~9 GB |
~6 GB |
|
14B |
~28 GB |
~15 GB |
~10 GB |
|
32B |
~64 GB |
~34 GB |
~20 GB |
|
70B |
~140 GB |
~75 GB |
~40 GB |
Selecting your open-weight model and hosting hardware is only half the battle; the final critical piece is the software harness stack.
Because LLMs are purely text-prediction engines, they alone cannot act autonomously. Instead, they require software scaffolding to parse their context and manage multi-step reasoning loops.
Generally, this software ecosystem is formed of three layers:
As you can imagine, the open-source landscape is evolving fast, and many tools deliberately cross boundaries to offer all-in-one solutions. When mapping these tools to your stack, it helps to look at them by their primary architectural focus:
These tools focus on providing a secure and polished place to interact with AI. They do not run the models or sandbox code themselves; they plug into backend engines. Tools to check-out:
These tools bundle a user interface directly with a secure agentic orchestration sandbox.
These tools are the "brains." They have no consumer-facing chat interface; they exist purely to manage model weights and serve local APIs on your hardware.
(Note: Developer frameworks like LangGraph or CrewAI also sit deeply in Layer 2, acting as the underlying code libraries you use to build these custom agentic workflows from scratch.)
When architecting a Sovereign AI platform, remember that sovereignty is only as strong as your weakest link.
If you use a secure, local engine like vLLM but plug it into a cloud-hosted interface that tracks user telemetry, you have broken the privacy barrier. Conversely, if you deploy a beautiful local UI but connect it to an un-sandboxed agent framework, you risk exposing your private systems to rogue code.
How to choose your path:
By ensuring your software stack covers all three layers securely on your own infrastructure, you transform static, open-weight models into a highly functional, entirely autonomous, and completely private corporate asset.
With that being said, it’s worth bearing in mind that sovereignty isn't a binary choice. The EU Cloud Sovereignty Framework defines eight degrees of sovereignty, each with distinct trade-offs. The primary one is effort: managed services offer the easiest deployment path, but at the lowest level of sovereignty.
Achieving Sovereign AI isn't about isolating your organization; it is about choosing exactly where your data boundaries live. True sovereignty means moving away from a rental-only model of intelligence and intentionally designing control points across three critical pillars.
Your choice of model weights determines your long-term independence and portability.
Where your models physically execute dictates which legal jurisdictions and data-privacy laws apply to your corporate data.
The user interfaces, agentic loops, and local engines that translate raw model text-prediction into actual business workflows.
Sovereignty is only as strong as your weakest link. If you run a cutting-edge open-weight model on private hardware, but route user interactions through a cloud-based interface tracking telemetry, your privacy boundary is broken.
For highly regulated workloads or core proprietary IP, the blueprint is clear: combine open-weight models, host them on verifiably local infrastructure, and orchestrate them with a self-hosted software harness that you control from end to end. By owning the full stack, you transform artificial intelligence from a compliance risk into a private, permanent corporate asset.