Sovereign AI: Taking Ownership of your AI Stack

19:58

Executive summary

Sovereign AI is becoming a strategic priority as companies rethink how much control they have over the models, infrastructure, software harnesses, and data that power their AI systems. Rather than relying fully on third-party platforms, organizations can now combine open-weight models, private or sovereign compute, and self-hosted orchestration tools to create AI systems that are more portable, secure, compliant, and resilient. The key message: AI sovereignty is not all-or-nothing, but a series of architectural choices that determine whether AI becomes a vendor dependency or a private, permanent business asset.

Sovereign AI?

Until very recently, AI systems (particularly those employing LLMs) followed a predictable architecture that relied heavily on 3rd-party resources. Recently, however, the conversation around “Sovereign AI” has been growing rapidly, with many starting to recognize the risks of 3rd-party control over their strategically important systems.

Simply, AI sovereignty is the degree to which an individual or organization owns and controls the various parts of their “AI Stack”, which can largely be broken down into Models, Compute, Harness, and Data. In this blog, we will aim to shed light on what is possible today across the first three of these pillars at varying degrees of sovereignty.

1. The Brains: Frontier Open Weights

Being the headline component of any LLM based system, the model in use is an obvious place to start when considering the sovereignty of your AI system. Luckily, the open-weights ecosystem of models has been maturing steadily for a long time and there are lots of high performance players to consider.

Mistral

The Flagships: Mistral Small 4 (119B MoE) and the voice-native Voxtral TTS.
The Architecture Secret: Merges reasoning, multimodal vision, and agentic coding into a single, efficient 128-expert MoE architecture on an Apache 2.0 license, activating roughly 6B parameters per token.
Where They Excel: High-speed function calling, real-time local audio/voice interfaces, and developer coding pipelines.
The Use Case: Built for multimodal actions. supports agents that can visually parse a layout, write frontend code, and verbally explain deployment steps using high-fidelity text-to-speech completely offline.

DeepSeek

The Flagships: DeepSeek-V4 (and the reasoning-heavy DeepSeek-R1).
The Architecture Secret: A 1.6-trillion parameter Mixture-of-Experts (MoE) array with three native reasoning modes (Non-Thinking, Think High, Think Max)
Where They Excel: High-throughput data pipelines, complex programming, and long-context information retrieval.
The Use Case: software engineering and codebase refactoring. The Think Max mode uses internal self-verification loops for advanced coding logic on a local workstation.

Llama (Meta)

The Flagships: Llama 4 Maverick and the long-context Llama 4 Scout.
The Architecture Secret: Mixture-of-Experts Architecture optimised for large context windows
Where They Excel: Autonomous agent workflows, native processing of large codebases, and sweeping legal/research corpora without truncation.
The Use Case: Suitable for long-horizon agent loops, enabling agents to scan multi-thousand-file software architectures, map dependencies, and trace bugs in a single turn without complex vector databases.

Gemma (Google)

The Flagships: Gemma 4 (31B Dense / 26B MoE) and the laptop-optimized Gemma 4 (12B Unified).
The Architecture Secret: native step-by-step thinking mode, a 256K context window, encoder-free multimodal architecture that processes audio, video, and image signals within the core LLM backbone.
Where They Excel: Long-context agentic workflows, on-device audio transcription/voice-editing, complex multimodal data processing, and edge deployments.
The Use Case: Gemma 4 eliminates the VRAM overhead of separate vision/audio encoders. The 12B Unified model allows running multimodal, voice-driven local agents that execute scripts and review visual data directly on a standard 16GB consumer laptop.

Qwen (Alibaba)

The Flagships: Qwen 3.6 Plus (for open-weights deployment)
The Architecture Secret: Optimized for long-horizon execution in the "Agent Era." It generalizes across foreign runtime harnesses (OpenClaw, Claude Code) without behavioral degradation.
Where They Excel: Multi-lingual enterprise agent swarms, long-running office productivity automation, and tool-use precision.
The Use Case: Suitable for systems that coordinate dozens of cross-functional sub-agents executing hundreds of sequential local tool-calls over multiple hours, offering agent-first execution capabilities for a local workspace.

Phi (Microsoft)

The Flagship: Phi-4-mini (3.8B).
The Architecture Secret: Relies on highly filtered synthetic data and textbook-grade logic strings, offering a 128K context window and strong mathematical capabilities in a compact footprint.
Where They Excel: Extreme-edge computing, mobile app integration, completely offline mobile tasks, and low-latency loops.
The Use Case: Optimized for mobile applications requiring text formatting, complex script handling, or mathematical logic to run 100% offline on a user's phone without relying on an external network or excessive battery drain. Phi-4-mini is an architecture used for this.

When considering which of these models to use, the best way to get started is to try several options and see which best fits your solution. The best place to see all the available models is huggingface.co, which is something like a ‘GitHub’ for AI models. Additionally, the “Model Gardens” from each of the key hyperscalers provide something of a marketplace for models (Azure Foundry Models, GCP Agent Platform, and AWS Bedrock). Here you will find both proprietary and open-weight models, though they typically offer only the flagship options from key model providers.

One of the key things you will notice when browsing the options is that each is usually available in a range of parameter sizes. For example, gemma4 is available in 2B, 4B, 12B, and 31B (the B standing for Billions of parameters).

Parameters represent the total number of internal connections in a model, directly dictating its depth of knowledge and reasoning capabilities. Models with higher parameter counts tend to have better logic and problem-solving abilities but require more GPU memory to run. (We include a table summarising Model Size to VRAM Req. relationship in the hardware section below)

The size of the model (number of parameters) that you will choose will largely come down to a trade-off between the desired capabilities of the model in your system and the available hardware on which to run the model, which presents a nice segue to the next pillar of sovereignty in your AI stack.

2. Hardware

Once you have selected the best model for your needs, you need some hardware on which to run it. The least “sovereign” approach would be to opt for one of the Model-Hosting-as-a-Service products available from the Big 3 hyperscaler vendors. The below table offers a comparison of the key offerings from the Big 3.

Attribute	Microsoft Foundry / Azure ML	Google Vertex Model Garden	Amazon SageMaker
BYOM Hub	Azure AI Foundry/ Azure ML	Vertex Model Garden	SageMaker Inference Components
Weight Storage	Blob Storage (Safetensors, GGUF)	Cloud Storage (Hugging Face format)	S3 (model.tar.gz)
Runtime Freedom	Custom Docker/Python (vLLM)	Custom Containers (via Model Registry) or Pre-built (vLLM/Hex-LLM)	Custom Docker or native LMI
Isolation	Azure VNet / Entra ID	Private Service Connect	AWS PrivateLink / Private VPC
Key Tech	Foundry Model Router / OpenAPI	Vertex AI Agent platform	HTTP/2 Bidirectional Streaming
Billing Model	Hourly VM instance rate	Hourly machine + GPU/TPU rate	Hourly instance rate
Mid-Tier Node (8B–32B)	$1.43 / hr (A10 8vCPU 110GB RAM)	$4.50 / hr (a2-highgpu-1g, 12vCPU, 85GB RAM)	$1.51- $7.09/ hr (ml.g5.2xlarge - ml.g5.12xlarge)
Enterprise Node (70B+ / MoE)	$9.08 – $18.00+ / hr (H100 320-640GB RAM)	$36.08 / hr (a2-highgpu-8g, 8xA100, 320GB VRAM, 96vCPU, 680GB RAM)	$25.25 / hr (ml.p4d.24xlarge)

These services offer all the benefits that have made the cloud at large such an attractive offer over the last couple of decades: high availability, redundancy, reliability, security, and scalability - all whilst requiring a much lower maintenance overhead compared to managing one’s own on-prem compute resources integrated in your existing cloud stack/bill.

However, the ownership of the platform is very much out of your hands, keeping you beholden to the vendors and their whims. Pricing and feature sets could change at any time. Not very Sovereign.

Luckily, running open-weight large language models locally is more accessible than one might think.

On the cheapest end of the scale, older graphics cards can be bought quite cheap and provide reasonable performance with the smaller quantizations. The GTX 1060 with 6GB of VRAM is available second hand for as little as €70.

For example, Gemma4 E4B running on a GTX 1060 6GB achieved an output speed of ~13.5 token/s and an input speed of ~510 tokens/s. This is not really suitable for the most advanced precision-sensitive cases or for real time chat applications, but its multi-modal and reasoning capabilities could make it a strong option for agentic applications that run in the background (i.e: Second Brain/ personal wiki assistant).

Ultimately, this setup represents the floor of what is currently possible with cheap, self-owned hardware, and the sky is the limit with prices for NVidia H100 cards reaching in excess of €25k.

Representing a middle ground between the pricey-poles, Apple systems have recently been a popular choice for Local LLM enthusiasts. Their System-On-A-Chip designs and specifically Unified memory for CPU and GPU tasks (with the option to utilise the SSD for even more memory if needed) make them an ideal choice for those looking for Local LLM work-horses.

Windows die-hards will also be encouraged to hear that NVidia are working on their own platform for Laptops and Mini-PCs, the RTX Spark. Announced in June of 2026, the new platform promises to bring agentic workloads to Windows personal computers and could make Windows machines competitive to Mac for AI tasks.

While the RTX Spark targets consumer Windows devices, the DGX Spark is a Linux-powered desktop mini-supercomputer. It utilizes the same GB10 Grace Blackwell chip and 128GB of unified memory, but trades mobility for maximum thermal headroom and unthrottled performance. This creates a completely air-gapped, plug-and-play sandbox for developers to safely run and fine-tune heavy workloads or models up to 70B parameters entirely offline.

Whatever your budget, the key calculation to make is how much VRAM you will need to support the types of models you are looking to run. The table below breaks down the requirements for a range of Model Sizes. Q4 and Q8 are levels of “Quantization”, a reduction of the model’s precision which improves memory performance at the cost of a little bit of ‘intelligence’.

Model Size (Parameters)	FP16 (Full Quality)	Q8 (8-bit)	Q4 (4-bit)
3B	~6 GB	~3.5 GB	~2.5 GB
7B – 8B	~16 GB	~9 GB	~6 GB
14B	~28 GB	~15 GB	~10 GB
32B	~64 GB	~34 GB	~20 GB
70B	~140 GB	~75 GB	~40 GB

3. Open Source Harnesses

Selecting your open-weight model and hosting hardware is only half the battle; the final critical piece is the software harness stack.

Because LLMs are purely text-prediction engines, they alone cannot act autonomously. Instead, they require software scaffolding to parse their context and manage multi-step reasoning loops.

Generally, this software ecosystem is formed of three layers:

User Interface Layer: The self-hosted front-end for secure, telemetry-free user interaction.
Orchestration & Policy Layer: The secure sandbox that manages agent workflows, memory, tool connections, and guardrails.
Inference & Serving Layer: The local software engines that load model weights into memory and serve APIs from your hosting infrastructure.

As you can imagine, the open-source landscape is evolving fast, and many tools deliberately cross boundaries to offer all-in-one solutions. When mapping these tools to your stack, it helps to look at them by their primary architectural focus:

1. Dedicated Front-Ends & Workspaces (Primary Layer 1)

These tools focus on providing a secure and polished place to interact with AI. They do not run the models or sandbox code themselves; they plug into backend engines. Tools to check-out:

Open WebUI: self-hosted interface for managing local engines like Ollama, complete with built-in document uploading (RAG) and extensive plugin support.
LibreChat: A customizable, open-source platform that replicates the premium ChatGPT experience while unifying many different local and cloud AI providers in one dashboard.

2. Vertically Integrated Agent Harnesses (Straddling Layers 1 & 2)

These tools bundle a user interface directly with a secure agentic orchestration sandbox.

PI: Specialized software-engineering harness. Provides an interface to view code and an isolated Docker sandbox where the agent can safely run commands and edit files.
Open Claw: An autonomous personal assistant tool. Includes a web dashboard or chat app interface and can orchestrate 24/7 background tasks, tools, and long-term memory
Odysseus: A privacy-first productivity workspace. Unifies chat, document editing, and email management into a single interface and supports autonomous background agents, local tool calling, and deep research pipelines.

3. Core Engine Runtimes & Inference (Primary Layer 3)

These tools are the "brains." They have no consumer-facing chat interface; they exist purely to manage model weights and serve local APIs on your hardware.

llama.cpp: The lightweight, native standard for running quantized models on local workstations.
vLLM & SGLang: High-throughput, enterprise-grade engines designed to serve open weights across private data centers or Kubernetes clusters with maximum memory efficiency.

(Note: Developer frameworks like LangGraph or CrewAI also sit deeply in Layer 2, acting as the underlying code libraries you use to build these custom agentic workflows from scratch.)

When architecting a Sovereign AI platform, remember that sovereignty is only as strong as your weakest link.

If you use a secure, local engine like vLLM but plug it into a cloud-hosted interface that tracks user telemetry, you have broken the privacy barrier. Conversely, if you deploy a beautiful local UI but connect it to an un-sandboxed agent framework, you risk exposing your private systems to rogue code.

How to choose your path:

For a modular, enterprise approach: Pair a dedicated front-end (Open WebUI) with a robust inference engine (llama.cpp), using developer frameworks (LangGraph) to hand-craft your business logic layer by layer.
For immediate, out-of-the-box utility: Lean on vertically integrated harnesses (PI, OpenHands, or Open Claw) that handle the complex interplay of interface, tools, and sandboxing right from day one.

By ensuring your software stack covers all three layers securely on your own infrastructure, you transform static, open-weight models into a highly functional, entirely autonomous, and completely private corporate asset.

With that being said, it’s worth bearing in mind that sovereignty isn't a binary choice. The EU Cloud Sovereignty Framework defines eight degrees of sovereignty, each with distinct trade-offs. The primary one is effort: managed services offer the easiest deployment path, but at the lowest level of sovereignty.

4. Tying It All Together

Achieving Sovereign AI isn't about isolating your organization; it is about choosing exactly where your data boundaries live. True sovereignty means moving away from a rental-only model of intelligence and intentionally designing control points across three critical pillars.

1. Model Selection

Your choice of model weights determines your long-term independence and portability.

The Full Sovereignty Approach: Deploying top-tier open-weight models (such as Llama, Qwen, or Mistral) on your own infrastructure. You own the weights forever, and no vendor can pull the plug or alter the model's behavior overnight.
The Hybrid Approach: Utilizing open-weight models but leaning on specialized, regional closed-source APIs that operate under strict contractual guarantees.
The Traditional Approach (No Sovereignty): Total reliance on global, proprietary, closed-source APIs where your operational intelligence is completely locked into a single vendor's cloud.

2. Hosting & Compute Infrastructure

Where your models physically execute dictates which legal jurisdictions and data-privacy laws apply to your corporate data.

The Full Sovereignty Approach: On-Premises Bare Metal. Running private GPU clusters directly inside your own physical data centers for absolute data isolation and zero external network leakage.
The Hybrid Approach: Sovereign Cloud & Virtual Private Clouds (VPCs). Utilizing specialized regional cloud providers or dedicated, geography-locked hyperscaler instances designed specifically to meet strict national sovereignty compliance guidelines.
The Traditional Approach (No Sovereignty): Standard multi-tenant public cloud hosting where data routinely crosses borders and mixes with public web traffic.

3. The Software Harness

The user interfaces, agentic loops, and local engines that translate raw model text-prediction into actual business workflows.

The Full Sovereignty Approach: Modular Open Source. Pairing a telemetry-free front-end (Open WebUI) with local inference engines (llama.cpp) running entirely inside your private network.
The Hybrid Approach: Integrated Frameworks. Utilizing all-in-one, vertically integrated open-source harnesses (PI, or Open Claw) deployed within a protected corporate VPC.
The Traditional Approach (No Sovereignty): Locking your enterprise workflows into proprietary SaaS agent platforms that track user behavior, prompt history, and operational metadata.

The Strategic Takeaway

Sovereignty is only as strong as your weakest link. If you run a cutting-edge open-weight model on private hardware, but route user interactions through a cloud-based interface tracking telemetry, your privacy boundary is broken.

For highly regulated workloads or core proprietary IP, the blueprint is clear: combine open-weight models, host them on verifiably local infrastructure, and orchestrate them with a self-hosted software harness that you control from end to end. By owning the full stack, you transform artificial intelligence from a compliance risk into a private, permanent corporate asset.

Sovereign AI: Taking Ownership of your AI Stack

Liam Campbell

Sovereign AI?

1. The Brains: Frontier Open Weights

Mistral

DeepSeek

Llama (Meta)

Gemma (Google)

Qwen (Alibaba)

Phi (Microsoft)

2. Hardware

3. Open Source Harnesses

1. Dedicated Front-Ends & Workspaces (Primary Layer 1)

2. Vertically Integrated Agent Harnesses (Straddling Layers 1 & 2)

3. Core Engine Runtimes & Inference (Primary Layer 3)

4. Tying It All Together

1. Model Selection

2. Hosting & Compute Infrastructure

3. The Software Harness

The Strategic Takeaway

The answers you've been looking for

Frequently asked questions

Sovereign AI?

1. The Brains: Frontier Open Weights

Mistral

DeepSeek

Llama (Meta)

Gemma (Google)

Qwen (Alibaba)

Phi (Microsoft)

2. Hardware

3. Open Source Harnesses

1. Dedicated Front-Ends & Workspaces (Primary Layer 1)

2. Vertically Integrated Agent Harnesses (Straddling Layers 1 & 2)

3. Core Engine Runtimes & Inference (Primary Layer 3)

4. Tying It All Together

1. Model Selection

2. Hosting & Compute Infrastructure

3. The Software Harness

The Strategic Takeaway

The answers you've been looking for

Frequently asked questions

1.What is Sovereign AI and what are the components of an AI stack?

2.Which open-weight LLMs are best for building a sovereign AI system?

3.How much GPU memory (VRAM) do I need to run an open-weight LLM locally?

4.What software do I need to run an open-weight model — isn't the model enough on its own?

5.Do I have to host everything on-premises, or can I be partially sovereign?