Technical

Finding the Virtualization Sweet Spot with Proxmox and Ansible

What to choose in physical servers, virtual machines/VMs, Docker containers/LXC, Kubernetes, and hyperscalers like GCP/AWS? Proxmox, apparently.

True

12 Nov 2024 — 9 min read

Photo by Shubham Dhage / Unsplash

Navigating compute infrastructure (servers) is difficult, that's why there are whole teams centred around that. What to choose in trade-offs between metal/physical machines, virtual machines/VMs, Docker containers/LXC, Kubernetes, OpenStack, and hyperscalers like GCP/AWS? Proxmox, apparently. Let me talk about how I arrived there for my project, and what tradeoffs are important to my work on the Blasphemess project.

Long story short, I'm trialing Proxmox VE for the time being, and automating it with Ansible to promote infrastructure-as-code and configuration-as-code standards. For the kinds of team sizes I anticipate, this is the sweet spot between simplicity and scale.

I need to be able to run maybe 25 VMs at any given time, and more importantly, perform bulk operations on them whilst also occasionally taking down or bringing up new ones. Proxmox easily matches this requirement for scale.

To get to the point of hosting on Proxmox, first I need to get prototypes online. For that I'll be using public clouds, that way I can grow the project and work out issues before committing to the expense of hardware.

Choosing Between Self-Hosted, Co-Located, Hybrid Cloud, and Public Cloud Infrastructure

Where you run your software is server infrastructure, and it's an important decision.

Many developers are used to managed services, that provide a way for you to run software on a virtualised environment. Virtual Machines, containers, serverless functions, and so on. Using public cloud infrastructure is an attractive notion, because then you eliminate all the toilsome tasks of managing your own servers.

Electricity and internet utilities, estimating capacity, swapping failing disks, cycling ageing hardware, and so on? Let the dedicated operational teams handle that. You get a stable, dependable platform-as-a-service to base your work on, for a cost.

That cost can be quite significant over time, though.

Contrast this with the other end of the spectrum, so to speak: self-hosted infrastructure that you have total control over. That means being a servant to the machine spirit, sometimes, but you can tailor your experience with enough know-how and resources.

I love self-hosting software. I also love redundancy and reliability. Seeing as my home does not have fibre lines and backup power generation and LTE failover networking... let's just say that I quickly ruled out the at-home server closet idea. I also don't want to worry about data integrity issues from fires or theft.

Since I'm a tiny solo dev, I'm also immediately ruling out the idea of "build a data centre building of my own" for the expense of it, and "rent an office to store servers in" for similar reasons to the at-home closet idea.

Self-hosting is great for hobby or non-critical projects! There might even be a place for it in Blasphemess, for e.g. off-site backups and maybe specific continuous integration/automation tasks.

With self-hosting ruled out, let's look at the closest reasonable alternative: co-located servers. This is essentially just renting rack space in a data centre, and letting them handle almost all the physical infrastructure whilst you handle the servers and software.

I'm drawn to this idea of co-lo rack space. It's cheaper in the long run than public clouds and provides ample opportunity to develop new skills on the hardware side. The downside with this approach is that it takes a significant investment up-front to get rolling. Server hardware is not cheap, and you need to jump through hoops to rent rack space.

Enter the hybrid cloud: many professionals already are familiar, but a lot of software is running in both public clouds and data centres for a single company. You get to use hyperscalers where it makes sense, and your own hardware when that makes sense.

This is likely to be the approach in the early releases of Blasphemess; I already use Google Cloud Platform for pre-alpha testing and prototyping of the game.

Leveraging a public cloud eases startup friction. I just intend to pivot away in the future, once things are proven viable.

Let's Talk Physical Servers and also the Pets vs. Cattle Philosophy

If I intend to run my servers in a co-lo data centre, then comes the immediate question: "What am I going to run them on?"

Physical servers are the old way of running most things; you get a server, you install an Operating System (OS), and you run software on that server directly.

This is great if the use case involves simplicity, performance requirements that virtualisation overhead interferes with, or when dealing with constraints like legal compliance.

Side note: physical servers are sometimes referred to as metal or bare metal. The terms are confusing, but in public cloud contexts they're usually referring to physical hosts that do not have virtualisation in place.

Physical servers cover plenty of use-cases. They also typically grow to be unique, like snowflakes or beloved pets. They get distinct names, are fixed when issues crop up through debugging and manual operations, and can often be tailored for the purpose of what software they run – like "the database server" and "the web server."

Such unique treatment is a problem when you want to scale horizontally, or recreate environments from scratch (such as after a disaster). Who can remember every single little detail to set up a server, especially years later?

This repeatable setup issue is at least partially addressed with "IaC" and "CaC". Infrastructure-as-code (IaC) and configuration-as-code (CaC) principles mean that server environments I run on should be checked into a source code management tool such as Git, that way I can repeatedly play that back every time I make a new server.

However, there's more to it than that: why treat servers like unique pets at all, why not treat them like cattle? Make a bunch of interchangeable servers, and you can apply bulk operations much more easily. There's also lots of fault tolerance (reliability) if you do it well.

I've written before about how I horizontally scale Blasphemess. This is more like a cattle approach. To accomplish that, I don't want to work with physical servers.

Instead, I want to step into the world of virtualisation...

The Start of Ease: Virtual Machines and Virtual Private Servers (VPS)

Renting a virtual private server is a common developer experience. You often get a virtual machine, which is an emulation of a whole server. It's "private" in the sense that other customers of the VPS provider can't access your machine (allegedly), because their VMs and your VMs may run on the same hardware, but are isolated from each other by the hypervisor.

Virtual machines are neat. That's an understatement, as they revolutionised a lot of the compute world. Nowadays, so much of everything is running on them.

VMs made infrastructure easier, and I consider them the start of my era of developer ease. Servers became a commodity with lower barriers to entry, such that some companies offered free virtual private servers to developers to get them hooked on the platform.

There are a lot of details in virtualisation, like VM images that you spin up and run immediately without needing OS installers. Projects like cloud-init that help. Online VM migrations between host nodes, for higher availability of services. Even fancy things like "hyperconverged infrastructure."

That's not important for this blog post, but they are considerations I factored into my decision on how to build Blasphemess infrastructure.

I picked Proxmox VE as a free & open source virtualisation platform. It's popular enough among self-hosted home-lab enthusiasts, so I decided to trial it with my own home lab.

It's working thus far, so let's see what the future holds!

The Promised Future: Docker, Linux Containers (LXC), and Kubernetes

When I started my career, Docker and Kubernetes were just being adopted. The containerisation future has been arriving every day since then.

Linux containers can be thought of as really lightweight virtual machines, or like a chroot if you're familiar. They share the host's OS, but are contained just enough to allow developers to run without fear of significant outside interference.

Containers come with a whole host of benefits; why deal with "Well, it runs on my machine" when you can just ship the whole environment you're running on? What about making the initial condition of the image immutable and reliable, so ending the container and recreating it is safe?

Containers are great for dealing with packaging and updates; just change the tag you're running, then swap everything over.

Kubernetes takes Linux containers and runs them in bulk, and performs a lot of tasks automatically to make things "just keep working." There's commentary here about "self-healing" and "rolling updates" and so on, but that's not a big factor for what I'm doing with Blasphemess; it's going to be a niche game, and downtimes are not a big deal.

At my day job, I often use Kubernetes. I love it, and see it as the future of compute: what cores and threads did for generic computers, Kubernetes does at the next level of abstraction up. It's built by incident-hardened operators, and has countless good ideas baked into the system.

The downside of Kubernetes, though, is some added complexity and a weak experience around stateful apps. The complexity can be significant, too; it's a whole ecosystem to learn, which is a major ask for anyone I would want to onboard onto the project.

Kubernetes is not the right tool for my stack, as it adds significant complexity and its benefits are negligible for my use case. Most application errors will not be detected at the Kubernetes level. I'm also going to be a small development team (and am solo at the moment), so standardising on Kubernetes does little to benefit development.

Neat Detour: OpenStack and Open Source Clouds

OpenStack is an open-source alternative to hyperscaler public clouds. There are plenty of features, including VM hosting and container orchestration. There's more than that, too, though.

It's neat to see that. However, it doesn't fit my use case because it's trying to be a platform for teams of people to collaborate, rather than for just hosting services. It's just too big for me to use!

I may consider it again in the future, though, especially for larger projects where I want to self-host my cloud.

The Sweet Spot

I'm using virtualised servers with Proxmox VE at the moment, and on top of that I have docker compose.

Docker compose is a great basic orchestration tool, and I've used it in other projects I've hosted online. The biggest benefit I'm aiming for is consistency between what my staging/production deployment uses and what local development uses. It's one tool to learn, for any contributors I end up picking up for the project.

I don't need serverless functions, I don't need Kubernetes, I don't need the newest shiny deployment tools; likewise, I do not care for old style unpackaging my code onto the VM to run.

Containers are a great simple middleground, and allow me to pivot if I want to add the complexity later.

Ansible and Proxmox Automation

How do I set up the Proxmox nodes and VMs initially though? How do I automate standard playbooks for operations, like stopping a service? How do I manage configuration files?

I considered Puppet, Chef, and SaltStack alongside Ansible. Ultimately, I don't want to install an agent on the VMs, which rules out the others. There's also a decent amount of complexity to learning any of the tools.

Ansible is not amazing or the greatest tool by any stretch; but it is a great place to write down in code what you've done to configure a server. It's like a history documenting your environments.

My ansible scripts do a few tasks after I have a fresh Proxmox install; it adds admin users to the host, configures SSH keys, hardens the security of the node, installs needed software, and creates a VM template for the project servers.

I also can write playbooks for standard operations, like bringing up Blasphemess services or stopping them to take a database backup or cloning the template VM for new clusters. These playbooks can be applied through automation.

This is effectively the sweet spot I'm comfortable at, somewhere between server administration and virtualised services. I think it's just complex enough to be teachable for new devs.

We'll see what refinements come with time and growth.