This article is more than 1 year old

The Linux cloud swap that spells trouble for Microsoft and VMware

Containers just wanna be hypervisors

Just occasionally, you get it right. Six years ago, I called containers "every sysadmin's dream," and look at them now. Even the Linux Foundation's annual bash has been renamed from "LinuxCon + CloudOpen + Embedded Linux Conference" to "LinuxCon + ContainerCon".

Why? Because since virtualization has been enterprise IT's favourite toy for more than a decade, the rise of "cloud computing" has promoted this even more. When something gets that big, everyone jumps on board and starts looking for an edge – and containers are much more efficient than whole-system virtualization, so there are savings to be made and performance gains to win. The price is that admins have to learn new security and management skills and tools.

But an important recent trend is one I didn't expect: these two very different technologies beginning to merge.

Traditional virtualization is a special kind of emulation: you emulate a system on itself. Mainframes have had it for about 40 years, but everyone thought it was impossible on x86. All the "type 1" and "type 2 hypervisor" stuff is marketing guff – VMware came up with a near-native-speed PC emulator for the PC. It's how everything from KVM to Hyper-V works. Software emulates a whole PC, from the BIOS to the disks and NICs, so you can run one OS under another.

It's conceptually simple. The hard part was making it fast. VMware's big innovation was running most of the guest's code natively, and finding a way to trap just the "ring 0" kernel-mode code and run only that through its software x86 CPU emulation. Later, others worked out how and did the same, then Intel and AMD extended their chips to hardware-accelerate running ring-0 code under another OS – by inserting a "ring -1" underneath.

But it's still very inefficient. Yes, there are hacks to allow RAM over-commit, sparse disk allocation and so on, but overdo it and performance suffers badly. The sysadmin has to partition stuff up manually and the VMs take ages to boot, limiting rapid scaling.

In computing terms, this is stone-age stuff. The whole point of half a century of R&D on dynamic memory management and multitasking operating systems was to avoid having to do this stuff manually. VMs squander all that.

Yes, it's improved, there are good management tools and so on, but all PC OSes were designed around the assumption that they run on their own dedicated hardware. Virtualization is still a kludge – but just one so very handy that everyone uses it.

That's why containers are much more efficient: they provide isolation without emulation. Normal PC OSes are divided into two parts: the kernel and drivers in ring 0, and all the ordinary unprivileged code – the "GNU" part of GNU/Linux – and your apps, in ring 3.

With containers, a single kernel runs multiple separate, walled-off userlands (the ring 3 stuff). Each thinks it's the only thing on the machine. But the kernel keeps total control of all the processes in all the containers.

There's no emulation, no separate memory spaces or virtual disks. A single kernel juggles multiple processes in one memory space, as it was designed to do. It doesn't matter if a container holds one process or a thousand. To the kernel, they're just ordinary programs – they load and can be paused, duplicated, killed or restarted in milliseconds.

But there's only one kernel, so you can only run Linux containers on Linux. Because there's only one copy of the core OS, if an app in a container needs an OS update, everyone gets it and the whole machine must be rebooted.

These are fundamentally different approaches. So how can they be merged together or the fundamental distinction between two totally different approaches reduced?

The hypervisor that isn't a hypervisor

Canonical has come up with something like a combination – although it admittedly has limitations. Its LXD "containervisor" runs system containers – ones holding a complete Linux distro from the init system upwards. The "container machines" share nothing but the kernel, so they can contain different versions of Ubuntu to the host – or even completely different distros.

LXD uses btrfs or zfs to provide snapshotting and copy-on-write, permitting rapid live-migration between hosts. Block devices on the host – disk drives, network connections, almost anything – can be dedicated to particular containers, and limits set, and dynamically changed, on RAM, disk, processor and IO usage. You can change how many CPU cores a container has on the fly, or pin containers to particular cores.

Despite some of the marketing folks' claims, it's not a full hypervisor. You can't run non-Linux containers. Indeed you can only run distros that will work on the kernel of the host's version of Ubuntu. You can't even freely migrate containers between hosts running different Ubuntu versions. Also, any global restrictions in the Linux kernel – such as number of network connections or IP addresses – apply to all the containers on the host put together.

However, it does offer most of the functionality of Xen or KVM-style Linux-on-Linux virtualization but with considerably greater efficiency, meaning lower overheads in both resources and licence costs. Importantly for Canonical, it allows Ubuntu Server to run software that's only certified or supported on more established enterprise distros, such as RHEL or SLES.

LXD is a pure container system that looks as much as possible like a full hypervisor, without actually being one.

... and containers that aren't really containers

What's the flipside of trying to make containers look like VMs? A hypervisor trying very hard to make VMs look like containers, complete with endorsement from an unexpected source.

When IBM invented hypervisors back in the 1960s, it created two different flavours of mainframe OS – ones designed to host others in VMs, and other radically different ones designed solely to run inside VMs.

Some time ago, Intel modified Linux into something akin to a mainframe-style system: a dedicated guest OS, plus a special hypervisor designed to run only that OS. The pairing of a hypervisor that will only run one specific Linux kernel, plus a kernel that can only run under that hypervisor, allowed Intel to dispense with a lot of baggage on both sides. The VMs aren't PC-compatible. There's no BIOS or boot process – just copy the kernel into RAM and execute it. No need to emulate a display or other IO – everything, including the root filesystem, is accessed over a simple, fast, purely virtual network connection. Guest control is over ssh, just like with containers.

The result is a tiny, simple hypervisor and tiny VMs, which start in a fraction of a second and require a fraction of the storage of conventional ones, with almost no emulation involved. In other words, much like containers.

Intel announced this under the slightly misleading banner of "Clear Containers" some years ago. It didn't take the world by storm, but slowly, support is growing. First, CoreOS added support for Clear Containers into container-based OSes. Later, Microsoft added it to Azure. Now, though, Docker supports it, which might speed adoption.

Summary? Now both Docker and CoreOS rkt containers can be started in actual VMs, for additional isolation and security – whereas a Linux distro vendor is offering a container system that aims to look and work like a hypervisor. These are strange times. Perhaps the only common element is that it's bad news for both VMware and Microsoft. ®

More about

TIP US OFF

Send us news


Other stories you might like