For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).
To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.
zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs
Better to not make copies in the first place.
An OS isn't large. Your spotify/slack/browser instance is of comparable size. Says more about browser based apps but still.
A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)
They're not even close to "being comparable in size," although I guess that says more about Windows.
And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...
If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)
Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.
Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.
The host block cache will end up deduplicating it automatically because all the 'copies' lead back to the same block on disk.
https://learn.microsoft.com/en-us/windows/security/applicati...
By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.
For those unfamiliar, the informal distinction between type-1 and type-2 is that type-1 hypervisors are in direct control of the allocation of all resources of the physical computer, while type-2 hypervisors operate as some combination of being “part of” / “running on” a host operating system, which owns and allocates the resources. KVM (for example) gives privileged directions to the Linux kernel and its virtualization kernel module for how to manage VMs, and the kernel then schedules and allocates the appropriate system resources. Yes, the type-2 hypervisor needs kernel-mode primitives for managing VMs, and the kernel runs right on the hardware, but those primitives aren’t making management decisions for the division of hardware resources and time between VMs. The type-2 hypervisor is making those decisions, and the hypervisor is scheduled by the OS like any other user-mode process.
It was never popularly used in a way accurate to the origin of the classification - in the original paper by Popek and Goldberg talked about formal proofs for the two types and they really have very little to do with how the terms began being used in the 90s and 00s. Things have changed a lot with computers since the 70s when the paper was written and the terminology was coined.
So, language evolves, and Type-1 and Type-2 came to mean something else in common usage. And this might have made sense to differentiate something like esx from vmware workstation in their capabilities, but it's lost that utility in trying to differentiate Xen from KVM for the overwhelming majority of use cases.
Why would I say it's useless in trying to differentiate, say, Xen and KVM? Couple of reasons:
1) There's no performance benefit to type-1 - a lot of performance sits on the device emulation side, and both are going to default to qemu there. Other parts are based heavily on CPU extensions, and Xen and KVM have equal access there. Both can pass through hardware, support sr-iov, etc., as well.
2) There's no overhead benefit in Xen - you still need a dom0 VM, which is going to arguably be even more overhead than a stripped down KVM setup. There's been work on dom0less Xen, but it's frankly in a rough state and the related drawbacks make it challenging to use in a production environment.
Neither term provides any real advantage or benefit in reasoning between modern hypervisors.
What actually can change is the amount of work that the kernel-mode hypervisor leaves to a less privileged (user space) component.
For more detail see https://www.spinics.net/lists/kvm/msg150882.html
[1]: https://www.redhat.com/en/topics/virtualization/what-is-KVM
Although I’ll note that the line between a VMM and hypervisor are not always clear. E.g., KVM includes some things that other hypervisors delegate to the VMM (such as instruction completion). And macOS’s hypervisor.framework is almost a pass through to the CPU’s raw capabilities.
Is there any article that tells the difference and relationship between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc. with their typical use cases?
Qemu is a user space system emulator. It can emulate in software different architectures like ARM, x86, etc. It can also emulate drivers, networking, disks, etc. Is called via the command line.
The reason you'll see Qemu/KVM a lot is because Qemu is the emulator, the things actually running the VM. And it utilizes KVM (on linux, OSX has HVF, for example) to accelerate the VM when the host architecture matches the VM's.
Libvirt is an XML based API on top of Qemu (and others). It allows you to define networks, VMs (it calls them domains), and much more with a unified XML schema through libvirtd.
Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI to do the same.
Proxmox is Debian under the hood with Qemu/KVM running VMs. It provides a robust web UI and easy clustering capabilities. Along with nice to haves like easy management of disks, ceph, etc. You can also manage Ceph through an API with Terraform.
Xen is an alternative hypervisor (like esxi). Instead of running on top of Linux, Xen has it's own microkernel. This means less flexibility (there's no Linux body running things), but also simpler to manage and less attack surface. I haven't played much with xen though, KVM is kind of the defacto, but IIRC AWS used to use a modified Xen before KVM came along and ate Xen's lunch.
Use cases: proxmox web interface exposed on your local network on a KVM Linux box that uses QEMU to manage VM’s. Proxmox will allow you to do that from the web. QEMU is great for single or small fleet of machines but should be automated for any heavy lifting. Proxmox will do that.
In really simple terms, so simple that I'm not 100% sure they are correct:
* KVM is a hypervisor, or rather it lets you turn linux into a hypervisor [1], which will let you run VMs on your machine. I've heard KVM is rather hard to work with (steep learning curve). (Xen is also a hypervisor.)
* QEMU is a wrapper-of-a-sorts (a "machine emulator and virtualizer" [2]) which can be used on top of KVM (or Xen). "When used as a virtualizer, QEMU achieves near native performance by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux." [2]
* libvirt "is a toolkit to manage virtualization platforms" [3] and is used, e.g., by VDSM to communicate with QEMU.
* virt-manager is "a desktop user interface for managing virtual machines through libvirt" [4]. The screenshots on the project page should give an idea of what its typical use-case is - think VirtualBox and similar solutions.
* Proxmox is the above toolstack (-ish) but as one product.
---
[1] https://www.redhat.com/en/topics/virtualization/what-is-KVM
It’s like with “isomorphic” code. That just sounds much cooler than “js that runs on the client and the server”.
Is it good to think of libvirt as a virtual machine mointor, or is that more "virtual machine management"?
"API to virtualization system" would probably be closest approximation but it also does some more advanced stuff like coordinating cross-host VM migration
Firecracker has a balloon device you can inflate (ie: acquire as much memory inside the VM as possible) and then deflate... returning the memory to the host. You can do this while the VM is running.
https://github.com/firecracker-microvm/firecracker/blob/main...
for many mostly "general purpose" use cases it's quite viable, or else ~fly.io~ AWS Fargate wouldn't be able to use it
this doesn't mean it's easy to implement the necessary automatized tooling etc.
so it's depending on your dev resources and priorities it might be a bad choice
still I feel the article was had quite a bit a being subtil judgemental while moving some quite relevant parts for the content of the article into a footnote and also omitting that this "supposedly unusable tool" is used successfully by various other companies...
like as it it was written by and engineer being overly defensive about their decision due having to defend it to the 100th time because shareholders, customers, higher level management just wouldn't shut up about "but that uses Firecracker"
That depends on the workload and the maximum memory allocated to the guest OS.
A lot of workloads rely on the OS cache/buffers to manage IO so unless RAM is quite restricted you can call in to release that pretty easily prior to having the balloon driver do its thing. In fact I'd not be surprised to be told the balloon process does this automatically itself.
If the workload does its own IO management and memory allocation (something like SQL Server which will eat what RAM it can and does its own IO cacheing) or the VM's memory allocation is too small for OS caching to be a significant use after the rest of the workload (you might pair memory down to the bare minimum like this for a “fairly static content” server that doesn't see much variation in memory needs and can be allowed to swap a little if things grow temporarily), then I'd believe is it more difficult. That is hardly the use case for firecracker though so if that is the sort of workload being run perhaps reassessing the tool used for the job was the right call.
Having said that my use of VMs is generally such that I can give them a good static amount of RAM for their needs and don't need to worry about dynamic allocation, so I'm far from a subject expert here.
And, isn't firecraker more geared towards short-lived VMs, quick to spin up, do a job, spin down immediately (or after only a short idle timeout if the VM might answer another request if one comes in immediately or is already queued), so you are better off cycling VMs, which is probably happening anyway, than messing around with memory balloons? Again, I'm not talking from a position of personal experience here so corrections/details welcome!
It's absolutely usable in practice, it just makes oversubscription more challenging.
I will never understand the whole virtual machine and cloud craze. Your operating system is better than any hypervisor at sharing resources efficiently.
And if youre running untrusted code, then using a virtualized environment is the easiest (id even say best) way to go about it.
Automatic scaling is great. Cloud parallelization (a.k.a fork) is absolutely wild once you get it rolling. Code deployments are incredibly simple. Never having to worry about physical machines or variable traffic loads is worth the small overhead they charge me for the wrapper. The generic system wide permissions model is an absolute joy once you get over the learning curve.
The fact of the matter is that it's just inefficient, slow and expensive.
Bare metal is simple, fast, and keeps you in control.
...firecracker does fine what it was designed to - short running fast start workloads.
(oh, and the article starts by slightly misusing a bunch of technical terms, firecracker's not technically a hypervisor per se)
so while Firecracker was designed for thing running just a few seconds there are many places running it with jobs running way longer then that
the problem is if you want to make it work with long running general purpose images you don't control you have to put a ton of work into making it work nicely on all levels of you infrastructure and code ... which is costly ... which a startup competing on a online dev environment compared to e.g. a vm hosting service probably shouldn't wast time on
So AFIK the decision in the article make sense the reasons but listed for the decision are oversimplified to a point you could say they aren't quite right. Idk. why, could be anything from the engineer believing that to them avoiding issues with some shareholder/project lead which is obsessed with "we need to do Firecracker because competition does so too".
EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-web-service... says my "pretty sure" memory is in fact correct.
mainly it's optimized to run code only shortly (init time max 10s, max usage is 15min, and default max request time 130s AFIK)
also it's focused on thin server less functions, like e.g. deserialize some request, run some thin simple business logic and then delegate to other lambdas based on it. This kind of functions often have similar memory usage per-call and if a call is an outlier it can just discard the VM instance soon after (i.e. at most after starting up a new instance, i.e. at most 10s later)
We reclaim memory with a memory balloon device, for the disk trimming we discard (& compress) the disk, and for i/o speed we use io_uring (which we only use for scratch disks, the project disks are network disks).
It's a tradeoff. It's more work and does require custom implementations. For us that made sense, because in return we get a lightweight VMM that we can more easily extend with functionality like memory snapshotting and live VM cloning [1][2].
[1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...
[2]: https://codesandbox.io/blog/cloning-microvms-using-userfault...
> i/o speed we use io_uring
custom io_uring based driver for the VM block devices? or what do you mean here?
> custom io_uring based driver for the VM block devices? or what do you mean here?
We're using the async io backend that's shipped with Firecracker for our scratch disks.
This is why paying for dedicated memory is often more expensive than its counter part, because that dedicated memory is not considered as part of pooling.
I think the complaints are perfectly valid.
Interesting. I guess we are reading a different website.
I understand that they need to sell their product but jeez. don't leave us hanging like that
Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd
This just works on Hyper-V Linux guests btw. For all the crap MS gets they do some things very right.
I would be scared to let unknown persons use QEMU that bind mounts volumes as that is a huge security risk. Firecracker, I think, was designed from the start to run un-sanitized workloads, hence, no bind mounting.
Most dangerous 12-words sentence.
I didn't know it existed until they posted, but QEMU has a Firecracker-inspired target:
> microvm is a machine type inspired by Firecracker and constructed after its machine model.
> It’s a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.
Light is cool but for many tasks that level of Spartan is overkill
If I’m investing time in light it might as well be wasm tech
Okay.