KG – Page 2 – A Net/Sys Admin Blog

Linux, as beautiful as it is, does have some longstanding problems with resource scheduling. Over the years various attempts to resolve the issue have been hampered due to the nature of the problem. Different applications need different performance tuning parameters. People that deploy only a single application on a server don’t care about desktop responsiveness. And desktop users don’t care as much about a server use case that requires maximum throughput at all cost. So you now have an operating system that is used by millions of people on millions of different types of hardware in millions of different ways. You can’t tune a generic profile that would work for everyone.

Graceful handling of storage IO tasks is the job of the scheduler. There are several. Noop, BFQ, CFQ, deadline. They all have different algorithms and design specifications. And it would be understandable if that was the first place you looked when you notice your Linux desktop becoming unresponsive under heavy disk IO load. While some schedulers are better than others at maintaining responsiveness, BFQ probably being the best fit for this type of use, a larger problem lies elsewhere.

Linux famously caches storage operations in RAM to improve performance when not under memory pressure. Because, why not? It’s basically free. The problem with caching storage operations between volatile memory and non-volatile storage is the possibility for data loss. Writes in flight will be lost if there is a power disruption, for example. So this can create problems that the conservative Kernel and distro maintainers understandably try to tune around. By default the Kernel is allowed a small amount of cache to hold asynchronously before it is forced to stop everything and call a synchronous write to flush its data to NV storage. We have some parameters:

vm.dirty_background_ratio

The percentage of data that is allowed to be in memory and uncommitted to disk. The possibility for data loss exists in this state, hence the name “dirty.” This is set at 10% by default. For many more desktop oriented workloads this is going to be quite low.

vm.dirty_ratio

The dirty ratio sysctl setting is the max the Kernel will keep before halting everything and forcing a synchronous write. If you feel your desktop ‘hitching’ for a period of time during IO load, you may very well be seeing this exact issue playing out. The storage system can’t keep up, throws up its hands, and forces everything to stop so it can catch up. The default 20% here is, again, quite low for what most would want on a desktop.

vm.dirty_expire_centisecs

Essentially a timer, the expire parameter states how long dirty data can sit waiting to be flushed async before the kernel forces a sync call. The value is in centiseconds, so divide by 100 to get normal seconds.

If your data isn’t particularly mission critial you can increase each of these values in order to allow the system to work through IO spikes asynchronously. This will lead to a much smoother experience for other applications.

One of the methods for increasing density in the virtual workload space that is common today is deduplication. It is possible to dedupe any computing storage resource as the process simply scans for identical blocks and then stores only one of them with the rest being a pointer to the original.

In the above image the blue blocks are identical. As you can see only one is being stored in the hypervisor memory and shared among three different virtual machines. For environments that have identical operating systems or applications this can free up some memory and allow for greater density.

You might think that something as simple as a hash scan and COW pointer replacement would be available in any type 1 hypervisor. VMware ESXi used to have this functionality enabled. It was called transparent page sharing. Ever since ESXi 5.0 this was disabled by default and required every virtual machine .vmx file to be modified with a hash salt value in order to allow TPS to work. This decision was likely made due to security implications that are more theoretical than concrete. The two obvious issues being the weakening of ASLR and the possibility of rowhammer type memory leak attacks. Later on due to Large Pages support TPS became less useful as the likelihood of an entire 2MB memory page being identical was much smaller than when the pages were 4KB. As such VMware documentation suggests that this feature is only useful during memory overcommit to avoid OOM situations.

Microsoft’s Hyper-V platform has no such memory deduplication technology available.

The Linux kernel has had full scale memory deduplication available since 2009. The 2.6.32 kernel merged Kernel Same-page Merging. This is a fully configurable daemon that scans the host memory and combines identical pages to reduce memory usage. The addition of KSM allowed the KVM hypervisor at the time to run 52 instances of Windows XP each allocated with 1GB of memory on a 16GB hypervisor host. That’s a 325% memory density increase! Let’s see what it is doing on a production host with minimal OS re-use. This is a far from ideal scenario, but should yield some benefit.

root@cluster0n0:~# cat /sys/kernel/mm/ksm/pages_sharing 
326715

So we have 326,715 pages shared. Those are 4096 bits each. Some simple math shows us we are looking at a savings of 1.33GiB of memory due to KSM. Most of this is likely down to the two Windows servers running on the host.

For a more detailed academic overview you can check out this paper. Since KSM is totally tunable you can read about the available options in the Red Hat documentation.

There are a few different options when it comes time to set up your virtual storage backing within the KVM environment.

Your options are:

No cache
Direct sync
Writethrough
Writeback

Let’s take a look at what we get with each.

The no cache and writethrough methods both blend data safety with a different focus. The disk write cache for no cache method will allow for better write performance. The writethrough method will offer improved read performance. Writeback has both cache options enabled and will perform the best, but at the cost of data safety. Loss of power or a system lockup will result in data that was in the buffers to be lost. While journaling and copy-on-write based filesystems will reduce the likelihood of inconsistent date leading to filesystem corruption, certain applications may have issues with dataloss of this type. Databases, for instance, will have told themselves that those transactions were flushed to storage and will expect those blocks to have been written. For these sensitive applications directsync will be the safest. Note that directsync can still be fast if you have caching done at another level. A hardware RAID controller, for instance, will allow for safe write caching when paired with a battery.

Below are some testing results for a large array of disks on software RAID. We have the four methods benchmarked at small 4k files and sequential transfer for queue depths of 1 and 32.

And we can see performance that tracks our predictions. Writethrough beats no-cache in reads. No-cache comes back and beats writethrough in writes. Writeback absolutely dominates both as it tells the application that data has been written as soon as it hits a cache.

In the world of virtualization, a decently long pedigree and strong support are king. There is an old saying that nobody ever got fired for buying Cisco. The hypervisor equivalent would probably be either Hyper-V or VMWare. So is there room for a more homegrown Linux based virtualization platform?

I tried to flesh out some details on Proxmox in my homelab to get a general idea on performance, stability, and support. I grabbed the latest spin of the Debian based virtualization environment and went to work. With the cast-off machines I have available for testing, I didn’t expect big boy server performance. I do know, however, what kind of performance I could get with ESXi on these same machines. The install was painless and it took about 30 minutes to get my four machines booted and configured. Here is where I ran into my first problem. There are two different wiki pages in the official documentation for setting up clustering. Each had a different procedure. Some more forum digging later I learned that the pvecm command was created to deprecate the previous method. Also unlisted is the fact that you must copy your ssh key for the root user from each machine to each other machine before you begin the process of tying the cluster together. My first attempt without doing this resulted in an insufficient quorum (three machines are required). The third machine was unable to join, as the cluster leader was awaiting a quorum and did not process the added key. Be sure to ssh-copy-id before you begin. I had to reinstall and start over.

After I got the cluster created I was really excited to try out the built-in cluster storage management suite Ceph. Ceph is a really elegant approach to storage. Each node (server) has whatever storage you attach to it. You take these storage block devices and add them to a pool. The storage drives are called OSDs. Transactions in flight occur on the ODSs or on separate block devices you designate as journals. It’s best to use something with great latency characteristics for a journal such as an SSD. But I’m getting ahead of myself as, yet again, there are two separate documents outlining two separate procedures for setting up the Ceph cluster. After more digging the second link is more up to date. Unfortunately, I was unable to utilize the Ceph pool that I created as an undocumented key signing issue got in the way. What you need to do is copy the key created during cluster setup into a new directory and name it after what you wish to call your storage pool.

mkdir -p /etc/pve/priv/ceph cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/ceph-sshd_01.keyring

This was the third issue I ran into just trying to configure Proxmox before I even began using it to host anything. I soldier on only to discover that I am unable to upload any reasonably sized .iso to mount for OS installation. The option exists in the UI, but the web server is, by default, incorrectly configured and fails silently and repeatedly attempting to handle any installation .iso. After researching the issue it appears that a lead developer on the project doesn’t know how to configure Apache. Apache can be configured to accept any size of upload you want. Case in point my Owncloud setup:

owncloud-upload

After using scp to copy my installation media to the appropriate directory, I was finally able to install some virtual machines. Performance seemed good, especially in terms of CPU and memory utilization. Right about where I expected a lean hypervisor to perform. The storage, however, was great at reading but had deal-breaking issues with write. A write benchmark within a Linux VM put sustained writes at 8.5MB/s. For hosts connected by gigabit Ethernet, that sort of performance is inexcusable. ZFS is available as a storage backend for Proxmox, and I recommend using it. Unfortunately, you lose some of the versatility and scalability that come with a clustered, network filesystem.

Among all the documentation issues, setup foibles, and general usability problems I persevered. The final nail in the coffin was live migration. Live migration was not supported for LXC clients, and I had more downtime during live migration than I would see with Hyper-V or a VCenter cluster.

I’m really glad that a Linux based virtualization platform is being developed. It is coming along nicely, but as of December 2015 I’d have to conclude that VMWare is still king by a country mile.

KG

Riding Dirty – Linux Storage Caching and You

KSM me baby

KVM/Qemu cache performance

Proxmox – Ready for the Big Leagues?