A bit on SANs and system dependencies

It’s fairly well known that I’m not a fan of SANs for mysql. I’m really not a fan of them in general but most of this is from not being a fan of them for mysql. Over the past few decades the server world has migrated from few large expensive servers to lots of small cheap servers. For the most part this is accepted as a good thing. Large mainframe computers were slowly replaced by relatively smaller yet still expensive servers which are now replaced by smaller cheaper servers. I apply this same logic and trend to storage. I think storage should be replaced by smaller local storage or storage services rather than large centralized storage.

The idea of a SAN seems great on paper. You get a large pool of storage which can be sliced up into smaller pools as needed by different clients. If one client demands more storage it’s fairly straight forward to allocate them more storage and expand the filesystem over it. if you need more storage you simply add it to the pool and dole it out as needed. When it comes to mysql there are some problems with this approach. Some of the problems are technical others are with people and process.

People tend to think of storage as a sack that can hold N apples. When the sack is full of apples either get another sack or a larger sack and add more apples. This is only one dimension of storage. The other one is more difficult to plan for and often times overlooked until it’s too late. This is the rate at which apples can be added or removed from the sack or sorting through the sack to get at specific apples. The time it takes to access data and the number of concurrent requests that can be supported is more important for the speed of a database than the amount of data that can be stored.

Not all SAN storage is created equal. It’s possible that the large shared pool of storage is created by different underlying disks. Some of these disks can only support a few hundred iops (input/output operations per second) while others can support several thousand. By mixing the type of disks the once large shared storage pool is now smaller pools classified by the rate at which they can read and write data. Many SANs also support caching reads or writes in front of those disks. So not only do you now have to decide which clients will use which speed of disks but making sure the cache isn’t overloaded by any of the clients.

On top of all of this there is the problem of changing workloads. During almost every SAN performance related conversation I have ever had someone always suggests that I benchmark the SAN vs DAS and let the benchmark drive the architecture. This is fine if the SAN performance doesn’t change but as I’ve already established allocating SAN space is an ongoing task. This is because the workload of clients can change and the number of clients on the SAN can also change. Adding more clients or changing the workload of clients can change the performance for other clients.

I understand that it is possible to allocate clients in a way that will prevent some clients from interfering with others but in practice I’ve never seen this work out. Inevitably some clients interfere with others. This becomes a huge problem as people tend to think of i/o latency as a constant within some small range of variation. I’ve seen i/o latency go from a few milliseconds to several seconds for the same storage. Tracking down the source of these problems can be difficult especially if the clients are separate teams or business units. Adding new clients to a SAN or changing the workload invalidates any previous benchmarks or capacity plans done against it.

Back to mysql. Most mysql installations commit transactions serially. As part of this process they run an fsync on the transaction log to make sure the transaction is sycned to disk. This means that doubling the i/o latency will halve the number of transactions that can be sent through. This makes mysql more sensitive to i/o latency than other types of applications. Mysql slaves will perform every i/o of a transaction sequentially, not just the commit phase which makes them even more sensitive. While it’s technically possible to have a SAN function correctly it isn’t as simple as carving up the global pool of storage for clients.

I have to be fair and say that there is a potential for performance problems with DAS as well. Performance can degrade for various reasons such as when a disk dies or a battery dies. In these situations performance can degrade a little bit or it can degrade severely such as the controller refusing to cache writes without a battery. The difference between these scenarios and the allocation scenarios on a SAN is that DAS failures are states that can be monitored. If a disk dies your monitor sends an alert and you replace it. If the battery dies simply replace it. This means your monitoring has to be in place and you need a HA pair of machines but I don’t feel like that is a tall order.

The upside of using DAS is that each machine gets it’s own set of disks. If it overloads those disks and causes problems those problems won’t be reflected on the other machines io latency. What it does mean is that it’s more difficult to add space to each machine meaning that you need to plan accordingly. It can also mean that there is a larger pool over unused space because each machine needs a certain percentage of overhead. This depends on the workload of course but it must be considered. This also means that it is more difficult to scale vertically although SSDs and fusionio have raised the vertical scalability ceiling significantly in the past few years.

 

[ Update 2011-06-19 Daniël van Eeden has posted a good followup on his blog ]

2 Comments

  1. One big problem I have with DAS is that native file systems on Linux do not support a lot of useful storage capabilities like properly working snapshots or efficient provisioning of new copies across hosts, for example like NetApp snapmirror capability. Suppose you have a slave that falls over and needs to be fixed. With snapshots you can just revert back to before the problem. If you don’t have snapshots you re-provision from another system. These capabilities are increasingly important as you get to the 500GB+ range, because at that point a reload from another host using Xtrabackup (for example) can run to 24 hours or more, followed by N hours of catchup time.

    Without the high-end SAN/NAS features you end up being restricted to relatively small datasets, which is exactly the opposite of the tendency I see out in the field.

  2. Eric Bergen says:

    Those features are nice and they do make it easier to provision new machines but which architecture do you go with? It is nice to be able to deploy one SAN and be able to provision slaves by snapshotting the master. The problem is all your storage now relies on one SAN. Sure you can snapmirror that to another site or another local SAN but the recovery time from a SAN failure will include rolling forward file system and transaction logs and heating up caches. Alternately do you deploy half the machines to one SAN and half the machines to another SAN. This is a safer architecture but it means you can’t really CoW provision new slaves but you can copy between SANs. I can see benefits to both if either justifies the cost of a SAN.

    Sadly the most common deployment I see is one big SAN that is shared not only by multiple different mysql installations but other completely unrelated systems. The group that manages the SAN is independent from the group managing the databases. It’s a really scary combination and one that I hear about all too often when talking about past system failures.

    I wish I could deploy to a pair of cheap SANs that snap mirrored to each other, had masters on one and slaves on the other where all of the caches on the SAN were perfectly partitioned so no client could affect the others performance. The SANs were monitored for hot spots, disk space, latency by a group of operators who scheduled maintenance tasks and upgrades such that they didn’t impact users. They also established SLAs with their users both for uptime and latency which they always met.

    When someone shows me that deployment I will copy it and change my mind. I haven’t seen it yet.

Leave a Reply