Yesterday my friend and ex-colleague Iain send me a link to this
dailyWTF article. I felt the content of the article and of (most of) the
comments were so wrong on so many levels I had to write something about
the RH Performance Tuning course
I suspect he did this because he remembered a heated discussion I had in a
team meeting with our team leader back when Iain and myself worked together:
our team leader was coming back from a "Red Hat Performance Tuning" course and
said there was a lot of things that we could do to improve the performance of
our systems, including:
- ensure that all systems had swap defined as twice the amount of RAM
- ensure that the /tmp partitions were created on the outside parts of the
I expressed serious doubts about the validity of those assumptions in a
modern IT environment.
First of all, memory is cheap nowadays and QoS matters. In most cases, a
swapping server is the best way to guarantee that it won't be able to offer the
right level of service: it is either an indication that there is something
wrong with the software like a memory leak, or that the server is not properly
sized for the task.
The partitioning issue is very similar to the case described in the dailyWTF
article. It is based on physical assumptions that are not necessarily true
nowadays, especially when we are talking about partitions made on a hardware
raid1 volume using multi-platter drives. In my opinion, there was no guarantee
that the firmware of the RAID controller nor the one from the drives will do
what we think they do.
Proof versus Belief
Interestingly enough, it seems that I was wrong and that drives
manufacturers do their best to keep a mapping that is still in sync with the
belief system in place, as the proved by the zcav
tests pointed in one of the article's comment.
What is important here is the experimental evidence as opposed to beliefs or
possibly outdated knowledge.
Still, it is important to remember that the zcav published datas are only
valid in the context of the tests: they might not be valid for your production
system with your set of drives, your raid controller and moreover, for your
Of course, with enough experience with a specific application, there is a
possibility that generic rules can be deducted, as long as they are
methodically deduced rules and not just wild assertions.
Alas, if the conditions changes, the rules are no longer valid, so you can't
blindly follow them when you do performance optimization, you can just use them
as hints or possible things to try: only in-situ measures can validate an
hypothesis and prove performance increases.
Which means that if you have to optimize performance, you have to use your
brains, common sense and produce reproducible test results!
With such a complicated setup, it can be difficult to measure the right
thing and there can be plenty of unwanted interactions.
On the other hand, if you can't prove it makes a difference, you just
over-complicate for the sake of it, which means it will be more difficult to
maintain and diagnose for no provable benefit.
If it brings more performance, the benefits will still have to be evaluated
against the operational risks that the complexity brought.
Not only the amount of complexity but also where you add the complexity
matters: the more complexity you push down the stack, the harder it is to
change things: a configuration option in an application is easier to do (and
revert) than changing the version of the software.
If you depend on a specific software and hardware stack for your
system/application to work, you are tied and have very limited ways to make
your solution evolve or adapt. This tend to create systems where changes induce
It might not necessarily be a problem, especially if your system does not
evolve a lot either in functionality or scale, and the risks can be mitigated
by tests, but those costs must be clearly understood when the decision is
Usually a lot of optimization can be done on the highest level, i-e the user
side, with limited risks and efforts. However it can't be achieved if you don't
understand what you're doing nor if you don't understand what the client
application/users are doing,
Instead of shooting in the dark by applying random recipes, talking to the
users to get the Big Picture can help making the system more in sync with the
actual needs, and will let you identify which path that can be explored.
Sometimes it can be as easy as spreading the load over the course of a
day/week/month instead of having everyone doing their queries at the same
Also, provided the DB is not used by a blackbox system, there are different
things that can be done either on the DB or at the application system:
- pruning the tables
- optimizing the schema/indexes/queries
- queuing/asynchronous queries
- spliting/sharding the tables
From my experience, the need to do performance tuning usually comes from a
bad design and the inability to scale horizontally. The IT industry has been
relying too much on the ability to do vertical scaling, and unfortunately, it
seems that apart from the big web players, only a few companies have realized
that vertical scaling is barely an option now.
The communication problem
I believe the communication issue and lack of trust to be the most
It is obvious that the company has a "Us vs Them" syndrome between DBAs and
Ops, indicating a big silo problem, and I doubt this problem is only limited to
the Ops DBA interaction, but probably spans to other teams
interactions as well.
I think the Ops persons and his boss did the wrong thing there by hiding
things under the rug. I believe that it was only pride that made the Ops guy
behave the way he did, and I think this will only create more problems for him
in the future.
Maybe a better way to do is to show the DBA that there was a better way to
do things on a system level, to create a trust relationship with him and to
encourage communication with the DB users.
Pouring oil on the fire is not going to stop the fire spreading nor the