Archive for the ‘Filesystems’ Category

HP DL380 G5 drive write cache (BBWC)

Friday, June 19th, 2009

In one of my previous articles I have written about tools that I use for benchmarking database performance and especially for discovering system bottlenecks. In this article I will show you how large impact on filesystem performance a drive write cache may have.

HP DL360 G4 server

HP and its line of DL servers is very respected amongst IT engineers. It is (at least by my experience) a reliable class of servers, well built, easy to maintain and comes with an excellent server management software called Integrated Lights-Out (ILO). DLs SCSI and SAS storage subsystem is usually controlled by controllers called Smart Array, which can be integrated onto the motherboard or not. Fresh HP customer gets all the nifty RAID levels to play with and is usually satisfied. But what fresh customer usually DOES NOT KNOW is that Smart Array write performance really sucks if write cache is not enabled. And to enable it, fresh user needs to install a special module with attached battery called BBWC. So, HP, if you are accidentally reading this, please do notify your customers about such things.

HP DL380 G5 server

This is another server where I conducted the benchmark and, unlike DL360 G4, already came with BBWC installed. To simulate absence of write cache I disabled it with command line tool hpacucli (HP Array Configuration Utility Command Line Interface).

Benchmark metodology

For this benchmark I used two tools from sql-bench suite which heavily stress filesystem with lots of file creations and deletions. These tools are test-alter-table and test-create. The former test is faster and only gives rough figure. The later creates and deletes around 50,000 MyISAM tables which results in 150,000 files created and deleted. I executed both benchmarks first without and then with write cache enabled.
One of the machines (the DL380) was already in production, but it was benchmarked during the night when usage is negligible.

Test systems

HP DL360 G4

  • Controller: Smart Array 6i
  • Filesystem: ufs
  • OS: FreeBSD 6.0
  • MySQL: mysql-5.0.41-freebsd6.0-i386

HP DL380 G5

  • Controller: Smart Array P400i
  • ext3 filesystem
  • OS: Slackware 12.2
  • MySQL: mysql-5.0.77 compiled from source

Results on HP DL380 G5

Drive write cache DISABLED

Testing of ALTER TABLE
Time for insert (1000) 0 wallclock secs
( 0.02 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Time for alter_table_add (100): 17 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for create_index (8): 2 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop_index (8): 2 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for alter_table_drop (91): 16 wallclock secs ( 0.01 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.01 CPU)
Total time: 37 wallclock secs ( 0.03 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.03 CPU)

Testing the speed of creating and dropping tables
Time for create_MANY_tables (10000): 253 wallclock secs ( 0.26 usr 0.06 sys + 0.00 cusr 0.00 csys = 0.32 CPU)
Time to select_group_when_MANY_tables (10000): 1 wallclock secs ( 0.09 usr 0.07 sys + 0.00 cusr 0.00 csys = 0.16 CPU)
Time for drop_table_when_MANY_tables (10000): 1 wallclock secs ( 0.09 usr 0.03 sys + 0.00 cusr 0.00 csys = 0.12 CPU)
Time for create+drop (10000): 259 wallclock secs ( 0.24 usr 0.18 sys + 0.00 cusr 0.00 csys = 0.42 CPU)
Time for create_key+drop (10000): 255 wallclock secs ( 0.41 usr 0.11 sys + 0.00 cusr 0.00 csys = 0.52 CPU)
Total time: 769 wallclock secs ( 1.09 usr 0.45 sys + 0.00 cusr 0.00 csys = 1.54 CPU)

Drive write cache ENABLED

Testing of ALTER TABLE
Time for insert (1000) 0 wallclock secs ( 0.02 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Time for alter_table_add (100): 3 wallclock secs ( 0.01 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Time for create_index (8): 1 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop_index (8): 0 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for alter_table_drop (91): 4 wallclock secs ( 0.01 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.01 CPU)
Total time: 8 wallclock secs ( 0.04 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.05 CPU)

Testing the speed of creating and dropping tables
Time for create_MANY_tables (10000): 9 wallclock secs ( 0.34 usr 0.06 sys + 0.00 cusr 0.00 csys = 0.40 CPU)
Time to select_group_when_MANY_tables (10000): 1 wallclock secs ( 0.06 usr 0.03 sys + 0.00 cusr 0.00 csys = 0.09 CPU)
Time for drop_table_when_MANY_tables (10000): 1 wallclock secs ( 0.10 usr 0.04 sys + 0.00 cusr 0.00 csys = 0.14 CPU)
Time for create+drop (10000): 9 wallclock secs ( 0.44 usr 0.10 sys + 0.00 cusr 0.00 csys = 0.54 CPU)
Time for create_key+drop (10000): 10 wallclock secs ( 0.35 usr 0.10 sys + 0.00 cusr 0.00 csys = 0.45 CPU)
Total time: 30 wallclock secs ( 1.29 usr 0.33 sys + 0.00 cusr 0.00 csys = 1.62 CPU)

Drive write cache DISABLED ENABLED Relative difference
sql-bench: test-alter-table 37 s 8 s 462%
sql-bench: test-create 769 s 30 s 2563%

Results on HP DL360 G4

Drive write cache DISABLED

Testing of ALTER TABLE
Time for insert (1000) 0 wallclock secs ( 0.02 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.04 CPU)
Time for alter_table_add (100): 33 wallclock secs ( 0.01 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Time for create_index (8): 4 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop_index (8): 3 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for alter_table_drop (91): 33 wallclock secs ( 0.01 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Total time: 75 wallclock secs ( 0.04 usr 0.03 sys + 0.00 cusr 0.00 csys = 0.07 CPU)

Testing the speed of creating and dropping tables
Testing with 10000 tables and 10000 loop count
Time for create_MANY_tables (10000): 1035 wallclock secs ( 1.27 usr 0.25 sys + 0.00 cusr 0.00 csys = 1.52 CPU)
Time to select_group_when_MANY_tables (10000): 83 wallclock secs ( 0.63 usr 0.16 sys + 0.00 cusr 0.00 csys = 0.79 CPU)
Time for drop_table_when_MANY_tables (10000): 493 wallclock secs ( 0.50 usr 0.19 sys + 0.00 cusr 0.00 csys = 0.69 CPU)
Time for create+drop (10000): 958 wallclock secs ( 1.59 usr 0.38 sys + 0.00 cusr 0.00 csys = 1.97 CPU)
(NOTICE: Could not wait for this test to finish, because machine needed to get back in production,)
(thus I assume it to be around 900 seconds just to be on the safe side, probably would be more.)
Total time calculated: around 3400 seconds

Drive write cache ENABLED

Testing of ALTER TABLE
Time for insert (1000) 0 wallclock secs ( 0.02 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Time for alter_table_add (100): 8 wallclock secs ( 0.02 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Time for create_index (8): 1 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for drop_index (8): 1 wallclock secs ( 0.00 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.00 CPU)
Time for alter_table_drop (91): 8 wallclock secs ( 0.01 usr 0.01 sys + 0.00 cusr 0.00 csys = 0.02 CPU)
Total time: 18 wallclock secs ( 0.05 usr 0.02 sys + 0.00 cusr 0.00 csys = 0.06 CPU)

Testing the speed of creating and dropping tables
Time for create_MANY_tables (10000): 104 wallclock secs ( 1.07 usr 0.24 sys + 0.00 cusr 0.00 csys = 1.31 CPU)
Time to select_group_when_MANY_tables (10000): 27 wallclock secs ( 0.48 usr 0.18 sys + 0.00 cusr 0.00 csys = 0.66 CPU)
Time for drop_table_when_MANY_tables (10000): 53 wallclock secs ( 0.32 usr 0.09 sys + 0.00 cusr 0.00 csys = 0.41 CPU)
Time for create+drop (10000): 143 wallclock secs ( 1.31 usr 0.30 sys + 0.00 cusr 0.00 csys = 1.62 CPU)
Time for create_key+drop (10000): 164 wallclock secs ( 1.56 usr 0.36 sys + 0.00 cusr 0.00 csys = 1.92 CPU)
Total time: 491 wallclock secs ( 4.74 usr 1.18 sys + 0.00 cusr 0.00 csys = 5.92 CPU)

Drive write cache ABSENT ENABLED Relative difference
sql-bench: test-alter-table 75 s 18 s 416%
sql-bench: test-create 3400 s 491 s 692%

Analysis

The test-alter-table results seem fine, slightly over 400% increase in performance. But what bothers me is the test-create difference. I expected the HP DL360 G4 to improve more and execute this test below 100 seconds barier, heck, actually I expected it below 50 seconds. It is true that this machine uses different operating- and filesystem. But 500 seconds still seems too much to me, especially when HP DL380 G5 excells at 30 seconds. If someone know the answer, please drop it in comments.

Conclusion

I believe this article has clearly shown why one must conduct even such synthetic tests before deploying the systems to production environment. Furthermore even before the “real-world benchmarks” are conducted. The phrase “real-world benchmark” signifies a comparative benchmark of certain application on an existing production systems and on the ones that are in testing phase. It often happens that hardware is not upgraded for quite some time, which means that new hardware is few generations younger than the existing one. The new one is far more powerful and one easily misses some not-so-innocent bottleneck if “real-world benchmark” displays certain improvement. Thus, as I believe, newer systems MUST perform better than older ones in every synthetic benchmark (if the systems are comparable, of course), and only then we can start conducting “real-world benchmarks”.

How did I discover this “issue”?

It happened to me back in the 2004 that I deployed such a HP server to collocation facility and only later discovered that it was performing worse than some old test machine lying under my desk. After couple of hours of googling I assumed that the lack of BBWC was our problem. I had to order it and then remove the server from collocation because I also wanted to upgrade all the firmwares, just in case. On top of that, I still had to figure out how to install ‘hpacucli’ on non-RedHat linux. After a long weekend the machine was back in production and never caused a single problem again.