Following the paper on the block size I decided to write something more on the Linux I/O schedulers and their interaction with oracle.
This paper involves a series of tests stressing oracle with a TPC-C workload while the oracle DB rely on different linux schedulers.
The purpose of the I/O scheduler is to sort and merge the I/O request the I/O queues in order to increase efficiency and boost performance.
Using the /sys pseudo file system you can change and tune the I/O scheduler for a given block device.
For any scheduler there is a different directory tree representing the tuning options.
There are four schedulers available at the moment:
The command:
Is going to tell you which scheduler you are using.
On newer kernel you can change the scheduler without a reboot by simply issuing:
The testing software:
The chosen tool is hammerora which generates a TPC-C workload trying to "hammer" oracle as much as possible. Definitly a good stress test.
In the last version (1.26) I had scalability problems. The numember of trasaction per minute (tpm) were low and I noticed in my DB wait events lots of 'read by other session'.
Investigating further I saw the ITEM table (used by hammerora) was growing and lot of tablescan were performed on it.
I simply create an index with this DDL:
CREATE INDEX TPCC.ITEM_ID
ON TPCC.ITEM (I_ID)
INITRANS 255 MAXTRANS 255
TABLESPACE USERS
PCTFREE 60;
And the problem disappeared.
I even increased for every index and table the number on inittrans to 255 trying to increase the concurrency.
The difference was 100 folds in the number of tpm.
The virtual users for the initial tests were 10.
My DB:
A 10.2.0.2 (10g release 2 with first patchset).
SQL> show sga
Total System Global Area 838860800 bytes
Fixed Size 1263572 bytes
Variable Size 83888172 bytes
Database Buffers 746586112 bytes
Redo Buffers 7122944 bytes
SQL> show parameter sga_target
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
sga_target big integer 800M
SQL> show parameter pga_aggregate_target
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 103M
Asynch I/O is activated while direct I/O is diabled (to be sure to use of the feature of the I/O scheduler).
I configured the AWR to take a snapshot every 10 minutes.
I'm going to measure the results using the report create with the AWR (similar to the old statspack).
Disk Layout:
For the first test all the database files are on the same disk: sdb.
They are divided on two reiser file systems: one for the datafile of 4KB block size and one for the redolog of 512 byte.
Hardware:
IBM x335
2 CPU Xeon(TM) 2.00GHz
1,5 GB RAM
6 disks of 36 GB in three different RAID 1 (/dev/sda, /dev/sdb, /dev/sdc)
Operating system:
SUSE Linux Enterprise Server 10 beta8.
I choose this version since it is going to be certified soon with oracle and because it is the first SUSE Enterprise were the I/O scheduler can be changed on the fly.
This last characteristic is really important.
With a simple command like:
echo deadline > /sys/block/sdb/queue/scheduler
the scheduler is changed.
On older SUSE versions like SLES9 the I/O scheduler can be changed at boot time with the parameter elevator=[name of the scheduler] where the name can be: noop, deadline, as, cfq.
Unfortunately with this method you have one scheduler for all the block devices of the system.
It is not possible to combine more I/O scheduler so the tuning capabilities are limited.
Testing methodology:
With 10 virtual users a constant workload is kept on the database.
After 30 minutes the scheduler is chnged. The default parameters are kept in place.
After three cycles of all I/O schedulers the AWR snapshots are used to generate reports and to compare them.
First results:
For any scheduler you can see an AWR report following the links:
transaction per second | log file sync % | user calls | physical reads | physical writes | |
noop | 74.61 | 3.2 | 350.39 | 153.69 | 117.57 |
anticipatory | 30.12 | 44.4 | 140.62 | 67.31 | 53.94 |
deadline | 77.74 | 3.1 | 362.25 | 151.96 | 118.71 |
cfq | 23.13 | 36.8 | 107.97 | 51.58 | 40.22 |
The winner is the deadline scheduler. It is interesting to see that cfq and anticipatory have to lowest number of transaction per second (around 23 against more than 70 of deadline and noop).
Probably this is due to the high 'log file sync' of cfq and anticipatory. They are the clear losers on the redo log file writes!!!
This is worrying since cfq is the default scheduler of SLES distribution (and RedHat AS).
If you are going to implement a OLTP then it is better you test your application using different schedulers. Maybe the default is not right for you.
Deadline seems thebest scheduler on this kind of workload but it wins shortly against noop.
It would be interesting to divide redolog from datafile on different block devices. Then set the deadline scheduler on the redolog device and to retest switching the scheduler only on the datafile device (you can set a different scheduler for any block device).
This second test is going to be performed here.
Contact information:
fabrizio.magni _at_ gmail.com