Heretical: Oracle Blocksize on Linux

I know that by posting this I'm risking public mockery but I believe that one of the common rules about chosing the DB block size is sort of an oracle myth.

In particolar it is common practice to choose a DB block size identical to the filesystem block size when not using direct i/o.

I'm going to discuss the use of I/O on linux system but I won't extend my arguments to other unix flavours since I don't know their i/o subsystems well enough.

Steve Adams wrote this article http://www.ixora.com.au/tips/buffered_block_size.htm
where he addresses two reasons for avoiding block sizes larger than the operating system buffer size:

- the i/o is split in several calls of the size of the buffer size,
- the readahead can be triggered wrongly.

Howard J. Rogers (in http://www.dizwell.com/html/db_block_size.html) writes of split i/o in the size of "file system buffer size".

Please read the articles. I don't wish to cite them and I'm only paraphrasing.
Both authors deserves lot better than my killing their elegant sentences (yes, I really like how they write, HJR in particular).

First of all I'd like to demonstrate that the linux scheduler(s) can merge I/O and that even on databases under different kind of loads a DB block size multiple of the filesystem block size doesn't cause performance issues (on the contrary: it can help batch and bulk operations).

Let's start with a simple example using dd and iostat.

The methodology:
I'm using a spare device not touched by any application.
I start an
iostat -x <devicename> 1
and check that I obtain lines of zero (nothing is writing on the disk) at interval of 1 second.

Now I go with the tests:

dd if=/dev/zero of=/u02/foo bs=4k count=100
100+0 records in
100+0 records out

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
ida/c0d1     0.00  94.06  0.00  4.95    0.00  792.08     0.00   396.04   160.00     0.08   15.60   3.60   1.78

The iostat shows that of 100 writes of 4k (the block size) I didn't get 100 writes (w/s) but only 4.95. Why?
The other writes have been merged (wrqm/s) in only 4,95.

Now I bind the same device on raw1 and test again:

breonldblc03:~ # dd if=/dev/zero of=/dev/raw/raw1 bs=8k count=100
100+0 records in
100+0 records out

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
ida/c0d1     0.00   0.00  0.00 100.00    0.00 1600.00     0.00   800.00    16.00     0.41    4.08   4.08  40.80

I doubled the block size and using raw but the results are quite different:
100 writes has been done and no merging has happened. Why?
We are using directio and the writes are syncronous.

If the theory of the articles written above is correct than a 8k block size dd on raw (direct i/o) should outperform a 32k block size dd on filesystem (the latter should be split in i/o calls of 4k size, producing a sensible overhead).

Here is the result (two iostat lines are reported for comparing the results):

time dd if=/dev/zero of=/dev/raw/raw3 bs=8k count=10000
10000+0 records in
10000+0 records out

real    0m41.063s
user    0m0.016s
sys     0m0.412s

avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.75    0.00    2.01   23.31   72.93

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
ida/c0d1     0.00   0.00  0.00 239.22    0.00 3827.45     0.00  1913.73    16.00     0.97    4.07   4.07  97.35

avg-cpu:  %user   %nice    %sys %iowait   %idle
           2.00    0.00    1.75   24.00   72.25

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
ida/c0d1     0.00   0.00  0.00 246.46    0.00 3943.43     0.00  1971.72    16.00     0.99    4.02   4.02  99.19

time dd if=/dev/zero of=/u02/foo bs=8k count=10000
10000+0 records in
10000+0 records out

real    0m2.324s
user    0m0.005s
sys     0m0.536s

avg-cpu:  %user   %nice    %sys %iowait   %idle
           2.00    0.00   12.75   12.00   73.25

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
ida/c0d1     0.00 10020.00  0.00 234.00    0.00 83224.00     0.00 41612.00   355.66    72.44  177.41   2.21  51.60

avg-cpu:  %user   %nice    %sys %iowait   %idle
           2.24    0.00    2.49   47.13   48.13

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
ida/c0d1     0.00 9277.00  0.00 328.00    0.00 76776.00     0.00 38388.00   234.07   142.87  383.93   3.05 100.10

The second, which should be the slower, is 20 folds faster.
The reason: the number of i/o calls per second are almost the same but the latter got an avarage of 117K for every writes against the fixed 8K writes of the raw device.

You should complain that dd is not an oracle DB. A DB opens with O_SYNC and writes with pwrites.
So I wrote a stupid c program.

It takes as parameter the number of buffer pages to write/read for every i/o.
So if you run
./rw 2
it will try to write with blocks of twice the dimension of the system buffer page (on x86 the pagesize is 4k).

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>

#define O_DIRECT         040000 /* direct disk access hint */
#define O_SYNC 010000

int main(int argc, char *argv[])
{

     int NR_PAGES = 1;
     int FILE_OFFSET = 0;

     int fd_i = open("/dev/zero", O_RDWR|O_SYNC);

        /* The output file must already exist */

     int fd_o = open("/u02/app/oracle/foo", O_RDWR|O_SYNC);

     int len, pagesize = getpagesize();
     char *message;

        if (argc >= 1) {
                if (argv[1] != NULL)  {
                        if (atoi(argv[1]) != 0) { NR_PAGES = atoi(argv[1]); }
                }
        }

     printf("Pagesize: %d\n", pagesize);

     posix_memalign((void **)&message, pagesize, pagesize);
     memset(message, 0xff, pagesize);

     printf("%p\n", message);
        if(fd_i < 0) {
         printf("Unable to open input file, errno is %d.\n", errno);
         return 1;
        }

        for (FILE_OFFSET = 0; FILE_OFFSET <= pagesize*NR_PAGES*100 ; FILE_OFFSET=FILE_OFFSET+pagesize*NR_PAGES ) {

                printf("Offset: %d, Have to reach: %d\n", FILE_OFFSET, pagesize*NR_PAGES*100);

             if(fd_o < 0) {
                 printf("Unable to open output file, errno is %d.\n", errno);
                 return 1;
             } else {
                 if((len = read(fd_i, message, pagesize*NR_PAGES)) < 0) {
                    perror("read");
                  } else {
                    printf("%d bytes read.\n", len);
                  }
                 if((len = pwrite(fd_o, message, pagesize*NR_PAGES, FILE_OFFSET)) < 0) {
                    perror("write \n");
                  } else {
                    printf("%d bytes written.\n", len);
                  }
            }
        }

    close(fd_i);
    close(fd_o);

    return 0;
}

I run it with different argument and reporting the result (avoid 0):

NR_PAGES = 1 (total: 409600 bytes)

real    0m0.270s
user    0m0.027s
sys     0m0.151s

NR_PAGES = 2 (total: 819200 bytes)

real    0m0.262s
user    0m0.028s
sys     0m0.158s

NR_PAGES = 4 (total: 1638400 bytes)

real    0m0.381s
user    0m0.037s
sys     0m0.225s

NR_PAGES = 8 (total: 3276800 bytes)

real    0m0.515s
user    0m0.048s
sys     0m0.335s

NR_PAGES = 16 (total: 6553600 bytes)

real    0m0.650s
user    0m0.047s
sys     0m0.430s

Time is increasing but you are even doubling the total bytes written at every run.
So for near thrice the execution time (real) you got 16 folds the bytes written.

a typical iostat line:

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00 437.37  0.00 65.66    0.00 4072.73     0.00  2036.36    62.03     0.12    1.85   1.85  12.12

linux is still merging.

But again this is not a database.

I took one server and created several DBs: TEST2K, TEST4K, TEST8K, TEST16K, TEST32K and I'm performing several benchmark on them (product: benchmark factory).

The machine is a DL585 4CPU running a SLES9 x86-64 (2.6.5-7.97-smp #1 SMP) and oracle 10.1.0.3 for x86-64.
The scheduler is cfq, filesystem is ext3 with 4k block size (mounted with defaults parameters) and the database is on a dedicated device separated from operating system and oracle binaries.

The tablespace TPC where the benchmark objects are stored is on a different partition, just to have better iostat results.

DBs have been created as contemporary as possible to ensure a low degree of block contiguity.

For every run the machine is restarted and only one database is started.

At the moment I can only give you the TCP-C result of TEST4K against TEST8K and TEST32K.

The test ran 52 minutes on every node:

TEST4K:

TPS: 24,82
kBPS: 50.409
Total executions: 5584
Total Rows: 67179
Total kBytes: 11338,879
Average response time: 0.034
Maximum response time: 0.464

TEST8K:

TPS: 24,72
kBPS: 50.090
Total executions: 5564
Total Rows: 66809
Total kBytes: 11272,059
Average response time: 0.036
Maximum response time: 0.246

TEST16K:

TPS: 24,46
kBPS: 50.017
Total executions: 5505
Total Rows: 66770
Total kBytes: 11255,488
Average response time: 0.038
Maximum response time: 0.283

TEST32K:

TPS: 24.20
kBPS: 49.426
Total executions: 5444
Total Rows: 65975
Total kBytes: 11118.851
Average response time: 0.040
Maximum response time: 0.227

TEST4K performed slightly better thank TEST32K... but didn't hit TEST32K as it should have (the test has been made only once at the time of this post).

TCP-B

TEST2K:

TPS: 40.66
kBPS: 0.325
Total executions: 9150
Total Rows: 9150
Total kBytes: 73,200
Average response time: 0.018
Maximum response time: 0.102

TEST4K:

TPS: 41.20
kBPS: 0.330
Total executions: 9271
Total Rows: 9271
Total kBytes: 74,168
Average response time: 0.017
Maximum response time: 0.052

TEST8K:

TPS: 40,67
kBPS: 0.325
Total executions: 9150
Total Rows: 9150
Total kBytes: 73,200
Average response time: 0.018
Maximum response time: 0.057

TEST32K:

TPS: 40,67
kBPS: 0.325
Total executions: 9150
Total Rows: 9150
Total kBytes: 73,200
Average response time: 0.018
Maximum response time: 0.057

Next days I'm going to try tpc-b on all databases to stress the i/o under batch workload.

And what for readahead?
I can only sugges this good article from 2004 linux symposium:
http://www.linuxsymposium.org/proceedings/reprints/Reprint-Pai-OLS2004.pdf

time exp system/******@rmprod file=olap_dfi.dmp log=olap_dfi_exp.log owner=olap_dfi statistics=none

real    18m31.340s
user    3m25.312s
sys     0m34.573s

time imp system/test32k@test32k file=olap_dfi.dmp log=olap_dfi_imp32.log fromuser=olap_dfi touser=tpc ignore=y statistics=none

real    8m6.429s
user    2m10.214s
sys     0m24.037s

time imp system/test16k@test16k file=olap_dfi.dmp log=olap_dfi_imp16k.log fromuser=olap_dfi touser=tpc ignore=y statistics=none grants=none

real    9m21.775s
user    2m11.890s
sys     0m25.383s

time imp system/test4k@test4k file=olap_dfi.dmp log=olap_dfi_imp4k.log fromuser=olap_dfi touser=tpc ignore=y statistics=none grants=none

real    13m52.711s
user    2m11.094s
sys     0m25.023s

second execution of the same command:

real    12m22.458s
user    2m10.727s
sys     0m24.392s

200MB table import and select (no indexes):

time imp tpc/tpc@test32k file=table.dmp fromuser=moses touser=tpc ignore=y grants=n

real    0m14.212s
user    0m3.600s
sys     0m0.752s

time echo "select VERSION_LIABILITY from AST_FIXED_BOOK_RETURN;"|sqlplus tpc/tpc@test32k > /dev/null

real    0m12.485s
user    0m7.793s
sys     0m1.721s

500MB table import and select (no indexes):

time imp tpc/tpc@test32k file=table.dmp fromuser=moses touser=tpc ignore=y grants=n

real    0m55.348s
user    0m17.205s
sys     0m3.263s

time echo "select VERSION_LIABILITY from AST_LIAB_DATA_PROD;"|sqlplus tpc/tpc@test32k > /dev/null

real    0m13.584s
user    0m8.466s
sys     0m1.846s

200MB table import and select (no indexes):

time imp tpc/tpc@test4k file=table.dmp fromuser=moses touser=tpc ignore=y grants=n

real    0m20.408s
user    0m3.700s
sys     0m0.683s

time echo "select VERSION_LIABILITY from AST_FIXED_BOOK_RETURN;"|sqlplus tpc/tpc@test4k > /dev/null

real    0m12.531s
user    0m7.795s
sys     0m1.680s

500MB table import and select (no indexes):

time imp tpc/tpc@test4k file=table.dmp fromuser=moses touser=tpc ignore=y grants=n

real    1m49.915s
user    0m17.113s
sys     0m3.280s

time echo "select VERSION_LIABILITY from AST_LIAB_DATA_PROD;"|sqlplus tpc/tpc@test4k > /dev/null

real    0m13.425s
user    0m8.530s
sys     0m1.774s

------------------------

time echo "select VERSION_LIABILITY from AST_LIAB_DATA_PROD;"|sqlplus tpc/tpc@test32k > /dev/null

real    0m14.067s
user    0m8.431s
sys     0m1.704s

time echo "select VERSION_LIABILITY from AST_LIAB_DATA_PROD;"|sqlplus tpc/tpc@test4k > /dev/null

real    0m14.501s
user    0m8.235s
sys     0m1.846s

 

Contact information:
fabrizio.magni _at_ gmail.com