2011-11-28

bzip2 twice?


To check the performance of RMAN backup I recently started to trace it a little bit. As most of the time was not spent in any reading from disk or writing to media manager library event, it was on CPU. It's good to know the CPUs are of any good, but as I still want to know what's going on I tried to dig any deeper. CPU cycles are not just a magic black box where we put in a problem and the answer comes out after some times. At an abstraction layer it's a chain of functions where one is called by another, and only the last is the one doing anything. There is not much information in that fact per se, but developers are humans also, and they are giving the functions they code meaningful names.


So I had just to find these names (and where most of the time is spent) to figure out what's going on. To save my time I remembered Tanel Poders Advanced Oracle Troubleshooting Guide, Part 9 – Process stack profiling from sqlplus using OStackProf. There he described his tool ostackprof. This did all the job for me, I just had to find a rman session.

Here's the shortstack where most of the time was spent:
(This backup was done with COMPRESSION ALGORITHM ‘BASIC’)
->__libc_start_main()->main()->ssthrdmain()->opimai_real()->sou2o()->opidrv()->opiodr()->opiino()->opitsk()->ttcpip()->opiodr()->kporpc()->kkxrpc()->prient()->prient2()->pricbr()->pricar()->plsql_run()->pfrrun()->pfrrun_no_tool()->pfrinstr_ICAL()->pevm_icd_call_common()->krbibpc()->krbbpc()->krbb3crw()->krbbcdo()->kgccdo()->kgccbz2pseudodo()->kgccbz2do()->kgccm()->kgccbuf()->kgccgmtf()->__sighandler()->->

The naming convention for functions is not public documented by oracle, but for some reasons I'm sure functions starting with krb are related to backup, whereas kgcc is used for compression. Especially the working function kgccgmtf reads like generate Move To Front.

At that point I had a lot more information than before, still I had no way how to improve the backup speed. As we have licensed advanced compression for that particular node, we tested with different other compression methods. LOW and MEDIUM where faster, with less compression than our previous BASIC. But HIGH was even slower!

So again I used ostackprof and that's the topmost stack trace - for HIGH:
->__libc_start_main()->main()->ssthrdmain()->opimai_real()->sou2o()->opidrv()->opiodr()->opiino()->opitsk()->ttcpip()->opiodr()->kporpc()->kkxrpc()->prient()->prient2()->pricbr()->pricar()->plsql_run()->pfrrun()->pfrrun_no_tool()->pfrinstr_ICAL()->pevm_icd_call_common()->krbibpc()->krbbpc()->krbb3crw()->krbbcdo()->kgccdo()->__PGOSF209_kgccbzip2pseudodo()->kgccbzip2do()->BZ2_bzCompress()->handle_compress()->BZ2_compressBlock()->generateMTFValues()->__sighandler()->->


Do you see the difference? Until kgccdo there is no! And even afterwards, the functions are somewhat similar. One more thing is worth to mention: the bzip2 implementation for HIGH does not use oracle internal naming convention. So it's worth to search for these names on the internet. one of my best hits was a compress.c File Reference.

Did Oracle reinvent the wheel? No. For me it looks as if they tried their best first (by doing their own kgcc implementation) and afterwards preferred simple copy&paste. Maybe they should just skip either of these 2 - they still can use parameters to achieve different compression quality. 


If someone is interested in our results:
for a single datafile of 30GB (with 100% usage) we achieved on a production system - with all it ongoing tasks:


Typeminbackup-size
BASIC13:325.8
LOW5:178
MEDIUM8:526.14
HIGH65:294.25

We decided to choose MEDIUM.