2017-11-16

Backup stalled due to ASM rebalance stuck

I hit an issue where a full backup took much longer than normal.
In this case there was no alarm yet as no threshold was reached. But I worked on the DB for some other reason and out of a habit I most often start a ASH viewer whenever I work on a system - even if I only check data, it's worth to have an eye on the system.
In this case I saw some top session in waits 'ASM file metadata operation' & 'KSV master wait'.
It wasn't my query session (so I didn't break anything) but some RMAN worker processes.

That's worth to investigate. After some research (Google & MetaLink) I saw some links between ASM rebalance and 'ASM file metadata operation'.

Checking the ASM instance, there was really a ASM rebalance ongoing, but no progress (no change in v$asm_operation.SOFAR over some minutes). It was initiated the other evening by a colleague which added a disk to the DG. I agree with Kevin this is a bad habit, but in this environment it's not enough pain (and multiple teams involved) to re-work all the processes. The RBAL process was waiting in 'enq: RB - contention'.

As ASM rebalance can be stopped or re-started wit othe rpriority easily, I gave this a chance and run ALTER DISKGROUP dg REBALANCE POWER 2 - the power is not important here, ony to stop the current (stalled) rebalance and issue another.

The ASH viewer immediately showed the uncommon waits disappear and in RMAN logs I saw ordinary progress immediately.

To be honest I did not much analysis here, so it might be worth to do better, but in this case it was sufficient and the issue solved even before there was an alarm regarding the blocked backup.

Once again, ASH (and my curiosity) helped solving the issue.