High load average, high IO wait, slow process startup, slow RAC communications

We have been having performance problems on our test Exadata for several months. I have opened five Oracle service requests for multiple symptoms. While the cpu utilization was fairly low, Oracle background processes would hang, the OEM 12c agent would hang, backup jobs would hang, we would experience slow communications between RAC nodes, half of the cpus would be in 100% IO wait state, the system load average would exceed 6000, etc. We noticed one of the NFS mounts was unreachable and it happened to be the mount point where we keep our DBA scripts. Processes would be in IO wait state “D” and we noticed several were the DBA scripts running from NFS. We could resolve the problem by killing these scripts. So I moved the backup scripts to local drives and eliminated some issues.
The problem kept returning though so I kept opening more SRs with no solution. Yesterday I escalated and had one of my SRs reopened. I was finally able to get to the correct person in Oracle support who gave me two things to try. The first was to add the “noac” option for the NFS mounts. The idea was that this would resolve issues where synchronous writes are induced. Since we are backing up to NFS using RMAN and tar this seemed a good bet. And it did help a lot. But we were still able to bring the problem back by tarring to NFS.
These NFS mounts are across Infiniband to an Exalogic ZFS storage system. The second fix was based on the fact that the new OEL kernel 11.2.3.2.1 update has memory management changes that may result in high TCP/IP traffic causing memory starvation for contiguous memory free space. See Knowledge base article 1546861.1 System hung with large numbers of page allocation failures with “order:5″ : <Future Exadata releases will be changing the MTU size on the InfiniBand Interfaces to 7000 (down from 65520) for new installations, so the 7000 MTU for Exadata environments is known to be appropriate> ┬áSo I changed the Infiniband MTU from 65520 to 7000 and restarted the network service. That finally fixed the issue.

About these ads

One thought on “High load average, high IO wait, slow process startup, slow RAC communications

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s