Oracle ZFS Storage Appliance Administration Ed 1 LVC

This was a four day course delivered online. The Oracle labs were inaccessible from my work laptop and from my work network so I used my personal laptop from home. This turned out to be a good thing because work interruptions were minimal and I could focus on the class.

The instructor Barry Isaacman has many years of experience with Sun hardware and has done many ZFS Appliance installations. So he had lots of tips on how to set up things optimally. He also provided documents outside the course materials to support further learning. I was really interested in the API and Python integration and he was very helpful.

The labs are excellent and they utilize real ZFS Appliance hardware so performance is pretty good. There is also a ZFS Appliance simulator here.The hardest part was interpreting the class specific instructions for student IP address assignments. After getting the initial settings correct everything worked fine.

I was able to immediately apply my new knowledge as soon as I returned to work this week. We have recently installed ZFS appliances for database backups over Infiniband in two data centers and the installer set up replication between them. I reviewed the configuration and was pleased that the best practices in the class were implemented in the new installation.

I’m just starting to experiment with the API using Python. Here are some links that are promising:

http://www.oracle.com/technetwork/server-storage/sun-unified-storage/documentation/restfulapi-zfssa-0914-2284451.pdf

https://blogs.oracle.com/jkremer/entry/oracle_zfs_storage_appliance_restful

 

 

 

High load average, high IO wait, slow process startup, slow RAC communications

We have been having performance problems on our test Exadata for several months. I have opened five Oracle service requests for multiple symptoms. While the cpu utilization was fairly low, Oracle background processes would hang, the OEM 12c agent would hang, backup jobs would hang, we would experience slow communications between RAC nodes, half of the cpus would be in 100% IO wait state, the system load average would exceed 6000, etc. We noticed one of the NFS mounts was unreachable and it happened to be the mount point where we keep our DBA scripts. Processes would be in IO wait state “D” and we noticed several were the DBA scripts running from NFS. We could resolve the problem by killing these scripts. So I moved the backup scripts to local drives and eliminated some issues.
The problem kept returning though so I kept opening more SRs with no solution. Yesterday I escalated and had one of my SRs reopened. I was finally able to get to the correct person in Oracle support who gave me two things to try. The first was to add the “noac” option for the NFS mounts. The idea was that this would resolve issues where synchronous writes are induced. Since we are backing up to NFS using RMAN and tar this seemed a good bet. And it did help a lot. But we were still able to bring the problem back by tarring to NFS.
These NFS mounts are across Infiniband to an Exalogic ZFS storage system. The second fix was based on the fact that the new OEL kernel 11.2.3.2.1 update has memory management changes that may result in high TCP/IP traffic causing memory starvation for contiguous memory free space. See Knowledge base article 1546861.1 System hung with large numbers of page allocation failures with “order:5” : <Future Exadata releases will be changing the MTU size on the InfiniBand Interfaces to 7000 (down from 65520) for new installations, so the 7000 MTU for Exadata environments is known to be appropriate> ¬†So I changed the Infiniband MTU from 65520 to 7000 and restarted the network service. That finally fixed the issue.