For as long as I have been using TeamQuest products I have wanted them to provide a solution for my Network Appliance brand filer devices. It was a desire that I could have have written a long time ago but frankly it was a low priority. I had a custom script that would run “df-Ah” on the filer, cut out the columns I wanted and write it to a CSV file that certain people could read in and make an Excel chart with. It was adequate so other higher priority items were worked and this languished for… a really long time. I now, finally, have something like the gauges chart below:
It finally happened because this last summer we migrated our Solaris web environment into zones. My previous job ran on the old systems and while I could have moved the stupid job I was holding out to force myself to get this written. So after a couple months of running it manually when the user needed the data I bit down and wrote the code necessary to get the capacity data into Teamquest. Basically I leveraged that inherent laziness in me to finally make myself get it done the proper way.
So, this blog documents my efforts to write a real solution to make a beautiful chart in my Teamquest IT Service Analyzer and Reporter where all the wild things go to get charted for management. I wanted more than just the storage utilization metrics we currently provide but that was the most important first step to accomplish and will be covered in this blog. A follow on blog should cover the CPU, network, failed disk, failed power supply, and other interesting metrics that can be gathered, monitored, charted, and alerted on.
How to duplicate my results in your environment
The first item is get SNMP version 3 working on your filers under a limited user account on the filer. SNMP version 3 is necessary in today’s multi-terabyte world because the fields defined within SNMP version 1 and 2 by the MIB cannot account for the insane amount of “bytes” reported. Yes, it has to report in bytes. So be sure to download the Word Doc available at the NetApp community site and follow through step one. Yes, just the first step is all that is really needed, but don’t forget to set a password for the new user account who is allowed (and only allowed) to use SNMP.
Create a location for your scripts and data files. I like to put my scripts in /opt/teamquest/scripts, with a data directory underneath that. The Teamquest User Table Agent will run as the user ‘daemon’ and group root, so be sure to set appropriate permissions on the directory path and script for read and execution, and the data directory for write permission.
Make sure your system has snmp binaries — the Solaris ones are adequate and will probably be in /usr/sfw/bin if you installed the packages. The OpenCSW binaries are great, too. You will notice I am actually using the OpenCSW binaries at times but I have no good reason too– except that I typically like to have some base OpenCSW packages installed so that I have gtail, allowing me to tail multiple files at the same time.
Download the following files
- NetApp MIB for SNMP from NetApp
- My Script for NetApp Volume Statistics
- TeamQuest table definition for NetApp Volume
Drop the latest NetApp SNMP MIB into the data directory and copy my scripts to the script directory. Use “less” to look into the NetApp MIB and look at some of the options to get in there. There is a lot. I focused on the following values that I will use between this blog (volume statistics) and a future blog on system statistics: dfTable, productVersion, productModel, productFirmwareVersion, cpuBusyTimePerCent, envFailedFanCount, envFailedPowerSupplyCount, nvramBatteryStatus, diskTotalCount, diskFailedCount, diskSpareCount, misc64NetRcvdBytes, and misc64NetSentBytes. If you see a “64” version of a statistic you will want to use that one to make sure that you are getting the real data figure out of the system.
Test your user and SNMP client with some command line operations before you start editing the script to run in your environment. A command would be like this ‘/opt/csw/bin/snmpget -v3 -n “” -u yourSNMPuser -l authNoPriv -A yourSNMPuserPassword -a Md5 -O qv -c yourcommunity -m /opt/teamquest/scripts/data/netapp.mib your-filername NETWORK-APPLIANCE-MIB::misc64NetRcvdBytes.0’. We will work on this statistic next time but today we are looking at the dfTable statistic for all the stats you want on your storage. So be sure to also test this different SNMP command: ‘ /usr/sfw/bin/snmptable -v3 -n “” -u yourSNMPuser -l authNoPriv -A yourSNMPuserPassword -a Md5 -c yourcommunity -m /opt/teamquest/scripts/data/netapp.mib yourfilername NETWORK-APPLIANCE-MIB::dfTable’ and marvel at the amount of data that comes across your terminal.
If all is successful with your command line tests then you are ready to edit the script and get it configured for your environment. You may be changing the path to the SNMP command and the MIB file, but you will definitely be changing the username, password, and community string. There are several other options to tweak too — do you want to import all volumes or just aggregates? Do you want to ignore snapshots? Test the script several times and make sure it is returning the data the way you want it. You will notice that you have to pass the filer names (comma separated, no spaces) in on the command line. This makes it easy to add and remove filers from your environment without adding or removing User Table Agents from your Teamquest manager, just simply edit the command line options passed to the script. Don’t forget to test with and without the -t=interval options for the TeamQuest format where the interval will match your desired frequency that the agent runs. And don’t worry about the extra options for snapshots or aggregates-only, this can be tweaked at any time to limit the data being absorbed by Teamquest and when you report or alert you can easily filter out what you don’t want.
When you are ready import the third file, the table, into your Teamquest database. You may want to use a test database for a while and then eventually add it to a production database. The command to import the table is “$manager/bin/tqtblprb -i -f NetApp_VolumeStats.tbl” but I heartily recommend you have the command line manual handy and consult it regularly for adding databases, tables, and deleting said items when things go wrong. IT happens.
When the table is entered into the database you are ready to add your very own User Table Agent. Connect to the TQ manager of the desired system using your browser. Set the database if you are not using the production database, and then click Collection Agents. On the far right you will see the link “Add Agent”, click that and then “User Table Instance”. Begin to enter the information that makes sense to you such as a name for this agent, the path to the executable, and the interval you want collection to happen. The class and subclass must match exactly what is in the table file that was imported. It will be “USER” and “NetAppVolumes” unless you changed it. The Program arguments is where you pass in the comma separated list of filer names (no spaces!), a single space and -t=<interval>. Make sure to set that interval to equal what you have entered in below for the actual collection interval. After you save and apply the new settings you simply have to wait until the clock hits the time to match the next collection (every five minute increment of the hour if you are using the 300 second interval like I am).
Be sure to launch TQView and look at the table directly for accurate statistics, play with the filter options, etc. Tip: you can create a test database in ITSAR that only harvests from this test database so that you can test end to end.
You will notice that I dropped the actual FlexibleVolume volume type data from my gathering. It may be useful at some point in the future and it can be re-added with a simple edit to the script, but for this first stage all I care about is overall health of the filer and so my ITSAR chart for management is a simple global view of the filer cluster-pair. For this, I use the statistic “FilerCapacity” that the script calculates by summing all of the VolumeType “aggregate” on each filer node. You can see that I have a total of four nodes in my environment (names withheld to protect the innocent).
And that is it for the first stage! On to writing alerts and getting the system stats working.