Archive for the ‘teamquest’ Category

My previous two posts were about getting utilization statistics out of my Network Appliance filers into a Teamquest database for my IT Service Analyzer and Reporter charts. They are working great and I am using them in a production environment. The thing that bothered me about them is they seemed so slow. The volume stats report would take just over a second for four filer heads and the system stats script seemed to take FOREVER. I timed it. It was only five seconds for four filers but the feeling was still FOREVER.

timing old script

ptime of old volstats

timing the old systats

ptime of old systats

I knew what the problem was and I knew I would have to buckle down and learn SNMP even better, and especially learn Perl SNMP modules in order to tune it back my acceptable standards of runtime. That first script was a quick and dirty hack really, and like most hacks it is just functional. All the SNMP requests were running system commands that could be easily run and debugged from a command line. It’s a great way to learn and get something functional at the same time. But it’s like a baby eating from a bottle, it needs to grow up, eat solid food, go to school, and get a job to support itself. Or, in Perl terms, it needs to use pure Perl code to do the work instead of system commands.

So, enter version 2 of both scripts. My new volume stats script literally runs twice as fast as the old script. My new system stats, also quite literally, runs TEN times as fast. Woo hoo! How is that for tuning code and making things better?

timing new volstats

ptime of new volstats

timing new sys stats

ptime of new sys stats

These new versions run no system commands but do all work using the Net-SNMP Perl modules (not to be confused with Net::SNMP Perl modules). The process of learning the SNMP module took several days of trial and error around my other work. The biggest issue with Perl is the confusing amount of Perl modules available to do the same job. Often, a few google searches will reveal which module has the most support and I would choose that one. But in the case of the Perl SNMP modules there is no clear winner. Both have equal number of blogs and confused postings looking for help with the modules.

So I picked one. It was the wrong one initially, of course. I picked Net::SNMP to start with because it can be built using the CPAN shell (eg, ‘perl -MCPAN -e shell’). The other primary SNMP module being used is the one provided by the Net-SNMP command line packages. This can be more of a challenge to build, but more often than not it can just be installed as a package for your system, which is the easy route I chose. I used the OpenCSW package.

The reason I say that the Net::SNMP package was the wrong path is the challenge for an SNMP illiterate to understand SNMP and specialized MIBs. It appeared that you needed to know the confusingly long ID number of the statistic to use this module. I was (and am still) trying to learn about SNMP and could not figure out the proper way to find the statistics I wanted using this module. So I switched to the other package module which allowed me to use names for statistics that I was used to, like “df64AvailKBytes” to find the full and correct amount of Kilobytes available to a filesystem.

So I set off to learn the module. I started small with test scripts to just gather one or a few statistics. This allowed me to make some quick progress and learn how to address the desired statistics as a scalar, array, or hash, and to grow and process multiple statistics in relation to each other.

I ended up using the VarList method within the module. It allows the script to retrieve a bunch of statistics with a single connection. This is much more efficient than the old script which would make up to a dozen SNMP command requests to each filer head to get the desired statistics. This new method gets them once and then let me step through them one row at a time.

View/download my scripts here:

  1. new version 2 netapp volume stats script
  2. new version 2 netapp sys stats script

There is one thing that bothered me and I never figured out when I worked on the volume statistics script (the second one I tackled). When using the command line utilities the entire disk table can be requested using the name ‘dfTable’. This would not work using the Perl SNMP module even though ‘volTable’ and ‘ifTable’ would work. I do not understand the difference, but instead punted and again used the VarList method for named individual statistics with great success. If you know why, please make a comment. I wonder if I could shave a few tenths of a second off using dfTable… 😉

Advertisements

This is a follow on post to my previous article on getting the NetApp filer disk/volume/aggregate statistics charting using TeamQuest ITSAR (IT Service Analyzer and Reporter). So if are you interested in getting some other statistics on usage and utilization of your Network Appliance filers like the one below, read on.

NetApp systats Chart

Utilization statistics in ITSAR

This script and user table agent definition detail how to get the actual filer utilization such as CPU busy, network kilobytes in and out, and some other useful things for potential alerts. Potential alerts? Yes, some of the statistics that can be gathered using the SNMP agent are things like failed disks, failed power supplies, failed fans, the number of spare disks, and more. Simple peruse the Network Appliance SNMP MIB to see everything that is available to us. The table definition and my script can easily be extended before implementation to include the additional information you may be interested in.

Personally, I really trust the NetApp auto-support ability. Our NetApp filers are extremely capable of alerting us when a disk or anything fails. The filer heads are clustered and extremely redundant so I trust them (just not the devil inside, to quote a movie), so I might as well gather a few stats that I may track and alert on at a future time.

I won’t spend a lot of time covering the setup of SNMP on the filer or the TeamQuest host because that’s already done in the previous blog on the subject. Instead I will jump straight into the files and table setup for these new statistics.

The first step is to download the two additional files needed for the filer system statistics.

  1. The Network Appliance TeamQuest table definition for System statistics
  2. The Network Appliance Systats perl script

By now you have all the recommendations on hand and ready to go from my last blog… so save the files above to the same directory. Edit the script to configure the paths, username, password, and community string just like last time. Also make sure that the data directory has write privilege for the user that will be running the TeamQuest UTA which is usually daemon:root. Run it a few times to make sure it is working correctly, but take the time to make sure that the logfiles are writable by the user daemon after you are finished testing.

The script writes two files necessary for calculating the true network statistics. The SNMP statistic delivered is a number in bytes since the system last booted. I don’t think it needs to be stated, but this is not a very flexible statistic to work with for charting. It’s huge! And it gets humongous since the filers never, ever need to restart except for upgrades. I developed the script to use a log file to store the statistic from the last run and do a little math to give us a useful number for ongoing utilization. On execution the script operates like this in regards to the network statistics:

  1. Gets current network statistic
  2. Get last network statistic from log file
  3. Calculates difference
  4. converts to kilobytes
  5. saves current statistic (as read from filer) to the logfile

That’s it! It’s pretty easy to setup and run. The most difficult part of the setup was reading through all the many possible options for defining the statistics in the table definitions. I think I saved you a bit of work there – and in fact, some of the praise there goes to TeamQuest themselves. I was having issues with the way some of the statistics were being averaged and I opened a ticket with them. They were very patient with me and we got it resolved. Tickle me happy!

So import the table definition into your test or production database (“$manager/bin/tqtblprb -i -d testdatabase -f NetApp_sysStats.tbl”). And when that is done build your User Table Agent same as before but referencing the second script and the new table (USER:NetAppSysStats).

I may go ahead and setup some alerts on some of the statistics, there is more to be done!

NetApp all table data

NetApp systats

For as long as I have been using TeamQuest products I have wanted them to provide a solution for my Network Appliance brand filer devices. It was a desire that I could have have written a long time ago but frankly it was a low priority. I had a custom script that would run “df-Ah” on the filer, cut out the columns I wanted and write it to a CSV file that certain people could read in and make an Excel chart with. It was adequate so other higher priority items were worked and this languished for… a really long time. I now, finally, have something like the gauges chart below:

ITSAR chart of NetApp storage utilization

ITSAR chart of my NetApp appliances’ utilization

It finally happened because this last summer we migrated our Solaris web environment into zones. My previous job ran on the old systems and while I could have moved the stupid job I was holding out to force myself to get this written. So after a couple months of running it manually when the user needed the data I bit down and wrote the code necessary to get the capacity data into Teamquest. Basically I leveraged that inherent laziness in me to finally make myself get it done the proper way.

So, this blog documents my efforts to write a real solution to make a beautiful chart in my Teamquest IT Service Analyzer and Reporter where all the wild things go to get charted for management. I wanted more than just the storage utilization metrics we currently provide but that was the most important first step to accomplish and will be covered in this blog. A follow on blog should cover the  CPU, network, failed disk, failed power supply, and other interesting metrics that can be gathered, monitored, charted, and alerted on.

How to duplicate my results in your environment

The first item is get SNMP version 3 working on your filers under a limited user account on the filer. SNMP version 3 is necessary in today’s multi-terabyte world because the fields defined within SNMP version 1 and 2 by the MIB cannot account for the insane amount of “bytes” reported. Yes, it has to report in bytes. So be sure to download the Word Doc available at the NetApp community site and follow through step one. Yes, just the first step is all that is really needed, but don’t forget to set a password for the new user account who is allowed (and only allowed) to use SNMP.

Create a location for your scripts and data files. I like to put my scripts in /opt/teamquest/scripts, with a data directory underneath that. The Teamquest User Table Agent will run as the user ‘daemon’ and group root, so be sure to set appropriate permissions on the directory path and script for read and execution, and the data directory for write permission.

Make sure your system has snmp binaries — the Solaris ones are adequate and will probably be in /usr/sfw/bin if you installed the packages. The OpenCSW binaries are great, too. You will notice I am actually using the OpenCSW binaries at times but I have no good reason too– except that I typically like to have some base OpenCSW packages installed so that I have gtail, allowing me to tail multiple files at the same time.

Download the following files

  1. NetApp MIB for SNMP from NetApp
  2. My Script for NetApp Volume Statistics
  3. TeamQuest table definition for NetApp Volume

Drop the latest NetApp SNMP MIB into the data directory and copy my scripts to the script directory. Use “less” to look into the NetApp MIB and look at some of the options to get in there. There is a lot. I focused on the following values that I will use between this blog (volume statistics) and a future blog on system statistics:  dfTable, productVersion, productModel, productFirmwareVersion, cpuBusyTimePerCent, envFailedFanCount, envFailedPowerSupplyCount, nvramBatteryStatus, diskTotalCount, diskFailedCount, diskSpareCount, misc64NetRcvdBytes, and misc64NetSentBytes. If you see a “64” version of a statistic you will want to use that one to make sure that you are getting the real data figure out of the system.

Test your user and SNMP client with some command line operations before you start editing the script to run in your environment. A command would be like this ‘/opt/csw/bin/snmpget -v3 -n “” -u yourSNMPuser -l authNoPriv -A yourSNMPuserPassword -a Md5 -O qv -c yourcommunity -m /opt/teamquest/scripts/data/netapp.mib your-filername NETWORK-APPLIANCE-MIB::misc64NetRcvdBytes.0’. We will work on this statistic next time but today we are looking at the dfTable statistic for all the stats you want on your storage. So be sure to also test this different SNMP command: ‘ /usr/sfw/bin/snmptable -v3 -n “” -u yourSNMPuser -l authNoPriv -A yourSNMPuserPassword -a Md5 -c yourcommunity -m /opt/teamquest/scripts/data/netapp.mib yourfilername NETWORK-APPLIANCE-MIB::dfTable’ and marvel at the amount of data that comes across your terminal.

If all is successful with your command line tests then you are ready to edit the script and get it configured for your environment. You may be changing the path to the SNMP command and the MIB file, but you will definitely be changing the username, password, and community string. There are several other options to tweak too — do you want to import all volumes or just aggregates? Do you want to ignore snapshots? Test the script several times and make sure it is returning the data the way you want it. You will notice that you have to pass the filer names (comma separated, no spaces) in on the command line. This makes it easy to add and remove filers from your environment without adding or removing User Table Agents from your Teamquest manager, just simply edit the command line options passed to the script. Don’t forget to test with and without the -t=interval options for the TeamQuest format where the interval will match your desired frequency that the agent runs. And don’t worry about the extra options for snapshots or aggregates-only, this can be tweaked at any time to limit the data being absorbed by Teamquest and when you report or alert you can easily filter out what you don’t want.

When you are ready import the third file, the table, into your Teamquest database. You may want to use a test database for a while and then eventually add it to a production database. The command to import the table is “$manager/bin/tqtblprb -i -f NetApp_VolumeStats.tbl” but I heartily recommend you have the command line manual handy and consult it regularly for adding databases, tables, and deleting said items when things go wrong. IT happens.

Adding User Table Agent configuration

Adding User Table Agent configuration

When the table is entered into the database you are ready to add your very own User Table Agent. Connect to the TQ manager of the desired system using your browser. Set the database if you are not using the production database, and then click Collection Agents. On the far right you will see the link “Add Agent”, click that and then “User Table Instance”. Begin to enter the information that makes sense to you such as a name for this agent, the path to the executable, and the interval you want  collection to happen. The class and subclass must match exactly what is in the table file that was imported. It will be “USER” and “NetAppVolumes” unless you changed it. The Program arguments is where you pass in the comma separated list of filer names (no spaces!), a single space and -t=<interval>. Make sure to set that interval to equal what you have entered in below for the actual collection interval. After you save and apply the new settings you simply have to wait until the clock hits the time to match the next collection (every five minute increment of the hour if you are using the 300 second interval like I am).

Be sure to launch TQView and look at the table directly for accurate statistics, play with the filter options, etc. Tip: you can create a test database in ITSAR that only harvests from this test database so that you can test end to end.

table view of data

Using TQview to examine actual data gathered

You will notice that I dropped the actual FlexibleVolume volume type data from my gathering. It may be useful at some point in the future and it can be re-added with a simple edit to the script, but for this first stage all I care about is overall health of the filer and so my ITSAR chart for management is a simple global view of the filer cluster-pair. For this, I use the statistic “FilerCapacity” that the script calculates by summing all of the VolumeType “aggregate” on each filer node. You can see that I have a total of four nodes in my environment (names withheld to protect the innocent).

And that is it for the first stage! On to writing alerts and getting the system stats working.