Posts Tagged ‘TeamQuest’

My previous two posts were about getting utilization statistics out of my Network Appliance filers into a Teamquest database for my IT Service Analyzer and Reporter charts. They are working great and I am using them in a production environment. The thing that bothered me about them is they seemed so slow. The volume stats report would take just over a second for four filer heads and the system stats script seemed to take FOREVER. I timed it. It was only five seconds for four filers but the feeling was still FOREVER.

timing old script

ptime of old volstats

timing the old systats

ptime of old systats

I knew what the problem was and I knew I would have to buckle down and learn SNMP even better, and especially learn Perl SNMP modules in order to tune it back my acceptable standards of runtime. That first script was a quick and dirty hack really, and like most hacks it is just functional. All the SNMP requests were running system commands that could be easily run and debugged from a command line. It’s a great way to learn and get something functional at the same time. But it’s like a baby eating from a bottle, it needs to grow up, eat solid food, go to school, and get a job to support itself. Or, in Perl terms, it needs to use pure Perl code to do the work instead of system commands.

So, enter version 2 of both scripts. My new volume stats script literally runs twice as fast as the old script. My new system stats, also quite literally, runs TEN times as fast. Woo hoo! How is that for tuning code and making things better?

timing new volstats

ptime of new volstats

timing new sys stats

ptime of new sys stats

These new versions run no system commands but do all work using the Net-SNMP Perl modules (not to be confused with Net::SNMP Perl modules). The process of learning the SNMP module took several days of trial and error around my other work. The biggest issue with Perl is the confusing amount of Perl modules available to do the same job. Often, a few google searches will reveal which module has the most support and I would choose that one. But in the case of the Perl SNMP modules there is no clear winner. Both have equal number of blogs and confused postings looking for help with the modules.

So I picked one. It was the wrong one initially, of course. I picked Net::SNMP to start with because it can be built using the CPAN shell (eg, ‘perl -MCPAN -e shell’). The other primary SNMP module being used is the one provided by the Net-SNMP command line packages. This can be more of a challenge to build, but more often than not it can just be installed as a package for your system, which is the easy route I chose. I used the OpenCSW package.

The reason I say that the Net::SNMP package was the wrong path is the challenge for an SNMP illiterate to understand SNMP and specialized MIBs. It appeared that you needed to know the confusingly long ID number of the statistic to use this module. I was (and am still) trying to learn about SNMP and could not figure out the proper way to find the statistics I wanted using this module. So I switched to the other package module which allowed me to use names for statistics that I was used to, like “df64AvailKBytes” to find the full and correct amount of Kilobytes available to a filesystem.

So I set off to learn the module. I started small with test scripts to just gather one or a few statistics. This allowed me to make some quick progress and learn how to address the desired statistics as a scalar, array, or hash, and to grow and process multiple statistics in relation to each other.

I ended up using the VarList method within the module. It allows the script to retrieve a bunch of statistics with a single connection. This is much more efficient than the old script which would make up to a dozen SNMP command requests to each filer head to get the desired statistics. This new method gets them once and then let me step through them one row at a time.

View/download my scripts here:

  1. new version 2 netapp volume stats script
  2. new version 2 netapp sys stats script

There is one thing that bothered me and I never figured out when I worked on the volume statistics script (the second one I tackled). When using the command line utilities the entire disk table can be requested using the name ‘dfTable’. This would not work using the Perl SNMP module even though ‘volTable’ and ‘ifTable’ would work. I do not understand the difference, but instead punted and again used the VarList method for named individual statistics with great success. If you know why, please make a comment. I wonder if I could shave a few tenths of a second off using dfTable… 😉

This is a follow on post to my previous article on getting the NetApp filer disk/volume/aggregate statistics charting using TeamQuest ITSAR (IT Service Analyzer and Reporter). So if are you interested in getting some other statistics on usage and utilization of your Network Appliance filers like the one below, read on.

NetApp systats Chart

Utilization statistics in ITSAR

This script and user table agent definition detail how to get the actual filer utilization such as CPU busy, network kilobytes in and out, and some other useful things for potential alerts. Potential alerts? Yes, some of the statistics that can be gathered using the SNMP agent are things like failed disks, failed power supplies, failed fans, the number of spare disks, and more. Simple peruse the Network Appliance SNMP MIB to see everything that is available to us. The table definition and my script can easily be extended before implementation to include the additional information you may be interested in.

Personally, I really trust the NetApp auto-support ability. Our NetApp filers are extremely capable of alerting us when a disk or anything fails. The filer heads are clustered and extremely redundant so I trust them (just not the devil inside, to quote a movie), so I might as well gather a few stats that I may track and alert on at a future time.

I won’t spend a lot of time covering the setup of SNMP on the filer or the TeamQuest host because that’s already done in the previous blog on the subject. Instead I will jump straight into the files and table setup for these new statistics.

The first step is to download the two additional files needed for the filer system statistics.

  1. The Network Appliance TeamQuest table definition for System statistics
  2. The Network Appliance Systats perl script

By now you have all the recommendations on hand and ready to go from my last blog… so save the files above to the same directory. Edit the script to configure the paths, username, password, and community string just like last time. Also make sure that the data directory has write privilege for the user that will be running the TeamQuest UTA which is usually daemon:root. Run it a few times to make sure it is working correctly, but take the time to make sure that the logfiles are writable by the user daemon after you are finished testing.

The script writes two files necessary for calculating the true network statistics. The SNMP statistic delivered is a number in bytes since the system last booted. I don’t think it needs to be stated, but this is not a very flexible statistic to work with for charting. It’s huge! And it gets humongous since the filers never, ever need to restart except for upgrades. I developed the script to use a log file to store the statistic from the last run and do a little math to give us a useful number for ongoing utilization. On execution the script operates like this in regards to the network statistics:

  1. Gets current network statistic
  2. Get last network statistic from log file
  3. Calculates difference
  4. converts to kilobytes
  5. saves current statistic (as read from filer) to the logfile

That’s it! It’s pretty easy to setup and run. The most difficult part of the setup was reading through all the many possible options for defining the statistics in the table definitions. I think I saved you a bit of work there – and in fact, some of the praise there goes to TeamQuest themselves. I was having issues with the way some of the statistics were being averaged and I opened a ticket with them. They were very patient with me and we got it resolved. Tickle me happy!

So import the table definition into your test or production database (“$manager/bin/tqtblprb -i -d testdatabase -f NetApp_sysStats.tbl”). And when that is done build your User Table Agent same as before but referencing the second script and the new table (USER:NetAppSysStats).

I may go ahead and setup some alerts on some of the statistics, there is more to be done!

NetApp all table data

NetApp systats

There’s a secret to being a good sysadmin: You have to be just a little lazy. Just enough that you can see a better way to doing boring, repetitive, tedious tasks and write a script to do it, letting you get back to more important tasks. This usually involves making a tool do something for you. And for a Unix admin, that means writing a quick little script.

A good Unix sysadmin isn’t content with just one medium for his scripts. He should be using shell and Perl so that he has both round and square pegs for all the different shaped holes that need to be plugged by a good script.

I was recently trying to import data about our backups into my TeamQuest reporting tool so that I could graph the usage and reliably plot trends. The backup administrator found a great command for pulling stats out the NetBackup database. The NetBackup command is found in the install directory bin/adminbin subdirectory. It takes a variety of options so be sure to read the man page. I found two basic commands I needed to give it to get the data I needed. One was for gathering live data, and one was for accessing historical data that I wanted to import for a really clear picture of things.

Going back the statement about sysadmins have a touch of laziness– I ask myself, why manually pull data when you can automate data collection?

After experimenting with TeamQuest and weekly and daily stats I finally determined that I really need to gather data hourly in order for some of the automated graph methods to be able to do their job. If the truth were known (and it’s about to be) I’d really prefer just to grab the data once a week so that I can look an entire backup spectrum of Full backups and all incrementals. But it is minor oversight by TeamQuest that the new-ish tools ITSAR (IT Service Analyzer and Reporter) cannot take a macroscopic view like single point of data for a week and graph it over a six month period. Minor oversight, I forgive them, and I’m sure it will be corrected sooner than later.

So here is my “live” data command to get an hour summary data from NetBackup that is instantly imported into TeamQuest. This runs at the top of every hour as a summary of the previous hour.
/usr/openv/netbackup/bin/admincmd/bpimagelist -U -hoursago 1
01/12/2012 18:35 02/02/2012 41904 3763745 N Differential Int_unix
01/12/2012 18:35 02/02/2012 42070 4150810 N Differential Int_unix

Of course it can’t be imported into my TeamQuest database straight like that! The command prints out a bunch of data based on the number of different jobs that ran while TeamQuest really needs it summed up cleanly. So I wrote a perl script that runs the NetBackup command and sums it up, formatting it nicely for TeamQuest as a total kilobytes and number of files backed up. The TeamQuest required specific header fields are a time field (in quotes), an interval in seconds, and the servername. I’ve specified in my table definitions that I’m also providing another field for week of the year so that I can combine data for an entire week, and then the total number of files and kilobytes backed up.
An interesting note about the week-of-the-year field…. I have a bit in my Perl code that determines which week of the year I want it to be counted as. Most date modules will default to the week beginning on Sunday per the Gregorian standard, but for my backup standards the week really begins Friday at 6pm when the full backups kick off. Every backup after that should be an incremental or part of that backup set extended from Friday night.

A sample run from my script
# ./ -t -hourly
"1/25/2012 18:00:00" 3600 backupservername "3/2012"
185118 17720160

Sweet! If you see a message “no entity was found” don’t worry about it. It’s just a message from the NetBackup database command (printed to STDERR) that there wasn’t a job that particular hour. Zeroes will be imported for the data that hour.

So now my backup server runs an hourly job that imports this data into the Teamquest test database. We are looking good, going forward. But that’s only half the battle! I still need to get historical data into TQ so that I can make proper analysis.

I expand my Perl script so that I can pass in historical start and stop times at the command line.
# ./ -t -hourly -a "01/25/2012 17:00:00"
"1/25/2012 18:00:00" 3600 blade193 "3/2012"
394697 192310787

This is going great! I can run this a bunch of times for each hour of historical data that I need and append the output to a single text file. When it is done I import a single file into the TQ database and then make some pretty graphs.
So… let’s see. I’d like to go back about six months, so that’s about 180 days give or take, by 24 hours, ohhh, that’s running my command 4,320 times. Yeah… about that. I can hear Al say “I don’t think so, Tim”.

But, I really don’t want to extend my Perl script any more because this is running hourly already, and going so smoothly. If I keep hacking at it with my lowly coding skills I may break it or corrupt my data that I am collecting now. This is pretty much going to be a one-off straight forward linear loop to run this 4,320 times. Six-off at best, if I am willing to make a run per month with a few minor changes in between runs. This sounds like a shell script. Sure, I could do it Perl. But for super simple loops that is not parsing data I prefer to just use a shell script. It’s a square peg and this is a square hole.

Here’s my double loop shell script that runs my Perl script once per hour for each day of a month–

let dd=1
let lastday=31
let mm=07
let yy=2011

while [ $dd -le $lastday ]
let hh=0
echo " Running stats for day $dd" 1>&2
while [ $hh -lt 24 ]
echo " Running stats for hour $hh" 1>&2
./ -t -hourly -a "$mm/$dd/$yy $hh:00:00"
hh=`expr $hh + 1`
dd=`expr $dd + 1`
echo "Incremented day to $dd" 1>&2

Pretty simple, really. Oh, and I am sending status info lines from the shell script to STDERR output so that the STDOUT can be directed safely and cleanly into a file ready to import into TeamQuest but yet the sysadmin can easily observe how the script is progressing.

# ./makegoodhourly >import.august
Running stats for day 1
Running stats for hour 0
no entity was found
Running stats for hour 1
Running stats for hour 2
Running stats for hour 3
no entity was found
Running stats for hour 4
no entity was found
Running stats for hour 5
no entity was found
Running stats for hour 6
no entity was found
Running stats for hour 7

Make a few tweaks to the shell script to change the month number and the total number of days per month, and run it again. Easy. I ran it once per month for September through January and imported my data, and I was done.

Here’s the Perl script. It will default to daily stats if neither hourly nor weekly is specified. Why? Well that was just because that is was a middle step before I realized I needed to go hourly. I didn’t want to completely remove weekly or daily statistics for future possibilities.

I’m sure there are some better ways to accomplish the things I do in my scripts– I’d like to hear them in the comments below. I’m always eager to improve my skills.


# 1.13.12 - ver 0 - K.Creason -
# To get weekly stats out of the NetBackup database

# First we define some things that are tunable
# The statcmd is the netbackup command that generates the output summary of
# all backup jobs based fields passed to it.
# We are going to use 168 hours ago for seven days to get a full weeks summary

my $statcmd="/usr/openv/netbackup/bin/admincmd/bpimagelist -U ";

# No more tunables, so these are some defaults that we will define for later

my $DEBUG,$VERBOSE,$filesummary,$datasummary,$files,$data,@data,$tqout,

use Date::Calc qw(:all);

# process the command line arguments
if ("$ARGV[0]" eq "-h") {die "\n\nUsage: $0 [-d for debug] [-v for verbose stats] [-t for Teamquest format] [-hourly or -w for weekly summary] [-a MM/DD/YYYY for alternate start date, if hourly should include HH:MM:ss within quotes]\n\n";}
if ("$ARGV[0]" eq "-d")
{ shift @ARGV; $DEBUG++; print STDERR "Debug on.\n"; }
if ("$ARGV[0]" eq "-v")
{ shift @ARGV; $VERBOSE++; print "Verbose on.\n";}
if ("$ARGV[0]" eq "-t") { $tqout=1; shift @ARGV; if ($DEBUG>0){print STDERR "TeamQuest report on.\n";}}
if ("$ARGV[0]" eq "-hourly") { $hourly=1; shift @ARGV; if ($DEBUG>0){print STDERR "Hourly report on.\n";}}
if ("$ARGV[0]" eq "-w") {$datespec++; $weekly=1; shift @ARGV; if ($DEBUG>0){print STDERR "Weekly report on.\n";}}
if ("$ARGV[0]" eq "-a")
shift @ARGV;
$date=$ARGV[0]; if ($DEBUG>0){print STDERR "Alternate date is \"$date\".\n";}
shift @ARGV;

if ("$date" eq "")
( $yy, $mm, $dd ) = Today(); $date="$mm/$dd/$yy";
if ($DEBUG>0){print STDERR "The end date is TODAY, $date.\n";}

if ($weekly lt 1)
$begindate=$date; if ($DEBUG>0){print STDERR "Begin date is end date, $begindate.\n";}
# need to add hourly check and if turned on calculate an end date of plus one hour
if (($hourly gt 0)&&("$date" =~/\:/))
if ($DEBUG>0){ print STDERR "Calculating an end date of plus one hour from $begindate.\n";}
my $cal,$time,$hh,$min,$sec;
($cal,$time)= split (/ /,$date);
($yy,$mm,$dd) = Decode_Date_US($cal);
($hh,$min,$sec) = split(/:/,$time);
if ($DEBUG>0){ print STDERR "Splitting end date to $yy, $mm, $dd, $hh, $min, $sec.\n";}

# Before we add an hour, check to make sure the start hour is two digits
if ( (length $hh) lt 2)
{ $hh="0$hh"; $begindate="$mm/$dd/$yy $hh:00:00"; }
($yy,$mm,$dd,$hh,$min,$sec) = Add_Delta_DHMS($yy,$mm,$dd,$hh,$min,$sec,0,+1,0,0);
if ( (length $hh) lt 2){ $hh="0$hh";}
$date="$mm/$dd/$yy $hh:00:00";
if ($DEBUG>0){ print STDERR "Calculated the end date of plus one hour to $date.\n";}
# weekly, so have to calculate a begin date
($mm, $dd, $yy) = split (/\//,$date);
if ($DEBUG>0){ print STDERR "Date ($date) is split year $yy, day $dd, month $mm.\n"; }
( $yy, $mm, $dd ) = Add_Delta_Days($yy,$mm,$dd , -7 ); $begindate="$mm/$dd/$yy";
if ($DEBUG>0){ print STDERR "Begin Date is calculated to $begindate.\n"; }

# Now we need to calculate which week of the year the backup stats belong to
# paying careful attention to use the weeknumber for Friday. So if the day of week
# is monday-thurs we take week the weeknumber of the previous Friday
# which is tricky if it happens to split a new year... Oy vey.
($mm, $dd, $yy) = split (/\//,$begindate);
my $dow = Day_of_Week($yy,$mm,$dd); if ($DEBUG>0){print STDERR "Day of Week is $dow.\n";}
if ($dow gt 4)
{ ($weekno,$yy)=Week_of_Year($yy,$mm,$dd);if($DEBUG>0){print STDERR "Week of year calculated for a Fri/Sat/Sun to be $weekno/$yy.\n";}}
# This is the more complicated route. First calculate what last Friday was and then the weekno of that day.
# Think we can just substract seven for last week
my $lyy,$lmm,$ldd;
($lyy,$lmm,$ldd)= Add_Delta_Days($yy,$mm,$dd,-7); if ($DEBUG>0){print STDERR "Date of a week ago is $mm/$dd/$yy.\n"; }
{ ($weekno,$yy)=Week_of_Year($lyy,$lmm,$ldd);if($DEBUG>0){print STDERR "Week of year calculated for M-Th to be $weekno/$yy.\n";}}

# sample data
# 01/12/2012 18:35 02/02/2012 41904 3763745 N Differential Int_unix
# 01/12/2012 18:35 02/02/2012 42070 4150810 N Differential Int_unix

if ($datespec gt 0)
$statcmd="$statcmd -d $begindate -e $date";
(@data) = map {(split)[0,3,4]} grep /^[0-9]/, `$statcmd`;
if ($DEBUG>0){print STDERR "Date specified command executed \"$statcmd\".\n";}
elsif ($hourly gt 0)
$statcmd="$statcmd -hoursago 1";
(@data) = map {(split)[0,3,4]} grep /^[0-9]/, `$statcmd`;
if ($DEBUG>0){print STDERR "Hourly command executed \"$statcmd\".\n";}
$statcmd="$statcmd -hoursago 24";
(@data) = map {(split)[0,3,4]} grep /^[0-9]/, `$statcmd`;
if($DEBUG>0){print STDERR "Daily/24 hour command executed \"$statcmd\".\n";}

my $a=0;
foreach (@data)
if ($a==0){$begindate=$_;$a++; if ($DEBUG>0){print STDERR "\tDate: $begindate. ";}}
elsif ($a==1){$files=$files+$_;$a++; if ($DEBUG>0){print STDERR " files now $files.";}}
elsif ($a==2){$data=$data+$_;$a=0; if ($DEBUG>0){print STDERR " data now $data.\n";}}

if ($tqout lt 1){ print "Files backed up: $files\nData backed up $data\n";}
else {
# Check for ENV Localhost
if ("$ENV{LOCALHOST}" eq ""){ chomp($ENV{LOCALHOST}=`hostname`);}

# if we are doing a weekly report for TQ it's a different time, at least for early testing
# and format, with the interval
if ($weekly gt 0)
{$date="\"$date 12:00:00\" $ENV{LOCALHOST} \"$weekno/$yy\"";}
elsif($hourly gt 0)
if ("$date" =~ /\:/ )
#then we have a time already, use it
$date="\"$date\" 3600 $ENV{LOCALHOST} \"$weekno/$yy\"";
else { ($hh,$mm,$dd)=Now(); $date="\"$date $hh:00:00\" 3600 $ENV{LOCALHOST} \"$weekno/$yy\""; }
else {$date="\"$date 12:00:00\" 86400 $ENV{LOCALHOST} \"$weekno/$yy\"";}
if ($DEBUG>0){print STDERR "DEBUG: $date\n$files $data\n\n"; }
print "$date\n$files $data\n\n";

I run it via two cronjobs on the backup server. One gives us a weekly summary via email, and the other is the hourly TeamQuest data import.

# test teamquest weekly stats gathering on Friday mornings
30 10 * * 5 /usr/openv/netbackup/bin/admincmd/ -w |mailx -s "NetBackup weekly summary" staff
0 * * * * /opt/teamquest/manager/bin/tqtblprb -d testuser -n NetBackupHourly >/dev/null 2>&1

And what does my data look like?
Backup data six months

Wow… there’s been a bunch more to backup lately.