Archive for the ‘solaris’ Category

For as long as I have been using TeamQuest products I have wanted them to provide a solution for my Network Appliance brand filer devices. It was a desire that I could have have written a long time ago but frankly it was a low priority. I had a custom script that would run “df-Ah” on the filer, cut out the columns I wanted and write it to a CSV file that certain people could read in and make an Excel chart with. It was adequate so other higher priority items were worked and this languished for… a really long time. I now, finally, have something like the gauges chart below:

ITSAR chart of NetApp storage utilization

ITSAR chart of my NetApp appliances’ utilization

It finally happened because this last summer we migrated our Solaris web environment into zones. My previous job ran on the old systems and while I could have moved the stupid job I was holding out to force myself to get this written. So after a couple months of running it manually when the user needed the data I bit down and wrote the code necessary to get the capacity data into Teamquest. Basically I leveraged that inherent laziness in me to finally make myself get it done the proper way.

So, this blog documents my efforts to write a real solution to make a beautiful chart in my Teamquest IT Service Analyzer and Reporter where all the wild things go to get charted for management. I wanted more than just the storage utilization metrics we currently provide but that was the most important first step to accomplish and will be covered in this blog. A follow on blog should cover the  CPU, network, failed disk, failed power supply, and other interesting metrics that can be gathered, monitored, charted, and alerted on.

How to duplicate my results in your environment

The first item is get SNMP version 3 working on your filers under a limited user account on the filer. SNMP version 3 is necessary in today’s multi-terabyte world because the fields defined within SNMP version 1 and 2 by the MIB cannot account for the insane amount of “bytes” reported. Yes, it has to report in bytes. So be sure to download the Word Doc available at the NetApp community site and follow through step one. Yes, just the first step is all that is really needed, but don’t forget to set a password for the new user account who is allowed (and only allowed) to use SNMP.

Create a location for your scripts and data files. I like to put my scripts in /opt/teamquest/scripts, with a data directory underneath that. The Teamquest User Table Agent will run as the user ‘daemon’ and group root, so be sure to set appropriate permissions on the directory path and script for read and execution, and the data directory for write permission.

Make sure your system has snmp binaries — the Solaris ones are adequate and will probably be in /usr/sfw/bin if you installed the packages. The OpenCSW binaries are great, too. You will notice I am actually using the OpenCSW binaries at times but I have no good reason too– except that I typically like to have some base OpenCSW packages installed so that I have gtail, allowing me to tail multiple files at the same time.

Download the following files

  1. NetApp MIB for SNMP from NetApp
  2. My Script for NetApp Volume Statistics
  3. TeamQuest table definition for NetApp Volume

Drop the latest NetApp SNMP MIB into the data directory and copy my scripts to the script directory. Use “less” to look into the NetApp MIB and look at some of the options to get in there. There is a lot. I focused on the following values that I will use between this blog (volume statistics) and a future blog on system statistics:  dfTable, productVersion, productModel, productFirmwareVersion, cpuBusyTimePerCent, envFailedFanCount, envFailedPowerSupplyCount, nvramBatteryStatus, diskTotalCount, diskFailedCount, diskSpareCount, misc64NetRcvdBytes, and misc64NetSentBytes. If you see a “64” version of a statistic you will want to use that one to make sure that you are getting the real data figure out of the system.

Test your user and SNMP client with some command line operations before you start editing the script to run in your environment. A command would be like this ‘/opt/csw/bin/snmpget -v3 -n “” -u yourSNMPuser -l authNoPriv -A yourSNMPuserPassword -a Md5 -O qv -c yourcommunity -m /opt/teamquest/scripts/data/netapp.mib your-filername NETWORK-APPLIANCE-MIB::misc64NetRcvdBytes.0’. We will work on this statistic next time but today we are looking at the dfTable statistic for all the stats you want on your storage. So be sure to also test this different SNMP command: ‘ /usr/sfw/bin/snmptable -v3 -n “” -u yourSNMPuser -l authNoPriv -A yourSNMPuserPassword -a Md5 -c yourcommunity -m /opt/teamquest/scripts/data/netapp.mib yourfilername NETWORK-APPLIANCE-MIB::dfTable’ and marvel at the amount of data that comes across your terminal.

If all is successful with your command line tests then you are ready to edit the script and get it configured for your environment. You may be changing the path to the SNMP command and the MIB file, but you will definitely be changing the username, password, and community string. There are several other options to tweak too — do you want to import all volumes or just aggregates? Do you want to ignore snapshots? Test the script several times and make sure it is returning the data the way you want it. You will notice that you have to pass the filer names (comma separated, no spaces) in on the command line. This makes it easy to add and remove filers from your environment without adding or removing User Table Agents from your Teamquest manager, just simply edit the command line options passed to the script. Don’t forget to test with and without the -t=interval options for the TeamQuest format where the interval will match your desired frequency that the agent runs. And don’t worry about the extra options for snapshots or aggregates-only, this can be tweaked at any time to limit the data being absorbed by Teamquest and when you report or alert you can easily filter out what you don’t want.

When you are ready import the third file, the table, into your Teamquest database. You may want to use a test database for a while and then eventually add it to a production database. The command to import the table is “$manager/bin/tqtblprb -i -f NetApp_VolumeStats.tbl” but I heartily recommend you have the command line manual handy and consult it regularly for adding databases, tables, and deleting said items when things go wrong. IT happens.

Adding User Table Agent configuration

Adding User Table Agent configuration

When the table is entered into the database you are ready to add your very own User Table Agent. Connect to the TQ manager of the desired system using your browser. Set the database if you are not using the production database, and then click Collection Agents. On the far right you will see the link “Add Agent”, click that and then “User Table Instance”. Begin to enter the information that makes sense to you such as a name for this agent, the path to the executable, and the interval you want  collection to happen. The class and subclass must match exactly what is in the table file that was imported. It will be “USER” and “NetAppVolumes” unless you changed it. The Program arguments is where you pass in the comma separated list of filer names (no spaces!), a single space and -t=<interval>. Make sure to set that interval to equal what you have entered in below for the actual collection interval. After you save and apply the new settings you simply have to wait until the clock hits the time to match the next collection (every five minute increment of the hour if you are using the 300 second interval like I am).

Be sure to launch TQView and look at the table directly for accurate statistics, play with the filter options, etc. Tip: you can create a test database in ITSAR that only harvests from this test database so that you can test end to end.

table view of data

Using TQview to examine actual data gathered

You will notice that I dropped the actual FlexibleVolume volume type data from my gathering. It may be useful at some point in the future and it can be re-added with a simple edit to the script, but for this first stage all I care about is overall health of the filer and so my ITSAR chart for management is a simple global view of the filer cluster-pair. For this, I use the statistic “FilerCapacity” that the script calculates by summing all of the VolumeType “aggregate” on each filer node. You can see that I have a total of four nodes in my environment (names withheld to protect the innocent).

And that is it for the first stage! On to writing alerts and getting the system stats working.

There’s a fun little geeky comic online that you may have heard of, XKCD.

A while back the author had comic that resonated with me about password security. I’m not buff enough in my math skills to keep up understand the equation but I could follow the principle. The idea is that four (or so) random words is more secure than making an extremely complex password that has numbers and special characters embedded and replacing letters. The challenge with this somewhat standard practice of l33tspeak is it has to be written down. But… since we are people that love stories, four random words in real English will be more memorable because we can make up a story to remember it.

Here’s the famous comic.

Recently, while I was working on one of my other Perl scripts, I was on an online forum and saw a post about how to make a random sentence. That tickled my fancy and I came up with a quick and dirty little CGI script to generate a random sentence suitable for passwords. Unfortunately a lot of places have requirements to still include special characters and numbers but this little script will meet those requirements. The spaces meet special character requirements (most of the time) and a number between 1 and 99 is included.

The results? They are often amusing and poetic. Sometimes they are risque. It just depends on what is in your system’s local dictionary. Download Perl Password Poetry Producer


print "Content-type: text/html\n\nPassword Poetry Generator\n";

open (INWORDS,"< /usr/dict/words");
close INWORDS;

my $poem;
my $randiddly=int(rand(99));

if ($randiddly%2==1){ $poem= join" ",(map{$w[rand@w]}1),$randiddly,(map{$w[rand@w]}1); }
else { $poem= join" ",(map{$w[rand@w]}1..2),$randiddly; }

if ("$poem" !~/[A-Z]/){ $poem= join" ",$poem,( map{ ucfirst ($w[rand@w])}1);}
else { $poem= join" ",$poem, (map{$w[rand@w]}1) ; }

print "$poem\n


#some html code has been stripped for wordpress

Some sample passwords:

Expirations happen.

But when those SSL certificates expire before being replaced, well, that’s bad. That’s egg on your face. This little Perl script is to put the egg back in the burrito.

All you have to do is make a directory tree where you save your public certificates (you don’t need the private key). Name them with a .cert extension if you use my code exactly or you can tweak the extension to match, and set up this little Perl script as a weekly cronjob to send you an email warning before they go bust!

You may need to add a few modules to your Perl repository. The modules I am using are Date::Calc, Crypt::OpenSSL::X509, Term::ANSIColor, and MIME::Lite. The Crypt Openssl module was a major pain in the butt to compile on Solaris. I should do a blog about that.

Oh, and the MIME::Lite module seems to require root or trusteduser privilege to run. At least on my Solaris boxes. It works great on Max OSX, but I’m probably a Trusted User on that system, I will be testing Linux before long. So, tweak the locations of the script in my examples below to meet your needs.

Setup the directory –
mkdir /home/billSpreston/mycerts

Copy the certs from your various servers, naming them with .cert extension —

ls mycerts
server1.cert server2.cert server3.cert

Touch a file for the Perl script and make it executable

touch ~/
chmod +x ~/

Now edit the file with your favorite editor (vim, or Smultron rocks!) and add this code in the PDF. (code with HTML tags is very hard to add to a blog).

Be sure to run it a few times to make sure it works the way you want it. Debug or verbose mode is useful in this phase, as is playing with expiration time. You could also create certificate using openssl that expires next week to test, or find an old expired cert as well. And when you are satisfied create a cronjob to run it weekly on your schedule and get pretty HTML reports in your mailbox. Don’t forget to turn off debug or verbose mode unless you just like noise.

There’s a secret to being a good sysadmin: You have to be just a little lazy. Just enough that you can see a better way to doing boring, repetitive, tedious tasks and write a script to do it, letting you get back to more important tasks. This usually involves making a tool do something for you. And for a Unix admin, that means writing a quick little script.

A good Unix sysadmin isn’t content with just one medium for his scripts. He should be using shell and Perl so that he has both round and square pegs for all the different shaped holes that need to be plugged by a good script.

I was recently trying to import data about our backups into my TeamQuest reporting tool so that I could graph the usage and reliably plot trends. The backup administrator found a great command for pulling stats out the NetBackup database. The NetBackup command is found in the install directory bin/adminbin subdirectory. It takes a variety of options so be sure to read the man page. I found two basic commands I needed to give it to get the data I needed. One was for gathering live data, and one was for accessing historical data that I wanted to import for a really clear picture of things.

Going back the statement about sysadmins have a touch of laziness– I ask myself, why manually pull data when you can automate data collection?

After experimenting with TeamQuest and weekly and daily stats I finally determined that I really need to gather data hourly in order for some of the automated graph methods to be able to do their job. If the truth were known (and it’s about to be) I’d really prefer just to grab the data once a week so that I can look an entire backup spectrum of Full backups and all incrementals. But it is minor oversight by TeamQuest that the new-ish tools ITSAR (IT Service Analyzer and Reporter) cannot take a macroscopic view like single point of data for a week and graph it over a six month period. Minor oversight, I forgive them, and I’m sure it will be corrected sooner than later.

So here is my “live” data command to get an hour summary data from NetBackup that is instantly imported into TeamQuest. This runs at the top of every hour as a summary of the previous hour.
/usr/openv/netbackup/bin/admincmd/bpimagelist -U -hoursago 1
01/12/2012 18:35 02/02/2012 41904 3763745 N Differential Int_unix
01/12/2012 18:35 02/02/2012 42070 4150810 N Differential Int_unix

Of course it can’t be imported into my TeamQuest database straight like that! The command prints out a bunch of data based on the number of different jobs that ran while TeamQuest really needs it summed up cleanly. So I wrote a perl script that runs the NetBackup command and sums it up, formatting it nicely for TeamQuest as a total kilobytes and number of files backed up. The TeamQuest required specific header fields are a time field (in quotes), an interval in seconds, and the servername. I’ve specified in my table definitions that I’m also providing another field for week of the year so that I can combine data for an entire week, and then the total number of files and kilobytes backed up.
An interesting note about the week-of-the-year field…. I have a bit in my Perl code that determines which week of the year I want it to be counted as. Most date modules will default to the week beginning on Sunday per the Gregorian standard, but for my backup standards the week really begins Friday at 6pm when the full backups kick off. Every backup after that should be an incremental or part of that backup set extended from Friday night.

A sample run from my script
# ./ -t -hourly
"1/25/2012 18:00:00" 3600 backupservername "3/2012"
185118 17720160

Sweet! If you see a message “no entity was found” don’t worry about it. It’s just a message from the NetBackup database command (printed to STDERR) that there wasn’t a job that particular hour. Zeroes will be imported for the data that hour.

So now my backup server runs an hourly job that imports this data into the Teamquest test database. We are looking good, going forward. But that’s only half the battle! I still need to get historical data into TQ so that I can make proper analysis.

I expand my Perl script so that I can pass in historical start and stop times at the command line.
# ./ -t -hourly -a "01/25/2012 17:00:00"
"1/25/2012 18:00:00" 3600 blade193 "3/2012"
394697 192310787

This is going great! I can run this a bunch of times for each hour of historical data that I need and append the output to a single text file. When it is done I import a single file into the TQ database and then make some pretty graphs.
So… let’s see. I’d like to go back about six months, so that’s about 180 days give or take, by 24 hours, ohhh, that’s running my command 4,320 times. Yeah… about that. I can hear Al say “I don’t think so, Tim”.

But, I really don’t want to extend my Perl script any more because this is running hourly already, and going so smoothly. If I keep hacking at it with my lowly coding skills I may break it or corrupt my data that I am collecting now. This is pretty much going to be a one-off straight forward linear loop to run this 4,320 times. Six-off at best, if I am willing to make a run per month with a few minor changes in between runs. This sounds like a shell script. Sure, I could do it Perl. But for super simple loops that is not parsing data I prefer to just use a shell script. It’s a square peg and this is a square hole.

Here’s my double loop shell script that runs my Perl script once per hour for each day of a month–

let dd=1
let lastday=31
let mm=07
let yy=2011

while [ $dd -le $lastday ]
let hh=0
echo " Running stats for day $dd" 1>&2
while [ $hh -lt 24 ]
echo " Running stats for hour $hh" 1>&2
./ -t -hourly -a "$mm/$dd/$yy $hh:00:00"
hh=`expr $hh + 1`
dd=`expr $dd + 1`
echo "Incremented day to $dd" 1>&2

Pretty simple, really. Oh, and I am sending status info lines from the shell script to STDERR output so that the STDOUT can be directed safely and cleanly into a file ready to import into TeamQuest but yet the sysadmin can easily observe how the script is progressing.

# ./makegoodhourly >import.august
Running stats for day 1
Running stats for hour 0
no entity was found
Running stats for hour 1
Running stats for hour 2
Running stats for hour 3
no entity was found
Running stats for hour 4
no entity was found
Running stats for hour 5
no entity was found
Running stats for hour 6
no entity was found
Running stats for hour 7

Make a few tweaks to the shell script to change the month number and the total number of days per month, and run it again. Easy. I ran it once per month for September through January and imported my data, and I was done.

Here’s the Perl script. It will default to daily stats if neither hourly nor weekly is specified. Why? Well that was just because that is was a middle step before I realized I needed to go hourly. I didn’t want to completely remove weekly or daily statistics for future possibilities.

I’m sure there are some better ways to accomplish the things I do in my scripts– I’d like to hear them in the comments below. I’m always eager to improve my skills.


# 1.13.12 - ver 0 - K.Creason -
# To get weekly stats out of the NetBackup database

# First we define some things that are tunable
# The statcmd is the netbackup command that generates the output summary of
# all backup jobs based fields passed to it.
# We are going to use 168 hours ago for seven days to get a full weeks summary

my $statcmd="/usr/openv/netbackup/bin/admincmd/bpimagelist -U ";

# No more tunables, so these are some defaults that we will define for later

my $DEBUG,$VERBOSE,$filesummary,$datasummary,$files,$data,@data,$tqout,

use Date::Calc qw(:all);

# process the command line arguments
if ("$ARGV[0]" eq "-h") {die "\n\nUsage: $0 [-d for debug] [-v for verbose stats] [-t for Teamquest format] [-hourly or -w for weekly summary] [-a MM/DD/YYYY for alternate start date, if hourly should include HH:MM:ss within quotes]\n\n";}
if ("$ARGV[0]" eq "-d")
{ shift @ARGV; $DEBUG++; print STDERR "Debug on.\n"; }
if ("$ARGV[0]" eq "-v")
{ shift @ARGV; $VERBOSE++; print "Verbose on.\n";}
if ("$ARGV[0]" eq "-t") { $tqout=1; shift @ARGV; if ($DEBUG>0){print STDERR "TeamQuest report on.\n";}}
if ("$ARGV[0]" eq "-hourly") { $hourly=1; shift @ARGV; if ($DEBUG>0){print STDERR "Hourly report on.\n";}}
if ("$ARGV[0]" eq "-w") {$datespec++; $weekly=1; shift @ARGV; if ($DEBUG>0){print STDERR "Weekly report on.\n";}}
if ("$ARGV[0]" eq "-a")
shift @ARGV;
$date=$ARGV[0]; if ($DEBUG>0){print STDERR "Alternate date is \"$date\".\n";}
shift @ARGV;

if ("$date" eq "")
( $yy, $mm, $dd ) = Today(); $date="$mm/$dd/$yy";
if ($DEBUG>0){print STDERR "The end date is TODAY, $date.\n";}

if ($weekly lt 1)
$begindate=$date; if ($DEBUG>0){print STDERR "Begin date is end date, $begindate.\n";}
# need to add hourly check and if turned on calculate an end date of plus one hour
if (($hourly gt 0)&&("$date" =~/\:/))
if ($DEBUG>0){ print STDERR "Calculating an end date of plus one hour from $begindate.\n";}
my $cal,$time,$hh,$min,$sec;
($cal,$time)= split (/ /,$date);
($yy,$mm,$dd) = Decode_Date_US($cal);
($hh,$min,$sec) = split(/:/,$time);
if ($DEBUG>0){ print STDERR "Splitting end date to $yy, $mm, $dd, $hh, $min, $sec.\n";}

# Before we add an hour, check to make sure the start hour is two digits
if ( (length $hh) lt 2)
{ $hh="0$hh"; $begindate="$mm/$dd/$yy $hh:00:00"; }
($yy,$mm,$dd,$hh,$min,$sec) = Add_Delta_DHMS($yy,$mm,$dd,$hh,$min,$sec,0,+1,0,0);
if ( (length $hh) lt 2){ $hh="0$hh";}
$date="$mm/$dd/$yy $hh:00:00";
if ($DEBUG>0){ print STDERR "Calculated the end date of plus one hour to $date.\n";}
# weekly, so have to calculate a begin date
($mm, $dd, $yy) = split (/\//,$date);
if ($DEBUG>0){ print STDERR "Date ($date) is split year $yy, day $dd, month $mm.\n"; }
( $yy, $mm, $dd ) = Add_Delta_Days($yy,$mm,$dd , -7 ); $begindate="$mm/$dd/$yy";
if ($DEBUG>0){ print STDERR "Begin Date is calculated to $begindate.\n"; }

# Now we need to calculate which week of the year the backup stats belong to
# paying careful attention to use the weeknumber for Friday. So if the day of week
# is monday-thurs we take week the weeknumber of the previous Friday
# which is tricky if it happens to split a new year... Oy vey.
($mm, $dd, $yy) = split (/\//,$begindate);
my $dow = Day_of_Week($yy,$mm,$dd); if ($DEBUG>0){print STDERR "Day of Week is $dow.\n";}
if ($dow gt 4)
{ ($weekno,$yy)=Week_of_Year($yy,$mm,$dd);if($DEBUG>0){print STDERR "Week of year calculated for a Fri/Sat/Sun to be $weekno/$yy.\n";}}
# This is the more complicated route. First calculate what last Friday was and then the weekno of that day.
# Think we can just substract seven for last week
my $lyy,$lmm,$ldd;
($lyy,$lmm,$ldd)= Add_Delta_Days($yy,$mm,$dd,-7); if ($DEBUG>0){print STDERR "Date of a week ago is $mm/$dd/$yy.\n"; }
{ ($weekno,$yy)=Week_of_Year($lyy,$lmm,$ldd);if($DEBUG>0){print STDERR "Week of year calculated for M-Th to be $weekno/$yy.\n";}}

# sample data
# 01/12/2012 18:35 02/02/2012 41904 3763745 N Differential Int_unix
# 01/12/2012 18:35 02/02/2012 42070 4150810 N Differential Int_unix

if ($datespec gt 0)
$statcmd="$statcmd -d $begindate -e $date";
(@data) = map {(split)[0,3,4]} grep /^[0-9]/, `$statcmd`;
if ($DEBUG>0){print STDERR "Date specified command executed \"$statcmd\".\n";}
elsif ($hourly gt 0)
$statcmd="$statcmd -hoursago 1";
(@data) = map {(split)[0,3,4]} grep /^[0-9]/, `$statcmd`;
if ($DEBUG>0){print STDERR "Hourly command executed \"$statcmd\".\n";}
$statcmd="$statcmd -hoursago 24";
(@data) = map {(split)[0,3,4]} grep /^[0-9]/, `$statcmd`;
if($DEBUG>0){print STDERR "Daily/24 hour command executed \"$statcmd\".\n";}

my $a=0;
foreach (@data)
if ($a==0){$begindate=$_;$a++; if ($DEBUG>0){print STDERR "\tDate: $begindate. ";}}
elsif ($a==1){$files=$files+$_;$a++; if ($DEBUG>0){print STDERR " files now $files.";}}
elsif ($a==2){$data=$data+$_;$a=0; if ($DEBUG>0){print STDERR " data now $data.\n";}}

if ($tqout lt 1){ print "Files backed up: $files\nData backed up $data\n";}
else {
# Check for ENV Localhost
if ("$ENV{LOCALHOST}" eq ""){ chomp($ENV{LOCALHOST}=`hostname`);}

# if we are doing a weekly report for TQ it's a different time, at least for early testing
# and format, with the interval
if ($weekly gt 0)
{$date="\"$date 12:00:00\" $ENV{LOCALHOST} \"$weekno/$yy\"";}
elsif($hourly gt 0)
if ("$date" =~ /\:/ )
#then we have a time already, use it
$date="\"$date\" 3600 $ENV{LOCALHOST} \"$weekno/$yy\"";
else { ($hh,$mm,$dd)=Now(); $date="\"$date $hh:00:00\" 3600 $ENV{LOCALHOST} \"$weekno/$yy\""; }
else {$date="\"$date 12:00:00\" 86400 $ENV{LOCALHOST} \"$weekno/$yy\"";}
if ($DEBUG>0){print STDERR "DEBUG: $date\n$files $data\n\n"; }
print "$date\n$files $data\n\n";

I run it via two cronjobs on the backup server. One gives us a weekly summary via email, and the other is the hourly TeamQuest data import.

# test teamquest weekly stats gathering on Friday mornings
30 10 * * 5 /usr/openv/netbackup/bin/admincmd/ -w |mailx -s "NetBackup weekly summary" staff
0 * * * * /opt/teamquest/manager/bin/tqtblprb -d testuser -n NetBackupHourly >/dev/null 2>&1

And what does my data look like?
Backup data six months

Wow… there’s been a bunch more to backup lately.

Boot and Nuke for SPARC

Posted: November 29, 2011 in solaris, tech
Tags: , , , , ,

I’m preparing to excess old hardware at my day job– it’s a very satisfying turn of events. It means we’ve had a job well-done and replaced old hardware with newer, faster, shinier stuff and we can say buh-bye to the old slow crap!

Some people will just pull the hard drives and run them through the degausser which turns them into useless lumps of metal and poisonous stuff that we don’t want to go to the landfill. I prefer to run a program on the drives to completely remove their identity to DOD standards. Afterall, this is the government and the old hardware will go an auction where you can buy a pallet of hardware for $20. I’d like the buyer to receive the hardware with working drives, not poisonous metal, and save the degausser for hard drives that die (but still have data on the platters).

There’s a free product called Darik’s Boot and Nuke that works fantastic on Wintel type machines. I can send a Windows or junior admin over with a CD, DVD, or USB drive to boot and nuke specific machines. But nothing for Sun (aka Oracle) SPARC last time I looked. I admit, I last googled for a SPARC boot and nuke about six years ago– I haven’t looked this time because I have a script for that (I should trademark that phrase!).

I save this script to my jumpstart server and configure an install profile to run it as a pre-install script. This script uses the builtin Solaris “format” command’s ability to run a series of commands and the “purge” function to completely erase each hard disk to DOD specs.

The steps are —

add the MAC address of the SPARC system to your jumpstart server’s /etc/ethers file with the host name “wipeme”.

add an unused IP address to the jumpstart server’s /etc/hosts file with the host name “wipeme”

Create a directory tree on the jumpstart server that is /jumpstart/install/wipe. In the wipe directory you need to have a very generic “profile” file and a “sysidcfg” file as required by jumpstart or it won’t build the rule checksum.

network_interface=primary {netmask= protocol_ipv6=no default_route=}
install_type initial_install
system_type standalone
cluster SUNWCuser

Create the script in the wipe directory, or a subdirectory such as pre-install.

The script is very simple and that is just perfect! It finds all device types for hard disks and by simple elimination eliminates ROM drives. The only thing is I am not sure about for the next generation of SPARC systems with SAS drives and their wonky C-numbering. But, hey, I’ve got five years to cross that bridge.

echo "analyze
" >>/tmp/fcmd

CMD="format -f /tmp/fcmd"for i in ` ls /dev/rdsk |cut -f1 -d"s" |sort |uniq`
echo "Executing command: \"$CMD -d $i\" \n"
$CMD -d $i

Edit your your /jumpstart/rules file and add an entry like so: "hostname wipeme install/wipe/pre-script/  install/wipe/profile - "

Now run your "check" routine to build the rules.ok checksum and you are off to the races to "boot net - install" just like normal. The only difference is it will purge the drives and not install anything leaving a tabula rosa for the new owner.

Unix systems have some great command line tools: Find, grep, cut, split, tr, sed, awk — all amazing tools.
But sometimes I still can’t quickly see what I need to see in a fast scrolling window. The text is the same font and color, and the background never changes. When scrolling through using ‘less’ and using the search option the word can be bolded or in other ways marked– but ‘less’ is not always useful when you are looking at the output of ps, tailing a logfile, top, prstat, snoop, tshark…
Well I finally found something. Cobbled together something is a little closer to the truth.

Sometime back I found a suggestion on using one of my favorites to change color: perl. This hack is pretty easy but involved remembering the very complicated one-liner to type it when needed, or pulling it out of history, or little notes files in my home directory. It uses the color mechanisms within the terminal, so your terminal and shell naturally have to support color.

It would go something like this (where 31 is red and 43 is yellow [I think]): tail system.log |perl -pe ‘s/Throttling/\e[1;31;43m$&\e[0m/ig’

tail system.log | perl -pe 's/Throttling/\e[1;31;43m$&\e[0m/ig'

Perl one-liner typed in command pipeline

It definitely was a start. But it wasn’t easy. I used it for years this way– look at the file or output and realize my eyes are lost. So I would cat my note file in my home directory, run the command again piping it through the perl one-liner to highlight the word I wanted (like “Throttling” in my example).

One day I had a enough. “Kevin, there has to be an easier way,” I said to myself. I messed about with trying ‘alias’ commands to set it up. That was fine for a static word to highlight but it wasn’t possible to stick in a variable to highlight different words when I needed to.

So.. I went back to shell basics and rediscovered “functions”.  By building several shell functions using my original Perl one-liner with color changes and different variable names I can now highlight multiple different words all at the same time:

Functions highlighting multiple words at the same time

Functions highlighting multiple words at the same time

Pretty awesome!

So here’s how it is setup. First, I find some colors that look in my terminal. I use the “Novel” color scheme in my Mac terminal found these three useful color schemes but your could easily have more and change them to match your heart’s desires:

  • Red on Yellow : 31;43m
  • Lt Blue on Dark: 32;44m
  • Lt Blue on purple: 32;45m

Next up is combing them with individual variables and putting them in my .profile so that they are active when I login to a system. I edit my .profile and at the bottom I add these three lines:

  • redy () { command perl -pe “s/$redy/\e[1;31;43m$&\e[0m/ig” ; }
  • blue () { command perl -pe “s/$blue/\e[1;32;44m$&\e[0m/ig” ; }
  • purp () { command perl -pe “s/$purp/\e[1;32;45m$&\e[0m/ig” ; }

The first word is the name of the function. This can be called just like a command or an alias, but it calls the command inside the brackets which is the original Perl one-liner modified with a variable. The variable is not a Perl scalar but is a shell variable that will be replaced with the contents before Perl executes.

Once you’ve edited your .profile you need to exit and re-login or source the .profile into your environment. You can type ‘set’ to see if it is in your environment before attempting to use it.

When the function is built and in your environment you are ready to add it to your pipeline. The pipeline must start off with defining the variable and within the same session execute your command before handing that off to the pipeline. The way to do this is with an “&&” joining construct. It tells the shell “set this and if successful do this” and the pipeline follows so the whole enchilada is fed to the next command. It’s not complicated, just messy to describe. So let me show you:

  • blue=5004 && redy=5007 && purp=5008 && snoop -Vd ce0 port 5007 or port 5008 or port 5004|redy |purp|blue
    • set blue to be the word 5004
    • set redy to be the word 5007
    • set purp to be the word 5008
    • execute snoop command
    • pipe feed output to function redy
    • pipe feed output to function purp
    • pipe feed output to function blue
  • Sip coffee and watch your magic!

    Example Execution of commands

    Example Execution of commands

It is still a fair bit of typing and still requires some biological memory  but it is easier.

I hope this helps someone!

I always have Terminal running on my Macbook Pro at work. I use it to SSH in to numerous servers during the day to check things. You know, to administer the boxes. I have 59 Unix systems to take care of (some virtual but they still need administering). It won’t take long before you just get tired of typing your password and a command 59 times to check and see if all 59 boxes mount filerB or use DNS2.

If you are in a situation remotely like mine you’ve already setup your system to use Kerberos or SSH keys so you don’t have the type the password 59 times. But there’s still room for improvement. You’d have to print off a piece of paper listing all 59 servers and check them off one by one, or open up a spreadsheet and toggle back and forth between Terminal and Excel as you type the command and then type the answer.

Boring and tedious. So I wrote a little Perl script to do most of it for me. It’s not elegant, it’s just useful. Please not, I’m not really a programmer, just a hack. So if you see it and like it, use it! If you want to improve in it, go ahead– you can. I wouldn’t mind a peek at your changes.

Sure, I know some of you are going to recommend an App For That. Like Spiceworks. Or Cohesion. Go ahead– fire away. Let’s see what you recommend. We’ve tried a few and did not find one to fit our environment or budget yet. But I am interested in hearing about some options.

If you wan to try my script you will need to make a few edits and provide a few things beforehand.

First – you will need to be able to get into your servers from Terminal or an xterm without password prompts using SSH. It run as your user id executing the Perl script and use it to make the SSH connection. Don’t use root. Just don’t. This is a reporting tool, not an administration tool.

Second- you will need to provide a way for the system to get a list of servers. We maintain a text file on a NetApp qtree that is mountable as Cifs or NFS for our list of servers. It has more information in it that we do not need– such as, is it a global zone or a non-global zone? The critical part is that each line begins with the resolvable server name and the rest is discarded. In the script I have defined two locations of where the file can be found. You will find these definitions in scalar variables called location1 and location2. Why two? Because I want to be able to run this from my Mac or from a Unix system and the mountpaths are different. If it fails locating this file it will use a canned  array of hostnames (@hostlist), so you will want to edit that as well. It’s a fallback and guaranteed to be out of date in my environment. You can find these three things to edit in the subroutine MakeServerArray.

Third- the script uses two CPAN modules that you will need to provide your Perl where you run the script. MIME::Lite is the first and will handle the SMTP connection to send the report via email. The second is Term::ANSIColor and it will dress up the report to the Terminal window if that is the reporting method you choose.

Here’s my little script (this is going to look terrible in my current blog format):

# Run a command over SSH on lots of hosts and print a report on the results
# Simple html reports by email, or color to the screen
# Version 1, 9.19.2011, K.Creason

use MIME::Lite;
use Term::ANSIColor qw(:constants);
my $DEBUG=0;
my $cmd; my $title; my $email; my $report; my @hostlist;

# we require some arguments from the command line
if ("$ARGV[0]" eq "" ) { &help;}

# check for debug option
if ("$ARGV[0]" eq "-d")
{ $DEBUG=2; shift; print BOLD,YELLOW,ON_BLACK "Running in super-duper verbose debug mode:",RESET,"\n";}
# or if we have the not-quite debug verbose option
if ("$ARGV[0]" eq "-v")
{ $DEBUG=1; shift; }

# check to see if are going to email the report
if ("$ARGV[0]" eq "-e" )
if ($DEBUG gt 1){ print BOLD,YELLOW,ON_BLACK "\tThe report will be email to \"$email\"",RESET,"\n"; }

# check to see if the title of the report will be specified
if ("$ARGV[0]" eq "-t" )
if ($DEBUG gt 1){ print BOLD,YELLOW,ON_BLACK "\tThe report title is \"$title\"",RESET,"\n"; }

# and finally check for a command to run
if ("$ARGV[0]" eq "-c" )
if ($DEBUG gt 1){ print BOLD,YELLOW,ON_BLACK "\tWe have a command to run which is \"$cmd\"",RESET,"\n"; }

if ("$cmd" eq ""){ &help;}

if ("$title" eq ""){$title="Perl Report for \"$cmd\"."; }

# if an email address delivery is true we are doing HTML rather than color report to screen
if ("$email" ne "")
{ $report="<h2> $title </h2>\n<h5>PERL host report using command \"$cmd\"</h5>\n<table border=1>\n";}
else {print BLUE,BOLD "Report $title",RESET,BLUE "\nUsing command \"$cmd\"",RESET,"\n"; }

# find the servers to run report on

# now we are ready to go
my %host;
foreach $host (@hostlist)
# Here we run it over SSH and send STDERR to the bit bucket
chop($host{value}=`ssh $host "$cmd" 2>/dev/null`);
if ($DEBUG>1){ print YELLOW,ON_BLACK "\tDEBUG: command execute on $host returned \"$host{value}\".",RESET,"\n";}

# the next line is the easiest way to determine if the results were null, just a linefeed
if ($host{value} =~/12/) { $report=$report."<td><pre>$host{value}</pre></td></tr>\n";}
elsif ("$host{value}" eq ""){ $report=$report."<td>Unable to fulfill the request.</td></tr>\n";}
else { $report=$report."<td>$host{value}</td></tr>\n"; }
if ("$email" eq ""){ print "$host = > ",RED,"$host{value}",RESET,"\n"; }

if ("$email" ne "")
my $msg = MIME::Lite->new(
To =>"$email",
Subject =>"System Report: $title",
Type =>'multipart/related'
Type => 'text/html',
Data => qq{
else { print "\n\n"; }

# subs
sub MakeServerArray
my $location1="/Volumes/homeshare/systems/server.list";
my $location2="/share/systems/server.list";
my $discard; my $b;

if ( -f "$location1")
if ($DEBUG>0){print "Found a locally mounted server.list on a Mac system.\n\n";}
open SLIST, "<$location1";
foreach (<SLIST>){ if ( /^[a-zA-Z]/ ){ ($b,$discard)=(split( / /,$_,2)); push @hostlist,$b; } }
close SLIST;
elsif ( -f "$location2")
if ($DEBUG>0){print "Found a server.list on an NFS mounted share.\n\n";}
open SLIST, "<$location2";
foreach (<SLIST>){ if ( /^[a-zA-Z]/ ){ ($b,$discard)=(split( / /,$_,2)); push @hostlist,$b; } }
close SLIST;
print "\n\n\t!!!!!!\n\tUsing Canned Server List!! Unable to read the server.list in a known share location.\n!!!!\n\n";

@hostlist=qw(kona java arabica typica robusta bourbon duncan macbeth macduff agent86 agent99 hymie denali k2 everest bubba bubbajoe bubbabob michelangelo donatelo raphael leonardo splinter shredder);

foreach (@hostlist)
{ if ($DEBUG>0){print "DEBUG: processing server.list entry \"$_\"\n"; } }


sub help
die "\nUsage: $0 [-d or -v for optional debug OR verbose] [-e] -t \"title\" -c \"command to run\"
Use simple args-- if you have complex args and special or quoted
characters make a script to call with this utility.\n\n";


perl/  -t "Unix Resolv.conf settings" -c "cat /etc/resolv.conf"

It didn’t take long for the fragile 15 year old Solaris app to break again. Six years ago we moved it to Solaris 10 quite successfully thanks to the library interposers. This enabled us to change the hostId for a process to match the old hostId it expected from nine year old hardware. It was much much happier on new hardware.
A few months ago I wanted to move it into a virtual Solaris environment (a Zone) rather than waste an entire physical box on this application(s). Solaris had recently added the ability to set your own desired hostId for a virtual environment so it looked like it should be a slam dunk. The issue that we ran into which Dtrace solved was that te inodes for virtual root directory were wanky. The Dtrace memory edit, which is the function copyin, fixed that.
All’s well that ends well, but this hasn’t ended yet.
It’s running on a test system and it is time to move it to a production system so we can have our test platform back.
So I loaded up some new T3-1B systems up with Solaris 10 release 10. And I shouldn’t have. Or maybe its good I did, find the issue earlier than later.
The problem is with this release the System Info HW Provider is now ‘Oracle Corporation’ instead of Sun_Microsystems. This unfortunately breaks flexlm as it wants to confirm the license it is trying to use is running on the correct hardware as it was issued for. 😦

I’ve been hacking and slashing my way to a unified solution and it’s mostly there. It starts up but I’m getting an error associated with creating/manging the lock file. I’ll pick that up tomorrow, but without further ado here is my SI_HW_Provider fix.

To replace the “Oracle Corporation” with Sun_Microsystems

#!/usr/sbin/dtrace -Cs

#pragma D option destructive

self->mach = arg1;

/self->mach && execname ==”binaryname”/
copyoutstr(“Sun_Microsystems “, self->mach,18);

/ execname == “binaryname” /
{ self->buf = arg1; }

/self->buf && arg1 > 0/
this->dep = (struct dirent *)copyin(self->buf, sizeof (struct dirent));
this->dep->d_ino = 4;
copyout(this->dep, self->buf, sizeof (struct dirent));

Dtrace Saves the Day

Posted: July 5, 2011 in dtrace, solaris, tech


Ten years ago when I got here we had an X Windows app that was old then. I think it qualifies for the Antique Roadshow now. The application and data had been from the beginning of the Station project when it was just Freedom not “International”.

The tool which probably should remain nameless to protect the guilty was sold and new versions came out that ran on Windows only. We had an administrator spend about six months migrating the data and datafiles to the Windows version, only to have the users refuse to run it. So they stayed on the original X window app designed and initially installed for Solaris 2.5.1 just as it was on the old e4000.

Five years ago we really wanted to get rid of the e-series of SPARC systems, especially this 4000, and get to Solaris 10 at the same time. We contacted the vendor of the application to find it had been sold again, and the latest owner was not able to issue new license files for the Unix application. A real Unix admin doesn’t take ‘no’ for an answer so an administrator found a tool that is compiled with the target hostid and loaded as an LD_PRELOAD option. This effectively changes the hostid for any processes (like the Flexible License Manager of flexlm) to the old hostid. Using this we were able to migrate the database, binaries, and data files to a Blade 2000 running Solaris 10. The users were very happy with the upgrade and we were too as we now had a single Unix OS to support for the environment.


It’s now 2011 and the Blade 2000 needs to be retired. The hostid won’t be a problem thanks to our last fix, but the systems are so powerful there has to be a better way- specifically to use the Solaris Containers/Zones and not require a dedicated box to support this tired old product. A way to easily keep this running another ten years for the life of Station… That’s my goal.

A new feature for Zones allows administrators to assign a hostid to a zone– so the original problem goes away. This should make our life easier. But try as we might, the Flex License Manager process (lmgrd) starts but will not start the subprocess containing the licenses.

Googling for the error reveals some others have encountered this or similar errors:

 7/05 14:17:03 (lmgrd) FLEXlm (v2.40) started on xxxxxx (Sun) (7/5/111)
 7/05 14:17:03 (lmgrd) License file: "/tools/flexlm/xxxxx.license.dat"
 7/05 14:17:03 (lmgrd) Started YYYY
 7/05 14:17:03 (YYYY) Cannot open daemon lock file
 7/05 14:17:03 (lmgrd) MULTIPLE "YYYY" servers running.
 7/05 14:17:03 (lmgrd) Please kill, and run lmreread
 7/05 14:17:03 (lmgrd)
 7/05 14:17:03 (lmgrd) This error probably results from either:
 7/05 14:17:03 (lmgrd)   1. Another copy of lmgrd running
 7/05 14:17:03 (lmgrd)   2. A prior lmgrd was killed with "kill -9"
 7/05 14:17:03 (lmgrd)       (which would leave the vendor daemon running)
 7/05 14:17:03 (lmgrd) To correct this, do a "ps -ax | grep YYYY"
 7/05 14:17:03 (lmgrd)   (or equivalent "ps" command)
 7/05 14:17:03 (lmgrd) and kill the "YYYY" process
 7/05 14:17:03 (lmgrd)

While a solution was not found immediately an answer began emerge. In particular a link here where Peter says “This can be fixed using a fairly simple dtrace script to fool cdslmd into seeing /. and /.. with the same inode number (getdents64()).”

In a non-global zone under Solaris 10 the root file system does not start it’s inode numbering at 0, because it is not a root filesystem. It is just a directory from the global zone, an existing filesystem. In our case, the fourth zone is getting pretty high up the inode tree and we are at six digits! It turns out this is the problem with the license manager daemon– no idea why, but it is the problem. To get around this issue I tried creating a new file system for the root directory for the non-global zone and was successful in that the zone ran and the license manager started. But this had unintended side effects and broke Solaris’ Live Upgrade process and other utilities for system maintenance. So in other words- worse than a fix, the medicine would kill you faster.

Back to Peter’s assertion that dtrace could fix this with LD_PRELOAD much like the old hostid fix. But how? I researched all the functions and probes within dtrace but the heights it can go is way way way over my head. All my research into dtrace’s power kept leading me to Brendan Gregg’s blog and book, so I reached out to Brendan through twitter and managed to pique his curiosity and in the space of a day had a blog post leading me to a solution.

My solution in four “easy” steps:

  1. Grant Dtrace privileges to the non-global zone
  2. Grant dtrace_user privilege to the user running the flexible license server within the non-global zone
  3. Write a dtrace script that detects the YYYY license package running and will fix the inode numbers as they pass through memory and cleanly terminates
  4. Add new dtrace script to execute and detach before the normal license manager process starts– and sleep ten seconds just in case the system is busy.

So to bring it out in more detail, read on.

Grant Dtrace privileges to the non-global zone

Assuming your zone is already running, use this command to add the privilege

zonecfg -z zonename ‘set limitpriv=”default,dtrace_proc,dtrace_user” ‘

Then halt and boot the zone.

When you login as root to the zone you should be able use ‘dtrace -l’ to see a list of probes available. This would be zilch before.

Grant dtrace_user privilege to the user running the flexible license server within the non-global zone

Login is as root the zone and use this command to grant dtrace_user privilege to the user account who will run the license manager. This should not be root…

echo “flexlm::::defaultpriv=basic,dtrace_user” >>/etc/user_attr

Now login or su to the user account and test the privilege is assigned by using the same command we used to test as root, above, ‘dtrace -l’.

Write a dtrace script that detects the YYYY license package running and will fix the inode numbers as they pass through memory and cleanly terminates

This is the magic section and it is all owed to Brendan Gregg. Your license process is the second process that is started by the license manager daemon. In my case the license manager is lmgrd, and my license is in all caps like “YYYY” but that is not always the case. Whatever the name of your second license process is what needs to be added to the dtrace script below. This script will run before the license manager starts up and will detect the actual license process and fix its inode stat in memory before it sees the unmatched values of “.” and “..”. Just be sure to replace the “zonename” with your zonename, “procname” with your license process name, and the inode value with your inode value.

#!/usr/sbin/dtrace -Cs
/* line 7 number must be be changed to the zonename and process name */
/* line 16 number must be be changed to the inode of root dir's .. */
#pragma D option destructive
/zonename == "zonename" && execname == "procname"/
        self->buf = arg1;
/self->buf && arg1 > 0/
/* modify first entry of ls(1) getdents() */
this->dep = (struct dirent *)copyin(self->buf, sizeof (struct dirent));
this->dep->d_ino = 415469;
copyout(this->dep, self->buf, sizeof (struct dirent));
self->buf = 0;

Save that and make sure it has execute privilege for your flexlm user. If you want to test it, change procname to “ls”. Run this as your flexlm user in one terminal window and in another terminal window as your flexlm user, run ‘ls -ai /’. Compare the output with the dtrace running and without. The dtrace script should terminate quietly once the ls has completed. When satisfied with the function change procname to the license process and proceed to modify your SMF start script.

Add new dtrace script to execute and detach before the normal license manager process starts– and sleep ten seconds just in case the system is busy.

My Service Management Facility is set to execute /tools/flexlm/SMF/flexlm. I added two lines above the actual lmgrd start process. You may start it differently, but here is my example.

case "$1" in
   echo "Starting up FLEXlm ..."
   su - flexlm -c "/tools/flexlm/ " &
        sleep 15
   su - flexlm -c "/tools/flexlm/lmgrd -c /tools/flexlm/xxxxx.license.dat -l /tools/flexlm/log " &

And that is it. It’s enough to run– it needs some tweaking and to be a bit better, but it is enough to know we can go forward by moving this system to a zone.

root@xxxxxx:~ $ps -fu flexlm
     UID   PID  PPID   C    STIME TTY         TIME CMD
root@xxxxxx:~ $svcadm enable flexlm
root@xxxxxx:~ $ps -fu flexlm
     UID   PID  PPID   C    STIME TTY         TIME CMD
  flexlm  2873  2872   1 14:49:10 ?           0:02 /usr/sbin/dtrace -Cs /tools/flexlm/
root@xxxxxx:~ $ps -fu flexlm
     UID   PID  PPID   C    STIME TTY         TIME CMD
  flexlm  2906  2897   0 14:49:26 ?           0:00 YYYY -T xxxxxx 4 -c /tools/flexlm/xxxxxx.license.dat
  flexlm  2873  8499   1 14:49:10 ?           0:03 /usr/sbin/dtrace -Cs /tools/flexlm/
  flexlm  2897  8499   0 14:49:25 ?           0:00 /tools/flexlm/lmgrd -c /tools/flexlm/xxxxxx.license.dat -l /tools/flexlm/log
root@xxxxxxx:~ $ps -fu flexlm
     UID   PID  PPID   C    STIME TTY         TIME CMD
  flexlm  2906  2897   0 14:49:26 ?           0:00 YYYY -T xxxxxx 4 -c /tools/flexlm/xxxxxx.license.dat
  flexlm  2897  8499   0 14:49:25 ?           0:00 /tools/flexlm/lmgrd -c /tools/flexlm/xxxxxx.license.dat -l /tools/flexlm/log