Techspeak for the socially diminished

Microsoft has released an update to the MSMQ (version 3) management pack.

System Center Pack for: Message Queuing 3.0
Version: 6.0.6615.0
Released on: 12/14/2009

Message Queuing (also known as MSMQ) is a server application that enables applications to communicate across heterogeneous networks and systems that may be temporarily offline or otherwise inaccessible. Instead of an application communicating with a service on another computer, it sends its information to Message Queuing, which sends the information to a Message Queuing service on the target computer where it is made available to the other application. Message Queuing provides guaranteed delivery, efficient routing, security, and priority based messaging.

Now, what’s really interesting is what you will find in the MP Guide under “Supported Configurations”.

The Message Queuing Management Pack for Operations Manager 2007 is designed to monitor Message Queuing version 3 only.

The Message Queuing Management Pack supports the following platforms:

· Windows Server 2003

· Windows XP

The Message Queuing Management Pack also supports monitoring clustered MSMQ components.

Text coloration is obviously added by me to highlight the interesting part. ;)

Finally MSMQ monitoring seems to be cluster aware, which might mean that the home-made pack i did to have those (numerous) queues covered could be passed on to the scrap-heap. This is also confirmed under “Changes in This Update”.

The December 2009 update to this management pack includes the following change:

· Fixed a problem when working with an instance of MSMQ in a Cluster. The MP is now able to discover and monitor public and private queues in a cluster.

· Fixed a problem when discovering the local and cluster instance of MSMQ. The MP is now able to discover and monitor both instances.

The confusing double RunAs profiles seems to have been cleaned up too (you only have to worry about one now) as well as fixing some sloppy mistakes in the previous scripts (no Option Explicit? C’mon Microsoft! You write the best practices, try to stick to them.) and generally improving display and documentation.

Gonna import this to our staging environment today and let it roll during the holidays.

Cheers! Oh, and happy holidays!

Download and documentation:
http://www.microsoft.com/downloads/details.aspx?FamilyId=1D2B4398-8BC2-4A43-850C-852EBB0D983B&displaylang=en&displaylang=en

Here’s a little trouble-shooting guide for discovering Linux systems from OpsMgr R2 when getting the following error from the wizard:

<stdout>Generating certificate with hostname="COMPUTERNAME"

[/home/serviceb/TfsCoreWrkSpcRedhat/source/code/tools/scx_ssl_config/scxsslcert.cpp:198]

Failed to allocate resource of type random data: Failed to get random data - not enough entropy

</stdout><stderr>error: %post(scx-1.0.4-248.i386) scriptlet failed, exit status 1

</stderr><returnCode>1</returnCode>

<DataItem type="Microsoft.SSH.SSHCommandData" time="2009-08-05T11:15:01.5800358-04:00" sourceHealthServiceId="0EB1D6DA-202C-7FC5-3D46-BDBB9208547D"><SSHCommandData><stdout>Generating certificate with hostname="COMPUTERNAME"

[/home/serviceb/TfsCoreWrkSpcRedhat/source/code/tools/scx_ssl_config/scxsslcert.cpp:198]

Failed to allocate resource of type random data: Failed to get random data - not enough entropy

</stdout><stderr>error: %post(scx-1.0.4-248.i386) scriptlet failed, exit status 1

</stderr><returnCode>1</returnCode></SSHCommandData></DataItem>

But first, a little background on the actual “problem”. To generate the certificate, the entropy needs to be high enough to generate random data for the certificate creation. Without the certificate, the OpsMgr agent won’t be able to open up communications with the MS. So, what creates this entropy we need? Bluntly put, a selection of hardware components that are likely to produce non-predictable data. Like a keyboard, mouse and a monitor or videocard. Of course, there’s a lot more to it, but we really don’t need to know this. What we need to know is that there has to be a “bit bucket” of more than 256bytes of entropy for the certificate creation process to succeed. We also need to know that more enterprise-ish servers, like rack- or blade-servers tend to be void of things like directly attached keyboards, mouses and monitors that the linux kernel needs to be able to generate entropy. And herein lies the problem. If you have a new server that is not in full service (likely since we are trying to deploy the monitoring on it) which means that there’s not much random data flowing through the hardware and there’s no keyboard or mouse or monitor connected to it there is quite the risk that the system entropy is going to be very low. Of the linux systems that I have been deploying OpsMgr agents to, about half have failed because of “Not enough entropy”. So, here’s the steps I usually takes to ensure that discovery works. I use PuTTY to connect to the soon-to-be-monitored servers. This guide also assumes that you have SU rights on the system since all of these steps (except #1) needs it.

  1. Check you current entropy
    cat /proc/sys/kernel/random/entropy_avail

    Is it less than, or close to, 256? It probably is. If you don’t feel like connecting a mouse and start wiggling it around—not really feasible in a data center—and see if the entropy increases, you can generate your own random data.

  2. Generate you own random data.
    Be advised that this forced entropy will not be as random as the system-created on and thus not as secure. How much more insecure it is, I don’t know, and quite frankly I prefer to have my systems monitored yet slightly less secure than not monitored at all. Anyway, you can force your own random data by running:

    dd if=/dev/urandom of=~/.rnd bs=1 count=1024

    This creates a .rnd file with 1024B of random data that the certificate creation process will use instead of the system entropy if the file exists.

  3. Uninstall and re-discover
    The first failed attempt of discovery will most likely leave a non-working agent installation that we have to remove. Otherwise we will just be stuck with an “Access Denied” error. Run:

    rpm –e scx

    Now, try to discover the system again.

  4. Failed again?
    Try generating the certificate manually by running:

    /opt/microsoft/scx/bin/tools/scxsslconfig -f –v
    /opt/microsoft/scx/bin/tools/scxadmin –restart

    Retry discovery again.

  5. Still fails?
    Uninstall the agent once more as instructed in step 3.

Stese steps have solved my problems 100% on both SUSE and RedHat and hopefully they will help you too.

Interestingely enough, these problems seems to be connected to some changes in the 2.6 kernel and basically everything that uses SSL-ish certificates will be affected. Even though the symptoms may be a bit more subtle, like time-outs and disconnects. For “headless” servers like those I usually to administer where the random data tend to be much lower, there’s even specialised hardware whose sole purpose is to generate random data, like the Entropy Key. I have also been told that new servers is likely to be equipped with entropy chipsets to make sure that there’s chaos enough to avoid these new-found oddities.

Sources:
http://social.technet.microsoft.com/Forums/en-US/crossplatformsles/thread/f94ec905-23ac-4444-b9f8-644fec3ae357

http://www.askrenzo.com/oracle/SCOM/SCOM_discovering_nodes.html

Microsoft has released an updated MP for SCCM SP2 (v6.0.6000.2, released on 10/28/2009) for OpsMgr R2.

The update basically contains support for x64 that was missin in the previous release.

The Configuration Manager 2007 SP2 Management Pack adds support for monitoring Configuration Manager 2007 SP2 in a 64-bit environment with Operations Manager 2007 R2 or Operations Manager 2007 SP1 with hotfix (KB971541) installed. This enables the Configuration Manager 2007 SP2 Management Pack to work with either the 32-bit or the 64-bit Operations Manager 2007 agent. Except for the 64-bit support, the other features and guidance for Configuration Manager 2007 Management Packs remain intact.

(coloration added by me)

Read more and download here:
http://www.microsoft.com/downloads/details.aspx?FamilyID=a8443173-46c2-4581-b3b8-ce67160f627b

This update hasn’t showed up in the MP Catalog yet, but the System Center Operations Manager 2007 R2 Cross Platform Update can be downloaded here.

Besides SUSE 11 support, here’s the short overview.

The System Center Operations Manager 2007 R2 Cross Platform Update adds fixes for a defunct process issue on Unix/Linux Servers, as well as, adds support for SUSE Linux Enterprise Server 11 (both 32-bit and 64-bit versions) and Solaris Zone support.
Feature Summary:
The System Center Operations Manager 2007 R2 Cross Platform Update supports the monitoring of Unix/Linux Servers including:

  • Monitoring of SUSE Linux Enterprise Server 11 servers (both 32-bit and 64-bit versions)
  • Support of Solaris Zones
  • Fix for defunct Process issue
  • The Cross Platform Agent may not discover soft partitions on Solaris systems. Therefore, the disk provider may be unloaded, and the Cross Platform Agent may stop collecting information from the system disks.
  • The Cross Platform Agent may not restart after the AIX server reboots.

The latest versions of all the Operations Manager 2007 R2 Unix/Linux agents are included in this update.

Perfect timing, I must say, since I really need this today. :D

Update:
This is no small MP-update, which probably is the reason that we do not find it in the MP Catalog, but a ~250MB OpsMgr R2 Software Update. You need to run this on all Operations Manager Servers (RMS/MS, GW?) since it actually updates many of the agent Cross Platform binaries. It does add a new MP för SUSE 11 that you have to import from disk if you need it.

So, the installation goes somewhat like this:

  1. Install the Software Update (pick the right Architecture) on all OpsMgr R2 Servers
  2. Import the SUSE 11 MP if necessary
  3. Re-discover your Unix/Linux machines.

Files updated in this update for R2:

  • .\Microsoft.Enterprisemanagement.UI.Administration.dll (Version 6.1.7043.1)
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.aix.5.ppc.lpp.gz
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.aix.6.ppc.lpp.gz
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.hpux.11iv2.ia64.depot.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.hpux.11iv2.parisc.depot.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.hpux.11iv3.ia64.depot.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.hpux.11iv3.parisc.depot.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.rhel.4.x64.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.rhel.4.x86.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.rhel.5.x64.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.rhel.5.x86.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.sles.10.x64.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.sles.10.x86.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.sles.9.x86.rpm
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.solaris.10.sparc.pkg.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.solaris.10.x86.pkg.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.solaris.8.sparc.pkg.Z
  • .\AgentManagement\UnixAgents\scx-1.0.4-248.solaris.9.sparc.pkg.Z

Files added:

  • Microsoft.Linux.SLES.11.MP

All in all, the update contains the following fixes:

  • KB969342
  • KB973583
  • Q954049
  • Q956240

Microsoft released an updated MP (v6.1.7533.0, released on 10/8/2009) for monitoring the health the Operations Manager components.

Most significant updates, according to me, would seem to be:

Fixed an issue that was previously preventing all rules related to agentless exception monitoring from generating alerts.

Added the rule “Collects Opsmgr SDK Service\Client Connections” to collect the number of connected clients for a given management group. This data is shown in the view “Console and SDK Connection Count” under the folder “Operations Manager\Management Server Performance”.

Updated a number of monitors and rules to ensure that data is reported to the correct management group for multihomed agents.

Fixed the configuration of the rule “IIS Discovery Probe Module Execution Failure” to so that the parameter replacement will now work correctly for alert suppression and generating the details of the alert’s description.

The rest is mostly polishing, fine-tuning and complementary updates. Nothing really ground-breaking here, but still a welcome update.

Download at: http://www.microsoft.com/downloads/details.aspx?FamilyID=61365290-3c38-4004-b717-e90bb0f6c148

If you are looking into replacing an (or just switching to another primary) Operations Manager 2007 Gateway Server for any reason, there’s a little more to consider than just right-clicking the clients and selecting “Change Primary Management Server” in the Operations Console.
You could end up with agents not being able to connect to the Management Group at all due to a small problem with the order in which Operations Manager do things.

Here’s basically what happens:

  • You tell Operations Manager to change Primary Management Server for AGENTX from GW1 to GW2.
  • The SDK Service (i guess) tells GW1 that “You’re no longer the Primary Management Server for AGENTX”
  • GW1 acknowledges this and stops talking to AGENTX. And I mean Completely stops talking to AGENTX.
  • OpsMgr then tells GW2 to start accepting communication from AGENTX.
  • OpsMgr tries to tell AGENTX that it should talk to GW2 since GW1 won’t listen.

Spotted the problem?
This modus operandi probably works when agents are on the same network and in the same domain where fail-over is sort of automatic. The problem we are facing now is that the server are telling the Gateway to stop accepting communications to and from the agent before the agent is notified that there is a new Gateway server to talk to. The agent will continue to talk to GW1 but will be completely ignored and you will probably start seeing events in the Operations Manager eventlog on GW1 with EventID 20000.

How do I get around this little feature then?

No matter if you found this article after running into the mentioned troubles or if you are googling ahead of time to be prepared, the fix is the same and consists of a few powershell scripts. These scripts are out there allready, but in different contexts, hence this post.

First step: Install the new Gateway

Documentation on this from Microsoft is good enough, but here’s the short version.

  1. Verify name resolution to and from Gateway server and Management Server
  2. Create certificate for the Gateway server
  3. Approve the Gateway server
  4. Install Gateway server
  5. Import certificates on Windows system
  6. Run MOMCertImport.exe on Gateway server to add the certificate into Gateway server configuration
  7. Wait

The wait is for the gateway server to get all needed configuration from RMS and to download all neccesary management packs, run all the discovery scripts and so on. When the Operations Manager event log has calmed down a bit, move to step two.

Second step: Configure Agent Failover

Connect to an Operations Manager Command Shell. Any will do, as long as it’s connected to the correct Management Group.
Then run the following script:

$primaryGW= Get-ManagementServer | where {$_.Name -eq 'GW2.domain.local'}
$failoverGw = Get-ManagementServer | where {$_.Name -eq 'GW1.domain.local'}
$agents = Get-Agent | where {$_.primarymanagementservername -eq 'GW1.domain.local'}
Set-ManagementServer -AgentManagedComputer: $agents -PrimaryManagementServer: $primaryGW -FailoverServer: $failoverGw

Remember to change “GW1.domain.local” to you OLD Gateway servername and “GW2.domain.local” to your NEW Gateway servername.
If you don’t know powershell, this script basically configures all agents using the old Gateway to use the new one as primare, but keep the old one as a fail-over server. The Gateways will still get to know the changes before the agents, but since the old on is still listening to the agents (though, as the fail-over host) it will be able to tell them to go to the new one, GW2.

Just wanted to raise a word of caution about the TCP Port Check in Operations Manager 2007.

Some customers have notices the the system-logs on some Unix machines are completely swamped with “connection error”, “TCP Connect failed”, “TCP Session Lost” and similar and after a bit och research the problematic servers were narrowed down to those monitored by Operations Manager. Specifically, those who are targeted by a TCP Port Check.

It would seem like the TCP-connection never fully initializes on the target server. Kind of like knocking on your neighbours door and then hiding. Then when the door opens, no one is there.

Maybe there’s a setting somewhere to modify how “deep” a Port Check should go before closing. Perhaps fully initializing and then sending a proper “Close” instead of just cutting the connection. In a few extreme cases we have noticed that the target server even goes so far as to start a session, but never ending it since there’s no closure and finally having no sessions to spare for the real users. But on most servers it’s just an annoyance since the “real” errors is very hard to be found in all the connection related logs.

Anyway. Just a good thing to keep in mind when running TCP Port Checks from Operations Manager 2007. Keep an eye on the logs when implementing the port checks.

The MSMQ Management Pack seems to have a few problems with it’s discovery script that can lead to the following error showing up in the logs:

The process started at 13:34:40 failed to create System.Discovery.Data. Errors found in output:

C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 49\9788\DiscoverQueues.vbs(107, 4) Microsoft VBScript runtime error: Subscript out of range: '[number: 0]'

Command executed: "C:\WINDOWS\system32\cscript.exe" /nologo "DiscoverQueues.vbs" {615D37C9-477D-62E2-0833-6ECBF0E89A87} {A176AC83-CC31-01C3-5DE9-E2DFF64E7CC7} "MASKED.server.fqdn" "MSMQ" "true" "true" "False" "false"
Working Directory: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 49\9788\

One or more workflows were affected by this.

Workflow name: Microsoft.MSMQ.2003.DiscoverQueues

Instance name: MASKED.server.fqdn

Instance ID: {A176AC83-CC31-01C3-5DE9-E2DFF64E7CC7}

Management group: MASKED

This seems to be related to the discovery of public queues on some servers that has none. One quick fix, or rather work-around, is to override the discovery on these servers to set DiscoverPublic to False.
Screenshot of Override

Don’t know how I missed this when writing the last post, but Microsoft released the MP for Windows Server 2008 NLB yesterday (28/4 -09). This is the initial release for Win2k8 NLB so I guess we just have to try it out then.

Quick Details

File Name: Microsoft Server 2008 Network Load Balancing System Center Operations Manager 2007 MP.msi
Version: 6.0.6573.0
Date Published: 4/28/2009
Language: English
Download Size: 519 KB

Feature Summary

  • Monitor the NLB Node status.
  • Based on the status of individual cluster nodes, determine the overall state of the cluster.
  • Where an integration management pack exists, determine the health state of a cluster node by looking at the health state of the load balanced application, such as IIS.
  • Alert on errors and warnings that are reported by the NLB driver, such as an incorrectly configured NLB cluster.
  • Take the node out of the NLB cluster if the underlying load-balanced application becomes unhealthy, and add the node back to the cluster when the application becomes healthy again.

Requires OpsMgr 2007 SP1 or later, the Base Operating System MP for 2008, the QFEs for Windows Server 2008 and that you are not running the converted 2003 NLB MP. If you are running the old converted NLB MP, upgrade first. As an additional recommendation, Microsoft recommends in the MP Guide that you install the QFE for wmiprvse.exe problems on Windows Server 2008.

No support for Mixed-mode (2008 and 2003) clusters though.

I have seen this error popping up every now and then at multiple customer sites and haven’t really been able to solve it yet. It does not look like I am alone either.
The error message usually looks like this:

Error doing IIS Discovery

Error: 0x80070002
Details: The system cannot find the file specified.

One or more workflows were affected by this. 

Workflow name: Microsoft.Windows.InternetInformationServices.2003.DiscoverBase
Instance name: Microsoft.Windows.InternetInformationServices.2003.ServerRole
Instance ID: {A81E4808-4D05-9BFE-4043-DC668527F2D0}
Management group: MASKED

Or…

Error doing IIS Discovery

Error: 0x80070006
Details: The handle is invalid.

One or more workflows were affected by this. 

Workflow name: Microsoft.Windows.InternetInformationServices.2000.DiscoverWebSites26to50
Instance name: IIS Web Server
Instance ID: {D36DA76A-027F-8F3E-4160-115279A1E23A}
Management group: MASKED

I have been trying to figure out what file is missing and/or if the “invalid handle” is related. Possibly a file-handle? Could be but not neccesary since these two errors occur on different servers with increasing repeat-count (atleast once-a-day). The IIS MP does call the IIS*.VBS Scripts in %windir%\System32 but as far as I can tell, on the systems I have tried it on, the scritps return valid data. This does by no means mean that there is no error and evidently I am missing something. But what? Does anyone have a clue to this?

References and other victims:

And no, neither of these provides even a hint to a working solution.

Switch to our mobile site