Here’s a little trouble-shooting guide for discovering Linux systems from OpsMgr R2 when getting the following error from the wizard:
<stdout>Generating certificate with hostname="COMPUTERNAME"
[/home/serviceb/TfsCoreWrkSpcRedhat/source/code/tools/scx_ssl_config/scxsslcert.cpp:198]
Failed to allocate resource of type random data: Failed to get random data - not enough entropy
</stdout><stderr>error: %post(scx-1.0.4-248.i386) scriptlet failed, exit status 1
</stderr><returnCode>1</returnCode>
<DataItem type="Microsoft.SSH.SSHCommandData" time="2009-08-05T11:15:01.5800358-04:00" sourceHealthServiceId="0EB1D6DA-202C-7FC5-3D46-BDBB9208547D"><SSHCommandData><stdout>Generating certificate with hostname="COMPUTERNAME"
[/home/serviceb/TfsCoreWrkSpcRedhat/source/code/tools/scx_ssl_config/scxsslcert.cpp:198]
Failed to allocate resource of type random data: Failed to get random data - not enough entropy
</stdout><stderr>error: %post(scx-1.0.4-248.i386) scriptlet failed, exit status 1
</stderr><returnCode>1</returnCode></SSHCommandData></DataItem>
But first, a little background on the actual “problem”. To generate the certificate, the entropy needs to be high enough to generate random data for the certificate creation. Without the certificate, the OpsMgr agent won’t be able to open up communications with the MS. So, what creates this entropy we need? Bluntly put, a selection of hardware components that are likely to produce non-predictable data. Like a keyboard, mouse and a monitor or videocard. Of course, there’s a lot more to it, but we really don’t need to know this. What we need to know is that there has to be a “bit bucket” of more than 256bytes of entropy for the certificate creation process to succeed. We also need to know that more enterprise-ish servers, like rack- or blade-servers tend to be void of things like directly attached keyboards, mouses and monitors that the linux kernel needs to be able to generate entropy. And herein lies the problem. If you have a new server that is not in full service (likely since we are trying to deploy the monitoring on it) which means that there’s not much random data flowing through the hardware and there’s no keyboard or mouse or monitor connected to it there is quite the risk that the system entropy is going to be very low. Of the linux systems that I have been deploying OpsMgr agents to, about half have failed because of “Not enough entropy”. So, here’s the steps I usually takes to ensure that discovery works. I use PuTTY to connect to the soon-to-be-monitored servers. This guide also assumes that you have SU rights on the system since all of these steps (except #1) needs it.
- Check you current entropy
cat /proc/sys/kernel/random/entropy_avail
Is it less than, or close to, 256? It probably is. If you don’t feel like connecting a mouse and start wiggling it around—not really feasible in a data center—and see if the entropy increases, you can generate your own random data.
- Generate you own random data.
Be advised that this forced entropy will not be as random as the system-created on and thus not as secure. How much more insecure it is, I don’t know, and quite frankly I prefer to have my systems monitored yet slightly less secure than not monitored at all. Anyway, you can force your own random data by running:
dd if=/dev/urandom of=~/.rnd bs=1 count=1024
This creates a .rnd file with 1024B of random data that the certificate creation process will use instead of the system entropy if the file exists.
- Uninstall and re-discover
The first failed attempt of discovery will most likely leave a non-working agent installation that we have to remove. Otherwise we will just be stuck with an “Access Denied” error. Run:
rpm –e scx
Now, try to discover the system again.
- Failed again?
Try generating the certificate manually by running:
/opt/microsoft/scx/bin/tools/scxsslconfig -f –v
/opt/microsoft/scx/bin/tools/scxadmin –restart
Retry discovery again.
- Still fails?
Uninstall the agent once more as instructed in step 3.
Stese steps have solved my problems 100% on both SUSE and RedHat and hopefully they will help you too.
Interestingely enough, these problems seems to be connected to some changes in the 2.6 kernel and basically everything that uses SSL-ish certificates will be affected. Even though the symptoms may be a bit more subtle, like time-outs and disconnects. For “headless” servers like those I usually to administer where the random data tend to be much lower, there’s even specialised hardware whose sole purpose is to generate random data, like the Entropy Key. I have also been told that new servers is likely to be equipped with entropy chipsets to make sure that there’s chaos enough to avoid these new-found oddities.
Sources:
http://social.technet.microsoft.com/Forums/en-US/crossplatformsles/thread/f94ec905-23ac-4444-b9f8-644fec3ae357
http://www.askrenzo.com/oracle/SCOM/SCOM_discovering_nodes.html
In SQL Server 2005 and 2008 the local Administrators account is not sysadmin by default. This makes it even more important that the one setting up the Database also remembers to add a SQL Server admins group to the sysamin role. If this step is forgotten, the user installing the database server is the only one that will ever be sysadmin.
In some extreme cases I’ve seen situations where no one except some dude on vacation is sysadmin and there’s a bunch of applications that needs to be installed/upgraded. In these cases I have also been assigned Local Administrator rights on the server, but since the local Administrators group isn’t sysadmin either I still cannot login to the SQL server.
What to do!?
Thanks to Raul Carcia’s blog post it’s not that a big deal. The instructions is written for SQL Server 2005, but works equally fine on SQL Server 2008 and the only requirement is that you are a local server administrator.
Here’s what to do:
- Open the SQL Server Configuration Manager.
- In SQL Server Services, open the properties for the SQL Server instance you need access to.
- In the Advanced tab, locate Startup Parameters.
- Add “;-m” to the end of that line.
- Press OK and restart the SQL Server into “Maintenance Mode” or “Single User Mode” if you like. (check that a restart is OK first
)
- Open a command prompt (right-click, “Run as Administrator” in Windows 2008) and go to C:\Program Files\Microsoft SQL Server\100\Tools\Binn\
(C:\Program Files\Microsoft SQL Server\90\Tools\Binn\ for SQL2005)
- Execute sqlcmd –E –Sadmin:<instancename> (use . for local default instance)
- In the CLI, execute:
EXEC sp_addsrvrolemember ‘DOMAIN\yourusername’, ’sysadmin’;
GO
- Return to the SQL Server Configuration Manager and restore the Startup Parameters to it’s previous settings.
- Restart the SQL Server instance to allow users to access it again.
Now, you should be able to login to the SQL server with sysadmin rights using your current user. This would also be a good point in time to actually establish a SQL Server Admins group (local or domain) to add to the sysadmin role to avoid having others to the above steps when you, yourself, happens to be on vacation.
As Raul Carcia point out in his original post, this is really a disaster recovery procedure and there’s definitely nothing sneaky about it since it leaves quite alot of trails in the event logs.
All in all, a Great article by Raul and all credit should go his way.
Let’s say you have followed this guide: http://support.microsoft.com/kb/938245/
Still not working? The one thing I forgot, or rather did not find in any of the guides, was to change the website application pool to “Classic .NET AppPool”. It is actually noted in KB938245 but only after the installation, during the configuration. For some reason I have not been able to install Reporting Services 2005 on Windows 2008 without changing this prior to the installation.
Maybe I am doing it wrong but this seems to be working all right for me.
If you are looking into replacing an (or just switching to another primary) Operations Manager 2007 Gateway Server for any reason, there’s a little more to consider than just right-clicking the clients and selecting “Change Primary Management Server” in the Operations Console.
You could end up with agents not being able to connect to the Management Group at all due to a small problem with the order in which Operations Manager do things.
Here’s basically what happens:
- You tell Operations Manager to change Primary Management Server for AGENTX from GW1 to GW2.
- The SDK Service (i guess) tells GW1 that “You’re no longer the Primary Management Server for AGENTX”
- GW1 acknowledges this and stops talking to AGENTX. And I mean Completely stops talking to AGENTX.
- OpsMgr then tells GW2 to start accepting communication from AGENTX.
- OpsMgr tries to tell AGENTX that it should talk to GW2 since GW1 won’t listen.
Spotted the problem?
This modus operandi probably works when agents are on the same network and in the same domain where fail-over is sort of automatic. The problem we are facing now is that the server are telling the Gateway to stop accepting communications to and from the agent before the agent is notified that there is a new Gateway server to talk to. The agent will continue to talk to GW1 but will be completely ignored and you will probably start seeing events in the Operations Manager eventlog on GW1 with EventID 20000.
How do I get around this little feature then?
No matter if you found this article after running into the mentioned troubles or if you are googling ahead of time to be prepared, the fix is the same and consists of a few powershell scripts. These scripts are out there allready, but in different contexts, hence this post.
First step: Install the new Gateway
Documentation on this from Microsoft is good enough, but here’s the short version.
- Verify name resolution to and from Gateway server and Management Server
- Create certificate for the Gateway server
- Approve the Gateway server
- Install Gateway server
- Import certificates on Windows system
- Run MOMCertImport.exe on Gateway server to add the certificate into Gateway server configuration
- Wait
The wait is for the gateway server to get all needed configuration from RMS and to download all neccesary management packs, run all the discovery scripts and so on. When the Operations Manager event log has calmed down a bit, move to step two.
Second step: Configure Agent Failover
Connect to an Operations Manager Command Shell. Any will do, as long as it’s connected to the correct Management Group.
Then run the following script:
$primaryGW= Get-ManagementServer | where {$_.Name -eq 'GW2.domain.local'}
$failoverGw = Get-ManagementServer | where {$_.Name -eq 'GW1.domain.local'}
$agents = Get-Agent | where {$_.primarymanagementservername -eq 'GW1.domain.local'}
Set-ManagementServer -AgentManagedComputer: $agents -PrimaryManagementServer: $primaryGW -FailoverServer: $failoverGw
Remember to change “GW1.domain.local” to you OLD Gateway servername and “GW2.domain.local” to your NEW Gateway servername.
If you don’t know powershell, this script basically configures all agents using the old Gateway to use the new one as primare, but keep the old one as a fail-over server. The Gateways will still get to know the changes before the agents, but since the old on is still listening to the agents (though, as the fail-over host) it will be able to tell them to go to the new one, GW2.
Jonathan Almquist has posted (a while ago) an article on how to clear discovered objects after you have disabled the discovery rules in OpsMgr that I think deserves a notion.
Read more about it at Jonathan Almquist on Operations Manager : Remove-DisabledMonitoringObject.