A few weeks ago someone (aka zombie) thought they were manually uninstalling a SCOM agent from a managed server. What they didn’t know was that they were NOT on the server they thought they were on but it was the Root Management Server. What comes next? They uninstalled the SCOM software from the Root Management Server. I knew this because I was on the server troubleshooting bad agents and everything closed out on me. I immediately check my event logs to see why, and the Operations Manager event log was GONE. I know, RIGHT! (aka Zombie Apocalypse)
But luck was on our side, and we were able to re-install SCOM back and apply the backup keys. everything was going good so we thought.
About 24 hours later the Root Management Server had the following alerts in the Operations Manager event log. I mean nothing but this event; it was pouring a lot of these event id every second.
There was not much out there in the world on this event id that we could find.
Event Type: Information
Event Source: OpsMgr Connector
Event Category: None
Event ID: 21042
Computer: RMS.FQDN
Description:
Operations Manager has discarded 1 item in management group <Management Group Name>, which came from $$ROOT$$. These items have been discarded because no valid route exists at this time. This can happen when devices are added to the topology but the completed topology has not been distributed yet. the discarded items will be regenerated.
Funny thing was each of the management Servers that had agents report to them were being flooded with these alerts.
Event Type: Information
Event Source: OpsMgr Connector
Event Category: None
Event ID: 20000
Computer: RMS.FQDN
Description:
A device which is not part of this management group has attempted to access this Health Service. Requesting Device Name: ServerName.FQDN
The quick fix was to “”Re-Enter”” each of the Run-As accounts back manually. Once this was done, the 21042 alert went away. However many of the event id 20000 didn’t all clear up. it took some manual process of stopping the agent service and deleting the Health Service State folder and re-starting the service again to make the some of the agents communicate back.
In my lab I was able to re-create the issue and correct it by doing this. I don’t fully understand all the details just yet but I’m working on that, if I can get some more time to it.