One corrupt discovery data record (DDR) can completely lock up a primary site server. In this article I'll attempt to help you troubleshoot typical primary site server DDR processing problems. I'm also going to document a not so typical problem that I had in my SMS implementation.
Discovery Data Overview
Discovery data records are small files (about 2K) containing basic client system information that is subject to change. They have .ddr extensions. A DDR should contain network address, username, GUID, machine name and a few others elements. The file is processed on the client, based on a site's discovery configuration and is located in ms\sms\core\data\smsdisc.ddr. Methods of client discovery include NT Logon Discovery, Heart Beat Discovery, and Network Discovery. DDRs are used by SMS to "keep tabs" on the clients. They allow a few data elements to be refreshed on a regular basis in order for remote tools and advertised programs to function in a "reliable" manner. To view discovery data, information open any collection and double-click on any client from that collection. What you'll see next is discovery data for the client that has been entered into the SMS database.
The DDR is most likely sent by the client to the logon point\smslogon\ddr.box, then forwarded on to CAP\ddr.box, and then forwarded on to site server\sms\inboxes\ddm.box. Then it is finally placed into the SMS database. As you have probably already experienced, many problems can happen along the way. This article is going to focus on the issue of DDRs backing up in site server's ddm.box. For more details about the information I've already covered please check out the SMS Admin Guide, SMS Resource Kit (in BORK 4.5) and\or SMS Admin's Companion.
Don't confuse the discovery data process with hardware inventory. The two are completely different mechanisms used for different purposes.
Troubleshooting Typical DDR Processing Issues
Data Discovery Manager is the server side thread of smsexec.exe responsible for processing DDRs. If discovery data is not being updated according to the configuration that you have chosen within the site's hierarchy properties, hopefully the following information will help you.
From a collection, open a client's properties and check the date on one of the Agent Time entries. It should match your discovery method frequency. Just a tip: You should have some discovery method configured to occur at least once a day.
You can browse to the primary site server's ddm.box inbox and view the backed up files for your self. Turn details on in Windows Explorer and compare the dates. If there are no backed up DDRs look at flow charts of the discovery data process to find other possible sources.
Look for processing errors appearing in the Site Status tree under the Discovery Data Manager tread.
View the SMS site server logs. Look in ddm.log for lines that indicate Discovery Data Manager is attempting to process the same DDR over and over.
The solution is really pretty simple. Stop the Discovery Data Manager thread within the SMS Service Manager (or just stop the SMS component in NT Service Manager) and delete the file that is not being processed. For troubleshooting purposes, it might be a good idea to copy the file first for further research. Keep an eye on ddm.log and ddm.box. Within an hour or so it should look like you opened the dam and let all those backed up DDRs out.
Symptoms of My Atypical Problem
When a corrupt DDR entered ddm.box the following symptoms appeared:
I browsed to the site server's ddm.box inbox and viewed several days worth of backed up files DDRs.
Ddm.log indicated that Discovery Data Manager was attempting to process the same DDR over and over.
The first two were very common for corrupt DDRs, the next two are not.
Processing errors appeared multiple times in the Site Status tree under the Discovery Data Manager thread as DDRs backed up. They began the first time Discovery Data Manager attempted to process the corrupt record.
Message ID 669 Component raised an exception but failed to handle it.
Here is another atypical symptom that I saw: I had about 250 SMS crash dumps. These were located in site server\SMS\Logs\CrashDumps. Each crash.log file had the following message:
Time = 06/15/2001 13:42:24.925
Service name = SMS_EXECUTIVE
Thread name = SMS_DISCOVERY_DATA_MANAGER
Executable = D:\SMS\bin\i386\smsexec.exe
Process ID = 394 (0x18a)
Thread ID = 655 (0x28f)
Instruction address = 5f4040fd
Exception = c0000005 (EXCEPTION_ACCESS_VIOLATION)
Description = "The thread tried to read from the virtual address C35EC67F for which it does not have the appropriate access."
Raised inside CService mutex = No
CService mutex description = ""
The only Tech Net article that remotely relates to this issue is Q223755 SMS Executive Crashes when enumerating a non-Microsoft Server.
Because of the problems covered above, I had these additional issues that served to complicate the troubleshooting process:
The drive containing the SMS install was completely full. Normally it has 3.5 GB free. This was due to the tremendous space that SMS crash dumps take up. Each crash dump makes a copy of all SMS logs, wether logging is turned on or not. It's a pretty nice feature, as long as you don't have 250 of them. <grin>
After the drive was full, the following appeared in Site Status under Discovery Data Manager:
Message ID 2636 SMS Discovery Data Manager Failed to update the following discovery data record server\sms\inboxes\ddm.box\HQCL4GPB.ddr because it cannot update the data source.
I believe this occurred due to the lack of space on the SMS server install drive.
Despite the non-standard symptoms this was a standard DDR corruption problem. The solution, again, was quite simple. I found and deleted the corrupt DDR. I also deleted the 3.5 GB worth of crash dumps. After that no more problems.
Normally this type of problem wouldn't go unnoticed in my SMS implementation but...The corrupt DDR was received on a Friday. I set my site status messages to clear on Sunday evenings. I didn't pay attention to the logs on Monday. On Tuesday, as I prepared for an important software distribution to 1500 machines, I noticed this problem.