Quantcast
Viewing all 66 articles
Browse latest View live

Son of SPA: AD Data Collector Sets in Win2008 and beyond

Hello, David Everett here again. This time I’m going to cover configuration and management of Active Directory Diagnostics Data Collector Sets. Data Collector Sets are the next generation of a utility called Server Performance Advisor (SPA).

Prior to Windows Server 2008, troubleshooting Active Directory performance issues often required the installation of SPA. SPA is helpful because the Active Directory data set collects performance data and it generates XML based diagnostic reports that make analyzing AD performance issues easier by identifying the IP addresses of the highest volume callers and the type of network traffic that is placing the most load on the CPU. A screen shot of SPA is shown here with the Active Directory data set selected.

Image may be NSFW.
Clik here to view.
image

Those who came to rely upon this tool will be happy to know its functionality has been built into Windows Server 2008 and Windows Server 2008 R2.

This performance feature is located in the Server Manager snap-in under the Diagnostics node and when the Active Directory Domain Services Role is installed the Active Directory Diagnostics data collector set is automatically created under System as shown here. It can also be accessed by running “Perfmon” from the RUN command.

Image may be NSFW.
Clik here to view.
image

Like SPA, the Active Directory Diagnostics data collector set runs for a default of 5 minutes. This duration period cannot be modified for the built-in collector. However, the collection can be stopped manually by clicking the Stop button or from the command line. If reducing or increasing the time that a data collector set runs is required, and manually stopping the collection is not desirable, then see How to Create a User Defined Data Collection Set below. Like SPA, the data is stored under %systemdrive%\perflogs, only now it is under the \ADDS folder and when a data collection is run it creates a new subfolder called YYYYMMDD-#### where YYYY = Year, MM = Month and DD=Day and #### starts with 0001.

Once the data collection completes the report is generated on the fly and is ready for review under the Reports node.

Just as SPA could be managed from the command line with spacmd.exe, data collector sets can also be managed from the command line.

How to gather Active Directory Diagnostics from the command line

  • To START a collection of data from the command line issue this command from an elevated command prompt:

logman start “system\Active Directory Diagnostics” -ets

  • To STOP the collection of data before the default 5 minutes, issue this command:

logman stop “system\Active Directory Diagnostics” -ets

NOTE: To gather data from remote systems just add “-s servername” to the commands above like this:

logman -s servername start “system\Active Directory Diagnostics” -ets

logman -s servername stop “system\Active Directory Diagnostics” -ets

This command will also work if the target is Server Core. If you cannot connect using Server Manager you can view the report by connecting from another computer to the C$ admin share and open the report.html file under \\servername\C$\PerfLogs\ADDS\YYYYMMDD-000#.

See LaNae’s blog post on How to Enable Remote Administration of Server Core via MMC using NETSH to open the necessary firewall ports.

In the event you need a Data Collection set run for a shorter or longer period of time, or if some other default setting is not to your liking you can create a User Defined Data Collector Set using the Active Directory Diagnostics collector set as a template.

NOTE: Increasing the duration that a data collection set runs will require more time for the data to be converted and could increase load on CPU, memory and disk.

Once your customized Data Collector Set is defined to your liking you can export the information to an XML file and import it to any server you wish using Server Manager or logman.exe

How to Create a User Defined Data Collection Set

 

  1. Open Server Manager on a Full version of Windows Server 2008 or later.
  2. Expand Diagnostics> Reliability and Performance> Data Collector Sets .
  3. Right-click User Defined and select New> Data Collector Set.
  4. Type in a name like Active Directory Diagnostics and leave the default selection of Create from a template (Recommended) selected and click Next.
  5. Select Active Directory Diagnostics from the list of templates and click Next and follow the Wizard prompts making any changes you think are necessary.
  6. Right-click the new User Defined data collector set and view the Properties.
  7. To change the run time, modify the Overall Duration settings in the Stop Condition tab and click OK to apply the changes.

Once the settings have been configured to your liking you can run this directly from Server Manager or you can export this and deploy it to specific DCs.

Deploying a User Defined Data Collection Set

  • In Server Manager on a Full version of Windows Server 2008 or later
  1.  
    1. Expand Diagnostics> Reliability and Performance> Data Collector Sets> User Defined
    2. Right-click the newly created data collector set and select Save Template…
  • From the command line

1. Enumerate all User Defined data collector sets

logman query

NOTE: If running this from a remote computer the command add “-s servername” to target the remote server

logman -s servername query

2. Export the desired collection set

logman export -n “Active Directory Diagnostics” -xml addiag.xml

3. Import the collection set to the target server.

logman import -n “Active Directory Diagnostics” -xml addiag.xml

NOTE: If you get the error below then there’s an SDDL string in the XML file between the <Security></Security> tags that is not correct. This can happen if you export the Active Directory Diagnostics collector set under System. To correct this, remove everything between <Security></Security> tags in the XML file.

Error:

This security ID may not be assigned as the owner of this object.

4. Verify the collector set is installed

 logman query

5. Now that the data collector set is imported you’re ready to gather data. See How to gather Active Directory Diagnostics from the command line above to do this from the command line.

Once you’ve gathered your data, you will have these interesting and useful reports to aid in your troubleshooting and server performance trending:

Image may be NSFW.
Clik here to view.
image

Image may be NSFW.
Clik here to view.
image

In short, all the goodness of SPA is now integrated into the operating system, not requiring an install or reboot. Follow the steps above, and you'll be on your way to gathering and analyzing lots of performance goo.

David “highly excitable” Everett

Image may be NSFW.
Clik here to view.

Friday Mail Sack: Newfie from the Grave Edition

Heya, Ned here again. Since this another of those catch up mail sacks, there’s plenty of interesting stuff to discuss. Today we talk NSPI, DFSR, USMT, NT 4.0 (!!!), Win2008/R2 AD upgrades, Black Hat 2010, and Irish people who live on icebergs.

Faith and Begorrah!

Question

A vendor told me that I need to follow KB2019948 to raise the number of “NSPI max sessions per user” from 50 to 10,000 for their product to work. Am I setting myself up for failure?

Answer

Starting in Windows Server 2008 global catalogs are limited to 50 concurrent NSPI connections per user from messaging applications. That is because previous experience with letting apps use unlimited connections has been unpleasant. :) So when your vendor tells you to do this, they are putting you in the position where your DC’s will be allocating a huge number of memory pages to handle what amounts to a denial of service attack caused by a poorly written app that does not know how to re-use sessions correctly.

We wrote an article you can use to confirm this is your issue (BlackBerry Enterprise Server currently does this and yikes, Outlook 2007 did at some point too! There are probably others):

949469    NSPI connections to a Windows 2008-based domain controller may cause MAPI client applications to fail with an error code: "MAPI_E_LOGON_FAILED"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;949469

The real answer is to fix the calling application so that it doesn’t behave this way. As a grotesque bandage, you can use the registry change on your GC’s. Make sure these DC’s are x64 OS and not memory bound before you start, as it’s likely to hurt. Try raising the value in increments before going to something astronomical like 10,000 – it may be that significantly fewer are needed per user and the vendor was pulling that number out of their butt. It’s not like they will be the ones on the phone with you all night when the DC tanks, right?

Question

I have recently started deploying Windows Server 2008 R2 as part of a large DFSR infrastructure. When I use the DFS Management (DFSMGMT.MSC) snap-in on the old Win2008 and Win2003 servers to examine my RG’s, the new RG’s don’t show up. Even when I select “Add replication groups to display” and hit the “Show replication groups” button I don’t see the new RG’s. What’s up?

Answer

We have had some changes in the DFSMGMT snap-in that intentionally lead to behaviors like these. For example:

Here’s Win2008 R2:

Image may be NSFW.
Clik here to view.
clip_image002

and here’s Win2003 R2:

Image may be NSFW.
Clik here to view.
clip_image002[5]

See the difference? The missing RG names gives a clue. :)

This is because the msDFSR-Version attribute on the RG gets set to “3.0” when creating an RG with clustered memberships or an RG containing read-only memberships. Since a Win2003 or Win2008 server cannot correctly manage those new model RG’s, their snap-in is not allowed to see it.

Image may be NSFW.
Clik here to view.
clip_image002[7]

In both cases this is only at creation time; if you go back later and do stuff with cluster or RO, then the version may not necessarily be updated and you can end up with 2003/2008 seeing stuff they cannot manage. For that reason I recommend you avoid managing DFSR with anything but the latest DFSMGMT.MSC. The snap-ins just can’t really coexist effectively. There’s never likely to be a backport because – why bother? The only way to have the problem is to already have the solution.

Question

Is there a way with USMT 4.0 to take a bunch of files scattered around the computer and put them into one central destination folder during loadstate? For example, PST files?

Answer

Sure thing, USMT supports a concept called “rerouting” that relies on an XML element called “locationModify”. Here’s an example:

<migration urlid="<a href="http://www.microsoft.com/migration/1.0/migxmlext/pstconsolidate">
  <component type="Documents" context="System">
    <displayName>
All .pst files to a single folder
</displayName>
    <role role="Data">
      <rules>
        <include>
          <objectSet>
            <script>MigXmlHelper.GenerateDrivePatterns ("* [*.pst]", "Fixed")</script>
          </objectSet>
        </include>
      
<!-- Migrates all the .pst files in the store to the C:\PSTFiles folder during LoadState -->
        <locationModify script="MigXmlHelper.Move('C:\PSTFiles')">
          <objectSet>
            <script>MigXmlHelper.GenerateDrivePatterns ("* [*.pst]", "Fixed")</script>
          </objectSet>
        </locationModify>
      </rules>
    </role>
  </component>
</migration>

The <locationModify> element allows you to choose from the MigXmlHelpers of RelativeMove, Move, and ExactMove. Move is typically the best option as it just preserves the old source folder structure under the new parent folder to which you redirected . ExactMove is less desirable as it will flatten out the source directory structure, which means you then need to explore the <merge> element and decide how you want to handle conflicts. Those could involve various levels of precedence (where some files will be overwritten permanently) or simply renaming files with (1), (2), etc added to the tail. Pretty gross. I don’t recommend it and your users will not appreciate it. RelativeMove allows you to take from one known spot in the scanstate and move to another new known spot in the loadstate.

Question

I’m running into some weird issues with pre-seeding DFSR using robocopy with Win2008 and Win2008 R2, even when following your instructions from an old post. It looks like my hashes are not matching as I’m seeing a lot of conflicts. I also remember you saying that there will be a new article on pre-seeding coming?

Answer

1. Make sure you install these QFE version that fixes several problems with ACL’s and other elements not correctly copying in 2008/2008R2 – all file elements are used by DFSR to calculate the SHA-1 hash, so anything being different (including security) will conflict the file:

973776  The security configuration information, such as the ACL, is not copied if a backup operator uses the Robocopy.exe utility together with the /B option to copy a file on a computer that is running Windows Vista or Windows Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;973776

979808    "Robocopy /B" does not copy the security information such as ACL in Windows 7 and in Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;979808

2. Here’s my recommended robocopy syntax.  You will want to ensure that the base folder (where copying from and to) have the same security and inheritance settings prior to copying, of course.

Image may be NSFW.
Clik here to view.
clip_image002[11]

3. If you are using Windows Server 2008 R2 (or have a Win7 computer lying around), you can use the updated version of DFSRDIAG.EXE that supports the FILEHASH command. It will allow you to test and see if your pre-seeding was done correctly before continuing:

C:\>dfsrdiag.exe filehash
Command "FileHash" or "Hash" Help:
   Displays a hash value identical to that computed by the DFS Replication
   service for the specified file or folder
   Usage: DFSRDIAG FileHash </FilePath:filepath>

   </FilePath> or </Path>
     File full path name
     Example: /FilePath:d:\directory\filename.ext

It only works on a per-file basis, so it’s either for “spot checking” or you’d have to script it to crawl everything (probably overkill). So you could do your pre-seeding test, then use this to check how it went on some files:

dfsrdiag filehash /path:\\srv1\rf\somefile.txt
dfsrdiag filehash /path:\\srv2\rf\somefile.txt

If the hashes fit, you must acquit!

Still working on the full blog post, sorry. It’s big and requires a lot of repro and validation, just needs more time – but it had that nice screenshot for you. :)

Question

  1. Can a Windows NT 4.0 member join a Windows Server 2008 R2 domain?
  2. Can Windows7/2008 R2 join an NT 4.0 domain?
  3. Can I create a two-way or outbound trust between an NT 4.0 PDC and Windows Server 2008 R2 PDCE?

Short Snarky Answer

  1. Yes, but good grief, really!?!
  2. No.
  3. Heck no.

Long Helpful Answer

  1. If you enable the AllowNt4Crypto Netlogon setting and all the other ridiculously insecure settings required for NT 4.0 below you will be good to go. At least until you get hacked due to using a 15 year old OS that has not gotten a security hotfix in half a decade.

    823659    Client, service, and program incompatibilities that may occur when you modify security settings and user rights assignments
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;823659

    942564    The Net Logon service on Windows Server 2008 and on Windows Server 2008 R2 domain controllers does not allow the use of older cryptography algorithms that are compatible with Windows NT 4.0 by default
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;942564
     
  2. Windows 7 and 2008 R2 computers cannot join NT 4.0 domains due to fundamental security changes. No, this will not change. No, there is no workaround.  

    940268    Error message when you try to join a Windows Vista, Windows Server 2008, Windows 7, or Windows Server 2008 R2-based computer to a Windows NT 4.0 domain: "Logon failure: unknown user name or bad password"
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;940268

  3. Windows Server 2008 R2 PDCE’s cannot create an outbound or two-way trusts to NT 4.0 due to fundamental security changes . We have a specific article in mind for this right now, but the KB942564 was updated to reflect this also. No, this will not change. No, there is no workaround.  

The real solution here is to stop expending all this energy to be insecure and keep ancient systems running. You obviously have newer model OS’s in the environment, just go whole hog. Upgrade, migrate or toss your NT 4.0 environments. Windows 2000 support just ended, for goodness sake, and it was 5 years younger than NT 4.0! For every one customer that tells me they need an NT 4.0 domain for some application to run (which no one ever actually checks to see if that’s true, because they secretly know it is not true), the other nineteen admit that they just haven’t bothered out of sheer inertia.

Let me try this another way – go here: http://www.microsoft.com/technet/security/bulletin/summary.mspx. This is the list of all Microsoft security bulletins in the past seven years. For five of those years, NT 4.0 has not gotten a single hotfix. Windows 2000 – remember, not supported now either– has gotten 174 security updates in the past four years alone. If you think your NT 4.0 environment is not totally compromised, it’s only because you keep it locked in an underwater vault with piranha fish and you keep the servers turned off. It’s an OS based on using NTLM’s challenge response security, which people are still gleefully attacking with new vectors.

You need Kerberos.

Question

We use a lot of firewalls between network segments inside our environment. We have deployed DFSR and it works like a champ, replicating without issues. But when I try to gather a health report for a computer that is behind a firewall, it fails with an RPC error. My event log shows:

Error Event Source: DCOM
Event Category: None
Event ID: 10006
Date: 7/15/2010
Time: 2:51:52 PM
User: N/A
Computer: SRVBEHINDFIREWALL
Description: DCOM got error "The RPC server is unavailable."

Answer

If replication is working with the firewall but health reports are not, it sounds like DCOM/WMI traffic is being filtered out. Make sure the firewalls are not blocking or filtering the DCOM traffic specifically; a later model firewall that supports packet inspection may be deciding to block the DCOM types of traffic based on some rule. A double-sided network capture is how you will figure this out – the computer running MMC will connect remotely to DCOM over port 135, get back a response packet that (internally) states the remote port for subsequent connections, then the MMC will connect to that port for all subsequent conversations. If that port is blocked, no report.

For example here I connect to port 135 (DCOM/EPM), get a response packet that contains the new dynamic listening port to connect for DCOM – that port happens to be 55158 (but will differ every time). I then connect to that remote port in order to get a health diagnostic output using the IServerHealthReport call. If you create a double-sided network capture, you will likely see the first conversation fail, and if it succeeds, the subsequent conversation will be failing. Failing due the firewall dropping the packets and them never appearing on the remote host – that’s why you must use double-sided.

Image may be NSFW.
Clik here to view.
clip_image002[13]
 
Click me

Question

I know USMT cannot migrate local printers, but can it migrate TCP-port connected printers?

Answer

No, and for the same reason: those printers are not mapped to a print server that can send you a device driver and they are (technically) also a local printer. Dirty secret time: USMT doesn’t really migrate network printers, it just migrates these two registry keys:

HKCU\Printers\Connections
HKCU\Printers\DevModes2

So if your printer is in those keys, USMT is win – and the only kind that live there are mapped network printers. When you first logon and access the printer on your newly restored computer, Windows will just download the driver for you and away you go. Considering that you are in the middle of this big migration, now would be a good time to get rid of these old (wrong?) ways of connecting printers. Windows 7 has plenty of options for printer deployment through group policy, group policy preferences, and you can even make the right printers appear based on the user’s location. For example, here’s what I see when I add a printer here at my desk – all I see are the printers in my little building on the nearest network. Not the ones across the street, not the ones I cannot use, not the ones I have no business seeing. Do this right and most users will only see printers within 50 feet of them. :)

Image may be NSFW.
Clik here to view.
image

To quote from the book of Bourdain: That does not suck.

Question

What are the best documents for planning, deploying, and completing a forest upgrade from Win2000/2003 to Win2008/2008R2? [Asked at least 10 times a week – Ned]

Answer

Here:

Upgrading Active Directory Domains to Windows Server 2008 and Windows Server 2008 R2 AD DS Domains
http://technet.microsoft.com/en-us/library/cc731188(WS.10).aspx

If you are planning a domain upgrade, this should be your new homepage until the operation is complete. It is fantastic documentation with checklists, guides, known issues, recommended hotfixes, and best practices. It’s the bee’s knees, the wasp’s elbows, and the caterpillar's feets.

Image may be NSFW.
Clik here to view.
image

 

Moving on to other things not directly sack-related…

There are a couple of interesting takeaways from Black Hat US 2010 this week:

  • We announced our new Coordinated Vulnerability Disclosure process. Adobe is onboard already, hopefully more to come.
  • These folks claim they have a workable attack on Kerberos smart card logons. Except that we’ve had a way to prevent the attack for three years, starting in Vista using Strict KDC Validation – so that kinda takes the wind out of their sails. You can read more about how to make sure you are protected here and here and soon here. Pretty amazing also that this is the first time – that I’ve heard of, at least – in 11 years of MS Kerberos smart cards that anyone was talking attacks past the theoretical stage.
  • Of 102 topics, 10 are directly around Microsoft and Windows attacks. 48 are around web, java, and browser attacks. How much attention are you giving your end-to-end web security?
  • 10 topics were also around attacking iPhones and Google apps. How much attention are you giving those products in your environment? They are now as interesting to penetrate as all of Windows, according to Black Hat.
  • 5 topics on cloud computing attacks. Look for that number to double next year, and then double again the year after. Bet on it, buddy.

Finally, remember my old boss Mike O’Reilly? Yes, that guy that made the Keebler tree and who was the manager in charge of this blog and whom I worked with for 6 years. Out of the blue he sends me this email today – using his caveman Newfie mental gymnastics:

Ned,

I never ever read the Askds blog when I worked there.  I was reading it today and just realized that you are funny. 

What a guy. Have a nice weekend folks,

- Ned “I have 3 bosses, Bob” Pyle

Image may be NSFW.
Clik here to view.

New DNS and AD DS BPA’s released (or: the most accurate list of DNS recommendations you will ever find from Microsoft)

Hi folks, Ned here again. We’ve released another wave of Best Practices Analyzer rules for Windows Server 2008 / R2, and if you care about Directory Services you care about these:

AD DS rules update

Info:Update for the AD DS Best Practices Analyzer rules in Windows Server 2008 R2
Download:Rules Update for Active Directory Domain Services Best Practice Analyzer for Windows Server 2008 R2 x64 Editions (KB980360)

This update BPA for Active Directory Domain Services include seven rules changes and updates, some of which are well known but a few that are not.

DNS Analyzer 2.0

Operation Info: Best Practices Analyzer for Domain Name System – Ops
Configuration info: Best Practices Analyzer for Domain Name System - Config
Download:Microsoft DNS (Domain Name System) Model for Microsoft Baseline Configuration Analyzer 2.0

Remember when – a few weeks back– I wrote about recommended DNS configuration and I promised more info? Well here it is, in all its glory. Despite what you might have heard, misheard, remembered, or argued about, this is the official recommended list, written by the Product Group and appended/vetted/munged by Support. Which includes:

Awww yeaaaahhh… just memorize that and you’ll win any "Microsoft recommended DNS" bar bets you can imagine. That’s the cool thing about this ongoing BPA project: not only do you get a tool that will check your work in later OS versions, but the valid documentation gets centralized.

- Ned “Arren hates cowboys” Pyle

Image may be NSFW.
Clik here to view.

Multi-NIC File Server Dissection

Ned here. Our friend and colleague Jose Barreto from the File Server development team has posted a very interesting article around multiple NIC usage on Win2008/R2 file servers. Here's the intro:

When you set up a File Server, there are advantages to configuring multiple Network Interface Cards (NICs). However, there are many options to consider depending on how your network and services are laid out. Since networking (along with storage) is one of the most common bottlenecks in a file server deployment, this is a topic worth investigating.

Throughout this blog post, we will look into different configurations for Windows Server 2008 (and 2008 R2) where a file server uses multiple NICs. Next, we’ll describe how the behavior of the SMB client can help distribute the load for a file server with multiple NICs. We will also discuss SMB2 Durability and how it can recover from certain network failure in configuration where multiple network paths between clients and servers are available. Finally, we will look closely into the configuration of a Clustered File Server with multiple client-facing NICs.

I highly recommend giving the whole thing a read if you are interested in increasing file server throughput and reliability on the network in a recommend fashion.

http://blogs.technet.com/b/josebda/archive/2010/09/03/using-the-multiple-nics-of-your-file-server-running-windows-server-2008-and-2008-r2.aspx

- Ned "I am team Edward" Pyle

Image may be NSFW.
Clik here to view.

What does DCDIAG actually… do?

Hi folks, Ned here again. I recently wrote a KB article about some expected DCDIAG.EXE behaviors. This required reviewing DCDIAG.EXE as I wasn’t finding anything deep in TechNet about the “Services” test that had my interest. By the time I was done, I had found a dozen other test behaviors I had never known existed. While we have documented the version of DCDIAG that shipped with Windows Server 2008 – sometimes with excellent specificity, like Justin Hall’s article about the DNS tests– mostly it’s a black box and you only find out what it tests when the test fails. Oh, we have help of course: just run DCDIAG /? to see it. But it’s help written by developers. Meaning you get wording like this:

Advertising
Checks whether each DSA is advertising itself, and whether it is advertising itself as having the capabilities of a DSA.

So, it checks each DSA (whatever that is) to see if it’s advertising (whatever that means). The use of an undefined acronym is an especially nice touch, as even within Microsoft, DSA could mean:

Naturally, this brings out my particular brand of OCD. What follows is the result of my compulsion to understand. I’m not documenting every last switch in DCDIAG, just the tests. I am only documenting Windows Server 2008 R2 SP1 behavior – I have no idea where the source code is for the ancient Support Tools version of DCDIAG and you aren’t paying me enough here to find it :-).  The Windows Server 2008 RTM through Windows Server 2008 R2 SP1 versions are nearly identical except for bug fixes:

KB2401600 The Dcdiag.exe VerifyReferences test fails on an RODC that is running Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;en-US;2401600

KB979294 The Dcdiag.exe tool takes a long time to run in Windows Server 2008 R2 and in Windows 7
http://support.microsoft.com/default.aspx?scid=kb;EN-US;979294

KB978387 FIX: The connectivity test that is run by the Dcdiag.exe tool fails together with error code 0x621
http://support.microsoft.com/default.aspx?scid=kb;EN-US;978387

Everything I describe below you can discover and confirm yourself with careful examination of network captures and logging, to include the public functions being used– but why walk when you can ride? Using /v can also provide considerable details on some tests. No internal source code is described nor do I show any special hidden functionality.

For info on all the network protocols I list out – or if you run into network errors when using DCDIAG – see Service overview and network port requirements for the Windows Server system. I went pretty link-happy in general in this post to help people using it as a reference; that way if you just look at your one little test it has all the info you need. I don’t always call out name resolution being tested because it is implicit; it’s also testing TCP, UDP, and IP.

Finally: this post is more of a reference than my usual lighthearted fare. Do not operate heavy machinery while reading.

Initial Required Tests

This tests general connectivity and responsiveness of a DC, to include:

  • Verifying the DC can be located in DNS.
  • Verifying the DC responds to ICMP pings.
  • Verifying the DC allows LDAP connectivity by binding to the instance.
  • Verifying the DC allows binding to the AD RPC interface using the DsBindWithCred function.

The DNS test can be satisfied out of the client cache so restarting the DNS client service locally is advisable when running DCDIAG to guarantee a full test of name resolution. For example:

Net stop "dns client" & net start "dns client" & dcdiag /test:verifyreplicas /s:DC-01

The initial tests cannot be skipped.

The initial tests use ICMP, LDAP, DNS, and RPC on the network.

Editorial note: Blocking ICMP will prevent DCDIAG from working. While blocking ICMP is highly recommended at the Internet-edge of your network, internally blocking ICMP traffic mainly just leads to administrative headaches like breaking legacy group policy, breaking black hole router detection (or leading to highly inefficient MTU sizes due to lack of a discovery option), and breaking troubleshooting tools like ping.exe or tracert.exe. It creates an illusion of security; there are a great many other easy ways for a malicious internal user to locate computers.

Advertising

This test validates that the public DsGetDcName function used by computers to locate domain controllers will correctly locate any DCs specified with in the command line with the /s, /a, or /e parameter. It checks that the server successfully reports itself with DS_Flags for:

  • DC
  • LDAP server
  • Writable or Read-Only DC
  • KDC
  • Time Server
  • GC or not (and if claiming to be a GC, if the is GC ready to respond to requests )

Note that “advertising” is not the same as “working”. For instance, if the KDC service is stopped the Advertising test will fail since the flag returned from DsGetDcName will not include KDC. But if port 88 over TCP and UDP are blocked on a firewall, the Advertising test will pass – even though the KDC is not going to be able to answer requests for Kerberos tickets.

This test is done using RPC over SMB (using a Netlogon named pipe) to the DC plus LDAP to locate the DCs site information.

CheckSDRefDom

This test validates that your application partition cross reference objects (located in “cn=partitions,cn=configuration,dc=<forest root domain>”) contain the correct domain names in their msDS-SDReferenceDomain attributes. The test uses LDAP.

I find no history of anyone ever seeing the error message that can be displayed here.

The test uses LDAP.

CheckSecurityError

This test does a variety of checks around the security components of a DC like Kerberos. For it to be more specifically useful you should provide /replsource:<some partner DC> as the default checks are not as comprehensive.

This test:

  • Validates that at least one KDC is online for each domain and they are reachable (first in the same site, then anywhere in the domain if that fails)
  • Checks if packet fragmentation of Kerberos over UDP might be an issue based on current MTU size by sending non-fragmenting ICMP packets
  • Checks if the DC’s computer account exists in AD, if it’s within the default “Domain Controllers” OU, if it has the correct UserAccountControl flags for DCs, that the correct ServerReference attributes are set, and if the minimum Service Principal Names are set
  • Validates that the DCs computer object has replicated to other DCs
  • Validates that there are no replication or KCC connection issues for connected partners by querying the function DsReplicaGetInfo to get any security-related errors

When the /replsource is added, a few more tests happen. The partner is checked for all of the above also, then:

  • Time skew is calculated between the servers to verify it is less than 300 seconds for Kerberos. It does not check the Kerberos policy to see if allowed skew has been modified
  • Permissions are checked on all the naming contexts (such as Schema, Configuration, etc.) on the source DC to validate that replication and connectivity will work between DCs
  • Connectivity is checked to validate that the user running DCDIAG (and therefore in theory, all other users) can connect to and read the SYSVOL and NETLOGON shares without any security errors. It also checks IPC$, but inability to connect there would have broken many earlier tests
  • The "Access this computer from the network" privilege on the DC is checked to verify it is held by Administrators, Authenticated Users, and Everyone groups
  • The DC's computer object is checked to ensure it is the latest version on the DCs. This is done to prove replication convergence since a very stale DC might lead to security issues for users, problems with the DCs own computer account password, or secure channels to other servers. It checks versions, USNs, originating servers, and timestamps

These tests are performed using LDAP, RPC, RPC over SMB, and ICMP.

Connectivity

No matter what you specify for tests, this always runs as part of Initial Required Tests.

CrossRefValidation

This test retrieves a list of naming contexts (located in “cn=partitions,cn=configuration,dc=<forest root domain>”) with their cross references and then validates them, similar to the CheckSDRefDom test above. It is looking at the nCName , dnsRoot, nETBIOSName, and systemFlags attributes to:

  • Make sure the names or DNs are not invalid or null
  • Confirm DNs are not otherwise mangled with CNF or 0ADEL (which happens during Conflict or Deletion operations)
  • Ensure the systemFlags are correct for that object
  • Call out any empty (orphaned) replica sets

The test uses LDAP.

CutoffServers

Tests the AD replication topology to ensure there are no DCs without working connection objects between partners. Any servers that cannot replicate inbound or outbound from any DCs are considered “cut off”. It uses the function DsReplicaSyncAll to do this which means this “test” actually triggers replication on the DCs so use with caution if you are the owner of crud WAN links that you keep clean with schedules, and certainly consider this before using /e.

This test is rather misleading in its help description; if it cannot contact a server that is actually unavailable to LDAP on the network then it gives no error or test results, even if the /v parameter is specified. You have to notice that there is no series of “analyzing the alive system replication topology” or “performing upstream (of target) analysis” messages being printed for a cutoff server. However, the Connectivity test will fail if the server is unreachable so it’s a wash.

The test uses RPC.

DcPromo

The DCpromo test is one of the two oddballs in DCDIAG (the other is ‘DNS’). It is designed to test how well a DCPROMO would proceed if you were to run it on the server where DCDIAG is launched. It also has a number of required switches for each kind of promotion operation. All of the tests are against the server specified first in the client DNS settings. It tests:

  • If at least one network adapter has a primary DNS server set
  • If you would have a disjoint namespace based on the DNS suffix
  • That the proposed authoritative DNS zone can be contacted
  • If dynamic DNS updates are possible for the server’s A record. It checks both the setting on the authoritative DNS zone as well as the client registry configuration of DnsUpdateOnAllAdapters and DisableDynamicUpdate
  • If an LDAP DClocator record (i.e. “_ldap ._tcp.dc._msdcs.<domain>”) is returned when querying for existing forests

The test uses DNS on the network.

DNS

This series of enterprise-wide DNS tests are already well documented here:

http://technet.microsoft.com/en-us/library/cc731968(WS.10).aspx

The tests use DNS, RPC, and WMI protocols.

FrsEvent

This test validates the File Replication Service’s health by reading (and printing, if using /v) FRS event log warning and error entries from the past 24 hours. It’s possible this service won’t be running or installed on Windows Server 2008 or later if SYSVOL has been migrated to DFSR. On Windows Server 2008, some events may be misleading as they may refer to custom replica sets and not necessarily SYSVOL; on Windows Server 2008 R2, however, FRS can be used for SYSVOL only.

By default, remote connections to the event log are disabled by the Windows Server 2008/R2 firewall rules so this test will fail. KB2512643 covers enabling those rules to allow the test to succeed.

The test uses RPC, specifically with the EventLog Remoting Protocol.

DFSREvent

This test validates the Distributed File System Replication service’s health by reading (and printing, if using /v) DFSR event log warning and error entries from the past 24 hours. It’s possible this service won’t be running or installed on Windows Server 2008 if SYSVOL is still using FRS; on Windows Server 2008 R2 the service is always present on DCs. While this ostensibly tests DFSR-enabled SYSVOL, any errors within custom DFSR replication groups would also appear here, naturally.

By default, remote connections to the event log are disabled by the Windows Server 2008/R2 firewall rules so this test will fail. KB2512643 covers enabling those rules to allow the test to succeed.

The test uses RPC, specifically with the EventLog Remoting Protocol.

SysVolCheck

This test reads the DCs Netlogon SysvolReady registry key to validate that SYSVOL is being advertised:

HKEY_Local_Machine\System\CurrentControlSet\Services\Netlogon\Parameters
SysvolReady=1

The value name has to exist with a value of 1 to pass the test. This test will work with either FRS or DFSR-replicated SYSVOLs. It doesn’t check if the SYSVOL and NELOGON shares are actually accessible, though (CheckSecurityError does that).

The test uses RPC over SMB (through a named pipe to WinReg).

LocatorCheck

This test validates that DCLocator queries return the five “capabilities” that any DC must know of to operate correctly.

If not hosting one, the DC will refer to another DC that can satisfy the request; this means that you must carefully examine this under /v to make sure a server you thought was supposed to be holding a capability actually is correctly returned. If no DC answers or if the queries return errors then the test will fail.

The tests use RPC over SMB with the standard DsGetDcName DCLocator queries.

Intersite

This test uses Directory Replication Service (DRS) functions to check for conditions that would prevent inter-site AD replication within a specific site or all sites:

  • Locates and connect to the Intersite Topology Generators (ISTG)
  • Locates and connect to the bridgehead servers
  • Reports back any replication failures after triggering a replication
  • Validates that all DCs within sites with inbound connections to this site are available
  • Checks the KCC values for “IntersiteFailuresAllowed” and “MaxFailureTimeForIntersiteLink” overrides within the registry key:

KEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

You must be careful with this test’s command-line arguments and always provide /a or /e. Not providing a site means that the test runs but skips actually testing anything (you can see this under /v).

All tests use RPC over the network to test the replication aspects and will make registry connections (RPC over SMB to WinReg) to check for those NTDS settings override entries. LDAP is also used to locate connection info.

KccEvent

This test queries the Knowledge Consistency Checker on a DC for KCC errors and warnings generated in the Directory Services event log during the last 15 minutes. This 15 minute threshold is irrespective of the Repl topology update period (secs) registry value on the DC.

By default, remote connections to the event log are disabled by the Windows Server 2008/R2 firewall rules so this test will fail. KB2512643 covers enabling those rules to allow the test to succeed.

The test uses RPC, specifically with the EventLog Remoting Protocol.

KnowsOfRoleHolders

This test returns the DC's knowledge of the five Flexible Single Master Operation (FSMO) roles. The test does not inherently check all DCs knowledge for consistency, but using the /e parameter would provide data sufficient to allow comparison.

The test uses RPC to return DSListRoles within the Directory Replication Service (DRS) functions.

MachineAccount

This test checks if:

  • The DC's computer account exists in AD
  • It’s within the Domain Controllers OU
  • It has the correct UserAccountControl flags for DCs
  • The correct ServerReference attributes are set
  • The minimum Service Principal Names are set. For those paying close attention, this is identical to one test aspect of CheckSecurityError; this is because they use the same internal test

This test also mentions two repair options:

  • /RecreateMachineAccount will recreate a missing DC computer object. This is not a recommended fix as it does not recreate any child objects of a DC, such as FRS and DFSR subscriptions. The best practice is to use a valid SystemState backup to authoritatively restore the DC's deleted object and child objects. If you do use this /RecreateMachineAccount option then the DC should then be gracefully demoted and promoted to repair all the missing relationships
  • /FixMachineAccount will add the UserAccountControl flags to a DCs computer object for “TRUSTED_FOR_DELEGATION” and “SERVER_TRUST_ACCOUNT”. It’s safe to use as a DC missing those bit flags will not function and it does not remove other bit flags present. Using this repair option is preferred over trying to set these flags yourself through ADSIEDIT or other LDAP editors

This test uses LDAP and RPC over SMB.

NCSecDesc

This test checks permissions on all the naming contexts (such as Schema, Configuration, etc.) on the source DC to validate that replication and connectivity will work between DCs. It makes sure that “Enterprise Domain Controllers” and “Administrators” groups have the correct minimum permissions. This is the same performed test within CheckSecurityError.

The test uses LDAP.

NetLogons

This test is designed to:

  • Validate that the user running DCDIAG (and therefore in theory, all other users) can connect to and read the SYSVOL and NETLOGON shares without any security errors. It also checks IPC$, but inability to connect there would have broken many earlier tests
  • Verify that the Administrators, Authenticated Users, and Everyone group have the “access this computer from the network” privilege on the DC. If not, you’d see a ton of other errors here though, naturally

Both of these tests are also performed by CheckSecurityError.

The tests use SMB and RPC over SMB (through named pipes).

ObjectsReplicated

This test verifies that replication of a few key objects and attributes has occurred and displays up-to-dateness info if replication is stale. By default the two objects validated are:

  • The ”CN=NTDS Settings” object of each DC exists up to date on all other DCs.
  • The “CN=<DC name>” object of each DC exists up to date on all other DCs.

This test is not valuable unless run with /e or /a as it just asks the DC about itself when those are not specified. Using /v will give more details on objects thought to be stale based on version.

You can also specify arbitrary objects to test with /objectdn /n, which can be useful after creating a “canary” object to validate replication.

The tests are done using RPC with Directory Replication Service (DRS) functions.

OutboundSecureChannels

This test is designed to check external trusts. It will not run by default and will fail even when provided correct /testdomain parameters, validating the secure channel with NLTEST.EXE, and using a working external trust. It does state that the secure channel is valid but then mistakenly reports that there are no working trust objects. I’ll update this post when I find out more. This test should not be used.

RegisterLocatorDnsCheck

Validates many of the same aspects as the Dcpromo test. It requires the /dnsdomain switch to specify a domain that would be the target of registration; this can be a different domain than the current primary one. It specifically verifies:

  • If at least one network adapter has a primary DNS server set.
  • If you would have a disjoint namespace based on the DNS suffix
  • That the proposed authoritative DNS zone can be contacted
  • If dynamic DNS updates are possible for the server’s A record. It checks both the setting on the authoritative DNS zone as well as the client registry configuration of DnsUpdateOnAllAdapters and DisableDynamicUpdate
  • If an LDAP DClocator record (i.e. “_ldap ._tcp.dc._msdcs.<domain>”) is returned when querying for existing forests
  • That the authoritative DNS zone can be contacted

The test uses DNS on the network.

Replications

This test checks all AD replication connection objects for all naming contexts on specified DC(s) to see:

  • If the last replication attempted was successful or returned an error
  • If replication is disabled
  • If replication latency is more than 12 hours

The tests are done with LDAP and RPC using DsReplicaGetInfo.

RidManager

This test validates that the RID Master FSMO role holder:

  • Can be located and contacted through a DsBind
  • Has valid RID pool values

This role must be online and accessible for DCs to be able to create security principals (users, computers, and groups) as well as for further DCs to be promoted within a domain.

The test uses LDAP and RPC.

Services

This test validates that various AD-dependent services are running, accessible, and set to specific start types:

  • RPCSS - Start Automatically – Runs in Shared Process
  • EVENTSYSTEM - Start Automatically - Runs in Shared Process
  • DNSCACHE - Start Automatically - Runs in Shared Process
  • NTFRS - Start Automatically - Runs in Own Process (if domain functional level is less than Windows Server 2008. Does not trigger on SYSVOL being replicated by FRS)
  • ISMSERV - Start Automatically - Runs in Shared Process
  • KDC - Start Automatically - Runs in Shared Process
  • SAMSS - Start Automatically - Runs in Shared Process
  • SERVER - Start Automatically - Runs in Shared Process
  • WORKSTATION - Start Automatically - Runs in Shared Process
  • W32TIME - Start Manually or Automatically - Runs in Shared Process
  • NETLOGON - Start Automatically - Runs in Shared Process

(If target is Windows Server 2008 or later)

  • NTDS - Start Automatically - Runs in Shared Process
  • DFSR - Start Automatically - Runs in Own Process (if domain functional level is Windows Server 2008 or greater. Does not trigger on SYSVOL being replicated by DFSR)

(If using SMTP-based AD replication)

  • IISADMIN - Start Automatically - Runs in Shared Process
  • SMTPSVC - Start Automatically - Runs in Shared Process

These are the “real” service names listed in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services. If this test is specified when targeting Windows Server 2003 DCs it is expected to fail on RpcSs. See KB2512643.

The test uses RPC and the Service Control Manager remote protocol.

SystemLog

This test validates the System Event Log’s health by reading and printing entries from the past 60 minutes (stopping at computer startup timestamp if less than 60 minutes). Errors and warnings will be printed, with no evaluation done of them being expected or not – this is left to the DCDIAG user.

By default, remote connections to the event log are disabled by the Windows Server 2008/R2 firewall rules so this test will fail. KB2512643 covers enabling those rules to allow the test to succeed.

The test uses RPC, specifically with the EventLog Remoting Protocol.

Topology

This test checks that a server has a fully-connected AD replication topology. This test must be explicitly run. It checks:

The test uses DsReplicaSyncAll with the flag of DS_REPSYNCALL_DO_NOT_SYNC. Meaning that the test analyzes and validates replication topology without actually replicating changes. The test does not validate the availability of replication partners – having a partner offline will not cause failures in this test. This does not test if the schedule is completely closed, preventing replication; to see those active replication results, use tests Replications or CutoffServers.

The test uses RPC and LDAP.

VerifyEnterpriseReferences

This test verifies computer reference attributes for all DCs, including:

  • ServerReference attribute correct for a DC on cn=<DC name>,cn=<site>,cn=sites,cn=configuration,dc=<domain>
  • ServerReferenceBL attribute correct for a DC site object on a DC on cn=<DC Name>,ou=domain controllers,dc=<domain>
  • frsComputerReference attribute correct for a DC site object on cn=domain system volume (sysvol share),cn=ntfrs subscriptions,cn=<DC Name>,ou=domain controllers,DC=<domain>
  • frsComputerReferenceBL attribute correct for a DC object on cn=<DC Name>,cn=domain system volume (sysvol share),cn=file replication service,cn=system,dc=<domain>
  • hasMasterNCs attribute correct for a DC on cn=ntds settings,cn=<DC Name>,cn=<site>,cn=sites,cn=configuration,dc=<domain>
  • nCName attribute correct for a partition at cn=<partition name>,cn=partitions,cn=configuration,dc=<domain>
  • msDFSR-ComputerReference attribute correct for a DC DFSR replication object on cn=<DC Name>,cn=topology,cn=domain system volume,cn=dfsr-blobalsettings,cn=system,dc=<domain>
  • msDFSR-ComputerReferenceBL attribute correct for a DC site object on a DC on cn=<DC Name>,ou=domain controllers,dc=<domain>

Note that the two DFSR tests are only performed if domain functional level is Windows Server 2008 or higher. This means there will be an expected failure if DFSR has not been migrated to SYSVOL as the test does not actually care if FRS is still in use.

The test uses LDAP. The DCS are not all individually contacted, only the specified DCs are contacted.

VerifyReferences

This test verifies computer reference attributes for a single DC, including:

  • ServerReference attribute correct for a DC on cn=<DC name>,cn=<site>,cn=sites,cn=configuration,dc=<domain>
  • ServerReferenceBL attribute correct for a DC site object on a DC on cn=<DC Name>,ou=domain controllers,dc=<domain>
  • frsComputerReference attribute correct for a DC site object on cn=domain system volume (sysvol share),cn=ntfrs subscriptions,cn=<DC Name>,ou=domain controllers,DC=<domain>
  • frsComputerReferenceBL attribute correct for a DC object on cn=<DC Name>,cn=domain system volume (sysvol share),cn=file replication service,cn=system,dc=<domain>
  • msDFSR-ComputerReference attribute correct for a DC DFSR replication object on cn=<DC Name>,cn=topology,cn=domain system volume,cn=dfsr-blobalsettings,cn=system,dc=<domain>
  • msDFSR-ComputerReferenceBL attribute correct for a DC site object on a DC on cn=<DC Name>,ou=domain controllers,dc=<domain>

This is similar to the VerifyEnterpriseRefrences test except that it does not check partition cross references or all other DC objects.

The test uses LDAP.

VerifyReplicas

This test verifies that the specified server does indeed host the application partitions specified by its crossref attributes in the partitions container. It operates exactly like CheckSDRefDom except that it does not show output data and validates hosting.

This test uses LDAP.

 

That’s all folks.

- Ned “that was seriously un-fun to write” Pyle

Image may be NSFW.
Clik here to view.

What is the Impact of Upgrading the Domain or Forest Functional Level?

Hello all, Jonathan here again. Today, I want to address a question that we see regularly. As customers upgrade Active Directory, and they inevitably reach the point where they are ready to change the Domain or Forest Functional Level, they sometimes become fraught. Why is this necessary? What does this mean? What’s going to happen? How can this change be undone?

What Does That Button Do?

Before these question can be properly addressed, if must first be understood exactly what purposes the Domain and Forest Functional Levels serve. Each new version of Active Directory on Windows Server incorporates new features that can only be taken advantage of when all domain controllers (DC) in either the domain or forest have been upgraded to the same version. For example, Windows Server 2008 R2 introduces the AD Recycle Bin, a feature that allows the Administrator to restore deleted objects from Active Directory. In order to support this new feature, changes were made in the way that delete operations are performed in Active Directory, changes that are only understood and adhered to by DCs running on Windows Server 2008 R2. In mixed domains, containing both Windows Server 2008 R2 DCs as well as DCs on earlier versions of Windows, the AD Recycle Bin experience would be inconsistent as deleted objects may or may not be recoverable depending on the DC on which the delete operation occurred. To prevent this, a mechanism is needed by which certain new features remain disabled until all DCs in the domain, or forest, have been upgraded to the minimum OS level needed to support them.

After upgrading all DCs in the domain, or forest, the Administrator is able to raise the Functional Level, and this Level acts as a flag informing the DCs, and other components as well, that certain features can now be enabled. You'll find a complete list of Active Directory features that have a dependency on the Domain or Forest Functional Level here:

Appendix of Functional Level Features
http://technet.microsoft.com/en-us/library/understanding-active-directory-functional-levels(WS.10).aspx

There are two important restrictions of the Domain or Forest Functional Level to understand, and once they are, these restrictions are obvious. Once the Functional Level has been upgraded, new DCs on running on downlevel versions of Windows Server cannot be added to the domain or forest. The problems that might arise when installing downlevel DCs become pronounced with new features that change the way objects are replicated (i.e. Linked Value Replication). To prevent these issues from arising, a new DC must be at the same level, or greater, than the functional level of the domain or forest.

The second restriction, for which there is a limited exception on Windows Server 2008 R2, is that once upgraded, the Domain or Forest Functional Level cannot later be downgraded. The only purpose that having such ability would serve would be so that downlevel DCs could be added to the domain. As has already been shown, this is generally a bad idea.

Starting in Windows Server 2008 R2, however, you do have a limited ability to lower the Domain or Forest Functional Levels. The Windows Server 2008 R2 Domain or Forest Functional level can be lowered to Windows Server 2008, and no lower, if and only if none of the Active Directory features that require a Windows Server 2008 R2 Functional Level has been activated. You can find details on this behavior - and how to revert the Domain or Forest Functional Level - here.

What Happens Next?

Another common question: what impact does changing the Domain or Forest Functional Level have on enterprise applications like Exchange or Lync, or on third party applications? First, new features that rely on the Functional Level are generally limited to Active Directory itself. For example, objects may replicate in a new and different way, aiding in the efficiency of replication or increasing the capabilities of the DCs. There are exceptions that have nothing to do with Active Directory, such as allowing NTFRS replacement by DFSR to replicate SYSVOL, but there is a dependency on the version of the operating system. Regardless, changing the Domain or Forest Functional Level should have no impact on an application that depends on Active Directory.

Let's fall back on a metaphor. Imagine that Active Directory is just a big room. You don't actually know what is in the room, but you do know that if you pass something into the room through a slot in the locked door you will get something returned to you that you could use. When you change the Domain or Forest Functional Level, what you can pass in through that slot does not change, and what is returned to you will continue to be what you expect to see. Perhaps some new slots added to the door through which you pass in different things, and get back different things, but that is the extent of any change. How Active Directory actually processes the stuff you pass in to produce the stuff you get back, what happens behind that locked door, really isn't relevant to you.

If you carry this metaphor forward into the real world, if an application like Exchange uses Active Directory to store its objects, or to perform various operations, none of that functionality should be affected if the Domain or Forest Functional Mode changes. In fact, if your applications are also written to take advantage of new features introduced in Active Directory, you may find that the capabilities of your applications increase when the Level changes.

The answer to the question about the impact of changing the Domain or Forest Functional Level is there should be no impact. If you still have concerns about any third party applications, then you should contact the vendor to find out if they tested the product at the proposed Level, and if so, with what result. The general expectation, however, should be that nothing will change. Besides, you do test your applications against proposed changes to your production AD, do you not? Discuss any issues with the vendor before engaging Microsoft Support.

Where’s the Undo Button?

Even after all this, however, there is a great concern about the change being irreversible, so that you must have a rollback plan just in case something unforeseen and catastrophic occurs to Active Directory. This is another common question, and there is a supported mechanism to restore the Domain or Forest Functional Level. You take a System State back up of one DC in each domain in the forest. To recover, flatten all the DCs in the forest, restore one for each domain from the backup, and then DCPROMO the rest back into their respective domains. This is a Forest Restore, and the steps are outlined in detail in the following guide:

Planning for Active Directory Forest Recovery
http://technet.microsoft.com/en-us/library/planning-active-directory-forest-recovery(WS.10).aspx

By the way, do you know how often we’ve had to help a customer perform a complete forest restore because something catastrophic happened when they raised the Domain or Forest Functional Level? Never.

Best Practices

What can be done prior to making this change to ensure that you have as few issues as possible? Actually, there are some best practices here that you can follow:

1. Verify that all DCs in the domain are, at a minimum, at the OS version to which you will raise the functional level. Yes… I know this sounds obvious, but you’d be surprised. What about that DC that you decommissioned but for which you failed to perform metadata cleanup? Yes, this does happen.
Another good one that is not so obvious is the Lost and Found container in the Configuration container. Is there an NTDS Settings object in there for some downlevel DC? If so, that will block raising the Domain Functional Level, so you’d better clean that up.

2. Verify that Active Directory is replicating properly to all DCs. The Domain and Forest Functional Levels are essentially just attributes in Active Directory. The Domain Functional Level for all domains must be properly replicated before you’ll be able to raise the Forest Functional level. This practice also addresses the question of how long one should wait to raise the Forest Functional Level after you’ve raised the Domain Functional Level for all the domains in the forest. Well…what is your end-to-end replication latency? How long does it take a change to replicate to all the DCs in the forest? Well, there’s your answer.

Best practices are covered in the following article:

322692 How to raise Active Directory domain and forest functional levels
http://support.microsoft.com/default.aspx?scid=kb;EN-US;322692

There, you’ll find some tools you can use to properly inventory your DCs, and validate your end-to-end replication.

Update: Woo, we found an app that breaks! It has a hotfix though (thanks Paolo!). Mkae sure you install this everywhere if you are using .Net 3.5 applications that implement the DomainMode enumeration function.

FIX: "The requested mode is invalid" error message when you run a managed application that uses the .NET Framework 3.5 SP1 or an earlier version to access a Windows Server 2008 R2 domain or forest  
http://support.microsoft.com/kb/2260240

Conclusion

To summarize, the Domain or Forest Functional Levels are flags that tell Active Directory and other Windows components that all DCs in the domain or forest are at a certain minimal level. When that occurs, new features that require a minimum OS on all DCs are enabled and can be leveraged by the Administrator. Older functionality is still supported so any applications or services that used those functions will continue to work as before -- queries will be answered, domain or forest trusts will still be valid, and all should remain right with the world. This projection is supported by over eleven years of customer issues, not one of which involves a case where changing the Domain or Forest Functional Level was directly responsible as the root cause of any issue. In fact, there are only cases of a Domain or Forest Functional Level increase failing because the prerequisites had not been met; overwhelmingly, these cases end with the customer's Active Directory being successfully upgraded.

If you want to read more about Domain or Forest Functional Levels, review the following documentation:

What Are Active Directory Functional Levels?
http://technet.microsoft.com/en-us/library/cc787290(WS.10).aspx

Functional Levels Background Information
http://technet.microsoft.com/en-us/library/cc738038(WS.10).aspx

Jonathan “Con-Function Junction” Stephens

Image may be NSFW.
Clik here to view.

Last Week Before Vista and Win2008 SP1 Support Ends

Last chance folks. Windows Vista and Windows Server 2008 Service Pack 1 support ends on July 12. As in, one week from now. This means computers running SP1 won’t get security updates after the next Patch Tuesday.

Image may be NSFW.
Clik here to view.
image

The bomb represents malware. Or your next review…

For those not running WSUS or SCCM, grab SP2 here:

Not sure which computers are running lean? Check out these inventory techniques you can use to find SP1 computers in the domain using AD PowerShell; all you need is one Win7 computer or Win2008 R2 DC. Only have older DCs? That’s ok, use the AD Management Gateway. Think PowerShell is t3h sux 4 n00b$? That’s ok, Amish IT, use CSVDE.EXE to get a list back from your DCs that you can examine in Excel. Foir example, all the Vista non-SP2 computers:

csvde -f c:\sp.csv -p subtree –d “dc=contoso,dc=com” -r "(&(operatingsystem=windows vista*)(!operatingsystemservicepack=Service Pack 2))" -l operatingsystem,operatingsystemservicepack -u

Image may be NSFW.
Clik here to view.
image

No complaints that this finds stale computers and doesn’t tell you IP addresses – AD PowerShell does all that and bakes delicious pies.

Windows Server 2008 shipped with SP1 built-in as it were, so there is no way for it to have no service pack at all. Whatever you do, it’s closing time for SP1. You don’t have to go home but you can’t stay here.

Ned “I recently upgraded to Windows 2000 – it’s pretty slick!” Pyle

Image may be NSFW.
Clik here to view.

Windows Server 2008 SP1 and Windows Vista SP1 now UNSUPPORTED


Cluster and Stale Computer Accounts

Hi, Mike here again. Today, I want to write about a common administrative task that can lead to disaster: removing stale computer accounts from Active Directory.

Removing stale computer accounts is simply good hygiene-- it’s the brushing and flossing of Active Directory. Like tartar, computer accounts have the tendency to build up until they become a problem (difficult to identify and remove, and can lead to lengthy backup times).

Oops… my bad

Many environments separate administrative roles. The Active Directory administrator is not the Cluster Administrator. Each role holder performs their duties in a somewhat isolated manner-- the Cluster admins do their thing and the AD admins do theirs. The AD admin cares about removing stale computer accounts. The cluster admin does not… until the AD admin accidentally deletes a computer account associated with a functioning Failover Cluster because it looks like a stale account.

Unexpected deletion of Cluster Name Object (CNO) or Virtual computer Object (VCO) is one of the top issues worked by our engineers that support Clustering and High-Availability. Everyone does their job and boom-- Clustered Servers stop working because CNOs or the VCOs are missing. What to do?

What's wrong here

I'll paraphrase an article posted on the Clustering and High-Availability TechNet blog that solves this scenario. Typically, domain admins key on two different attributes to determine if a computer account is stale: pwdlastSet and LastLogonTimeStamp. Domains that are not configured to a Window Server 2003 Domain Functional Level use the pwdLastAttribute. However, domains configured to a Windows Server 2003 Domain Functional Level or later should use the lastLogonTimeStamp attribute. What you may not know is that a Failover Cluster (CNO and VCO) does not update the lastLogonTimeStamp the same way as a real computer.

Cluster updates the lastLogonTimeStamp when it brings a clustered network name resource online. Once online, it caches the authentication token. Therefore, a clustered network named resource working in production for months will never update the lastLogonTimeStamp. This appears as a stale computer account to the AD administrator. Being a good citizen, the AD administrator deletes the stale computer account that has not logged on in months. Oops.

The Solution

There are few things that you can do to avoid this situation.

  • Use the servicePrincipalName attribute in addition to the lastLogonTimeStamp attribute when determining stale computer accounts. If any variation of MSClusterVirtualServer appears in this attribute, then leave the computer account alone and consult with the cluster administrator.
  • Encourage the Cluster administrator to use -CleanupAD to delete the computer accounts they are not using after they destroy a cluster.
  • If you are using Windows Server 2008 R2, then consider implementing the Active Directory Recycle Bin. The concept is identical to the recycle bin for the file system, but for AD objects. The following ASKDS blogs can help you evaluate if AD Recycle Bin is a good option for your environment.

Mike "Four out of Five AD admins recommend ASKDS" Stephens

Image may be NSFW.
Clik here to view.

Friday Mail Sack: They Pull Me Back in Edition

Hiya world, Ned is back with your best questions and comments. I’ve been off to teach this fall’s MCM, done Win8 stuff, and generally been slacking keeping busy; sorry for the delay in posting. That means a hefty backlog - get ready to slurp.

Today we talk:

I know it was you, Fredo.

Question

If I run netdom query dc only writable DCs are returned. If I instead run nltest /dclist:contoso.com, both writable and RODCs are returned. Is it by design that netdom can't find RODC?

Answer

It’s by design, but not by any specific intentions. Netdom was written for NT 4.0 and uses a very old function when you invoke QUERY DC, which means that if a domain controller is not of typeSV_TYPE_DOMAIN_CTRL or SV_TYPE_DOMAIN_BAKCTRL, they are not shown in the list. Effectively, it queries for all the DCs just like Nltest, but it doesn’t know what RODCs are, so it won’t show them to you.

Nltest is old too, but its owners have updated it more consistently. When it returns all the DCs (using what amounts to the same lookup functions), it knows modern information. For instance, when it became a Win2008 tool, its owners updated it to use the DS_DOMAIN_CONTROLLER_INFO_3 structure, which is why it can tell you the FQDN, which servers are RODCs, who the PDCE is, and what sites map to each server.

Image may be NSFW.
Clik here to view.
image

When all this new RODC stuff came about, the developers either forgot about Netdom or more likely, didn’t feel it necessary to update both with redundant capabilities – so they updated Nltest only. Remember that these were formerly out-of-band support tools that were not owned by the Windows team until Vista/2008 – in many cases, the original developers had been gone for more than a decade.

Now that we’ve decided to make PowerShell the first class citizen, I wouldn’t expect any further improvements in these legacy utilities.

Question

We’re trying to use DSRevoke on Win2008 R2 to enumerate access control entries. We are finding it spits out: “Error occurred in finding ACEs.” This seems to have gone belly up in Server 2008. Is this tool in fact deprecated, and if so do you know of a replacement?

Answer

According to the download page, it only works on Win2003 (Win2000 being its original platform, and being dead). It’s not an officially supported tool in any case – just made by some random internal folks. You might say it was deprecated the day it released. :)

I also find that it fails as you said on Win2008 R2, so you are not going crazy. As for why it’s failing on 2008 and 2008 R2, I have not the foggiest idea, and I cannot find any info on who created this tool or if it even still has source code (it is not in the Windows source tree, I checked). I thought at first it might be an artifact of User Account Control, but even on a Win2008 R2 Core server, it is still a spaz.

I don’t know of any purpose-built replacements, although if I want to enumerate access on OUs (or anything), I’d use AD PowerShell and Get-ACL. For example, a human-readable output:

import-module activedirectory

cd ad:

get-acl(get-adobject someDNinquotes) | format-list

Image may be NSFW.
Clik here to view.
image

Or to get all the OUs:

get-acl(get-adorganizationalunit –filter *) | fl

Image may be NSFW.
Clik here to view.
image

Or fancy spreadsheets using select-object and export-csv (note – massaged in Excel, it won’t come out this purty):

Image may be NSFW.
Clik here to view.
image

Image may be NSFW.
Clik here to view.
image

Or whatever. The world is your oyster at that point.

You can also use Dsacls.exe, but it’s not as easy to control the output. And there are the fancy/free Quest AD PowerShell tools, but I can’t speak to them (Get-QADPermission is the cmdlet for this).

Question

We are thinking about removing evil WINS name resolution from our environment. We hear that this has been done successfully in several organizations. Is there anything we need to watch out for in regards to Active Directory infrastructure? Are there any gotchas you've seen with environments in general? Also, it seems that the days of WINS may be numbered. Can you offer any insight into this?

Answer

Nothing “current” in Windows has any reliance on WINS resolution – even the classic components like DFS Namespaces have long ago offered DNS alternatives - but legacy products may still need it. I’m not aware of any list of Microsoft products with all dependencies, but we know Exchange 2003 and 2007 require it, for instance (and 2010 does not). Anything here that requires port 137 Netbios name resolution may fail if it doesn’t also use DNS. Active Directory technologies do not need it; they are all from the DNS era.

A primary limitation of WINS and NetBT is that they do not support IPv6, so anything written for Server 2008 and up wouldn’t have been tested without DNS-only resolution. If you have legacy applications with WINS dependency for specific static records, and they are running at least Server 2008 for DNS, you can replace the single-label resolution functionality provided by WINS with the DNS GlobalNames zone. See http://technet.microsoft.com/en-us/library/cc731744.aspx. Do not disable the TCP/IP NetBIOS Helper service on any computers, even if you get rid of WINS. All heck will break loose.

Rest assured that WINS is still included in the Windows 8 Server Developer Preview, and Microsoft itself still runs many WINS servers; odds are good that you have at least 12 more years of WINS in your future. Yay!

I expect to hear horror stories in the Comments…

Question

What is the expected behavior with respect to any files created in DFSR-replicated folders if they're made prior to initial sync completion? I.e. data in the replicated folder is added or modified on the non-authoritative server during the initial sync?

Answer

  1. If it’s a brand new file created by the user on the downstream, or if the file has already “replicated” from the upstream (meaning that its hash and File ID are now recorded by the downstream server, not that the file actually replicates) and is later changed by the user before initial replication is fully complete, nothing “bad” happens. Once initial sync completes, their original changes and edits will replicate back outbound without issues.
  2. If the user has bad timing and starts modifying existing pre-seeded files that have not yet had their file ID and hashes replicated (which would probably take a really big dataset combined with a really poor network), their files will get conflicted and changes wiped out, in favor of the upstream server.

Question

During initial DFSR replication of a lot of data, I often see debug log messages like:

20111028 17:06:30.308 9092 CRED   105 CreditManager::GetCredits [CREDIT] No update credits available. Suspending Task:00000000010D3850 listSize:1 this:00000000010D3898

 

20111028 17:06:30.308 9092 IINC   281 IInConnectionCreditManager::GetCredits [CREDIT] No connection credits available, queuing request.totalConnectionCreditsGranted:98 totalGlobalCreditsGranted:98 csId:{6A576AEE-561E-8F93-8C99-048D2348D524} csName:GooconnId:{B34747C-4142-478F-96AF-D2121E732B16} sessionTaskPtr:000000000B4D5040

And just what are DFSR “Credits?” Does this amount just control how many files can be replicated to a partner before another request has to be made?  Is it a set amount for a specific amount of time per server?

Answer

Not how many files, per se - how many updates. A credit maps to a "change" - create, modify, delete.  All the Credit Manager code does is allow an upstream server to ration out how many updates each downstream server can request in a batch. Once that pool is used up, the downstream can ask again. It ensures that one server doesn't get to replicate all the time and other servers never replicate - except in Win2003/2008, this still happened. Because we suck. In Win2008 R2, the credit manager now correctly puts you to the back of the queue if you just showed up asking for more credits, and gives other servers a chance. As an update replicates, a credit is "given back" until your list is exhausted. It has nothing to do with time, just work.

"No update credits available" is normal and expected if you are replicating a bung-load of updates. And in initial sync, you are.

Question

The registry changes I made after reading your DFSR tuning article made a world of difference. I do have a question though: is the max number of replicating server only 64?

Answer

Not the overall max, just the max simultaneously. I.e. 64 servers replicating a file at this exact instance in time. We have some customers with more than a thousand replicating servers (thankfully, using pretty static data).

Question

Can members of the Event Log Readers group automatically access all event logs?

Answer

Almost all. To see the security on any particular event log, you can use wevtutil gl . For example:

wevtutil gl security

Image may be NSFW.
Clik here to view.
image

Note the S-1-5-32-573 SID there on the end – that is the Event Log Readers well-known built-in SID. If you wanted to see the security on all your event logs, you could use this in a batch file (wraps):

@echo off

if exist %temp%\eventlistmsft.txt del %temp%\eventlistmsft.txt

if exist %temp%\eventlistmsft2.txt del %temp%\eventlistmsft2.txt

Wevtutil el > %temp%\eventlistmsft.txt

For /f "delims=;" %%i in (%temp%\eventlistmsft.txt) do wevtutil gl "%%i" >> %temp%\eventlistmsft2.txt

notepad %temp%\eventlistmsft2.txt

My own quick look showed that a few do not ACL with that group – Internet Explorer, Microsoft-Windows-CAPI2, Microsoft-Windows-Crypto-RNG, Group Policy, Microsoft-Windows-Firewall with advanced security. IE seems like an accident, but the others were likely just considered sensitive by their developers.

Other stuff

Happy Birthday to Bill Gates and to Windows XP. You’re equally responsible for nearly every reader or writer of this blog having a job. And in my case, one not digging ditches. So thanks, you crazy kids.

The ten best Jeremy Clarkson Top Gear lines… in the world!

Halloween Part 1: Awesome jack-o-lantern templates, courtesy of ThinkGeek. Yes, they have NOTLD!

Halloween Part 2: Dogs in costume, courtesy of Bing. The AskDS favorite, of course, is:

Image may be NSFW.
Clik here to view.
image

 

Thanks to Japan, you can now send your boss the most awesome emoticon ever, when you fix an issue but couldn’t get root cause:

¯\_(ツ)_/¯

Pluto returning to planet status? It better be; that do-over was lame…

Finally – my new favorite place to get Sci-Fi and Fantasy pics is Cgsociety. Check out some of 3D and 2D samples from the Showcase Gallery:

 

Image may be NSFW.
Clik here to view.
clip_image002
Image may be NSFW.
Clik here to view.
clip_image004

Image may be NSFW.
Clik here to view.
clip_image006
Image may be NSFW.
Clik here to view.
clip_image008

Image may be NSFW.
Clik here to view.
clip_image010
Image may be NSFW.
Clik here to view.
clip_image012

Image may be NSFW.
Clik here to view.
clip_image014

That last one makes a great lock screen

Have a great weekend, folks.

- Ned “They hit him with five shots and he's still alive!” Pyle

Image may be NSFW.
Clik here to view.

Does your logon hang after a password change on win 8.1 /2012 R2/win10?

Hi, Linda Taylor here, Senior Escalation Engineer from the Directory Services team in the UK.

I have been working on this issue which seems to be affecting many of you globally on windows 8.1, 2012 R2 and windows 10, so I thought it would be a good idea to explain the issue and workarounds while we continue to work on a proper fix here.

The symptoms are such that after a password change, logon hangs forever on the welcome screen:

Image may be NSFW.
Clik here to view.
clip_image002

How annoying….

The underlying issue is a deadlock between several components including DPAPI and the redirector.

For full details or the issue, workarounds and related fixes check out my post on the ASKPFEPLAT blog here http://blogs.technet.com/b/askpfeplat/archive/2016/01/11/does-your-win-8-1-2012-r2-win10-logon-hang-after-a-password-change.aspx

This is now fixed in the following updates:

Windows 8.1, 2012 R2, 2012 install:

For Windows 10 TH2 build 1511 install:

I hope this helps,

Linda

Previewing Server 2016 TP4: Temporary Group Memberships

Disclaimer: Windows Server 2016 is still in a Technical Preview state – the information contained in this post may become inaccurate in the future as the product continues to evolve. More specifically, there are still issues being ironed out in other parts of Privileged Access Management in Technical Preview 4 for multi-forest deployments.   Watch for more updates as we get closer to general availability!

Hello, Ryan Ries here again with some juicy new Active Directory hotness. Windows Server 2016 is right around the corner, and it’s bringing a ton of new features and improvements with it. Today we’re going to talk about one of the new things you’ll be seeing in Active Directory, which you might see referred to as “expiring links,” or what I like to call “temporary group memberships.”

One of the challenges that every security-conscious Active Directory administrator has faced is how to deal with contractors, vendors, temporary employees and anyone else who needs temporary access to resources within your Active Directory environment. Let’s pretend that your Information Security team wants to perform an automated vulnerability scan of all the devices on your network, and to do this, they will need a service account with Domain Administrator privileges for 5 business days. Because you are a wise AD administrator, you don’t like the idea of this service account that will be authenticating against every device on the network having Domain Administrator privileges, but the CTO of the company says that you have to give the InfoSec team what they want.

(Trust me, this stuff really happens.)

So you strike a compromise, claiming that you will grant this service account temporary membership in the Domain Admins group for 5 days while the InfoSec team conducts their vulnerability scan. Now you could just manually remove the service account from the group after 5 days, but you are a busy admin and you know you’re going to forget to do that. You could also set up a scheduled task to run after 5 days that runs a script that removes the service account from the Domain Admins group, but let’s explore a couple of more interesting options.

The Old Way

One old-school way of accomplishing this is through the use of dynamic objects in 2003 and later. Dynamic objects are automatically deleted (leaving no tombstone behind) after their entryTTL expires. Using this knowledge, our plan is to create a security group called “Temp DA for InfoSec” as a dynamic object with a TTL (time-to-live) of 5 days. Then we’re going to put the service account into the temporary security group. Then we are going to add the temporary security group to the Domain Admins group. The service account is now a member of Domain Admins because of the nested group membership, and once the temporary security group automatically disappears in 5 days, the nested group membership will be broken and the service account will no longer be a member of Domain Admins.

Creating dynamic objects is not as simple as just right-clicking in AD Users & Computer and selecting “New > Dynamic Object,” but it’s still pretty easy if you use ldifde.exe and a simple text file. Below is an example:

Image may be NSFW.
Clik here to view.
clip_image002

Figure 1: Creating a Dynamic Object with ldifde.exe.

dn: cn=Temp DA For InfoSec,ou=Information Security,dc=adatum,dc=com
changeType: add
objectClass: group
objectClass: dynamicObject
entryTTL: 432000
sAMAccountName: Temp DA For InfoSec

In the text file, just supply the distinguished name of the security group you want to create, and make sure it has both the group objectClass and the dynamicObject objectClass. I set the entryTTL to 432000 in the screen shot above, which is 5 days in seconds. Import the object into AD using the following command:
  ldifde -i -f dynamicGroup.txt

Now if you go look at the newly-created group in AD Users & Computers, you’ll see that it has an entryTTL attribute that is steadily counting down to 0:

Image may be NSFW.
Clik here to view.
clip_image004

Figure 2: Dynamic Security Group with an expiry date.

You can create all sorts of objects as Dynamic Objects by the way, not just groups. But enough about that. We came here to see how the situation has improved in Windows Server 2016. I think you’ll like it better than the somewhat convoluted Dynamic Objects solution I just described.

The New Hotness (Windows Server 2016 Technical Preview 4, version 1511.10586.122)

For our next trick, we’ll need to enable the Privileged Access Management Feature in our Windows Server 2016 forest. Another example of an optional feature is the AD Recycle Bin. Keep in mind that just like the AD Recycle Bin, once you enable the Privileged Access Management feature in your forest, you can’t turn it off. This feature also requires a Windows Server 2016 or “Windows Threshold” forest functional level:

Image may be NSFW.
Clik here to view.
clip_image006

Figure 3: This AD Optional Feature requires a Windows Server 2016 or “Windows Threshold” Forest Functional Level.

It’s easy to enable with PowerShell:
Enable-ADOptionalFeature ‘Privileged Access Management Feature’ -Scope ForestOrConfigurationSet -Target adatum.com

Now that you’ve done this, you can start setting time limits on group memberships directly. It’s so easy:
Add-ADGroupMember -Identity ‘Domain Admins’ -Members ‘InfoSecSvcAcct’ -MemberTimeToLive (New-TimeSpan -Days 5)

Now isn’t that a little easier and more straightforward? Our InfoSec service account now has temporary membership in the Domain Admins group for 5 days. And if you want to view the time remaining in a temporary group membership in real time:
Get-ADGroup ‘Domain Admins’ -Property member -ShowMemberTimeToLive

Image may be NSFW.
Clik here to view.
clip_image008

Figure 4: Viewing the time-to-live on a temporary group membership.

So that’s cool, but in addition to convenience, there is a real security benefit to this feature that we’ve never had before. I’d be remiss not to mention that with the new Privileged Access Management feature, when you add a temporary group membership like this, the domain controller will actually constrain the Kerberos TGT lifetime to the shortest TTL that the user currently has. What that means is that if a user account only has 5 minutes left in its Domain Admins membership when it logs on, the domain controller will give that account a TGT that’s only good for 5 more minutes before it has to be renewed, and when it is renewed, the PAC (privilege attribute certificate) will no longer contain that group membership! You can see this in action using klist.exe:

Image may be NSFW.
Clik here to view.
clip_image010

Figure 5: My Kerberos ticket is only good for about 8 minutes because of my soon-to-expire group membership.

Awesome.

Lastly, it’s worth noting that this is just one small aspect of the upcoming Privileged Access Management feature in Windows Server 2016. There’s much more to it, like shadow security principals, bastion forests, new integrations with Microsoft Identity Manager, and more. Read more about what’s new in Windows Server 2016 here.

Until next time,

Ryan “Domain Admin for a Minute” Ries


Updated 3/21/16 with additional text in Disclaimer – “Disclaimer: Server 2016 is still in a Technical Preview state – the information contained in this post may become inaccurate in the future as the product continues to evolve.  More specifically, there are still issues being ironed out in other parts of Privileged Access Management in Technical Preview 4 for multi-forest deployments.   Watch for more updates as we get closer to general availability!”

Are your DCs too busy to be monitored?: AD Data Collector Set solutions for long report compile times or report data deletion

Hi all, Herbert Mauerer here. In this post we’re back to talk about the built-in AD Diagnostics Data collector set available for Active Directory Performance (ADPERF) issues and how to ensure a useful report is generated when your DCs are under heavy load.

Why are my domain controllers so busy you ask? Consider this: Active Directory stands in the center of the Identity Management for many customers. It stores the configuration information for many critical line of business applications. It houses certificate templates, is used to distribute group policy and is the account database among many other things. All sorts of network-based services use Active Directory for authentication and other services.

As mentioned there are many applications which store their configuration in Active Directory, including the details of the user context relative to the application, plus objects specifically created for the use of these applications.

There are also applications that use Active Directory as a store to synchronize directory data. There are products like Forefront Identity Manager (and now Microsoft Identity Manager) where synchronizing data is the only purpose. I will not discuss whether these applications are meta-directories or virtual directories, or what class our Office 365 DirSync belongs to…

One way or the other, the volume and complexity of Active Directory queries has a constant trend of increasing, and there is no end in sight.

So what are my Domain Controllers doing all day?

We get this questions a lot from our customers. It often seems as if the AD Admins are the last to know what kind of load is put onto the domain controllers by scripts, applications and synchronization engines. And they are not made aware of even significant application changes.

But even small changes can have a drastic effect on the DC performance. DCs are resilient, but even the strongest warrior may fall against an overwhelming force.  Think along the lines of “death by a thousand cuts”.  Consider applications or scripts that run non-optimized or excessive queries on many, many clients during or right after logon and it will feel like a distributed DoS. In this scenario, the domain controller may get bogged down due to the enormous workload issued by the clients. This is one of the classic scenarios when it comes to Domain Controller performance problems.

What resources exist today to help you troubleshoot AD Performance scenarios?

We have already discussed the overall topic in this blog, and today many customer requests start with the complaint that the response times are bad and the LSASS CPU time is high. There also is a blog post specifically on the toolset we’ve had since Windows Server 2008. We also updated and brought back the Server Performance Advisor toolset. This toolset is now more targeted at trend analysis and base-lining.  If a video is more your style, Justin Turner revealed our troubleshooting process at Ignite.

The reports generated by this data collection are hugely useful for understanding what is burdening the Domain Controllers. There are fewer cases where DCs are responding slowly, but there is no significant utilization seen. We released a blog on that scenario and also gave you a simple method to troubleshoot long-running LDAP queries at our sister site.  So what’s new with this post?

The AD Diagnostic Data Collector set report “report.html” is missing or compile time is very slow

In recent months, we have seen an increasing number of customers with incomplete Data Collector Set reports. Most of the time, the “report.html” file is missing:

This is a folder where the creation of the report.html file was successful:

Image may be NSFW.
Clik here to view.
image

This folder has exceeded the limits for reporting:

Image may be NSFW.
Clik here to view.
image

Notice the report.html file is missing in the second folder example. Also take note that the ETL and BLG files are bigger. What’s the reason for this?

The Data Collector Set report generation process uncovered:

  • When the data collection ends, the process “tracerpt.exe” is launched to create a report for the folder where the data was collected.
  • “tracerpt.exe” runs with “below normal” priority so it does not get full CPU attention especially if LSASS is busy as well.
  • “tracerpt.exe” runs with one worker thread only, so it cannot take advantage of more than one CPU core.
  • “tracerpt.exe” accumulates RAM usage as it runs.
  • “tracerpt.exe” has six hours to complete a report. If it is not done within this time, the report is terminated.
  • The default settings of the system AD data collector deletes the biggest data set first that exceed the 1 Gigabyte limit. The biggest single file in the reports is typically “Active Directory.etl”.  The report.html file will not get created if this file does not exist.

I worked with a customer recently with a pretty well-equipped Domain Controller (24 server-class CPUs, 256 GB RAM). The customer was kind enough to run a few tests for various report sizes, and found the following metrics:

  • Until the time-out of six hours is hit, “tracerpt.exe” consumes up to 12 GB of RAM.
  • During this time, one CPU core was allocated 100%. If a DC is in a high-load condition, you may want to increase the base priority of “tracerpt.exe” to get the report to complete. This is at the expense of CPU time potentially impacting purpose of said server and in turn clients.
  • The biggest data set that could be completed within the six hours had an “Active Directory.etl” of 3 GB.

If you have lower-spec and busier machines, you shouldn’t expect the same results as this example (On a lower spec machine with a 3 GB ETL file, the report.html file would likely fail to compile within the 6-hour window).

What a bummer, how do you get Performance Logging done then?

Fortunately, there are a number of parameters for a Data Collector Set that come to the rescue. Before you can use any of them you first need one of the more custom Data Collector Sets. You can play with a variety of settings, based on the purpose of the collection.

In Performance Monitor you can create a custom set on the “User Defined” folder by right-clicking it, to bring up the New -> Data Collector Set option in the context menu:

Image may be NSFW.
Clik here to view.
image

This launches a wizard that prompts you for a number of parameters for the new set.

The first thing it wants is a name for the new set:

Image may be NSFW.
Clik here to view.
image

The next step is to select a template. It may be one of the built-in templates or one exported from another computer as an XML file you select through the “Browse” button. In our case, we want to create a clone of “Active Directory Diagnostics”:

Image may be NSFW.
Clik here to view.
image

The next step is optional, and it’s specifies the storage location for the reports. You may want to select a volume with more space or lower IO load than the default volume:

Image may be NSFW.
Clik here to view.
image

There is one more page in the wizard, but there is no reason to make any more changes here. You can click “Finish” on this page.

The default settings are fine for an idle DC, but if you find your ETL files are too large, your reports are not generated, or it takes too long to process the data, you will likely want to make the following configuration changes.

For a real “Big Data Collector Set” we first want to make important changes to the storage strategy of the set that are available in the “Data Manager” log:

Image may be NSFW.
Clik here to view.
image

The most relevant settings are “Resource Policy” and “Maximum Root Path Size”. I recommend starting with the settings as shown below:

Image may be NSFW.
Clik here to view.
image

Notice, I’ve changed the Resource policy from “Delete largest” to “Delete oldest”. I’ve also increased the Maximum root path size from 1024 to 2048 MB.  You can run some reports to learn what the best size settings are for you. You might very well end up using 10 GB or more for your reports.

The second crucial parameter for your custom sets is the run interval for the data collection. It is five minutes by default. You can adjust that in the properties of the collector in the “Stop Condition” tab. In many cases shortening the data collection is a viable step if you see continuous high load:

Image may be NSFW.
Clik here to view.
image

You should avoid going shorter than two minutes, as this is the maximum LDAP query duration by default. (If you have LDAP queries that reach this threshold, they would not show up in a report that is less than two minutes in length.) In fact, I would suggest the minimum interval be set to three minutes.

One very attractive option is automatically restarting the data collection if a certain size of data collection is exceeded. You need to use common sense when you look at the multiple reports, e.g. the ratio of long-running queries is then shown in the logs. But it is definitely better than no report.

If you expect to exceed the 1 GB limit often, you certainly should adjust the total size of collections (Maximum root path size) in the “Data Manager”.

So how do I know how big the collection is while running it?

You can take a look at the folder of the data collection in Explorer, but you will notice it is pretty lazy updating it with the current size of the collection:

Image may be NSFW.
Clik here to view.
image

Explorer only updates the folder if you are doing something with the files. It sounds strange, but attempting to delete a file will trigger an update:

Image may be NSFW.
Clik here to view.
image

Now that makes more sense…

If you see the log is growing beyond your expectations, you can manually stop it before the stop condition hits the threshold you have configured:

Image may be NSFW.
Clik here to view.
image

Of course, you can also start and stop the reporting from a command line using the logman instructions in this post.

Room for improvement

We are aware there is room for improvement to get bigger data sets reported in a shorter time. The good news is that much of these special configuration changes won’t be needed once your DCs are running on Windows Server 2016. We will talk about that in a future post.

Thanks for reading.

Herbert

Setting up Virtual Smart card logon using Virtual TPM for Windows 10 Hyper-V VM Guests

Hello Everyone, my name is Raghav and I’m a Technical Advisor for one of the Microsoft Active Directory support teams. This is my first blog and today I’ll share with you how to configure a Hyper-V environment in order to enable virtual smart card logon to VM guests by leveraging a new Windows 10 feature: virtual Trusted Platform Module (TPM).

Here’s a quick overview of the terminology discussed in this post:
  • Smart cards are physical authentication devices, which improve on the concept of a password by requiring that users actually have their smart card device with them to access the system, in addition to knowing the PIN, which provides access to the smart card.
  • Virtual smart cards (VSCs) emulate the functionality of traditional smart cards, but instead of requiring the purchase of additional hardware, they utilize technology that users already own and are more likely to have with them at all times. Theoretically, any device that can provide the three key properties of smart cards (non-exportability, isolated cryptography, and anti-hammering) can be commissioned as a VSC, though the Microsoft virtual smart card platform is currently limited to the use of the Trusted Platform Module (TPM) chip onboard most modern computers. This blog will mostly concern TPM virtual smart cards.
    For more information, read Understanding and Evaluating Virtual Smart Cards.
  • Trusted Platform Module – (As Christopher Delay explains in his blog) TPM is a cryptographic device that is attached at the chip level to a PC, Laptop, Tablet, or Mobile Phone. The TPM securely stores measurements of various states of the computer, OS, and applications. These measurements are used to ensure the integrity of the system and software running on that system. The TPM can also be used to generate and store cryptographic keys. Additionally, cryptographic operations using these keys take place on the TPM preventing the private keys of certificates from being accessed outside the TPM.
  • Virtualization-based security – The following Information is taken directly from https://technet.microsoft.com/en-us/itpro/windows/keep-secure/windows-10-security-guide
    • One of the most powerful changes to Windows 10 is virtual-based security. Virtual-based security (VBS) takes advantage of advances in PC virtualization to change the game when it comes to protecting system components from compromise. VBS is able to isolate some of the most sensitive security components of Windows 10. These security components aren’t just isolated through application programming interface (API) restrictions or a middle-layer: They actually run in a different virtual environment and are isolated from the Windows 10 operating system itself.
    • VBS and the isolation it provides is accomplished through the novel use of the Hyper V hypervisor. In this case, instead of running other operating systems on top of the hypervisor as virtual guests, the hypervisor supports running the VBS environment in parallel with Windows and enforces a tightly limited set of interactions and access between the environments. Think of the VBS environment as a miniature operating system: It has its own kernel and processes. Unlike Windows, however, the VBS environment runs a micro-kernel and only two processes called trustlets
  • Local Security Authority (LSA) enforces Windows authentication and authorization policies. LSA is a well-known security component that has been part of Windows since 1993. Sensitive portions of LSA are isolated within the VBS environment and are protected by a new feature called Credential Guard.
  • Hypervisor-enforced code integrity verifies the integrity of kernel-mode code prior to execution. This is a part of the Device Guard feature.
VBS provides two major improvements in Windows 10 security: a new trust boundary between key Windows system components and a secure execution environment within which they run. A trust boundary between key Windows system components is enabled though the VBS environment’s use of platform virtualization to isolate the VBS environment from the Windows operating system. Running the VBS environment and Windows operating system as guests on top of Hyper-V and the processor’s virtualization extensions inherently prevents the guests from interacting with each other outside the limited and highly structured communication channels between the trustlets within the VBS environment and Windows operating system.
VBS acts as a secure execution environment because the architecture inherently prevents processes that run within the Windows environment – even those that have full system privileges – from accessing the kernel, trustlets, or any allocated memory within the VBS environment. In addition, the VBS environment uses TPM 2.0 to protect any data that is persisted to disk. Similarly, a user who has access to the physical disk is unable to access the data in an unencrypted form.
Image may be NSFW.
Clik here to view.
clip_image002[4]
VBS requires a system that includes:
  • Windows 10 Enterprise Edition
  • A-64-bit processor
  • UEFI with Secure Boot
  • Second-Level Address Translation (SLAT) technologies (for example, Intel Extended Page Tables [EPT], AMD Rapid Virtualization Indexing [RVI])
  • Virtualization extensions (for example, Intel VT-x, AMD RVI)
  • I/O memory management unit (IOMMU) chipset virtualization (Intel VT-d or AMD-Vi)
  • TPM 2.0
Note: TPM 1.2 and 2.0 provides protection for encryption keys that are stored in the firmware. TPM 1.2 is not supported on Windows 10 RTM (Build 10240); however, it is supported in Windows 10, Version 1511 (Build 10586) and later.
Among other functions, Windows 10 uses the TPM to protect the encryption keys for BitLocker volumes, virtual smart cards, certificates, and the many other keys that the TPM is used to generate. Windows 10 also uses the TPM to securely record and protect integrity-related measurements of select hardware.



Now that we have the terminology clarified, let’s talk about how to set this up.


Setting up Virtual TPM
First we will ensure we meet the basic requirements on the Hyper-V host.
On the Hyper-V host, launch msinfo32 and confirm the following values:

The BIOS Mode should state “UEFI”.

Image may be NSFW.
Clik here to view.
clip_image001
Secure Boot State should be On.
Image may be NSFW.
Clik here to view.
clip_image002

Next, we will enable VBS on the Hyper-V host.
  1. Open up the Local Group Policy Editor by running gpedit.msc.
  2. Navigate to the following settings: Computer Configuration, Administrative Templates, System, Device Guard. Double-click Turn On Virtualization Based Security. Set the policy to Enabled, click OK,
Image may be NSFW.
Clik here to view.
clip_image004

Now we will enable Isolated User Mode on the Hyper-V host.
1. To do that, go to run type appwiz.cpl on the left pane find Turn Windows Features on or off.
Check Isolated User Mode, click OK, and then reboot when prompted.
Image may be NSFW.
Clik here to view.
clip_image006

This completes the initial steps needed for the Hyper-V host.


Now we will enable support for virtual TPM on your Hyper-V VM guest
Note: Support for Virtual TPM is only included in Generation 2 VMs running Windows 10.
To enable this on your Windows 10 generation 2 VM. Open up the VM settings and review the configuration under the Hardware, Security section. Enable Secure Boot and Enable Trusted Platform Module should both be selected.
Image may be NSFW.
Clik here to view.
clip_image008

That completes the Virtual TPM part of the configuration.  We will now work on working on virtual Smart Card configuration.

Setting up Virtual Smart Card
In the next section, we create a certificate template so that we can request a certificate that has the required parameters needed for Virtual Smart Card logon.
These steps are adapted from the following TechNet article: https://technet.microsoft.com/en-us/library/dn579260.aspx

Prerequisites and Configuration for Certificate Authority (CA) and domain controllers
  • Active Directory Domain Services
  • Domain controllers must be configured with a domain controller certificate to authenticate smartcard users. The following article covers Guidelines for enabling smart card logon: http://support.microsoft.com/kb/281245
  • An Enterprise Certification Authority running on Windows Server 2012 or Windows Server 2012 R2. Again, Chris’s blog covers neatly on how to setup a PKI environment.
  • Active Directory must have the issuing CA in the NTAuth store to authenticate users to active directory.
Create the certificate template
1. On the CA console (certsrv.msc) right click on Certificate Template and select Manage
Image may be NSFW.
Clik here to view.
clip_image010

2. Right-click the Smartcard Logon template and then click Duplicate Template
Image may be NSFW.
Clik here to view.
clip_image012

3. On the Compatibility tab, set the compatibility settings as below
Image may be NSFW.
Clik here to view.
clip_image014

4. On the Request Handling tab, in the Purpose section, select Signature and smartcard logon from the drop down menu
Image may be NSFW.
Clik here to view.
clip_image016

5. On the Cryptography Tab, select the Requests must use on of the following providers radio button and then select the Microsoft Base Smart Card Crypto Provider option.
Image may be NSFW.
Clik here to view.
clip_image018

Optionally, you can use a Key Storage Provider (KSP). Choose the KSP, under Provider Category select Key Storage Provider. Then select the Requests must use one of the following providers radio button and select the Microsoft Smart Card Key Storage Provider option.
Image may be NSFW.
Clik here to view.
clip_image020

6. On the General tab: Specify a name, such as TPM Virtual Smart Card Logon. Set the validity period to the desired value and choose OK


7. Navigate to Certificate Templates. Right click on Certificate Templates and select New, then Certificate Template to Issue.  Select the new template you created in the prior steps.


Image may be NSFW.
Clik here to view.
clip_image022
Note that it usually takes some time for this certificate to become available for issuance.


Create the TPM virtual smart card

Next we’ll create a virtual Smart Card on the Virtual Machine by using the Tpmvscmgr.exe command-line tool.

1. On the Windows 10 Gen 2 Hyper-V VM guest, open an Administrative Command Prompt and run the following command:
tpmvsmgr.exe create /name myVSC /pin default /adminkey random /generate
Image may be NSFW.
Clik here to view.
clip_image024
You will be prompted for a pin.  Enter at least eight characters and confirm the entry.  (You will need this pin in later steps)


Enroll for the certificate on the Virtual Smart Card Certificate on Virtual Machine.
1. In certmgr.msc, right click Certificates, click All Tasks then Request New Certificate.
Image may be NSFW.
Clik here to view.
clip_image025

2. On the certificate enrollment select the new template you created earlier.
Image may be NSFW.
Clik here to view.
clip_image027

3. It will prompt for the PIN associated with the Virtual Smart Card. Enter the PIN and click OK.
Image may be NSFW.
Clik here to view.
clip_image029

4. If the request completes successfully, it will display Certificate Installation results page
Image may be NSFW.
Clik here to view.
clip_image031

5. On the virtual machine select sign-in options and select security device and enter the pin
Image may be NSFW.
Clik here to view.
clip_image033

That completes the steps on how to deploy Virtual Smart Cards using a virtual TPM on virtual machines.  Thanks for reading!

Raghav Mahajan

The Version Store Called, and They’re All Out of Buckets

Hello, Ryan Ries back at it again with another exciting installment of esoteric Active Directory and ESE database details!

I think we need to have another little chat about something called the version store.

The version store is an inherent mechanism of the Extensible Storage Engine and a commonly seen concept among databases in general. (ESE is sometimes referred to as Jet Blue. Sometimes old codenames are so catchy that they just won’t die.) Therefore, the following information should be relevant to any application or service that uses an ESE database (such as Exchange,) but today I’m specifically focused on its usage as it pertains to Active Directory.

The version store is one of those details that the majority of customers will never need to think about. The stock configuration of the version store for Active Directory will be sufficient to handle any situation encountered by 99% of AD administrators. But for that 1% out there with exceptionally large and/or busy Active Directory deployments, (or for those who make “interesting” administrative choices,) the monitoring and tuning of the version store can become a very important topic. And quite suddenly too, as replication throughout your environment grinds to a halt because of version store exhaustion and you scramble to figure out why.

The purpose of this blog post is to provide up-to-date (as of the year 2016) information and guidance on the version store, and to do it in a format that may be more palatable to many readers than sifting through reams of old MSDN and TechNet documentation that may or may not be accurate or up to date. I can also offer more practical examples than you would probably get from straight technical documentation. There has been quite an uptick lately in the number of cases we’re seeing here in Support that center around version store exhaustion. While the job security for us is nice, knowing this stuff ahead of time can save you from having to call us and spend lots of costly support hours.

Version Store: What is it?

As mentioned earlier, the version store is an integral part of the ESE database engine. It’s an area of temporary storage in memory that holds copies of objects that are in the process of being modified, for the sake of providing atomic transactions. This allows the database to roll back transactions in case it can’t commit them, and it allows other threads to read from a copy of the data while it’s in the process of being modified. All applications and services that utilize an ESE database use version store to some extent. The article “How the Data Store Works” describes it well:

“ESE provides transactional views of the database. The cost of providing these views is that any object that is modified in a transaction has to be temporarily copied so that two views of the object can be provided: one to the thread inside that transaction and one to threads in other transactions. This copy must remain as long as any two transactions in the process have different views of the object. The repository that holds these temporary copies is called the version store. Because the version store requires contiguous virtual address space, it has a size limit. If a transaction is open for a long time while changes are being made (either in that transaction or in others), eventually the version store can be exhausted. At this point, no further database updates are possible.”

When Active Directory was first introduced, it was deployed on machines with a single x86 processor with less than 4 GB of RAM supporting NTDS.DIT files that ranged between 2MB and a few hundred MB. Most of the documentation you’ll find on the internet regarding the version store still has its roots in that era and was written with the aforementioned hardware in mind. Today, things like hardware refreshes, OS version upgrades, cloud adoption and an improved understanding of AD architecture are driving massive consolidation in the number of forests, domains and domain controllers in them, DIT sizes are getting bigger… all while still relying on default configuration values from the Windows 2000 era.

The number-one killer of version store is long-running transactions. Transactions that tend to be long-running include, but are not limited to:

– Deleting a group with 100,000 members
– Deleting any object, not just a group, with 100,000 or more forward/back links to clean
– Modifying ACLs in Active Directory on a parent container that propagate down to many thousands of inheriting child objects
– Creating new database indices
– Having underpowered or overtaxed domain controllers, causing transactions to take longer in general
– Anything that requires boat-loads of database modification
– Large SDProp and garbage collection tasks
– Any combination thereof

I will show some examples of the errors that you would see in your event logs when you experience version store exhaustion in the next section.

Monitoring Version Store Usage

To monitor version store usage, leverage the Performance Monitor (perfmon) counter:

‘\\dc01\Database ==> Instances(lsass/NTDSA)\Version buckets allocated’

Image may be NSFW.
Clik here to view.
image

(Figure 1: The ‘Version buckets allocated’ perfmon counter.)

The version store divides the amount of memory that it has been given into “buckets,” or “pages.” Version store pages need not (and in AD, they do not) equal the size of database pages elsewhere in the database. We’ll get into the exact size of these buckets in a minute.

During typical operation, when the database is not busy, this counter will be low. It may even be zero if the database really just isn’t doing anything. But when you perform one of those actions that I mentioned above that qualify as “long-running transactions,” you will trigger a spike in the version store usage. Here is an example of me deleting a group that contains 200,000 members, on a DC running 2012 R2 with 1 64bit CPU:

Image may be NSFW.
Clik here to view.
image
(Figure 2: Deleting a group containing 200k members on a 2012 R2 DC with 1 64bit CPU.)

The version store spikes to 5332 buckets allocated here, seconds after I deleted the group, but as long as the DC recovers and falls back down to nominal levels, you’ll be alright. If it stays high or even maxed out for extended periods of time, then no more database transactions for you. This includes no more replication. This is just an example using the common member/memberOf relationship, but any linked-value attribute relationship can cause this behavior. (I’ve talked a little about linked value attributes before here.) There are plenty of other types of objects that may invoke this same kind of behavior, such as deleting an RODC computer object, and then its msDs-RevealedUsers links must be processed, etc..

I’m not saying that deleting a group with fewer than 200K members couldn’t also trigger version store exhaustion if there are other transactions taking place on your domain controller simultaneously or other extenuating circumstances. I’ve seen transactions involving as few as 70K linked values cause major problems.

After you delete an object in AD, and the domain controller turns it into a tombstone, each domain controller has to process the linked-value attributes of that object to maintain the referential integrity of the database. It does this in “batches,” usually 1000 or 10,000 depending on Windows version and configuration. This was only very recently documented here. Since each “batch” of 1000 or 10,000 is considered a single transaction, a smaller batch size will tend to complete faster and thus require less version store usage. (But the overall job will take longer.)

An interesting curveball here is that having the AD Recycle Bin enabled will defer this action by an msDs-DeletedObjectLifetime number of days after an object is deleted, since that’s the appeal behind the AD Recycle Bin – it allows you to easily restore deleted objects with all their links intact. (More detail on the AD Recycle Bin here.)

When you run out of version storage, no other database transactions can be committed until the transaction or transactions that are causing the version store exhaustion are completed or rolled back. At this point, most people start rebooting their domain controllers, and this may or may not resolve the immediate issue for them depending on exactly what’s going on. Another thing that may alleviate this issue is offline defragmentation of the database. (Or reducing the links batch size, or increasing the version store size – more on that later.) Again, we’re usually looking at 100+ gigabyte DITs when we see this kind of issue, so we’re essentially talking about pushing the limits of AD. And we’re also talking about hours of downtime for a domain controller while we do that offline defrag and semantic database analysis.

Here, Active Directory is completely tapping out the version store. Notice the plateau once it has reached its max:

Image may be NSFW.
Clik here to view.
image
(Figure 3: Version store being maxed out at 13078 buckets on a 2012 R2 DC with 1 64bit CPU.)

So it has maxed out at 13,078 buckets.

When you hit this wall, you will see events such as these in your event logs:

Log Name: Directory Service
Source: Microsoft-Windows-ActiveDirectory_DomainService
Date: 5/16/2016 5:54:52 PM
Event ID: 1519
Task Category: Internal Processing
Level: Error
Keywords: Classic
User: S-1-5-21-4276753195-2149800008-4148487879-500
Computer: DC01.contoso.com
Description:
Internal Error: Active Directory Domain Services could not perform an operation because the database has run out of version storage.

And also:

Log Name: Directory Service
Source: NTDS ISAM
Date: 5/16/2016 5:54:52 PM
Event ID: 623
Task Category: (14)
Level: Error
Keywords: Classic
User: N/A
Computer: DC01.contoso.com
Description:
NTDS (480) NTDSA: The version store for this instance (0) has reached its maximum size of 408Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

The peculiar “408Mb” figure that comes along with that last event leads us into the next section…

How big is the Version Store by default?

The “How the Data Store Works” article that I linked to earlier says:

“The version store has a size limit that is the lesser of the following: one-fourth of total random access memory (RAM) or 100 MB. Because most domain controllers have more than 400 MB of RAM, the most common version store size is the maximum size of 100 MB.”

Incorrect.

And then you have other articles that have even gone to print, such as this one, that say:

“Typically, the version store is 25 percent of the physical RAM.”

Extremely incorrect.

What about my earlier question about the bucket size? Well if you consulted this KB article you would read:

The value for the setting is the number of 16KB memory chunks that will be reserved.”

Nope, that’s wrong.

Or if I go to the MSDN documentation for ESE:

“JET_paramMaxVerPages
This parameter reserves the requested number of version store pages for use by an instance.

Each version store page as configured by this parameter is 16KB in size.”

Not true.

The pages are not 16KB anymore on 64bit DCs. And the only time that the “100MB” figure was ever even close to accurate was when domain controllers were 32bit and had 1 CPU. But today, domain controllers are 64bit and have lots of CPUs. Both version store bucket size and number of version store buckets allocated by default both double based on whether your domain controller is 32bit or 64bit. And the figure also scales a little bit based on how many CPUs are in your domain controller.

So without further ado, here is how to calculate the actual number of buckets that Active Directory will allocate by default:

(2 * (3 * (15 + 4 + 4 * #CPUs)) + 6400) * PointerSize / 4

Pointer size is 4 if you’re using a 32bit processor, and 8 if you’re using a 64bit processor.

And secondly, version store pages are 16KB if you’re on a 32bit processor, and 32KB if you’re on a 64bit processor. So using a 64bit processor effectively quadruples the default size of your AD version store. To convert number of buckets allocated into bytes for a 32bit processor:

(((2 * (3 * (15 + 4 + 4 * 1)) + 6400) * 4 / 4) * 16KB) / 1MB

And for a 64bit processor:

(((2 * (3 * (15 + 4 + 4 * 1)) + 6400) * 8 / 4) * 32KB) / 1MB

So using the above formulae, the version store size for a single-core, 64bit DC would be ~408MB, which matches that event ID 623 we got from ESE earlier. It also conveniently matches 13078 * 32KB buckets, which is where we plateaued with our perfmon counter earlier.

If you had a 4-core, 64bit domain controller, the formula would come out to ~412MB, and you will see this line up with the event log event ID 623 on that machine. When a 4-core, Windows 2008 R2 domain controller with default configuration runs out of version store:

Log Name:      Directory Service
Source:        NTDS ISAM
Date:          5/15/2016 1:18:25 PM
Event ID:      623
Task Category: (14)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      dc02.fabrikam.com
Description:
NTDS (476) NTDSA: The version store for this instance (0) has reached its maximum size of 412Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

The version store size for a single-core, 32bit DC is ~102MB. This must be where the original “100MB” adage came from. But as you can see now, that information is woefully outdated.

The 6400 number in the equation comes from the fact that 6400 is the absolute, hard-coded minimum number of version store pages/buckets that AD will give you. Turns out that’s about 100MB, if you assumed 16KB pages, or 200MB if you assume 32KB pages. The interesting side-effect from this is that the documented “EDB max ver pages (increment over the minimum)” registry entry, which is the supported way of increasing your version store size, doesn’t actually have any effect unless you set it to some value greater than 6400 decimal. If you set that registry key to something less than 6400, then it will just get overridden to 6400 when AD starts. But if you set that registry entry to, say, 9600 decimal, then your version store size calculation will be:

(((2 *(3 * (15 + 4 + 4 * 1)) + 9600) * 8 / 4) * 32KB) / 1MB = 608.6MB

For a 64bit, 1-core domain controller.

So let’s set those values on a DC, then run up the version store, and let’s get empirical up in here:

Image may be NSFW.
Clik here to view.
image
(Figure 4: Version store exhaustion at 19478 buckets on a 2012 R2 DC with 1 64bit CPU.)

(19478 * 32KB) / 1MB = 608.7MB

And wouldn’t you know it, the event log now reads:

Image may be NSFW.
Clik here to view.
image
(Figure 5: The event log from the previous version store exhaustion, showing the effect of setting the “EDB max ver pages (increment over the minimum)” registry value to 9600.)

Here’s a table that shows version store sizes based on the “EDB max ver pages (increment over the minimum)” value and common CPU counts:

Buckets

1 CPU

2 CPUs

4 CPUs

8 CPUs

16 CPUs

6400

(The default)

x64: 410 MB

x86: 103 MB

x64: 412 MB

x86: 103 MB

x64: 415 MB

x86: 104 MB

x64: 421 MB

x86: 105 MB

x64: 433 MB

x86: 108 MB

9600

x64: 608 MB

x86: 152 MB

x64: 610 MB

x86: 153 MB

x64: 613 MB

x86: 153 MB

x64: 619MB

x86: 155 MB

x64: 631 MB

x86: 158 MB

12800

x64: 808 MB

x86: 202 MB

x64: 810 MB

x86: 203 MB

x64: 813 MB

x86: 203 MB

x64: 819 MB

x86: 205 MB

x64: 831 MB

x86: 208 MB

16000

x64: 1008 MB

x86: 252 MB

x64: 1010 MB

x86: 253 MB

x64: 1013 MB

x86: 253 MB

x64: 1019 MB

x86: 255 MB

x64: 1031 MB

x86: 258 MB

19200

x64: 1208 MB

x86: 302 MB

x64: 1210 MB

x86: 303 MB

x64: 1213 MB

x86: 303 MB

x64: 1219 MB

x86: 305 MB

x64: 1231 MB

x86: 308 MB

Sorry for the slight rounding errors – I just didn’t want to deal with decimals. As you can see, the number of CPUs in your domain controller only has a slight effect on the version store size. The processor architecture, however, makes all the difference. Good thing absolutely no one uses x86 DCs anymore, right?

Now I want to add a final word of caution.

I want to make it clear that we recommend changing the “EDB max ver pages (increment over the minimum)” only when necessary; when the event ID 623s start appearing. (If it ain’t broke, don’t fix it.) I also want to reiterate the warnings that appear on the support KB, that you must not set this value arbitrarily high, you should increment this setting in small (50MB or 100MB increments,) and that if setting the value to 19200 buckets still does not resolve your issue, then you should contact Microsoft Support. If you are going to change this value, it is advisable to change it consistently across all domain controllers, but you must also carefully consider the processor architecture and available memory on each DC before you change this setting. The version store requires a contiguous allocation of memory – precious real-estate – and raising the value too high can prevent lsass from being able to perform other work. Once the problem has subsided, you should then return this setting back to its default value.

In my next post on this topic, I plan on going into more detail on how one might actually troubleshoot the issue and track down the reason behind why the version store exhaustion is happening.

Conclusions

There is a lot of old documentation out there that has misled many an AD administrator on this topic. It was essentially accurate at the time it was written, but AD has evolved since then. I hope that with this post I was able to shed more light on the topic than you probably ever thought was necessary. It’s an undeniable truth that more and more of our customers continue to push the limits of AD beyond that which was originally conceived. I also want to remind the reader that the majority of the information in this article is AD-specific. If you’re thinking about Exchange or Certificate Services or Windows Update or DFSR or anything else that uses an ESE database, then you need to go figure out your own application-specific details, because we don’t use the same page sizes or algorithms as those guys.

I hope this will be valuable to those who find themselves asking questions about the ESE version store in Active Directory.

With love,

Ryan “Buckets of Fun” Ries



Deploying Group Policy Security Update MS16-072 \ KB3163622

My name is Ajay Sarkaria & I work with the Windows Supportability team at Microsoft. There have been many questions on deploying the newly released security update MS16-072.

This post was written to provide guidance and answer questions needed by administrators to deploy the newly released security update, MS16-072 that addresses a vulnerability. The vulnerability could allow elevation of privilege if an attacker launches a man-in-the-middle (MiTM) attack against the traffic passing between a domain controller and the target machine on domain-joined Windows computers.

The table below summarizes the KB article number for the relevant Operating System:

Article # Title Context / Synopsis
MSKB 3163622 MS16-072: Security Updates for Group Policy: June 14, 2016 Main article for MS16-072
MSKB 3159398 MS16-072: Description of the security update for Group Policy: June 14, 2016 MS16-072 for Windows Vista / Windows Server 2008, Window 7 / Windows Server 2008 R2, Windows Server 2012, Window 8.1 / Windows Server 2012 R2
MSKB 3163017 Cumulative update for Windows 10: June 14, 2016 MS16-072 For Windows 10 RTM
MSKB 3163018 Cumulative update for Windows 10 Version 1511 and Windows Server 2016 Technical Preview 4: June 14, 2016 MS16-072 For Windows 10 1511 + Windows Server 2016 TP4
MSKB 3163016 Cumulative Update for Windows Server 2016 Technical Preview 5: June 14 2016 MS16-072 For Windows Server 2016 TP5
TN: MS16-072 Microsoft Security Bulletin MS16-072 – Important Overview of changes in MS16-072
What does this security update change?

The most important aspect of this security update is to understand the behavior changes affecting the way User Group Policy is applied on a Windows computer. MS16-072 changes the security context with which user group policies are retrieved. Traditionally, when a user group policy is retrieved, it is processed using the user’s security context.

After MS16-072 is installed, user group policies are retrieved by using the computer’s security context. This by-design behavior change protects domain joined computers from a security vulnerability.

When a user group policy is retrieved using the computer’s security context, the computer account will now need “read” access to retrieve the group policy objects (GPOs) needed to apply to the user.

Traditionally, all group policies were read if the “user” had read access either directly or being part of a domain group e.g. Authenticated Users

What do we need to check before deploying this security update?

As discussed above, by default “Authenticated Users” have “Read” and “Apply Group Policy” on all Group Policy Objects in an Active Directory Domain.

Below is a screenshot from the Default Domain Policy:

Image may be NSFW.
Clik here to view.

If permissions on any of the Group Policy Objects in your active Directory domain have not been modified, are using the defaults, and as long as Kerberos authentication is working fine in your Active Directory forest (i.e. there are not Kerberos errors visible in the system event log on client computers while accessing domain resources), there is nothing else you need to make sure before you deploy the security update.

In some deployments, administrators may have removed the “Authenticated Users” group from some or all Group Policy Objects (Security filtering, etc.)

In such cases, you will need to make sure of the following before you deploy the security update:

  1. Check if “Authenticated Users” group read permissions were removed intentionally by the admins. If not, then you should probably add those back. For example, if you do not use any security filtering to target specific group policies to a set of users, you could add “Authenticated Users” back with the default permissions as shown in the example screenshot above.
  2. If the “Authenticated Users” permissions were removed intentionally (security filtering, etc), then as a result of the by-design change in this security update (i.e. to now use the computer’s security context to retrieve user policies), you will need to add the computer account retrieving the group policy object (GPO) to “Read” Group Policy (and not “Apply group policy“).

    Example Screenshot:

    Image may be NSFW.
    Clik here to view.

In the above example screenshot, let’s say an Administrator wants “User-Policy” (Name of the Group Policy Object) to only apply to the user with name “MSFT Ajay” and not to any other user, then the above is how the Group Policy would have been filtered for other users. “Authenticated Users” has been removed intentionally in the above example scenario.

Notice that no other user or group is included to have “Read” or “Apply Group Policy” permissions other than the default Domain Admins and Enterprise Admins. These groups do not have “Apply Group Policy” by default so the GPO would not apply to the users of these groups & apply only to user “MSFT Ajay”

What will happen if there are Group Policy Objects (GPOs) in an Active Directory domain that are using security filtering as discussed in the example scenario above?

Symptoms when you have security filtering Group Policy Objects (GPOs) like the above example and you install the security update MS16-072:

  • Printers or mapped drives assigned through Group Policy Preferences disappear.
  • Shortcuts to applications on users’ desktop are missing
  • Security filtering group policy does not process anymore
  • You may see the following change in gpresult: Filtering: Not Applied (Unknown Reason)
  • If you are using Folder Redirection and the Folder Redirection group policy removal option is set to “Redirect the folder back to the user profile location when policy is removed,” the redirected folders are moved back to the client machine after installing this security update
What is the Resolution?

Simply adding the “Authenticated Users” group with the “Read” permissions on the Group Policy Objects (GPOs) should be sufficient. Domain Computers are part of the “Authenticated Users” group. “Authenticated Users” have these permissions on any new Group Policy Objects (GPOs) by default. Again, the guidance is to add just “Read” permissions and not “Apply Group Policy” for “Authenticated Users”

What if adding Authenticated Users with Read permissions is not an option?

If adding “Authenticated Users” with just “Read” permissions is not an option in your environment, then you will need to add the “Domain Computers” group with “Read” Permissions. If you want to limit it beyond the Domain Computers group: Administrators can also create a new domain group and add the computer accounts to the group so you can limit the “Read Access” on a Group Policy Object (GPO). However, computers will not pick up membership of the new group until a reboot. Also keep in mind that with this security update installed, this additional step is only required if the default “Authenticated Users” Group has been removed from the policy where user settings are applied.

Example Screenshots:

Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.

Now in the above scenario, after you install the security update, as the user group policy needs to be retrieved using the system’s security context, (domain joined system being part of the “Domain Computers” security group by default), the client computer will be able to retrieve the user policies required to be applied to the user and the same will be processed successfully.

How to identify GPOs with issues:

In case you have already installed the security update and need to identify Group Policy Objects (GPOs) that are affected, the easy way is just to do a simple gpupdate /force on a Windows client computer and then run the gpresult /h new-report.html -> Open the new-report.html and review for any errors like: “Reason Denied: Inaccessible, Empty or Disabled”

Image may be NSFW.
Clik here to view.

What if there are lot of GPOs?

A script is available which can detect all Group Policy Objects (GPOs) in your domain which may have the “Authenticated Users” missing “Read” Permissions
You can get the script from here: https://gallery.technet.microsoft.com/Powershell-script-to-cc281476

Pre-Reqs:

  • The script can run only on Windows 7 and above Operating Systems which have the RSAT or GPMC installed or Domain Controllers running Windows Server 2008 R2 and above
  • The script works in a single domain scenario.
  • The script will detect all GPOs in your domain (Not Forest) which are missing “Authenticated Users” permissions & give the option to add “Authenticated Users” with “Read” Permissions (Not Apply Group Policy). If you have multiple domains in your Active Directory Forest, you will need to run this for each domain.
    • Domain Computers are part of the Authenticated Users group
  • The script can only add permissions to the Group Policy Objects (GPOs) in the same domain as the context of the current user running the script. In a multi domain forest, you must run it in the context of the Domain Admin of the other domain in your forest.

Sample Screenshots when you run the script:

In the first sample screenshot below, running the script detects all Group Policy Objects (GPOs) in your domain which has the “Authenticated Users” missing the Read Permission.

Image may be NSFW.
Clik here to view.
image

If you hit “Y”, you will see the below message:

Image may be NSFW.
Clik here to view.
image

What if there are AGPM managed Group Policy Objects (GPOs)?

Follow the steps below to add “Authenticated Users” with Read Permissions:

To change the permissions for all managed GPO’s and add Authenticated Users Read permission follow these steps:

Re-import all Group Policy Objects (GPOs) from production into the AGPM database. This will ensure the latest copy of production GPO’s.

Image may be NSFW.
Clik here to view.
clip_image002[1]

Image may be NSFW.
Clik here to view.
clip_image004[1]

Add either “Authenticated Users” or “Domain Computers” the READ permission using the Production Delegation Tab by selecting the security principal, granting the “READ” role then clicking “OK”

Image may be NSFW.
Clik here to view.
clip_image006

Grant the selected security principal the “Read” role.

Image may be NSFW.
Clik here to view.
clip_image008

Delegation tab depicting Authenticated Users having the READ permissions.

Image may be NSFW.
Clik here to view.
clip_image010

Select and Deploy GPOs again:
Note:  To modify permissions on multiple AGPM-managed GPOs, use shift+click or ctrl+click to select multiple GPO’s at a time then deploy them in a single operation.
CTRL_A does not select all policies.

Image may be NSFW.
Clik here to view.
clip_image012

Image may be NSFW.
Clik here to view.
clip_image014

The targeted GPO now have the new permissions when viewed in AD:

Image may be NSFW.
Clik here to view.
clip_image016

Below are some Frequently asked Questions we have seen:

Frequently Asked Questions (FAQs):

Q1) Do I need to install the fix on only client OS? OR do I also need to install it on the Server OS?

A1) It is recommended you patch Windows and Windows Server computers which are running Windows Vista, Windows Server 2008 and newer Operating Systems (OS), regardless of SKU or role, in your entire domain environment. These updates only change behavior from a client (as in “client-server distributed system architecture”) standpoint, but all computers in a domain are “clients” to SYSVOL and Group Policy; even the Domain Controllers (DCs) themselves

Q2) Do I need to enable any registry settings to enable the security update?

A2) No, this security update will be enabled when you install the MS16-072 security update, however you need to check the permissions on your Group Policy Objects (GPOs) as explained above

Q3) What will change in regard to how group policy processing works after the security update is installed?

A3) To retrieve user policy, the connection to the Windows domain controller (DC) prior to the installation of MS16-072 is done under the user’s security context. With this security update installed, instead of user’s security context, Windows group policy clients will now force local system’s security context, therefore forcing Kerberos authentication

Q4) We already have the security update MS15-011 & MS15-014 installed which hardens the UNC paths for SYSVOL & NETLOGON & have the following registry keys being pushed using group policy:

  • RequirePrivacy=1
  • RequireMutualAuthentication=1
  • RequireIntegrity=1

Should the UNC Hardening security update with the above registry settings not take care of this vulnerability when processing group policy from the SYSVOL?

A4) No. UNC Hardening alone will not protect against this vulnerability. In order to protect against this vulnerability, one of the following scenarios must apply: UNC Hardened access is enabled for SYSVOL/NETLOGON as suggested, and the client computer is configured to require Kerberos FAST Armoring

– OR –

UNC Hardened Access is enabled for SYSVOL/NETLOGON, and this particular security update (MS16-072 \ KB3163622) is installed

Q5) If we have security filtering on Computer objects, what change may be needed after we install the security update?

A5) Nothing will change in regard to how Computer Group Policy retrieval and processing works

Q6) We are using security filtering for user objects and after installing the update, group policy processing is not working anymore

A6) As noted above, the security update changes the way user group policy settings are retrieved. The reason for group policy processing failing after the update is installed is because you may have removed the default “Authenticated Users” group from the Group Policy Object (GPO). The computer account will now need “read” permissions on the Group Policy Object (GPO). You can add “Domain Computers” group with “Read” permissions on the Group Policy Object (GPO) to be able to retrieve the list of GPOs to download for the user

Example Screenshot as below:

Image may be NSFW.
Clik here to view.

Q7) Will installing this security update impact cross forest user group policy processing?

A7) No, this security update will not impact cross forest user group policy processing. When a user from one forest logs onto a computer in another forest and the group policy setting “Allow Cross-Forest User Policy and Roaming User Profiles” is enabled, the user group policy during the cross forest logon will be retrieved using the user’s security context.

Q8) Is there a need to specifically add “Domain Computers” to make user group policy processing work or adding “Authenticated Users” with just read permissions should suffice?

A8) Yes, just adding “Authenticated Users” with Read permissions should suffice. If you already have “Authenticated Users” added with at-least read permissions on a GPO, there is no further action required. “Domain Computers” are by default part of the “Authenticated Users” group & user group policy processing will continue to work. You only need to add “Domain Computers” to the GPO with read permissions if you do not want to add “Authenticated Users” to have “Read”

Thanks,

Ajay Sarkaria

Supportability Program Manager – Windows

Edits:
6/29/16 – added script link and prereqs
7/11/16 – added information about AGPM
8/16/16 – added note about folder redirection

Access-Based Enumeration (ABE) Concepts (part 1 of 2)

Hello everyone, Hubert from the German Networking Team here.  Today I want to revisit a topic that I wrote about in 2009: Access-Based Enumeration (ABE)

This is the first part of a 2-part Series. This first part will explain some conceptual things around ABE.  The second part will focus on diagnostic and troubleshooting of ABE related problems.  The second post is here.

Access-Based Enumeration has existed since Windows Server 2003 SP1 and has not change in any significant form since my Blog post in 2009. However, what has significantly changed is its popularity.

With its integration into V2 (2008 Mode) DFS Namespaces and the increasing demand for data privacy, it became a tool of choice for many architects. However, the same strict limitations and performance impact it had in Windows Server 2003 still apply today. With this post, I hope to shed some more light here as these limitations and the performance impact are either unknown or often ignored. Read on to gain a little insight and background on ABE so that you:

  1. Understand its capabilities and limitations
  2. Gain the background knowledge needed for my next post on how to troubleshoot ABE

Two things to keep in mind:

  • ABE is not a security feature (it’s more of a convenience feature)
  • There is no guarantee that ABE will perform well under all circumstances. If performance issues come up in your deployment, disabling ABE is a valid solution.

So without any further ado let’s jump right in:

What is ABE and what can I do with it?

From the TechNet topic:

“Access-based enumeration displays only the files and folders that a user has permissions to access. If a user does not have Read (or equivalent) permissions for a folder, Windows hides the folder from the user’s view. This feature is active only when viewing files and folders in a shared folder; it is not active when viewing files and folders in the local file system.”

Note that ABE has to check the user’s permissions at the time of enumeration and filter out files and folders they don’t have Read permissions to. Also note that this filtering only applies if the user is attempting to access the share via SMB versus simply browsing the same folder structure in the local file system.

For example, let’s assume you have an ABE enabled file server share with 500 files and folders, but a certain user only has read permissions to 5 of those folders. The user is only able to view 5 folders when accessing the share over the network. If the user logons to this server and browses the local file system, they will see all of the files and folders.

In addition to file server shares, ABE can also be used to filter the links in DFS Namespaces.

With V2 Namespaces DFSN got the capability to store permissions for each DFSN link, and apply those permissions to the local file system of each DFSN Server.

Those NTFS permissions are then used by ABE to filter directory enumerations against the DFSN root share thus removing DFSN links from the results sent to the client.

Therefore, ABE can be used to either hide sensitive information in the link/folder names, or to increase usability by hiding hundreds of links/folders the user does not have access to.

How does it work?

The filtering happens on the file server at the time of the request.

Any Object (File / Folder / Shortcut / Reparse Point / etc.) where the user has less than generic read permissions is omitted in the response by the server.

Generic Read means:

  • List Folder / Read Data
  • Read Attributes
  • Read Extended Attributes
  • Read Permissions

If you take any of these permissions away, ABE will hide the object.

So you could create a scenario (i.e. remove the Read Permission permission) where the object is hidden from the user, but he/she could still open/read the file or folder if the user knows its name.

That brings us to the next important conceptual point we need to understand:

ABE does not do access control.

It only filters the response to a Directory Enumeration. The access control is still done through NTFS.

Aside from that ABE only works when the access happens through the Server Service (aka the Fileserver). Any access locally to the file system is not affected by ABE. Restated:

“Access-based enumeration does not prevent users from obtaining a referral to a folder target if they already know the DFS path of the folder with targets. Permissions set using Windows Explorer or the Icacls command on namespace roots or folders without targets control whether users can access the DFS folder or namespace root. However, they do not prevent users from directly accessing a folder with targets. Only the share permissions or the NTFS file system permissions of the shared folder itself can prevent users from accessing folder targets.” Recall what I said earlier, “ABE is not a security feature”. TechNet

ABE does not do any caching.

Every requests causes a filter calculation. There is no cache. ABE will repeat the same exact work for identical directory enumerations by the same user.

ABE cannot predict the permissions or the result.

It has to do the calculations for each object in every level of your folder hierarchy every time it is accessed.

If you use inheritance on the folder structure, a user will have the same permission and thus the same filter result from ABE through the entire folder structure. Still ABE as to calculate this result, consuming CPU Cycles in the process.

If you enable ABE on such a folder structure you are just wasting CPU cycles without any gain.

With those basics out of the way, let’s dive into the mechanics behind the scenes:

How the filtering calculation works

  1. When a QUERY_DIRECTORY request (https://msdn.microsoft.com/en-us/library/cc246551.aspx) or its SMB1 equivalent arrives at the server, the server will get a list of objects within that directory from the filesystem.
  2. With ABE enabled, this list is not immediately sent out to the client, but instead passed over to the ABE for processing.
  3. ABE will iterate through EVERY object of this list and compare the permission of the user with the objects ACL.
  4. The objects where the user does not have generic read access are removed from the list.
  5. After ABE has completed its processing, the client receives the filtered list.

This yields two effects:

  • This comparison is an active operation and thus consumes CPU Cycles.
  • This comparison takes time, and this time is passed down to the User as the results will only be sent, when the comparisons for the entire directory are completed.

This brings us directly to the core point of this Blog:

In order to successfully use ABE in your environment you have to manage both effects.

If you don’t, ABE can cause a wide spread outage of your File services.

The first effect can cause a complete saturation of your CPUs (all cores at 100%).

This does not only increase the response times of the Fileserver to its clients to a magnitude where the Server is not accepting any new connections or the clients kill their connection after not getting a response from the server for several minutes, but it can also prevent you from establishing a remote desktop connection to the server to make any changes (like disabling ABE for instance).

The second effect can increase the response times of your fileserver (even if its otherwise Idle) to a magnitude that is not accepted by the Users anymore.

The comparison for a single directory enumeration by a single user can keep one CPU in your server busy for quite some time, thus making it more likely for new incoming requests to overlap with already running ABE calculations. This eventually results in a Backlog adding further to the delays experienced by your clients.

To illustrate this let’s roll some numbers:

A little disclaimer:

The following calculation is what I’ve seen, your results may differ as there are many moving pieces in play here. In other words, your mileage may vary. That aside, the numbers seen here are not entirely off but stem from real production environments. Performance of Disk and CPU and other workloads play into these numbers as well.

Thus the calculation and numbers are for illustration purposes only. Don’t use it to calculate your server’s performance capabilities.

Let’s assume you have a DFS Namespace with 10,000 links that is hosted on DFS Servers that have 4 CPUs with 3.5 GHz (also assuming RSS is configured correctly and all 4 CPUs are used by the File service: https://blogs.technet.microsoft.com/networking/2015/07/24/receive-side-scaling-for-the-file-servers/ ).

We usually expect single digit millisecond response times measured at the fileserver to achieve good performance (network latency obviously adds to the numbers seen on the client).

In our scenario above (10,000 Links, ABE, 3.5 Ghz CPU) it is not unseen that a single enumeration of the namespace would take 500ms.

CPU cores and speed DFS Namespace Links RSS configured per recommendations ABE enabled? Response time
4 @ 3.5 GHz 10,000 Yes No <10ms
4 @ 3.5 GHz 10,000 Yes Yes 300 – 500 ms

That means a single CPU can handle up to 2 Directory Enumerations per Second. Multiplied by 4 CPUs the server can handle 8 User Requests per Second. Any more than those 8 requests and we push the Server into a backlog.

Backlog in this case means new requests are stuck in the Processor Queue behind other requests, therefore multiplying the wait time.

This can reach dimensions where the client (and the user) is waiting for minutes and the client eventually decides to kill the TCP connection, and in case of DFSN, fail over to another server.

Anyone remotely familiar with Fileserver Scalability probably instantly recognizes how bad and frightening those numbers are.  Please keep in mind, that not every request sent to the server is a QUERY_DIRECTORY request, and all other requests such as Write, Read, Open, Close etc. do not cause an ABE calculation (however they suffer from an ABE-induced lack of CPU resources in the same way).

Furthermore, the Windows File Service Client caches the directory enumeration results if SMB2 or SMB3 is used (https://technet.microsoft.com/en-us/library/ff686200(v=ws.10).aspx ).

There is no such Cache for SMB1. Thus SMB1 Clients will send more Directory Enumeration Requests than SMB2 or SMB3 Clients (particularly if you keep the F5 key pressed).

It should now be obvious that you should use SMB2/3 versus SMB1 and ensure you leave the caches enabled if you use ABE on your servers.

As you might have realized by now, there is no easy or reliable way to predict the CPU demand of ABE. If you are developing a completely new environment you usually cannot forecast the proportion of QUERY_DIRECTORY requests in relation to the other requests or the frequency of the same.

Recommendations!

The most important recommendation I can give you is:

Do not enable ABE unless you really need to.

Let’s take the Users Home shares as an example:

Usually there is no user browsing manually through this structure, but instead the users get a mapped drive pointing to their folder. So the usability aspect does not count.  Additionally most users will know (or can find out from the Office Address book) the names or aliases of their colleagues. So there is no sensitive information to hide here.  For ease of management most home shares live in big namespace or server shares, what makes them very unfit to be used with ABE.  In many cases the user has full control (or at least write permissions) inside his own home share.  Why should I waste my CPU Cycles to filter the requests inside someone’s Home Share?

Considering all those points, I would be intrigued to learn about a telling argument to enable ABE on User Home Shares or Roaming Profile Shares.  Please sound off in the comments.

If you have a data structure where you really need to enable ABE, your file service concept needs to facilitate these four requirements:

You need Scalability.

You need the ability to increase the number of CPUs doing the ABE calculations in order to react to increasing numbers (directory sizes, number of clients, usage frequency) and thus performance demand.

The easiest way to achieve this is to do ABE Filtering exclusively in DFS Domain Namespaces and not on the Fileservers.

By that you can add easily more CPUs by just adding further Namespace Servers in the sites where they are required.

Also keep in mind, that you should have some redundancy and that another server might not be able to take the full additional load of a failing server on top of its own load.

You need small chunks

The number of objects that ABE needs to check for each calculation is the single most important factor for the performance requirement.

Instead of having a single big 10,000 link namespace (same applies to directories on file servers) build 10 smaller 1,000 link-namespaces and combine them into a DFS Cascade.

By that ABE just needs to filter 1,000 objects for every request.

Just re-do the example calculation above with 250ms, 100ms, 50ms or even less.

You will notice that you are suddenly able to reach very decent numbers in terms of Requests/per Second.

The other nice side effect is, that you will do less calculations, as the user will usually follow only one branch in the directory tree, and is thus not causing ABE calculations for the other branches.

You need Separation of Workloads.

Having your SQL Server run on the same machine as your ABE Server can cause a lack of Performance for both workloads.

Having ABE run on you Domain Controller exposes your Domain Controller Role to the risk of being starved of CPU Cycles and thus not facilitating Domain Logons anymore.

You need to test and monitor your performance

In many cases you are deploying a new file service concept into an existing environment.

Thus you can get some numbers regarding QUERY_DIRECTORY requests, from the existing DFS / Fileservers.

Build up your Namespace / Shares as you envisioned and use the File Server Capacity Tool (https://msdn.microsoft.com/en-us/library/windows/hardware/dn567658(v=vs.85).aspx ) to simulate the expected load against it.

Monitor the SMB Service Response Times, the Processor utilization and Queue length and the feel on the client while browsing through the structures.

This should give you an idea on how many servers you will need, and if it is required to go for a slimmer design of the data structures.

Keep monitoring those values through the lifecycle of your file server deployment in order to scale up in time.

Any deployment of new software, clients or the normal increase in data structure size could throw off your initial calculations and test results.

This point should imho be outlined very clearly in any concept documentation.

This concludes the first part of this Blog Series.

I hope you found it worthwhile and got an understanding how to successfully design a File service with ABE.

Now to round off your knowledge, or if you need to troubleshoot a Performance Issue on an ABE-enabled Server, I strongly encourage you to read the second part of this Blog Series. This post will be updated as soon as it’s live.

With best regards,

Hubert

Access-Based Enumeration (ABE) Troubleshooting (part 2 of 2)

Hello everyone! Hubert from the German Networking Team here again with part two of my little Blog Post Series about Access-Based Enumeration (ABE). In the first part I covered some of the basic concepts of ABE. In this second part I will focus on monitoring and troubleshooting Access-based enumeration.
We will begin with a quick overview of Windows Explorer’s directory change notification mechanism (Change Notify), and how that mechanism can lead to performance issues before moving on to monitoring your environment for performance issues.

Change Notify and its impact on DFSN servers with ABE

Let’s say you are viewing the contents of a network share while a file or folder is added to the share remotely by someone else. Your view of this share will be updated automatically with the new contents of the share without you having to manually refresh (press F5) your view.
Change Notify is the mechanism that makes this work in all SMB Protocols (1,2 and 3).
The way it works is quite simple:

  1. The client sends a CHANGE_NOTIFY request to the server indicating the directory or file it is interested in. Windows Explorer (as an application on the client) does this by default for the directory that is currently in focus.
  2. Once there is a change to the file or directory in question, the server will respond with a CHANGE_NOTIFY Response, indicating that a change happened.
  3. This causes the client to send a QUERY_DIRECTORY request (in case it was a directory or DFS Namespace) to the server to find out what has changed.

    QUERY_DIRECTORY is the thing we discussed in the first post that causes ABE filter calculations. Recall that it’s these filter calculation that result in CPU load and client-side delays.
    Let’s look at a common scenario:
  4. During login, your users get a mapped drive pointing at a share in a DFS Namespace.
  5. This mapped drive causes the clients to connect to your DFSN Servers
  6. The client sends a Change Notification (even if the user hasn’t tried to open the mapped drive in Windows Explorer yet) for the DFS Root.

    Nothing more happens until there is a change on the server-side. Administrative work, such as adding and removing links, typically happens during business hours, whenever the administrators find the time, or the script that does it, runs.

    Back to our scenario. Let’s have a server-side change to illustrate what happens next:
  7. We add a Link to the DFS Namespace.
  8. Once the DFSN Server picks up the new link in the namespace from Active directory, it will create the corresponding reparse point in its local file system.
    If you do not use Root Scalability Mode (RSM) this will happen almost at the same time on all of the DFS Servers in that namespace. With RSM the changes will usually be applied by the different DFS servers over the next hour (or whatever your SyncInterval is set to).
  9. These changes trigger CHANGE_NOTIFY responses to be sent out to any client that indicated interest in changes to the DFS Root on that server. This usually applies to hundreds of clients per DFS server.
  10. This causes hundreds of Clients to send QUERY_DIRECTORY requests simultaneously.

What happens next strongly depends on the size of your namespace (larger namespaces lead to longer duration per ABE calculation) and the number of Clients (aka Requests) per CPU of the DFSN Server (remember the calculation from the first part?)

As your Server does not have hundreds of CPUs there will definitely be some backlog. The numbers above decide how big this backlog will be, and how long it takes for the server to work its way back to normal. Keep in mind that while pedaling out of the backlog situation, your server still has to answer other, ongoing requests that are unrelated to our Change Notify Event.
Suffice it to say, this backlog and the CPU demand associated with it can also have negative impact to other jobs.  For example, if you use this DFSN server to make a bunch of changes to your namespace, these changes will appear to take forever, simply because the executing server is starved of CPU Cycles. The same holds true if you run other workloads on the same server or want to RDP into the box.

So! What can you do about it?
As is common with an overloaded server, there are a few different approaches you could take:

  • Distribute the load across more servers (and CPU cores)
  • Make changes outside of business hours
  • Disable Change Notify in Windows Explorer

Approach

Method

Distribute the load / scale up

An expensive way to handle the excessive load is to throw more servers/CPU cores into the DFS infrastructure. In theory, you could increase the number of Servers and the number of CPUs to a level where you can handle such peak loads without any issues, but that can be a very expensive approach.

Make changes outside business hours

Depending on your organizations structure, your business needs, SLAs and other requirements, you could simply make planned administrative changes to your Namespaces outside the main business hours, when there are less clients connected to your DFSN Servers.

Disable Change Notify in Windows Explorer

You can set:
NoRemoteChangeNotify
NoRemoteRecursiveEvents
See https://support.microsoft.com/en-us/kb/831129
to prevent Windows Explorer from sending Change Notification Requests.
This is however a client-side setting that disables this functionality (change notify) not just for DFS shares but also for any fileserver it is working with. Thus you have to actively press F5 to see changes to a folder or a share in your Windows Explorer. This might or might not be a big deal for your users.

Monitoring ABE

As you may have realized by now, ABE is not a fire and forget technology—it needs constant oversight and occasional tuning. We’ve mainly discussed the design and “tuning” aspect so far. Let’s look into the monitoring aspect.

Using Task Manager / Process Explorer

This is a bit tricky, unfortunately, as any load caused by ABE shows up in Task Manager inside the System process (as do many other things on the server). In order to correlate high CPU utilization in the System process to ABE load, you need to use a tool such as Process Explorer and configure it to use public symbols. With this configured properly, you can drill deeper inside the System Process and see the different threads and the component names. We need to note, that ABE and the Fileserver both use functions in srv.sys and srv2.sys. So strictly speaking it’s not possible to differentiate between them just by the component names. However, if you are troubleshooting a performance problem on an ABE-enabled server where most of the threads in the System process are sitting in functions from srv.sys and srv2.sys, then it’s very likely due to expensive ABE filter calculations. This is, aside from disabling ABE, the best approach to reliably prove your problem to be caused by ABE.

Using Network trace analysis

Looking at CPU utilization shows us the server-side problem. We must use other measures to determine what the client-side impact is, one approach is to take a network trace and analyze the SMB/SMB2 Service Response times. You may however end up having to capture the trace on a mirrored switch port. To make analysis of this a bit easier, Message Analyzer has an SMB Service Performance chart you can use.

Image may be NSFW.
Clik here to view.
clip_image002

You get there by using a New Viewer, like below.

Image may be NSFW.
Clik here to view.
smbserviceperf

Wireshark also has a feature that provides you with statistics under Statistics -> Service Response Times -> SMB2. Ignore the values for ChangeNotify (its normal that they are several seconds or even minutes). All other response times translate into delays for the clients. If you see values over a second, you can consider your files service not only to be slow but outright broken.
While you have that trace in front of you, you can also look for SMB/TCP Connections that are terminated abnormally by the Client as the server failed to respond to the SMB Requests in time. If you have any of those, then you have clients unable to connect to your file service, likely throwing error messages.

Using Performance Monitor

If your server is running Windows Server 2012 or newer, the following performance counters are available:

Object

Counter

Instance

SMB Server Shares

Avg. sec /Data Request

<Share that has ABE Enabled>

SMB Server Shares

Avg. sec/Read

‘’

SMB Server Shares

Avg. sec/Request

‘’

SMB Server Shares

Avg. sec/Write

‘’

SMB Server Shares

Avg. Data Queue Length

‘’

SMB Server Shares

Avg. Read Queue Length

‘’

SMB Server Shares

Avg. Write Queue Length

‘’

SMB Server Shares

Current Pending Requests

‘’

Image may be NSFW.
Clik here to view.
clip_image005

Most noticeable here is Avg. sec/Request counter as this contains the response time to the QUERY_DIRECTORY requests (Wireshark displays them as Find Requests). The other values will suffer from a lack of CPU Cycles in varying ways but all indicate delays for the clients. As mentioned in the first part: We expect single digit millisecond response times from non-ABE Fileservers that are performing well. For ABE-enabled Servers (more precisely Shares) the values for QUERY_DIRECTORY / Find Requests will always be higher due to the inevitable length of the ABE Calculation.

When you reached a state where all the other SMB Requests aside of the QUERY_DIRECTORY are constantly responded to in less than 10ms and the QUERY_DIRECTORY constantly in less than 50ms you have a very good performing Server with ABE.

Other Symptoms

There are other symptoms of ABE problems that you may observe, however, none of them on their own is very telling, without the information from the points above.

At a first glance a high CPU Utilization and a high Processor Queue lengths are indicators of an ABE problem, however they are also indicators of other CPU-related performance issues. Not to mention there are cases where you encounter ABE performance problems without saturating all your CPUs.

The Server Work Queues\Active Threads (NonBlocking) will usually raise to their maximum allowed limit (MaxThreadsPerQueue ) as well as the Server Work Queues\Queue Length increasing. Both indicate that the Fileserver is busy, but on their own don’t tell you how bad the situation is. However, there are scenarios where the File server will not use up all Worker Threads allowed due to a bottleneck somewhere else such as in the Disk Subsystem or CPU Cycles available to it.

See the following should you choose to setup long-term monitoring (which you should) in order to get some trends:

Number of Objects per Directory or Number of DFS Links
Number of Peak User requests (Performance Counter: Requests / sec.)
Peak Server Response time to Find Requests or Performance Counter: Avg. sec/Request
Peak CPU Utilization and Peak Processor Queue length.

If you collect those values every day (or a shorter interval), you can get a pretty good picture how much head-room you have left with your servers at the moment and if there are trends that you need to react to.

Feel free to add more information to your monitoring to get a better picture of the situation.  For example: gather information on how many DFS servers were active at any given day for a certain site, so you can explain if unusual high numbers of user requests on the other servers come from a server downtime.

ABELevel

Some of you might have heard about the registry key ABELevel. The ABELevel value specifies the maximum level of the folders on which the ABE feature is enabled. While the title of the KB sounds very promising, and the hotfix is presented as a “Resolution”, the hotfix and registry value have very little practical application.  Here’s why:
ABELevel is a system-wide setting and does not differentiate between different shares on the same server. If you host several shares, you are unable to filter to different depths as the setting forces you to go for the deepest folder hierarchy. This results in unnecessary filter calculations for shares.

Usually the widest directories are on the upper levels—those levels that you need to filter.  Disabling the filtering for the lower level directories doesn’t yield much of a performance gain, as those small directories don’t have much impact on server performance, while the big top-level directories do.  Furthermore, the registry value doesn’t make any sense for DFS Namespaces as you have only one folder level there and you should avoid filtering on your fileservers anyway.

While we are talking about Updates

Here is one that you should install:
High CPU usage and performance issues occur when access-based enumeration is enabled in Windows 8.1 or Windows 7 – https://support.microsoft.com/en-us/kb/2920591

Furthermore you should definitely review the Lists of recommended updates for your server components:
DFS
https://support.microsoft.com/en-us/kb/968429 (2008 / 2008 R2)
https://support.microsoft.com/en-us/kb/2951262 (2012 / 2012 R2)

File Services
https://support.microsoft.com/en-us/kb/2473205 (2008 / 2008 R2)
https://support.microsoft.com/en-us/kb/2899011 (2012 / 2012 R2)

Well then, this concludes this small (my first) blog series.
I hope you found reading it worthwhile and got some input for your infrastructures out there.

With best regards
Hubert

Troubleshooting failed password changes after installing MS16-101

Hi!

Linda Taylor here, Senior Escalation Engineer in the Directory Services space.

I have spent the last month working with customers worldwide who experienced password change failures after installing the updates under Ms16-101 security bulletin KB’s (listed below), as well as working with the product group in getting those addressed and documented in the public KB articles under the known issues section. It has been busy!

In this post I will aim to provide you with a quick “cheat sheet” of known issues and needed actions as well as ideas and troubleshooting techniques to get there.

Let’s start by understanding the changes.

The following 6 articles describe the changes in MS16-101 as well as a list of Known issues. If you have not yet applied MS16-101 I would strongly recommend reading these and understanding how they may affect you.

        3176492 Cumulative update for Windows 10: August 9, 2016
        3176493 Cumulative update for Windows 10 Version 1511: August 9, 2016
        3176495 Cumulative update for Windows 10 Version 1607: August 9, 2016
        3178465 MS16-101: Security update for Windows authentication methods: August 9, 2016
        3167679 MS16-101: Description of the security update for Windows authentication methods: August 9, 2016
        3177108 MS16-101: Description of the security update for Windows authentication methods: August 9, 2016

The good news is that this month’s updates address some of the known issues with MS16-101.

The bad news is that not all the issues are caused by some code defect in MS16-101 and in some cases the right solution is to make your environment more secure by ensuring that the password change can happen over Kerberos and does not need to fall back to NTLM. That may include opening TCP ports used by Kerberos, fixing other Kerberos problems like missing SPN’s or changing your application code to pass in a valid domain name.

Let’s start with the basics…

Symptoms:

After applying MS16-101 fixes listed above, password changes may fail with the error code

“The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.”
Or
“The system cannot contact a domain controller to service the authentication request. Please try again later.”

This text maps to the error codes below:

Hexadecimal

Decimal

Symbolic

Friendly

0xc0000388

1073740920

STATUS_DOWNGRADE_DETECTED

The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

0x80074f1

1265

ERROR_DOWNGRADE_DETECTED

The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

Question: What does MS16-101 do and why would password changes fail after installing it?

Answer: As documented in the listed KB articles, the security updates that are provided in MS16-101 disable the ability of the Microsoft Negotiate SSP to fall back to NTLM for password change operations in the case where Kerberos fails with the STATUS_NO_LOGON_SERVERS (0xc000005e) error code.
In this situation, the password change will now fail (post MS16-101) with the above mentioned error codes (ERROR_DOWNGRADE_DETECTED / STATUS_DOWNGRADE_DETECTED).
Important: Password RESET is not affected by MS16-101 at all in any scenario. Only password change using the Negotiate package is affected.

So, now you understand the change, let’s look at the known issues and learn how to best identify and resolve those.

Summary and Cheat Sheet

To make it easier to follow I have matched the ordering of known issues in this post with the public KB articles above.

First, when troubleshooting a failed password change post MS16-101 you will need to understand HOW and WHERE the password change is happening and if it is for a domain account or a local account. Here is a cheat sheet.

Summary of SCENARIO’s and a quick reference table of actions needed.

Scenario / Known issue #

Description

Action Needed

1.

Domain password change fails via CTRL+ALT+DEL and shows an error like this:

Image may be NSFW.
Clik here to view.
clip_image001

Text: “System detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you. “

Troubleshoot using this guide and fix Kerberos.

2.

Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION and it now returns STATUS_DOWNGRADE_DETECTED (after installing Ms16-101) causing your application to behave in an expected way or even crash.

Note: In these cases password change works ok when correct new password is entered that complies with the password policy.

Install October fixes in the table below.

3.

Local user account password change fails via CTRL+ALT+DEL or application code.

Install October fixes in the table below.

4.

Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

None. By design.

5.

Domain password change fails via application code when a good password is entered.

This is the case where if you pass a servername to NetUserChangePassword, the password change will fail post MS16-101. This is because it would have previously worked and relied on NTLM. NTLM is insecure and Kerberos is always preferred. Therefore passing a domain name here is the way forward.

One thing to note for this one is that most of the ADSI and C#/.NET changePassword API’s end up calling NetUserChangePassword under the hood. Therefore, also passing invalid domain names to these API’s will fail. I have provided a detailed walkthrough example in this post with log snippets.

Troubleshoot using this guide and fix code to use Kerberos.

6.

After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.

To resolve this issue, see KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.

7.

After you install the security updates that are described in MS16-101, remote, programmatic changes of a local user account password remotely, and password changes across untrusted forest fail with the STATUS_DOWNGRADE_DETECTED error as documented in this post.

This happens because the operation relies on NTLM fall-back since there is no Kerberos without a trust.  NTLM fall-back is forbidden by MS16-101.

For this scenario you will need to install October fixes in the table below and set the registry key NegoAllowNtlmPwdChangeFallback documented in KB’s below which allows the NTLM fall back to happen again and unblocks this scenario.

http://support.microsoft.com/kb/3178465
http://support.microsoft.com/kb/3167679
http://support.microsoft.com/kb/3177108
http://support.microsoft.com/kb/3176492
http://support.microsoft.com/kb/3176495
http://support.microsoft.com/kb/3176493

Note: you may also consider using this registry key in an emergency for Known Issue#5 when it takes time to update the application code. However please read the above articles carefully and only consider this as a short term solution for scenario 5.


Table of Fixes for known issues above release 2016.10.11, taken from MS16-101 Security Bulletin:

OS

Fix needed

Vista / W2K8

Re-install 3167679, re-released 2016.10.11

Win7 / W2K8 R2

Install 3192391 (security only)
or
Install 3185330 (monthly rollup that includes security fixes)

WS12

3192393 (security only)
or
3185332 (monthly rollup that includes security fixes)

Win8.1 / WS12 R2

3192392 (security only)
OR
3185331 ((monthly rollup that includes security fixes)

Windows 10

For 1511: 3192441 Cumulative update for Windows 10 Version 1511: October 11, 2016
For 1607: 3194798 Cumulative update for Windows 10 Version 1607 and Windows Server 2016: October 11, 2016

Troubleshooting

As I mentioned, this post is intended to support the documentation of the known issues in the Ms16-101 KB articles and provide help and guidance for troubleshooting. It should help you identify which known issue you are experiencing as well as provide resolution suggestions for each case.

I have also included a troubleshooting walkthrough of some of the more complex example cases. We will start with the problem definition, and then look at the available logs and tools to identify a suitable resolution. The idea is to teach “how to fish” because there can be many different scenario’s and hopefully you can apply these techniques and use the log files documented here to help resolve the issues when needed.

Once you know the scenario that you are using for the password change the next step is usually to collect some data on the server or client where the password change is occuring. For example if you have a web server running a password change application and doing password changes on behalf of users, you will need to collect the logs there. If in doubt collect the logs from all involved machines and then look for the right one doing the password change using the snippets in the examples. Here are the helpful logs.

DATA COLLECTION

The same logs will help in all the scenario’s.

LOGS

1. SPENGO debug log/ LSASS.log

To enable this log run the following commands from an elevated admin CMD prompt to set the below registry keys:

reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v SPMInfoLevel /t REG_DWORD /d 0xC03E3F /f
reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v LogToFile /t REG_DWORD /d 1 /f
reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v NegEventMask /t REG_DWORD /d 0xF /f


  • This will log Negotiate debug output to the %windir%\system32\lsass.log.
  • There is no need for reboot. The log is effective immediately.
  • Lsass.log is a text file that is easy to read with a text editor such as Wordpad.

2. Netlogon.log:

This log has been around for many years and is useful for troubleshooting DC LOCATOR traffic. It can be used together with a network trace to understand why the STATUS_NO_LOGON_SERVERS is being returned for the Kerberos password change attempt.

· To enable Netlogon debug logging run the following command from an elevated CMD prompt:

            nltest /dbflag:0x26FFFFFF

· The resulting log is found in %windir%\debug\netlogon.log & netlogon.bak

· There is no need for reboot. The log is effective immediately. See also 109626 Enabling debug logging for the Net Logon service

· The Netlogon.log (and Netlogon.bak) is a text file.

           Open the log with any text editor (I like good old Notepad.exe)

3. Collect a Network trace during the password change issue using the tool of your choice.

Scenario’s, Explanations and Walkthrough’s:

When reading this you should keep in mind that you may be seeing more than one scenario. The best thing to do is to start with one, fix that and see if there are any other problems left.

1. Domain password change fails via CTRL+ALT+DEL

This is most likely a Kerberos DC locator failure of some kind where the password changes were relying on NTLM before installing MS16-101 and are now failing. This is the simplest and easiest case to resolve using basic Kerberos troubleshooting methods.

Solution: Fix Kerberos.

Some tips from cases which we saw:

1. Use the Network trace to identify if the necessary communication ports are open. This was quite a common issue. So start by checking this.

         In order for Kerberos password changes to work communication on TCP port 464 needs to be open between the client doing the
         password change and the domain controller.

Note on RODC: Read-only domain controllers (RODCs) can service password changes if the user is allowed by the RODCs password replication policy. Users who are not allowed by the RODC password policy require network connectivity to a read/write domain controller (RWDC) in the user account domain to be able to change the password.

           To check whether TCP port 464 is open, follow these steps (also documented in KB3167679):

             a. Create an equivalent display filter for your network monitor parser. For example:

                            ipv4.address== <ip address of client> && tcp.port==464

             b. In the results, look for the “TCP:[SynReTransmit” frame.

If you find these, then investigate firewall and open ports. It is often useful to take a simultaneous trace from the client and the domain controller and check if the packets are arriving at the other end.

2. Make sure that the target Kerberos names are valid.

  • IP addresses are not valid Kerberos names
  • Kerberos supports short names and fully qualified domain names. Like CONTOSO or Contoso.com

3. Make sure that service principal names (SPNs) are registered correctly.

For more information on troubleshooting Kerberos see https://blogs.technet.microsoft.com/askds/2008/05/14/troubleshooting-kerberos-authentication-problems-name-resolution-issues/ or https://technet.microsoft.com/en-us/library/cc728430(v=ws.10).aspx

2. Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION. After installing Ms16-101 it returns STATUS_DOWNGRADE_DETECTED causing your application to behave in an expected way or even crash.

Note: In this scenario, password change succeeds when correct new password is entered that complies with the password policy.

Cause:

This issue is caused by a code defect in ADSI whereby the status returned from Kerberos was not returned to the user by ADSI correctly.
Here is a more detailed explanation of this one for the geek in you:

Before MS16-101 behavior:

           1. An application calls ChangePassword method from using the ADSI LDAP provider.
           Setting and changing passwords with the ADSI LDAP Provider is documented here.
           Under the hood this calls Negotiate/Kerberos to change the password using a valid realm name.
           Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

          2. A 2nd changepassword call is made via NetUserChangePassword API with an intentional realmname as the <dcname> which uses
           Negotiate and will retry Kerberos. Kerberos fails with STATUS_NO_LOGON_SERVERS because a DC name is not a valid realm name.

         3.Negotiate then retries over NTLM which succeeds or returns the same previous failure status.

The password change fails if a bad password was entered and the NTLM error code is returned back to the application. If a valid password was entered, everything works because the 1st change password call passes in a good name and if Kerberos works, the password change operation succeeds and you never enter into step 3.

Post MS16-101 behavior /why it fails with MS16-101 installed:

         1. An application calls ChangePassword method from using the ADSI LDAP provider. This calls Negotiate for the password change with
          a valid realm name.
         Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

         2. A 2nd ChangePassword call is made via NetUserChangePassword with a <dcname> as realm name which fails over Kerberos with
         STATUS_NO_LOGON_SERVERS which triggers NTLM fallback.

          3. Because NTLM fallback is blocked on MS16-101, Error STATUS_DOWNGRADE_DETECTED is returned to the calling app.

Solution: Easy. Install the October update which will fix this issue. The fix lies in adsmsext.dll included in the October updates.

Again, here are the updates you need to install, Taken from MS16-101 Security Bulletin:

OS

Fix needed

Vista / W2K8

Re-install 3167679, re-released 2016.10.11

Win7 / W2K8 R2

Install 3192391 (security only)
or
Install 3185330 (monthly rollup that includes security fixes)

WS12

3192393 (security only)
or
3185332 (monthly rollup that includes security fixes)

Win8.1 / WS12 R2

3192392 (security only)
OR
3185331 ((monthly rollup that includes security fixes)

Windows 10

For 1511: 3192441 Cumulative update for Windows 10 Version 1511: October 11, 2016
For 1607: 3194798 Cumulative update for Windows 10 Version 1607 and Windows Server 2016: October 11, 2016

3.Local user account password change fails via CTRL+ALT+DEL or application code.

Installing October updates above should also resolve this.

MS16-101 had a defect where Negotiate did not correctly determine that the password change was local and would try to find a DC using the local machine as the domain name.

This failed and NTLM fallback was no longer allowed post MS16-101. Therefore, the password changes failed with STATUS_DOWNGRADE_DETECTED.

Example:

One such scenario which I saw where password changes of local user accounts via ctrl+alt+delete failed with the message “The system detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you.” Was when you have the following group policy set and you try to change a password of a local account:

Policy

Computer Configuration \ Administrative Templates \ System \ Logon\“Assign a default domain for logon”

Path

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\DefaultLogonDomain

Setting

DefaultLogonDomain

Data Type

REG_SZ

Value

“.”    (less quotes). The period or “dot” designates the local machine name

Notes

Cause: In this case, post MS16-101 Negotiate incorrectly determined that the account is not local and tried to discover a DC using \\<machinename> as the domain and failed. This caused the password change to fail with the STATUS_DOWNGRADE_DETECTED error.

Solution: Install October fixes listed in the table at the top of this post.

4.Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

MS16-101 purposely disabled changing the password of locked-out or disabled user account passwords via Negotiate by design.

Important: Password Reset is not affected by MS16-101 at all in any scenario. Only password change. Therefore, any application which is doing a password Reset will be unaffected by Ms16-101.

Another important thing to note is that MS16-101 only affects applications using Negotiate. Therefore, it is possible to change locked-out and disabled account password using other method’s such as LDAPs.

For example, the PowerShell cmdlet Set-ADAccountPassword will continue to work for locked out and disabled account password changes as it does not use Negotiate.

5. Troubleshooting Domain password change failure via application code when a good password is entered.

This is one of the most difficult scenarios to identify and troubleshoot. And therefore I have provided a more detailed example here including sample code, the cause and solution.

In summary, the solution for these cases is almost always to correct the application code which maybe passing in an invalid domain name such that Kerberos fails with STATUS_NO_LOGON_SERVERS.

Scenario:

An application is using system.directoryservices.accountmanagement namespace to change a users password.
https://msdn.microsoft.com/en-us/library/system.directoryservices.accountmanagement(v=vs.110).aspx

After installing Ms16-101 password changes fail with STATUS_DOWNGRADE_DETECTED. Example .NET failing code snippet using PowerShell which worked before MS16-101:

<snip>

Add-Type -AssemblyName System.DirectoryServices.AccountManagement
$ct = [System.DirectoryServices.AccountManagement.ContextType]::Domain
$ctoptions = [System.DirectoryServices.AccountManagement.ContextOptions]::SimpleBind -bor [System.DirectoryServices.AccountManagement.ContextOptions]::ServerBind
$pc = New-Object System.DirectoryServices.AccountManagement.PrincipalContext($ct, “contoso.com”,”OU=Accounts,DC=Contoso,DC=Com”, ,$ctoptions)
$idType = [System.DirectoryServices.AccountManagement.IdentityType]::SamAccountName  
$up = [System.DirectoryServices.AccountManagement.UserPrincipal]::FindByIdentity($pc,$idType, “TestUser”)
$up.ChangePassword(“oldPassword!123”, “newPassword!123”)

<snip>

Data Analysis

There are 2 possibilities here:
(a) The application code is passing an incorrect domain name parameter causing Kerberos password change to fail to locate a DC.
(b)  Application code is good and Kerberos password change fails for other reason like blocked port or DNS issue or missing SPN.

Let’s start with (a) The application code is passing an incorrect domain name/parameter causing Kerberos password change to fail to locate a DC.

(a) Data Analysis Walkthrough Example based on a real case:

1. Start with Lsass.log (SPNEGO trace)

If you are troubleshooting a password change failure after MS16-101 look for the following text in Lsass.log to indicate that Kerberos failed and NTLM fallback was forbidden by Ms16-101:

Failing Example:

[ 9/13 10:23:36] 492.2448> SPM-WAPI: [11b0.1014] Dispatching API (Message 0)
[ 9/13 10:23:36] 492.2448> SPM-Trace: [11b0] LpcDispatch: dispatching ChangeAccountPassword (1a)
[ 9/13 10:23:36] 492.2448> SPM-Trace: [11b0] LpcChangeAccountPassword()
[ 9/13 10:23:36] 492.2448> SPM-Helpers: [11b0] LsapCopyFromClient(0000005EAB78C9D8, 000000DA664CE5E0, 16) = 0
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword:
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: NegoExtender
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: Kerberos
[ 9/13 10:23:36] 492.2448> SPM-Warning: Failed to change password for account Test: 0xc000005e
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: NTLM
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, NTLM failed: not allowed to change domain passwords
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, returning: 0xc0000388

  • 0xc000005E is STATUS_NO_LOGON_SERVERS
    0xc0000388 is STATUS_DOWNGRADE_DETECTED

If you see this, it means Kerberos failed to locate a Domain Controller in the domain and fallback to NTLM is not allowed by Ms16-101. Next you should look at the Netlogon.log and the Network trace to understand why.

2. Network trace

Look at the network trace and filter the traffic based on the client IP, DNS and any authentication related traffic.
You may see the client is requesting a Kerberos ticket using an invalid SPN like:


Source

Destination

Description

Client

DC1

KerberosV5:TGS Request Realm: CONTOSO.COM Sname: ldap/contoso.com             {TCP:45, IPv4:7}

DC1

Client

KerberosV5:KRB_ERROR  – KDC_ERR_S_PRINCIPAL_UNKNOWN (7)  {TCP:45, IPv4:7}

So here the client tried to get a ticket for this ldap\Contoso.com SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere.

  • This is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com

Next let’s check the Netlogon.log

3. Netlogon.log:

Open the log with any text editor (I like good old Notepad.exe) and check the following:

  • Is a valid domain name being passed to DC locator?

Invalid names such as \\servername.contoso.com or IP address \\x.y.x.w will cause dclocator to fail and thus Kerberos password change to return STATUS_NO_LOGON_SERVERS. Once that happens NTLM fall back is not allowed and you get a failed password change.

If you find this issue examine the application code and make necessary changes to ensure correct domain name format is being passed to the ChangePassword API that is being used.

Example of failure in Netlogon.log:

[MISC] [PID] DsGetDcName function called: client PID=1234, Dom:\\contoso.com Acct:(null) Flags: IP KDC
[MISC] [PID] DsGetDcName function returns 1212 (client PID=1234): Dom:\\contoso.com Acct:(null) Flags: IP KDC

\\contoso.com is not a valid domain name. (contoso.com is a valid domain name)

This Error translates to:

0x4bc

1212

ERROR_INVALID_DOMAINNAME

The format of the specified domain name is invalid.

winerror.h

So what happened here?

The application code passed an invalid TargetName to kerberos. It used the domain name as a server name and so we see the SPN of LDAP\contoso.com.

The client tried to get a ticket for this SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere. As Noted: this is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com.

The application code then tried the password change again and passed in \\contoso.com as a domain name for the password change. Anything beginning with \\ as domain name is not valid. IP address is not valid. So DCLOCATOR will fail to locate a DC when given this domain name. We can see this in the Netlogon.log and the Network trace.

Conclusion and Solution

If the domain name is invalid here, examine the code snippet which is doing the password change to understand why the wrong name is passed in.

The fix in these cases will be to change the code to ensure a valid domain name is passed to Kerberos to allow the password change to successfully happen over Kerberos and not NTLM. NTLM is not secure. If Kerberos is possible, it should be the protocol used.

SOLUTION

The solution here was to remove “ContextOptions.ServerBind |  ContextOptions.SimpleBind ” and allow the code to use the default (Negotiate). Note, because we were using a domain context but ServerBind this caused the issue. Negotiate with Domain context is the option that works and is successfully able to use kerberos.

Working code:

<snip>
Add-Type -AssemblyName System.DirectoryServices.AccountManagement
$ct = [System.DirectoryServices.AccountManagement.ContextType]::Domain
$pc = New-Object System.DirectoryServices.AccountManagement.PrincipalContext($ct, “contoso.com”,”OU=Accounts,DC=Contoso,DC=Com”)
$idType = [System.DirectoryServices.AccountManagement.IdentityType]::SamAccountName  
$up = [System.DirectoryServices.AccountManagement.UserPrincipal]::FindByIdentity($pc,$idType, “TestUser”)
$up.ChangePassword(“oldPassword!123”, “newPassword!123”)

<snip>

Why does this code work before MS16-101 and fail after?

ContextOptions are documented here: https://msdn.microsoft.com/en-us/library/system.directoryservices.accountmanagement.contextoptions(v=vs.110).aspx

Specifically: “This parameter specifies the options that are used for binding to the server. The application can set multiple options that are linked with a bitwise OR operation. “

Passing in a domain name such as contoso.com with the ContextOptions ServerBind or SimpleBind causes the client to attempt to use an SPN like ldap\contoso.com because it expects the name which is passed in to be a ServerName.

This is not a valid SPN and does not exist, therefore this will fail and as a result Kerberos will fail with STATUS_NO_LOGON_SERVERS.
Before MS16-101, in this scenario, the Negotiate package would fall back to NTLM, attempt the password change using NTLM and succeed.
Post MS16-101 this fall back is not allowed and Kerberos is enforced.

(b) If Application Code is good but Kerberos fails to locate a DC for other reason

If you see a correct domain name and SPN’s in the above logs, then the issue is that kerberos fails for some other reason such as blocked TCP ports. In this case revert to Scenario 1 to troubleshoot why Kerberos failed to locate a Domain Controller.

There is a chance that you may also have both (a) and (b). Traces and logs are the best tools to identify.

Scenario6: After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.

I will not go into detail of this scenario as it is well described in the KB article KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.

That’s all for today! I hope you find this useful. I will update this post if any new information arises.

Linda Taylor | Senior Escalation Engineer | Windows Directory Services
(A well established member of the content police.)

Using Debugging Tools to Find Token and Session Leaks

Hello AskDS readers and Identity aficionados. Long time no blog.

Ryan Ries here, and today I have a relatively “hardcore” blog post that will not be for the faint of heart. However, it’s about an important topic.

The behavior surrounding security tokens and logon sessions has recently changed on all supported versions of Windows. IT professionals – developers and administrators alike – should understand what this new behavior is, how it can affect them, and how to troubleshoot it.

But first, a little background…

Image may be NSFW.
Clik here to view.
Figure 1 - Tokens

Figure 1 – Tokens

Windows uses security tokens (or access tokens) extensively to control access to system resources. Every thread running on the system uses a security token, and may own several at a time. Threads inherit the security tokens of their parent processes by default, but they may also use special security tokens that represent other identities in an activity known as impersonation. Since security tokens are used to grant access to resources, they should be treated as highly sensitive, because if a malicious user can gain access to someone else’s security token, they will be able to access resources that they would not normally be authorized to access.

Note: Here are some additional references you should read first if you want to know more about access tokens:

If you are an application developer, your application or service may want to create or duplicate tokens for the legitimate purpose of impersonating another user. A typical example would be a server application that wants to impersonate a client to verify that the client has permissions to access a file or database. The application or service must be diligent in how it handles these access tokens by releasing/destroying them as soon as they are no longer needed. If the code fails to call the CloseHandle function on a token handle, that token can then be “leaked” and remain in memory long after it is no longer needed.

And that brings us to Microsoft Security Bulletin MS16-111.

Here is an excerpt from that Security Bulletin:

Multiple Windows session object elevation of privilege vulnerabilities exist in the way that Windows handles session objects.

A locally authenticated attacker who successfully exploited the vulnerabilities could hijack the session of another user.
To exploit the vulnerabilities, the attacker could run a specially crafted application.
The update corrects how Windows handles session objects to prevent user session hijacking.

Those vulnerabilities were fixed with that update, and I won’t further expound on the “hacking/exploiting” aspect of this topic. We’re here to explore this from a debugging perspective.

This update is significant because it changes how the relationship between tokens and logon sessions is treated across all supported versions of Windows going forward. Applications and services that erroneously leak tokens have always been with us, but the penalty paid for leaking tokens is now greater than before. After MS16-111, when security tokens are leaked, the logon sessions associated with those security tokens also remain on the system until all associated tokens are closed… even after the user has logged off the system. If the tokens associated with a given logon session are never released, then the system now also has a permanent logon session leak as well. If this leak happens often enough, such as on a busy Remote Desktop/Terminal Server where users are logging on and off frequently, it can lead to resource exhaustion on the server, performance issues and denial of service, ultimately causing the system to require a reboot to be returned to service.

Therefore, it’s more important than ever to be able to identify the symptoms of token and session leaks, track down token leaks on your systems, and get your application vendors to fix them.

How Do I Know If My Server Has Leaks?

As mentioned earlier, this problem affects heavily-utilized Remote Desktop Session Host servers the most, because users are constantly logging on and logging off the server. The issue is not limited to Remote Desktop servers, but symptoms will be most obvious there.

Figuring out that you have logon session leaks is the easy part. Just run qwinsta at a command prompt:

Image may be NSFW.
Clik here to view.
Figure 2 - qwinsta

Figure 2 – qwinsta

Pay close attention to the session ID numbers, and notice the large gap between session 2 and session 152. This is the clue that the server has a logon session leak problem. The next user that logs on will get session 153, the next user will get session 154, the next user will get session 155, and so on. But the session IDs will never be reused. We have 150 “leaked” sessions in the screenshot above, where no one is logged on to those sessions, no one will ever be able to log on to those sessions ever again (until a reboot,) yet they remain on the system indefinitely. This means each user who logs onto the system is inadvertently leaving tokens lying around in memory, probably because some application or service on the system duplicated the user’s token and didn’t release it. These leaked sessions will forever be unusable and soak up system resources. And the problem will only get worse as users continue to log on to the system. In an optimal situation where there were no leaks, sessions 3-151 would have been destroyed after the users logged out and the resources consumed by those sessions would then be reusable by subsequent logons.

How Do I Find Out Who’s Responsible?

Now that you know you have a problem, next you need to track down the application or service that is responsible for leaking access tokens. When an access token is created, the token is associated to the logon session of the user who is represented by the token, and an internal reference count is incremented. The reference count is decremented whenever the token is destroyed. If the reference count never reaches zero, then the logon session is never destroyed or reused. Therefore, to resolve the logon session leak problem, you must resolve the underlying token leak problem(s). It’s an all-or-nothing deal. If you fix 10 token leaks in your code but miss 1, the logon session leak will still be present as if you had fixed none.

Before we proceed: I would recommend debugging this issue on a lab machine, rather than on a production machine. If you have a logon session leak problem on your production machine, but don’t know where it’s coming from, then install all the same software on a lab machine as you have on the production machine, and use that for your diagnostic efforts. You’ll see in just a second why you probably don’t want to do this in production.

The first step to tracking down the token leaks is to enable token leak tracking on the system.

Modify this registry setting:

HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Kernel
    SeTokenLeakDiag = 1 (DWORD)

The registry setting won’t exist by default unless you’ve done this before, so create it. It also did not exist prior to MS16-111, so don’t expect it to do anything if the system does not have MS16-111 installed. This registry setting enables extra accounting on token issuance that you will be able to detect in a debugger, and there may be a noticeable performance impact on busy servers. Therefore, it is not recommended to leave this setting in place unless you are actively debugging a problem. (i.e. don’t do it in production exhibit A.)

Prior to the existence of this registry setting, token leak tracing of this kind used to require using a checked build of Windows. And Microsoft seems to not be releasing a checked build of Server 2016, so… good timing.

Next, you need to configure the server to take a full or kernel memory dump when it crashes. (A live kernel debug may also be an option, but that is outside the scope of this article.) I recommend using DumpConfigurator to configure the computer for complete crash dumps. A kernel dump should be enough to see most of what we need, but get a Complete dump if you can.

Image may be NSFW.
Clik here to view.
Figure 3 - DumpConfigurator

Figure 3 – DumpConfigurator

Then reboot the server for the settings to take effect.

Next, you need users to log on and off the server, so that the logon session IDs continue to climb. Since you’re doing this in a lab environment, you might want to use a script to automatically logon and logoff a set of test users. (I provided a sample script for you here.) Make sure you’ve waited 10 minutes after the users have logged off to verify that their logon sessions are permanently leaked before proceeding.

Finally, crash the box. Yep, just crash it. (i.e. don’t do it in production exhibit B.) On a physical machine, this can be done by hitting Right-Ctrl+Scroll+Scroll if you configured the appropriate setting with DumpConfigurator earlier. If this is a Hyper-V machine, you can use the following PowerShell cmdlet on the Hyper-V host:

Debug-VM -VM (Get-VM RDS1) -InjectNonMaskableInterrupt

You may have at your disposal other means of getting a non-maskable interrupt to the machine, such as an out-of-band management card (iLO/DRAC, etc.,) but the point is to deliver an NMI to the machine, and it will bugcheck and generate a memory dump.

Now transfer the memory dump file (C:\Windows\Memory.dmp usually) to whatever workstation you will use to perform your analysis.

Note: Memory dumps may contain sensitive information, such as passwords, so be mindful when sharing them with strangers.

Next, install the Windows Debugging Tools on your workstation if they’re not already installed. I downloaded mine for this demo from the Windows Insider Preview SDK here. But they also come with the SDK, the WDK, WPT, Visual Studio, etc. The more recent the version, the better.

Next, download the MEX Debugging Extension for WinDbg. Engineers within Microsoft have been using the MEX debugger extension for years, but only recently has a public version of the extension been made available. The public version is stripped-down compared to the internal version, but it’s still quite useful. Unpack the file and place mex.dll into your C:\Debuggers\winext directory, or wherever you installed WinDbg.

Now, ensure that your symbol path is configured correctly to use the Microsoft public symbol server within WinDbg:

Image may be NSFW.
Clik here to view.
Figure 4 - Example Symbol Path in WinDbg

Figure 4 – Example Symbol Path in WinDbg

The example symbol path above tells WinDbg to download symbols from the specified URL, and store them in your local C:\Symbols directory.

Finally, you are ready to open your crash dump in WinDbg:

Image may be NSFW.
Clik here to view.
Figure 5 - Open Crash Dump from WinDbg

Figure 5 – Open Crash Dump from WinDbg

After opening the crash dump, the first thing you’ll want to do is load the MEX debugging extension that you downloaded earlier, by typing the command:

Image may be NSFW.
Clik here to view.
Figure 6 - .load mex

Figure 6 – .load mex

The next thing you probably want to do is start a log file. It will record everything that goes on during this debugging session, so that you can refer to it later in case you forgot what you did or where you left off.

Image may be NSFW.
Clik here to view.
Figure 7 - !logopen

Figure 7 – !logopen

Another useful command that is among the first things I always run is !DumpInfo, abbreviated !di, which simply gives some useful basic information about the memory dump itself, so that you can verify at a glance that you’ve got the correct dump file, which machine it came from and what type of memory dump it is.

Image may be NSFW.
Clik here to view.
Figure 8 - !DumpInfo

Figure 8 – !DumpInfo

You’re ready to start debugging.

At this point, I have good news and I have bad news.

The good news is that there already exists a super-handy debugger extension that lists all the logon session kernel objects, their associated token reference counts, what process was responsible for creating the token, and even the token creation stack, all with a single command! It’s
!kdexts.logonsession, and it is awesome.

The bad news is that it doesn’t work… not with public symbols. It only works with private symbols. Here is what it looks like with public symbols:

Image may be NSFW.
Clik here to view.
Figure 9 - !kdexts.logonsession - public symbols lead to lackluster output

Figure 9 – !kdexts.logonsession – public symbols lead to lackluster output

As you can see, most of the useful stuff is zeroed out.

Since public symbols are all you have unless you work at Microsoft, (and we wish you did,) I’m going to teach you how to do what
!kdexts.logonsession does, manually. The hard way. Plus some extra stuff. Buckle up.

First, you should verify whether token leak tracking was turned on when this dump was taken. (That was the registry setting mentioned earlier.)

Image may be NSFW.
Clik here to view.
Figure 10 - SeTokenLeakTracking = <no type information>

Figure 10 – x nt!SeTokenLeakTracking = <no type information>

OK… That was not very useful. We’re getting <no type information> because we’re using public symbols. But this symbol corresponds to the SeTokenLeakDiag registry setting that we configured earlier, and we know that’s just 0 or 1, so we can just guess what type it is:

Image may be NSFW.
Clik here to view.
Figure 11 - db nt!SeTokenLeakTracking L1

Figure 11 – db nt!SeTokenLeakTracking L1

The db command means “dump bytes.” (dd, or “dump DWORDs,” would have worked just as well.) You should have a symbol for
nt!SeTokenLeakTracking if you configured your symbol path properly, and the L1 tells the debugger to just dump the first byte it finds. It should be either 0 or 1. If it’s 0, then the registry setting that we talked about earlier was not set properly, and you can basically just discard this dump file and get a new one. If it’s 1, you’re in business and may proceed.

Next, you need to locate the logon session lists.

Image may be NSFW.
Clik here to view.
Figure 12 - dp nt!SepLogonSessions L1

Figure 12 – dp nt!SepLogonSessions L1

Like the previous step, dp means “display pointer,” then the name of the symbol, and L1 to just display a single pointer. The 64-bit value on the right is the pointer, and the 64-bit value on the left is the memory address of that pointer.

Now we know where our lists of logon sessions begin. (Lists, plural.)

The SepLogonSessions pointer points to not just a list, but an array of lists. These lists are made up of _SEP_LOGON_SESSION_REFERENCES structures.

Using the dps command (display contiguous pointers) and specifying the beginning of the array that we got from the last step, we can now see where each of the lists in the array begins:

Image may be NSFW.
Clik here to view.
Figure 13 - dps 0xffffb808`3ea02650 – displaying pointers that point to the beginning of each list in the array

Figure 13 – dps 0xffffb808`3ea02650 – displaying pointers that point to the beginning of each list in the array

If there were not very many logon sessions on the system when the memory dump was taken, you might notice that not all the lists are populated:

Image may be NSFW.
Clik here to view.
Figure 14 - Some of the logon session lists are empty because not very many users had logged on in this example

Figure 14 – Some of the logon session lists are empty because not very many users had logged on in this example

The array doesn’t fill up contiguously, which is a bummer. You’ll have to skip over the empty lists.

If we wanted to walk just the first list in the array (we’ll talk more about dt and linked lists in just a minute,) it would look something like this:

Image may be NSFW.
Clik here to view.
Figure 15 - Walking the first list in the array and using !grep to filter the output

Figure 15 – Walking the first list in the array and using !grep to filter the output

Notice that I used the !grep command to filter the output for the sake of brevity and readability. It’s part of the Mex debugger extension. I told you it was handy. If you omit the !grep AccountName part, you would get the full, unfiltered output. I chose “AccountName” arbitrarily as a keyword because I knew that was a word that was unique to each element in the list. !grep will only display lines that contain the keyword(s) that you specify.

Next, if we wanted to walk through the entire array of lists all at once, it might look something like this:

Image may be NSFW.
Clik here to view.
Figure 16 - Walking through the entire array of lists!

Figure 16 – Walking through the entire array of lists!

OK, I realize that I just went bananas there, but I’ll explain what just happened step-by-step.

When you are using the Mex debugger extension, you have access to many new text parsing and filtering commands that can truly enhance your debugging experience. When you look at a long command like the one I just showed, read it from right to left. The commands on the right are fed into the command to their left.

So from right to left, let’s start with !cut -f 2 dps ffffb808`3ea02650

We already showed what the dps <address> command did earlier. The !cut -f 2 command filters that command’s output so that it only displays the second part of each line separated by whitespace. So essentially, it will display only the pointers themselves, and not their memory addresses.

Like this:

Image may be NSFW.
Clik here to view.
Figure 17 - Using !cut to select just the second token in each line of output

Figure 17 – Using !cut to select just the second token in each line of output

Then that is “piped” line-by-line into the next command to the left, which was:

!fel -x “dt nt!_SEP_LOGON_SESSION_REFERENCES @#Line -l Next”

!fel is an abbreviation for !foreachline.

This command instructs the debugger to execute the given command for each line of output supplied by the previous command, where the @#Line pseudo-variable represents the individual line of output. For each line of output that came from the dps command, we are going to use the dt command with the -l parameter to walk that list. (More on walking lists in just a second.)

Next, we use the !grep command to filter all of that output so that only a single unique line is shown from each list element, as I showed earlier.

Finally, we use the !count -q command to suppress all of the output generated up to that point, and instead only tell us how many lines of output it would have generated. This should be the total number of logon sessions on the system.

And 380 was in fact the exact number of logon sessions on the computer when I collected this memory dump. (Refer to Figure 16.)

Alright… now let’s take a deep breath and a step back. We just walked an entire array of lists of structures with a single line of commands. But now we need to zoom in and take a closer look at the data structures contained within those lists.

Remember, ffffb808`3ea02650 was the very beginning of the entire array.

Let’s examine just the very first _SEP_LOGON_SESSION_REFERENCES entry of the first list, to see what such a structure looks like:

Image may be NSFW.
Clik here to view.
Figure 18 - dt _SEP_LOGON_SESSION_REFERENCES* ffffb808`3ea02650

Figure 18 – dt _SEP_LOGON_SESSION_REFERENCES* ffffb808`3ea02650

That’s a logon session!

Let’s go over a few of the basic fields in this structure. (Skipping some of the more advanced ones.)

  • Next: This is a pointer to the next element in the list. You might notice that there’s a “Next,” but there’s no “Previous.” So, you can only walk the list in one direction. This is a singly-linked list.
  • LogonId: Every logon gets a unique one. For example, “0x3e7” is always the “System” logon.
  • ReferenceCount: This is how many outstanding token references this logon session has. This is the number that must reach zero before the logon session can be destroyed. In our example, it’s 4.
  • AccountName: The user who does or used to occupy this session.
  • AuthorityName: Will be the user’s Active Directory domain, typically. Or the computer name if it’s a local account.
  • TokenList: This is a doubly or circularly-linked list of the tokens that are associated with this logon session. The number of tokens in this list should match the ReferenceCount.

The following is an illustration of a doubly-linked list:

Image may be NSFW.
Clik here to view.
Figure 19 - Doubly or circularly-linked list

Figure 19 – Doubly or circularly-linked list

Flink” stands for Forward Link, and “Blink” stands for Back Link.

So now that we understand that the TokenList member of the _SEP_LOGON_SESSION_REFERENCES structure is a linked list, here is how you walk that list:

Image may be NSFW.
Clik here to view.
Figure 20 - dt nt!_LIST_ENTRY* 0xffffb808`500bdba0+0x0b0 -l Flink

Figure 20 – dt nt!_LIST_ENTRY* 0xffffb808`500bdba0+0x0b0 -l Flink

The dt command stands for “display type,” followed by the symbol name of the type that you want to cast the following address to. The reason why we specified the address 0xffffb808`500bdba0 is because that is the address of the _SEP_LOGON_SESSION_REFERENCES object that we found earlier. The reason why we added +0x0b0 after the memory address is because that is the offset from the beginning of the structure at which the TokenList field begins. The -l parameter specifies that we’re trying to walk a list, and finally you must specify a field name (Flink in this case) that tells the debugger which field to use to navigate to the next node in the list.

We walked a list of tokens and what did we get? A list head and 4 data nodes, 5 entries total, which lines up with the ReferenceCount of 4 tokens that we saw earlier. One of the nodes won’t have any data – that’s the list head.

Now, for each entry in the linked list, we can examine its data. We know the payloads that these list nodes carry are tokens, so we can use dt to cast them as such:

Image may be NSFW.
Clik here to view.
Figure 21 - dt _TOKEN*0xffffb808`4f565f40+8+8 - Examining the first token in the list

Figure 21 – dt _TOKEN*0xffffb808`4f565f40+8+8 – Examining the first token in the list

The reason for the +8+8 on the end is because that’s the offset of the payload. It’s just after the Flink and Blink as shown in Figure 19. You want to skip over them.

We can see that this token is associated to SessionId 0x136/0n310. (Remember I had 380 leaked sessions in this dump.) If you examine the UserAndGroups member by clicking on its DML (click the link,) you can then use !sid to see the SID of the user this token represents:

Image may be NSFW.
Clik here to view.
Figure 22 - Using !sid to see the security identifier in the token

Figure 22 – Using !sid to see the security identifier in the token

The token also has a DiagnosticInfo structure, which is super-interesting, and is the coolest thing that we unlocked when we set the SeTokenLeakDiag registry setting on the machine earlier. Let’s look at it:

Image may be NSFW.
Clik here to view.
Figure 23 - Examining the DiagnosticInfo structure of the first token

Figure 23 – Examining the DiagnosticInfo structure of the first token

We now have the process ID and the thread ID that was responsible for creating this token! We could examine the ImageFileName, or we could use the ProcessCid to see who it is:

Image may be NSFW.
Clik here to view.
Figure 24 - Using !mex.tasklist to find a process by its PID

Figure 24 – Using !mex.tasklist to find a process by its PID

Oh… Whoops. Looks like this particular token leak is lsass’s fault. You’re just going to have to let the *ahem* application vendor take care of that one.

Let’s move on to a different token leak. We’re moving on to a different memory dump file as well, so the memory addresses are going to be different from here on out.

I created a special token-leaking application specifically for this article. It looks like this:

Image may be NSFW.
Clik here to view.
Figure 25 - RyansTokenGrabber.exe

Figure 25 – RyansTokenGrabber.exe

It monitors the system for users logging on, and as soon as they do, it duplicates their token via the DuplicateToken API call. I purposely never release those tokens, so if I collect a memory dump of the machine while this is running, then evidence of the leak should be visible in the dump, using the same steps as before.

Using the same debugging techniques I just demonstrated, I verified that I have leaked logon sessions in this memory dump as well, and each leaked session has an access token reference that looks like this:

Image may be NSFW.
Clik here to view.
Figure 26 - A _TOKEN structure shown with its attached DiagnosticInfo

Figure 26 – A _TOKEN structure shown with its attached DiagnosticInfo

And then by looking at the token’s DiagnosticInfo, we find that the guilty party responsible for leaking this token is indeed RyansTokenGrabber.exe:

Image may be NSFW.
Clik here to view.
Figure 27 - The process responsible for leaking this token

Figure 27 – The process responsible for leaking this token

By this point you know who to blame, and now you can go find the author of RyansTokenGrabber.exe, and show them the stone-cold evidence that you’ve collected about how their application is leaking access tokens, leading to logon session leaks, causing you to have to reboot your server every few days, which is a ridiculous and inconvenient thing to have to do, and you shouldn’t stand for it!

We’re almost done. but I have one last trick to show you.

If you examine the StackTrace member of the token’s DiagnosticInfo, you’ll see something like this:

Image may be NSFW.
Clik here to view.
Figure 28 - DiagnosticInfo.CreateTrace

Figure 28 – DiagnosticInfo.CreateTrace

This is a stack trace. It’s a snapshot of all the function calls that led up to this token’s creation. These stack traces grew upwards, so the function at the top of the stack was called last. But the function addresses are not resolving. We must do a little more work to figure out the names of the functions.

First, clean up the output of the stack trace:

Image may be NSFW.
Clik here to view.
Figure 29 - Using !grep and !cut to clean up the output

Figure 29 – Using !grep and !cut to clean up the output

Now, using all the snazzy new Mex magic you’ve learned, see if you can unassemble (that’s the u command) each address to see if resolves to a function name:

Image may be NSFW.
Clik here to view.
Figure 30 - Unassemble instructions at each address in the stack trace

Figure 30 – Unassemble instructions at each address in the stack trace

The output continues beyond what I’ve shown above, but you get the idea.

The function on top of the trace will almost always be SepDuplicateToken, but could also be SepCreateToken or SepFilterToken, and whether one creation method was used versus another could be a big hint as to where in the program’s code to start searching for the token leak. You will find that the usefulness of these stacks will vary wildly from one scenario to the next, as things like inlined functions, lack of symbols, unloaded modules, and managed code all influence the integrity of the stack. However, you (or the developer of the application you’re using) can use this information to figure out where the token is being created in this program, and fix the leak.

Alright, that’s it. If you’re still reading this, then… thank you for hanging in there. I know this wasn’t exactly a light read.

And lastly, allow me to reiterate that this is not just a contrived, unrealistic scenario; There’s a lot of software out there on the market that does this kind of thing. And if you happen to write such software, then I really hope you read this blog post. It may help you improve the quality of your software in the future. Windows needs application developers to be “good citizens” and avoid writing software with the ability to destabilize the operating system. Hopefully this blog post helps someone out there do just that.

Until next time,
Ryan “Too Many Tokens” Ries

Viewing all 66 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>