Active Directory Experts: apply within

July 26, 2017, 2:56 pm

≫ Next: Introducing Lingering Object Liquidator v2

≪ Previous: Using Debugging Tools to Find Token and Session Leaks

Hi all! Justin Turner here from the Directory Services team with a brief announcement: We are hiring!

Would you like to join the U.S. Directory Services team and work on the most technically challenging and interesting Active Directory problems? Do you want to be the next Ned Pyle or Linda Taylor?

Then read more…

We are an escalation team based out of Irving, Texas; Charlotte, North Carolina; and Fargo, North Dakota. We work with enterprise customers helping them resolve the most critical Active Directory infrastructure problems as well as enabling them to get the best of Microsoft Windows and Identity-related technologies. The work we do is no ordinary support – we work with a huge variety of customer environments and there are rarely two problems which are the same.

You will need strong AD knowledge, strong troubleshooting skills, along with great collaboration, team work and customer service skills.

If this sounds like you, please apply here:

Irving, Texas (Las Colinas):
https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=2&jid=290399&jlang=en&pp=ss

Charlotte, North Carolina:
https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=1&jid=290398&jlang=en&pp=ss

U.S. citizenship is required for these positions.

Justin

↧

Introducing Lingering Object Liquidator v2

October 9, 2017, 11:04 am

≫ Next: ESE Deep Dive: Part 1: The Anatomy of an ESE database

≪ Previous: Active Directory Experts: apply within

Greetings again AskDS!

Ryan Ries here. Got something exciting to talk about.

You might be familiar with the original Lingering Object Liquidator tool that was released a few years ago.

Today, we’re proud to announce version 2 of Lingering Object Liquidator!

Because Justin’s blog post from 2014 covers the fundamentals of what lingering objects are so well, I don’t think I need to go over it again here. If you need to know what lingering objects in Active Directory are, and why you want to get rid of them, then please go read that post first.

The new version of the Lingering Object Liquidator tool began its life as an attempt to address some of the long-standing limitations of the old version. For example, the old version would just stop the entire scan when it encountered a single domain controller that was unreachable. The new version will just skip the unreachable DC and continue scanning the other DCs that are reachable. There are multiple other improvements in the tool as well, such as multithreading and more exhaustive logging.

Before we take a look at the new tool, there are some things you should know:

1) Lingering Object Liquidator – neither the old version nor the new version – are covered by CSS Support Engineers. A small group of us (including yours truly) have provided this tool as a convenience to you, but it comes with no guarantees. If you find a problem with the tool, or have a feature request, drop a line to the public AskDS public email address, or submit feedback to the Windows Server UserVoice forum, but please don’t bother Support Engineers with it on a support case.

2) Don’t immediately go into your production Active Directory forest and start wildly deleting things just because they show up as lingering objects in the tool. Please carefully review and consider any AD objects that are reported to be lingering objects before deleting.

3) The tool may report some false positives for deleted objects that are very close to the garbage collection age. To mitigate this issue, you can manually initiate garbage collection on your domain controllers before using this tool. (We may add this so the tool does it automatically in the future.)

4) The tool will continue to evolve and improve based on your feedback! Contact the AskDS alias or the Uservoice forum linked to in #1 above with any questions, concerns, bug reports or feature requests.

Graphical User Interface Elements

Let’s begin by looking at the graphical user interface. Below is a legend that explains each UI element:

Lingering Object Liquidator v2

A) “Help/About” label. Click this and a page should open up in your default web browser with extra information and detail regarding Lingering Object Liquidator.

B) “Check for Updates” label. Click this and the tool will check for a newer version than the one you’re currently running.

C) “Detect AD Topology” button. This is the first button that should be clicked in most scenarios. The AD Topology must be generated first, before proceeding on to the later phases of lingering object detection and removal.

D) “Naming Context” drop-down menu. (Naming Contexts are sometimes referred to as partitions.) Note that this drop-down menu is only available after AD Topology has been successfully discovered. It contains each Active Directory naming context in the forest. If you know precisely which Active Directory Naming context that you want to scan for lingering objects, you can select it from this menu. (Note: The Schema partition is omitted because it does not support deletion, so in theory it cannot contain lingering objects.) If you do not know which naming contexts may contain lingering objects, you can select the “[Scan All NCs]” option and the tool will scan each Naming Context that it was able to discover during the AD Topology phase.

E) “Reference DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The reference DC is the “known-good” DC against which you will compare other domain controllers for lingering objects. If a domain controller contains AD objects that do not exist on the Reference DC, they will be considered lingering objects. If you select the “[Scan Entire Forest]” option, then the tool will (arbitrarily) select one global catalog from each domain in the forest that is known to be reachable. It is recommended that you wisely choose a known-good DC yourself, because the tool doesn’t necessarily know “the best” reference DC to pick. It will pick one at random.

F) “Target DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The Target DC is the domain controller that is suspected of containing lingering objects. The Target DC will be compared against the Reference DC, and each object that exists on the Target DC but not on the Reference DC is considered a lingering object. If you aren’t sure which DC(s) contain lingering objects, or just want to scan all domain controllers, select the “[Target All DCs]” option from the drop-down menu.

G) “Detect Lingering Objects” button. Note that this button is only available after AD Topology has been successfully discovered. After you have made the appropriate selections in the three aforementioned drop-down menus, click the Detect Lingering Objects button to run the scan. Clicking this button only runs a scan; it does not delete anything. The tool will automatically detect and avoid certain nonsensical situations, such as the user specifying the same Reference and Target DCs, or selecting a Read-Only Domain Controller (RODC) as a Reference DC.

H) “Select All” button. Note that this button does not become available until after lingering objects have been detected. Clicking it merely selects all rows from the table below.

I) “Remove Selected Lingering Objects” button. This button will attempt to delete all lingering objects that have been detected by the detection process. You can select a range of items from the list using the shift key and the arrow keys. You can select and unselect specific items by holding down the control key and clicking on them. If you want to just select all items, click the “Select All” button.

J) “Removal Method” radio buttons. These are mutually exclusive. You can choose which of the two supported methods you want to use to remove the lingering objects that have been detected. The “removeLingeringObject” method refers to the rootDSE modify operation, which can be used to “spot-remove” individual lingering objects. In contrast, the DsReplicaVerifyObjects method will remove all lingering objects all at once. This intention is reflected in the GUI by all lingering objects automatically being selected when the DsReplicaVerifyObjects method is chosen.

K) “Import” button. This imports a previously-exported list of lingering objects.

L) “Export” button. This exports the selected lingering objects to a file.

M) The “Target DC Column.” This column tells you which Target DC was seen to have contained the lingering object.

N) The “Reference DC Column.” This column tells you which Reference DC was used to determine that the object in question was lingering.

O) The “Object DN Column.” This column contains the distinguished name of the lingering object.

P) The “Naming Context Column.” This column contains the Naming Context that the lingering object resides in.

Q) The “Lingering Object ListView”. This “ListView” works similarly to a spreadsheet. It will display all lingering objects that were detected. You can think of each row as a lingering object. You can click on the column headers to sort the rows in ascending or descending order, and you can resize the columns to fit your needs. NOTE: If you right-click on the lingering object listview, the selected lingering objects (if any) will be copied to your clipboard.

R) The “Status” box. The status box contains diagnostics and operational messages from the Lingering Object Liquidator tool. Everything that is logged to the status box in the GUI is also mirrored to a text log file.

User-configurable Settings

The user-configurable settings in Lingering Object Liquidator are alluded to in the Status box when the application first starts.

Key: HKLM\SOFTWARE\LingeringObjectLiquidator
Value: DetectionTimeoutPerDCSeconds
Type: DWORD
Default: 300 seconds

This setting affects the “Detect Lingering Objects” scan. Lingering Object Liquidator establishes event log “subscriptions” to each Target DC that it needs to scan. The tool then waits for the DC to log an event (Event ID 1942 in the Directory Service event log) signaling that lingering object detection has completed for a specific naming context. Only once a certain number of those events (depending on your choices in the “Naming Contexts” drop-down menu,) have been received from the remote domain controller, does the tool know that particular domain controller has been fully scanned. However, there is an overall timeout, and if the tool does not receive the requisite number of Event ID 1942s in the allotted time, the tool “gives up” and proceeds to the next domain controller.

Key: HKLM\SOFTWARE\LingeringObjectLiquidator
Value: ThreadCount
Type: DWORD
Default: 8 threads

This setting sets the maximum number of threads to use during the “Detect Lingering Objects” scan. Using more threads may decrease the overall time it takes to complete a scan, especially in very large environments.

Tips

The domain controllers must allow the network connectivity required for remote event log management for Lingering Object Liquidator to work. You can enable the required Windows Firewall rules using the following line of PowerShell:

Get-NetFirewallRule -DisplayName "Remote Event Log*" | Enable-NetFirewallRule

Check the download site often for new versions! (There’s also a handy “check for updates” option in the tool.)

Final Words

We provide this tool because we at AskDS want your Active Directory lingering object removal experience to go as smoothly as possible. If you find any bugs or have any feature requests, please drop a note to our public contact alias.

Download: http://aka.ms/msftlol

Release Notes

v2.0.19:
– Initial release to the public.

v2.0.21:
– Added new radio buttons that allow the user more control over which lingering object removal method they want to use – the DsReplicaVerifyObjects method or removeLingeringObject method.
– Fixed issue with Export button not displaying the full path of the export file.
– Fixed crash when unexpected or corrupted data is returned from event log subscription.

↧

ESE Deep Dive: Part 1: The Anatomy of an ESE database

December 4, 2017, 10:33 am

≫ Next: TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

≪ Previous: Introducing Lingering Object Liquidator v2

hi!

Get your crash helmets on and strap into your seatbelts for a JET engine / ESE database special…

This is Linda Taylor, Senior AD Escalation Engineer from the UK here again. And WAIT…… I also somehow managed to persuade Brett Shirley to join me in this post. Brett is a Principal Software Engineer in the ESE Development team so you can be sure the information in this post is going to be deep and confusing but really interesting and useful and the kind you cannot find anywhere else :- )
BTW, Brett used to write blogs before he grew up and got very busy. And just for fun, you might find this old “Brett” classic entertaining. I have never forgotten it. :- )
Back to today’s post…this will be a rather more grown up post, although we will talk about DITs but in a very scientific fashion.

In this post, we will start from the ground up and dive deep into the overall file format of an ESE database file including practical skills with esentutl such as how to look at raw database pages. And as the title suggests this is Part1 so there will be more!

What is an ESE database?

Let’s start basic. The Extensible Storage Engine (ESE), also known as JET Blue, is a database engine from Microsoft that does not speak SQL. And Brett also says … For those with a historical bent, or from academia, and remember ‘before SQL’ instead of ‘NoSQL’ ESE is modelled after the ISAMs (indexed sequential access method) that were vogue in the mid-70s. ;-p
If you work with Active Directory (which you must do if you are reading this post then you will (I hope!) know that it uses an ESE database. The respective binary being, esent.dll (or Brett loves exchange, it’s ese.dll for the Exchange Server install). Applications like active directory are all ESE clients and use the JET APIs to access the ESE database.

This post will dive deep into the Blue parts above. The ESE side of things. AD is one huge client of ESE, but there are many other Windows components which use an ESE database (and non-Microsoft software too), so your knowledge in this area is actually very applicable for those other areas. Some examples are below:

Tools

There are several built-in command line tools for looking into an ESE database and related files.

esentutl. This is a tool that ships in Windows Server by default for use with Active Directory, Certificate Authority and any other built in ESE databases. This is what we will be using in this post and can be used to look at any ESE database.

eseutil. This is the Exchange version of the same and gets installed typically in the Microsoft\Exchange\V15\Bin sub-directory of the Program Files directory.

ntdsutil. Is a tool specifically for managing an AD or ADLDS databases and cannot be used with generic ESE databases (such as the one produced by Certificate Authority service). This is installed by default when you add the AD DS or ADLDS role.

For read operations such as dumping file or log headers it doesn’t matter which tool you use. But for operations which write to the database you MUST use the matching tool for the application and version (for instance it is not safe to run esentutl /r from Windows Server 2016 on a Windows Server 2008 DB). Further throughout this article if you are looking at an Exchange database instead, you should use eseutil.exe instead of esentutl.exe. For AD and ADLDS always use ntdsutil or esentutl. They have different capabilities, so I use a mixture of both. And Brett says that If you think you can NOT keep the read operations straight from the write operations, play it safe and match the versions and application.

During this post, we will use an AD database as our victim example. We may use other ones, like ADLDS for variety in later posts.

Database logical format – Tables

Let’s start with the logical format. From a logical perspective, an ESE database is a set of tables which have rows and columns and indices.

Below is a visual of the list of tables from an AD database in Windows Server 2016. Different ESE databases will have different table names and use those tables in their own ways.

In this post, we won’t go into the detail about the DNTs, PDNTs and how to analyze an AD database dump taken with LDP because this is AD specific and here we are going to look at ESE specific level. Also, there are other blogs and sources where this has already been explained. for example, here on AskPFEPlat. However, if such post is wanted, tell me and I will endeavor to write one!!

It is also worth noting that all ESE databases have a table called MSysObjects and MSysObjectsShadow which is a backup of MSysObjects. These are also known as “the catalog” of the database and they store metadata about client’s schema of the database – i.e.

All the tables and their table names and where their associated B+ trees start in the database and other miscellaneous metadata.

All the columns for each table and their names (of course), the type of data stored in them, and various schema constraints.

All the indexes on the tables and their names, and where their associated B+ trees start in the database.

This is the boot-strap information for ESE to be able to service client requests for opening tables to eventually retrieve rows of data.

Database physical format

From a physical perspective, an ESE database is just a file on disk. It is a collection of fixed size pages arranged into B+ tree structures. Every database has its page size stamped in the header (and it can vary between different clients, AD uses 8 KB). At a high level it looks like this:

The first “page” is the Header (H).

The second “page” is a Shadow Header (SH) which is a copy of the header.

However, in ESE “page number” (also frequently abbreviated “pgno”) has a very specific meaning (and often shows up in ESE events) and the first NUMBERED page of the actual database is page number / pgno 1 but is actually the third “page” (if you are counting from the beginning :-).

From here on out though, we will not consider the header and shadow header proper pages, and page number 1 will be third page, at byte offset = <page size> * 2 = 8192 * 2 (for AD databases).

If you don’t know the page size, you can dump the database header with esentutl /mh.

Here is a dump of the header for an NTDS.DIT file – the AD database:

The page size is the cbDbPage. AD and ADLDS uses a page size of 8k. Other databases use different page sizes.

A caveat is that to be able to do this, the database must not be in use. So, you’d have to stop the NTDS service on the DC or run esentutl on an offline copy of the database.

But the good news is that in WS2016 and above we can now dump a LIVE DB header with the /vss switch! The command you need would be “esentutl /mh ntds.dit /vss” (note: must be run as administrator).

All these numbered database pages logically are “owned” by various B+ trees where the actual data for the client is contained … and all these B+ trees have a “type of tree” and all of a tree’s pages have a “placement in the tree” flag (Root, or Leaf or implicitly Internal – if not root or leaf).

Ok, Brett, that was “proper” tree and page talk – I think we need some pictures to show them…

Logically the ownership / containing relationship looks like this:

More about B+ Trees

The pages are in turn arranged into B+ Trees. Where top page is known as the ‘Root’ page and then the bottom pages are ‘Leaf’ pages where all the data is kept. Something like this (note this particular example does not show ‘Internal’ B+ tree pages):

The upper / parent page has partial keys indicating that all entries with 4245 + A* can be found in pgno 13, and all entries with 4245 + E* can be found in pgno 14, etc.
Note this is a highly simplified representation of what ESE does … it’s a bit more complicated.

This is not specific to ESE; many database engines have either B trees or B+ trees as a fundamental arrangement of data in their database files.

The Different trees

You should know that there are different types of B+ trees inside the ESE database that are needed for different purposes. These are:

Data / Primary Trees – hold the table’s primary records which are used to store data for regular (and small) column data.

Long Value (LV) Trees – used to store long values. In other words, large chunks of data which don’t fit into the primary record.

Index trees – these are B+Trees used to store indexes.

Space Trees – these are used to track what pages are owned and free / available as new pages for a given B+ tree. Each of the previous three types of B+ Tree (Data, LV, and index), may (if the tree is large) have a set of two space trees associated with them.

Storing large records

Each Row of a table is limited to 8k (or whatever the page size is) in Active Directory and AD LDS. I.e. so each record has to fit into a single database page of 8k..but you are probably aware that you can fit a LOT more than 8k into an AD object or an exchange e-mail! So how do we store large records?

Well, we have different types of columns as illustrated below:

Tagged columns can be split out into what we call the Long Value Tree. So in the tagged column we store a simple 4 byte number that’s called a LID (Long Value ID) which then points to an entry in the LV tree. So we take the large piece of data, break it up into small chunks and prefix those with the key for the LID and the offset.

So, if every part of the record was a LID / pointer to a LV, then essentially we can fit 1300 LV pointers onto the 8k page. btw, this is what creates the 1300 attribute limit in AD. It’s all down to the ESE page size.

Now you can also start to see that when you are looking at a whole AD object you may read pages from various trees to get all the information about your object. For example, for a user with many attributes and group memberships you may have to get data from a page in the ”datatable” \ Primary tree + “datatable” \ LV tree + sd_table \ Primary tree + link_table \ Primary tree.

Index Trees

An index is used for a couple of purposes. Firstly, to make a list of the records in an intelligent order, such as by surname in an alphabetical order. And then secondly to also cut down the number of records which sometimes greatly helps speed up searches (especially when the ‘selectivity is high’ – meaning few entries match).

Below is a visual illustration (with the B+ trees turned on their side to make the diagram easier) of a primary index which is the DNT index in the AD Database – the Data Tree. And a secondary index of dNSHostName. You can see that the secondary index only contains the records which has a dNSHostName populated. It is smaller.

You can also see that in the secondary index, the primary key is the data portion (the name) and then the data is the actual Key that links us back to the REAL record itself.

Inside a Database page

Each database page has a fixed header. And the header has a checksum as well as other information like how much free space is on that page and which B-tree it belongs to.

Then we have these things called TAGS (or nodes), which store the data.

A node can be many things, such as a record in a database table or an entry in an index.

The TAGS are actually out of order on the page, but order is established by the tag array at end.

TAG 0 = Page External Header

This contains variable sized special information on the page, depending upon the type of B-tree and type of page in B tree (space vs. regular tree, and root vs. leaf).

TAG 1,2,3, etc are all “nodes” or lines, and the order is tracked.

The key & data is specific to the B Tree type.

And TAG 1 is actually node 0!!! So here is a visual picture of what an ESE database page looks like:

It is possible to calculate this key if you have an object’s primary key. In AD this is a DNT.

The formulae for that (if you are ever crazy enough to need it) would be:

Start with 0x7F, and if it is a signed INT append a 0x80000000 and then OR in the number

For example 4248 –> in hex 1098 –> as key 7F80001098 (note 5 bytes).
Note: Key buffer uses big endian, not little endian (like x86/amd64 arch).
If it was a 64-bit int, just insert zeros in the middle (9 byte key).

If it is an unsigned INT, start with 0x7F and just append the number.
Note: Long Value (LID) trees and ESE’s Space Trees (pgno) are special, no 0x7F (4 byte keys).
And finally other non-integers column types, such as String and Binary types, have a different more complicated formatting for keys.

Why is this useful? Because, for example you can take a DNT of an object and then calculate its key and then seek to its page using esentutl.exe dump page /m functionality and /k option.

The Nodes also look different (containing different data) depending on the ESE B+tree type. Below is an illustration of the different nodes in a Space tree, a Data Tree, a LV tree and an Index tree.

The green are the keys. The dark blue is data.

What does a REAL page look like?

You can use esentutl to dump pages of the database if you are investigating some corruption for example.

Before we can dump a page, we want to find a page of interest (picking a random page could give you just a blank page) … so first we need some info about the table schema, so to start you can dump all the tables and their associated root page numbers like this :

Note, we have findstring’d the output again to get a nice view of just all the tables and their pgnoFDP and objidFDP. Findstr.exe is case sensitive so use the exact format or use /i switch.

objidFDP identifies this table in the catalog metadata. When looking at a database page we can use its objidFDP to tell which table this page belongs to.

pgnoFDP is the page number of the Father Data Page – the very top page of that B+ tree, also known as the root page. If you run esentutl /mm <dbname> on its own you will see a huge list of every table and B-tree (except internal “space” trees) including all the indexes.

So, in this example page 31 is the root page of the datatable here.

Dumping a page

You can dump a page with esentutl using /m and /p. Below is an example of dumping page 31 from the database – the root page of the “datatable” table as above.

The objidFDP is the number indicating which B-tree the page belongs to. And the cbFree tells us how much of this page is free. (cb = count of bytes). Each database page has a double header checksum – one ECC (Error Correcting Code) checksum for single bit data correction, and a higher fidelity XOR checksum to catch all other errors, including 3 or more bit errors that the ECC may not catch. In addition, we compute a logged data checksum from the page data, but this is not stored in the header, and only utilized by the Exchange 2016 Database Divergence Detection feature.

You can see this is a root page and it has 3 nodes (4 TAGS – remember TAG1 is node 0 also known as line 0! and it is nearly empty! (cbFree = 8092 bytes, so only 100 bytes used for these 3 nodes + page header + external header).

The objidFDP tells us which B-Tree this page belongs to.

And notice the PageFlushType, which is related to the JET Flush Map file we could talk about in another post later.

The nodes here point to pages lower down in the tree. And we could dump a next level page (pgno: 1438)….and we can see them getting deeper and more spread out with more nodes.

So you can see this page has 294 nodes! Which again all point to other pages. It is also a ParentOfLeaf meaning these pgno / page numbers actually point to leaf pages (with the final data on them).

Are you bored yet? Open-mouthed smile

Or are you enjoying this like a geek? either way, we are nearly done with the page internals and the tree climbing here.

If you navigate more down, eventually you will get a page with some data on it like this for example, let’s dump page 69 which TAG 6 is pointing to:

So this one has some data on it (as indicated by the “Leaf page” indicator under the fFlags).

Finally, you can also dump the data – the contents of a node (ie TAG) with the /n switch like this:

Remember: The /n specifier takes a pgno : line or node specifier … this means that the :3 here, dumped TAG 4 from the previous screen. And note that trying to dump “/n69:4” would actually fail.

This /n will dump all the raw data on the page along with the information of columns and their contents and types. The output also needs some translation because it gives us the columnID (711 in the above example) and not the attribute name in AD (or whatever your database may be). The application developer would then be able to translate those column IDs to some meaningful information. For AD and ADLDS, we can translate those to attribute names using the source code.

Finally, there really should be no need to do this in real life, other than in a situation where you are debugging a database problem. However, we hope this provided a good and ‘realistic’ demo to help understand and visualize the structure of an ESE database and how the data is stored inside it!

Stay tuned for more parts …. which Brett says will be significantly more useful to everyday administrators!

The End!

Linda & Brett

↧

TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

April 10, 2018, 2:19 pm

≫ Next: Deep Dive: Active Directory ESE Version Store Changes in Server 2019

≪ Previous: ESE Deep Dive: Part 1: The Anatomy of an ESE database

Hi There!

Marius and Tolu from the Directory Services Escalation Team.

Today, we’re going to talk about a little twist on some scenarios you may have come across at some point, where TLS connections fail or timeout for a variety of reasons.

You’re probably already familiar with some of the usual suspects like cipher suite mismatches, certificate validation errors and TLS version incompatibility, to name a few.

Here are just some examples for illustration (but there is a wealth of information out there)

Recently we’ve seen a number of cases with a variety of symptoms affecting different customers which all turned out to have a common root cause.

We’ve managed to narrow it down to an unlikely source; a built-in OS feature working in its default configuration.

We’re talking about the automatic root update and automatic disallowed roots update mechanisms based on CTLs.

Starting with Windows Vista, root certificates are updated on Windows automatically.

When a user on a Windows client visits a secure Web site (by using HTTPS/TLS), reads a secure email (S/MIME), or downloads an ActiveX control that is signed (code signing) and encounters a certificate which chains to a root certificate not present in the root store, Windows will automatically check the appropriate Microsoft Update location for the root certificate.

If it finds it, it downloads it to the system. To the user, the experience is seamless; they don’t see any security dialog boxes or warnings and the download occurs automatically, behind the scenes.

Additional information in:

How Root Certificate Distribution Works

During TLS handshakes, any certificate chains involved in the connection will need to be validated, and, from Windows Vista/2008 onwards, the automatic disallowed root update mechanism is also invoked to verify if there are any changes to the untrusted CTL (Certificate Trust List).

A certificate trust list (CTL) is a predefined list of items that are authenticated and signed by a trusted entity.

The mechanism is described in more detail in the following article:

An automatic updater of untrusted certificates is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2

It expands on the automatic root update mechanism technology (for trusted root certificates) mentioned earlier to let certificates that are compromised or are untrusted in some way be specifically flagged as untrusted.

Customers therefore benefit from periodic automatic updates to both trusted and untrusted CTLs.

So, after the preamble, what scenarios are we talking about today?

Here are some examples of issues we’ve come across recently.

Your users may experience browser errors after several seconds when trying to browse to secure (https) websites behind a load balancer.
      They might receive an error like “The page cannot be displayed. Turn on TLS 1.0, TLS 1.1, and TLS 1.2 in the Advanced settings and try connecting to https://contoso.com again.  If this error persists, contact your site administrator.”
      If they try to connect to the website via the IP address of the server hosting the site, the https connection works after showing a certificate name mismatch error.
      All TLS versions ARE enabled when checking in the browser settings:

Internet Options

      You have a 3rd party appliance making TLS connections to a Domain Controller via LDAPs (Secure LDAP over SSL) which may experience delays of up to 15 seconds during the TLS handshake
      The issue occurs randomly when connecting to any eligible DC in the environment targeted for authentication.
      There are no intervening devices that filter or modify traffic between the appliance and the DCs

2a)

A very similar scenario* to the above is in fact described in the following article by our esteemed colleague, Herbert:

Understanding ATQ performance counters, yet another twist in the world of TLAs

Where he details:

Scenario 2

DC supports LDAP over SSL/TLS

A user sends a certificate on a session. The server need to check for certificate revocation which may take some time.*

This becomes problematic if network communication is restricted and the DC cannot reach the Certificate Distribution Point (CDP) for a certificate.

To determine if your clients are using secure LDAP (LDAPs), check the counter “LDAP New SSL Connections/sec”.

If there are a significant number of sessions, you might want to look at CAPI-Logging.

A 3^rd party meeting server performing LDAPs queries against a Domain Controller may fail the TLS handshake on the first attempt after surpassing a pre-configured timeout (e.g 5 seconds) on the application side
Subsequent connection attempts are successful

So, what’s the story? Are these issues related in anyway?

Well, as it turns out, they do have something in common.

As we mentioned earlier, certificate chain validation occurs during TLS handshakes.

Again, there is plenty of documentation on this subject, such as

During certificate validation operations, the CTL engine gets periodically invoked to verify if there are any changes to the untrusted CTLs.

In the example scenarios we described earlier, if the default public URLs for the CTLs are unreachable, and there is no alternative internal CTL distribution point configured (more on this in a minute), the TLS handshake will be delayed until the WinHttp call to access the default CTL URL times out.

By default, this timeout is usually around 15 seconds, which can cause problems when load balancers or 3^rd party applications are involved and have their own (more aggressive) timeouts configured.

If we enable CAPI2 Diagnostic logging, we should be able to see evidence of when and why the timeouts are occurring.

We will see events like the following:

Event ID 20 – Retrieve Third-Party Root Certificate from Network:

Trusted CTL attempt

Trusted CTL Attempt

Disallowed CTL attempt

Disallowed CTL Attempt

Event ID 53 error message details showing that we have failed to access the disallowed CTL:

Event ID 53

The following article gives a more detailed overview of the CAPI2 diagnostics feature available on Windows systems, which is very useful when looking at any certificate validation operations occurring on the system:

Troubleshooting PKI Problems on Windows Vista

To help us confirm that the CTL updater engine is indeed affecting the TLS delays and timeouts we’ve described, we can temporarily disable it for both the trusted and untrusted CTLs and then attempt our TLS connections again.

To disable it:

 Create a backup of this registry key (export and save a copy)

HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\SystemCertificates\AuthRoot

Then create the following DWORD registry values under the key

“EnableDisallowedCertAutoUpdate”=dword:00000000

“DisableRootAutoUpdate”=dword:00000001

After applying these steps, you should find that your previously failing TLS connections will no longer timeout. Your symptoms may vary slightly, but you should see speedier connection times, because we have eliminated the delay in trying and failing to reach the CTL URLs.

So, what now?

We should now REVERT the above registry changes by restoring the backup we created, and evaluate the following, more permanent solutions.

We previously stated that disabling the updater engine should only be a temporary measure to confirm the root cause of the timeouts in the above scenarios.

For the untrusted CTL:

The automatic disallowed root update mechanism is a built-in OS feature, so we can consider allowing access to the public Microsoft disallowed CTL URL from users’ machines; http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/disallowedcertstl.cab

OR, we can configure and maintain an internal untrusted CTL distribution point as outlined in Configure Trusted Roots and Disallowed Certificates

For the trusted CTL:

For server systems you might consider deploying the trusted 3rd party CA certificates via GPO on an as needed basis

Manage Trusted Root Certificates 

(particularly to avoid hitting the TLS protocol limitation described here:

SSL/TLS communication problems after you install KB 931125 )

For client systems, you should consider

Allowing access to the public allowed Microsoft CTL URL http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab

Defining and maintaining an internal trusted CTL distribution point as outlined in Configure Trusted Roots and Disallowed Certificates

If you require a more granular control of which CAs are trusted by client machines, you can deploy the 3^rd Party CA certificates as needed via GPO

Manage Trusted Root Certificates

So there you have it. We hope you found this interesting, and now have an additional factor to take into account when troubleshooting TLS/SSL communication failures.

↧

Deep Dive: Active Directory ESE Version Store Changes in Server 2019

October 2, 2018, 4:39 pm

≫ Next: AskDS Is Moving!

≪ Previous: TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

Hey everybody. Ryan Ries here to help you fellow AD ninjas celebrate the launch of Server 2019.

Warning: As is my wont, this is a deep dive post. Make sure you’ve had your coffee before proceeding.

Last month at Microsoft Ignite, many exciting new features rolling out in Server 2019 were talked about. (Watch MS Ignite sessions here and here.)

But now I want to talk about an enhancement to on-premises Active Directory in Server 2019 that you won’t read or hear anywhere else. This specific topic is near and dear to my heart personally.

The intent of the first section of this article is to discuss how Active Directory’s sizing of the ESE version store has changed in Server 2019 going forward. The second section of this article will discuss some basic debugging techniques related to the ESE version store.

Active Directory, also known as NT Directory Services (NTDS,) uses Extensible Storage Engine (ESE) technology as its underlying database.

One component of all ESE database instances is known as the version store. The version store is an in-memory temporary storage location where ESE stores snapshots of the database during open transactions. This allows the database to roll back transactions and return to a previous state in case the transactions cannot be committed. When the version store is full, no more database transactions can be committed, which effectively brings NTDS to a halt.

In 2016, the CSS Directory Services support team blog, (also known as AskDS,) published some previously undocumented (and some lightly-documented) internals regarding the ESE version store. Those new to the concept of the ESE version store should read that blog post first.

In the blog post linked to previously, it was demonstrated how Active Directory had calculated the size of the ESE version store since AD’s introduction in Windows 2000. When the NTDS service first started, a complex algorithm was used to calculate version store size. This algorithm included the machine’s native pointer size, number of CPUs, version store page size (based on an assumption which was incorrect on 64-bit operating systems,) maximum number of simultaneous RPC calls allowed, maximum number of ESE sessions allowed per thread, and more.

Since the version store is a memory resource, it follows that the most important factor in determining the optimal ESE version store size is the amount of physical memory in the machine, and that – ironically – seems to have been the only variable not considered in the equation!

The way that Active Directory calculated the version store size did not age well. The original algorithm was written during a time when all machines running Windows were 32-bit, and even high-end server machines had maybe one or two gigabytes of RAM.

As a result, many customers have contacted Microsoft Support over the years for issues arising on their domain controllers that could be attributed to or at least exacerbated by an undersized ESE version store. Furthermore, even though the default ESE version store size can be augmented by the “EDB max ver pages (increment over the minimum)” registry setting, customers are often hesitant to use the setting because it is a complex topic that warrants heavier and more generous amounts of documentation than what has traditionally been available.

The algorithm is now greatly simplified in Server 2019:

– When NTDS first starts, the ESE version store size is now calculated as 10% of physical RAM, with a minimum of 400MB and a maximum of 4GB.

The same calculation applies to physical machines and virtual machines. In the case of virtual machines with dynamic memory, the calculation will be based off of the amount of “starting RAM” assigned to the VM. The “EDB max ver pages (increment over the minimum)” registry setting can still be used, as before, to add additional buckets over the default calculation. (Even beyond 4GB if desired.) The registry setting is in terms of “buckets,” not bytes. Version store buckets are 32KB each on 64-bit systems. (They are 16KB on 32-bit systems, but Microsoft no longer supports any 32-bit server OSes.) Therefore, if one adds 5000 “buckets” by setting the registry entry to 5000 (decimal,) then 156MB will be added to the default version store size. A minimum of 400MB was chosen for backwards compatibility because when using the old algorithm, the default version store size for a DC with a single 64-bit CPU was ~410MB, regardless of how much memory it had. (There is no way to configure less than the minimum of 400MB, similar to previous Windows versions.) The advantage of the new algorithm is that now the version store size scales linearly with the amount of memory the domain controller has, when previously it did not.

Defaults:

Physical Memory in the Domain Controller	Default ESE Version Store Size
1GB	400MB
2GB	400MB
3GB	400MB
4GB	400MB
5GB	500MB
6GB	600MB
8GB	800MB
12GB	1.2GB
24GB	2.4GB
48GB	4GB
128GB	4GB

This new calculation will result in larger default ESE version store sizes for domain controllers with greater than 4GB of physical memory when compared to the old algorithm. This means more version store space to process database transactions, and fewer cases of version store exhaustion. (Which means fewer customers needing to call us!)

Note: This enhancement currently only exists in Server 2019 and there are not yet any plans to backport it to older Windows versions.

Note: This enhancement applies only to Active Directory and not to any other application that uses an ESE database such as Exchange, etc.

ESE Version Store Advanced Debugging and Troubleshooting

This section will cover some basic ESE version store triage, debugging and troubleshooting techniques.

As covered in the AskDS blog post linked to previously, the performance counter used to see how many ESE version store buckets are currently in use is:

\\.\Database ==> Instances(lsass/NTDSA)\Version buckets allocated

Once that counter has reached its limit, (~12,000 buckets or ~400MB by default,) events will be logged to the Directory Services event log, indicating the exhaustion:

Figure 1: NTDS version store exhaustion.

The event can also be viewed graphically in Performance Monitor:

Figure 2: The plateau at 12,314 means that the performance counter “Version Buckets Allocated” cannot go any higher. The flat line represents a dead patient.

As long as the domain controller still has available RAM, try increasing the version store size using the previously mentioned registry setting. Increase it in gradual increments until the domain controller is no longer exhausting the ESE version store, or the server has no more free RAM, whichever comes first. Keep in mind that the more memory that is used for version store, the less memory will be available for other resources such as the database cache, so a sensible balance must be struck to maintain optimal performance for your workload. (i.e. no one size fits all.)

If the “Version Buckets Allocated” performance counter is still pegged at the maximum amount, then there is some further investigation that can be done using the debugger.

The eventual goal will be to determine the nature of the activity within NTDS that is primarily responsible for exhausting the domain controller of all its version store, but first, some setup is required.

First, generate a process memory dump of lsass on the domain controller while the machine is “in state” – that is, while the domain controller is at or near version store exhaustion. To do this, the “Create dump file” option can be used in Task Manager by right-clicking on the lsass process on the Details tab. Optionally, another tool such as Sysinternals’ procdump.exe can be used (with the -ma switch .)

In case the issue is transient and only occurs when no one is watching, data collection can be configured on a trigger, using procdump with the -p switch.

Note: Do not share lsass memory dump files with unauthorized persons, as these memory dumps can contain passwords and other sensitive data.

It is a good idea to generate the dump after the Version Buckets Allocated performance counter has risen to an abnormally elevated level but before version store has plateaued completely. The reason why is because the database transaction responsible may be terminated once the exhaustion occurs, therefore the thread would no longer be present in the memory dump. If the guilty thread is no longer alive once the memory dump is taken, troubleshooting will be much more difficult.

Next, gather a copy of %windir%\System32\esent.dll from the same Server 2019 domain controller. The esent.dll file contains a debugger extension, but it is highly dependent upon the correct Windows version, or else it could output incorrect results. It should match the same version of Windows as the memory dump file.

Next, download WinDbg from the Microsoft Store, or from this link.

Once WinDbg is installed, configure the symbol path for Microsoft’s public symbol server:

$Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol$

Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol

Now load the lsass.dmp memory dump file, and load the esent.dll module that you had previously collected from the same domain controller:

Figure 4: .load esent.dll

Now the ESE database instances present in this memory dump can be viewed with the command !ese dumpinsts:

Figure 5: !ese dumpinsts – The only ESE instance present in an lsass dump on a DC should be NTDSA.

Notice that the current version bucket usage is 11,189 out of 12,802 buckets total. The version store in this memory dump is very nearly exhausted. The database is not in a particularly healthy state at this moment.

The command !ese param <instance> can also be used, specifying the same database instance gotten from the previous command, to see global configuration parameters for that ESE database instance. Notice that JET_paramMaxVerPages is set to 12800 buckets, which is 400MB worth of 32KB buckets:

Figure 6: !ese param

To see much more detail regarding the ESE version store, use the !ese verstore <instance> command, specifying the same database instance:

Figure 7: !ese verstore

The output of the command above shows us that there is an open, long-running database transaction, how long it’s been running, and which thread started it. This also matches the same information displayed in the Directory Services event log event pictured previously.

Neither the event log event nor the esent debugger extension were always quite so helpful; they have both been enhanced in recent versions of Windows.

In older versions of the esent debugger extension, the thread ID could be found in the dwTrxContext field of the PIB, (command: !ese dump PIB 0x000001AD71621320) and the start time of the transaction could be found in m_trxidstack as a 64-bit file time. But now the debugger extension extracts that data automatically for convenience.

Switch to the thread that was identified earlier and look at its call stack:

Figure 8: The guilty-looking thread responsible for the long-running database transaction.

The four functions that are highlighted by a red rectangle in the picture above are interesting, and here’s why:

When an object is deleted on a domain controller, and that object has links to other objects, those links must also be deleted/cleaned by the domain controller. For example, when an Active Directory user becomes a member of a security group, a database link between the user and the group is created that represents that relationship. The same principle applies to all linked attributes in Active Directory. If the Active Directory Recycle Bin is enabled, then the link-cleaning process will be deferred until the deleted object surpasses its Deleted Object Lifetime – typically 60 or 180 days after being deleted. This is why, when the AD Recycle Bin is enabled, a deleted user can be easily restored with all of its group memberships still intact – because the user account object’s links are not cleaned until after its time in the Recycle Bin has expired.

The trouble begins when an object with many backlinks is deleted. Some security groups, distribution lists, RODC password replication policies, etc., may contain hundreds of thousands or even millions of members. Deleting such an object will give the domain controller a lot of work to do. As you can see in the thread call stack shown above, the domain controller had been busily processing links on a deleted object for 47 seconds and still wasn’t done. All the while, more and more ESE version store space was being consumed.

When the AD Recycle Bin is enabled, this can cause even more confusion, because no one remembers that they deleted that gigantic security group 6 months ago. A time bomb has been sitting in the AD Recycle Bin for months. But suddenly, AD replication grinds to a standstill throughout the domain and the admins are scrambling to figure out why.

The performance counter “\\.\DirectoryServices ==> Instances(NTDS)\Link Values Cleaned/sec” would also show increased activity during this time.

There are two main ways to fight this: either by increasing version store size with the “EDB max ver pages (increment over the minimum)” registry setting, or by decreasing the batch size with the “Links process batch size” registry setting, or a combination of both. Domain controllers process the deletion of these links in batches. The smaller the batch size, the shorter the individual database transactions will be, thus relieving pressure on the ESE version store.

Though the default values are properly-sized for almost all Active Directory deployments and most administrators should never have to worry about them, the two previously-mentioned registry settings are supported and well-informed enterprise administrators are encouraged to tweak the values – within reason – to avoid ESE version store depletion. Contact Microsoft customer support before making any modifications if there is any uncertainty.

At this point, one could continue diving deeper, using various approaches (e.g. consider not only debugging process memory, but also consulting DS Object Access audit logs, object metadata from repadmin.exe, etc.) to find out which exact object with many thousands of links was just deleted, but in the end that’s a moot point. There’s nothing else that can be done with that information. The domain controller simply must complete the work of link processing.

In other situations however, it will be apparent using the same techniques shown previously, that it’s an incoming LDAP query from some network client that’s performing inefficient queries, leading to version store exhaustion. Other times it will be DirSync clients. Other times it may be something else. In those instances, there may be more you can do besides just tweaking the version store variables, such as tracking down and silencing the offending network client(s), optimizing LDAP queries, creating database indices, etc..

Thanks for reading,
– Ryan “Where’s My Ship-It Award” Ries

↧

AskDS Is Moving!

March 26, 2019, 8:08 am

≪ Previous: Deep Dive: Active Directory ESE Version Store Changes in Server 2019

Hello readers. The AskDS blog, as with all the other TechNet and MSDN blogs, will be moving to a new home on TechCommunity. The migration is currently in progress. This post will be updated with the new URL when the migration is complete. Thanks!... Read more

↧