Today I’m happy to announce a new 5-day live training to take place in December, “C++ Programming Masterclass”. This is remote training presented live, where the sessions are recorded for later viewing if you miss any.
The course is scheduled for the beginning of December with 10 half-days spanning two and a half weeks.
The full syllabus, who should attend, with the exact dates and times can be found here: C++ Programming Masterclass.
Here is a quick summary of the course modules:
Module 1: Introduction to Modern C++
Module 2: C++ Fundamentals
Module 3: Standard Library Basics
Module 4: Object Oriented Design and Programming
Module 5: Modern Language Features and Libraries (Part 1)
Module 6: Templates
Module 7: Modern Language Features and Libraries (Part 2)
Module 8: The C++ Standard Library
Module 9: Concurrency in Modern C++
The course is currently heavily discounted, with the a price increase after November 1st. Register here: C++ Programming Masterclass
If you have any questions, feel free to get in touch via email (zodiacon@live.com), Linkedin, or X (@zodiacon)
In part 3 we implemented the bulk of what makes a DataStack – push, pop and clear operations. We noted a few remaining deficiencies that need to be taken care of. Let’s begin.
Object Destruction
A DataStack object is deallocated when the last reference to it removed (typically all handles are closed). Any other cleanup must be done explicitly. The DeleteProcedure member of the OBJECT_TYPE_INITIALIZER is an optional callback we can set to be called just before the structure is freed:
init.DeleteProcedure = OnDataStackDelete;
The callback is simple – it’s called with the object about to be destroyed. We can use the cleanup support from part 3 to free the dynamic state of the stacked items:
The native API provides many functions starting with NtQueryInformation* with an object type like process, thread, file, etc. We’ll add a similar function for querying information about DataStack objects. A few declarations are in order, mimicking similar declarations used by other query APIs:
The implementation (in kernel mode) is not complicated, just verbose. As with other APIs, we’ll start by getting the object itself from the handle, asking for DATA_STACK_QUERY access mask:
// if no buffer provided then ReturnLength must be
// non-NULL and buffer size must be zero
//
if (!ARGUMENT_PRESENT(Buffer) && (!ARGUMENT_PRESENT(ReturnLength) || BufferSize != 0))
return STATUS_INVALID_PARAMETER;
//
// if buffer provided, then size must be non-zero
//
if (ARGUMENT_PRESENT(Buffer) && BufferSize == 0)
return STATUS_INVALID_PARAMETER;
The rest is pretty standard. Let’s look at one information class:
ULONG len = 0;
switch (InformationClass) {
case DataStackItemCount:
len = sizeof(ULONG); break;
case DataStackTotalSize:
len = sizeof(ULONG_PTR); break;
case DataStackConfiguration:
len = sizeof(DATA_STACK_CONFIGURATION); break;
default:
return STATUS_INVALID_INFO_CLASS;
}
if (BufferSize < len) {
status = STATUS_BUFFER_TOO_SMALL;
}
else {
if (ExGetPreviousMode() != KernelMode) {
__try {
if (ARGUMENT_PRESENT(Buffer))
ProbeForWrite(Buffer, BufferSize, 1);
if (ARGUMENT_PRESENT(ReturnLength))
ProbeForWrite(ReturnLength, sizeof(ULONG), 1);
}
__except (EXCEPTION_EXECUTE_HANDLER) {
return GetExceptionCode();
}
}
switch (InformationClass) {
case DataStackItemCount:
{
ExAcquireFastMutex(&ds->Lock);
auto count = ds->Count;
ExReleaseFastMutex(&ds->Lock);
if (ExGetPreviousMode() != KernelMode) {
__try {
*(ULONG*)Buffer = count;
}
__except (EXCEPTION_EXECUTE_HANDLER) {
return GetExceptionCode();
}
}
else {
*(ULONG*)Buffer = count;
}
break;
}
//...
//
// set returned bytes if requested
//
if (ARGUMENT_PRESENT(ReturnLength)) {
if (ExGetPreviousMode() != KernelMode) {
__try {
*ReturnLength = len;
}
__except (EXCEPTION_EXECUTE_HANDLER) {
return GetExceptionCode();
}
}
else {
*ReturnLength = len;
}
}
ObDereferenceObjectWithTag(ds, DataStackTag);
return status;
You can find the other information classes implemented in the source code in a similar fashion.
To round it up, we’ll add Win32-like APIs that call the native APIs. The Native APIs call the driver in a similar way as the other native API user-mode implementations.
BOOL WINAPI GetDataStackSize(HANDLE hDataStack, ULONG_PTR* pSize) {
auto status = NtQueryInformationDataStack(hDataStack,
DataStackTotalSize, pSize, sizeof(ULONG_PTR), nullptr);
if (!NT_SUCCESS(status))
SetLastError(RtlNtStatusToDosError(status));
return NT_SUCCESS(status);
}
BOOL WINAPI GetDataStackItemCount(HANDLE hDataStack, ULONG* pCount) {
auto status = NtQueryInformationDataStack(hDataStack,
DataStackItemCount, pCount, sizeof(ULONG), nullptr);
if (!NT_SUCCESS(status))
SetLastError(RtlNtStatusToDosError(status));
return NT_SUCCESS(status);
}
BOOL WINAPI GetDataStackConfig(HANDLE hDataStack, DATA_STACK_CONFIG* pConfig) {
auto status = NtQueryInformationDataStack(hDataStack,
DataStackConfiguration, pConfig,
sizeof(DATA_STACK_CONFIG), nullptr);
if (!NT_SUCCESS(status))
SetLastError(RtlNtStatusToDosError(status));
return NT_SUCCESS(status);
}
Waitable Objects
Waitable objects, also called Dispatcher objects, maintain a state called Signaled or Non-Signaled, where the meaning of “signaled” depends on the object type. For example, process objects are signaled when terminated. Same for thread objects. Job objects are signaled when all processes in the job terminate. And so on.
Waitable objects can be waited on with WaitForSingleObject/ WaitForMultipleObjectsand friends in the Windows API, which call native APIs like NtWaitForSingleObject / NtWaitForMultipleObjects, which eventually get to the kernel and call ObWaitForSingleObject / ObWaitForMultipleObjects which finally invoke KeWaitForSingleObject/ KeWaitForMultipleObjects(both documented in the WDK).
It would be nice if DataStack objects would be dispatcher objects, where “signaled” would mean the data stack is not empty, and vice-versa. The first thing to do is make sure that the SYNCHRONIZE access mask is valid for the object type. This is the default, so nothing special to do here. GENERIC_READ also adds SYNCHRONIZE for convenience.
In order to be a dispatcher object, the structure managing the object must start with a DISPATCHER_HEADER structure (which is provided by the WDK headers). For example, KPROCESS and KTHREAD start with DISPATCHER_HEADER. Same for all other dispatcher objects – well, almost. If we look at an EJOB (using symbols), we’ll see the following:
The advantage of using a KEVENT is that the event API is available – this is taken advantage of by the Job implementation. For processes and threads, the work of signaling is done internally by the Process and Thread APIs.
For the DataStack implementation, we’ll take the Job approach, as the scheduler APIs are internal and undocumented. The DataStack now looks like this:
The event is initialized as a Notification Event (Manual Reset in user mode terminology). Why? This is just a choice. We could extend the DataStack creation API to allow choosing Notification (manual reset) vs. Synchronization (auto reset) – I’ll leave that for interested coder.
Next, we need to set or reset the event when appropriate. It starts in the non-signaled state (the FALSE in KeInitializeEvent), since the data stack starts empty. In the implementation of DsPushDataStack we signal the event if the count is incremented from zero to 1:
These operations are performed under the protection of the fast mutex, of course.
Testing
Here is one way to amend the test application to use WaitForSingleObject:
// wait 5 seconds at most for data to appear
while (WaitForSingleObject(h, 5000) == WAIT_OBJECT_0) {
DWORD size = sizeof(buffer);
if (!PopDataStack(h, buffer, &size) && GetLastError() != ERROR_NO_DATA) {
printf("Error in PopDataStack (%u)\n", GetLastError());
break;
}
//...
DWORD count;
DWORD_PTR total;
if (GetDataStackItemCount(h, &count) && GetDataStackSize(h, &total))
printf("Data stack Item count: %u Size: %zu\n", count, total);
}
Refer to the project source code for the full sample.
Summary
This four-part series demonstrated creating a new kernel object type and using only exported functions to implement it. I hope this sheds more light on certain mechanisms used by the Windows kernel.
In the first part we looked at creating a new kernel object type. In the second part we implemented creation of new DataStack objects and opening existing objects by name. In this part, we’ll implement the main functionality of a DataStack, that makes a DataStack what it is.
Before we get on with that, there is one correction I must make. A good comment from Luke (@lukethezoid) on X said that although the code returns a 32-bit HANDLE to 32-bit callers, it’s not nearly enough because all pointers passed using NtDeviceIoControlFile/DeviceIoControl would be wrong – 32-bit pointers would be passed to the 64-bit kernel and that would cause a crash unless we do some work. For example, a UNICODE_STRING provided by a 32-bit client has a different binary layout than a 64-bit one. Wow64 processes (32-bit x86 running on x64) have two versions of NtDll.Dll in their address space. One is the “native” 64-bit, and the other is a special 32-bit variant. I say “special” because it’s not the same NtDll.Dll you would find on a true 32-bit system. This is because this special DLL knows it’s on a 64-bit system, and provides conversions for parameters (pointer and structures) before invoking the “real” 64-bit DLL API. Here is a snapshot from Process Explorer showing a 32-bit process with two NtDll.Dll files – the native 64-bit loaded into a high address, while the other loaded into a low address (within the first 4GB of address space):
The changes required to support Wow64 processes are not difficult to make, but not too interesting, either, so I won’t be implementing them. Instead, 32-bit clients will be blocked from using DataStacks. We should block on the kernel side for sure, and also in user mode to fail faster. In the kernel, we could use something like this in IRP_MJ_CREATE or IRP_MJ_DEVICE_CONTROL handles:
if (IoIs32bitProcess(Irp)) {
status = STATUS_NOT_IMPLEMENTED;
// complete the request...
}
In user mode, we can prevent DataStack.Dll from loading in the first place in Wow64 processes:
BOOL APIENTRY DllMain(HMODULE hModule, DWORD reason, LPVOID) {
switch (reason) {
case DLL_PROCESS_ATTACH:
// C++ 17
if (BOOL wow; IsWow64Process(GetCurrentProcess(), &wow) && wow)
return FALSE;
//...
Note, however, that on true 32-bit systems everything should work just fine, as user mode and kernel mode are bitness-aligned.
Implementing the Actual Data Stack
Now we’re ready to focus on implementing the stack functionality – push, pop, and clear.
We’ll start from user mode, and gradually move to kernel mode. First, we want nice APIs for clients to use that have “Win32” conventions like so:
The APIs return BOOL to indicate success/failure, and GetLastError could be used to get additional information in case of an error. Let’s start with push:
BOOL WINAPI PushDataStack(HANDLE hDataStack, const PVOID buffer, DWORD size) {
auto status = NtPushDataStack(hDataStack, buffer, size);
if (!NT_SUCCESS(status))
SetLastError(RtlNtStatusToDosError(status));
return NT_SUCCESS(status);
}
Nothing to it. Call NtPushDataStack and update the last error is needed. NtPushDataStack just packs the arguments in a helper structure and sends to the driver, just like we did with CreateDataStack and OpenDataStack:
Now we switch to the kernel side. The DeviceIoControl handler just forwards the arguments to the “real” NtPushDataStack:
case IOCTL_DATASTACK_PUSH:
{
auto data = (DataStackPush*)Irp->AssociatedIrp.SystemBuffer;
if (dic.InputBufferLength < sizeof(*data)) {
status = STATUS_BUFFER_TOO_SMALL;
break;
}
status = NtPushDataStack(data->DataStackHandle, data->Buffer, data->Size);
break;
}
Now we’re getting to the interesting parts. First, we need to check if the arguments make sense:
NTSTATUS NTAPI NtPushDataStack(HANDLE DataStackHandle,
const PVOID Item, ULONG ItemSize) {
if (ItemSize == 0)
return STATUS_INVALID_PARAMETER_3;
if (!ARGUMENT_PRESENT(Item))
return STATUS_INVALID_PARAMETER_2;
If the pushed data size is zero or the pointer to the data is NULL, then it’s an error. The ARGUMENT_PRESENT macro returns true if the given pointer is not NULL. Notice that we can return specific “invalid parameter” error based on the parameter index. Unfortunately, user mode callers always get a generic ERROR_INVALID_PARAMETER regardless. But at least kernel callers can benefit from the extra detail.
Next, we need to get to the DataStack object corresponding to the given handle; in fact, the handle could be bad, or not point to a DataStack object at all. This is a job for ObReferenceObjectByHandleor one of its variants:
DataStack* ds;
auto status = ObReferenceObjectByHandleWithTag(DataStackHandle,
DATA_STACK_PUSH, g_DataStackType, ExGetPreviousMode(),
DataStackTag, (PVOID*)&ds, nullptr);
if (!NT_SUCCESS(status))
return status;
ObReferenceObjectByHandleWithTagattempts to retrieve the object pointer given a handle, a type object, and access. The added tag provides a simple way to “track” the caller taking the reference. I’ve defined DataStackTag to be used for dynamic memory allocations as well (we’ll do very soon):
const ULONG DataStackTag = 'ktsD';
It’s the same kind of tag typically provided to allocation functions. You can read the tag from right to left (that’s how it would be displayed in tools as we’ll see later) – “Dstk”, kind of short for “Data Stack”. The function also checks if the handle has the required access mask (DATA_STACK_PUSH in this case), and will fail if the handle is not powerful enough. ExGetPreviousMode is provided to the function to indicate who is the original caller – for kernel callers, any access will be granted (the access mask is not really used).
With the object in hand, we can do the work, delegated to a separate function, and then not to forget to dereference the object or it will leak:
status = DsPushDataStack(ds, Item, ItemSize);
ObDereferenceObjectWithTag(ds, DataStackTag);
return status;
Technically, we don’t have to use a separate function, but it mimics how the kernel works for complex objects: there is another layer of implementation that works with the object directly (no more handles involved).
The DataStack structure mentioned in the previous parts holds the data for the implementation, and will be treated as a classic C structure – no member functions just to mimic how the Windows kernel is implemented – almost everything in C, not C++:
The “stack” will be managed by a linked list (the classic LIST_ENTRY and friends) implementation provided by the kernel. We’ll push by adding to the tail and pop by removing from the tail. (Clearly, it would be just as easy to implement a queue, rather than, or in addition to, a stack.) We need a lock to prevent corruption of the list, and a fast mutex will do the job. Count stores the number of items currently on the data stack, and Size has the total size in bytes. MaxSize and MaxItemCount are initialized when the DataStack is created and provide some control over the limits of the data stack. Values of zero indicate no special limit.
The second structure, DataBlock is the one that holds the actual data, along with its size, and of course a link to the list. Data[1] is just a placeholder, where the data is going to be copied to, assuming we allocate the correct size.
We’ll start the push implementation in an optimistic manner, and allocate the DataBlock structure with the required size based on the data provided:
We use the (relatively) new ExAllocatePool2 API to allocate the memory block (ExAllocatePoolWithTag is deprecated from Windows version 2004, but you can use it with an old-enough WDK or if you turn off the deprecation warning). We allocate the buffer uninitialized, as we’ll copy the data to it very soon, so no need for zeroed buffer. Technically, we allocate one extra byte beyond what we need, but that’s not a big deal. Now we can copy the data from the client’s provided buffer, being careful to probe user-mode buffers under exception protection:
If an exception occurs because of a bad user-mode buffer, we can return the exception code and that’s it. Note that ProbeForRead does not work for kernel addresses, and cannot prevent crashes – this is intentional. Only user-mode buffers are out of control from the kernel’s side.
Now we can add the item to the stack if the limits are not violated, while updating the DataStack’s stats:
ExAcquireFastMutex(&ds->Lock);
do {
if (ds->MaxItemCount == ds->Count) {
status = STATUS_NO_MORE_ENTRIES;
break;
}
if (ds->MaxItemSize && ItemSize > ds->MaxItemSize) {
status = STATUS_NOT_CAPABLE;
break;
}
if (ds->MaxSize && ds->Size + ItemSize > ds->MaxSize) {
status = STATUS_NOT_CAPABLE;
break;
}
} while (false);
if (NT_SUCCESS(status)) {
InsertTailList(&ds->Head, &buffer->Link);
ds->Count++;
ds->Size += ItemSize;
}
ExReleaseFastMutex(&ds->Lock);
if (!NT_SUCCESS(status))
ExFreePool(buffer);
return status;
First, we acquire the fast mutex to prevent data races. I opted not to use any C++ RAII type here to make things as clear as possible – we have to be careful not to return before releasing the fast mutex. Next, the do/while non-loop is used to check if any setting is violated, in which case the status is set to some failure. The status values I chose may not look perfect – the right thing to do is create new NTSTATUS values that would be specific for DataStacks, but I was too lazy. The interested reader/coder is welcome to do it right.
Inserting the item involves calling InsertTailList, and then just updating the item count and total byte size. If anything fails, we are careful to free the buffer to prevent a memory leak. This is for push.
Popping Items
The pop operation works along similar lines. In this case, the client asks to pop an item but needs to provide a large-enough buffer to store the data. We’ll use an additional size pointer argument, that on input indicates the buffer’s size, and on output indicates the actual item size. First, the “Win32” API:
BOOL WINAPI PopDataStack(HANDLE hDataStack, PVOID buffer, DWORD* size) {
auto status = NtPopDataStack(hDataStack, buffer, size);
if (!NT_SUCCESS(status))
SetLastError(RtlNtStatusToDosError(status));
return NT_SUCCESS(status);
}
Just delegating the work to the native API, which forwards to the kernel with a helper structure:
This should be expected by now. On the kernel side, things are more interesting. First, get the object based on the handle, then send it to the lower-layer function if successful:
The input buffer size is extracted, being careful to probe the user mode pointer. The real work is done in DsPopDataStack. First, take the lock. Second, see if the data stack is empty – if so, no pop operation possible. If the input size is zero, return the size of the top element:
The locking here works differently than the push implementation by using a __finally block, which is the one releasing the fast mutex. This ensures that no matter how we leave the __try block, the lock will be released for sure.
The CONTAINING_RECORD macro is used correctly to get to the item from the link (LIST_ENTRY). Technically, in this case we could just make a simple cast, as the LIST_ENTRY member is the first in a DataBlock. Notice how we get to the top item: Head.Blink, which points to the tail (last) item.
If the data stack is empty, we place zero in the item size pointer and return an error (abusing yet another existing error):
If manage to get beyond this point, then there is an item, and we need to remove it, copy the data to the client’s buffer (if it’s big enough), and free the kernel’s copy of the buffer:
The call to RemoveTailList removes the top item from the list. The next assert verifies the list wasn’t empty before the removal (it can’t be as we dealt with that case in the previous code section). Remember, that if a list is empty calling RemoveTailList or RemoveHeadList returns the head’s pointer.
If the client’s buffer is too small, we reinsert the item back and bail. Otherwise, we copy the data, update the data stack’s stats and free our copy of the item.
Cleanup
The stack clear operation is relatively straightforward of them all. Here is the kernel part that matters:
We take the lock, and then go over the list, removing and freeing each item. Finally, we update the stats to zero items and zero bytes.
Testing
Here is one way to test – having an executable run twice, the first instance pushes some items, and the second one popping items. main creates a data stack with a name. If it’s a new object, it assumes the role of “pusher”. Otherwise, it assumes the role of “popper”:
int main() {
HANDLE hDataStack = CreateDataStack(nullptr, 0, 100,
10 << 20, L"MyDataStack");
if (!hDataStack) {
printf("Failed to create data stack (%u)\n", GetLastError());
return 1;
}
printf("Handle created: 0x%p\n", hDataStack);
if (GetLastError() == ERROR_ALREADY_EXISTS) {
printf("Opened an existing object... will pop elements\n");
PopItems(hDataStack);
}
else {
Sleep(5000);
PushString(hDataStack, "Hello, data stack!");
PushString(hDataStack, "Pushing another string...");
for (int i = 1; i <= 10; i++) {
Sleep(100);
PushDataStack(hDataStack, &i, sizeof(i));
}
}
CloseHandle(hDataStack);
return 0;
}
When creating a named object, if GetLastError returns ERROR_ALREADY_EXISTS, it means a handle is returned to an existing object. In our current implementation, this actually won’t work. We have to fix the CreateDataStack implementation like so:
HANDLE hDataStack;
auto status = NtCreateDataStack(&hDataStack, &attr, maxItemSize, maxItemCount, maxSize);
if (NT_SUCCESS(status)) {
const NTSTATUS STATUS_OBJECT_NAME_EXISTS = 0x40000000;
if (status == STATUS_OBJECT_NAME_EXISTS) {
SetLastError(ERROR_ALREADY_EXISTS);
}
else {
SetLastError(0);
}
return hDataStack;
}
After calling NtCreateDataStack we fix the returned “error” if the kernel returns STATUS_OBJECT_NAME_EXISTS. Now the previous will work correctly.
PushString is a little helper to push strings:
bool PushString(HANDLE h, std::string const& text) {
auto ok = PushDataStack(h, (PVOID)text.c_str(), (ULONG)text.length() + 1);
if (!ok)
printf("Error in PushString: %u\n", GetLastError());
return ok;
}
Finally, PopItems does some popping:
void PopItems(HANDLE h) {
BYTE buffer[256];
auto tick = GetTickCount64();
while (GetTickCount64() - tick < 10000) {
DWORD size = sizeof(buffer);
if (!PopDataStack(h, buffer, &size) && GetLastError() != ERROR_NO_DATA) {
printf("Error in PopDataStack (%u)\n", GetLastError());
break;
}
if (size) {
printf("Popped %u bytes: ", size);
if (size > sizeof(int))
printf("%s\n", (PCSTR)buffer);
else
printf("%d\n", *(int*)buffer);
}
Sleep(300);
}
}
Not very exciting, but is good enough for this simple test. Here is some output, first from the “pusher” and then the “popper”:
Are we done? Not quite. Astute readers may have noticed a little problem. What happens if a DataStack object is destroyed (e.g., the last handle to it is closed), but the stack is not empty? That memory will leak, as we have no “desctructor”. Running the “pusher” a few times without a second process that pops items results in a leak. Here is my PoolMonX tool showing the leak:
Notice the “Dstk” tag and the number of allocations being higher that deallocations.
Another feature we are missing is the ability to wait on a DataStack until data is available, if the stack is empty, maybe by calling the WaitForSingleObject API. It would be nice to have that.
Yet another missing element is the ability to query DataStack objects – how much memory is being used, how many items, etc.