Let’s Get Stacking! (Part 3)

In the first part we looked at creating a new kernel object type. In the second part we implemented creation of new DataStack objects and opening existing objects by name. In this part, we’ll implement the main functionality of a DataStack, that makes a DataStack what it is.

Before we get on with that, there is one correction I must make. A good comment from Luke (@lukethezoid) on X said that although the code returns a 32-bit HANDLE to 32-bit callers, it’s not nearly enough because all pointers passed using NtDeviceIoControlFile/DeviceIoControl would be wrong – 32-bit pointers would be passed to the 64-bit kernel and that would cause a crash unless we do some work. For example, a UNICODE_STRING provided by a 32-bit client has a different binary layout than a 64-bit one. Wow64 processes (32-bit x86 running on x64) have two versions of NtDll.Dll in their address space. One is the “native” 64-bit, and the other is a special 32-bit variant. I say “special” because it’s not the same NtDll.Dll you would find on a true 32-bit system. This is because this special DLL knows it’s on a 64-bit system, and provides conversions for parameters (pointer and structures) before invoking the “real” 64-bit DLL API. Here is a snapshot from Process Explorer showing a 32-bit process with two NtDll.Dll files – the native 64-bit loaded into a high address, while the other loaded into a low address (within the first 4GB of address space):

The changes required to support Wow64 processes are not difficult to make, but not too interesting, either, so I won’t be implementing them. Instead, 32-bit clients will be blocked from using DataStacks. We should block on the kernel side for sure, and also in user mode to fail faster. In the kernel, we could use something like this in IRP_MJ_CREATE or IRP_MJ_DEVICE_CONTROL handles:

if (IoIs32bitProcess(Irp)) {
	status = STATUS_NOT_IMPLEMENTED;
    // complete the request...
}

In user mode, we can prevent DataStack.Dll from loading in the first place in Wow64 processes:

BOOL APIENTRY DllMain(HMODULE hModule, DWORD reason, LPVOID) {
	switch (reason) {
		case DLL_PROCESS_ATTACH:
            // C++ 17
			if (BOOL wow; IsWow64Process(GetCurrentProcess(), &wow) && wow)
				return FALSE;
//...

Note, however, that on true 32-bit systems everything should work just fine, as user mode and kernel mode are bitness-aligned.

Implementing the Actual Data Stack

Now we’re ready to focus on implementing the stack functionality – push, pop, and clear.

We’ll start from user mode, and gradually move to kernel mode. First, we want nice APIs for clients to use that have “Win32” conventions like so:

BOOL WINAPI PushDataStack(_In_ HANDLE hDataStack, _In_ const PVOID buffer, _In_ DWORD size);
BOOL WINAPI PopDataStack(_In_ HANDLE hDataStack, _Out_ PVOID buffer, _Inout_ DWORD* size);
BOOL WINAPI ClearDataStack(_In_ HANDLE hDataStack);

The APIs return BOOL to indicate success/failure, and GetLastError could be used to get additional information in case of an error. Let’s start with push:

BOOL WINAPI PushDataStack(HANDLE hDataStack, const PVOID buffer, DWORD size) {
	auto status = NtPushDataStack(hDataStack, buffer, size);
	if (!NT_SUCCESS(status))
		SetLastError(RtlNtStatusToDosError(status));

	return NT_SUCCESS(status);
}

Nothing to it. Call NtPushDataStack and update the last error is needed. NtPushDataStack just packs the arguments in a helper structure and sends to the driver, just like we did with CreateDataStack and OpenDataStack:

NTSTATUS NTAPI NtPushDataStack(_In_ HANDLE DataStackHandle, 
    _In_ const PVOID Item, _In_ ULONG ItemSize) {
	DataStackPush data;
	data.DataStackHandle = DataStackHandle;
	data.Buffer = Item;
	data.Size = ItemSize;

	IO_STATUS_BLOCK ioStatus;
	return NtDeviceIoControlFile(g_hDevice, nullptr, nullptr, nullptr, &ioStatus,
		IOCTL_DATASTACK_PUSH, &data, sizeof(data), nullptr, 0);
}

Now we switch to the kernel side. The DeviceIoControl handler just forwards the arguments to the “real” NtPushDataStack:

case IOCTL_DATASTACK_PUSH:
{
	auto data = (DataStackPush*)Irp->AssociatedIrp.SystemBuffer;
	if (dic.InputBufferLength < sizeof(*data)) {
		status = STATUS_BUFFER_TOO_SMALL;
		break;
	}
	status = NtPushDataStack(data->DataStackHandle, data->Buffer, data->Size);
	break;
}

Now we’re getting to the interesting parts. First, we need to check if the arguments make sense:

NTSTATUS NTAPI NtPushDataStack(HANDLE DataStackHandle, 
    const PVOID Item, ULONG ItemSize) {
	if (ItemSize == 0)
		return STATUS_INVALID_PARAMETER_3;

	if (!ARGUMENT_PRESENT(Item))
		return STATUS_INVALID_PARAMETER_2;

If the pushed data size is zero or the pointer to the data is NULL, then it’s an error. The ARGUMENT_PRESENT macro returns true if the given pointer is not NULL. Notice that we can return specific “invalid parameter” error based on the parameter index. Unfortunately, user mode callers always get a generic ERROR_INVALID_PARAMETER regardless. But at least kernel callers can benefit from the extra detail.

Next, we need to get to the DataStack object corresponding to the given handle; in fact, the handle could be bad, or not point to a DataStack object at all. This is a job for ObReferenceObjectByHandle or one of its variants:

DataStack* ds;
auto status = ObReferenceObjectByHandleWithTag(DataStackHandle,
    DATA_STACK_PUSH, g_DataStackType, ExGetPreviousMode(), 
    DataStackTag, (PVOID*)&ds, nullptr);
if (!NT_SUCCESS(status))
	return status;

ObReferenceObjectByHandleWithTag attempts to retrieve the object pointer given a handle, a type object, and access. The added tag provides a simple way to “track” the caller taking the reference. I’ve defined DataStackTag to be used for dynamic memory allocations as well (we’ll do very soon):

const ULONG DataStackTag = 'ktsD';

It’s the same kind of tag typically provided to allocation functions. You can read the tag from right to left (that’s how it would be displayed in tools as we’ll see later) – “Dstk”, kind of short for “Data Stack”. The function also checks if the handle has the required access mask (DATA_STACK_PUSH in this case), and will fail if the handle is not powerful enough. ExGetPreviousMode is provided to the function to indicate who is the original caller – for kernel callers, any access will be granted (the access mask is not really used).

With the object in hand, we can do the work, delegated to a separate function, and then not to forget to dereference the object or it will leak:

status = DsPushDataStack(ds, Item, ItemSize);
ObDereferenceObjectWithTag(ds, DataStackTag);

return status;

Technically, we don’t have to use a separate function, but it mimics how the kernel works for complex objects: there is another layer of implementation that works with the object directly (no more handles involved).

The DataStack structure mentioned in the previous parts holds the data for the implementation, and will be treated as a classic C structure – no member functions just to mimic how the Windows kernel is implemented – almost everything in C, not C++:

struct DataStack {
	LIST_ENTRY Head;
	FAST_MUTEX Lock;
	ULONG Count;
	ULONG MaxItemCount;
	ULONG_PTR Size;
	ULONG MaxItemSize;
	ULONG_PTR MaxSize;
};

struct DataBlock {
	LIST_ENTRY Link;
	ULONG Size;
	UCHAR Data[1];
};

The “stack” will be managed by a linked list (the classic LIST_ENTRY and friends) implementation provided by the kernel. We’ll push by adding to the tail and pop by removing from the tail. (Clearly, it would be just as easy to implement a queue, rather than, or in addition to, a stack.) We need a lock to prevent corruption of the list, and a fast mutex will do the job. Count stores the number of items currently on the data stack, and Size has the total size in bytes. MaxSize and MaxItemCount are initialized when the DataStack is created and provide some control over the limits of the data stack. Values of zero indicate no special limit.

The second structure, DataBlock is the one that holds the actual data, along with its size, and of course a link to the list. Data[1] is just a placeholder, where the data is going to be copied to, assuming we allocate the correct size.

We’ll start the push implementation in an optimistic manner, and allocate the DataBlock structure with the required size based on the data provided:

NTSTATUS DsPushDataStack(DataStack* ds, PVOID Item, ULONG ItemSize) {
	auto buffer = (DataBlock*)ExAllocatePool2(POOL_FLAG_PAGED | POOL_FLAG_UNINITIALIZED, 
		ItemSize + sizeof(DataBlock), DataStackTag);
	if (buffer == nullptr)
		return STATUS_INSUFFICIENT_RESOURCES;

We use the (relatively) new ExAllocatePool2 API to allocate the memory block (ExAllocatePoolWithTag is deprecated from Windows version 2004, but you can use it with an old-enough WDK or if you turn off the deprecation warning). We allocate the buffer uninitialized, as we’ll copy the data to it very soon, so no need for zeroed buffer. Technically, we allocate one extra byte beyond what we need, but that’s not a big deal. Now we can copy the data from the client’s provided buffer, being careful to probe user-mode buffers under exception protection:

auto status = STATUS_SUCCESS;
if (ExGetPreviousMode() != KernelMode) {
	__try {
		ProbeForRead(Item, ItemSize, 1);
		memcpy(buffer->Data, Item, ItemSize);
	}
	__except (EXCEPTION_EXECUTE_HANDLER) {
		ExFreePool(buffer);
		return GetExceptionCode();
	}
}
else {
	memcpy(buffer->Data, Item, ItemSize);
}
buffer->Size = ItemSize;

If an exception occurs because of a bad user-mode buffer, we can return the exception code and that’s it. Note that ProbeForRead does not work for kernel addresses, and cannot prevent crashes – this is intentional. Only user-mode buffers are out of control from the kernel’s side.

Now we can add the item to the stack if the limits are not violated, while updating the DataStack’s stats:

ExAcquireFastMutex(&ds->Lock);
do {
	if (ds->MaxItemCount == ds->Count) {
		status = STATUS_NO_MORE_ENTRIES;
		break;
	}

	if (ds->MaxItemSize && ItemSize > ds->MaxItemSize) {
		status = STATUS_NOT_CAPABLE;
		break;
	}

	if (ds->MaxSize && ds->Size + ItemSize > ds->MaxSize) {
		status = STATUS_NOT_CAPABLE;
		break;
	}
} while (false);

if (NT_SUCCESS(status)) {
	InsertTailList(&ds->Head, &buffer->Link);
	ds->Count++;
	ds->Size += ItemSize;
}
ExReleaseFastMutex(&ds->Lock);

if (!NT_SUCCESS(status))
	ExFreePool(buffer);

return status;

First, we acquire the fast mutex to prevent data races. I opted not to use any C++ RAII type here to make things as clear as possible – we have to be careful not to return before releasing the fast mutex. Next, the do/while non-loop is used to check if any setting is violated, in which case the status is set to some failure. The status values I chose may not look perfect – the right thing to do is create new NTSTATUS values that would be specific for DataStacks, but I was too lazy. The interested reader/coder is welcome to do it right.

Inserting the item involves calling InsertTailList, and then just updating the item count and total byte size. If anything fails, we are careful to free the buffer to prevent a memory leak. This is for push.

Popping Items

The pop operation works along similar lines. In this case, the client asks to pop an item but needs to provide a large-enough buffer to store the data. We’ll use an additional size pointer argument, that on input indicates the buffer’s size, and on output indicates the actual item size. First, the “Win32” API:

BOOL WINAPI PopDataStack(HANDLE hDataStack, PVOID buffer, DWORD* size) {
	auto status = NtPopDataStack(hDataStack, buffer, size);
	if (!NT_SUCCESS(status))
		SetLastError(RtlNtStatusToDosError(status));

	return NT_SUCCESS(status);
}

Just delegating the work to the native API, which forwards to the kernel with a helper structure:

NTSTATUS NTAPI NtPopDataStack(_In_ HANDLE DataStackHandle, _In_ PVOID Buffer, _Inout_ PULONG ItemSize) {
	DataStackPop data;
	data.DataStackHandle = DataStackHandle;
	data.Buffer = Buffer;
	data.Size = ItemSize;

	IO_STATUS_BLOCK ioStatus;
	return NtDeviceIoControlFile(g_hDevice, nullptr, nullptr, nullptr, &ioStatus,
		IOCTL_DATASTACK_POP, &data, sizeof(data), nullptr, 0);
}

This should be expected by now. On the kernel side, things are more interesting. First, get the object based on the handle, then send it to the lower-layer function if successful:

NTSTATUS NTAPI NtPopDataStack(HANDLE DataStackHandle, 
    PVOID Buffer, PULONG BufferSize) {
	if (!ARGUMENT_PRESENT(BufferSize))
		return STATUS_INVALID_PARAMETER_3;

	ULONG size;
	if (ExGetPreviousMode() != KernelMode) {
		__try {
			ProbeForRead(BufferSize, sizeof(ULONG), 1);
			size = *BufferSize;
		}
		__except (EXCEPTION_EXECUTE_HANDLER) {
			return GetExceptionCode();
		}
	}
	else {
		size = *BufferSize;
	}

	if (!ARGUMENT_PRESENT(Buffer) && size != 0)
		return STATUS_INVALID_PARAMETER_2;

	DataStack* ds;
	auto status = ObReferenceObjectByHandleWithTag(DataStackHandle,
        DATA_STACK_POP, g_DataStackType, ExGetPreviousMode(), 
        DataStackTag, (PVOID*)&ds, nullptr);
	if (!NT_SUCCESS(status))
		return status;

	status = DsPopDataStack(ds, Buffer, size, BufferSize);
	ObDereferenceObjectWithTag(ds, DataStackTag);
	return status;
}

The input buffer size is extracted, being careful to probe the user mode pointer. The real work is done in DsPopDataStack. First, take the lock. Second, see if the data stack is empty – if so, no pop operation possible. If the input size is zero, return the size of the top element:

NTSTATUS DsPopDataStack(DataStack* ds, PVOID buffer, 
    ULONG inputSize, ULONG* itemSize) {
	ExAcquireFastMutex(&ds->Lock);
	__try {
		if (inputSize == 0) {
			//
			// return size of next item
			//			
			__try {
				if (ds->Count == 0) {
					//
					// stack empty
					//
					*itemSize = 0;
				}
				else {
					auto top = CONTAINING_RECORD(ds->Head.Blink, DataBlock, Link);
					*itemSize = top->Size;
				}
				return STATUS_SUCCESS;
			}
			__except (EXCEPTION_EXECUTE_HANDLER) {
				return GetExceptionCode();
			}
		}

The locking here works differently than the push implementation by using a __finally block, which is the one releasing the fast mutex. This ensures that no matter how we leave the __try block, the lock will be released for sure.

The CONTAINING_RECORD macro is used correctly to get to the item from the link (LIST_ENTRY). Technically, in this case we could just make a simple cast, as the LIST_ENTRY member is the first in a DataBlock. Notice how we get to the top item: Head.Blink, which points to the tail (last) item.

If the data stack is empty, we place zero in the item size pointer and return an error (abusing yet another existing error):

if (ds->Count == 0) {
	__try {
		*itemSize = 0;
	}
	__except (EXCEPTION_EXECUTE_HANDLER) {
		return GetExceptionCode();
	}
	return STATUS_PIPE_EMPTY;
}

If manage to get beyond this point, then there is an item, and we need to remove it, copy the data to the client’s buffer (if it’s big enough), and free the kernel’s copy of the buffer:

	auto link = RemoveTailList(&ds->Head);
	NT_ASSERT(link != &ds->Head);
	
	auto item = CONTAINING_RECORD(link, DataBlock, Link);
	__try {
		*itemSize = item->Size;
		if (inputSize < item->Size) {
			//
			// buffer too small
			// reinsert item
			//
			InsertTailList(&ds->Head, link);
			return STATUS_BUFFER_TOO_SMALL;
		}
		else {
			memcpy(buffer, item->Data, item->Size);
			ds->Count--;
			ds->Size -= item->Size;
			ExFreePool(item);
			return STATUS_SUCCESS;
		}
	}
	__except (EXCEPTION_EXECUTE_HANDLER) {
		return GetExceptionCode();
	}
}
__finally {
	ExReleaseFastMutex(&ds->Lock);
}

The call to RemoveTailList removes the top item from the list. The next assert verifies the list wasn’t empty before the removal (it can’t be as we dealt with that case in the previous code section). Remember, that if a list is empty calling RemoveTailList or RemoveHeadList returns the head’s pointer.

If the client’s buffer is too small, we reinsert the item back and bail. Otherwise, we copy the data, update the data stack’s stats and free our copy of the item.

Cleanup

The stack clear operation is relatively straightforward of them all. Here is the kernel part that matters:

NTSTATUS DsClearDataStack(DataStack* ds) {
	ExAcquireFastMutex(&ds->Lock);
	LIST_ENTRY* link;

	while ((link = RemoveHeadList(&ds->Head)) != &ds->Head) {
		auto item = CONTAINING_RECORD(link, DataBlock, Link);
		ExFreePool(item);
	}
	ds->Count = 0;
	ds->Size = 0;
	ExReleaseFastMutex(&ds->Lock);

	return STATUS_SUCCESS;
}

We take the lock, and then go over the list, removing and freeing each item. Finally, we update the stats to zero items and zero bytes.

Testing

Here is one way to test – having an executable run twice, the first instance pushes some items, and the second one popping items. main creates a data stack with a name. If it’s a new object, it assumes the role of “pusher”. Otherwise, it assumes the role of “popper”:

int main() {
	HANDLE hDataStack = CreateDataStack(nullptr, 0, 100, 
        10 << 20, L"MyDataStack");
	if (!hDataStack) {
		printf("Failed to create data stack (%u)\n", GetLastError());
		return 1;
	}

	printf("Handle created: 0x%p\n", hDataStack);

	if (GetLastError() == ERROR_ALREADY_EXISTS) {
		printf("Opened an existing object... will pop elements\n");
		PopItems(hDataStack);
	}
	else {
		Sleep(5000);

		PushString(hDataStack, "Hello, data stack!");
		PushString(hDataStack, "Pushing another string...");
		for (int i = 1; i <= 10; i++) {
			Sleep(100);
			PushDataStack(hDataStack, &i, sizeof(i));
		}
	}

	CloseHandle(hDataStack);
	return 0;
}

When creating a named object, if GetLastError returns ERROR_ALREADY_EXISTS, it means a handle is returned to an existing object. In our current implementation, this actually won’t work. We have to fix the CreateDataStack implementation like so:

HANDLE hDataStack;
auto status = NtCreateDataStack(&hDataStack, &attr, maxItemSize, maxItemCount, maxSize);
if (NT_SUCCESS(status)) {
	const NTSTATUS STATUS_OBJECT_NAME_EXISTS = 0x40000000;

	if (status == STATUS_OBJECT_NAME_EXISTS) {
		SetLastError(ERROR_ALREADY_EXISTS);
	}
	else {
		SetLastError(0);
	}
	return hDataStack;
}

After calling NtCreateDataStack we fix the returned “error” if the kernel returns STATUS_OBJECT_NAME_EXISTS. Now the previous will work correctly.

PushString is a little helper to push strings:

bool PushString(HANDLE h, std::string const& text) {
	auto ok = PushDataStack(h, (PVOID)text.c_str(), (ULONG)text.length() + 1);
	if (!ok)
		printf("Error in PushString: %u\n", GetLastError());
	return ok;
}

Finally, PopItems does some popping:

void PopItems(HANDLE h) {
	BYTE buffer[256];

	auto tick = GetTickCount64();
	while (GetTickCount64() - tick < 10000) {
		DWORD size = sizeof(buffer);
		if (!PopDataStack(h, buffer, &size) && GetLastError() != ERROR_NO_DATA) {
			printf("Error in PopDataStack (%u)\n", GetLastError());
			break;
		}
		if (size) {
			printf("Popped %u bytes: ", size);
			if (size > sizeof(int))
				printf("%s\n", (PCSTR)buffer);
			else
				printf("%d\n", *(int*)buffer);
		}
		Sleep(300);
	}
}

Not very exciting, but is good enough for this simple test. Here is some output, first from the “pusher” and then the “popper”:

E:\Test>DSTest.exe
Handle created: 0x00000000000000F8
E:\Test>DSTest.exe
Handle created: 0x0000000000000104
Opened an existing object... will popup elements
Popped 4 bytes: 2
Popped 4 bytes: 5
Popped 4 bytes: 8
Popped 4 bytes: 10
Popped 4 bytes: 9
Popped 4 bytes: 7
Popped 4 bytes: 6
Popped 4 bytes: 4
Popped 4 bytes: 3
Popped 4 bytes: 1
Popped 26 bytes: Pushing another string...
Popped 19 bytes: Hello, data stack!

What’s Next?

Are we done? Not quite. Astute readers may have noticed a little problem. What happens if a DataStack object is destroyed (e.g., the last handle to it is closed), but the stack is not empty? That memory will leak, as we have no “desctructor”. Running the “pusher” a few times without a second process that pops items results in a leak. Here is my PoolMonX tool showing the leak:

Notice the “Dstk” tag and the number of allocations being higher that deallocations.

Another feature we are missing is the ability to wait on a DataStack until data is available, if the stack is empty, maybe by calling the WaitForSingleObject API. It would be nice to have that.

Yet another missing element is the ability to query DataStack objects – how much memory is being used, how many items, etc.

We’ll deal with these aspects in the next part.

Published by

Unknown's avatar

Pavel Yosifovich

Developer, trainer, author and speaker. Loves all things software

One thought on “Let’s Get Stacking! (Part 3)”

Leave a comment