Vulkan with DXGI - experiment results

# Vulkan with DXGI - experiment results

Mon
19
Nov 2018

In my previous post, I’ve described a way to get GPU memory usage in Windows Vulkan app by using DXGI. This API, designed for Direct3D, seems to work with Vulkan as well. In this post I would like to share detailed results of my experiment on two different platforms with two graphics cards from different vendors. But before that, a disclaimer:

AMD

Now to the data: I ran my program on two different machines. First one was:

OS: Windows 10 64-bit version 1803 (OS Build 17134.407)
RAM: 16 GB
GPU: AMD Radeon RX 580 8 GB
Graphics driver: 18.11.1

At program startup, before any Vulkan objects are created:

Local: Budget=7252479180 CurrentUsage=0 AvailableForReservation=3839547801 CurrentReservation=0
Nonlocal: Budget=7699177267 CurrentUsage=0 AvailableForReservation=4063454668 CurrentReservation=0

So with 8 GB of GPU memory and 16 GB with CPU memory we have Local Budget 6.75 GB (84%) and NonLocal Budget 7.17 GB (45%). AvailableForReservation Local is 3.58 GB (45%), NonLocal is 3.78 GB (24%).

Budget, AvailableForReservation, CurrentReservation stay the same for entire program. Only CurrentUsage changes.

Tests I made:

Swapchain
VkSwapchain is created. Format = B8G8R8A8_UNORM, minImageCount = 2, imageExtent = 1424 x 704
CurrentUsage Local +9,732,096
Expected size of this object is 1424 * 704 * 2 * 4 bytes per pixel = 8,019,968, so we have like 18% overhead.

Memory 1.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of DEVICE_LOCAL memory type (GPU memory, D3D12 equivalent is DEFAULT heap).
CurrentUsage Local +104,857,600
It’s exactly 100 MB, zero overhead.

Memory 1.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of DEVICE_LOCAL memory type.
CurrentUsage Local +104,857,600
Same total memory size and same result as before - exactly 100 MB. No overhead for separate allocations.

Memory 2.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of DEVICE_LOCAL + HOST_VISIBLE memory type. (That’s a special 256 MB heap available only on AMD cards and only in Vulkan, not in D3D12.)
CurrentUsage Local +104,857,600.
Exactly same amount. Apparently this heap counts to Local in DXGI.

Memory 2.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of DEVICE_LOCAL + HOST_VISIBLE memory type.
CurrentUsage Local +104,857,600.
Again, same as before. Allocations from this heap account to Local and reported usage is byte-exact as total size of allocations made.

Memory 3.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of HOST_VISIBLE + HOST_COHERENT memory type (CPU memory, D3D12 equivalent is UPLOAD heap).
CurrentUsage NonLocal +104,857,600
Same amount, exactly the size of allocation, reported as used out of NonLocal memory this time.

Memory 3.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of HOST_VISIBLE + HOST_COHERENT memory type.
CurrentUsage NonLocal +104,857,600
Same as before. Exact sum of allocated memory added to NonLocal. No overhead for separate allocations.

Memory 4.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of HOST_VISIBLE + HOST_COHERENT + HOST_CACHED memory type (cached CPU memory, D3D12 equivalent is READBACK heap).
CurrentUsage NonLocal +104,857,600
Same as before, which means this heap also accounts to NonLocal in DXGI.

Memory 4.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of HOST_VISIBLE + HOST_COHERENT + HOST_CACHED memory type.
CurrentUsage NonLocal +104,857,600
Same as before.

Buffers
5,000 VkBuffer objects created and bound to a VkDeviceMemory that has been allocated before, so we are counting buffer objects alone and not the memory allocation.
No change in CurrentUsage, either Local or NonLocal.

Images
5,000 VkImage objects created and bound to a VkDeviceMemory that has been allocated before, so we are counting image objects alone and not the memory allocation.
No change in CurrentUsage, either Local or NonLocal.

Queries 1
1 VkQueryPool object is created that contains 1,000,000 queries of type TIMESTAMP.
CurrentUsage Local +8,003,584
That gives 8 bytes per timestamp query.

Queries 2
5000 VkQueryPool objects is created with 200 TIMESTAMP queries each.
CurrentUsage Local +8,126,464
With same number of queries but 4999 more query pools, we have 122,880 more bytes, so we can assume overhead for a separate query pool is 25 bytes per pool.

Command buffers 1
1 VkCommandPool and 1 VkCommandBuffer with 10,000 render passes.
CurrentUsage Local +1,835,008
CurrentUsage NonLocal +77,594,624
Which means command buffer is allocated mostly in CPU memory. 184 bytes of GPU memory and 7,759 bytes of CPU memory per render pass.

Command buffers 2
1 VkCommandPool and 2,000 VkCommandBuffer with 5 render passes each.
CurrentUsage Local +30,932,992
CurrentUsage NonLocal +184,549,376
With same number of render passes and 1,999 more command buffers, we have 29,097,984 more bytes of GPU memory and 106,954,752 more bytes of CPU memory, so we can say that overhead of separate command buffer is 14,556 bytes of GPU memory and 53,504 bytes of CPU memory per command buffer.

Descriptors 1
1 VkDescriptorPool, 1 VkDescriptorSet, 20,000 descriptors of type COMBINED_IMAGE_SAMPLER, as 1 binding
CurrentUsage Local +962,560
CurrentUsage NonLocal +962,560
We can see that descriptors are shadowed in both CPU and GPU memory, 48 bytes per descriptor.

Descriptors 2
1 VkDescriptorPool, 5,000 VkDescriptorSet, 4 descriptors
CurrentUsage Local +962,560
CurrentUsage NonLocal +962,560
Same number of descriptors gives same amount of memory used, so there is no overhead for separate descriptor set.

Descriptors 3
1,000 VkDescriptorPool, 5 VkDescriptorSet, 4 descriptors
CurrentUsage Local +786,432
CurrentUsage NonLocal +786,432
Surprisingly, creating many smaller instead of one large descriptor pool uses less memory. Now it’s 39 bytes per descriptor.

Pipeline 1
1 VkPipeline with a very long fragment shader, containing thousands of instructions.
CurrentUsage Local +180,224

Pipeline 2
1,000 VkPipeline objects with quite simple shaders. All fragment shaders slightly different, so driver cannot reuse same object.
CurrentUsage Local +524,288
That gives 524 bytes per pipeline.

Call time
Calling IDXGIAdapter3::QueryVideoMemoryInfo function 10,000,000 times, for either DXGI_MEMORY_SEGMENT_GROUP_LOCAL or DXGI_MEMORY_SEGMENT_GROUP_NON_LOCAL, took around 3.36 seconds, which means 336 nanoseconds per call - so short that we can safely call it every frame or even multiple times per frame.

Another Vulkan application is launched
No change in any parameters, whether it’s CurrentUsage, Budget, or AvailableForReservation, Local or NonLocal.

NVIDIA

Now, the second test machine:

OS: Windows 10 64-bit version 1803 (OS Build 17134.407)
RAM: 32 GB
GPU: NVIDIA GeForce GTX 1070 8 GB
Graphics driver: 416.94 Release date: 11/13/2018

At program startup, which is before any Vulkan objects are created:

Local: Budget=7208750284 CurrentUsage=0 AvailableForReservation=3816397209 CurrentReservation=0
Nonlocal: Budget=15442303795 CurrentUsage=0 AvailableForReservation=8150104780 CurrentReservation=0

So with 8 GB of GPU memory and 32 GB of CPU memory we have Local Budget 6.71 GB (84%) and NonLocal Budget 14.38 GB (45%) - exactly same percentage as on AMD.
AvailableForReservation is Local 3.55 GB (44%) and NonLocal 7.59 GB (24%) - exactly same percentage as on AMD.

Again, Budget, AvailableForReservation, CurrentReservation stays the same for entire program. Only CurrentUsage changes.

Swapchain
VkSwapchain is created. Format = B8G8R8A8_UNORM, minImageCount = 3, imageExtent = 1904 x 990
CurrentUsage Local +31,195,136
Expected size of this object is 1904 x 990 * 3 * 4 bytes per pixel = 22,619,520, so we have like 38% overhead.

Memory 1.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of DEVICE_LOCAL memory type (GPU memory, D3D12 equivalent is DEFAULT heap).
CurrentUsage Local +104,857,600.
It’s exactly 100 MB, zero overhead.

Memory 1.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of DEVICE_LOCAL memory type.
CurrentUsage Local +29,360,128
That’s much less than 100 MB. Maybe NVIDIA sub-allocates such smaller blocks out of their own larger blocks and the driver already had some free space available in the existing ones. Or maybe they lazily allocate after something is bound to the memory. Or maybe memory usage reporting is inaccurate. I don’t know.

Memory 2.x
There is no DEVICE_LOCAL + HOST_VISIBLE memory type on NVIDIA, so these tests are skipped.

Memory 3.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of HOST_VISIBLE + HOST_COHERENT memory type (CPU memory, D3D12 equivalent is UPLOAD heap).
CurrentUsage NonLocal +104,857,600
Same amount, exactly the size of allocation, reported as used out of NonLocal memory this time.

Memory 3.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of HOST_VISIBLE + HOST_COHERENT memory type.
CurrentUsage NonLocal +4,194,304
Just as in test 1.2 - reported usage is much less than allocation size.

Memory 4.1
Allocating 1 VkDeviceMemory block with size 100 MB, out of HOST_VISIBLE + HOST_COHERENT + HOST_CACHED memory type (cached CPU memory, D3D12 equivalent is READBACK heap).
CurrentUsage NonLocal +104,857,600
Same as before, which means this heap also accounts to NonLocal in DXGI.

Memory 4.2
Allocating 1,024 VkDeviceMemory blocks with size 100 KB, out of HOST_VISIBLE + HOST_COHERENT + HOST_CACHED memory type.
CurrentUsage NonLocal -105,906,176
That allocation actually decreased usage number by more than 100 MB! Which confirms again that reported usage is unpredictable when allocating smaller blocks.

Buffers
5,000 VkBuffer objects created and bound to a VkDeviceMemory that has been allocated before, so we are counting buffer objects alone and not the memory allocation.
No change in CurrentUsage, either Local or NonLocal.

Images
5,000 VkImage objects created and bound to a VkDeviceMemory that has been allocated before, so we are counting image objects alone and not the memory allocation.
No change in CurrentUsage, either Local or NonLocal.

Queries 1
1 VkQueryPool object is created that contains 1,000,000 queries of type TIMESTAMP.
CurrentUsage NonLocal +16,003,072
That gives 16 bytes per timestamp query. Please note that they are accounted in NonLocal memory, while on AMD they were in Local memory.

Queries 2
5000 VkQueryPool objects is created with 200 TIMESTAMP queries each.
CurrentUsage NonLocal -106,180,608
Again, we see some unexpected, large fluctuation in memory usage when creating large number of small objects.

Command buffers 1
1 VkCommandPool and 1 VkCommandBuffer with 10,000 render passes.
No change in CurrentUsage, either Local or NonLocal. Please note that on AMD there was both Local and (mostly) NonLocal usage reported for command buffers.

Command buffers 2
1 VkCommandPool and 2000 VkCommandBuffer with 5 render passes each.
Same as above - no change in CurrentUsage, either Local or NonLocal.

Descriptors 1
1 VkDescriptorPool, 1 VkDescriptorSet, 20,000 descriptors of type COMBINED_IMAGE_SAMPLER, as 1 binding
No change in CurrentUsage, either Local or NonLocal. Please note that on AMD there was both Local and NonLocal (same amount) usage reported for descriptors.

Descriptors 2
1 VkDescriptorPool, 5,000 VkDescriptorSet, 4 descriptors
CurrentUsage NonLocal -15,728,640
Again, we can see unexpected fluctuation in memory usage when operating on large number of objects.

Descriptors 3
1,000 VkDescriptorPool, 5 VkDescriptorSet, 4 descriptors
CurrentUsage NonLocal +65,011,712
If we believe this number is accurate and not the result of some complex, internal driver memory management, then it gives 65,011 bytes per descriptor pool with capacity of 20 descriptors, or 3,251 bytes per descriptor, but it doesn’t look reasonable.

Pipeline 1
1 VkPipeline with a very long fragment shader, containing thousands of instructions.
CurrentUsage Local +196,608

Pipeline 2
It hung with infinite loop inside NVIDIA driver on first call to vkCreateGraphicsPipeline. I didn’t investigate it further. Top of the stack was showing:

nvoglv64.dll+0x81b75
nvoglv64.dll!vkGetInstanceProcAddr+0x15513
nvoglv64.dll!vkGetInstanceProcAddr+0xe104
nvoglv64.dll!vkGetInstanceProcAddr+0x246e6

Call time
Calling IDXGIAdapter3::QueryVideoMemoryInfo function 10,000,000, for either DXGI_MEMORY_SEGMENT_GROUP_LOCAL or DXGI_MEMORY_SEGMENT_GROUP_NON_LOCAL, took around 11.42 seconds, which means 1.14 microseconds per call - so short that we can safely call it every frame or even multiple times per frame.

Another Vulkan application is launched
Same as on AMD - no change in any parameters, whether it’s CurrentUsage, Budget, or AvailableForReservation, Local or NonLocal.

Conclusions

Maybe someday Vulkan will reach official D3D12-level support for GPU memory usage query. Until then, I guess that’s the best thing we have…

Comments | #windows #graphics #directx #vulkan Share

Comments

STAT NO AD
[Stat] [STAT NO AD] [Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2018