Blog

# Scaling is everywhere, pixel-perfect is the past

Thu
04
Oct 2018

Long time ago, when computers were slow and screen resolutions were low, everything had to be pixel-perfect. For example, Atari 2600 game console could display only 160x192 pixels. During that time, game characters and all the graphics had to be drawn pixel by pixel, to include all the intended details, like Mario's moustache. This is known as pixel art.



Source: The Evolution of Mario

Years later, with higher screen resolutions, game sprites could be drawn using different methods, or even rendered from 3D models, but icons and other GUI elements were still prepared to be shown pixel-by-pixel. Same applied to web pages.

Nowadays, even GUI icons are scaled. They can be enlarged smoothly and they can be displayed on various monitors, where 4K monitor has 4x more pixels than FullHD. Setting desktop DPI scaling other than 100% scales all the apps in Windows. Modern web pages created according to "responsive design" principles have to look good on all kinds of devices, from little smartphones to huge monitors. Scaling is everywhere.

When programmable cellphones first appeared, making apps and games for them was like going back in time. Just like on retro platforms and first PCs, screens had very low resolutions and pixel art was the way to go when drawing game characters. Now mobile games have to work on all sorts of smartphones, many of them having resolutions like our PC monitors - FullHD or even higher.

What seems like the last relic of pixel-perfection is the rendering of 3D scenes. Since the introduction of 3D graphics, we tend to rasterize and shade our triangles in the same resolution as the image to be displayed on screen, which is ideally equal to native resolution of the monitor. Otherwise, every gamer who cares about image quality would call it looking bad. Or wouldn't he?

Some things can be rendered in lower resolution. There are games that render the layer with alpha-blended, translucent objects (especially particle effects like fire, smoke, clouds) to a 4x smaller texture and then upscale it while compositing with main, opaque geometry. Such elements tend not to have too many high-frequency (small) details anyway, so quality degradation due to lower resolution is not very noticeable, while smaller number of pixels that need to be shaded and blended saves a lot of rendering time.

But that's not the full story. Regardless of resolution, antialiasing is, and always will be, necessary to blur jaggy edges. Ideal solution for it is known as Super Sampling Anti-Aliasing (SSAA), which is nothing else but rendering the scene in higher resolution and then downscaling it to e.g. average 2x2 rendered pixels into a single output pixel. It could be done by a game, or introduced by graphics driver. AMD has this feature in driver under the name "Virtual Super Resolution".

This of course is a slow method, because rendering 4x more pixels requires a lot of computations and memory bandwidth. Various methods exist that provide more efficient antialiasing. Multisample Anti-Aliasing (MSAA), which is supported by GPUs in hardware, lets you shade a pixel (calculate RGB color in a pixel shader) only once, but store it in multiple per-pixel samples, depending on the shape of the edge being rendered. Numerous screen-space postprocessing algorithms exist that intelligently blur already rendered image to smooth the edges, e.g. FXAA, MLAA.

This interchangeability between rendering in higher resolution and higher quality antialiasing, as well as the possibility to do some filtering of the rendered image, is probably best exploited by the engine behind Call of Duty. Jorge Jimenez (Graphics R&D Technical Director at Activision Blizzard) explained it in his talks: "Dynamic temporal antialiasing and upsampling in Call of Duty" (Digital Dragons 2017), "Dynamic Temporal Antialiasing in Call of Duty: Infinite Warfare" (SIGGRAPH 2017). They dynamically scale rendering resolution depending on current game load to maintain sufficient framerate. The scene is then scaled to screen resolution. Their technique "combines dynamic resolution with temporal upsampling". Such techniques are especially useful where high FPS and smooth gameplay is important, even at the expense of graphics quality - in fast-paced games, professional e-sport, and VR.

Screen resolutions become higher, but performance of GPUs don't necessarily scale at the same rate. Single-pixel details are harder to notice. That's why it can make sense to render at resolutions even smaller than output resolution and then interpolate missing pixels. Of course, no interpolation algorithm is perfect and using just bilinear filter would look horrible. That's why techniques are being developed which try to minimize quality loss in this process, e.g. temporal methods (that use image from the previous frame), checkerboard rendering, or new Deep Learning Super Sampling (DLSS) from NVIDIA.

It also makes sense to shade pixels at lower rate in some parts of the image where details are hard to notice, e.g. where the player is not looking (peripheral vision in VR, especially if eye tracking is available), objects are moving fast (based on screen-space motion vectors) or where there are not many high-frequency details (based on analysis of the previous frame). Shading per pixel or per sample is just one option. NVIDIA cards support techniques like Multi-Res Shading or their latest invention - Variable Rate Shading (VFR), where helper texture can locally control shading rate from once per 16 pixels all the way to 8 times per pixel.

Finally, the rate of shading (lighting calculation) can be completely decoupled from the rate of rendering of the final image (rasterization) and done in different space, at different framerate or even completely asynchronously. This is known as Object-Space Shading/Texture-Space Shading. It has successfully been used by Oxide Games in their Ashes of the Singularity and may soon become more widespread.

I think we could say that scaling is everywhere, pixel-perfect is the past. It is not necessarily a bad thing. If the goal of advancements in 3D rendering in games is to look photorealistically like movies, then we should realize that movies are never pixel-perfect - there is always scaling and filtering involved at various stages. Even at the very beginning, camera sensors have some pattern of R, G, B pixels that must be interpolated to fit them into (RGB) triplets.

Then they are often encoded using chroma subsampling (like 4:2:2) and compressed using some video compression codecs. Interpolation and filtering may be involved at many stages of processing, e.g. frame rate conversion, deinterlace, noise reduction, or finally, sharpening commonly applied by modern smart TVs (which I'm very allergic to, but there must be some reason behind it). Recorded videos are never pixel perfect. Rendered 3D games don't have to be as well.

Comments | #graphics Share

# Efficient way of using std::vector

Sat
22
Sep 2018

Some people say that C++ STL is slow. I would rather say it's the way we use it. Sure, there are many potential sources of slowness when using STL. For example, std::list or std::map tends to allocate many small objects and dynamic allocation is a time-consuming operation. Making many copies of your objects like std::string is also costly - that's why I created str_view project. But std::vector is just a wrapper over a dynamically allocated array that also remembers its size and can reallocate when you add new elements. Without STL, you would need to implement the same functionality yourself.

When it comes to traversing elements of the vector (e.g. to sum the numerical values contained in it), there are many ways to do it. STL is notoriously known for working very slow in Debug project configuration, but as it turns out, this heavily depends on what method do you choose for using it.

Here is a small experiment that I've just made. In this code, I create a vector of 100,000,000 integers, then sum its elements using 5 different methods, calculating how much time does it take for each of them. Results (averaged over 5 iterations for each method) are as follows. Notice logarithmic scale on horizontal axis.

Here is the full source code of my testing program:

#include <cstdio>
#include <cstdint>
#include <vector>
#include <chrono>
#include <numeric>

typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::high_resolution_clock::duration duration;
inline time_point now() { return std::chrono::high_resolution_clock::now(); }
inline double durationToMilliseconds(duration d) { return std::chrono::duration<double, std::milli>(d).count(); }

int main()
{
    printf("Iteration,Method,Sum,Time (ms)\n");
    
    for(uint32_t iter = 0; iter < 5; ++iter)
    {
        std::vector<int> numbers(100000000ull);
        numbers[0] = 1; numbers[1] = 2; numbers.back() = 3;

        {
            time_point timeBeg = now();

            // Method 1: Use STL algorithm std::accumulate.
            int sum = std::accumulate(numbers.begin(), numbers.end(), 0);

            printf("%u,accumulate,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 2: Use the new C++11 range-based for loop.
            int sum = 0;
            for(auto value : numbers)
                sum += value;

            printf("%u,Range-based for loop,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 3: Use traditional loop, traverse vector using its iterator.
            int sum = 0;
            for(auto it = numbers.begin(); it != numbers.end(); ++it)
                sum += *it;

            printf("%u,Loop with iterator,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 4: Use traditional loop, traverse using index.
            int sum = 0;
            for(size_t i = 0; i < numbers.size(); ++i)
                sum += numbers[i];

            printf("%u,Loop with indexing,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }

        {
            time_point timeBeg = now();

            // Method 5: Get pointer to raw array and its size, then use a loop to traverse it.
            int sum = 0;
            int* dataPtr = numbers.data();
            size_t count = numbers.size();
            for(size_t i = 0; i < count; ++i)
                sum += dataPtr[i];

            printf("%u,Loop with pointer,%i,%g\n", iter, sum, durationToMilliseconds(now() - timeBeg));
        }
    }
}

As you can see, some methods are slower than the others in Debug configurations by more than 3 orders of magnitude! The difference is so big that if you write your program or game like this, it may not be possible to use its Debug version with any reasonably-sized input data. But if you look at disassembly, it should be no surprise. For example, method 4 calls vector methods size() and operator[] in every iteration of the loop. We know that in Debug configuration functions are not inilined and optimized, so these are real function calls:

On the other hand, method 5 that operates on raw pointer to the vector's underlying data is not that much slower in Debug configuration comparing to Release. Disassembly from Debug version:

So my conclusion is: Using std::vector to handle memory management and reallocation and using raw pointer to access its data is the best way to go.

My testing environment was:

CPU: Intel Core i7-6700K 4.00 GHz
RAM: DDR4, Dual-Channel, current memory clock 1066 MHz
OS: Windows 10 Version 1803 (OS Build 17134.285)
Compiler: Microsoft Visual Studio Community 2017 Version 15.4.8
Configuration options: x64 Debug/Release
Windows SDK Version 10.0.16299.0

Comments | #stl #c++ #optimization Share

# Debugging D3D12 driver crash

Wed
12
Sep 2018

New generation, explcit graphics APIs (Vulkan and DirectX 12) are more efficient, involve less CPU overhead. Part of it is that they don't check most errors. In old APIs (Direct3D 9, OpenGL) every function call was validated internally, returned success of failure code, while driver crash indicated a bug in driver code. New APIs, on the other hand, rely on developer doing the right thing. Of course, some functions still return error code (especially ones that allocate memory or create some resource), but those that record commands into a command list just return void. If you do something illegal, you can expect undefined behavior. You can use Validation Layers / Debug Layer to do some checks, but otherwise everything may work fine on some GPUs, you may get incorrect result, or you may experience driver crash or timeout (called "TDR"). Good thing is that (contrary to old Windows XP), crash inside graphics driver doesn't cause "blue screen of death" or machine restart. System just restarts graphics hardware and driver, while your program receives DXGI_ERROR_DEVICE_REMOVED code from one of functions like IDXGISwapChain::​Present. Unfortunately, you then don't know which specific draw call or other command caused the crash.

NVIDIA proposed solution for that: they created NVIDIA Aftermath library. It lets you (among other things) record commands that write custom "marker" data to a buffer that survives driver crash, so you can later read it and see which command was successfully executed last. Unfortunately, this library works only with NVIDIA graphics cards.

Some time ago I showed a portable solution for Vulkan in my post: "Debugging Vulkan driver crash - equivalent of NVIDIA Aftermath". Now I'd like to present a solution for Direct3D 12. It turns out that this API also provides a standardized way to achieve this, in form of a method ID3D12GraphicsCommandList2::​WriteBufferImmediate. One caveat: This new version of the interface requires:

  • User to have at least Windows 10 Fall Creators Update.
  • Developer to have Windows SDK in version at least for Windows 10 Fall Creators Update.
  • Developer to use Visual Studio 2017 - required by this version of Windows SDK.

I created a simple library that implements all the required logic under easy interface, which I called D3d12AfterCrash. You can find all the details and instruction for how to use it in file "D3d12AfterCrash.h".

I guess it would be better to allocate the buffer using WinAPI function VirtualAlloc(NULL, bufferSize, MEM_COMMIT, PAGE_READWRITE), then call ID3D12Device3::​OpenExistingHeapFromAddress and ID3D12Device::​CreatePlacedResource, but my simple way of just doing ID3D12Device::​CreateCommittedResource seems to work - buffer survives driver crash and preserves its content. I checked it on AMD as well as NVIDIA card.

Comments | #directx #graphics #libraries #productions Share

# Macro with current function name - __func__ vs __FUNCTION__

Tue
11
Sep 2018

Today, while programming in C++, I wanted to write an assert-like macro that would throw an exception when given condition is not satisfied. I wanted to include as much information as possible in the message string. I know that condition expression, which is argument of my macro, can be turned into a string by using # preprocessor operator.

Next, I searched for a way to also obtain name of current function. At first, I found __func__, as described here (C++11) and here (C99). Unfortunately, following code fails to compile:

#define CHECK(cond) if(!(cond)) { \
    throw std::runtime_error("ERROR: Condition " #cond " in function " __func__);

void ProcessData()
{
    CHECK(itemCount > 0); // Compilation error!
    // (...)
}

This is because this identifier is actually an implicit local variable static const char __func__[] = "...".

Then I recalled that Visual Studio defines __FUNCTION__ macro, as custom Microsoft extension. See documentation here. This one works as I expected - it can be concatenated with other strings, because it's a string literal. Following macro definition fixes the problem:

#define CHECK(cond) if(!(cond)) \
    { throw std::runtime_error("ERROR: Condition " #cond " in function " __FUNCTION__); }

When itemCount is 0, exception is thrown and ex.what() returns following string:

ERROR: Condition itemCount > 0 in function ProcessData

Well... For any experienced C++ developer, it should be no surprise that C++ standard committee comes up with solutions that are far from being useful in practice :)

Comments | #c++ Share

# Operations on power of two numbers

Sun
09
Sep 2018

Numbers that are powers of two (i.e. 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 and so on...) are especially important in programming, due to the way computers work - they operate on binary representation. Sometimes there is a need to ensure that certain number is power of two. For example, it might be important for size and alignment of some memory blocks. This property simplifies operations on such quantities - they can be manipulated using bitwise operations instead of arithmetic ones.

In this post I'd like to present efficient algorithms for 3 common operations on power-of-2 numbers, in C++. I do it just to gather them in one place, because they can be easily found in many other places all around the Internet. These operations can be implemented using other algorithms as well. Most obvious implementation would involve a loop over bits, but that would give O(n) time complexity relative to the number of bits in operand type. Following algorithms use clever bit tricks to be more efficient. They have constant or logarithmic time and they don't use any flow control.

1. Check if a number is a power of two. Examples:

IsPow2(0)   == true (!!)
IsPow2(1)   == true
IsPow2(2)   == true
IsPow2(3)   == false
IsPow2(4)   == true
IsPow2(123) == false
IsPow2(128) == true
IsPow2(129) == false

This one I know off the top of my head. The trick here is based on an observation that a number is power of two when its binary representation has exactly one bit set, e.g. 128 = 0b10000000. If you decrement it, all less significant bits become set: 127 = 0b1111111. Bitwise AND checks if these two numbers have no bits set in common.

template <typename T> bool IsPow2(T x)
{
    return (x & (x-1)) == 0;
}

2. Find smallest power of two greater or equal to given number. Examples:

NextPow2(0)   == 0
NextPow2(1)   == 1
NextPow2(2)   == 2
NextPow2(3)   == 4
NextPow2(4)   == 4
NextPow2(123) == 128
NextPow2(128) == 128
NextPow2(129) == 256

This one I had in my library for a long time.

uint32_t NextPow2(uint32_t v)
{
    v--;
    v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8;
    v |= v >> 16;
    v++;
    return v;
}
uint64_t NextPow2(uint64_t v)
{
    v--;
    v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8;
    v |= v >> 16; v |= v >> 32;
    v++;
    return v;
}

3. Find largest power of two less or equal to given number. Examples:

PrevPow2(0) == 0
PrevPow2(1) == 1
PrevPow2(2) == 2
PrevPow2(3) == 2
PrevPow2(4) == 4
PrevPow2(123) == 64
PrevPow2(128) == 128
PrevPow2(129) == 128

I needed this one just recently and it took me a while to find it on Google. Finally, I found it in this post on StackOveflow.

uint32_t PrevPow2(uint32_t v)
{
    v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8;
    v |= v >> 16;
    v = v ^ (v >> 1);
    return v;
}
uint64_t PrevPow2(uint64_t v)
{
    v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8;
    v |= v >> 16; v |= v >> 32;
    v = v ^ (v >> 1);
    return v;
}

Update 2018-09-10: As I've been notified on Twitter, C++20 is also getting such functions as standard header <bit>.

Comments | #math #c++ #algorithms Share

# Iteration time is everything

Thu
06
Sep 2018

I still remember Demobit 2018 in February in Bratislava, Slovakia. During this demoscene party, one of the talks was given by Matt Swoboda "Smash", author of Notch. Notch is a program that allows to create audio-visual content, like demos or interactive visual shows accompanying concerts, in a visual way - by connecting blocks, somewhat like blueprints in Unreal Engine. (The name not to be confused with nickname of the author of Minecraft.) See also Number one / Another one by CNDC/Fairlight - latest demo made in it.

During his talk, Smash referred to music production. He said that musicians couldn't imagine working without a possibility to instantly hear the effect of changes they make to their project. He said that graphics artists deserve same level of interactivity - WYSIWYG, instant feedback, without a need for a lengthy "build" or "render". That's why Notch was created. Then I thought: What about programmers? Don't they deserve it too? Shorter iteration times mean better work efficiency and higher quality of the result. Meanwhile, a programmer sometimes has to wait minutes or even hours to be able to test a change in his code, no matter how small it is. I think it's a big problem.

This is exactly what I like about development of desktop Windows applications and games: they can usually be built, ran, and tested locally within few seconds. Same applies to games made in Unity and Unreal Engine - developer can usually hit "Play" button and quickly test his gameplay. It is often not the case with development for smaller devices (like mobile or embedded) or larger (like servers/cloud).

I think that iteration time - time after which we can observe effects of our changes - is critical for developers' work efficiency, as well as their well-being. We programmers should demand better tools. All of us - including low-level C and C++ programmers. Currently we are at the good position in the job market so we can choose companies and projects to work on. Let's use it and vote with our feet. Decision makers and architects of software/hardware platforms may think that developers are smart, so they can work efficiently even in harsh conditions. They forget that wasting developers' precious time means wasting a lot of money, not to mention their frustration. Creating better tools is an investment that will pay off.

Now, whenever I get a job offer for a developer position, I ask two simple questions:

1. What is the typical iteration time, from the moment when I change something in the code, through compilation, deployment, application launch and loading, until I can observe the effect of my change? If the answer is: "Usually it's just a matter of few seconds. Files you changed are recompiled, then launching the app takes few seconds and that's it." - that's fine. But if the answer is more like: "Well, the whole project needs to be rebuilt. You don't do it locally. You shelve your changes in Perforce so that build server picks it and makes the build. The build is then deployed to the target device, which then needs to reboot and load your app. It takes 15-20 minutes." - then it's a NOPE for me.

2. How do you debug the application? Can you make experiments by setting up breakpoints and watching variables in a convenient way? If the answer is: "Yes, we have debugger nicely integrated with Visual Studio/WinDBG/Eclipse/other IDE and we debug whenever we see a problem." - that's fine. But when I hear: "Well, command-line GDB should work with this environment, but to be honest, it's so hard to setup that no one uses it here. We just put debug console prints in the code and recompile it whenever we want to make a debug experiment." - then that's a red light for me.

Comments | #career #tools #philosophy Share

# Vulkan Memory Allocator 2.1.0

Tue
28
Aug 2018

Yesterday I merged changes in the code of Vulkan Memory Allocator that I've been working on for past few months to "master" branch, which I consider a major milestone, so I marked it as version 2.1.0-beta.1. There are many new features, including:

  • Added linear allocation algorithm, accessible for custom pools, that can be used as free-at-once, stack, double stack, or ring buffer.
  • Added feature to record sequence of calls to the library to a file and replay it using dedicated application.
  • Improved support for non-coherent memory.
  • Improved debug features related to detecting incorrect mapped memory usage.
  • Changed format of JSON dump to include more information and allow better coloring in VmaDumpVis.

The release also includes many smaller bug fixes, improvements and additions. Everything is tested and documented. Yet I call it "beta" version, to encourage you to test it in your project and send me your feedback.

Comments | #vulkan #libraries #productions #graphics Share

# str_view - null-termination-aware string-view class for C++

Sun
19
Aug 2018

tl;dr I've written a small library, which I called "str_view - null-termination-aware string-view class for C++". You can find code and documentation on GitHub - sawickiap/str_view. Read on to see full story behind it...

Let me disclose my controversial beliefs: I like C++ STL. I think that any programming language needs to provide some built-in strings and containers to be called modern and suitable for developing large programs. But of course I'm aware that careless use of classes like std::list or std::map makes program very slow due to large number of dynamic allocations.

What I value the most is RAII - the concept that memory is automatically freed whenever an object referenced by value is destroyed. That's why I use std::unique_ptr all over the place in my personal code. Whenever I create and own an array, I use std::vector, but when I just pass it to some other code for reading, I pass raw pointer and number of elements - myVec.data() and myVec.size(). Similarly, whenever I own and build a string, I use std::string (or rather std::wstring - I like Unicode), but when I pass it somewhere for reading, I use raw pointer.

There are multiple ways a string can be passed. One is pointer to first character and number of characters. Another one is pointer to first character and pointer to the next after last character - a pair of iterators, also called range. These two can be trivially converted between each other. Out of these, I prefer pointer + length, because I think that number of characters is slightly more often needed than pointer past the end.

But there is another way of passing strings common in C and C++ programs - just one pointer to a string that needs to be null-terminated. I think that null-terminated strings is one of the worst and the most stupid inventions in computer science. Not only it limits set of characters available to be used in string content by excluding '\0', but it also makes calculation of string length O(n) time complexity. It also creates opportunity for security bugs. Still we have to deal with it because that's the format that most libraries expect.

I came up with an idea for a class that would encapsulate a reference to an externally-owned, immutable string, or a piece of thereof. Objects of such class could be used to pass strings to library functions instead of e.g. a pointer to null-terminated string or a pair of iterators. They can be then queried for length(), indexed to access individual characters etc., as well as asked for a null-terminated copy using c_str() method - similar to std::string.

Code like this already exists, e.g. C++17 introduces class std::string_view. But my implementation has a twist that I'm quite happy with, which made me call my class "null-termination-aware". My str_view class not only remembers pointer and length of the referred string, but also the way it was created to avoid unnecessary operations and lazily evaluate those that are requested.

  • If it was created from a null-terminated string:
    • c_str() trivially returns pointer to the original string.
    • Length is unknown and it is calculated upon first call to length().
  • On the other hand, if it was created from a string that is not null-terminated:
    • Length is explicitly known, so length() trivially returns it.
    • c_str() creates a local, null-terminated copy of the string upon first call.

If you consider such class useful in your C++ code, see GitHub - sawickiap/str_view project for code (it's just a single header file), documentation, and extensive set of tests. I share this code for free, on MIT license. Feel free to contact me if you find any bugs or have any suggestions regarding this library.

Comments | #productions #libraries #c++ Share

Older entries >

Twitter

Pinboard Bookmarks

LinkedIn

Blog Tags

STAT NO AD
[Stat] [STAT NO AD] [Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2018