/ Forums
New To The Forum? Click Here To Read The How To Guide. -- Developers Click Here.

Free DX 12 Rift Engine Code

lamour42lamour42 Posts: 108
Art3mis
edited December 2016 in PC Development
Hi,

if you want to write code for the Rift using DirectX 12 you might want to take
a look at the code I provided on GitHub https://github.com/ClemensX/ShadedPath12.git

The sample engine is extremely limited on draw abilities: I can only draw lines!
But it may serve as a learning playground for DirectX 12 and Rift programming.

I find it fascinating how a bunch of simple lines suddenly become great if you can walk around them and view them from any direction when you wear the Rift!

The current state of the code is a first step to porting my older DX 11 engine to DX 12.
If you want you are allowed to use any code you like in your own projects.

I want to express gratitude to galopin, who came up with a detailed 8-step guide on how to combine
DirectX 12 with Oculus SDK rendering. See this thread https://forums.oculus.com/viewtopic.php?f=20&t=25900
When I found out that using oculus API ovr_CreateSwapTextureSetD3D11 on a
D3D11On12Device throws NullPointerExceptions I would have given up if he had not given this advice!

Some features of the code example:
  • Engine / Sample separation. Look at Sample1.cpp to see what you can currently do with this engine and see how it is done.
  • Oculus Rift support (head tracking and rendering). See vr.cpp
  • Post Effect Shader: Copy rendered frame to texture - Rift support is built on top of this feature
  • Use Threads to update GPU data. See LinesEffect::update()
  • Synchronize GPU and CPU via Fences
  • Free float camera - use WASD or arrow keys to navigate. Or just walk/turn/duck if you wear the Rift

Any feedback welcome.
«1

Comments

  • cyberealitycybereality Posts: 20,661 Oculus Staff
    Nice.
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • glazeglaze Posts: 43
    Thanks! I'll probably learn from this codebase when adding Rift support into my engine's D3D12 renderer.
  • galopingalopin Posts: 351
    Nexus 6
    Gratitude accepted :)

    If only Oculus could add real support, integrating d3d12 queues and fences, plus manual management of the surface memory...
  • lamour42lamour42 Posts: 108
    Art3mis
    wnvy1.jpg

    Added a geometry shader to draw 3D text at any world position. Also allows to draw a coordinate system.

    The text is copied to the GPU with some positional data, then the geometry shader parses the text and produces all the lines necessary to draw the letters. Intended as a diagnostics tool.

    While looking lame in the picture above it looks far more impressive if watched with the Rift. Makes you want to touch the lines.
  • lamour42lamour42 Posts: 108
    Art3mis
    fnb6o6.jpg

    Added texture support.

    The texture shader uploads DDS image files to the GPU and a billboard shader draws them at user defined world position and size.

    Texture support re-uses the DX12 DDSTextureLoader from Microsofts MiniEngine. Slightly changed to be easier to use outside MiniEngine.
  • blazespinnaker2blazespinnaker2 Posts: 52
    Virtual Boy (or Girl)
    This looks great. Just curious, have you done any perf testing compared to the tiny room demo?
  • lamour42lamour42 Posts: 108
    Art3mis
    No, I didn't compare performance to other examples. But performance is a big topic for me. And the Microsoft tools are very good in showing you the bottlenecks of your code.

    Some remarks with regards to performance:
    • I copied the approach of the Microsoft provided DX12 examples of using 3 frames at the same time for rendering and synchronizing them with fences. Unfortunately the documentation about this topics lacks any depth, so there are a lot questions unanswered. I found it hard to come up with a system that really runs parallel and doesn't limit access to your central objects. Certainly a topic that needs to be revisited.
    • Texture preloading. For a small framework like mine I think it is ok, even beneficial, to preload all textures in start up phase. It is just a lot easier when you know that all textures are already in GPU memory when you start rendering. Not something a big engine for big games could do, but for smaller applications I think it is the right way to go.
    • Threaded approach. I experimented a lot with threads for the 3D text shader. Meant as a diagnostic tool, it doesn't matter if text changes are reflected some frames too late in the world. So on rendering the input buffer that is already presend on the GPU is just reused. Only some bytes with the current View/Projection Matrix have to be copied to the GPU before rendering can start. A background thread is responsible to update the GPU input buffer for all the text in the background and then just switch to the new buffer once it is ready. With this approach it doesn't really matter how much text you display (at least not until the text shader on the GPU becomes the bottleneck). In my example I display over 1000 lines of text without seeing any performance degradation at all. I still have several hundred frames per second in a window and constant 75 fps in the rift.
    • Rift optimizations. One advantage to start with a completely new framework is that I do not have to pay attention to existing shader code and more traditional ways of rendering. Usually, for most existing engines, each shader updates it's data for each frame in an update method, then renders in a draw method. When you draw for the rift, you draw the two images for each eye right after another. The images are very similar, but obviously not exactly the same. There is a lot overhead involved in going through all the update and drawing code twice. My shaders are designed in a way that as much unnecessary double work can be avoided as possible. Basically, all setup (like input buffers, updating world positions of your objects) is done only once. When it is time to issue the actual draw call on the GPU, the corrected Model/View/Projection matrix for the current eye is copied to the GPU and rendering starts.
  • galopingalopin Posts: 351
    Nexus 6
    Overhead to render in dx12 ? it is because you do not think GPU :)

    Yes, pushing on the CPU can be light speed compared to DX11 while using it in an old cpu fashion way, but the real force is to do things differently, more stream lined to the GPU.

    I broke my oculus mode right now, still the screenshots are 256K objects, with per instance texture, in a couple of dispatches, one ExecuteIndirect that contains 4096 draws when no culling no occlusion is performed ( to emulate a collection of different objects, should be one ExecuteIndirect per PSO, my commands are made of one index buffer, two vertex buffers and a draw instanced ) and a few draw calls for text, debug draws, gpu timers and blits.

    The cpu cost of the app right now is near to zero, if i look at the sky, i am still gpu bound at 0.8ms, 0.6ms is the draw indirect ( it should be zero but is not able to claim performance from a count buffer value smaller than the max argument, nvidia need to fix that, and AMD is just pure broken right now on ExecuteIndirect, no kidding ). Imagine you culled on the cpu 256K objects, even with the right hierarchical structure, your are way behind that.

    In a real app, because of dx12 bindless, the number of real cpu unique draw calls is lower than it could have been on dx11. For a stereo render, you can imagine a lot of techniques, mine is doubling the groups in the ExecuteIndirect, and add an extra root constant to say left/right in the command signature and use that to use the proper viewprojectionviewport matrix + extra clip plane between the two fake viewports ( because VPAndRTArrayIndexFromAnyShaderFeedingRasterizerSupportedWithoutGSEmulation is false on nvidia, or it would be even simpler, just a semantic to output from the VS ).

    The culling occlusion is the small red bar in the top left part of the screen. The red part in the right is the blit to backbuffer prior to text and gpu timers. And the purple just before is depth buffer pyramid for occlusion in the next frame. Most of my stuff are still rough and not optimal, and of course, you do not want to know the number of millions of triangles that are on these screenshots :)

    full sized images

    Only frustum culling:
    W8lpSkW.png

    With occlusion culling:
    Tu0CyP2.png

    What was hidden:
    JvqFp1M.png

    the stripped grey bar show the milliseconds.
  • lamour42lamour42 Posts: 108
    Art3mis
    yes, you get the feeling that once your stuff is on the GPU the speed is limitless. Here we look at 1 Million billboard textures from inside the rift. At totally constant 75 FPS.

    67jg2b.jpg
  • cyberealitycybereality Posts: 20,661 Oculus Staff
    Wow!
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • galopingalopin Posts: 351
    Nexus 6
    Wow!

    All this just means that the GPU is again to be the likely bottleneck of an application and this is great. With a faster cpu side, sending two views instead of one is less an issue, latency is also reduce and it lower the risk to fail a frame in time.

    Even if GPUs are damn fast and offloading part of what was done on the CPU prior can be a clear net win, it still means you take a little from your GPU for that, and if you are short on GPU, it still can be better to stay on the CPU side for things like culling and occlusion, or use hybrid approach.

    But yes, it is possible to render and entire environment made of hundreds of different meshes with hundreds of materials in a single draw call. It is because you can have all the information you need to pick object positions, textures, material property visible by the gpu directly. You can have thousands of textures bind at the same time and pick whatever the one you need based on a material id retrieve from a mesh instance data or whatever.
  • lamour42lamour42 Posts: 108
    Art3mis
    Changed the billboard vertex shader so that all the Million of billboards are facing the camera at any time. No change in FPS.
    I think it's a nice Rift demo to fly through all the images and see them changing direction towards yourself.

    Also, I made a pre-release version on GitHub that includes all the textures needed to run the demo.

    2e513jm.jpg
  • cyberealitycybereality Posts: 20,661 Oculus Staff
    That's amazing!

    How is framerate and on what machine?

    This gives me some ideas. Maybe I will revive my engine project and update to DX12.
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • lamour42lamour42 Posts: 108
    Art3mis
    That's amazing!

    How is framerate and on what machine?

    This gives me some ideas. Maybe I will revive my engine project and update to DX12.

    My machine is this:
    • 6GB NVIDIA GeForce GTX 980 Ti
    • Intel Core i7-6700K
    • 16 GB RAM
    • Oculus Rift DK 2
    In single window mode I get around 90 frames per second.

    In VR mode with the DK2 I get constant 75 frames per second. Framerate in VR stays at 75 until around 2 Million billboards. For more billboards framerate slowly begins to drop. When I display 4 Million billboards framerate is at 45.
  • galopingalopin Posts: 351
    Nexus 6
    FPS is the worst indicator of performance, you need to think in milliseconds, more natural for performance, and to cut things logically, a frame has differents elements, some variable, like the scene geometry, some fixed cost like various blit and stuff like the oculus rift presentation warp.

    With DX12, it is easier to measure this stuff. You can see bars on my screenshot of the previous post, they represent GPU timings in my application.

    Here is how to do it :

    1. Prevent the GPU to idle and keep time stamp consistant cross command lists ( they are always consistant in a single command list ), also query the timestamp frequency, optional, get a sync of value between GPU and CPU to obtain an acurate delta between them
    ID3D12Device::SetStablePowerState
    ID3D12CommandQueue::GetTimestampFrequency
    ID3D12CommandQueue::GetClockCalibration
    

    2. Create a query heap of timestamps plus a regular buffer to resolve them to be used in a shader
    ID3D12Device::CreateQueryHeap
    

    3. In your frame generate time stamps. Create a class that keep track of the hierarchy and remember were the begin and the end are stored in the time stamp heap. optional, you can use BeginEvent( 0, wstr, (wclen(str)+1*2)/EndEvent of the command list or queue, it will organize a VSGD or Nsight capture for debugging purpose.
    ID3D12GraphicsCommandList::EndQuery
    

    4. Copy the timestamps to the buffer
    ID3D12GraphicsCommandList::ResolveQueryData
    

    5. optional, copy back the data to a read back buffer for cpu access

    6. Write a shader that read the values inside and display bars, here, mine, free of charge, i use a vertex buffer that is per instance ( of 4 vertices for a quad ), contains begin, end and depth plus a color, the final vertices are from the time stamps and vertex id
    #define MyRS1 \
    "RootFlags( ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT | DENY_PIXEL_SHADER_ROOT_ACCESS ), " \
    "RootConstants(num32BitConstants=7, b0, visibility=SHADER_VISIBILITY_VERTEX)," \
    "SRV(t0)" \
    ""
    
    struct VSPS {
    	float4 pos : SV_Position;
    	float4 color : COLOR;
    	float ms : MS;
    	uint isMain : MAIN;
    };
    
    struct RootConstants
    {
    	float rtWidth;
    	float rtHeight;
    	float barHeight;
    	uint first;
    	uint last;
    	uint freq;
    	float fpsWidth;
    };
    
    ConstantBuffer<RootConstants> root : register(b0);
    
    struct Stamp {
    	uint high;
    	uint low;
    };
    StructuredBuffer<Stamp> stamps : register(t0);
    
    struct IA {
    	uint3 location : BARPOS;
    	float4 color : BARCOLOR;
    	uint subVert : SV_VertexID;
    	uint instId : SV_InstanceID;
    };
    
    uint DiffStamp(uint end, uint beg) {
    	Stamp a = stamps[beg];
    	Stamp b = stamps[end];
    	if(a.low==b.low)
    		return b.high - a.high;
    	else
    		return uint(0xffffffff) - a.high + b.high + 1u;
    }
    [RootSignature(MyRS1)]
    void main(IA input, out VSPS output) {
    	uint frameLen = DiffStamp(root.last,root.first);
    
    	float width;
    	if ( root.fpsWidth > 0.f)
    		width = root.rtWidth/ (float(root.freq) / root.fpsWidth);
    	else
    		width = root.rtWidth / float(frameLen);
    	float height = root.barHeight;
    
    	float startx = float(DiffStamp(input.location.x,root.first));
    	float endx = float(DiffStamp(input.location.y,root.first));
    
    	float starty = (height + 2) * float(input.location.z + 1);
    	float endy = starty + height;
    	float y = (input.subVert & 1) ? endy : starty;
    	float x = input.subVert < 2 ? startx : endx;
    
    	output.ms = x * 1000.f / float(root.freq);
    	output.isMain = input.location.z == 0;
    	x *= width;
    	float2 pos = float2(x, y);
    
    	pos /= float2(root.rtWidth, root.rtHeight);
    	pos.y = 1 - pos.y;
    	pos *= 2;
    	pos -= 1;
    
    	output.pos = float4(pos, 0, 1);
    	output.color = input.color;
    }
    
    [RootSignature(MyRS1)]
    float4 main(in VSPS input) : SV_TARGET {
    	float l = 1.f;
    	if (input.isMain && (int(trunc(input.ms))&1))
    			l = 0.2f;
    	return float4(input.color.rgb * l, input.color.a);
    }
    

    Because the timestamps are stable, you can even collect if you want the gap between before Present and after. I only display my current frame right now, but i could display the N past frame, or also display the same hierarchy from the CPU point of view and visualize the latency from the cpu operation to when it happen on the GPU
  • lamour42lamour42 Posts: 108
    Art3mis
    Hi galopin,

    I would disagree. Looking at milliseconds and what call took how long exactly is of course very important for the developer, but for the end user FPS is the ultimate test. Anything below 75 FPS (for DK2) just feels very bad. Much worse than some missing object or other glitch in the rendered world.

    I like you innovative approach for displaying performance data inside your engine. I have also done my share of diagnostic tools that display something directly in VR. For exact timings however, I would recommend to just use the capabilities of VisualStudio 2015. There is not much about performance measurement that they don't provide. Right down to how many nanoseconds each and every GPU call took. Also the CPU measurements can help you finding bottlenecks in your code very fast. Simple things like going through a std:vector element list without using references are revealed very easily.
  • galopingalopin Posts: 351
    Nexus 6
    I am not talking about the end user but the dev team. And no, visual studio is not enough and no, performance burden is not only for programmers with visual studios.

    Without ingame perf tools, how could an artist or designer can understand the impact of things and find what it is important to optimise first ? he can't. How is the shadow map budget, how is the deferred light here, why when i do that my fps drop 10fps, oh it is because a new postprocess goes nut…

    The fps is a function of all the inside, and more perf tools is never enough ( more good tools i mean ) to analyse things. The right tool to the right problem. And interactive realtime feedback is a must for gpu frames.
  • glazeglaze Posts: 43
    That's amazing!

    Maybe I will revive my engine project and update to DX12.

    I liked your engine blog posts.
  • Cool
  • cyberealitycybereality Posts: 20,661 Oculus Staff
    glaze wrote:
    That's amazing!

    Maybe I will revive my engine project and update to DX12.

    I liked your engine blog posts.

    Thanks. I'm glad someone appreciated them.
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • galopingalopin Posts: 351
    Nexus 6
    glaze wrote:
    That's amazing!

    Maybe I will revive my engine project and update to DX12.

    I liked your engine blog posts.

    Thanks. I'm glad someone appreciated them.

    An url ?
  • cyberealitycybereality Posts: 20,661 Oculus Staff
    It's on my blog ( http://www.cybereality.com ).

    Just click the three line icon on the top right corner to access the 3d engine series.

    Honestly, most of it was just my thoughts on development (not a lot of code) but I am considering doing some more posts with pieces of code depending on how a feel about reviving the project.
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • lamour42lamour42 Posts: 108
    Art3mis
    Bone Animation is in. See the ObjectViewer app.
    Be warned, however, that until I provide docu and tools for mesh creation you will have a hard time creating your own animated objects.
    If you are interested, my Content creation chain is this:

    Blender --> Collada Export --> parse Collada XML with Java and produce custom binary .b Format --> engine reads .b files at runtime

    Still, if you are interested in looking at code that does CPU bound bone animation you might want to take a look. Of Course, GPU bound Animation is the ultimate goal, but that will come (much) later.

    Here the most simple example you can get: A single Joint.
    2popenp.jpg

    An animated worm:
    315oas7.jpg
  • cyberealitycybereality Posts: 20,661 Oculus Staff
    Awesome!
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • lamour42lamour42 Posts: 108
    Art3mis
    Awesome!

    Thanks Cyber! It really helps to get some encouraging words along the way! :)

    In the meantime I added ambient, directional and point lights to the engine. Also support for background music and directional sound.

    Although there are many, many things I would like to add and enhance (like shadows, terrain rendering), I think there is now enough functionality available to try for more entertaining demos. I will exactly do that - and certainly along the way fix end enhance the engine while building the demo.
  • cyberealitycybereality Posts: 20,661 Oculus Staff
    @lamour42: Was there any trick to getting the 1 million objects running? I finally got something somewhat working but performance seems worse than with DX11. Even with just around 2,000 cubes, the performance is tanking. The code is really hacked together at this point, so I'm probably doing some silly stuff, but maybe you have some advice.
    AMD Ryzen 7 1800X | MSI X370 Titanium | G.Skill 32GB DDR4 3200 | EVGA SuperNOVA 1000 | Corsair Hydro H110i
    PowerColor RX 480 x2 | Samsung 960 Evo M.2 500GB | Seagate FireCuda SSHD 2TB | Phanteks ENTHOO EVOLV
  • galopingalopin Posts: 351
    Nexus 6
    @lamour42: Was there any trick to getting the 1 million objects running? I finally got something somewhat working but performance seems worse than with DX11. Even with just around 2,000 cubes, the performance is tanking. The code is really hacked together at this point, so I'm probably doing some silly stuff, but maybe you have some advice.

    There is no real trick to be fast, should be natural. You really need to put a lot of effort to be slower than dx11. One possible thing is if you try to recycle an allocator/command list still in use by the gpu and fence waiting for completion. Doing so would create bad idle bubbles.

    I will put my sample online at some point too, but for the moment, it is my platform to report bugs to nvidia/amd/microsoft. Things like xbox one code can't go public and will have to be stripped too :)

    For nvidia, i highly recommend the driver 364.xx, they fixed some very bad memory corruption in the d3d12 drivers :)
  • lamour42lamour42 Posts: 108
    Art3mis
    Hi,

    I disagree somewhat with Galopin here. I find it very easy to be slower with DX12 than with DX11. That is because driver level is much more thin and things that may have been done in parallel with DX11 won't be parallel in DX12 automatically.

    To be fast you should pay attention that either everything large is already in GPU for processing so that only minimal data like WorldViewProjection matrix and maybe some parameters need to be transferred to GPU before the draw call.

    Or if you have to transmit larger amounts of e.g. vertex data you have to make sure this runs in its own thread so that everything else does not have to wait.

    And of course C++ comes into play big time. At one point I had very bad performance just because I iterated over my vertex data with an auto loop over a vector, forgetting to use a referenced auto variable.

    To see perfomance bottlenecks I strongly recommend to try the diagnostic features of Visual Studio. They give a very detailed look at everything that goes on in CPU and GPU.
  • galopingalopin Posts: 351
    Nexus 6
    your c++ mistake is irrelevant to dx11 or dx12 here :) My point is that there is no more hidden cost in the API, nor black magic in the Present black box and there is not a single costly call in the rendering side of things ( minus multiple SetDescriptorHeaps on intel inside a commanlist…), all the costly things are on the creation side, that are irrelevant to amain loop.

    Most good practice from dx11 are still relevant, like keep big things prepared once and for good on the local memory and try to update less as the gpu has a slow bandwidth from the main ram. It is true that dx12 allow a multithread feeding of command list, something deferred context in dx11 that fail to provide improvements. But in raw processing power, a single command list feeding, similar to a 11 implementation is still far faster. The fully setup-ed PSO allow a brainless driver now.

    So unless you explicitly wait on fence every frame and kill the cpu/gpu parallelism, you cannot be slow !

    EDIT: haha "intel inside"…
  • terranwolfterranwolf Posts: 14
    Virtual Boy (or Girl)
    Happy to see engine stuff here. Keep it up! :P
    VR Dev, Game Programmer
«1
Sign In or Register to comment.