Compute Shader Learning Notes (I) Getting Started

Tags: Getting Started/Shader/Compute Shader/GPU Optimization

img

Preface

Compute Shader is relatively complex and requires certain programming knowledge, graphics knowledge, and GPU-related hardware knowledge to master it well. The study notes are divided into four parts:

  • Get to know Compute Shader and implement some simple effects
  • Draw circles, planet orbits, noise maps, manipulate Meshes, and more
  • Post-processing, particle system
  • Physical simulation, drawing grass
  • Fluid simulation

The main references are as follows:

  • https://www.udemy.com/course/compute-shaders/?couponCode=LEADERSALE24A
  • https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
  • https://medium.com/ericzhan-publication/shader notes-a preliminary exploration of compute-shader-9efeebd579c1
  • https://docs.unity3d.com/Manual/class-ComputeShader.html
  • https://docs.unity3d.com/ScriptReference/ComputeShader.html
  • https://learn.microsoft.com/en-us/windows/win32/api/D3D11/nf-d3d11-id3d11devicecontext-dispatch
  • lygyue:Compute Shader(Very interesting)
  • https://medium.com/@sengallery/unity-compute-shader-basic-understanding-5a99df53cea1
  • https://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html (too old and outdated)
  • http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf
  • Wang Jiangrong: [Unity] Basic Introduction and Usage of Compute Shader
  • …To be continued

L1 Introduction to Compute Shader

1. Introduction to Compute Shader

Simply put, you can use Compute Shader to calculate a material and then display it through Renderer. It should be noted that Compute Shader can do more than just this.

img
img

You can copy the following two codes and test them.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class AssignTexture : MonoBehaviour
{
    // ComputeShader is used to perform computing tasks on the GPU
    public ComputeShader shader;

    // Texture resolution
    public int texResolution = 256;

    // Renderer component
    private Renderer rend;
    // Render texture
    private RenderTexture outputTexture;
    // Compute shader kernel handle
    private int kernelHandle;

    // Start is called once when the script is started
    void Start()
    {
        // Create a new render texture, specifying width, height, and bit depth (here the bit depth is 0)
        outputTexture = new RenderTexture(texResolution, texResolution, 0);
        // Allow random write
        outputTexture.enableRandomWrite = true;
        // Create a render texture instance
        outputTexture.Create();

        // Get the renderer component of the current object
        rend = GetComponent<Renderer>();
        // Enable the renderer
        rend.enabled = true;

        InitShader();
    }

    private void InitShader()
    {
        // Find the handle of the compute shader kernel "CSMain"
        kernelHandle = shader.FindKernel("CSMain");

        // Set up the texture used in the compute shader
        shader.SetTexture(kernelHandle, "Result", outputTexture);

        // Set the render texture as the material's main texture
        rend.Material.SetTexture("_MainTex", outputTexture);

        // Schedule the execution of the compute shader, passing in the size of the compute group
        // Here it is assumed that each working group is 16x16
        // Simply put, how many groups should be allocated to complete the calculation. Currently, only half of x and y are divided, so only 1/4 of the screen is rendered.
        DispatchShader(texResolution / 16, texResolution / 16);
    }

    private void DispatchShader(int x, int y)
    {
        // Schedule the execution of the compute shader
        // x and y represent the number of calculation groups, 1 represents the number of calculation groups in the z direction (here there is only one)
        shader.Dispatch(kernelHandle, x, y, 1);
    }

    void Update()
    {
        // Check every frame whether there is keyboard input (button U is released)
        if (Input.GetKeyUp(KeyCode.U))
        {
            // If the U key is released, reschedule the compute shader
            DispatchShader(texResolution / 8, texResolution / 8);
        }
    }
}

Unity's default Compute Shader:

// Each #kernel tells which function to compile; you can have many kernels
#Pragmas kernel CSMain

// Create a RenderTexture with enableRandomWrite flag and set it
// with cs.SetTexture
RWTexture2D<float4> Result;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID) { 
  // TODO: insert actual code here! Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0); 
}

In this example, we can see that a fractal structure called Sierpinski net is drawn in the lower left quarter. This is not important. Unity officials think this graphic is very representative and use it as the default code.

Let's talk about the Compute Shader code in detail. You can refer to the comments for the C# code.

#pragma kernel CSMain This line of code indicates the entry of Compute Shader. You can change the name of CSMain at will.

RWTexture2D Result This line of code is a readable and writable 2D texture. R stands for Read and W stands for Write.

Focus on this line of code:

[numthreads(8,8,1)]

In the Compute Shader file, this line of code specifies the size of a thread group. For example, in this 8 * 8 * 1 thread group, there are 64 threads in total. Each thread calculates a unit of pixels (RWTexture).

In the C# file above, we use shader.Dispatch to specify the number of thread groups.

img
img
img

Next, let's ask a question. If the current thread group is specified as 881, so how many thread groups do we need to render a RWTexture of size res*res?

The answer is: res/8. However, our code currently only calls res/16, so only the 1/4 area in the lower left corner is rendered.

In addition, the parameters passed into the entry function are also worth mentioning: uint3 id: SV_DispatchThreadID This id represents the unique identifier of the current thread.

2. Quarter pattern

Before you learn to walk, you must first learn to crawl. First, specify the task (Kernel) to be performed in C#.

img

Currently we have written it in stone, now we expose a parameter that indicates that different rendering tasks can be performed.

public string kernelName = "CSMain"; ... kernelHandle = shader.FindKernel(kernelName);

In this way, you can modify it at will in the Inspector.

img

However, it is not enough to just put the plate on the table, we need to serve the dish. We cook the dish in the Compute Shader.

Let's set up a few menus first.

#pragma kernel CSMain // 刚刚我们已经声明好了
#pragma kernel SolidRed // 定义一个新的菜,并且在下面写出来就好了
... // 可以写很多
[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID){ ... }
[numthreads(8,8,1)]
void SolidRed (uint3 id : SV_DispatchThreadID){
 Result[id.xy] = float4(1,0,0,0); 
}

You can enable different Kernels by modifying the corresponding names in the Inspector.

img

What if I want to pass data to the Compute Shader? For example, pass the resolution of a material to the Compute Shader.

shader.SetInt("texResolution", texResolution);
img
img

And in the Compute Shader, it must also be declared.

img

Think about a question, how to achieve the following effect?

img
[numthreads(8,8,1)]
void SplitScreen (uint3 id : SV_DispatchThreadID)
{
    int halfRes = texResolution >> 1;
    Result[id.xy] = float4(step(halfRes, id.x),step(halfRes, id.y),0,1);
}

To explain, the step function is actually:

step(edge, x){
    return x>=edge ? 1 : 0;
}

(uint)res >> 1 means that the bits of res are shifted one position to the right. This is equivalent to dividing by 2 (binary content).

This calculation method simply depends on the current thread id.

The thread at the bottom left corner always outputs black because the step return is always 0.

For the lower left thread, id.x > halfRes , so 1 is returned in the red channel.

If you are not convinced, you can do some calculations to help you understand the relationship between thread ID, thread group and thread group group.

img
img

3. Draw a circle

The principle sounds simple. It checks whether (id.x, id.y) is inside the circle. If yes, it outputs 1. Otherwise, it outputs 0. Let's try it.

img
float inCircle( float2 pt, float radius ){
    return ( length(pt)<radius ) ? 1.0 : 0.0;
}

[numthreads(8,8,1)]
void Circle (uint3 id : SV_DispatchThreadID)
{
    int halfRes = texResolution >> 1;
    int isInside = inCircle((float2)((int2)id.xy-halfRes), (float)(halfRes>>1));
    Result[id.xy] = float4(0.0,isInside ,0,1);
}

img

4. Summary/Quiz

If the output is a RWTexture with a side length of 256, which answer will produce a completely red texture?

RWTexture2D<float4> output;

[numthreads(16,16,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
     output[id.xy] = float4(1.0, 0.0, 0.0, 1.0);
}

img

Which answer will give red on the left side of the texture output and yellow on the right side?

img

L2 has begun

1. Passing values to the GPU

img

Without further ado, let's draw a circle. Here are two initial codes.

PassData.cs: https://pastebin.com/PMf4SicK

PassData.compute: https://pastebin.com/WtfUmhk2

The general structure is the same as above. You can see that a drawCircle function is called to draw a circle.

[numthreads(1,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    int2 centre = (texResolution >> 1);
    int radius = 80;
    drawCircle( centre, radius );
}

The circle drawing method used here is a very classic rasterization drawing method. If you are interested in the mathematical principles, you can read http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf. The general idea is to use a symmetric idea to generate.

The difference is that here we use (1,1,1) as the size of a thread group. Call CS on the CPU side:

private void DispatchKernel(int count) { shader.Dispatch(circlesHandle, count, 1, 1); } void Update() { DispatchKernel(1); }

The question is, how many times does a thread execute?

Answer: It is executed only once. Because a thread group has only 111 = 1 thread, and only 1 is called on the CPU side11 = 1 thread group is used for calculation. Therefore, only one thread is used to draw a circle. In other words, one thread can draw an entire RWTexture at a time, instead of one thread drawing one pixel as before.

This also shows that there is an essential difference between Compute Shader and Fragment Shader. Fragment Shader only calculates the color of a single pixel, while Compute Shader can perform more or less arbitrary operations!

img

Back to Unity, if you want to draw a good-looking circle, you need an outline color and a fill color. Pass these two parameters to CS.

float4 clearColor; float4 circleColor;

And add color filling kernel, and modify the Circles kernel. If multiple kernels access a RWTexture at the same time, you can add the shared keyword.

#Pragmas kernel Circles
#Pragmas kernel Clear
    ...
shared RWTexture2D<float4> Result;
    ...
[numthreads(32,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    // int2 center = (texResolution >> 1);
    int2 centre = (int2)(random2((float)id.x) * (float)texResolution);
    int radius = (int)(random((float)id.x) * 30);
    drawCircle( centre, radius );
}

[numthreads(8,8,1)]
void Clear (uint3 id : SV_DispatchThreadID)
{
    Result[id.xy] = clearColor;
}

Get the Clear kernel on the CPU side and pass in the data.

private int circlesHandle; private int clearHandle; ... shader.SetVector( "clearColor", clearColor); shader.SetVector( "circleColor", circleColor); ... private void DispatchKernels(int count) { shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1); shader.Dispatch(circlesHandle, count, 1, 1); } void Update() { DispatchKernels(1); // There are now 32 circles on the screen }

A question, if the code is changed to: DispatchKernels(10), how many circles will there be on the screen?

Answer: 320. Initially, Dispatch is 111=1, a thread group has 3211=32 threads, each thread draws a circle. Elementary school mathematics.

Next, add the _Time variable to make the circle change with time. Since there seems to be no such variable as _time in the Compute Shader, it can only be passed in by the CPU.

On the CPU side, note that variables updated in real time need to be updated before each Dispatch (outputTexture does not need to be updated because this outputTexture actually points to a reference to the GPU texture!):

private void DispatchKernels(int count) { shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1); shader.SetFloat( "time", Time.time); shader.Dispatch(circlesHandle, count, 1, 1) ; }

Compute Shader:

float time; ... void Circles (uint3 id : SV_DispatchThreadID){ ... int2 center = (int2)(random2((float)id.x + time) * (float)texResolution); ... }

Current version code:

  • Compute Shader: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Shaders/PassData.compute
  • CPU: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Scripts/PassData.cs

But now the circles are very messy. The next step is to use Buffer to make the circles look more regular.

img

At the same time, you don't need to worry about multiple threads trying to write to the same memory location (such as RWTexture) at the same time, which may cause race conditions. The current API will handle this problem well.

2. Use Buffer to pass data to GPU

So far, we have learned how to transfer some simple data from the CPU to the GPU. How do we pass a custom structure?

img

We can use Buffer as a medium, where Buffer is of course stored in the GPU, and the CPU side (C#) only stores its reference. First, declare a structure on the CPU, and then declare the CPU-side reference and the GPU-sideReferences.

struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
    Circle[] circleData;  // 在CPU上
    ComputeBuffer buffer; // 在GPU上

To get the size information of a thread group, you can do this. The following code only gets the number of threads in the x direction of the circlesHandles thread group, ignoring y and z (because it is assumed that the y and z of the thread group are both 1). And multiply it by the number of allocated thread groups to get the total number of threads.

uint threadGroupSizeX; shader.GetKernelThreadGroupSizes(circlesHandle, out threadGroupSizeX, out _, out _); int total = (int)threadGroupSizeX * count;

Now prepare the data to be passed to the GPU. Here we create circles with the number of threads, circleData[threadNums].

circleData = new Circle[total];
float speed = 100;
float halfSpeed = speed * 0.5f;
float minRadius = 10.0f;
float maxRadius = 30.0f;
float radiusRange = maxRadius - minRadius;
for(int i=0; i<total; i++)
{
    Circle circle = circleData[i];
    circle.origin.x = Random.value * texResolution;
    circle.origin.y = Random.value * texResolution;
    circle.velocity.x = (Random.value * speed) - halfSpeed;
    circle.velocity.y = (Random.value * speed) - halfSpeed;
    circle.radius = Random.value * radiusRange + minRadius;
    circleData[i] = circle;
}

Then accept this Buffer in the Compute Shader. Declare an identical structure (Vector2 and Float2 are the same), and then create a reference to the Buffer.

// Compute Shader struct circle { float2 origin; float2 velocity; float radius; }; StructuredBuffer circlesBuffer;

Note that the StructureBuffer used here is read-only, which is different from the RWStructureBuffer mentioned in the next section.

Back to the CPU side, send the CPU data just prepared to the GPU through the Buffer. First, we need to make clear the size of the Buffer we applied for, that is, how big we want to pass to the GPU. Here, a circle data has two float2 variables and one float variable, a float is 4 bytes (may be different on different platforms, you can use sizeof(float) to determine), and there are circleData.Length pieces of circle data to be passed. circleData.Length indicates how many circle objects the buffer needs to store, and stride defines how many bytes each object's data occupies. After opening up such a large space, use SetData() to fill the data into the buffer, that is, in this step, pass the data to the GPU. Finally, bind the GPU reference where the data is located to the Kernel specified by the Compute Shader.

int stride = (2 + 2 + 1) * 4; //2 floats origin, 2 floats velocity, 1 float radius - 4 bytes per float buffer = new ComputeBuffer(circleData.Length, stride); buffer.SetData(circleData); shader.SetBuffer(circlesHandle, "circlesBuffer", buffer);

So far, we have passed some data prepared by the CPU to the GPU through Buffer.

img

OK, now let’s make use of the data that was transferred to the GPU with great difficulty.

[numthreads(32,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    int2 centre = (int2)(circlesBuffer[id.x].origin + circlesBuffer[id.x].velocity * time);
    while (centre.x>texResolution) centre.x -= texResolution;
    while (centre.x<0) centre.x += texResolution;
    while (centre.y>texResolution) centre.y -= texResolution;
    while (centre.y<0) centre.y += texResolution;
    uint radius = (int)circlesBuffer[id.x].radius;
    drawCircle( centre, radius );
}

You can see that the circle is now moving continuously because our Buffer stores the position of the circle indexed by id.x in the previous frame and the movement status of the circle.

img

To sum up, in this section we learned how to customize a structure (data structure) on the CPU side, pass it to the GPU through a Buffer, and process the data on the GPU.

In the next section, we will learn how to get data from the GPU back to the CPU.

  • Current version code:
  • Compute Shader: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Shaders/BufferJoy.compute
  • CPU: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Scripts/BufferJoy.cs

3. Get data from GPU

As usual, create a Buffer to transfer data from the GPU to the CPU. Define an array on the CPU side to receive the data. Then create the buffer, bind it to the shader, and finally create variables on the CPU ready to receive GPU data.

ComputeBuffer resultBuffer; // Buffer
Vector3[] output;           // CPU接受
...
    //buffer on the gpu in the ram
    resultBuffer = new ComputeBuffer(starCount, sizeof(float) * 3);
    shader.SetBuffer(kernelHandle, "Result", resultBuffer);
    output = new Vector3[starCount];

Compute Shader also accepts such a Buffer. The Buffer here is readable and writable, which means that the Buffer can be modified by Compute Shader. In the previous section, Compute Shader only needs to read the Buffer, so StructuredBuffer is enough. Here we need to use RW.

RWStructuredBuffer Result;

Next, use GetData after Dispatch to receive the data.

shader.Dispatch(kernelHandle, groupSizeX, 1, 1); resultBuffer.GetData(output);
img

The idea is so simple. Now let's try to make a scene where a lot of stars move around the center of the sphere.

The task of calculating the star coordinates is put on the GPU to complete, and finally the calculated position data of each star is obtained, and the object is instantiated in C#.

In Compute Shader, each thread calculates the position of a star and outputs it to the Buffer.

[numthreads(64,1,1)]
void OrbitingStars (uint3 id : SV_DispatchThreadID)
{
    float3 sinDir = normalize(random3(id.x) - 0.5);
    float3 vec = normalize(random3(id.x + 7.1393) - 0.5);
    float3 cosDir = normalize(cross(sinDir, vec));
    float scaledTime = time * 0.5 + random(id.x) * 712.131234;
    float3 pos = sinDir * sin(scaledTime) + cosDir * cos(scaledTime);
    Result[id.x] = pos * 2;
}

Get the calculation result through GetData on the CPU side, and modify the Pos of the corresponding previously instantiated GameObject at any time.

void Update()
{
    shader.SetFloat("time", Time.time);
    shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
    resultBuffer.GetData(output);
    for (int i = 0; i < stars.Length; i++)
        stars[i].localPosition = output[i];
}
img

Current version code:

  • Compute Shader: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Shaders/OrbitingStars.compute
  • CPU: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Scripts/OrbitingStars.cs

4. Use noise

Generating a noise map using Compute Shader is very simple and very efficient.

float random (float2 pt, float seed) {
    const float a = 12.9898;
    const float b = 78.233;
    const float c = 43758.543123;
    return frac(sin(seed + dot(pt, float2(a, b))) * c );
}

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    float4 white = 1;
    Result[id.xy] = random(((float2)id.xy)/(float)texResolution, time) * white;
}
img

There is a library to get more various noises. https://pastebin.com/uGhMLKeM

#include "noiseSimplex.cginc" // Paste the code above and named "noiseSimplex.cginc"

...

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    float3 POS = (((float3)id)/(float)texResolution) * 2.0;
    float n = snoise(POS);
    float ring = frac(noiseScale * n);
    float delta = pow(ring, ringScale) + n;

    Result[id.xy] = lerp(darkColor, paleColor, delta);
}

img

5. Deformed Mesh

In this section, we will transform a Cube into a Sphere through Compute Shader, and we will also need an animation process with gradual changes!

img

As usual, declare vertex parameters on the CPU side, then throw them into the GPU for calculation, and apply the calculated new coordinates newPos to the Mesh.

Vertex structure declaration. We attach a constructor to the CPU declaration for convenience. The GPU declaration is similar. Here, we intend to pass two buffers to the GPU, one read-only and the other read-write. At first, the two buffers are the same. As time changes (gradually), the read-write buffer gradually changes, and the Mesh changes from a cube to a ball.

// CPU
public struct Vertex
{
    public Vector3 position;
    public Vector3 normal;
    public Vertex( Vector3 p, Vector3 n )
    {
        position.x = p.x;
        position.y = p.y;
        position.z = p.z;
        normal.x = n.x;
        normal.y = n.y;
        normal.z = n.z;
    }
}
...
Vertex[] vertexArray;
Vertex[] initialArray;
ComputeBuffer vertexBuffer;
ComputeBuffer initialBuffer;
// GPU
struct Vertex {
    float3 position;
    float3 normal;
};
...
RWStructuredBuffer<Vertex>  vertexBuffer;
StructuredBuffer<Vertex>    initialBuffer;

The complete steps of initialization ( Start() function) are as follows:

  1. On the CPU side, initialize the kernel and obtain the Mesh reference
  2. Transfer Mesh data to CPU
  3. Declare the Buffer of Mesh data in GPU
  4. Passing Mesh data and other parameters to the GPU

After completing these operations, every frame Update, we apply the new vertices obtained from the GPU to the mesh.

So how do we implement GPU computing?

It's quite simple, we just need to normalize each vertex in the model space! Imagine that when all vertex position vectors are normalized, the model becomes a sphere.

img

In the actual code, we also need to calculate the normal at the same time. If we don't change the normal, the lighting of the object will be very strange. So the question is, how to calculate the normal? It's very simple. The coordinates of the original vertices of the cube are the final normal vectors of the ball!

img

In order to achieve the "breathing" effect, a sine function is added to control the normalization coefficient.

float delta = (Mathf.Sin(Time.time) + 1)/ 2;

Since the code is a bit long, I'll put a link.

Current version code:

  • Compute Shader: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Shaders/MeshDeform.compute
  • CPU: https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Scripts/MeshDeform.cs
img

6. Summary/Quiz

How this structure should be defined on the GPU:

struct Circle { public Vector2 origin; public Vector2 velocity; public float radius; }
img

How should this structure set the size of ComputeBuffer?

struct Circle { public Vector2 origin; public Vector2 velocity; public float radius; }
img

Why is the following code wrong?

StructuredBuffer<float3> positions;
//Inside a kernel
...
positions[id.x] = fixed3(1,0,0);
img

References

https://cmwdexint.com/2017/10/20/indirect-compute-shader

Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

en_USEN