  • Unity 曲面細分詳解

    Unity Tessellation

    Tags: Getting Started/Shader/Tessellation Shader/Displacement Map/LOD/Smooth Outline/Early Culling

    The word tessellation refers to a broad category of design activities, usually involving the arrangement of tiles of various geometric shapes next to each other to form a pattern on a flat surface. Its purpose can be artistic or practical, and many examples date back thousands of years. — Tessellation, Wikipedia, accessed July 2020.

    This article mainly refers to:

    Surface subdivision in game development is generally done in a triangleflat(or Quad) and then use the Displacement map to do vertex displacement, or use the Phong subdivision or PN triangles subdivision implemented in this article to do vertex displacement.

    Phong subdivision does not need to know the adjacent topological information, only uses interpolation calculation, which is more efficient than PN triangles and other algorithms. Loop and Schaefer mentioned in GAMES101 use low-degree quadrilateral surfaces to approximate Catmull-Clark surfaces. The polygons input by these methods are replaced by a polynomial surface. The Phong subdivision in this article does not require any operation to correct additional geometric areas.

    1. Overview of the tessellation process

    This chapter introduces the process of surface subdivision in the rendering pipeline.

    The tessellation shader is located after the vertex shader, and the tessellation is divided into three steps: Hull, Tesselllator and Domain, among which Tessellator is not programmable.

    The first step of tessellation is the tessellation control shader (also known as Tessellation Control Shader, TCS), which will output control points and tessellation factors. This stage mainly consists of two parallel functions: Hull Function and Patch Constant Function.

    Both functions receive patches, which are a set of vertex indices. For example, a triangle uses three numbers to represent the vertex indices. One patch can form a fragment, for example, a triangle fragment is composed of three vertex indices.

    Moreover, the Hull Function is executed once for each vertex, and the Path Constant Function is executed once for each Patch. The former outputs the modified control point data (usually including vertex position, possible normals, texture coordinates and other attributes), while the latter outputs the constant data related to the entire fragment, that is, the subdivision factor. The subdivision factor tells the next stage (the tessellator) how to subdivide each fragment.

    In general, the Hull Function modifies each control point, while the Patch Constant Function determines the level of subdivision based on the distance from the camera.

    Next comes the non-programmable stage, the tessellator. It receives the patch and the subdivision factor just obtained. The tessellator generates a barycentric coordinate for each vertex data.

    Next comes the last step, the Domain Stage (also known as Tessellation Evaluation Shader, TES), which is programmable. This part consists of domain functions, which are executed once per vertex. It receives the barycentric coordinates and the results generated by the two functions in the Patch and Hull Stage. Most of the logic is written here. The most important thing is that you can reposition the vertices in this stage, which is the most important part of tessellation.

    If there is a geometry shader, it will be executed after the Domain Stage. But if not, it will come to the rasterization stage.

    In summary, the first thing is the vertex shader. The Hull stage accepts vertex data and decides how to subdivide the mesh. Then the tessellator stage processes the subdivided mesh, and finally the Domain stage outputs vertices for the fragment shader.

    2. Surface subdivision analysis

    This chapter contains code analysis of Unity's surface subdivision, practical example effects display and an overview of the underlying principles.

    2.1 Key code analysis

    2.1.1 Basic settings of Unity tessellation

    First of all, the tessellation shader needs to use shader target 5.0.

    #Pragmas target 5.0 // 5.0 required for tessellation
    #Pragmas vertex Vertex
    #Pragmas hull Hull
    #Pragmas domain Domain
    #Pragmas fragment Fragment

    2.1.2 Hull Stage Code 1 – Hull Function

    In the classic process, the vertex shader converts the position and normal information into world space. Then the output result is passed to the Hull Stage. It should be noted that, unlike the vertex shader, the vertices of the Hull shader are represented by INTERNALTESSPOS semantics instead of POSITION semantics. The reason is that Hull does not need to output these vertex positions to the next rendering process, but for its own internal tessellation algorithm, so it will convert these vertices to a coordinate system that is more suitable for tessellation. In addition, developers can also distinguish more clearly.

    struct Attributes {
        float3 positionOS : POSITION;
        float3 normalOS : NORMAL;
    struct TessellationControlPoint {
        float3 positionWS : INTERNAL LTESS POS;
        float3 normalWS : NORMAL;
    TessellationControlPoint Vertex(Attributes input) {
        TessellationControlPoint output;
        UNITY_TRANSFER_INSTANCE_ID(input, output);
        VertexPositionInputs posnInputs = GetVertexPositionInputs(input.positionOS);
        VertexNormalInputs normalInputs = GetVertexNormalInputs(input.normalOS);
        output.positionWS = posnInputs.positionWS;
        output.normalWS = normalInputs.normalWS;
        return output;

    Below are some setting parameters for the Hull Shader.

    The first line, domain, defines the domain type of the tessellation shader, which means that both the input and output are triangle primitives. You can choose tri (triangle), quad (quadrilateral), etc.

    The second line outputcontrolpoints indicates the number of output control points, 3 corresponds to the three vertices of the triangle.

    The third line outputtopology indicates the topological structure of the primitive after subdivision. triangle_cw means that the vertices of the output triangle are sorted clockwise. The correct order can ensure that the surface faces outward. triangle_cw (clockwise around the triangle), triangle_ccw (counterclockwise around the triangle), line (line segment)

    The fourth line patchconstantfunc is another function of the Hull Stage, which outputs constant data such as subdivision factors. A patch is executed only once.

    The fifth line, partitioning, specifies how to distribute additional vertices to the edges of the original Path primitive. This step can make the subdivision process smoother and more uniform. integer, fractional_even, fractional_odd.

    The maxtessfactor in the sixth line represents the maximum subdivision factor. Limiting the maximum subdivision can control the rendering burden.


    In the Hull Shader, each control point will be called once independently, so this function will be executed the same number of control points. To know which vertex is currently being processed, we use the variable id with the semantics of SV_OutputControlPointID to determine. The function also passes in a special structure that can be used to easily access any control point in the Patch like an array.

    TessellationControlPoint Hull(
        InputPatch<TessellationControlPoint, 3> patch, uint id : SV_OutputControlPointID) {
        TessellationControlPoint h;
        // Hull shader code here
        return patch[id];

    2.1.3 Hull Stage Code 2 – Patch Constant Function

    In addition to the Hull Shader, there is another function in the Hull Stage that runs in parallel, the patch constant function. The signature of this function is relatively simple. It inputs a patch and outputs the calculated subdivision factor. The output structure contains the tessellation factor specified for each edge of the triangle. These factors are identified by the special system value semantics SV_TessFactor. Each tessellation factor defines how many small segments the corresponding edge should be subdivided into, thereby affecting the density and details of the resulting mesh. Let's take a closer look at what this factor specifically contains.

    struct TessellationFactors {
        float edge[3] : SV_TessFactor;
        float inside : SV_InsideTessFactor;
    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        UNITY_SETUP_INSTANCE_ID(patch[0]); // Set up instancing
        //Calculate tessellation factors
        TessellationFactors f;
        f.edge[0] = _FactorEdge1.x;
        f.edge[1] = _FactorEdge1.y;
        f.edge[2] = _FactorEdge1.z;
        f.inside = _FactorInside;
        return f;

    First, there is an edge tessellation factor edge[3] in the TessellationFactors structure, marked as SV_TessFactor. When using triangles as the basic primitives for tessellation, each edge is defined as being located relative to the vertex with the same index. Specifically: edge 0 corresponds to vertex 1 and vertex 2. Edge 1 corresponds to vertex 2 and vertex 0. Edge 2 corresponds to vertex 0 and vertex 1. Why is this so? The intuitive explanation is that the index of the edge is the same as the index of the vertex it is not connected to. This helps to quickly identify and process the edges corresponding to specific vertices when writing shader code.

    There is also a center tessellation factor inside labeled SV_InsideTessFactor. This factor directly changes the final tessellation pattern, and more essentially determines the number of edge subdivisions, which is used to control the subdivision density inside the triangle. Compared with the edge subdivision factor, the center tessellation factor controls how the inside of the triangle is further subdivided into smaller triangles, while the edge tessellation factor affects the number of edge subdivisions.

    Patch Constant Function can also output other useful data, but it must be labeled with the correct semantics. For example, BEZIERPOS semantics is very useful and can represent float3 data. This semantics will be used later to output the control points of the smoothing algorithm based on the Bezier curve.

    2.1.4 Domain Stage Code

    Next, we enter the Domain Stage. The Domain Function also has a Domain property, which should be the same as the output topology type of the Hull Function. In this example, it is set to a triangle. This function inputs the patch from the Hull Function, the output of the Patch Constant Function, and the most important vertex barycentric coordinates. The output structure is very similar to the output structure of the vertex shader, containing the position of the Clip space, as well as the lighting data required by the fragment shader.

    It doesn’t matter if you don’t know what it is for now. Just read Chapter 4 of this article and then come back to study it.

    Simply put, each new vertex that is subdivided will run this domain function.

    struct Interpolators {
        float3 normalWS                 : TEXCOORD0;
        float3 positionWS               : TEXCOORD1;
        float4 positionCS               : SV_POSITION;
    // Call this macro to interpolate between a triangle patch, passing the field name
    #define BARYCENTRIC_INTERPOLATE(fieldName) \
            patch[0].fieldName * barycentricCoordinates.x + \
            patch[1].fieldName * barycentricCoordinates.y + \
            patch[2].fieldName * barycentricCoordinates.z
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
        Interpolators output;
        // Setup instancing and stereo support (for VR)
        UNITY_TRANSFER_INSTANCE_ID(patch[0], output);
        float3 positionWS = BARYCENTRIC_INTERPOLATE(positionWS);
        float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
        output.positionCS = TransformWorldToHClip(positionWS);
        output.normalWS = normalWS;
        output.positionWS = positionWS;
        return output;

    In this function, Unity will give us the subdivision factor, the three vertices of the patch, and the centroid coordinates of the current new vertex. We can use this data to do displacement processing, etc.

    2.2 Detailed explanation of subdivision factors and division modes

    From thisLink Copy the code, then make the corresponding material and turn on the wireframe mode. We have only drawn vertices for the Mesh and have not applied any operations in the fragment shader, so it looks transparent.

    If any component of the Edge Factor is set to 0 or less than 0, the Mesh will disappear completely. The following figure shows what it looks like after it disappears (the Unity editor's object border stroke is turned on). This feature is very important.

    2.2.1 Overview of subdivision factors

    To put it bluntly, after these factors are set in the Hull Stage, they are simply and crudely written into the barycentric coordinates in the Tessellation Stage, such as edge factors and internal factors. (Assuming they are all tri, if it is quad, it is calculated using uv, which may be more complicated, I don't know) This simple and crude stage is not programmable.

    Take "integer (uniform) cutting mode" as an example. (temporarily) [partitioning("integer")] The domain is all triangles [domain("tri")] The number of output vertices is also 3. [outputcontrolpoints(3)] And the output topology is a triangle clockwise. [outputtopology("triangle_cw")]

    2.2.2 Preparatory work and potential parallel issues

    Modify the code to the following:

    // .shader
    _FactorEdge1("[Float3]Edge factors,[Float]Inside factor", Vector) = (1, 1, 1, 1) // -- Edited -- 
    // .hlsl
    float4 _FactorEdge1; // -- Edited -- 
    f.edge[0] = _FactorEdge1.x;
    f.edge[1] = _FactorEdge1.y; // -- Edited -- 
    f.edge[2] = _FactorEdge1.z; // -- Edited -- 
    f.inside = _FactorEdge1.w; // -- Edited --

    There may be a problem here. Sometimes the compiler will split the Patch Constant Function and calculate each factor in parallel, which may cause some factors to be deleted, and the factors may be inexplicably equal to 0. The solution is to pack these factors into a vector so that the compiler will not use undefined quantities. The following is a simple reproduction of what may happen.

    Modify the Path Constant Function as follows and open two new properties in the panel.

    The modified code lines are commented out with // — Edited — .

    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
    InputPatch<TessellationControlPoint, 3> patch) {
    UNITY_SETUP_INSTANCE_ID(patch[0]); // Set up instancing
    //Calculate tessellation factors
        TessellationFactors f;
        f.edge[0] = _FactorEdge1.x;
        f.edge[1] = _FactorEdge2; // -- Edited --
        f.edge[2] = _FactorEdge3; // -- Edited --
        f.inside = _FactorInside;
    return f;
    _FactorEdge2("Edge 2 factor", Float) = 1 // -- Edited --
    _FactorEdge3("Edge 3 factor", Float) = 1 // -- Edited --

    2.2.3 Edge Factor – SV_TessFactor

    It can be seen that the edge factors correspond approximately to the number of times the corresponding edge is split, and the internal factor corresponds to the complexity of the center.

    The edge factor only affectsOriginal triangle edgeAs for the complex internal pattern, it is controlled by the internal factor Inside Factor and the division mode.

    It should be noted that the surface subdivision in "integer cutting mode" is rounded up, for example, 2.1 is rounded up to 3.

    One picture says it all.

    2.2.4 Inside Factor – SV_InsideTessFactor

    Let's take the INTEGER mode as an example. The internal factor will only affect the complexity of the internal pattern. The specific influence is described in detail below.To summarize, the edge factor affects the triangular subdivision between the outermost layer and the first layer, the internal factor affects how many layers there are, and the division mode affects how each internal layer is subdivided.

    Assuming that the Edge Factors are set to (2,3,4) and only the Insider Factor is modified, an interesting property can be observed: when the internal factor n is an even number, a vertex can be found whose coordinates are exactly at the centroid position (13,13,13).

    Generally, it is good to set the edge factors to the same value. Here, different values are set, and the graph may be more confusing, but the most essential rules can be seen.

    It can be further observed that the number of vertices on any edge closest to the outermost triangle has an equal relationship with the internal factor Inside Factor (n): n=Numpoint−1. That is, the number of vertices on this edge is always equal to the subdivision factor minus 1.

    The number of vertices in each layer decreases by 1. That is, the first layer (not counting the outermost layer, as it will not be subdivided) will have n vertices, the second layer inward will have n−2 vertices, and so on.

    Combining the above three observations, we can get a guess and conclusion(It’s useless, but I calculated it when I had nothing to do)The total number of internal vertices can be calculated using the formula, where n corresponds to the internal factor n-1. Note that the internal factor starts at 2: a2n=3n2a2n−1=3n(n−1)+1. This can be simplified and combined to: ak=−0.125(−1)k+0.75k2+0.125. The formula for all integer operations is as follows: ak=⌊−(−1)k+6k2+18⌋

    2.2.5 Partitioning Mode – [partitioning(“_”)]

    The above only describes the simplest way to divide integers evenly, which uses integer multiples for subdivision. Let's talk about the other methods.Simply put, Fractional Odd and Fractional Even are advanced versions of Integer, but the former is an advanced version of Integer when it is an odd number, and the latter is an advanced version of Integer when it is an even number. The specific advancement is that the fractional part can be used to make the division no longer equal.

    Fractional Odd: Inside Factor can be a fraction (not Ceil), and the denominator is an odd number. Note that the denominator here is actually the denominator represented by the barycentric coordinates of each vertex. The division method with an odd number as the denominator will definitely make a vertex fall on the barycentric coordinates of the triangle, while an even number will not.Kaios.


    Fractional Even: Similar to fractional_odd, but with an even denominator. I'm not sure how to choose this.


    Pow2 (power of 2): This mode only allows the use of powers of 2 (such as 1, 2, 4, 8, etc.) as subdivision levels. Generally used for texture mapping or shadow calculations.

    3. Segment Optimization

    3.1 View Frustum Culling

    Generating so many vertices will result in very bad performance! Therefore, some methods are needed to improve rendering efficiency. Although vertices outside the frustum will be culled before T rasterization, if unnecessary patches are culled in advance in TCS, the calculation pressure of the tessellation shader will be reduced.

    If the tessellation factor is set to 0 in the Patch Constant Function, the tessellation generator will ignore the patch, which means that the culling here is for the entire patch, rather than the vertex-by-vertex culling in the frustum culling.

    We test every point in the patch to see if they are out of view. To do this, transform every point in the patch into clip space. So we need to calculate the clip space coordinates of each point in the vertex shader and pass it to the Hull Stage. Use GetVertexPositionInputs to get what we want.

    struct TessellationControlPoint {
        float4 positionCS : SV_POSITION; // -- Edited -- 
    TessellationControlPoint Vertex(Attributes input) {
        TessellationControlPoint output;
        VertexPositionInputs posnInputs = GetVertexPositionInputs(input.positionOS);
        output.positionCS = posnInputs.positionCS; // -- Edited -- 
        return output;

    Then write a test function above the Patch Constant Function to determine whether to cull the patch. Temporarily pass false here. The function passes in three points in the clipping space.

    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        return false;

    Then write the IsOutOfBounds function to test whether a point is outside the bounds. The bounds can also be specified, and this method can be used in another function to determine whether a point is outside the view frustum.

    // Returns true if the point is outside the bounds set by lower and higher
    bool IsOutOfBounds(float3 p, float3 lower, float3 higher) {
        return p.x < lower.x || p.x > higher.x || p.y < lower.y || p.y > higher.y || p.z < lower.z || p.z > higher.z;
    // Returns true if the given vertex is outside the camera fustum and should be culled
    bool IsPointOutOfFrustum(float4 positionCS) {
        float3 culling =;
        float w = positionCS.w;
        // UNITY_RAW_FAR_CLIP_VALUE is either 0 or 1, depending on graphics API
        // Most use 0, however OpenGL uses 1
        float3 lowerBounds = float3(-w, -w, -w * UNITY_RAW_FAR_CLIP_VALUE);
        float3 higherBounds = float3(w, w, w);
        return IsOutOfBounds(culling, lowerBounds, higherBounds);

    In Clip Space, the W component is the secondary coordinate that determines whether a point is in the view frustum. If xyz is outside the range [-w, w], these points will be culled because they are outside the view frustum. Different APIs have differentDepth of processingThere is a different logic on the , we need to pay attention when we use this component as the boundary. DirectX and Vulkan use the left-handed system, the Clip depth is [0, 1], so UNITY_RAW_FAR_CLIP_VALUE is 0. OpenGL is a right-handed system, the Clip depth range is [-1, 1], and UNITY_RAW_FAR_CLIP_VALUE is 1.

    After preparing these, you can determine whether a patch needs to be culled. Go back to the function at the beginning and determine whether all the points of a patch need to be culled.

    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
            IsPointOutOfFrustum(p1PositionCS) &&
            IsPointOutOfFrustum(p2PositionCS); // -- Edited -- 
        return allOutside; // -- Edited -- 

    3.2 Backface Culling

    In addition to frustum culling, patches can also undergo backface culling, using the normal vector to determine whether a patch needs to be culled.


    The normal vector is obtained by taking the cross product of two vectors. Since we are currently in Clip space, we need to do a perspective division to get NDC, which should be in the range of [-1,1]. The reason for converting to NDC is that the position in Clip space is nonlinear, which may cause the position of the vertex to be distorted. Converting to a linear space like NDC can more accurately determine the front and back relationship of the vertices.

    // Returns true if the points in this triangle are wound counter-clockwise
    bool ShouldBackFaceCull(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        float3 point0 = / p0PositionCS.w;
        float3 point1 = / p1PositionCS.w;
        float3 point2 = / p2PositionCS.w;
        float3 normal = cross(point1 - point0, point2 - point0);
        return dot(normal, float3(0, 0, 1)) < 0;

    The above code still has a cross-platform problem. The viewing direction is different in different APIs, so modify the code.

    // In clip space, the view direction is float3(0, 0, 1), so we can just test the z coord
        return cross(point1 - point0, point2 - point0).z < 0;
    #else // In OpenGL, the test is reversed
        return cross(point1 - point0, point2 - point0).z > 0;

    Finally, add the function you just wrote to ShouldClipPatch to determine backface culling.

    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
            IsPointOutOfFrustum(p1PositionCS) &&
        return allOutside || ShouldBackFaceCull(p0PositionCS, p1PositionCS, p2PositionCS); // -- Edited -- 

    Then set the vertex factor of the patch to be culled to 0 in PatchConstantFunction.

    if (ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS)) {
            f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0; // Cull the patch

    3.3 Increase Tolerance

    You may want to verify the correctness of the code, or there may be some unexpected exclusions. In this case, adding a tolerance is a flexible approach.

    The first is the frustum culling tolerance. If the tolerance is positive, the culling boundaries will be expanded so that some objects near the edge of the frustum will not be culled even if they are partially out of bounds. This method can reduce the frequent changes in culling state due to small perspective changes or object dynamics.

    // Returns true if the given vertex is outside the camera fustum and should be culled
    bool IsPointOutOfFrustum(float4 positionCS, float tolerance) {
        float3 culling =;
        float w = positionCS.w;
        // UNITY_RAW_FAR_CLIP_VALUE is either 0 or 1, depending on graphics API
        // Most use 0, however OpenGL uses 1
        float3 lowerBounds = float3(-w - tolerance, -w - tolerance, -w * UNITY_RAW_FAR_CLIP_VALUE - tolerance);
        float3 higherBounds = float3(w + tolerance, w + tolerance, w + tolerance);
        return IsOutOfBounds(culling, lowerBounds, higherBounds);

    Next, backface culling is adjusted. In practice, this is done by comparing to a tolerance instead of zero to avoid issues with numerical precision. If the dot product result is less than some small positive value (the tolerance) instead of being strictly less than zero, then the primitive is considered a backface. This approach provides an additional buffer, ensuring that only explicitly backface primitives are culled.

    // Returns true if the points in this triangle are wound counter-clockwise
    bool ShouldBackFaceCull(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS, float tolerance) {
        float3 point0 = / p0PositionCS.w;
        float3 point1 = / p1PositionCS.w;
        float3 point2 = / p2PositionCS.w;
        // In clip space, the view direction is float3(0, 0, 1), so we can just test the z coord
        return cross(point1 - point0, point2 - point0).z < -tolerance;
    #else // In OpenGL, the test is reversed
        return cross(point1 - point0, point2 - point0).z > tolerance;

    It is possible to expose a Range in the Material Panel.

    // .shader
        _tolerance("_tolerance",Range(-0.002,0.001)) = 0
    // .hlsl
    float _tolerance;
    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS, _tolerance) &&
            IsPointOutOfFrustum(p1PositionCS, _tolerance) &&
            IsPointOutOfFrustum(p2PositionCS, _tolerance); // -- Edited -- 
        return allOutside || ShouldBackFaceCull(p0PositionCS, p1PositionCS, p2PositionCS,_tolerance); // -- Edited -- 

    3.4 Dynamic subdivision factor

    So far, our algorithm has subdivided all surfaces indiscriminately. However, in a complex Mesh, there may be large and small faces.Uneven Mesh AreaThe large face is more obvious visually due to its large area, and more subdivisions are needed to ensure the smoothness and details of the surface. The small face is small in area, so you can consider reducing the subdivision level of this part, which will not have a big impact on the visual effect. Dynamically changing the factor according to the length change is a common method. Set an algorithm to give faces with longer side lengths a higher subdivision factor.

    In addition to the large and small faces of the Mesh itself,The distance between the camera and the patchIt can also be used as a factor to dynamically change the factor. Objects that are farther away from the camera can have a lower tessellation factor because they occupy fewer pixels on the screen.The user’s viewing angle and gaze direction, you can prioritize subdividing faces that face the camera, and reduce the level of subdivision for faces that face away from the camera or to the sides.

    3.4.1 Fixed Segment Scaling

    Get the distance between two vertices. The larger the distance, the larger the subdivision factor. The scale is exposed in the control panel and set to [0,1]. When the scale is 1, the subdivision factor is directly contributed by the distance between the two points. The closer the scale is to 0, the larger the subdivision factor. In addition, an initial value bias is added. Finally, let it take a number of 1 or above to ensure accuracy.

    //Calculate the tessellation factor for an edge
    float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float3 p1PositionWS) {
        float factor = distance(p0PositionWS, p1PositionWS) / scale;
        return max(1, factor + bias);

    Then modify the material panel and Patch Constant Function. Generally speaking, the average value of the edge subdivision factor is used as the internal subdivision factor, which will give a more consistent visual effect.

    // .shader
        _TessellationBias("_TessellationBias", Range(-1,5)) = 1
         _TessellationFactor("_TessellationFactor", Range(0,1)) = 0
    // .hlsl
    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS);
    f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS);
    f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;

    The degree of subdivision of fragments of different sizes will change dynamically, and the effect is as follows.

    By the way, if you find that your internal factor pattern is very strange, this may be caused by the compiler. Try to modify the internal factor code to the following to solve it.

    f.inside = ( // If the compiler doesn't play nice...
      EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS) + 
      EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS) + 
      EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS)
      ) / 3.0;

    3.4.2 Screen Space Subdivision Scaling

    Next, we need to determine the camera distance. We can directlyUse screen space distance to adjust the subdivision level, which perfectly solves the problem of large and small surfaces + screen distance at the same time!

    Since we already have the data in Clip space, and since screen space is very similar to NDC space, we only need to convert it to NDC, that is, do a perspective division.

    float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float4 p0PositionCS, float3 p1PositionWS, float4 p1PositionCS) {
        float factor = distance( / p0PositionCS.w, / p1PositionCS.w) / scale;
        return max(1, factor + bias);

    Next, pass the Clip space coordinates into the Patch Constant Function.

    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
      patch[1].positionWS, patch[1].positionCS, patch[2].positionWS, patch[2].positionCS);
    f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
      patch[2].positionWS, patch[2].positionCS, patch[0].positionWS, patch[0].positionCS);
    f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
      patch[0].positionWS, patch[0].positionCS, patch[1].positionWS, patch[1].positionCS);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;

    The current effect is quite good, and the level of subdivision changes dynamically as the camera distance (screen space distance) changes. If you use a subdivision mode other than INTEGER, you will get a more consistent effect.

    There are still some areas that can be improved. For example, the unit of the scaling factor. Just now we controlled it to [0,1], which is not very suitable for us to adjust. We multiply it by the screen resolution and change the scaling factor range to [0,1080], which is more convenient for us to adjust. Then modify the material panel properties. Now it is a ratio in pixels.

    // .hlsl
    float factor = distance( / p0PositionCS.w, / p1PositionCS.w) * _ScreenParams.y / scale;
    // .shader
    _TessellationFactor("_TessellationFactor",Range(0,1080)) = 320

    3.4.3 Camera distance subdivision scaling

    How do we use camera distance scaling? It's very simple. We calculate the ratio of the distance between two points and the distance between the midpoint of the two vertices and the camera position. The larger the ratio, the larger the space occupied on the screen, and the more subdivision is needed.

    // .hlsl
    float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float3 p1PositionWS) {
        float length = distance(p0PositionWS, p1PositionWS);
        float distanceToCamera = distance(GetCameraPositionWS(), (p0PositionWS + p1PositionWS) * 0.5);
        float factor = length / (scale * distanceToCamera * distanceToCamera);
        return max(1, factor + bias);
            f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS);
            f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS);
            f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS);
    // .shader
    _TessellationFactor("_TessellationFactor",Range(0, 1)) = 0.02

    Note that the scaling factor is no longer in pixels, but in the original [0,1] unit. Because screen pixels are not very meaningful in this method, they are not used. And the world coordinates are used again.

    The results of screen space subdivision scaling and camera distance subdivision scaling are similar. Generally, a macro can be opened to switch the modes of the above dynamic factors. Here, it is left to the reader to complete.

    3.5 Specifying subdivision factors

    3.5.1 Vertex Storage Subdivision Factor

    In the previous section, we used different strategies to guess the appropriate subdivision factors. If we know exactly how the mesh should be subdivided, we can store the coefficients of these subdivision factors in the mesh. Since the coefficient only needs a float, only one color channel is needed. The following is a pseudo code, just give it a try.

    float EdgeTessellationFactor(float scale, float bias, float multiplier) {
        return max(1, (factor + bias) * multiplier);
    // PCF()
    [unroll] for (int i = 0; i < 3; i++) {
        multipliers[i] = patch[i].color.g;
    //Calculate tessellation factors
    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, (multipliers[1] + multipliers[2]) / 2);

    3.5.2 SDF Control Surface Subdivision Factor

    It is quite cool to combine the Signed Distance Field (SDF) to control the tessellation factor. Of course, this section does not involve the generation of SDF, assuming that it can be directly obtained through the ready-made function CalculateSDFDistance.

    For a given Mesh, use CalculateSDFDistance to calculate the distance from each vertex in each patch to the shape represented by the SDF (such as a sphere). After obtaining the distance, evaluate the subdivision requirements of the patch and perform subdivision.

    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        float multipliers[3];
        // Loop through each vertex
        [unroll] for (int i = 0; i < 3; i++) {
            // Calculate the distance from each vertex to the SDF surface
            float sdfDistance = CalculateSDFDistance(patch[i].positionWS);
            // Adjust subdivision factor based on SDF distance
            if (sdfDistance < _TessellationDistanceThreshold) {
                multipliers[i] = lerp(_MinTessellationFactor, _MaxTessellationFactor, (1 - sdfDistance / _TessellationDistanceThreshold));
            } else {
                multipliers[i] = _MinTessellationFactor;
        // Calculate the final subdivision factor
        TessellationFactors f;
        f.Edge[0] = max(multipliers[0], multipliers[1]);
        f.Edge[1] = max(multipliers[1], multipliers[2]);
        f.Edge[2] = max(multipliers[2], multipliers[0]);
        f.Inside = (multipliers[0] + multipliers[1] + multipliers[2]) / 3;
        return f;

    I don't know how to implement it specifically, so I'll try to understand it first.

    4. Vertex offset – contour smoothing

    The easiest way to add details to a mesh is to use various high-resolution textures. However, the bottom line is that adding more vertices to a mesh is better than increasing the texture resolution. For example, a normal map can change the direction of each fragment's normal, but it does not change the geometry. Even a 128K texture cannot eliminate aliasing and pointy edges.

    Therefore, we need to tessellate the surface and then offset the vertices. All the tessellation operations just mentioned are operated on the plane where the patch is located. If we want to bend these vertices, one of the simplest operations is Phong tessellation.

    4.1 Phong subdivision

    First, the original paper is attached.

    Phong shading should be familiar to you. It is a technique that uses linear interpolation of normal vectors to obtain smooth shading. Phong subdivision is inspired by Phong shading and extends the concept of Phong shading to the spatial domain.

    The core idea of Phong subdivision is to use the vertex normals of each corner of the triangle to affect the position of new vertices during the subdivision process, thereby creating a curved surface instead of a flat surface.

    It is worth noting that many tutorials here use triangle corner to represent vertices. I think they are all the same, so I will still use vertices in this article.

    First, in the Domain function, Unity will give us the centroid coordinates of the new vertex we need to process. Suppose we are currently processing (13,13,13).

    Each vertex of a patch has a normal. Imagine a tangent plane emanating from each vertex, perpendicular to the respective normal vector.

    Then project the current vertex onto these three tangent planes respectively.

    Describe it in mathematical language. P′=P−((P−V)⋅N)N

    in :

    • $P$ is the initially interpolated plane position.
    • $V$ is a vertex position on the plane.
    • $N$ is the normal at vertex $V$.
    • ⋅ represents the dot product.
    • P′ is the projection of $P$ on the plane.

    Get three $P'$.

    The three points projected on the three tangent planes are re-formed into a new triangle, and then the centroid coordinates of the current vertex are applied to the new triangle to calculate the new point.

    //Calculate Phong projection offset
    float3 PhongProjectedPosition(float3 flatPositionWS, float3 cornerPositionWS, float3 normalWS) {
        return flatPositionWS - dot(flatPositionWS - cornerPositionWS, normalWS) * normalWS;
    // Apply Phong smoothing
    float3 CalculatePhongPosition(float3 bary, float3 p0PositionWS, float3 p0NormalWS,
        float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        float3 smoothedPositionWS =
            bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
            bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
            bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
        return smoothedPositionWS;
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
        Interpolators output;
        float3 positionWS = CalculatePhongPosition(barycentricCoordinates, 
          patch[0].positionWS, patch[0].normalWS, 
          patch[1].positionWS, patch[1].normalWS, 
          patch[2].positionWS, patch[2].normalWS);
        float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
        float3 tangentWS = BARYCENTRIC_INTERPOLATE(;
        output.positionCS = TransformWorldToHClip(positionWS);
        output.normalWS = normalWS;
        output.positionWS = positionWS;
        output.tangentWS = float4(tangentWS, patch[0].tangentWS.w);

    Note that we need to add the normal vector here, and then write it into Vertex and Domain. Then write a function to calculate the coordinates of the center of gravity of $P'$.

    struct Attributes {
        float4 tangentOS : TANGENT;
    struct TessellationControlPoint {
        float4 tangentWS : TANGENT;
    struct Interpolators {
        float4 tangentWS : TANGENT;
    TessellationControlPoint Vertex(Attributes input) {
        TessellationControlPoint output;
        // .....The last one is the symbol coefficient
        output.tangentWS = float4(normalInputs.tangentWS, input.tangentOS.w); // tangent.w contains bitangent multiplier
    // Barycentric interpolation as a function
    float3 BarycentricInterpolate(float3 bary, float3 a, float3 b, float3 c) {
        return bary.x * a + bary.y * b + bary.z * c;

    In the original Phong subdivision paper, an α factor was added to control the degree of curvature. The original author recommends setting this value globally to three-quarters for the best visual effect. Expanding the algorithm with the α factor can produce a quadratic Bezier curve, which does not provide an inflection point but is sufficient for practical development.

    First, let’s look at the formula in the original paper.

    Essentially, it controls the degree of interpolation. A quantitative analysis shows that when α=0, all vertices are on the original plane, which is equivalent to no displacement. When α=1, the new vertices are completely dependent on the Phong subdivision bending vertices. Of course, you can also try values less than zero or greater than one, and the effect is also quite interesting. ~~It doesn’t matter if you don’t understand the mathematical formulas in the original text. I will just use a lerp and make a random interpolation.~~

    // Apply Phong smoothing
    float3 CalculatePhongPosition(float3 bary, float smoothing, float3 p0PositionWS, float3 p0NormalWS,
        float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
        float3 smoothedPositionWS =
            bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
            bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
            bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
        return lerp(flatPositionWS, smoothedPositionWS, smoothing);
    // Apply Phong smoothing
    float3 CalculatePhongPosition(float3 bary, float smoothing, float3 p0PositionWS, float3 p0NormalWS,
        float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
        float3 smoothedPositionWS =
            bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
            bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
            bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
        return lerp(flatPositionWS, smoothedPositionWS, smoothing);

    Don't forget to expose in the material panel.

    // .shader
    _TessellationSmoothing("_TessellationSmoothing", Range(0,1)) = 0.5
    // .hlsl
    float _TessellationSmoothing;
    Interpolators Domain( .... ) {
        float smoothing = _TessellationSmoothing;
        float3 positionWS = CalculatePhongPosition(barycentricCoordinates, smoothing,
          patch[0].positionWS, patch[0].normalWS, 
          patch[1].positionWS, patch[1].normalWS, 
          patch[2].positionWS, patch[2].normalWS);

    It is important to note that some models require some modification. If the edges of the model are very sharp, it means that the normal of this vertex is almost parallel to the normal of the face. In Phong Tessellation, this will cause the projection of the vertex on the tangent plane to be very close to the original vertex position, thus reducing the impact of subdivision.

    To solve this problem, you can add more geometric details by performing what is called "adding loop edges" or "loop cuts" in the modeling software. Insert additional edge loops near the edges of the original model to increase the subdivision density. The specific operation will not be expanded here.

    In general, the effect and performance of Phong subdivision are relatively good. However, if you want a higher quality smoothing effect, you can consider PN triangles. This technology is based on the curved triangle of Bezier curve.

    4.2 PN triangles subdivision

    First, here is the original paper.

    PN Triangles does not require information about neighboring triangles and is less expensive. The PN Triangles algorithm only requires the positions and normals of the three vertices in the patch. The rest of the data can be calculated. Note that all data is in barycentric coordinates.

    In the PN algorithm, 10 control points need to be calculated for surface subdivision, as shown in the figure below. Three triangle vertices, a centroid, and three pairs of control points on the edges constitute all the control points. The calculated Bezier curve control points will be passed to the Domain. Since the control points of each triangle patch are consistent, it is very appropriate to place the step of calculating the control points in the Patch Constant Function.

    The calculation method in the paper is as follows:

    b_{300} & =P_1 \
    b_{030} & =P_2 \
    b_{003} & =P_3 \
    w_{ij} & =\left(P_j-P_i\right) \cdot N_i \in \mathbf{R} \quad \text { here ' } \cdot \text { ' is the scalar product, } \
    b_{210} & =\left(2 P_1+P_2-w_{12} N_1\right) / 3 \
    b_{120} & =\left(2 P_2+P_1-w_{21} N_2\right) / 3 \
    b_{021} & =\left(2 P_2+P_3-w_{23} N_2\right) / 3 \
    b_{012} & =\left(2 P_3+P_2-w_{32} N_3\right) / 3 \
    b_{102} & =\left(2 P_3+P_1-w_{31} N_3\right) / 3, \
    b_{201} & =\left(2 P_1+P_3-w_{13} N_1\right) / 3, \
    E & =\left(b_{210}+b_{120}+b_{021}+b_{012}+b_{102}+b_{201}\right) / 6 \
    V & =\left(P_1+P_2+P_3\right) / 3, \
    b_{111} & =E+(EV) / 2 .

    Each edge of the formula $w_{ij}$ is calculated twice, so a total of 6 times. For example, the meaning of $w_{1 2}$ is the projection length of the vector from $P_1$ to $P_2$ in the normal direction of $P_1$. Multiplying it by the corresponding normal direction means that the projection vector is $w$ in length.

    Let's take the calculation of the factor close to $P_1$ as an example. The weight of the current position point should be larger. Multiplying it by $2$ makes the calculated control point closer to the current vertex. The reason for subtracting the projection vector is to correct the error caused by the position of $P_2$ not being on the plane defined by the $P_1$​​ normal. Make the triangle plane more consistent and reduce the distortion effect. Finally, divide by 3 for standardization.

    Next, calculate the average Bezier control point $E$​, which represents the average position of the six control points. This average position represents the concentration trend of the boundary control points. Then calculate the average position of the triangle vertices. Then find the midpoint of these two average positions and add it to the Bezier average control point. This is the tenth parameter required in the end.

    To summarize, the first three are the positions of the triangle vertices (so they don't need to be written in the structure), six are calculated by weight, and the last one is the average of the previous calculations. The code is very simple to write.

    struct TessellationFactors {
        float edge[3] : SV_TessFactor;
        float inside : SV_InsideTessFactor;
        float3 bezierPoints[7] : BEZIERPOS;
    //Bezier control point calculations
    float3 CalculateBezierControlPoint(float3 p0PositionWS, float3 aNormalWS, float3 p1PositionWS, float3 bNormalWS) {
        float w = dot(p1PositionWS - p0PositionWS, aNormalWS);
        return (p0PositionWS * 2 + p1PositionWS - w * aNormalWS) / 3.0;
    void CalculateBezierControlPoints(inout float3 bezierPoints[7],
        float3 p0PositionWS, float3 p0NormalWS, float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        bezierPoints[0] = CalculateBezierControlPoint(p0PositionWS, p0NormalWS, p1PositionWS, p1NormalWS);
        bezierPoints[1] = CalculateBezierControlPoint(p1PositionWS, p1NormalWS, p0PositionWS, p0NormalWS);
        bezierPoints[2] = CalculateBezierControlPoint(p1PositionWS, p1NormalWS, p2PositionWS, p2NormalWS);
        bezierPoints[3] = CalculateBezierControlPoint(p2PositionWS, p2NormalWS, p1PositionWS, p1NormalWS);
        bezierPoints[4] = CalculateBezierControlPoint(p2PositionWS, p2NormalWS, p0PositionWS, p0NormalWS);
        bezierPoints[5] = CalculateBezierControlPoint(p0PositionWS, p0NormalWS, p2PositionWS, p2NormalWS);
        float3 avgBezier = 0;
        [unroll] for (int i = 0; i < 6; i++) {
            avgBezier += bezierPoints[i];
        avgBezier /= 6.0;
        float3 avgControl = (p0PositionWS + p1PositionWS + p2PositionWS) / 3.0;
        bezierPoints[6] = avgBezier + (avgBezier - avgControl) / 2.0;
    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        TessellationFactors f = (TessellationFactors)0;
        // Check if this patch should be culled (it is out of view)
        if (ShouldClipPatch(...)) {
        } else {
            CalculateBezierControlPoints(f.bezierPoints, patch[0].positionWS, patch[0].normalWS, 
              patch[1].positionWS, patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
        return f;

    Then, in the domain function, use the ten factors output by the Hull Function. According to the formula given in the paper, calculate the final cubic Bezier surface coordinates. Then interpolate and expose them on the material panel.

    & b: \quad R^2 \mapsto R^3, \quad \text { for } w=1-uv, \quad u, v, w \geq 0 \
    & b(u, v)= \sum_{i+j+k=3} b_{ijk} \frac{3!}{i!j!k!} u^iv^jw^k \
    &= b_{300} w^3+b_{030} u^3+b_{003} v^3 \
    &+b_{210} 3 w^2 u+b_{120} 3 wu^2+b_{201} 3 w^2 v \
    &+b_{021} 3 u^2 v+b_{102} 3 wv^2+b_{012} 3 uv^2 \
    &+b_{111} 6 wuv .

    // Barycentric interpolation as a function
    float3 BarycentricInterpolate(float3 bary, float3 a, float3 b, float3 c) {
        return bary.x * a + bary.y * b + bary.z * c;
    float3 CalculateBezierPosition(float3 bary, float smoothing, float3 bezierPoints[7],
        float3 p0PositionWS, float3 p1PositionWS, float3 p2PositionWS) {
        float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
        float3 smoothedPositionWS =
            p0PositionWS * (bary.x * bary.x * bary.x) +
            p1PositionWS * (bary.y * bary.y * bary.y) +
            p2PositionWS * (bary.z * bary.z * bary.z) +
            bezierPoints[0] * (3 * bary.x * bary.x * bary.y) +
            bezierPoints[1] * (3 * bary.y * bary.y * bary.x) +
            bezierPoints[2] * (3 * bary.y * bary.y * bary.z) +
            bezierPoints[3] * (3 * bary.z * bary.z * bary.y) +
            bezierPoints[4] * (3 * bary.z * bary.z * bary.x) +
            bezierPoints[5] * (3 * bary.x * bary.x * bary.z) +
            bezierPoints[6] * (6 * bary.x * bary.y * bary.z);
        return lerp(flatPositionWS, smoothedPositionWS, smoothing);
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
        Interpolators output;
        // Calculate tessellation smoothing multipler
        float smoothing = _TessellationSmoothing;
        smoothing *= BARYCENTRIC_INTERPOLATE(color.r); // Multiply by the vertex's red channel
        float3 positionWS = CalculateBezierPosition(barycentricCoordinates,
          smoothing, factors.bezierPoints, 
          patch[0].positionWS, patch[1].positionWS, patch[2].positionWS);
        float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
        float3 tangentWS = BARYCENTRIC_INTERPOLATE(;

    Compare the effects, PN triangles off and on.

    4.3 Improved PN triangles – Output subdivided normals

    Traditional PN triangles only change the position information of the vertices. We can combine the normal information of the vertices to output dynamically changing normal information to provide better light reflection effects.

    In the original algorithm, the change of normals is very discrete. As shown in the figure below (above), the normals provided by the two vertices of the original triangle may not be able to well represent the change of the normals of the original surface. We want to achieve the effect shown in the figure below (below), so we need to use quadratic interpolation to obtain the possible surface changes in a single patch.

    Since the surface is a cubic Bezier surface, the normal should be a quadratic Bezier surface interpolation, so three additional normal control points are required.TheTusThe article has been explained clearly. Please go to the detailed mathematical principlesRef10. Link.

    The following is a brief introduction on how to obtain the normal direction of the subdivision.

    First, get the two normal information of point AB. Then find their average normal.

    Construct a plane perpendicular to line segment AB and passing through its midpoint.

    Take the reflection vector of the average normal just taken for the plane.

    Count each side, so there are three.

    struct TessellationFactors {
        float edge[3] : SV_TessFactor;
        float inside : SV_InsideTessFactor;
        float3 bezierPoints[10] : BEZIERPOS;
    float3 CalculateBezierControlNormal(float3 p0PositionWS, float3 aNormalWS, float3 p1PositionWS, float3 bNormalWS) {
        float3 d = p1PositionWS - p0PositionWS;
        float v = 2 * dot(d, aNormalWS + bNormalWS) / dot(d, d);
        return normalize(aNormalWS + bNormalWS - v * d);
    void CalculateBezierNormalPoints(inout float3 bezierPoints[10],
        float3 p0PositionWS, float3 p0NormalWS, float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        bezierPoints[7] = CalculateBezierControlNormal(p0PositionWS, p0NormalWS, p1PositionWS, p1NormalWS);
        bezierPoints[8] = CalculateBezierControlNormal(p1PositionWS, p1NormalWS, p2PositionWS, p2NormalWS);
        bezierPoints[9] = CalculateBezierControlNormal(p2PositionWS, p2NormalWS, p0PositionWS, p0NormalWS);
    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        TessellationFactors f = (TessellationFactors)0;
        // Check if this patch should be culled (it is out of view)
        if (ShouldClipPatch(...)) {
        } else {
              patch[0].positionWS, patch[0].normalWS, patch[1].positionWS, 
              patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
              patch[0].positionWS, patch[0].normalWS, patch[1].positionWS, 
              patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
        return f;

    And it should be noted that all interpolated normal vectors need to be standardized.

    float3 CalculateBezierNormal(float3 bary, float3 bezierPoints[10],
        float3 p0NormalWS, float3 p1NormalWS, float3 p2NormalWS) {
        return p0NormalWS * (bary.x * bary.x) +
            p1NormalWS * (bary.y * bary.y) +
            p2NormalWS * (bary.z * bary.z) +
            bezierPoints[7] * (2 * bary.x * bary.y) +
            bezierPoints[8] * (2 * bary.y * bary.z) +
            bezierPoints[9] * (2 * bary.z * bary.x);
    float3 CalculateBezierNormalWithSmoothFactor(float3 bary, float smoothing, float3 bezierPoints[10],
        float3 p0NormalWS, float3 p1NormalWS, float3 p2NormalWS) {
        float3 flatNormalWS = BarycentricInterpolate(bary, p0NormalWS, p1NormalWS, p2NormalWS);
        float3 smoothedNormalWS = CalculateBezierNormal(bary, bezierPoints, p0NormalWS, p1NormalWS, p2NormalWS);
        return normalize(lerp(flatNormalWS, smoothedNormalWS, smoothing));
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
        Interpolators output;
        // Calculate tessellation smoothing multipler
        float smoothing = _TessellationSmoothing;
        float3 positionWS = CalculateBezierPosition(barycentricCoordinates, smoothing, factors.bezierPoints, patch[0].positionWS, patch[1].positionWS, patch[2].positionWS);
        float3 normalWS = CalculateBezierNormalWithSmoothFactor(
            barycentricCoordinates, smoothing, factors.bezierPoints,
            patch[0].normalWS, patch[1].normalWS, patch[2].normalWS);
        float3 tangentWS = BARYCENTRIC_INTERPOLATE(;

    There is another problem that needs to be noted. When we use the interpolated normal, the tangent vector corresponding to it is no longer orthogonal to the interpolated normal vector. In order to maintain orthogonality, a new tangent vector needs to be calculated.

    void CalculateBezierNormalAndTangent(
        float3 bary, float smoothing, float3 bezierPoints[10],
        float3 p0NormalWS, float3 p0TangentWS, 
        float3 p1NormalWS, float3 p1TangentWS, 
        float3 p2NormalWS, float3 p2TangentWS,
        out float3 normalWS, out float3 tangentWS) {
        float3 flatNormalWS = BarycentricInterpolate(bary, p0NormalWS, p1NormalWS, p2NormalWS);
        float3 smoothedNormalWS = CalculateBezierNormal(bary, bezierPoints, p0NormalWS, p1NormalWS, p2NormalWS);
        normalWS = normalize(lerp(flatNormalWS, smoothedNormalWS, smoothing));
        float3 flatTangentWS = BarycentricInterpolate(bary, p0TangentWS, p1TangentWS, p2TangentWS);
        float3 flatBitangentWS = cross(flatNormalWS, flatTangentWS);
        tangentWS = normalize(cross(flatBitangentWS, normalWS));
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
        float3 normalWS, tangentWS;
            barycentricCoordinates, smoothing, factors.bezierPoints,
            patch[0].normalWS, patch[0], 
            patch[1].normalWS, patch[1], 
            patch[2].normalWS, patch[2],
            normalWS, tangentWS);


  • Unity可互动可砍断八叉树草海渲染 – 几何、计算着色器(BIRP/URP)

    Unity interactive and chopable octree grass sea rendering – geometry, compute shader (BIRP/URP)

    Project (BIRP) on Github:

    First, here is a screenshot of 10,0500 grasses running on Compute Shader on my M1 pro without any optimization. It can run more than 200 frames.

    After adding octree frustum culling, distance fading and other operations, the frame rate is not so stable (I want to die). I guess it is because the CPU has too much pressure to operate each frame and needs to maintain such a large amount of grass information. But as long as enough culling is done, running 700+ frames is no problem (comfort). In addition, the depth of the octree also needs to be optimized according to the actual situation. In the figure below, I set the depth of the octree to 5.


    This article is getting longer and longer. I mainly use it to review my knowledge. When you read it, you may feel that there are a lot of basic contents. I am a complete novice, and I beg for discussion and correction from you.

    This article mainly has two stages:

    • The GS + TS method achieves the most basic effect of grass rendering
    • Then I used CS to re-render the sea of grass, adding various optimization methods

    The rendering method of geometry shader + tessellation shader should be relatively simple, but the performance ceiling is relatively low and the platform compatibility is poor.

    The method of combining compute shaders with GPU Instancing should be the mainstream method in the current industry, and it can also run well on mobile terminals.

    The CS rendering of the sea of grass in this article mainly refers to the implementation of Colin and Minions Art, which is more like a hybrid of the two (the former has been analyzed by a big guy on ZhihuGrass rendering study notes based on GPU Instance). Use three sets of ComputeBuffer, one is the buffer containing all the grass, one is the buffer that is appended into the Material, and the other is a visible buffer (obtained in real time based on frustum culling). Implemented the use of a quad-octree (odd-even depth) for space division, plus the frustum culling to get the index of all the grass in the current frustum, pass it to the Compute Shader for further processing (such as Mesh generation, quaternion calculation rotation, LoD, etc.), and then use a variable-length ComputeBuffer (ComputeBufferType.Append) to pass the grass to be rendered to the Material through Instancing for final rendering.

    You can also use the Hi-Z solution to eliminate it. I'm digging a hole and working hard to learn.

    In addition, I referred to the article by Minions Art and copied a set of editor grass brushing tools (incomplete version), which stores the positions of all grass vertices by maintaining a vertex list.

    Furthermore, by maintaining another set of Cut Buffer, if the grass is marked with a -1 value, it will not be processed. If it is marked with a non--1 value of the chopper height, it will be passed to the Material, and through the WorldPos + Split.y plus the lerp operation, the upper half of the grass will be made invisible, and the color of the grass will be modified, and finally some grass clippings will be added to achieve a grass-cutting effect.

    Previous articleI have introduced in detail what a tessellation shader is and various optimization methods. Next, I will integrate tessellation into actual development. In addition, I combined the compute shader I learned in a few days to create a grass field based on the compute shader. You can find more details in the following article.This noteThe following is the small effect that this article will achieve, with complete code attached:

    • Grass Rendering
    • Grass Rendering – Geometry Shader (BIRP/URP)
    • Define grass width, height, orientation, pour, curvature, gradient, color, band, normal
    • INTEGER tessellation
    • URP adds Visibility Map
    • Grass rendering – Compute Shader (BIRP/URP) work on MacOS
    • Octree frustum culling
    • Distance fades
    • Grass Interaction
    • Interactive Geometry Shaders (BIRP/URP)
    • Interactive Compute Shader (BIRP) work on MacOS
    • Unity custom grass generation tool
    • Grass cutting system

    Main references(plagiarism)article:

    There are many ways to render grass, two of which are shown in this article:

    • Geometry Shader + Tessellation Shader
    • Compute Shaders + GPU Instancing

    First of all, the first solution has great limitations. Many mobile devices and Metal do not support GS, and GS will recalculate the Mesh every frame, which is quite expensive.

    Secondly, can MacOS no longer run geometry shaders? Not really. If you want to use GS, you must use OpenGL, not Metal. But it should be noted that Apple supports OpenGL up to OpenGL 4.1. In other words, this version does not support Compute Shader. Of course, MacOS in the Intel era can support OpenGL 4.3 and can run CS and GS at the same time. The M series chips do not have this fate. Either use 4.1 or use Metal. On my M1p mbp, even if you choose a virtual machine (Parallels 18+ provides DX11 and Vulkan), the Vulkan running on macOS is translated and is essentially Metal, so there is still no GS. Therefore, there is no native GS after macOS M1.

    Furthermore, Metal doesn't even support Tessellation shaders directly. Apple doesn't want to support these two things on the chip at all. Why? Because the efficiency is too low. On the M chip, TS is even simulated by CS!

    To sum up, geometry shaders are a dead-end technology, especially after the advent of Mesh Shader. Although GS is very popular in Unity, any similar effect can be instanced on CS, and it is more efficient. Although new graphics cards will still support GS, there are still quite a few games on the market that use GS. It's just that Apple didn't consider compatibility and directly cut it off.

    This article explains in detail why GS is so slow: Simply put, Intel optimized GS by blocking threads, etc., while other chips do not have this optimization.

    This article is a study note and is likely to contain errors.

    1. Overview of Geometry Shader Rendering Grass (BIRP)

    This chapter isRoystanA concise summary of the . If you need the project file or the final code, you can download it from the original article. Or readSocrates has no bottom article.

    1.1 Overview

    After the Domain Stage, you can choose to use a geometry shader.

    A geometry shader takes a whole primitive as input and is able to generate vertices on output. The input to a geometry shader is the vertices of a complete primitive (three vertices for a triangle, two vertices for a line or a single vertex for a point). The geometry shader is called once for each primitive.

    fromWeb DownloadInitial engineering.

    1.2 Drawing a triangle

    Draw a triangle.

    // Add inside the CGINCLUDE block.
    struct geometryOutput
        float4 pos : SV_POSITION;
        //Vertex shader
    return vertex;
    void geo(triangle float4 IN[3] : SV_POSITION, inout TriangleStreamtriStream)
        geometryOutput o;
        o.POS = UnityObjectToClipPos(float4(0.5, 0, 0, 1));
        o.POS = UnityObjectToClipPos(float4(-0.5, 0, 0, 1));
        o.POS = UnityObjectToClipPos(float4(0, 1, 0, 1));
    // Add inside the SubShader Pass, just below the #pragma fragment frag line.
    #pragma geometry geo

    We actually draw a triangle for each vertex in the mesh, but the positions we assign to the triangle vertices are constant - they don't change for each input vertex - placing all the triangles on top of each other.

    1.3 Vertex Offset

    Therefore, we can just make an offset according to the position of each vertex.

    // Add to the top of the geometry shader.
    float3 POS = IN[0];
    // Update each assignment of o.pos.
    o.POS = UnityObjectToClipPos(POS + float3(0.5, 0, 0));
    o.POS = UnityObjectToClipPos(POS + float3(-0.5, 0, 0));
    o.POS = UnityObjectToClipPos(POS + float3(0, 1, 0));

    1.4 Rotating blades

    However, it should be noted that currently all triangles are emitted in one direction, so normal correction is added. TBN matrix is constructed and multiplied with the current direction. And the code is organized.

    float3 vNormal = IN[0].normal;
    float4 vTangent = IN[0].tangent;
    float3 vBinormal = cross(vNormal, vTangent) * vTangent.w;
    float3x3 tangentToLocal = float3x3(
        vTangent.x, vBinormal.x, vNormal.x,
        vTangent.y, vBinormal.y, vNormal.y,
        vTangent.z, vBinormal.z, vNormal.z
    triStream.Append(VertexOutput(POS + mul(tangentToLocal, float3(0.5, 0, 0))));
    triStream.Append(VertexOutput(POS + mul(tangentToLocal, float3(-0.5, 0, 0))));
    triStream.Append(VertexOutput(POS + mul(tangentToLocal, float3(0, 0, 1))));

    1.5 Coloring

    Then define the upper and lower colors of the grass, and use UV to make a lerp gradient.

    return lerp(_BottomColor, _TopColor, i.uv.y);

    1.6 Rotation Matrix Principle

    Make a random orientation. Here a rotation matrix is constructed. The principle is also mentioned in GAMES101. There is also aVideo of formula derivation, and it is very clear! The simple derivation idea is, assuming that the vector $a$ rotates around the n-axis to $b$, then decompose $a$​ into the component parallel to the n-axis (found to be constant) plus the component perpendicular to the n-axis.

    float3x3 AngleAxis3x3(float angle, float3 axis)
        float c, s;
        sincos(angle, s, c);
        float t = 1 - c;
        float x = axis.x;
        float y = axis.y;
        float z = axis.z;
        return float3x3(
            t * x * x + c, t * x * y - s * z, t * x * z + s * y,
            t * x * y + s * z, t * y * y + c, t * y * z - s * x,
            t * x * z - s * y, t * y * z + s * x, t * z * z + c

    The rotation matrix $R$ is calculated here using Rodrigues' rotation formula: $$R=I+sin⁡(θ)⋅[k]×+(1−cos⁡(θ))⋅[k]×2$$

    Among them, $\theta$ is the rotation angle. $k$ is the unit rotation axis. $I$ is the identity matrix. $[k]_{\times}$ is the antisymmetric matrix corresponding to the axis $k$.

    For a unit vector $k=(x,y,z)$ , the antisymmetric matrix $[k]_{\times}=\left[\begin{array}{ccc} 0 & -z & y \\ z & 0 & -x \\ -y & x & 0 \end{array}\right]$ finally obtains the matrix elements:

    $$ \begin{array}{ccc} tx^2 + c & txy – sz & txz + sy \\ txy + sz & ty^2 + c & tyz – sx \\ txz – sy & tyz + sx & tz^2 + c \\ \end{array} $$

    float3x3 facingRotationMatrix = AngleAxis3x3(rand(POS) * UNITY_TWO_PI, float3(0, 0, 1));

    1.7 Blade tipping

    Get the grass in a random direction, and then pour it in any random direction on the x or y axis.

    float3x3 bendRotationMatrix = AngleAxis3x3(rand(POS.zzx) * _BendRotationRandom * UNITY_PI * 0.5, float3(-1, 0, 0));

    1.8 Leaf size

    Adjust the width and height of the grass. Originally, we set the height and width to be one unit. To make the grass more natural, we add rand to this step to make it look more natural.

    _BladeWidth("Blade Width", Float) = 0.05
    _BladeWidthRandom("Blade Width Random", Float) = 0.02
    _BladeHeight("Blade Height", Float) = 0.5
    _BladeHeightRandom("Blade Height Random", Float) = 0.3
    float height = (rand(POS.zyx) * 2 - 1) * _BladeHeightRandom + _BladeHeight;
    float width = (rand(POS.xzy) * 2 - 1) * _BladeWidthRandom + _BladeWidth;
    triStream.Append(VertexOutput(POS + mul(transformationMatrix, float3(width, 0, 0)), float2(0, 0)));
    triStream.Append(VertexOutput(POS + mul(transformationMatrix, float3(-width, 0, 0)), float2(1, 0)));
    triStream.Append(VertexOutput(POS + mul(transformationMatrix, float3(0, 0, height)), float2(0.5, 1)));

    1.9 Tessellation

    Since the number is too small, the upper surface is subdivided here.

    1.10 Perturbations

    To animate the grass, add the normals to the _Time perturbation. Sample the texture, then calculate the wind rotation matrix and apply it to the grass.

    float2 uv = POS.xz * _WindDistortionMap_ST.xy + _WindDistortionMap_ST.z + _WindFrequency * _Time.y;
    float2 windSample = (tex2Dlod(_WindDistortionMap, float4(uv, 0, 0)).xy * 2 - 1) * _WindStrength;
    float3 wind = normalize(float3(windSample.x, windSample.y, 0));
    float3x3 windRotation = AngleAxis3x3(UNITY_PI * windSample, wind);
    float3x3 transformationMatrix = mul(mul(mul(tangentToLocal, windRotation), facingRotationMatrix), bendRotationMatrix);

    1.11 Fixed blade rotation issue

    At this time, the wind may rotate along the x and y axes, which is specifically manifested as:

    Write a matrix for the two points under your feet that rotates only along z.

    float3x3 transformationMatrixFacing = mul(tangentToLocal, facingRotationMatrix);
    triStream.Append(VertexOutput(POS + mul(transformationMatrixFacing, float3(width, 0, 0)), float2(0, 0)));
    triStream.Append(VertexOutput(POS + mul(transformationMatrixFacing, float3(-width, 0, 0)), float2(1, 0)));

    1.12 Blade curvature

    In order to make the leaves have curvature, we have to add vertices. In addition, since double-sided rendering is currently enabled, the order of vertices does not matter. Here, a manual interpolation for loop is used to construct triangles. A forward is calculated to bend the leaves.

    float forward = rand(POS.yyz) * _BladeForward;
    for (int i = 0; i < BLADE_SEGMENTS; i++)
        float t = i / (float)BLADE_SEGMENTS;
        // Add below the line declaring float t.
        float segmentHeight = height * t;
        float segmentWidth = width * (1 - t);
        float segmentForward = pow(t, _BladeCurve) * forward;
        float3x3 transformMatrix = i == 0 ? transformationMatrixFacing : transformationMatrix;
        triStream.Append(GenerateGrassVertex(POS, segmentWidth, segmentHeight, segmentForward, float2(0, t), transformMatrix));
        triStream.Append(GenerateGrassVertex(POS, -segmentWidth, segmentHeight, segmentForward, float2(1, t), transformMatrix));
    triStream.Append(GenerateGrassVertex(POS, 0, height, forward, float2(0.5, 1), transformationMatrix));

    1.13 Creating Shadows

    Create shadows in another Pass and output.

            "LightMode" = "ShadowCaster"
        #Pragmas vertex vert
        #Pragmas geometry geo
        #Pragmas fragment frag
        #Pragmas hull hull
        #Pragmas domain domain
        #Pragmas target 4.6
        #Pragmas multi_compile_shadowcaster
        float4 frag(geometryOutput i) : SV_Target{

    1.14 Receiving Shadows

    Use SHADOW_ATTENUATION directly in Frag to determine the shadow.

    // geometryOutput struct.
    unityShadowCoord4 _ShadowCoord : TEXCOORD1;
    o._ShadowCoord = ComputeScreenPos(o.POS);
    #Pragmas multi_compile_fwdbase

    1.15 Removing shadow acne

    Removes surface acne.

        o.POS = UnityApplyLinearShadowBias(o.POS);

    1.16 Adding Normals

    Add normal information to vertices generated by the geometry shader.

    struct geometryOutput
        float4 POS : SV_POSITION;
        float2 uv : TEXCOORD0;
        unityShadowCoord4 _ShadowCoord : TEXCOORD1;
        float3 normal : NORMAL;
    o.normal = UnityObjectToWorldNormal(normal);

    1.17 Full code‼️ (BIRP)

    The final effect.



    2. Geometry Shader Rendering Grass (URP)

    2.1 References

    I have already written the BIRP version, and now I just need to port it.

    • URP code specification reference:
    • BIRP->URP quick reference table:

    You can followThis article by DanielYou can also follow me to modify the code. It should be noted that the space transformation code in the original repo has problems.Pull requestsThe solution was found in

    Now put the above BIRP tessellation shader together.

    • Tags changed to URP
    • The header file is introduced and replaced with the URP version
    • Variables are surrounded by CBuffer
    • Shadow casting, receiving code

    2.2 Start to change

    Declare the URP pipeline.

    LOD 100
    Cull Off
            "RenderType" = "Opaque"
            "Queue" = "Geometry"
            "RenderPipeline" = "UniversalPipeline"

    Import the URP library.

    #include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Core.hlsl"
    #include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Lighting.hlsl"
    #include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/ShaderVariablesFunctions.hlsl"
    o._ShadowCoord = ComputeScreenPos(o.POS);

    Change the function.

    // o.normal = UnityObjectToWorldNormal(normal);
    o.normal = TransformObjectToWorldNormal(normal);

    URP receives the shadow. It is best to calculate this in the vertex shader, but for convenience, it is all calculated in the geometry shader.

    Then generate the shadows. ShadowCaster Pass.

        Name "ShadowCaster"
        Tags{ "LightMode" = "ShadowCaster" }
        ZWrite On
        ZTest LEqual
            half4 frag(geometryOutput input) : SV_TARGET{
                return 1;

    2.3 Full code‼️(URP)

    3. Optimize tessellation logic (BIRP/URP)

    3.1 Organize the code

    Above we just use a fixed number of subdivision levels, which I cannot accept. If you don't understand the principle of surface subdivision, you can seeMy Tessellation Articles, which details several solutions for optimizing segmentation.

    I use the BIRP version of the code that I completed in Section 1 as an example. The current version only has the Uniform subdivision.

    _TessellationUniform("Tessellation Uniform", Range(1, 64)) = 1

    The output structures of each stage are quite confusing, so let's reorganize them.

    3.1 Partitioning Mode

    [KeywordEnum(INTEGER, FRAC_EVEN, FRAC_ODD, POW2)] _PARTITIONING("Partition algorithm", Float) = 0
    #if defined(_PARTITIONING_INTEGER)
    #elif defined(_PARTITIONING_FRAC_EVEN)
    #elif defined(_PARTITIONING_FRAC_ODD)
    #elif defined(_PARTITIONING_POW2)

    3.2 Subdivided Frustum Culling

    In BIRP, use _ProjectionParams.z to represent the far plane, and in URP use UNITY_RAW_FAR_CLIP_VALUE.

    bool IsOutOfBounds(float3 p, float3 lower, float3 higher) { //Given rectangle judgment
        return p.x < lower.x || p.x > higher.x || p.y < lower.y || p.y > higher.y || p.z < lower.z || p.z > higher.z;
    bool IsPointOutOfFrustum(float4 positionCS) { //View cone judgment
        float3 culling =;
        float w = positionCS.w;
        float3 lowerBounds = float3(-w, -w, -w * _ProjectionParams.z);
        float3 higherBounds = float3(w, w, w);
        return IsOutOfBounds(culling, lowerBounds, higherBounds);
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
            IsPointOutOfFrustum(p1PositionCS) &&
        return allOutside;
    TessellationControlPoint vert(Attributes v)
        o.positionCS = UnityObjectToClipPos(v.vertex);
    TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch)
        TessellationFactors f;
        if(ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS)){
            f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0;
            f.edge[0] = _TessellationFactor;
            f.edge[1] = _TessellationFactor;
            f.edge[2] = _TessellationFactor;
            f.inside = _TessellationFactor;
        return f;

    However, it should be noted that the judgment input here is the CS coordinates of the grass. If the triangular grass completely leaves the screen, but the grass grows high and may still be on the screen, it will cause a screen bug where the grass suddenly disappears. This depends on the needs of the project. If it is a project with an upward viewing angle and the grass is relatively short, this operation can be used.

    The viewing angle is not a big problem.

    If viewed from Voldemort's perspective, the grass is incomplete and over-culled.

    3.3 Fine-grained control of screen distance

    The grass is dense near and sparse far, but based on the screen distance (CS space). This method is affected by the resolution.

    float EdgeTessellationFactor(float scale, float4 p0PositionCS, float4 p1PositionCS) {
        float factor = distance( / p0PositionCS.w, / p1PositionCS.w) / scale;
        return max(1, factor);
    TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch)
        TessellationFactors f;
        f.edge[0] = EdgeTessellationFactor(_TessellationFactor, 
            patch[1].positionCS, patch[2].positionCS);
        f.edge[1] = EdgeTessellationFactor(_TessellationFactor, 
            patch[2].positionCS, patch[0].positionCS);
        f.edge[2] = EdgeTessellationFactor(_TessellationFactor, 
            patch[0].positionCS, patch[1].positionCS);
        f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;
        #if defined(_CUTTESS_TRUE)
            if(ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS))
                f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0;
        return f;

    Tessellation Factor = 0.08

    It is not recommended to select Frac as the segmentation mode, otherwise there will be strong shaking, which is very eye-catching. I don't like this method very much.

    3.4 Camera distance classification

    Calculate the ratio of "the distance between two points" to "the distance between the midpoint of the two vertices and the camera position". The larger the ratio, the larger the space occupied on the screen, and the more subdivision is required.

    float EdgeTessellationFactor_WorldBase(float scale, float3 p0PositionWS, float3 p1PositionWS) {
        float length = distance(p0PositionWS, p1PositionWS);
        float distanceToCamera = distance(_WorldSpaceCameraPos, (p0PositionWS + p1PositionWS) * 0.5);
        float factor = length / (scale * distanceToCamera * distanceToCamera);
        return max(1, factor);
    f.edge[0] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
        patch[1].vertex, patch[2].vertex);
    f.edge[1] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
        patch[2].vertex, patch[0].vertex);
    f.edge[2] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
        patch[0].vertex, patch[1].vertex);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;

    There is still room for improvement. Adjust the density of the grass so that the grass at close distance is not too dense, and the grass curve at medium distance is smoother, and introduce a nonlinear factor to control the relationship between distance and tessellation factor.

    float EdgeTessellationFactor_WorldBase(float scale, float3 p0PositionWS, float3 p1PositionWS) {
        float length = distance(p0PositionWS, p1PositionWS);
        float distanceToCamera = distance(_WorldSpaceCameraPos, (p0PositionWS + p1PositionWS) * 0.5);
        // Use the square root function to adjust the effect of distance to make the tessellation factor change more smoothly at medium distances
        float adjustedDistance = sqrt(distanceToCamera);
        // Adjust the impact of scale. You may need to further fine-tune the coefficient here based on the actual effect.
        float factor = length / (scale * adjustedDistance);
        return max(1, factor);

    This is more appropriate.

    3.5 Visibility Map Controls Grass Subdivision

    The vertex shader reads the texture and passes it to the tessellation shader, which calculates the tessellation logic in PCF.

    Take FIXED mode as an example:

    _VisibilityMap("Visibility Map", 2D) = "white" {}
    TEXTURE2D (_VisibilityMap);SAMPLER(sampler_VisibilityMap);
    struct Attributes
        float2 uv : TEXCOORD0;
    struct TessellationControlPoint
        float visibility : TEXCOORD1;
    TessellationControlPoint vert(Attributes v){
        float visibility = SAMPLE_TEXTURE2D_LOD(_VisibilityMap, sampler_VisibilityMap, v.uv, 0).r; 
        o.visibility    = visibility;
    TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch){
        float averageVisibility = (patch[0].visibility + patch[1].visibility + patch[2].visibility) / 3; // Calculate the average grayscale value of the three vertices
        float baseTessellationFactor = _TessellationFactor_FIXED; 
        float tessellationMultiplier = lerp(0.1, 1.0, averageVisibility); // Adjust the factor based on the average gray value
        #if defined(_DYNAMIC_FIXED)
            f.edge[0] = _TessellationFactor_FIXED * tessellationMultiplier;
            f.edge[1] = _TessellationFactor_FIXED * tessellationMultiplier;
            f.edge[2] = _TessellationFactor_FIXED * tessellationMultiplier;
            f.inside  = _TessellationFactor_FIXED * tessellationMultiplier;

    3.6 Complete code‼️ (BIRP)

    Grass Shader:

    3.7 Full code ‼ ️ (URP)

    There are some differences in URP. For example, to calculate ShadowBias, you need to do the following. I won’t expand on it. Just look at the code yourself.

        // o.pos = UnityApplyLinearShadowBias(o.pos);
        o.shadowCoord = TransformWorldToShadowCoord(ApplyShadowBias(posWS, norWS, 0));

    Grass Shader:

    4. Interactive Grassland

    URP and BIRP are exactly the same.

    4.1 Implementation steps

    The principle is very simple. The script transmits the character's world coordinates, and then bends the grass according to the set radius and interaction strength.

    uniform float3 _PositionMoving; // Object position float _Radius; // Object interaction radius float _Strength; // Interaction strength

    In the grass generation loop, calculate the distance between each grass fragment and the object and adjust the grass position according to this distance.

    float dis = distance(_PositionMoving, posWS); // Calculate distance
    float radiusEffect = 1 - saturate(dis / _Radius); // Calculate effect attenuation based on distance
    float3 sphereDisp = POS - _PositionMoving; // Calculate the position difference
    sphereDisp *= radiusEffect * _Strength; // Apply falloff and intensity
    sphereDisp = clamp(sphereDisp, -0.8, 0.8); // Limit the maximum displacement

    The new positions are then calculated within each blade of grass.

    // Apply interactive effects
    float3 newPos = i == 0 ? POS : POS + (sphereDisp * t);
    triStream.Append(GenerateGrassVertex(newPos, segmentWidth, segmentHeight, segmentForward, float2(0, t), transformMatrix));
    triStream.Append(GenerateGrassVertex(newPos, -segmentWidth, segmentHeight, segmentForward, float2(1, t), transformMatrix));

    Don't forget the outside of the for loop, which is the top vertex.

    // Final grass fragment
    float3 newPosTop = POS + sphereDisp;
    triStream.Append(GenerateGrassVertex(newPosTop, 0, height, forward, float2(0.5, 1), transformationMatrix));

    In URP, using uniform float3 _PositionMoving may cause SRP Batcher to fail.

    4.2 Script Code

    Bind the object that needs interaction.

    using UnityEngine;
    public class ShaderInteractor : MonoBehaviour
        // Update is called once per frame
        void Update()
            Shader.SetGlobalVector("_PositionMoving", transform.position);

    4.3 Full code ‼ ️ (URP)

    Grass shader:

    5. Compute Shader Rendering Grass v1.0

    Why v1.0? Because I think it is quite difficult to render the sea of grass with this compute shader. Many of the things that are not available now can be improved slowly in the future. I also wrote some notes about Compute Shader.

    1. Compute Shader Study Notes (I)
    2. Compute Shader Learning Notes (II) Post-processing Effects
    3. Compute Shader Learning Notes (II) Particle Effects and Cluster Behavior Simulation
    4. Compute Shader Learning Notes (Part 3) Grass Rendering

    5.1 Review/Organization

    The Compute Shader notes above fully describe how to write a stylized grass sea from scratch in CS. If you forgot, review it here.

    There are still many things that the CPU needs to do in the initialization stage. First, define the grass Mesh and Buffer transfer (the width and height of the grass, the position of each grass generation, the random orientation of the grass, and the random color depth of the grass). It also needs to specifically pass the maximum curvature value and grass interaction radius to the Compute Shader.

    For each frame, the CPU also passes the time variable, wind direction, wind force/speed, and wind field scaling factor to the Compute Shader.

    Compute Shader uses the information passed by the CPU to calculate how the grass should turn, using quaternions as output.

    Finally, the shader instantiates the ID and all calculation results, first calculating the vertex offset, then applying the quaternion rotation, and finally modifying the normal information.

    This demo can actually be further optimized, such as putting more calculations in the Compute Shader, such as the process of generating Mesh, the width and height of the grass, random tilting, etc. More real-time parameter adjustment variables can also be optimized. Various optimization culling can also be performed, such as culling the incoming camera position by distance, or culling with the view frustum, etc. This culling process requires the use of some atomic operations. There is also multi-object interaction. The logic of interactive grass deformation can also be optimized, such as the degree of interaction is proportional to the power of the distance of the interactive object, etc. The engine function can also be increased, and the function of brushing grass can be developed, which may require a quadtree storage system, etc.

    And in Compute Shader, use vectors instead of scalars when possible.

    First, organize the code. Put all variables that do not need to be sent to the Compute Shader every frame into a function for unified initialization. Organize the Inspector panel. (There are many code changes)

    First, basically all calculations are run on the GPU, except that the world coordinates of each grass are calculated in the CPU and passed to the GPU through a Buffer.

    The size of the buffer transmission depends entirely on the size of the ground mesh and the set density. In other words, if it is a super large open world, the buffer will become super large. For a 5*5 grass field, with the Density set to 0.5, approximately 312576 grass data will be sent, and the actual data will reach 4*312576*4=5001216 bytes. Based on the CPU->GPU transmission speed of 8 GB/s, it takes about 10 milliseconds to transmit.

    Fortunately, this buffer does not need to be transmitted every frame, but it is enough to attract our attention. If the current grass size increases to 100*100, the time required will increase several times, which is scary. Moreover, we may not use many of the vertices, which causes a great waste of performance.

    I added a function to generate perlin noise in the Compute Shader, as well as the xorshift128 random number generation algorithm.

    // Perlin random number algorithm
    float hash(float x, float y) {
        return frac(abs(sin(sin(123.321 + x) * (y + 321.123)) * 456.654));
    float perlin(float x, float y){
        float col = 0.0;
        for (int i = 0; i < 8; i++) {
            float fx = floor(x); float fy = floor(y);
            float xx = ceil(x); float cy = ceil(y);
            float a = hash(fx, fy); float b = hash(fx, cy);
            float c = hash(xx, fy); float d = hash(xx, cy);
            col += lerp(lerp(a, b, frac(y)), lerp(c, d, frac(y)), frac(x));
            col /= 2.0; x /= 2.0; y /= 2.0;
        return col;
    // XorShift128 random number algorithm -- Edited Directly output normalized data
    uint state[4];
    void xorshift_init(uint s) {
        state[0] = s; state[1] = s | 0xffff0000u;
        state[2] = s < 16; state[3] = s >> 16;
    float xorshift128() {
        uint t = state[3]; uint s = state[0];
        state[3] = state[2]; state[2] = state[1]; state[1] = s;
        t ^= t < 11u; t ^= t >> 8u;
        state[0] = t ^ s ^ (s >> 19u);
        return (float)state[0] / float(0xffffffffu);
    void BendGrass (uint3 id : SV_DispatchThreadID)
        xorshift_init(id.x * 73856093u ^ id.y * 19349663u ^ id.z * 83492791u);

    To review, at present, the CPU uses an AABB average grass paving logic to generate all possible grass vertices, which are then passed to the GPU to perform some culling, LoD and other operations in the Compute Shader.

    So far I have three Buffers.

    m_InputBuffer is the structure on the left of the above picture that sends all the grass to the GPU without any culling.

    m_OutputBuffer is a variable length buffer that increases slowly in the Compute Shader. If the grass of the current thread ID is suitable, it will be added to this buffer for instanced rendering later. The structure on the right of the above picture.

    m_argsBuffer is a parameterized Buffer, which is different from other Buffers. It is used to pass parameters to Draw, and its specific content is to specify the number of vertices to be rendered in batches, the number of rendering instances, etc. Let's take a look at it in detail:

    First parameter, my grass mesh has seven triangles, so there are 21 vertices to render.

    The second parameter is temporarily set to 0, indicating that nothing needs to be rendered. This number will be dynamically set according to the length of m_OutputBuffer after the Compute Shader calculation is completed. In other words, the number here will be the same as the number of grasses appended in the Compute Shader.

    The third and fourth parameters represent respectively: the index of the first rendered vertex and the index of the first instantiation.

    I haven't used the fifth parameter, so I don't know what it is used for.

    The last step looks like this, passing in the Mesh, material, AABB and parameter Buffer.

    5.2 Customizing Unity Tools

    Create a new C# script and save it in the Editor directory of the project (if it doesn't exist, create one). The script inherits from Editor, and then write [CustomEditor(typeof(XXX))] . It means you work for XXX. I work for GrassControl, and then you can attach what you wrote now to XXX. Of course, you can also have a separate window, which should inherit from EditorWindow.

    Write tools in the OnInspectorGUI() function, for example, write a Label.

    GUILayout.Label("== Remo Grass Generator ==");

    To center the Inspector, add a parameter.

    GUILayout.Label("== Remo Grass Generator ==", new GUIStyle(EditorStyles.boldLabel) { alignment = TextAnchor.MiddleCenter });

    Too crowded? Just add a line of space.


    If you want to attach tools above XXX, then all the logic should be written above OnInspectorGUI.

    ... // Write here
    // The default Inspector interface of GrassControl

    Create a button and press the code:

    if (GUILayout.Button("xxx"))
        ...//Code after pressing

    Anyway, these are the ones I use now.

    5.3 Editor selects the object to generate grass

    It is also very simple to get the Object of the script of the current service and display it in the Inspector.

    [SerializeField] private GameObject grassObject;
    grassObject = (GameObject)EditorGUILayout.ObjectField("Write any name", grassObject, typeof(GameObject), true);
    if (grassObject == null)
        grassObject = FindObjectOfType<GrassControl>()?.gameObject;

    After obtaining it, you can access the contents of the current script through GameObject.

    How to get the object selected in the Editor window? It can be done with one line of code.

    foreach (GameObject obj in Selection.gameObjects)

    Display the selected objects in the Inspector panel. Note that you need to handle the case of multiple selections, otherwise a Warning will be issued.

    // Display the current Editor selected object in real time and control the availability of the button
    EditorGUILayout.LabelField("Selection Info:", EditorStyles.boldLabel);
    bool hasSelection = Selection.activeGameObject != null;
    GUI.enabled = hasSelection;
    if (hasSelection)
        foreach (GameObject obj in Selection.gameObjects)
        EditorGUILayout.LabelField("No active object selected.");

    Next, get the MeshFilter and Renderer of the selected object. Since Raycast detection is required, get a Collider. If it does not exist, create one.

    Then I will not talk about the code of sketching grass here.

    5.4 Processing AABBs

    After generating a bunch of grass, add each grass to the AABB and finally pass it to Instancing.

    I assume that each grass is the size of a unit cube, so it is If the grass is particularly tall, this should need to be modified.

    Stuff each blade of grass into the big AABB and pass the new AABB back to the script's m_LocalBounds for Instancing.

    Graphics.DrawMeshInstancedIndirect(blade, 0, m_Material, m_LocalBounds, m_argsBuffer);

    5.5 Surface Shader – Pitfalls

    There is a small problem here. Since the current Material is a Surface Shader, the Vertex of the Surface Shader has calculated the center of the AABB by default to do the vertex offset, so the world coordinates passed in before cannot be used directly. You also need to pass the center of the AABB in and subtract it. It's so strange. I wonder if there is any elegant way.

    5.6 Simple Camera Distance Culling + Fade

    Currently, all generated grass is passed to the Compute Shader on the CPU, and then all grass is added to the AppendBuffer, which means there is no culling logic.

    The simplest culling solution is to cull grass based on the distance between the camera and the grass. In the Inspector panel, open a value to represent the culling distance. Calculate the distance between the camera and the current grass instance. If it is greater than the set value, it will not be added to the AppendBuffer.

    First, pass the world coordinates of the camera into C#. Here is the semi-pseudo code:

    // Get the camera
    private Camera m_MainCamera;
    m_MainCamera = Camera.main;
    if (m_MainCamera != null)
        m_ComputeShader.SetVector(ID_camreaPos, m_MainCamera.transform.position);

    In CS, calculate the distance between the grass and the camera:

    float distanceFromCamera = distance(input.position, _CameraPositionWS);

    The distance function code is as follows:

    float distanceFade = 1 - saturate((distanceFromCamera - _MinFadeDist) / (_MaxFadeDist - _MinFadeDist));

    If the value is less than 0, return directly.

    // skip if out of fading range too
    if (distanceFade < 0.001f)

    In the part between culling and not culling, set the grass width + Fade value to achieve a fading effect.

    Result.height = (bladeHeight + bladeHeightOffset * (xorshift128()*2-1)) * distanceFade;
    Result.width = (bladeWeight + bladeWeightOffset * (xorshift128()*2-1)) * distanceFade;
    Result.fade = xorshift128() * distanceFade;

    In the figure below, both are set to be relatively small for the convenience of demonstration.

    I think the actual effect is quite good and smooth. If the width and height of the grass are not modified, the effect will be greatly reduced.

    Of course, you can also modify the logic: do not completely remove the grass that exceeds the maximum drawing range, but reduce the number of drawings; or selectively draw the grass in the transition area.

    Both logics are acceptable, and if it were me I would choose the latter.

    5.7 Maintaining a set of visible ID buffers

    The so-called frustum culling is to reduce the redundant calculations of GPU through various methods at the CPU stage.

    So how do I let the Compute Shader know which grass needs to be rendered and which needs to be culled? My approach is to maintain a set of ID Lists. The length is the number of all grasses. If the current grass needs to be culled, otherwise the index value of the grass that needs to be rendered is recorded.

    List<uint> grassVisibleIDList = new List<uint>();
    // buffer that contains the ids of all visible instances
    private ComputeBuffer m_VisibleIDBuffer;
    private const int VISIBLE_ID_STRIDE        =  1 * sizeof(uint);
    m_VisibleIDBuffer = new ComputeBuffer(grassData.Count, VISIBLE_ID_STRIDE,
        ComputeBufferType.Structured); //uint only, per visible grass
    m_ComputeShader.SetBuffer(m_ID_GrassKernel, "_VisibleIDBuffer", m_VisibleIDBuffer);

    Since some grass has been removed before being passed to the Compute Shader, the number of Dispatches is no longer the number of all grasses, but the number of the current List.

    // m_ComputeShader.Dispatch(m_ID_GrassKernel, m_DispatchSize, 1, 1);
    m_DispatchSize = Mathf.CeilToInt(grassVisibleIDList.Count / threadGroupSize);

    Generates a fully visible ID sequence.

    void GrassFastList(int count)
        grassVisibleIDList = Enumerable.Range(0, count).ToArray().ToList();

    And each frame should be uploaded to GPU. The preparation is complete, and then use Quad tree to operate this array.

    5.8 Quad/Octtree Storing Grass Index

    You can consider dividing an AABB into multiple sub-AABBs and then use a quadtree to store and manage them.

    Currently, all grass is in one AABB. Next, we build an octree and put all the grass in this AABB into branches. This makes it easy to do frustum culling in the early stages of the CPU.

    How to store it? If the current grass has a small vertical drop, then a quadtree is enough. If it is an open world with undulating mountains, then use an octree. However, considering that the grass has a relatively high horizontal density, I use a quadtree + octree structure here. The parity of the depth determines whether the current depth is divided into four nodes or eight nodes. If there is no need for strong height division, it is OK to use an octree, but I feel that the efficiency may be a little lower. Here, it is directly evenly distributed. Later optimization can consider the AABB division method based on variable length dynamic changes.

    if (depth % 2 == 0)
        m_children.Add(new CullingTreeNode(topLeftSingle, depth - 1));
        m_children.Add(new CullingTreeNode(bottomRightSingle, depth - 1));
        m_children.Add(new CullingTreeNode(topRightSingle, depth - 1));
        m_children.Add(new CullingTreeNode(bottomLeftSingle, depth - 1));
        m_children.Add(new CullingTreeNode(topLeft, depth - 1));
        m_children.Add(new CullingTreeNode(bottomRight, depth - 1));
        m_children.Add(new CullingTreeNode(topRight, depth - 1));
        m_children.Add(new CullingTreeNode(bottomLeft, depth - 1));
        m_children.Add(new CullingTreeNode(topLeft2, depth - 1));
        m_children.Add(new CullingTreeNode(bottomRight2, depth - 1));
        m_children.Add(new CullingTreeNode(topRight2, depth - 1));
        m_children.Add(new CullingTreeNode(bottomLeft2, depth - 1));

    The detection of the view frustum and AABB can be done with GeometryUtility.TestPlanesAABB.

    public void RetrieveLeaves(Plane[] frustum, List<Bounds> list, List<int> visibleIDList)
        if (GeometryUtility.TestPlanesAABB(frustum, m_bounds))
            if (m_children.Count == 0)
                if (grassIDHeld.Count > 0)
                foreach (CullingTreeNode child in m_children)
                    child.RetrieveLeaves(frustum, list, visibleIDList);

    This code is the key part, passing in:

    • The six planes of the camera frustum Plane[]
    • A list of Bounds objects storing all nodes within the frustum
    • Stores a list of all grass indices contained in the node within the frustum

    By calling the method of this quad/octree, you can get the list of all bounding boxes and grass within the frustum.

    Then all the grass indexes can be made into a Buffer and passed to the Compute Shader.


    To get a visual AABB, use the OnDrawGizmos() method.

    Pass all the AABBs obtained by culling the view frustum into this function. This way you can see the AABBs intuitively.

    Also write everything inside the view frustum to the visible grass.

    5.9 Flickering grass problem – Pitfalls

    Here I hit a small pit. I completed the octree and successfully divided many sub-AABBs as shown above. But when I moved the camera, the grass flickered wildly. I was a little lazy and didn't want to make GIF videos. Observe the two pictures below. I just moved the view slightly and changed the current Visibility List. The position of the grass jumped a lot, and it looked like the grass flickered continuously.

    I can't figure it out, there is no problem with Compute Shader culling.

    The number of dispatches is also calculated based on the length of the visibility list, so there must be enough threads to compute the shader.

    And there is no problem with DrawMeshInstancedIndirect.

    What's the problem?

    After a long debugging, I found that the problem lies in the process of taking random numbers by Xorshift of Compute Shader.

    Before using _VisibleIDBuffer, one grass corresponds to one thread ID, which is determined from the moment the grass is born. Now that this group of indexes has been added, and the ID of the incoming random value is not changed to a Visible ID, the random numbers will appear very discrete.

    That is to say, all previous IDs are replaced with index values taken from _VisibleIDBuffer!

    5.10 Multi-object Interaction

    Currently there is only one trampler passed in. If it is not passed in, an error will be reported, which is unbearable.

    There are three parameters about interaction:

    • pos – Vector3
    • trampleStrength – Float
    • trampleRadius – Float

    Now put trampleRadius into pos (Vector4) (or another one, depending on your needs), and pass the position array into it using SetVectorArray. This way each interactive object can have a dedicated interactive radius. For fat interactive objects, make the radius larger, and for skinny ones, make it smaller. That is, remove the following line:

    // In SetGrassDataBase, no need to upload every frame
    // m_ComputeShader.SetFloat("trampleRadius", trampleRadius);


    // In SetGrassDataUpdate, each frame must be uploaded
    // Set up multiple interactive objects
    if (trampler.Length > 0)
        Vector4[] positions = new Vector4[trampler.Length];
        for (int i = 0; i < trampler.Length; i++)
            positions[i] = new Vector4(trampler[i].transform.position.x, trampler[i].transform.position.y, trampler[i].transform.position.z,
        m_ComputeShader.SetVectorArray(ID_tramplePos, positions);

    Then you have to pass the number of interactive objects so that the Compute Shader knows how many interactive objects need to be processed. This also needs to be updated every frame. I am used to storing an ID index for objects that are updated every frame, which is more efficient.

    // Initializing
    ID_trampleLength = Shader.PropertyToID("_trampleLength");
    // In each frame
    m_ComputeShader.SetFloat(ID_trampleLength, trampler.Length);

    I repackaged it:

    By modifying the corresponding code, you can adjust the radius of each interactive object on the panel. If you want to enrich this adjustment function, you can consider passing a separate Buffer into it.

    In the Compute Shader, it is relatively simple to combine multiple rotations.

    // Trampler
    float4 qt = float4(0, 0, 0, 1); // 1 in quaternion is like this, the imaginary part is 0
    for (int trampleIndex = 0; trampleIndex < trampleLength; trampleIndex++)
        float trampleRadius = tramplePos[trampleIndex].a;
        float3 relativePosition = input.position - tramplePos[trampleIndex].xyz;
        float dist = length(relativePosition);
        if (dist < trampleRadius) {
            // Use the power to enhance the effect at close range
            float eff = pow((trampleRadius - dist) / trampleRadius, 2) * trampleStrength;
            float3 direction = normalize(relativePosition);
            float3 newTargetDirection = float3(direction.x * eff, 1, direction.z * eff);
            qt = quatMultiply(MapVector(float3(0, 1, 0), newTargetDirection), qt);

    5.11 Editor real-time preview

    The camera currently passed to the Compute Shader is the main camera, which is the one in the game window. Now you want to temporarily get the main camera's lens in the editor (Scene window) and restore it after starting the game. You can use the Scene View GUI to draw events.

    Here is an example of remodeling my current code:

        SceneView view;
        void OnDestroy()
            // When the window is destroyed, remove the delegate
            // so that it will no longer do any drawing.
            SceneView.duringSceneGui -= this.OnScene;
        void OnScene(SceneView scene)
            view = scene;
            if (!Application.isPlaying)
                if ( != null)
                    m_MainCamera =;
                m_MainCamera = Camera.main;
        private void OnValidate()
            // Set up components
            if (!Application.isPlaying)
                if (view != null)
                    m_MainCamera =;
                m_MainCamera = Camera.main;

    When initializing the shader, subscribe to the event at the beginning, and then determine whether the current state is game, and then pass a camera. If it is in edit mode, then m_MainCamera is still NULL.

    void InitShader()
        SceneView.duringSceneGui += this.OnScene;
        if (!Application.isPlaying)
            if (view != null && != null)
                m_MainCamera =;
        if (Application.isPlaying)
            m_MainCamera = Camera.main;

    In the frame-by-frame Update function, if it is detected that m_MainCamera is NULL, it is determined that the current mode is edit mode:

    // Pass in the camera coordinates
            if (m_MainCamera != null)
                m_ComputeShader.SetVector(ID_camreaPos, m_MainCamera.transform.position);
            else if (view != null && != null)

    6. Cutting Grass

    Maintain a set of Cut Buffers

    // added for cutting
    private ComputeBuffer m_CutBuffer;
    float[] cutIDs;

    Initializing Buffer

    private const int CUT_ID_STRIDE            =  1 * sizeof(float);
    // added for cutting
    m_CutBuffer = new ComputeBuffer(grassData.Count, CUT_ID_STRIDE, ComputeBufferType.Structured);
    // added for cutting
    m_ComputeShader.SetBuffer(m_ID_GrassKernel, "_CutBuffer", m_CutBuffer);

    Don't forget to release it when you disable it.

    // added for cutting

    Define a method to pass in the current position and radius to calculate the position of the grass. Set the corresponding cutID to -1.

    // newly added for cutting
    public void UpdateCutBuffer(Vector3 hitPoint, float radius)
        // can't cut grass if there is no grass in the scene
        if (grassData.Count > 0)
            List<int> grasslist = new List<int>();
            // Get the list of IDS that are near the hitpoint within the radius
            cullingTree.ReturnLeafList(hitPoint, grasslist, radius);
            Vector3 brushPosition = this.transform.position;
            // Compute the squared radius to avoid square root calculations
            float squaredRadius = radius * radius;
            for (int i = 0; i < grasslist.Count; i++)
                int currentIndex = grasslist[i];
                Vector3 grassPosition = grassData[currentIndex].position + brushPosition;
                // Calculate the squared distance
                float squaredDistance = (hitPoint - grassPosition).sqrMagnitude;
                // Check if the squared distance is within the squared radius
                // Check if there is grass to cut, or of the grass is uncut(-1)
                if (squaredDistance <= squaredRadius && (cutIDs[currentIndex] > hitPoint.y || cutIDs[currentIndex] == -1))
                    // store cutting point
                    cutIDs[currentIndex] = hitPoint.y;

    Then bind a script to the object that needs to be cut:

    using System.Collections;
    using System.Collections.Generic;
    using UnityEngine;
    public class Cutgrass : MonoBehaviour
        GrassControl grassComputeScript;
        float radius = 1f;
        public bool updateCuts;
        Vector3 cachedPos;
        // Start is called before the first frame update
        // Update is called once per frame
        void Update()
            if (updateCuts && transform.position != cachedPos)
                grassComputeScript.UpdateCutBuffer(transform.position, radius);
                cachedPos = transform.position;
        private void OnDrawGizmos()
            Gizmos.color = new Color(1, 0, 0, 0.3f);
            Gizmos.DrawWireSphere(transform.position, radius);

    In the Compute Shader, just modify the grass height. (Very straightforward...) You can change the effect to whatever you want.

    StructuredBuffer<float> _CutBuffer;// added for cutting
        float cut = _CutBuffer[usableID];
        Result.height = (bladeHeight + bladeHeightOffset * (xorshift128()*2-1)) * distanceFade;
        if(cut != -1){
            Result.height *= 0.1f;



    5. Notes - A preliminary exploration of compute-shader
    17. Unity-compute-shader-Basic knowledge

  • Compute Shader学习笔记(四)之 草地渲染

    Compute Shader Learning Notes (IV) Grass Rendering

    Project address:



    L5 Grass Rendering

    The current effect is very ugly, and there are still many details that are not perfect, it is just "implemented". Since I am also a rookie, I hope you can correct me if I write/do it poorly.


    Summary of knowledge points:

    • Grass Rendering Solution
    • bounds.extents
    • X-ray detection
    • Rodrigo Spin
    • Quaternion rotation

    Preface 1

    Preface Reference Articles:


    There are many ways to render grass.

    The simplest way is to directly paste a grass texture on it.


    In addition, eachMesh GrassIt is also common to drag it into the scene. This method has a large operating space and every blade of grass is under control. Although you can use Batching and other methods to optimize and reduce the transmission time from CPU to GPU, this will consume the life of the Ctrl, C, V and D keys on your keyboard. However, you can use L(a, b) in the Transform component to evenly distribute the selected objects between a and b. If you want randomness, you can use R(a, b). For more related operations, seeOfficial Documentation.


    Can also be combinedGeometry shaders and tessellation shadersThis method looks good, but one shader can only correspond to one type of geometry (grass). If you want to generate flowers or rocks on this mesh, you need to modify the code in the geometry shader. This problem is not the most critical. The more serious problem is that many mobile devices and Metal do not support geometry shaders at all. Even if they do, they are only software-simulated, with poor performance. And the grass mesh will be recalculated every frame, wasting performance.


    BillboardTechnical rendering of grass is also a widely used and long-lasting method. This method works very well when we don't need high-fidelity images. This method is to simply render a Quad+map (Alpha clipping). Use DrawProcedural. However, this method can only be viewed from a distance and not up close, otherwise it will be exposed.


    Using UnityTerrain SystemYou can also draw very nice grass. And Unity uses instancing technology to ensure performance. The best part is its brush tool, but if your workflow does not include the terrain system, you can also use third-party plugins to do it.


    When searching for information, I also found aImpostors. It's quite interesting to combine the vertex saving advantage of billboards with the ability to realistically reproduce objects from multiple angles. This technology "takes" a Mesh photo of real grass from multiple angles in advance and stores it through Texture. At runtime, the appropriate texture is selected for rendering according to the viewing direction of the current camera. It is equivalent to an upgraded version of the billboard technology. I think the Impostors technology is very suitable for objects that are large but players may need to view from multiple angles, such as trees or complex buildings. However, this method may have problems when the camera is very close or changes between two angles. A more reasonable solution is: use a mesh-based method at very close distances, use Impostors at medium distances, and use billboards at long distances.


    The method to be implemented in this article is based on GPU Instancing, which should be called "per-blade mesh grass". This solution is used in games such as "Ghost of Tsushima", "Genshin Impact" and "The Legend of Zelda: Breath of the Wild". Each grass has its own entity, and the light and shadow effects are quite realistic.


    Rendering process:


    Preface 2

    Unity's Instancing technology is quite complex, and I have only seen a glimpse of it. Please correct me if I find any mistakes. The current code is written according to the documentation. GPU instancing currently supports the following platforms:

    • Windows: DX11 and DX12 with SM 4.0 and above / OpenGL 4.1 and above
    • OS X and Linux: OpenGL 4.1 and above
    • Mobile: OpenGL ES 3.0 and above / Metal
    • PlayStation 4
    • Xbox One

    In addition, Graphics.DrawMeshInstancedIndirect has been eliminated. You should use Graphics.RenderMeshIndirect. This function will automatically calculate the Bounding Box. This is a later story. For details, please see the official documentation:RenderMeshIndirect . This article was also helpful:

    The principle of GPU Instancing is to send a Draw Call to multiple objects with the same Mesh. The CPU first collects all the information, then puts it into an array and sends it to the GPU at once. The limitation is that the Material and Mesh of these objects must be the same. This is the principle of being able to draw so much grass at a time while maintaining high performance. To achieve GPU Instancing to draw millions of Meshes, you need to follow some rules:

    • All meshes need to use the same Material
    • Check GPU Instancing
    • Shader needs to support instancing
    • Skin Mesh Renderer is not supported

    Since Skin Mesh Renderer is not supported,In the previous articleWe bypassed SMR and directly took out the Mesh of different key frames and passed it to the GPU. This is also the reason why the question was raised at the end of the previous article.

    There are two main types of Instancing in Unity: GPU Instancing and Procedural Instancing (involving Compute Shaders and Indirect Drawing technology), and the other is the stereo rendering path (UNITY_STEREO_INSTANCING_ENABLED), which I won't go into here. In Shader, the former uses #pragma multi_compile_instancing and the latter uses #pragma instancing_options procedural:setup. For details, please see the official documentationCreating shaders that support GPU instancing .

    Then currently the SRP pipeline does not support custom GPU Instancing Shaders, only BIRP can.

    Then there is UNITY_PROCEDURAL_INSTANCING_ENABLED . This macro is used to indicate whether Procedural Instancing is enabled. When using Compute Shader or Indirect Drawing API, the attributes of the instance (such as position, color, etc.) can be calculated in real time on the GPU and used directly for rendering without CPU intervention.In the source code, the core code of this macro is:

    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED #ifndef UNITY_INSTANCING_PROCEDURAL_FUNC #error "UNITY_INSTANCING_PROCEDURAL_FUNC must be defined." #else void UNITY_INSTANCING_PROCEDURAL_FUNC(); // Forward declaration of programmatic function #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input) { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input)); UNITY_INSTANCING_PROCEDURAL_FUNC();} #endif #else #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input) { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input));} #endif

    The Shader is required to define a UNITY_INSTANCING_PROCEDURAL_FUNC function, which is actually the setup() function. If there is no setup() function, an error will be reported.

    Generally speaking, what the setup() function needs to do is to extract the corresponding (unity_InstanceID) data from the Buffer, and then calculate the current instance's position, transformation matrix, color, metalness, or custom data and other attributes.

    GPU Instancing is just one of Unity's many optimization methods, and you still need to continue learning.

    1. Swaying 3-Quad Grass

    All the CS knowledge points used in this chapter have been covered in the previous article, but the background is changed. Draw a simple diagram.


    The implementation is to use GPU Instancing, that is, rendering a large mesh at one time. The core code is just one sentence:

    Graphics.DrawMeshInstancedIndirect(mesh, 0, material, bounds, argsBuffer);

    The Mesh is composed of three Quads and a total of six triangles.


    Then add a texture + Alpha Test.


    The data structure of grass:

    • Location
    • Tilt Angle
    • Random noise value (used to calculate random tilt angles)
    public Vector3 position; // World coordinates, need to be calculated public float lean; public float noise; public GrassClump( Vector3 pos){ position.x = pos.x; position.y = pos.y; position.z = pos.z; lean = 0; noise = Random.Range(0.5f, 1); if (Random.value < 0.5f) noise = -noise; }

    Pass the buffer of the grass to be rendered (the world coordinates need to be calculated) to the GPU. First determine where the grass is generated and how much is generated. Get the AABB of the current object's Mesh (assuming it is a Plane Mesh for now).

    Bounds bounds = mf.sharedMesh.bounds; Vector3 clumps = bounds.extents;

    Determine the extent of the grass, then randomly generate grass on the xOz plane.


    Add a caption for the image, no more than 140 characters (optional)

    It should be noted that we are still in object space, so we need to convert Object Space to World Space.

    pos = transform.TransformPoint(pos);

    Combined with the density parameter and the object scaling factor, calculate how many grasses to render in total.

    Vector3 vec = transform.localScale / 0.1f * density; clumps.x *= vec.x; clumps.z *= vec.z; int total = (int)clumps.x * (int)clumps.z;

    Since the logic of Compute Shader is that each thread calculates a blade of grass, it is very likely that the number of blades of grass that need to be rendered is not a multiple of threads. Therefore, the number of blades of grass that need to be rendered is rounded up to a multiple of threads. In other words, when the density factor = 1, the number of blades of grass rendered is equal to the number of threads in a thread group.

    groupSize = Mathf.CeilToInt((float)total / (float)threadGroupSize); int count = groupSize * (int)threadGroupSize;

    Let the Compute Shader calculate the tilt angle of each grass.

    GrassClump clump = clumpsBuffer[id.x]; clump.lean = sin(time) * maxLean * clump.noise; clumpsBuffer[id.x] = clump;

    Passing the grass position and rotation angle to the GPU Buffer is not the end. The Material must decide the final appearance of the rendered instance before Graphics.DrawMeshInstancedIndirect can be executed.

    In the rendering process, before the instantiation phase (that is, in the procedural:setup function), use unity_InstanceID to determine which grass is currently being rendered. Get the current grass's world space and the grass's dump value.

    GrassClump clump = clumpsBuffer[unity_InstanceID]; _Position = clump.position; _Matrix = create_matrix(clump.position, clump.lean);

    Specific rotation + displacement matrix:

    float4x4 create_matrix(float3 pos, float theta){ float c = cos(theta); // Calculate the cosine of the rotation angle float s = sin(theta); // Calculate the sine of the rotation angle // Return a 4x4 transformation matrix return float4x4( c, -s, 0, pos.x, // First row: X-axis rotation and translation s, c, 0, pos.y, // Second row: Y-axis rotation (enough for 2D, but may not be used for grass) 0, 0, 1, pos.z, // Third row: Z axis unchanged 0, 0, 0, 1 // Fourth row: uniform coordinates (remain unchanged) ); }

    How is this formula derived? Substitute (0,0,1) into the Rodriguez formula to get a rotation matrix, and then expand it to the barycentric coordinates. Substitute it into the code formula.


    Multiply this matrix by the vertices of Object Space to get the vertex coordinates of the dumped + displaced vertex. *= _Scale; float4 rotatedVertex = mul(_Matrix, v.vertex); v.vertex = rotatedVertex;

    Now comes the problem. Currently the grass is not a plane, but a three-dimensional figure composed of three groups of Quads.


    If you simply rotate all vertices along the z-axis, the grass roots will be greatly offset.


    Therefore, we use v.texcoord.y to lerp the vertex positions before and after the rotation. In this way, the higher the Y value of the texture coordinate (that is, the closer the vertex is to the top of the model), the greater the rotation effect on the vertex. Since the Y value of the grass root is 0, the grass root will not shake after lerp. *= _Scale; float4 rotatedVertex = mul(_Matrix, v.vertex); // v.vertex = rotatedVertex; += _Position; v.vertex = lerp(v.vertex, rotatedVertex, v.texcoord.y);

    The effect is very poor, the grass is too fake. This kind of Quad grass can only be used from a distance.

    • Swinging stiffness
    • Stiff leaves
    • Poor lighting effects

    Current version code:

    2. Stylized Grass

    In the previous section, I used several Quads and grass with alpha maps, and used sin waves for disturbance, but the effect was very average. Now I will use stylized grass and Perlin noise to improve it.

    Define the grass' vertices, normals and UVs in C# and pass them to the GPU as a Mesh.

    Vector3[] vertices = { new Vector3(-halfWidth, 0, 0), new Vector3( halfWidth, 0, 0), new Vector3(-halfWidth, rowHeight, 0), new Vector3( halfWidth, rowHeight, 0), new Vector3 (-halfWidth*0.9f, rowHeight*2, 0), new Vector3( halfWidth*0.9f, rowHeight*2, 0), new Vector3(-halfWidth*0.8f, rowHeight*3, 0), new Vector3( halfWidth*0.8f, rowHeight*3, 0), new Vector3( 0, rowHeight*4, 0) } ; Vector3 normal = new Vector3(0, 0, -1); Vector3[] normals = { normal, normal, normal, normal, normal, normal, normal, normal, normal }; Vector2[] uvs = { new Vector2(0,0), new Vector2(1,0), new Vector2(0,0.25f), new Vector2(1,0.25f), new Vector2(0,0.5f), new Vector2(1,0.5f) , new Vector2(0,0.75f), new Vector2(1,0.75f), new Vector2(0.5f,1) };

    Unity's Mesh also has a vertex order that needs to be set. The default isCounterclockwiseIf you write clockwise and enable backface culling, you won't see anything.

    int[] indices = { 0,1,2,1,3,2,//row 1 2,3,4,3,5,4,//row 2 4,5,6,5,7,6, //row 3 6,7,8//row 4 }; mesh.SetIndices(indices, MeshTopology.Triangles, 0);

    The wind direction, size and noise ratio are set in the code, packed into a float4, and passed to the Compute Shader to calculate the swinging direction of a blade of grass.

    Vector4 wind = new Vector4(Mathf.Cos(theta), Mathf.Sin(theta), windSpeed, windScale);

    A blade of grass data structure

    struct GrassBlade { public Vector3 position; public float bend; // Random grass blade dumping public float noise; // CS calculates noise value public float fade; // Random grass blade brightness public float face; // Blade facing public GrassBlade( Vector3 pos) { position.x = pos.x; position.y = pos.y; position.z = pos.z; bend = 0; noise = Random.Range(0.5f, 1) * 2 - 1; fade = Random.Range(0.5f, 1); face = Random.Range(0, Mathf.PI); } }

    Currently, the grass blades are all oriented in the same direction. In the Setup function, first change the blade orientation.

    // Create a rotation matrix around the Y axis (facing) float4x4 rotationMatrixY = AngleAxis4x4(blade.position, blade.face, float3(0,1,0));

    The logic of tipping the grass blades (since AngleAxis4x4 includes displacement, the following figure only demonstrates the tipping of the blades without random orientation. If you want to get the effect shown in the figure below, remember to add displacement to the code):

    // Create a rotation matrix around the X axis (dump) float4x4 rotationMatrixX = AngleAxis4x4(float3(0,0,0), blade.bend, float3(1,0,0));

    Then combine the two rotation matrices.

    _Matrix = mul(rotationMatrixY, rotationMatrixX);

    The lighting is now very strange because the normals are not modified.

    // Calculate the inverse transpose matrix for normal transformation float3x3 normalMatrix = (float3x3)transpose(((float3x3)_Matrix)); // Transform normal v.normal = mul(normalMatrix, v.normal);

    Here is the code for the inverse matrix:

    float3x3 transpose(float3x3 m) { return float3x3( float3(m[0][0], m[1][0], m[2][0]), // Column 1 float3(m[0][1] , m[1][1], m[2][1]), // Column 2 float3(m[0][2], m[1][2], m[2][2]) // Column 3 ); }

    For code readability, add the homogeneous coordinate transformation matrix, which is upgraded to the famous rotation formula:

    float4x4 AngleAxis4x4(float3 pos, float angle, float3 axis){ float c, s; sincos(angle*2*3.14, s, c); float t = 1 - c; float x = axis.x; float y = axis. y; float z = axis.z; return float4x4( t * x * x + c , t * x * y - s * z, t * x * z + s * y, pos.x, t * x * y + s * z, t * y * y + c , t * y * z - s * x, pos.y, t * x * z - s * y, t * y * z + s * x, t * z * z + c , pos.z, 0,0,0,1 ); }

    What if you want to spawn on uneven ground?


    You only need to modify the logic of generating the initial height of the grass, and use MeshCollider and ray detection.

    bladesArray = new GrassBlade[count]; gameObject.AddComponent (); RaycastHit hit; Vector3 v = new Vector3(); Debug.Log( + bounds.extents.y); vy = ( + bounds.extents.y); v = transform .TransformPoint(v); float heightWS = vy + 0.01f; // Floating point error v.Set(0, 0, 0); vy = ( - bounds.extents.y); v = transform.TransformPoint(v); float neHeightWS = vy; float range = heightWS - neHeightWS; // heightWS += 10; // Increase the error slightly and adjust it yourself int index = 0; int loopCount = 0; while (index < count && loopCount < (count * 10)) { loopCount++; Vector3 pos = new Vector3( Random.value * bounds.extents.x * 2 - bounds.extents.x +, 0, Random.value * bounds.extents.z * 2 - bounds.extents.z +; pos = transform.TransformPoint(pos); pos.y = heightWS; if ( Physics.Raycast(pos, Vector3.down, out hit)) { pos.y = hit.point.y; GrassBlade blade = new GrassBlade(pos); bladesArray[index++] = blade; } }

    Here, rays are used to detect the position of each grass and calculate its correct height.


    You can also adjust it so that the higher the altitude, the sparser the grass.


    As shown above, calculate the ratio of the two green arrows. The higher the altitude, the lower the probability of generation.

    float deltaHeight = (pos.y - neHeightWS) / range; if (Random.value > deltaHeight) { // Grass }

    Current code link:

    Now there is no problem with lighting or shadow.

    3. Interactive Grass

    In the previous section, we first rotated the direction of the grass and then changed the tilt of the grass. Now we need to add another rotation. When an object approaches the grass, the grass will fall in the opposite direction of the object. This means another rotation. This rotation is not easy to set, so it is changed to quaternion. The calculation of quaternion is performed in Compute Shader. The quaternion is also passed to the material and stored in the structure of the grass piece. Finally, in the vertex shader, the quaternion is converted back to an affine matrix to apply the rotation.

    Here we add random width and height of grass. Because each grass mesh is the same, we can't modify the height of grass by modifying the mesh. So we can only do vertex offset in Vert.

    // C# [Range(0,0.5f)] public float width = 0.2f; [Range(0,1f)] public float rd_width = 0.1f; [Range(0,2)] public float height = 1f; [Range (0,1f)] public float rd_height = 0.2f; GrassBlade blade = new GrassBlade(pos); blade.height = Random.Range(-rd_height, rd_height); blade.width = Random.Range(-rd_width, rd_width); bladesArray[index++] = blade; // Setup starts with GrassBlade blade = bladesBuffer[unity_InstanceID]; _HeightOffset = blade.height_offset; _WidthOffset = blade.width_offset; // Vert starts with float tempHeight = v.vertex.y * _HeightOffset; float tempWidth = v.vertex.x * _WidthOffset; v.vertex.y += tempHeight; v.vertex.x += tempWidth;

    To sort it out, the current grass Buffer stores:

    struct GrassBlade{ public Vector3 position; // World position - need to be initialized public float height; // Grass height offset - need to be initialized public float width; // Grass width offset - need to be initialized public float dir; // Blade orientation - need to be initialized public float fade; // Random grass blade shading - need to be initialized public Quaternion quaternion; // Rotation parameters - CS calculation->Vert public float padding; public GrassBlade( Vector3 pos){ position.x = pos.x; position.y = pos.y; position.z = pos.z; height = width = 0; dir = Random.Range(0, 180); fade = Random.Range(0.99f, 1); quaternion = Quaternion.identity; padding = 0; } } int SIZE_GRASS_BLADE = 12 * sizeof(float);

    The quaternion q used to represent the rotation from vector v1 to vector v2 is:

    float4 MapVector(float3 v1, float3 v2){ v1 = normalize(v1); v2 = normalize(v2); float3 v = v1+v2; v = normalize(v); float4 q = 0; qw = dot(v, v2 ); = cross(v, v2); return q; }

    To combine two rotational quaternions, you need to use multiplication (note the order).

    Suppose there are two quaternions and . The formula for calculating their product is:

    where are the real and imaginary components of , and are the real and imaginary components of .

    float4 quatMultiply(float4 q1, float4 q2) { // q1 = a + bi + cj + dk // q2 = x + yi + zj + wk // Result = q1 * q2 return float4( q1.w * q2.x + q1.x * q2.w + q1.y * q2.z - q1.z * q2.y, // z + q1.x * q2.y - q1.y * q2.x + q1.z * q2.w, // Z component q1.w * q2.w - q1.x * q2.x - q1.y * q2.y - q1.z * q2.z // W (real) component ); }

    To determine where the grass should fall, you need to get the Pos of the interactive object trampler, that is, its Transform component. And each frame is passed to the GPU Buffer through SetVector for use by the Compute Shader, so the GPU memory address is stored as an ID and does not need to be accessed with a string every time. It is also necessary to determine the range of the grass to fall and how to transition between falling and not falling, and pass a trampleRadius to the GPU. Since this is a constant, it does not need to be modified every frame, so it can be directly set with a string.

    // CSharp public Transform trampler; [Range(0.1f,5f)] public float trampleRadius = 3f; ... Init(){ shader.SetFloat("trampleRadius", trampleRadius); tramplePosID = Shader.PropertyToID("tramplePos") ; } Update(){ shader.SetVector(tramplePosID, pos); }

    In this section, all rotation operations are thrown into the Compute Shader and calculated at once, and a quaternion is directly returned to the material. First, q1 calculates the quaternion of the random orientation, q2 calculates the random dump, and qt calculates the interactive dump. Here you can open an interactive coefficient in the Inspector.

    [numthreads(THREADGROUPSIZE,1,1)] void BendGrass (uint3 id : SV_DispatchThreadID) { GrassBlade blade = bladesBuffer[id.x]; float3 relativePosition = blade.position -; float dist = length(relativePosition); float4 qt ; if (dist

    Then the method of converting quaternion to rotation matrix is:

    float4x4 quaternion_to_matrix(float4 quat) { float4x4 m = float4x4(float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0 , 0, 0)); float x = quat.x, y = quat.y, z = quat.z, w = quat.w; float x2 = x + x, y2 = y + y, z2 = z + z; float xx = x * x2, xy = x * y2, xz = x * z2; float yy = y * y2, yz = y * z2, zz = z * z2; float wx = w * x2, wy = w * y2, wz = w * z2; m[0][0] = 1.0 - (yy + zz); m[0][1] = xy - wz; m[0][2] = xz + wy; m[1][0] = xy + wz; m[1][1] = 1.0 - (xx + zz); m[1][2] = yz - wx; m[2][0] = xz - wy; m[2][1] = yz + wx; m[2][2] = 1.0 - (xx + yy); m[0][3] = _Position.x; m[1][3] = _Position.y; m[2][3] = _Position. z; m[3][3] = 1.0; return m; }

    Then apply it.

    void vert(inout appdata_full v, out Input data) { UNITY_INITIALIZE_OUTPUT(Input, data); #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED float tempHeight = v.vertex.y * _HeightOffset; float tempWidth = v.vertex.x * _WidthOffset; v.vertex.y += tempHeight; v.vertex.x += tempWidth; // Apply model vertex transformation v.vertex = mul(_Matrix, v.vertex); += _Position; // Calculate the inverse transpose matrix for normal transformation v.normal = mul((float3x3)transpose(_Matrix), v.normal); #endif } void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED // Get Compute Shader calculation results GrassBlade blade = bladesBuffer[unity_InstanceID]; _HeightOffset = blade.height_offset; _WidthOffset = blade.width_offset; _Fade = blade.fade; // Set shading _Matrix = quaternion_to_matrix(blade.quaternion); // Set the final rotation matrix _Position = blade.position; // Set position #endif }

    Current code link:

    4. Summary/Quiz

    How do you programmatically get the thread group sizes of a kernel?


    When defining a Mesh in code, the number of normals must be the same as the number of vertex positions. True or false.

  • Compute Shader学习笔记(三)之 粒子效果与群集行为模拟

    Compute Shader Learning Notes (Part 3) Particle Effects and Cluster Behavior Simulation


    Following the previous article

    remoooo: Compute Shader Learning Notes (II) Post-processing Effects

    L4 particle effects and crowd behavior simulation

    This chapter uses Compute Shader to generate particles. Learn how to use DrawProcedural and DrawMeshInstancedIndirect, also known as GPU Instancing.

    Summary of knowledge points:

    • Compute Shader, Material, C# script and Shader work together
    • Graphics.DrawProcedural
    • material.SetBuffer()
    • xorshift random algorithm
    • Swarm Behavior Simulation
    • Graphics.DrawMeshInstancedIndirect
    • Rotation, translation, and scaling matrices, homogeneous coordinates
    • Surface Shader
    • ComputeBufferType.Default
    • #pragma instancing_options procedural:setup
    • unity_InstanceID
    • Skinned Mesh Renderer
    • Data alignment

    1. Introduction and preparation

    In addition to being able to process large amounts of data at the same time, Compute Shader also has a key advantage, which is that the Buffer is stored in the GPU. Therefore, the data processed by the Compute Shader can be directly passed to the Shader associated with the Material, that is, the Vertex/Fragment Shader. The key here is that the material can also SetBuffer() like the Compute Shader, accessing data directly from the GPU's Buffer!


    Using Compute Shader to create a particle system can fully demonstrate the powerful parallel capabilities of Compute Shader.

    During the rendering process, the Vertex Shader reads the position and other attributes of each particle from the Compute Buffer and converts them into vertices on the screen. The Fragment Shader is responsible for generating pixels based on the information of these vertices (such as position and color). Through the Graphics.DrawProcedural method, Unity canDirect RenderingThese vertices processed by the Shader do not require a pre-defined mesh structure and do not rely on the Mesh Renderer, which is particularly effective for rendering a large number of particles.

    2. Hello Particle

    The steps are also very simple. Define the particle information (position, speed and life cycle) in C#, initialize and pass the data to Buffer, bind Buffer to Compute Shader and Material. In the rendering stage, call Graphics.DrawProceduralNow in OnRenderObject() to achieve efficient particle rendering.


    Create a new scene and create an effect: millions of particles follow the mouse and bloom into life, as follows:


    Writing this makes me think a lot. The life cycle of a particle is very short, ignited in an instant like a spark, and disappearing like a meteor. Despite thousands of hardships, I am just a speck of dust among billions of dust, ordinary and insignificant. These particles may float randomly in space (Use the "Xorshift" algorithm to calculate the position of particle spawning), may have unique colors, but they can't escape the fate of being programmed. Isn't this a portrayal of my life? I play my role step by step, unable to escape the invisible constraints.

    “God is dead! And how can we who have killed him not feel the greatest pain?” – Friedrich Nietzsche

    Nietzsche not only announced the disappearance of religious beliefs, but also pointed out the sense of nothingness faced by modern people, that is, without the traditional moral and religious pillars, people feel unprecedented loneliness and lack of direction. Particles are defined and created in the C# script, move and die according to specific rules, which is quite similar to the state of modern people in the universe described by Nietzsche. Although everyone tries to find their own meaning, they are ultimately restricted by broader social and cosmic rules.

    Life is full of various inevitable pains, reflecting the inherent emptiness and loneliness of human existence.Particle death logic to be writtenAll of these confirm what Nietzsche said: nothing in life is permanent. The particles in the same buffer will inevitably disappear at some point in the future, which reflects the loneliness of modern people described by Nietzsche. Individuals may feel unprecedented isolation and helplessness, so everyone is a lonely warrior who must learn to face the inner tornado and the indifference of the outside world alone.

    But it doesn’t matter, “Summer will come again and again, and those who are meant to meet will meet again.” The particles in this article will also be regenerated after the end, embracing their own Buffer in the best state.

    Summer will come around again. People who meet will meet again.


    The current version of the code can be copied and run by yourself (all with comments):

    • Compute Shader:
    • CPU:
    • Shader:

    Enough of the nonsense, let’s first take a look at how the C# script is written.


    As usual, first define the particle buffer (structure), initialize it, and then pass it to the GPU.The key lies in the last three lines that bind the Buffer to the shader operation.There is nothing much to say about the code in the ellipsis below. They are all routine operations, so they are just mentioned with comments.

    struct Particle{ public Vector3 position; // Particle positionpublic Vector3 velocity; // Particle velocitypublic float life; // Particle life cycle } ComputeBuffer particleBuffer; // GPU Buffer ... // Init() // Initialize particle array Particle[] particleArray = new Particle[particleCount]; for (int i = 0; i < particleCount; i++){ // Generate random positions and normalize... // Set the initial position and velocity of the particle... // Set the life cycle of the particle particleArray[i].life = Random.value * 5.0f + 1.0f; } // Create and set up the Compute Buffer ... // Find the kernel ID in the Compute Shader ... // Bind the Compute Buffer to the shader shader.SetBuffer(kernelID, "particleBuffer", particleBuffer); material.SetBuffer("particleBuffer", particleBuffer); material.SetInt("_PointSize", pointSize);

    The key rendering stage is OnRenderObject(). material.SetPass is used to set the rendering material channel. The DrawProceduralNow method draws geometry without using traditional meshes. MeshTopology.Points specifies the topology type of the rendering as points. The GPU will treat each vertex as a point and will not form lines or faces between vertices. The second parameter 1 means starting drawing from the first vertex. particleCount specifies the number of vertices to render, which is the number of particles, that is, telling the GPU how many points need to be rendered in total.

    void OnRenderObject() { material.SetPass(0); Graphics.DrawProceduralNow(MeshTopology.Points, 1, particleCount); }

    Get the current mouse position method. OnGUI() This method may be called multiple times per frame. The z value is set to the camera's near clipping plane plus an offset. Here, 14 is added to get a world coordinate that is more suitable for visual depth (you can also adjust it yourself).

    void OnGUI() { Vector3 p = new Vector3(); Camera c = Camera.main; Event e = Event.current; Vector2 mousePos = new Vector2(); // Get the mouse position from Event. // Note that the y position from Event is inverted. mousePos.x = e.mousePosition.x; mousePos.y = c.pixelHeight - e.mousePosition.y; p = c.ScreenToWorldPoint(new Vector3(mousePos.x, mousePos.y, c.nearClipPlane + 14)); cursorPos.x = px; cursorPos.y = py; }

    ComputeBuffer particleBuffer has been passed to Compute Shader and Shader above.

    Let's first look at the data structure of the Compute Shader. Nothing special.

    // Define particle data structure struct Particle { float3 position; // particle position float3 velocity; // particle velocity float life; // particle remaining life time }; // Structured buffer used to store and update particle data, which can be read and written from GPU RWStructuredBuffer particleBuffer; // Variables set from the CPU float deltaTime; // Time difference from the previous frame to the current frame float2 mousePosition; // Current mouse position

    Here I will briefly talk about a particularly useful random number sequence generation method, the xorshift algorithm. It will be used to randomly control the movement direction of particles as shown above. The particles will move randomly in three-dimensional directions.

    • For more information, please refer to:
    • Original paper link:

    This algorithm was proposed by George Marsaglia in 2003. Its advantages are that it is extremely fast and very space-efficient. Even the simplest Xorshift implementation has a very long pseudo-random number cycle.

    The basic operations are shift and XOR. Hence the name of the algorithm. Its core is to maintain a non-zero state variable and generate random numbers by performing a series of shift and XOR operations on this state variable.

    // State variable for random number generation uint rng_state; uint rand_xorshift() { // Xorshift algorithm from George Marsaglia's paper rng_state ^= (rng_state << 13); // Shift the state variable left by 13 bits, then XOR it with the original state rng_state ^= (rng_state >> 17); // Shift the updated state variable right by 17 bits, and XOR it again rng_state ^= (rng_state << 5); // Finally, shift the state variable left by 5 bits, and XOR it one last time return rng_state; // Return the updated state variable as the generated random number }

    Basic Xorshift The core of the algorithm has been explained above, but different shift combinations can create multiple variants. The original paper also mentions the Xorshift128 variant. Using a 128-bit state variable, the state is updated by four different shifts and XOR operations. The code is as follows:

    // c language Ver uint32_t xorshift128(void) { static uint32_t x = 123456789; static uint32_t y = 362436069; static uint32_t z = 521288629; static uint32_t w = 88675123; uint32_t t = x ^ (x << 11); x = y; y = z; z = w; w = w ^ (w >> 19) ^ (t ^ (t >> 8)); return w; }

    This can produce longer periods and better statistical performance. The period of this variant is close, which is very impressive.

    In general, this algorithm is completely sufficient for game development, but it is not suitable for use in fields such as cryptography.

    When using this algorithm in Compute Shader, you need to pay attention to the range of random numbers generated by the Xorshift algorithm when it is the range of uint32, and you need to do another mapping ([0, 2^32-1] is mapped to [0, 1]):

    float tmp = (1.0 / 4294967296.0); // conversion factor rand_xorshift()) * tmp

    The direction of particle movement is signed, so we just need to subtract 0.5 from it. Random movement in three directions:

    float f0 = float(rand_xorshift()) * tmp - 0.5; float f1 = float(rand_xorshift()) * tmp - 0.5; float f2 = float(rand_xorshift()) * tmp - 0.5; float3 normalF3 = normalize(float3(f0, f1, f2)) * 0.8f; // Scaled the direction of movement

    Each Kernel needs to complete the following:

    • First get the particle information of the previous frame in the Buffer
    • Maintain particle buffer (calculate particle velocity, update position and health value), write back to buffer
    • If the health value is less than 0, regenerate a particle

    Generate particles. Use the random number obtained by Xorshift just now to define the particle's health value and reset its speed.

    // Set the new position and life of the particle particleBuffer[id].position = float3(normalF3.x + mousePosition.x, normalF3.y + mousePosition.y, normalF3.z + 3.0); particleBuffer[id].life = 4; // Reset life particleBuffer[id].velocity = float3(0,0,0); // Reset velocity

    Finally, the basic data structure of Shader:

    struct Particle{ float3 position; float3 velocity; float life; }; struct v2f{ float4 position : SV_POSITION; float4 color : COLOR; float life : LIFE; float size: PSIZE; }; // particles' data StructuredBuffer particleBuffer;

    Then the vertex shader calculates the vertex color of the particle, the Clip position of the vertex, and transmits the information of a vertex size.

    v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID){ v2f o = (v2f)0; // Color float life = particleBuffer[instance_id].life; float lerpVal = life * 0.25f; o.color = fixed4(1.0 f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal); // Position o.position = UnityObjectToClipPos(float4(particleBuffer[instance_id].position, 1.0f)); o.size = _PointSize; return o; }

    The fragment shader calculates the interpolated color.

    float4 frag(v2f i) : COLOR{ return i.color; }

    At this point, you can get the above effect.


    3. Quad particles

    In the previous section, each particle only had one point, which was not interesting. Now let's turn a point into a Quad. In Unity, there is no Quad, only a fake Quad composed of two triangles.

    Let's start working on it, based on the code above. Define the vertices in C#, the size of a Quad.

    // struct struct Vertex { public Vector3 position; public Vector2 uv; public float life; } const int SIZE_VERTEX = 6 * sizeof(float); public float quadSize = 0.1f; // Quad size

    On a per-particle basis, set the UV coordinates of the six vertices for use in the vertex shader, and draw them in the order specified by Unity.

    index = i*6; //Triangle 1 - bottom-left, top-left, top-right vertexArray[index].uv.Set(0,0); vertexArray[index+1].uv.Set(0,1 ); vertexArray[index+2].uv.Set(1,1); //Triangle 2 - bottom-left, top-right, bottom-right vertexArray[index+3].uv.Set(0,0); vertexArray[index+4].uv.Set(1,1); vertexArray[index+5].uv.Set(1,0);

    Finally, it is passed to Buffer. The halfSize here is used to pass to Compute Shader to calculate the positions of each vertex of Quad.

    vertexBuffer = new ComputeBuffer(numVertices, SIZE_VERTEX); vertexBuffer.SetData(vertexArray); shader.SetBuffer(kernelID, "vertexBuffer", vertexBuffer); shader.SetFloat("halfSize", quadSize*0.5f); material.SetBuffer("vertexBuffer ", vertexBuffer);

    During the rendering phase, the points are changed into triangles with six points.

    void OnRenderObject() { material.SetPass(0); Graphics.DrawProceduralNow(MeshTopology.Triangles, 6, numParticles); }

    Change the settings in the Shader to receive vertex data and a texture for display. Alpha culling is required.

    _MainTex("Texture", 2D) = "white" {} ... Tags{ "Queue"="Transparent" "RenderType"="Transparent" "IgnoreProjector"="True" } LOD 200 Blend SrcAlpha OneMinusSrcAlpha ZWrite Off .. . struct Vertex{ float3 position; float2 uv; float life; }; StructuredBuffer vertexBuffer; sampler2D _MainTex; v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID) { v2f o = (v2f)0; int index = instance_id*6 + vertex_id; float lerpVal = vertexBuffer[index].life * 0.25f; o .color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal); o.position = UnityWorldToClipPos(float4(vertexBuffer[index].position, 1.0f)); o.uv = vertexBuffer[index].uv; return o; } float4 frag(v2f i) : COLOR { fixed4 color = tex2D( _MainTex, i.uv ) * i.color; return color; }

    In the Compute Shader, add receiving vertex data and halfSize.

    struct Vertex { float3 position; float2 uv; float life; }; RWStructuredBuffer vertexBuffer; float halfSize;

    Calculate the positions of the six vertices of each Quad.

    //Set the vertex buffer // int index = id.x * 6; //Triangle 1 - bottom-left, top-left, top-right vertexBuffer[index].position.x = p.position.x-halfSize; vertexBuffer[index].position.y = p.position.y-halfSize; vertexBuffer[index].position.z = p.position.z; vertexBuffer[index].life =; vertexBuffer[index+1].position.x = p.position.x-halfSize; vertexBuffer[index+1].position.y = p.position.y+halfSize; vertexBuffer[index+1].position.z = p .position.z; vertexBuffer[index+1].life =; vertexBuffer[index+2].position.x = p.position.x+halfSize; vertexBuffer[index+2].position.y = p.position.y+halfSize; vertexBuffer[index+2].position.z = p.position.z; vertexBuffer[index+2].life =; //Triangle 2 - bottom-left, top-right, bottom-right // // vertexBuffer[index+3].position.x = p.position.x-halfSize; vertexBuffer[index+3].position.y = p.position.y-halfSize; vertexBuffer[index+3].position.z = p.position.z; vertexBuffer[index+3].life =; vertexBuffer[index+4].position.x = p.position.x+halfSize; vertexBuffer[index+4].position.y = p.position.y+halfSize ; vertexBuffer[index+4].position.z = p.position.z; vertexBuffer[index+4].life =; vertexBuffer[index+5].position.x = p.position.x+halfSize; vertexBuffer[index+5].position.y = p.position.y-halfSize; vertexBuffer[index+5].position.z = p.position.z; vertexBuffer[index+5].life =;

    Mission accomplished.


    Current version code:

    • Compute Shader:
    • CPU:
    • Shader:

    In the next section, we will upgrade the Mesh to a prefab and try to simulate the flocking behavior of birds in flight.

    4. Flocking simulation


    Flocking is an algorithm that simulates the collective movement of animals such as flocks of birds and schools of fish in nature. The core is based on three basic behavioral rules, proposed by Craig Reynolds in Sig 87, and is often referred to as the "Boids" algorithm:

    • Separation Particles cannot be too close to each other, and there must be a sense of boundary. Specifically, the particles with a certain radius around them are calculated and then a direction is calculated to avoid collision.
    • Alignment The speed of an individual tends to the average speed of the group, and there should be a sense of belonging. Specifically, the average speed of particles within the visual range is calculated (the speed size direction). This visual range is determined by the actual biological characteristics of the bird, which will be mentioned in the next section.
    • Cohesion The position of the individual particles tends to the average position (the center of the group) to feel safe. Specifically, each particle finds the geometric center of its neighbors and calculates a moving vector (the final result is the averageLocation).

    Think about it, which of the above three rules is the most difficult to implement?

    Answer: Separation. As we all know, calculating collisions between objects is very difficult to achieve. Because each individual needs to compare distances with all other individuals, this will cause the time complexity of the algorithm to be close to O(n^2), where n is the number of particles. For example, if there are 1,000 particles, then nearly 500,000 distance calculations may be required in each iteration. In the original paper, the author took 95 seconds to render one frame (80 birds) in the original unoptimized algorithm (time complexity O(N^2)), and it took nearly 9 hours to render a 300-frame animation.

    Generally speaking, using a quadtree or spatial hashing method can optimize the calculation. You can also maintain a neighbor list to store the individuals around each individual at a certain distance. Of course, you can also use Compute Shader to perform hard calculations.


    Without further ado, let’s get started.

    First download the prepared project files (if not prepared in advance):

    • Bird's Prefab:
    • Script:
    • Compute Shader:

    Then add it to an empty GO.


    Start the project and you'll see a bunch of birds.


    Below are some parameters for group behavior simulation.

    // Define the parameters for the crowd behavior simulation. public float rotationSpeed = 1f; // Rotation speed. public float boidSpeed = 1f; // Boid speed. public float neighbourDistance = 1f; // Neighboring distance. public float boidSpeedVariation = 1f; // Speed variation. public GameObject boidPrefab; // Prefab of Boid object. public int boidsCount; // Number of Boids. public float spawnRadius; // Radius of Boid spawn. public Transform target; // The moving target of the crowd.

    Except for the Boid prefab boidPrefab and the spawn radius spawnRadius, everything else needs to be passed to the GPU.

    For the sake of convenience, let’s make a foolish mistake in this section. We will only calculate the bird’s position and direction on the GPU, and then pass it back to the CPU for the following processing:

    ... boidsBuffer.GetData(boidsArray); // Update the position and direction of each bird for (int i = 0; i < boidsArray.Length; i++){ boids[i].transform.localPosition = boidsArray[i].position; if (!boidsArray[i].direction.Equals({ boids[i].transform.rotation = Quaternion.LookRotation(boidsArray[i].direction); } }

    The Quaternion.LookRotation() method is used to create a rotation so that an object faces a specified direction.

    Calculate the position of each bird in the Compute Shader.

    #pragma kernel CSMain #define GROUP_SIZE 256 struct Boid{ float3 position; float3 direction; }; RWStructuredBuffer boidsBuffer; float time; float deltaTime; float rotationSpeed; float boidSpeed; float boidSpeedVariation; float3 flockPosition; float neighborDistance; int boidsCount;


    void CSMain (uint3 id : SV_DispatchThreadID) { … // Continue below }

    First write the logic of alignment and aggregation, and finally output the actual position and direction to the Buffer.

    Boid boid = boidsBuffer[id.x]; float3 separation = 0; // Separation float3 alignment = 0; // Alignment - direction float3 cohesion = flockPosition; // Aggregation - position uint nearbyCount = 1; // Count itself as a surrounding individual. for (int i=0; i

    This is the result of having no sense of boundaries (separation terms), all individuals appear to have a fairly close relationship and overlap.


    Add the following code.

    if(distance(boid.position, temp.position)< neighborDistance) { float3 offset = boid.position - temp.position; float dist = length(offset); if(dist < neighborDistance) { dist = max(dist, 0.000001) ; separation += offset * (1.0/dist - 1.0/neighbourDistance); } ...

    1.0/dist When the Boids are closer together, this value is larger, indicating that the separation force should be greater. 1.0/neighbourDistance is a constant based on the defined neighbor distance. The difference between the two represents how much the actual separation force responds to the distance. If the distance between the two Boids is exactly neighborDistance, this value is zero (no separation force). If the distance between the two Boids is less than neighborDistance, this value is positive, and the smaller the distance, the larger the value.


    Current code:

    The next section will use Instanced Mesh to improve performance.

    5. GPU Instancing Optimization

    First, let's review the content of this chapter. In both the "Hello Particle" and "Quad Particle" examples, we used the Instanced technology (Graphics.DrawProceduralNow()) to pass the particle position calculated by the Compute Shader directly to the VertexFrag shader.


    DrawMeshInstancedIndirect used in this section is used to draw a large number of geometric instances. The instances are similar, but the positions, rotations or other parameters are slightly different. Compared with DrawProceduralNow, which regenerates the geometry and renders it every frame, DrawMeshInstancedIndirect only needs to set the instance information once, and then the GPU can render all instances at once based on this information. Use this function to render grass and groups of animals.


    This function has many parameters, only some of which are used.

    Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);
    1. boidMesh: Throw the bird Mesh in.
    2. subMeshIndex: The submesh index to draw. Usually 0 if the mesh has only one submesh.
    3. boidMaterial: The material applied to the instanced object.
    4. Bounds: The bounding box specifies the drawing range. The instantiated object will only be rendered in the area within this bounding box. Used to optimize performance.
    5. argsBuffer: ComputeBuffer of parameters, including the number of indices of each instance's geometry and the number of instances.

    What is this argsBuffer? This parameter is used to tell Unity which mesh we want to render and how many meshes we want to render! We can use a special Buffer as a parameter.

    When initializing the shader, a special Buffer is created, which is labeled ComputeBufferType.IndirectArguments. This type of buffer is specifically used to pass to the GPU so that indirect drawing commands can be executed on the GPU. The first parameter of new ComputeBuffer here is 1, which represents an args array (an array has 5 uints). Don't get it wrong.

    ComputeBuffer argsBuffer; ... argsBuffer = new ComputeBuffer(1, 5 * sizeof(uint), ComputeBufferType.IndirectArguments); if (boidMesh != null) { args[0] = (uint)boidMesh.GetIndexCount(0); args[ 1] = (uint)numOfBoids; } argsBuffer.SetData(args); ... Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);

    Based on the previous chapter, an offset is added to the individual data structure, which is used for the direction offset in the Compute Shader. In addition, the direction of the initial state is interpolated using Slerp, 70% keeps the original direction, and 30% is random. The result of Slerp interpolation is a quaternion, which needs to be converted to Euler angles using the quaternion method and then passed into the constructor.

    public float noise_offset; ... Quaternion rot = Quaternion.Slerp(transform.rotation, Random.rotation, 0.3f); boidsArray[i] = new Boid(pos, rot.eulerAngles, offset);

    After passing this new attribute noise_offset to the Compute Shader, a noise value in the range [-1, 1] is calculated and applied to the bird's speed.

    float noise = clamp(noise1(time / 100.0 + boid.noise_offset), -1, 1) * 2.0 - 1.0; float velocity = boidSpeed * (1.0 + noise * boidSpeedVariation);

    Then we optimized the algorithm a bit. Compute Shader is basically the same.

    if (distance(boid_pos, boidsBuffer[i].position) < neighborDistance) { float3 tempBoid_position = boidsBuffer[i].position; float3 offset = boid.position - tempBoid_position; float dist = length(offset); if (dist

    The biggest difference is in the shader. This section uses a surface shader instead of a fragment. This is actually a packaged vertex and fragment shader. Unity has already done a lot of tedious work such as lighting and shadows. You can still specify a vertice.

    When writing shaders to make materials, you need to do special processing for instanced objects. Because the positions, rotations and other properties of ordinary rendering objects are static in Unity. For the instantiated objects to be built, their positions, rotations and other parameters are constantly changing. Therefore, a special mechanism is needed in the rendering pipeline to dynamically set the position and parameters of each instantiated object. The current method is based on the instantiation technology of the program, which can render all instantiated objects at once without drawing them one by one. That is, one-time batch rendering.

    The shader uses the instanced technique. The instantiation phase is executed before vert. This way each instantiated object has its own rotation, translation, and scaling matrices.

    Now we need to create a rotation matrix for each instantiated object. From the Buffer, we get the basic information of the bird calculated by the Compute Shader (in the previous section, the data was sent back to the CPU, and here it is directly sent to the Shader for instantiation):


    In Shader, the data structure and related operations passed by Buffer are wrapped with the following macros.

    // .shader #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED struct Boid { float3 position; float3 direction; float noise_offset; }; StructuredBuffer boidsBuffer; #endif

    Since I only specified the number of birds to be instantiated (the number of birds, which is also the size of the Buffer) in args[1] of DrawMeshInstancedIndirect of C#, I can directly access the Buffer using the unity_InstanceID index.

    #pragma instancing_options procedural:setup void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED _BoidPosition = boidsBuffer[unity_InstanceID].position; _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0)); #endif }

    The calculation of the space transformation matrix here involvesHomogeneous Coordinates, you can review the GAMES101 course. The point is (x,y,z,1) and the coordinates are (x,y,z,0).

    If you use affine transformations, the code is as follows:

    void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED _BoidPosition = boidsBuffer[unity_InstanceID].position; _LookAtMatrix = look_at_matrix(boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0)); #endif } void vert(inout appdata_full v, out Input data) { UNITY_INITIALIZE_OUTPUT(Input, data); #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED v.vertex = mul(_LookAtMatrix, v.vertex); += _BoidPosition; #endif }

    Not elegant enough, we can just use homogeneous coordinates. One matrix handles rotation, translation and scaling!

    void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED _BoidPosition = boidsBuffer[unity_InstanceID].position; _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0)); #endif } void vert(inout appdata_full v, out Input data) { UNITY_INITIALIZE_OUTPUT(Input, data); #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED v.vertex = mul(_Matrix, v.vertex); #endif }

    Now, we are done! The current frame rate is nearly doubled compared to the previous section.


    Current version code:

    • Compute Shader:
    • CPU:
    • Shader:

    6. Apply skin animation


    What we need to do in this section is to use the Animator component to grab the Mesh of each keyframe into the Buffer before instantiating the object. By selecting different indexes, we can get Mesh of different poses. The specific skeletal animation production is beyond the scope of this article.

    You just need to modify the code based on the previous chapter and add the Animator logic. I have written comments below, you can take a look.

    And the individual data structure is updated:

    struct Boid{ float3 position; float3 direction; float noise_offset; float speed; // not useful for now float frame; // indicates the current frame index in the animation float3 padding; // ensure data alignment };

    Let's talk about alignment in detail. In a data structure, the size of the data should preferably be an integer multiple of 16 bytes.

    • float3 position; (12 bytes)
    • float3 direction; (12 bytes)
    • float noise_offset; (4 bytes)
    • float speed; (4 bytes)
    • float frame; (4 bytes)
    • float3 padding; (12 bytes)

    Without padding, the size is 36 bytes, which is not a common alignment size. With padding, the alignment is 48 bytes, perfect!

    private SkinnedMeshRenderer boidSMR; // Used to reference the SkinnedMeshRenderer component that contains the skinned mesh. private Animator animator; public AnimationClip animationClip; // Specific animation clips, usually used to calculate animation-related parameters. private int numOfFrames; // The number of frames in the animation, used to determine how many frames of data to store in the GPU buffer. public float boidFrameSpeed = 10f; // Controls the speed at which the animation plays. MaterialPropertyBlock props; // Pass parameters to the shader without creating a new material instance. This means that the material properties of the instance (such as color, lighting coefficient, etc.) can be changed without affecting other objects using the same material. Mesh boidMesh; // Stores the mesh data baked from the SkinnedMeshRenderer. ... void Start(){ // First initialize the Boid data here, then call GenerateSkinnedAnimationForGPUBuffer to prepare the animation data, and finally call InitShader to set the Shader parameters required for rendering. ... // This property block is used only for avoiding an instancing bug. props = new MaterialPropertyBlock(); props.SetFloat("_UniqueID", Random.value); ... InitBoids(); GenerateSkinnedAnimationForGPUBuffer(); InitShader(); } void InitShader(){ // This method configures the Shader and material properties to ensure that the animation playback can be displayed correctly according to the different stages of the instance. Enabling or disabling frameInterpolation determines whether to interpolate between animation frames for smoother animation effects. ... if (boidMesh)//Set by the GenerateSkinnedAnimationForGPUBuffer ... shader.SetFloat("boidFrameSpeed", boidFrameSpeed); shader.SetInt("numOfFrames", numOfFrames); boidMaterial.SetInt("numOfFrames", numOfFrames); if (frameInterpolation && !boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION")) boidMaterial.EnableKeyword("FRAME_INTERPOLATION"); if (!frameInterpolation && boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION")) boidMaterial.DisableKeyword("FRAME_INTERPOLATION"); } void Update(){ ... // The last two parameters: // 1. 0: Offset into the parameter buffer, used to specify where to start reading parameters. // 2. props: The MaterialPropertyBlock created earlier, containing properties shared by all instances. Graphics.DrawMeshInstancedIndirect( boidMesh, 0, boidMaterial, bounds, argsBuffer, 0, props); } void OnDestroy(){ ... if (vertexAnimationBuffer != null) vertexAnimationBuffer.Release(); } private void GenerateSkinnedAnimationForGPUBuffer() { ... // Continued }

    In order to provide the Shader with Mesh with different postures at different times, the mesh vertex data of each frame is extracted from the Animator and SkinnedMeshRenderer in the GenerateSkinnedAnimationForGPUBuffer() function, and then the data is stored in the GPU's ComputeBuffer for use in instanced rendering.

    GetCurrentAnimatorStateInfo to obtain the state information of the current animation layer for subsequent precise control of animation playback.

    numOfFrames is determined using the power of two that is closest to the product of the animation length and the frame rate, which can optimize GPU memory access.

    Then create a ComputeBuffer to store all vertex data for all frames. vertexAnimationBuffer

    In the for loop, bake all animation frames. Specifically, play and update immediately at each sampleTime point, then bake the mesh of the current animation frame into bakedMesh. And extract the newly baked Mesh vertices, update them into the array vertexAnimationData, and finally upload them to the GPU to end.

    // ...continued from above boidSMR = boidObject.GetComponentInChildren (); boidMesh = boidSMR.sharedMesh; animator = boidObject.GetComponentInChildren (); int iLayer = 0; AnimatorStateInfo aniStateInfo = animator.GetCurrentAnimatorStateInfo(iLayer); Mesh bakedMesh = new Mesh(); float sampleTime = 0; float perFrameTime = 0; numOfFrames = Mathf.ClosestPowerOfTwo((int)(animationClip.frameRate * animationClip.length)); perFrameTime = animationClip.length / numOfFrames; var vertexCount = boidSMR.sharedMesh.vertexCount; vertexAnimationBuffer = new ComputeBuffer(vertexCount * numOfFrames, 16); Vector4[] vertexAnimationData = new Vector4[vertexCount * numOfFrames]; for (int i = 0; i < numOfFrames; i++) { animator.Play(aniStateInfo.shortNameHash, iLayer, sampleTime); animator.Update(0f); boidSMR.BakeMesh(bakedMesh); for(int j = 0; j < vertexCount; j++) { Vector4 vertex = bakedMesh.vertices[j]; vertex.w = 1; vertexAnimationData[(j * numOfFrames) + i] = vertex; } sampleTime += perFrameTime; } vertexAnimationBuffer.SetData(vertexAnimationData); boidMaterial.SetBuffer("vertexAnimation", vertexAnimationBuffer); boidObject.SetActive(false);

    In the Compute Shader, maintain each frame variable stored in an individual data structure.

    boid.frame = boid.frame + velocity * deltaTime * boidFrameSpeed; if (boid.frame >= numOfFrames) boid.frame -= numOfFrames;

    Lerp different frames of animation in Shader. The left side is without frame interpolation, and the right side is after interpolation. The effect is very significant.


    A good title can get more recommendations and followers

    void vert(inout appdata_custom v) { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED #ifdef FRAME_INTERPOLATION v.vertex = lerp(vertexAnimation[ * numOfFrames + _CurrentFrame], vertexAnimation[ * numOfFrames + _NextFrame], _FrameInterpolation); #else v.vertex = vertexAnimation[ * numOfFrames + _CurrentFrame]; #endif v.vertex = mul(_Matrix, v.vertex); #endif } void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0)); _CurrentFrame = boidsBuffer[unity_InstanceID].frame; #ifdef FRAME_INTERPOLATION _NextFrame = _CurrentFrame + 1; if (_NextFrame >= numOfFrames) _NextFrame = 0; _FrameInterpolation = frac(boidsBuffer[unity_InstanceID].frame); #endif #endif }

    It was not easy, but it is finally complete.


    Complete project link:

    8. Summary/Quiz

    When rendering points which gives the best answer?


    What are the three key steps in flocking?


    When creating an arguments buffer for DrawMeshInstancedIndirect, how many uints are required?


    We created the wing flapping by using a skinned mesh shader. True or False.


    In a shader used by DrawMeshInstancedIndirect, which variable name gives the correct index for the instance?



    2. Flocks, Herds, and Schools: A Distributed Behavioral Model
  • Compute Shader学习笔记(二)之 后处理效果

    Compute Shader Learning Notes (II) Post-processing Effects



    Get a preliminary understanding of Compute Shader and implement some simple effects. All the codes are in:

    The main branch is the initial code. You can download the complete project and follow me. PS: I have opened a separate branch for each version of the code.


    This article learns how to use Compute Shader to make:

    • Post-processing effects
    • Particle System

    The previous article did not mention the GPU architecture because I felt that it would be difficult to understand if I explained a bunch of terms right at the beginning. With the experience of actually writing Compute Shader, you can connect the abstract concepts with the actual code.

    CUDA on GPUExecution ProgramIt can be explained by a three-tier architecture:

    • Grid – corresponds to a Kernel
    • |-Block – A Grid has multiple Blocks, executing the same program
    • | |-Thread – The most basic computing unit on the GPU

    Thread is the most basic unit of GPU, and there will naturally be information exchange between different threads. In order to effectively support the operation of a large number of parallel threads and solve the data exchange requirements between these threads, the memory is designed into multiple levels.Storage AngleIt can also be divided into three layers:

    • Per-Thread memory – Within a Thread, the transmission cycle is one clock cycle (less than 1 nanosecond), which can be hundreds of times faster than global memory.
    • Shared memory – Between blocks, the speed is much faster than the global speed.
    • Global memory – between all threads, but the slowest, usually the bottleneck of the GPU. The Volta architecture uses HBM2 as the global memory of the device, while Turing uses GDDR6.

    If the memory size limit is exceeded, it will be pushed to larger but slower storage space.

    Shared Memory and L1 cache share the same physical space, but they are functionally different: the former needs to be managed manually, while the latter is automatically managed by hardware. My understanding is that Shared Memory is functionally similar to a programmable L1 cache.


    In NVIDIA's CUDA architecture,Streaming Multiprocessor (SM)It is a processing unit on the GPU that is responsible for executing theBlocksThreads in .Stream Processors, also known as "CUDA cores", are processing elements within the SM, and each stream processor can process multiple threads in parallel. In general:

    • GPU -> Multi-Processors (SMs) -> Stream Processors

    That is, the GPU contains multiple SMs (multiprocessors), each of which contains multiple stream processors. Each stream processor is responsible for executing the computing instructions of one or more threads.

    In GPU,ThreadIt is the smallest unit for performing calculations.Warp (latitude)It is the basic execution unit in CUDA.

    In NVIDIA's CUDA architecture, eachWarpUsually contains 32Threads(AMD has 64).BlockA thread group contains multiple threads.BlockCan contain multipleWarp.Kernelis a function executed on the GPU. You can think of it as a specific piece of code that is executed in parallel by all activated threads. In general:

    • Kernel -> Grid -> Blocks -> Warps -> Threads

    But in daily development, it is usually necessary to executeThreadsFar more than 32.

    In order to solve the mismatch between software requirements and hardware architecture, the GPU adopts a strategy: grouping threads belonging to the same block. This grouping is called a "Warp", and each Warp contains a fixed number of threads. When the number of threads that need to be executed exceeds the number that a Warp can contain, the GPU will schedule additional Warps. The principle of doing this is to ensure that no thread is missed, even if it means starting more Warps.

    For example, if a block has 128 threads, and my graphics card is wearing a leather jacket (Nvidia has 32 threads per warp), then a block will have 128/32=4 warps. To give an extreme example, if there are 129 threads, then 5 warps will be opened. There are 31 thread positions that will be directly idle! Therefore, when we write a compute shader, the a in [numthreads(a,b,c)]bc should preferably be a multiple of 32 to reduce the waste of CUDA cores.

    You must be confused after reading this. I drew a picture based on my personal understanding. Please point out any mistakes.


    L3 post-processing effects

    The current build is based on the BIRP pipeline, and the SRP pipeline only requires a few code changes.

    The key to this chapter is to build an abstract base class to manage the resources required by Compute Shader (Section 1). Then, based on this abstract base class, write some simple post-processing effects, such as Gaussian blur, grayscale effect, low-resolution pixel effect, and night vision effect. A brief summary of the knowledge points in this chapter:

    • Get and process the Camera's rendering texture
    • ExecuteInEditMode Keywords
    • SystemInfo.supportsComputeShaders Checks whether the system supports
    • Use of Graphics.Blit() function (the whole process is Bit Block Transfer)
    • Using smoothstep() to create various effects
    • Data transmission between multiple Kernels Shared keyword

    1. Introduction and preparation

    Post-processing effects require two textures, one read-only and the other read-write. As for where the textures come from, since it is post-processing, it must be obtained from the camera, that is, the Target Texture on the Camera component.

    • Source: Read-only
    • Destination: Readable and writable, used for final output

    Since a variety of post-processing effects will be implemented later, a base class is abstracted to reduce the workload in the later stage.

    The following features are encapsulated in the base class:

    • Initialize resources (create textures, buffers, etc.)
    • Manage resources (for example, recreate buffers when screen resolution changes, etc.)
    • Hardware check (check whether the current device supports Compute Shader)

    Abstract class complete code link:

    First, when the script instance is activated or attached to a live GO, OnEnable() is called. Write the initialization operations in it. Check whether the hardware supports it, check whether the Compute Shader is bound in the Inspector, get the specified Kernel, get the Camera component of the current GO, create a texture, and set the initialized state to true.

    if (!SystemInfo.supportsComputeShaders) ... if (!shader) ... kernelHandle = shader.FindKernel(kernelName); thisCamera = GetComponent (); if (!thisCamera) ... CreateTextures(); init = true;

    Create two textures CreateTextures(), one Source and one Destination, with the size of the camera resolution.

    texSize.x = thisCamera.pixelWidth; texSize.y = thisCamera.pixelHeight; if (shader) { uint x, y; shader.GetKernelThreadGroupSizes(kernelHandle, out x, out y, out _); groupSize.x = Mathf.CeilToInt( (float)texSize.x / (float)x); groupSize.y = Mathf.CeilToInt((float)texSize.y / (float)y); } CreateTexture(ref output); CreateTexture(ref renderedSource); shader.SetTexture(kernelHandle, "source", renderedSource); shader.SetTexture(kernelHandle, " outputrt", output);

    Creation of specific textures:

    protected void CreateTexture(ref RenderTexture textureToMake, int divide=1) { textureToMake = new RenderTexture(texSize.x/divide, texSize.y/divide, 0); textureToMake.enableRandomWrite = true; textureToMake.Create(); }

    This completes the initialization. When the camera finishes rendering the scene and is ready to display it on the screen, Unity will call OnRenderImage(), and then call Compute Shader to start the calculation. If it is not initialized or there is no shader, it will be Blitted and the source will be directly copied to the destination, that is, nothing will be done. CheckResolution(out _) This method checks whether the resolution of the rendered texture needs to be updated. If so, it will regenerate the Texture. After that, it is time for the Dispatch stage. Here, the source map needs to be passed to the GPU through the Buffer, and after the calculation is completed, it will be passed back to the destination.

    protected virtual void OnRenderImage(RenderTexture source, RenderTexture destination) { if (!init || shader == null) { Graphics.Blit(source, destination); } else { CheckResolution(out _); DispatchWithSource(ref source, ref destination) ; } }

    Note that we don't use any SetData() or GetData() operations here. Because all the data is on the GPU now, we can just instruct the GPU to do it by itself, and the CPU should not get involved. If we fetch the texture back to memory and then pass it to the GPU, the performance will be very poor.

    protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination) { Graphics.Blit(source, renderedSource); shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1); Graphics.Blit(output, destination); }

    I didn't believe it, so I had to transfer it back to the CPU and then back to the GPU. The test results were quite shocking, and the performance was more than 4 times worse. Therefore, we need to reduce the communication between the CPU and GPU, which is very important when using Compute Shader.

    // Dumb method protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination) { // Blit the source texture to the texture for processing Graphics.Blit(source, renderedSource); // Process the texture using the compute shader shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1); // Copy the output texture into a Texture2D object so we can read the data to the CPU Texture2D tempTexture = new Texture2D(renderedSource.width, renderedSource.height, TextureFormat.RGBA32, false); = output; tempTexture.ReadPixels(new Rect(0, 0, output.width, output.height), 0, 0); tempTexture.Apply(); = null; // Pass the Texture2D data back to the GPU to a new RenderTexture RenderTexture tempRenderTexture = RenderTexture.GetTemporary(output.width, output.height); Graphics.Blit(tempTexture, tempRenderTexture); // Finally blit the processed texture to the target texture Graphics.Blit(tempRenderTexture, destination); // Clean up resources RenderTexture.ReleaseTemporary(tempRenderTexture); Destroy(tempTexture); }

    Next, we will start writing our first post-processing effect.

    Interlude: Strange BUG

    Also insert a strange bug.

    In Compute Shader, if the final output map result is named output, there will be problems in some APIs such as Metal. The solution is to change the name.

    RWTexture2D outputrt;

    Add a caption for the image, no more than 140 characters (optional)

    2. RingHighlight effect


    Create the RingHighlight class, inheriting from the base class just written.


    Overload the initialization method and specify Kernel.

    protected override void Init() { center = new Vector4(); kernelName = "Highlight"; base.Init(); }

    Overload the rendering method. To achieve the effect of focusing on a certain character, you need to pass the coordinate center of the character's screen space to the Compute Shader. And if the screen resolution changes before Dispatch, reinitialize it.

    protected void SetProperties() { float rad = (radius / 100.0f) * texSize.y; shader.SetFloat("radius", rad); shader.SetFloat("edgeWidth", rad * softenEdge / 100.0f); shader.SetFloat ("shade", shade); } protected override void OnRenderImage(RenderTexture source, RenderTexture destination) { if (!init || shader == null) { Graphics.Blit(source, destination); } else { if (trackedObject && thisCamera) { Vector3 pos = thisCamera.WorldToScreenPoint(trackedObject.position ); center.x = pos.x; center.y = pos.y; shader.SetVector("center", center); } bool resChange = false; CheckResolution(out resChange); if (resChange) SetProperties(); DispatchWithSource(ref source, ref destination); } }

    And when changing the Inspector panel, you can see the parameter change effect in real time and add the OnValidate() method.

    private void OnValidate() { if(!init) Init(); SetProperties(); }

    In GPU, how can we make a circle without shadow inside, with smooth transition at the edge of the circle and shadow outside the transition layer? Based on the method of judging whether a point is inside the circle in the previous article, we can use smoothstep() to process the transition layer.

    #Pragmas kernel Highlight
    Texture2D<float4> source;
    RWTexture2D<float4> outputrt;
    float radius;
    float edgeWidth;
    float shade;
    float4 center;
    float inCircle( float2 pt, float2 center, float radius, float edgeWidth ){
        float len = length(pt - center);
        return 1.0 - smoothstep(radius-edgeWidth, radius, len);
    [numthreads(8, 8, 1)]
    void Highlight(uint3 id : SV_DispatchThreadID)
        float4 srcColor = source[id.xy];
        float4 shadedSrcColor = srcColor * shade;
        float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
        float4 color = lerp( shadedSrcColor, srcColor, highlight );
        outputrt[id.xy] = color;


    Current version code:

    • Compute Shader:
    • CPU:

    3. Blur effect


    The principle of blur effect is very simple. The final effect can be obtained by taking the weighted average of the n*n pixels around each pixel sample.

    But there is an efficiency problem. As we all know, reducing the number of texture sampling is very important for optimization. If each pixel needs to sample 20*20 surrounding pixels, then rendering one pixel requires 400 samplings, which is obviously unacceptable. Moreover, for a single pixel, the operation of sampling a whole rectangular pixel around it is difficult to handle in the Compute Shader. How to solve it?

    The usual practice is to sample once horizontally and once vertically. What does this mean? For each pixel, only 20 pixels are sampled in the x direction and 20 pixels in the y direction, a total of 20+20 pixels are sampled, and then weighted average is taken. This method not only reduces the number of samples, but also conforms to the logic of Compute Shader. For horizontal sampling, set a kernel; for vertical sampling, set another kernel.

    #pragma kernel HorzPass #pragma kernel Highlight

    Since Dispatch is executed sequentially, after we calculate the horizontal blur, we use the calculated result to sample vertically again.

    shader.Dispatch(kernelHorzPassID, groupSize.x, groupSize.y, 1); shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);

    After completing the blur operation, combine it with the RingHighlight in the previous section, and you’re done!

    One difference is, after calculating the horizontal blur, how do we pass the result to the next kernel? The answer is obvious: just use the shared keyword. The specific steps are as follows.

    Declare a reference to the horizontal blurred texture in the CPU, create a kernel for the horizontal texture, and bind it.

    RenderTexture horzOutput = null; int kernelHorzPassID; protected override void Init() { ... kernelHorzPassID = shader.FindKernel("HorzPass"); ... }

    Additional space needs to be allocated in the GPU to store the results of the first kernel.

    protected override void CreateTextures() { base.CreateTextures(); shader.SetTexture(kernelHorzPassID, "source", renderedSource); CreateTexture(ref horzOutput); shader.SetTexture(kernelHorzPassID, "horzOutput", horzOutput); shader.SetTexture(kernelHandle , "horzOutput", horzOutput); }

    The GPU is set up like this:

    shared Texture2D source; shared RWTexture2D horzOutput; RWTexture2D outputrt;

    Another question is, it seems that it doesn't matter whether the shared keyword is included or not. In actual testing, different kernels can access it. So what is the point of shared?

    In Unity, adding shared before a variable means that this resource is not reinitialized for each call, but keeps its state for use by different shader or dispatch calls. This helps to share data between different shader calls. Marking shared can help the compiler optimize code for higher performance.


    When calculating the pixels at the border, there may be a situation where the number of available pixels is insufficient. Either the remaining pixels on the left are insufficient for blurRadius, or the remaining pixels on the right are insufficient. Therefore, first calculate the safe left index, and then calculate the maximum number that can be taken from left to right.

    [numthreads(8, 8, 1)] void HorzPass(uint3 id : SV_DispatchThreadID) { int left = max(0, (int)id.x-blurRadius); int count = min(blurRadius, (int)id.x) + min(blurRadius, source.Length.x - (int)id.x); float4 color = 0; uint2 index = uint2((uint)left, id.y); [unroll(100)] for(int x=0; x

    Current version code:

    • Compute Shader:
    • CPU:

    4. Gaussian Blur

    The difference from the above is that after sampling, the average value is no longer taken, but a Gaussian function is used to weight it.

    Where is the standard deviation, which controls the width.

    For more Blur content:

    Since the amount of calculation is not small, it would be very time-consuming to calculate this formula once for each pixel. We use the pre-calculation method to transfer the calculation results to the GPU through the Buffer. Since both kernels need to use it, add a shared when declaring the Buffer.

    float[] SetWeightsArray(int radius, float sigma) { int total = radius * 2 + 1; float[] weights = new float[total]; float sum = 0.0f; for (int n=0; n

    Full code:


    5. Low-resolution effects

    GPU: It’s really a refreshing computing experience.


    Blur the edges of a high-definition texture without changing the resolution. The implementation method is very simple. For every n*n pixels, only the color of the pixel in the lower left corner is taken. Using the characteristics of integers, the id.x index is divided by n first, and then multiplied by n.

    uint2 index = (uint2(id.x, id.y)/3) * 3; float3 srcColor = source[index].rgb; float3 finalColor = srcColor;

    The effect is already there. But the effect is too sharp, so add noise to soften the jagged edges.

    uint2 index = (uint2(id.x, id.y)/3) * 3; float noise = random(id.xy, time); float3 srcColor = lerp(source[id.xy].rgb, source[index] ,noise); float3 finalColor = srcColor;

    The pixel of each n*n grid no longer takes the color of the lower left corner, but takes the random interpolation result of the original color and the color of the lower left corner. The effect is much more refined. When n is relatively large, you can also see the following effect. It can only be said that it is not very good-looking, but it can still be explored in some glitch-style roads.


    If you want to get a noisy picture, you can try adding coefficients at both ends of lerp, for example:

    float3 srcColor = lerp(source[id.xy].rgb * 2, source[index],noise);

    6. Grayscale Effects and Staining

    Grayscale Effect & Tinted

    The process of converting a color image to a grayscale image involves converting the RGB value of each pixel into a single color value. This color value is a weighted average of the RGB values. There are two methods here, one is a simple average, and the other is a weighted average that conforms to human eye perception.

    1. Average method (simple but inaccurate):

    This method gives equal weight to all color channels. 2. Weighted average method (more accurate, reflects human eye perception):

    This method gives different weights to different color channels based on the fact that the human eye is more sensitive to green, less sensitive to red, and least sensitive to blue. (The screenshot below doesn't look very good, I can't tell lol)


    After weighting, the colors are simply mixed (multiplied) and finally lerp to obtain a controllable color intensity result.

    uint2 index = (uint2(id.x, id.y)/6) * 6; float noise = random(id.xy, time); float3 srcColor = lerp(source[id.xy].rgb, source[index] ,noise); // float3 finalColor = srcColor; float3 grayScale = (srcColor.r+srcColor.g+srcColor.b)/3.0; // float3 grayScale = srcColor.r*0.299f+srcColor.g*0.587f+srcColor.b*0.114f; float3 tinted = grayScale * tintColor.rgb ; float3 finalColor = lerp(srcColor, tinted, tintStrength); outputrt[id.xy] = float4(finalColor, 1);

    Dye a wasteland color:


    7. Screen scan line effect

    First, uvY normalizes the coordinates to [0,1].

    lines is a parameter that controls the number of scan lines.

    Then add a time offset, and the coefficient controls the offset speed. You can open a parameter to control the speed of line offset.

    float uvY = (float)id.y/(float)source.Length.y; float scanline = saturate(frac(uvY * lines + time * 3));

    This "line" doesn't look quite "line" enough, lose some weight.

    float uvY = (float)id.y/(float)source.Length.y; float scanline = saturate(smoothstep(0.1,0.2,frac(uvY * lines + time * 3)));

    Then lerp the colors.

    float uvY = (float)id.y/(float)source.Length.y; float scanline = saturate(smoothstep(0.1, 0.2, frac(uvY * lines + time*3)) + 0.3); finalColor = lerp(source [id.xy].rgb*0.5, finalColor, scanline);

    Before and after “weight loss”, each gets what they need!


    8. Night Vision Effect

    This section summarizes all the above content and realizes the effect of a night vision device. First, make a single-eye effect.

    float2 pt = (float2)id.xy; float2 center = (float2)(source.Length >> 1); float inVision = inCircle(pt, center, radius, edgeWidth); float3 blackColor = float3(0,0,0) ; finalColor = lerp(blackColor, finalColor, inVision);

    The difference between the binocular effect and the binocular effect is that there are two centers of the circle. The two calculated masks can be merged using max() or saturate().

    float2 pt = (float2)id.xy; float2 centerLeft = float2(source.Length.x / 3.0, source.Length.y /2); float2 centerRight = float2(source.Length.x / 3.0 * 2.0, source.Length .y /2); float inVisionLeft = inCircle(pt, centerLeft, radius, edgeWidth); float inVisionRight = inCircle(pt, centerRight, radius, edgeWidth); float3 blackColor = float3(0,0,0); // float inVision = max(inVisionLeft, inVisionRight); float inVision = saturate(inVisionLeft + inVisionRight); finalColor = lerp (blackColor, finalColor, inVision);

    Current version code:

    • Compute Shader:
    • CPU:

    9. Smooth transition lines

    Think about how we should draw a smooth straight line on the screen.


    The smoothstep() function can do this. Readers familiar with this function can skip this section. This function is used to create a smooth gradient. The smoothstep(edge0, edge1, x) function outputs a gradient from 0 to 1 when x is between edge0 and edge1. If x < edge0, it returns 0; if x > edge1, it returns 1. Its output value is calculated based on Hermite interpolation:

    float onLine(float position, float center, float lineWidth, float edgeWidth) { float halfWidth = lineWidth / 2.0; float edge0 = center - halfWidth - edgeWidth; float edge1 = center - halfWidth; float edge2 = center + halfWidth; float edge3 = center + halfWidth + edgeWidth; return smoothstep(edge0, edge1, position) - smoothstep(edge2, edge3, position); }

    In the above code, the parameters passed in have been normalized to [0,1]. position is the position of the point under investigation, center is the center of the line, lineWidth is the actual width of the line, and edgeWidth is the width of the edge, which is used for smooth transition. I am really unhappy with my ability to express myself! As for how to calculate it, I will draw a picture for you to understand!

    It's probably:,,.


    Think about how to draw a circle with a smooth transition.

    For each point, first calculate the distance vector to the center of the circle and return the result to position, and then calculate its length and return it to len.

    Imitating the difference method of the above two smoothsteps, a ring line effect is generated by subtracting the outer edge interpolation result.

    float circle(float2 position, float2 center, float radius, float lineWidth, float edgeWidth){ position -= center; float len = length(position); //Change true to false to soften the edge float result = smoothstep(radius - lineWidth / 2.0 - edgeWidth, radius - lineWidth / 2.0, len) - smoothstep(radius + lineWidth / 2.0, radius + lineWidth / 2.0 + edgeWidth, len); return result; }

    10. Scanline Effect

    Then add a horizontal line, a vertical line, and a few circles to create a radar scanning effect.

    float3 color = float3(0.0f,0.0f,0.0f); color += onLine(uv.y, center.y, 0.002, 0.001) * axisColor.rgb;//xAxis color += onLine(uv.x, center .x, 0.002, 0.001) * axisColor.rgb;//yAxis color += circle(uv, center, 0.2f, 0.002, 0.001) * axisColor.rgb; color += circle(uv, center, 0.3f, 0.002, 0.001) * axisColor.rgb; color += circle(uv, center, 0.4f , 0.002, 0.001) * axisColor.rgb;

    Draw another scan line with a trajectory.

    float sweep(float2 position, float2 center, float radius, float lineWidth, float edgeWidth) { float2 direction = position - center; float theta = time + 6.3; float2 circlePoint = float2(cos(theta), -sin(theta)) * radius; float projection = clamp(dot(direction, circlePoint) / dot(circlePoint, circlePoint), 0.0, 1.0); float lineDistance = length(direction - circlePoint * projection); float gradient = 0.0; const float maxGradientAngle = PI * 0.5; if (length(direction) < radius) { float angle = fmod(theta + atan2(direction.y, direction.x), PI2); gradient = clamp(maxGradientAngle - angle, 0.0, maxGradientAngle) / maxGradientAngle * 0.5; } return gradient + 1.0 - smoothstep(lineWidth, lineWidth + edgeWidth, lineDistance); }

    Add to the color.

    ... color += sweep(uv, center, 0.45f, 0.003, 0.001) * sweepColor.rgb; ...

    Current version code:

    • Compute Shader:
    • CPU:

    11. Gradient background shadow effect

    This effect can be used in subtitles or some explanatory text. Although you can directly add a texture to the UI Canvas, using Compute Shader can achieve more flexible effects and resource optimization.


    The background of subtitles and dialogue text is usually at the bottom of the screen, and the top is not processed. At the same time, a higher contrast is required, so the original picture is grayed out and a shadow is specified.

    if (id.y<(uint)tintHeight){ float3 grayScale = (srcColor.r + srcColor.g + srcColor.b) * 0.33 * tintColor.rgb; float3 shaded = lerp(srcColor.rgb, grayScale, tintStrength) * shade ; ... // Continue}else{ color = srcColor; }

    Gradient effect.

    ...// Continue from the previous text float srcAmount = smoothstep(tintHeight-edgeWidth, (float)tintHeight, (float)id.y); ...// Continue from the following text

    Finally, lerp it up again.

    ...// Continue from the previous text color = lerp(float4(shaded, 1), srcColor, srcAmount);

    12. Summary/Quiz

    If id.xy = [ 100, 30 ]. What would be the return value of inCircle((float2)id.xy, float2(130, 40), 40, 0.1)


    When creating a blur effect which answer describes our approach best?


    Which answer would create a blocky low resolution version of the source image?


    What is smoothstep(5, 10, 6); ?


    If an and b are both vectors. Which answer best describes dot(a,b)/dot(b,b); ?


    What is _MainTex_TexelSize.x? If _MainTex is 512 x 256 pixel resolution.


    13. Use Blit and Material for post-processing

    In addition to using Compute Shader for post-processing, there is another simple method.

    // .cs Graphics.Blit(source, dest, material, passIndex); // .shader Pass{ CGPROGRAM #pragma vertex vert_img #pragma fragment frag fixed4 frag(v2f_img input) : SV_Target{ return tex2D(_MainTex, input.uv); } ENDCG }

    Image data is processed by combining Shader.

    So the question is, what is the difference between the two? And isn't the input a texture? Where do the vertices come from?


    The first question. This method is called "screen space shading" and is fully integrated into Unity's graphics pipeline. Its performance is actually higher than Compute Shader. Compute Shader provides finer-grained control over GPU resources. It is not restricted by the graphics pipeline and can directly access and modify resources such as textures and buffers.

    The second question. Pay attention to vert_img. In UnityCG, you can find the following definition:


    Unity will automatically convert the incoming texture into two triangles (a rectangle that fills the screen). When we write post-processing using the material method, we can just write it directly on the frag.

    In the next chapter, you will learn how to connect Material, Shader, Compute Shader and C#.

  • Compute Shader学习笔记(一)之 入门

    Compute Shader Learning Notes (I) Getting Started

    Tags: Getting Started/Shader/Compute Shader/GPU Optimization



    Compute Shader is relatively complex and requires certain programming knowledge, graphics knowledge, and GPU-related hardware knowledge to master it well. The study notes are divided into four parts:

    • Get to know Compute Shader and implement some simple effects
    • Draw circles, planet orbits, noise maps, manipulate Meshes, and more
    • Post-processing, particle system
    • Physical simulation, drawing grass
    • Fluid simulation

    The main references are as follows:

    • notes-a preliminary exploration of compute-shader-9efeebd579c1
    • lygyue:Compute Shader(Very interesting)
    • (too old and outdated)
    • Wang Jiangrong: [Unity] Basic Introduction and Usage of Compute Shader
    • …To be continued

    L1 Introduction to Compute Shader

    1. Introduction to Compute Shader

    Simply put, you can use Compute Shader to calculate a material and then display it through Renderer. It should be noted that Compute Shader can do more than just this.


    You can copy the following two codes and test them.

    using System.Collections;
    using System.Collections.Generic;
    using UnityEngine;
    public class AssignTexture : MonoBehaviour
        // ComputeShader is used to perform computing tasks on the GPU
        public ComputeShader shader;
        // Texture resolution
        public int texResolution = 256;
        // Renderer component
        private Renderer rend;
        // Render texture
        private RenderTexture outputTexture;
        // Compute shader kernel handle
        private int kernelHandle;
        // Start is called once when the script is started
        void Start()
            // Create a new render texture, specifying width, height, and bit depth (here the bit depth is 0)
            outputTexture = new RenderTexture(texResolution, texResolution, 0);
            // Allow random write
            outputTexture.enableRandomWrite = true;
            // Create a render texture instance
            // Get the renderer component of the current object
            rend = GetComponent<Renderer>();
            // Enable the renderer
            rend.enabled = true;
        private void InitShader()
            // Find the handle of the compute shader kernel "CSMain"
            kernelHandle = shader.FindKernel("CSMain");
            // Set up the texture used in the compute shader
            shader.SetTexture(kernelHandle, "Result", outputTexture);
            // Set the render texture as the material's main texture
            rend.Material.SetTexture("_MainTex", outputTexture);
            // Schedule the execution of the compute shader, passing in the size of the compute group
            // Here it is assumed that each working group is 16x16
            // Simply put, how many groups should be allocated to complete the calculation. Currently, only half of x and y are divided, so only 1/4 of the screen is rendered.
            DispatchShader(texResolution / 16, texResolution / 16);
        private void DispatchShader(int x, int y)
            // Schedule the execution of the compute shader
            // x and y represent the number of calculation groups, 1 represents the number of calculation groups in the z direction (here there is only one)
            shader.Dispatch(kernelHandle, x, y, 1);
        void Update()
            // Check every frame whether there is keyboard input (button U is released)
            if (Input.GetKeyUp(KeyCode.U))
                // If the U key is released, reschedule the compute shader
                DispatchShader(texResolution / 8, texResolution / 8);

    Unity's default Compute Shader:

    // Each #kernel tells which function to compile; you can have many kernels
    #Pragmas kernel CSMain
    // Create a RenderTexture with enableRandomWrite flag and set it
    // with cs.SetTexture
    RWTexture2D<float4> Result;
    void CSMain (uint3 id : SV_DispatchThreadID) { 
      // TODO: insert actual code here! Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0); 

    In this example, we can see that a fractal structure called Sierpinski net is drawn in the lower left quarter. This is not important. Unity officials think this graphic is very representative and use it as the default code.

    Let's talk about the Compute Shader code in detail. You can refer to the comments for the C# code.

    #pragma kernel CSMain This line of code indicates the entry of Compute Shader. You can change the name of CSMain at will.

    RWTexture2D Result This line of code is a readable and writable 2D texture. R stands for Read and W stands for Write.

    Focus on this line of code:


    In the Compute Shader file, this line of code specifies the size of a thread group. For example, in this 8 * 8 * 1 thread group, there are 64 threads in total. Each thread calculates a unit of pixels (RWTexture).

    In the C# file above, we use shader.Dispatch to specify the number of thread groups.


    Next, let's ask a question. If the current thread group is specified as 881, so how many thread groups do we need to render a RWTexture of size res*res?

    The answer is: res/8. However, our code currently only calls res/16, so only the 1/4 area in the lower left corner is rendered.

    In addition, the parameters passed into the entry function are also worth mentioning: uint3 id: SV_DispatchThreadID This id represents the unique identifier of the current thread.

    2. Quarter pattern

    Before you learn to walk, you must first learn to crawl. First, specify the task (Kernel) to be performed in C#.


    Currently we have written it in stone, now we expose a parameter that indicates that different rendering tasks can be performed.

    public string kernelName = "CSMain"; ... kernelHandle = shader.FindKernel(kernelName);

    In this way, you can modify it at will in the Inspector.


    However, it is not enough to just put the plate on the table, we need to serve the dish. We cook the dish in the Compute Shader.

    Let's set up a few menus first.

    #pragma kernel CSMain // We have just declared #pragma kernel SolidRed // Define a new dish and write it below... // You can write a lot [numthreads(8,8,1)] void CSMain (uint3 id : SV_DispatchThreadID){ ... } [numthreads(8,8,1)] void SolidRed (uint3 id : SV_DispatchThreadID){ Result[id.xy] = float4(1,0,0,0); }

    You can enable different Kernels by modifying the corresponding names in the Inspector.


    What if I want to pass data to the Compute Shader? For example, pass the resolution of a material to the Compute Shader.

    shader.SetInt("texResolution", texResolution);

    And in the Compute Shader, it must also be declared.


    Think about a question, how to achieve the following effect?

    void SplitScreen (uint3 id : SV_DispatchThreadID)
        int halfRes = texResolution >> 1;
        Result[id.xy] = float4(step(halfRes, id.x),step(halfRes, id.y),0,1);

    To explain, the step function is actually:

    step(edge, x){
        return x>=edge ? 1 : 0;

    (uint)res >> 1 means that the bits of res are shifted one position to the right. This is equivalent to dividing by 2 (binary content).

    This calculation method simply depends on the current thread id.

    The thread at the bottom left corner always outputs black because the step return is always 0.

    For the lower left thread, id.x > halfRes , so 1 is returned in the red channel.

    If you are not convinced, you can do some calculations to help you understand the relationship between thread ID, thread group and thread group group.


    3. Draw a circle

    The principle sounds simple. It checks whether (id.x, id.y) is inside the circle. If yes, it outputs 1. Otherwise, it outputs 0. Let's try it.

    float inCircle( float2 pt, float radius ){
        return ( length(pt)<radius ) ? 1.0 : 0.0;
    void Circle (uint3 id : SV_DispatchThreadID)
        int halfRes = texResolution >> 1;
        int isInside = inCircle((float2)((int2)id.xy-halfRes), (float)(halfRes>>1));
        Result[id.xy] = float4(0.0,isInside ,0,1);


    4. Summary/Quiz

    If the output is a RWTexture with a side length of 256, which answer will produce a completely red texture?

    RWTexture2D<float4> output;
    void CSMain (uint3 id : SV_DispatchThreadID)
         output[id.xy] = float4(1.0, 0.0, 0.0, 1.0);


    Which answer will give red on the left side of the texture output and yellow on the right side?


    L2 has begun

    1. Passing values to the GPU


    Without further ado, let's draw a circle. Here are two initial codes.



    The general structure is the same as above. You can see that a drawCircle function is called to draw a circle.

    [numthreads(1,1,1)] void Circles (uint3 id : SV_DispatchThreadID) { int2 center = (texResolution >> 1); int radius = 80; drawCircle( centre, radius ); }

    The circle drawing method used here is a very classic rasterization drawing method. If you are interested in the mathematical principles, you can read The general idea is to use a symmetric idea to generate.

    The difference is that here we use (1,1,1) as the size of a thread group. Call CS on the CPU side:

    private void DispatchKernel(int count) { shader.Dispatch(circlesHandle, count, 1, 1); } void Update() { DispatchKernel(1); }

    The question is, how many times does a thread execute?

    Answer: It is executed only once. Because a thread group has only 111 = 1 thread, and only 1 is called on the CPU side11 = 1 thread group is used for calculation. Therefore, only one thread is used to draw a circle. In other words, one thread can draw an entire RWTexture at a time, instead of one thread drawing one pixel as before.

    This also shows that there is an essential difference between Compute Shader and Fragment Shader. Fragment Shader only calculates the color of a single pixel, while Compute Shader can perform more or less arbitrary operations!


    Back to Unity, if you want to draw a good-looking circle, you need an outline color and a fill color. Pass these two parameters to CS.

    float4 clearColor; float4 circleColor;

    And add color filling kernel, and modify the Circles kernel. If multiple kernels access a RWTexture at the same time, you can add the shared keyword.

    #Pragmas kernel Circles
    #Pragmas kernel Clear
    shared RWTexture2D<float4> Result;
    void Circles (uint3 id : SV_DispatchThreadID)
        // int2 center = (texResolution >> 1);
        int2 centre = (int2)(random2((float)id.x) * (float)texResolution);
        int radius = (int)(random((float)id.x) * 30);
        drawCircle( centre, radius );
    void Clear (uint3 id : SV_DispatchThreadID)
        Result[id.xy] = clearColor;

    Get the Clear kernel on the CPU side and pass in the data.

    private int circlesHandle; private int clearHandle; ... shader.SetVector( "clearColor", clearColor); shader.SetVector( "circleColor", circleColor); ... private void DispatchKernels(int count) { shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1); shader.Dispatch(circlesHandle, count, 1, 1); } void Update() { DispatchKernels(1); // There are now 32 circles on the screen }

    A question, if the code is changed to: DispatchKernels(10), how many circles will there be on the screen?

    Answer: 320. Initially, Dispatch is 111=1, a thread group has 3211=32 threads, each thread draws a circle. Elementary school mathematics.

    Next, add the _Time variable to make the circle change with time. Since there seems to be no such variable as _time in the Compute Shader, it can only be passed in by the CPU.

    On the CPU side, note that variables updated in real time need to be updated before each Dispatch (outputTexture does not need to be updated because this outputTexture actually points to a reference to the GPU texture!):

    private void DispatchKernels(int count) { shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1); shader.SetFloat( "time", Time.time); shader.Dispatch(circlesHandle, count, 1, 1) ; }

    Compute Shader:

    float time; ... void Circles (uint3 id : SV_DispatchThreadID){ ... int2 center = (int2)(random2((float)id.x + time) * (float)texResolution); ... }

    Current version code:

    • Compute Shader:
    • CPU:

    But now the circles are very messy. The next step is to use Buffer to make the circles look more regular.


    At the same time, you don't need to worry about multiple threads trying to write to the same memory location (such as RWTexture) at the same time, which may cause race conditions. The current API will handle this problem well.

    2. Use Buffer to pass data to GPU

    So far, we have learned how to transfer some simple data from the CPU to the GPU. How do we pass a custom structure?


    We can use Buffer as a medium, where Buffer is of course stored in the GPU, and the CPU side (C#) only stores its reference. First, declare a structure on the CPU, and then declare the CPU-side reference and the GPU-sideReferences.

    struct Circle { public Vector2 origin; public Vector2 velocity; public float radius; } Circle[] circleData; // on CPU ComputeBuffer buffer; // on GPU

    To get the size information of a thread group, you can do this. The following code only gets the number of threads in the x direction of the circlesHandles thread group, ignoring y and z (because it is assumed that the y and z of the thread group are both 1). And multiply it by the number of allocated thread groups to get the total number of threads.

    uint threadGroupSizeX; shader.GetKernelThreadGroupSizes(circlesHandle, out threadGroupSizeX, out _, out _); int total = (int)threadGroupSizeX * count;

    Now prepare the data to be passed to the GPU. Here we create circles with the number of threads, circleData[threadNums].

    circleData = new Circle[total]; float speed = 100; float halfSpeed = speed * 0.5f; float minRadius = 10.0f; float maxRadius = 30.0f; float radiusRange = maxRadius - minRadius; for(int i=0; i

    Then accept this Buffer in the Compute Shader. Declare an identical structure (Vector2 and Float2 are the same), and then create a reference to the Buffer.

    // Compute Shader struct circle { float2 origin; float2 velocity; float radius; }; StructuredBuffer circlesBuffer;

    Note that the StructureBuffer used here is read-only, which is different from the RWStructureBuffer mentioned in the next section.

    Back to the CPU side, send the CPU data just prepared to the GPU through the Buffer. First, we need to make clear the size of the Buffer we applied for, that is, how big we want to pass to the GPU. Here, a circle data has two float2 variables and one float variable, a float is 4 bytes (may be different on different platforms, you can use sizeof(float) to determine), and there are circleData.Length pieces of circle data to be passed. circleData.Length indicates how many circle objects the buffer needs to store, and stride defines how many bytes each object's data occupies. After opening up such a large space, use SetData() to fill the data into the buffer, that is, in this step, pass the data to the GPU. Finally, bind the GPU reference where the data is located to the Kernel specified by the Compute Shader.

    int stride = (2 + 2 + 1) * 4; //2 floats origin, 2 floats velocity, 1 float radius - 4 bytes per float buffer = new ComputeBuffer(circleData.Length, stride); buffer.SetData(circleData); shader.SetBuffer(circlesHandle, "circlesBuffer", buffer);

    So far, we have passed some data prepared by the CPU to the GPU through Buffer.


    OK, now let’s make use of the data that was transferred to the GPU with great difficulty.

    [numthreads(32,1,1)] void Circles (uint3 id : SV_DispatchThreadID) { int2 center = (int2)(circlesBuffer[id.x].origin + circlesBuffer[id.x].velocity * time); while (centre .x>texResolution) centre.x -= texResolution; while (centre.x<0) centre.x += texResolution; while (centre.y>texResolution) centre.y -= texResolution; while (centre.y<0) centre.y += texResolution; uint radius = (int)circlesBuffer[id.x].radius; drawCircle( centre, radius ) ; }

    You can see that the circle is now moving continuously because our Buffer stores the position of the circle indexed by id.x in the previous frame and the movement status of the circle.


    To sum up, in this section we learned how to customize a structure (data structure) on the CPU side, pass it to the GPU through a Buffer, and process the data on the GPU.

    In the next section, we will learn how to get data from the GPU back to the CPU.

    • Current version code:
    • Compute Shader:
    • CPU:

    3. Get data from GPU

    As usual, create a Buffer to transfer data from the GPU to the CPU. Define an array on the CPU side to receive the data. Then create the buffer, bind it to the shader, and finally create variables on the CPU ready to receive GPU data.

    ComputeBuffer resultBuffer; // Buffer Vector3[] output; // CPU accepts... //buffer on the gpu in the ram resultBuffer = new ComputeBuffer(starCount, sizeof(float) * 3); shader.SetBuffer(kernelHandle, "Result ", resultBuffer); output = new Vector3[starCount];

    Compute Shader also accepts such a Buffer. The Buffer here is readable and writable, which means that the Buffer can be modified by Compute Shader. In the previous section, Compute Shader only needs to read the Buffer, so StructuredBuffer is enough. Here we need to use RW.

    RWStructuredBuffer Result;

    Next, use GetData after Dispatch to receive the data.

    shader.Dispatch(kernelHandle, groupSizeX, 1, 1); resultBuffer.GetData(output);

    The idea is so simple. Now let's try to make a scene where a lot of stars move around the center of the sphere.

    The task of calculating the star coordinates is put on the GPU to complete, and finally the calculated position data of each star is obtained, and the object is instantiated in C#.

    In Compute Shader, each thread calculates the position of a star and outputs it to the Buffer.

    [numthreads(64,1,1)] void OrbitingStars (uint3 id : SV_DispatchThreadID) { float3 sinDir = normalize(random3(id.x) - 0.5); float3 vec = normalize(random3(id.x + 7.1393) - 0.5) ; float3 cosDir = normalize(cross(sinDir, vec)); float scaledTime = time * 0.5 + random(id.x) * 712.131234; float3 pos = sinDir * sin(scaledTime) + cosDir * cos(scaledTime); Result[id.x] = pos * 2; }

    Get the calculation result through GetData on the CPU side, and modify the Pos of the corresponding previously instantiated GameObject at any time.

    void Update() { shader.SetFloat("time", Time.time); shader.Dispatch(kernelHandle, groupSizeX, 1, 1); resultBuffer.GetData(output); for (int i = 0; i < stars.Length ; i++) stars[i].localPosition = output[i]; }

    Current version code:

    • Compute Shader:
    • CPU:

    4. Use noise

    Generating a noise map using Compute Shader is very simple and very efficient.

    float random (float2 pt, float seed) {
        const float a = 12.9898;
        const float b = 78.233;
        const float c = 43758.543123;
        return frac(sin(seed + dot(pt, float2(a, b))) * c );
    void CSMain (uint3 id : SV_DispatchThreadID)
        float4 white = 1;
        Result[id.xy] = random(((float2)id.xy)/(float)texResolution, time) * white;

    There is a library to get more various noises.

    #include "noiseSimplex.cginc" // Paste the code above and named "noiseSimplex.cginc"
    void CSMain (uint3 id : SV_DispatchThreadID)
        float3 POS = (((float3)id)/(float)texResolution) * 2.0;
        float n = snoise(POS);
        float ring = frac(noiseScale * n);
        float delta = pow(ring, ringScale) + n;
        Result[id.xy] = lerp(darkColor, paleColor, delta);


    5. Deformed Mesh

    In this section, we will transform a Cube into a Sphere through Compute Shader, and we will also need an animation process with gradual changes!


    As usual, declare vertex parameters on the CPU side, then throw them into the GPU for calculation, and apply the calculated new coordinates newPos to the Mesh.

    Vertex structure declaration. We attach a constructor to the CPU declaration for convenience. The GPU declaration is similar. Here, we intend to pass two buffers to the GPU, one read-only and the other read-write. At first, the two buffers are the same. As time changes (gradually), the read-write buffer gradually changes, and the Mesh changes from a cube to a ball.

    // CPU public struct Vertex { public Vector3 position; public Vector3 normal; public Vertex( Vector3 p, Vector3 n ) { position.x = px; position.y = py; position.z = pz; normal.x = nx; normal .y = ny; normal.z = nz; } } ... Vertex[] vertexArray; Vertex[] initialArray; ComputeBuffer vertexBuffer; ComputeBuffer initialBuffer; // GPU struct Vertex { float3 position; float3 normal; }; ... RWStructuredBuffer vertexBuffer; StructuredBuffer initialBuffer;

    The complete steps of initialization ( Start() function) are as follows:

    1. On the CPU side, initialize the kernel and obtain the Mesh reference
    2. Transfer Mesh data to CPU
    3. Declare the Buffer of Mesh data in GPU
    4. Passing Mesh data and other parameters to the GPU

    After completing these operations, every frame Update, we apply the new vertices obtained from the GPU to the mesh.

    So how do we implement GPU computing?

    It's quite simple, we just need to normalize each vertex in the model space! Imagine that when all vertex position vectors are normalized, the model becomes a sphere.


    In the actual code, we also need to calculate the normal at the same time. If we don't change the normal, the lighting of the object will be very strange. So the question is, how to calculate the normal? It's very simple. The coordinates of the original vertices of the cube are the final normal vectors of the ball!


    In order to achieve the "breathing" effect, a sine function is added to control the normalization coefficient.

    float delta = (Mathf.Sin(Time.time) + 1)/ 2;

    Since the code is a bit long, I'll put a link.

    Current version code:

    • Compute Shader:
    • CPU:

    6. Summary/Quiz

    How this structure should be defined on the GPU:

    struct Circle { public Vector2 origin; public Vector2 velocity; public float radius; }

    How should this structure set the size of ComputeBuffer?

    struct Circle { public Vector2 origin; public Vector2 velocity; public float radius; }

    Why is the following code wrong?

    StructuredBuffer positions; //Inside a kernel ... positions[id.x] = fixed3(1,0,0);


  • Games202 作业三 SSR实现

    Games202 Assignment 3 SSR Implementation

    Assignment source code:

    TODO List

    • Implements shading of the scene's direct lighting (taking shadows into account).
    • Implements screen space ray intersection (SSR).
    • Implements shading of indirect lighting of the scene.
    • Implement RayMarch with dynamic step size.
    • (Not written yet) Bonus 1: Screen Space Ray Tracing with Mipmap Optimization.

    Number of samples: 32

    Written in front

    The basic part of this assignment is the easiest among all the assignments in 202. There is nothing particularly complicated. But I don't know how to start with the bonus part. Can someone please help me?

    Depth buffer problem of framework

    This time, the operation encountered a more serious problem on macOS. The part of the cube close to the ground showed abnormal cutting jagged problems as the distance of the camera changed. This phenomenon did not occur on Windows, which was quite strange.


    I personally feel that this is related to the accuracy of the depth buffer, and may be caused by z-fighting, in which two or more overlapping surfaces compete for the same pixel. There are generally several solutions to this problem:

    • Adjust the near and far planes: don't make the near plane too close to the camera, and don't make the far plane too far away.
    • Improve the precision of the depth buffer: use 32-bit or higher precision.
    • Multi-Pass Rendering: Use different rendering schemes for objects in different distance ranges.

    The simplest solution is to modify the size of the near plane, located in line 25 of the framework's engine.js.

    // engine.js // const camera = new THREE.PerspectiveCamera(75, gl.canvas.clientWidth / gl.canvas.clientHeight, 0.0001, 1e5); const camera = new THREE.PerspectiveCamera(75, gl.canvas.clientWidth / gl.canvas.clientHeight, 5e-2, 1e2);

    This will give you a pretty sharp border.


    Added "Pause Rendering" function

    This section is optional. To reduce the strain on your computer, simply write a button to pause the rendering.

    // engine.js let settings = { 'Render Switch': true }; function createGUI() { ... // Add the boolean switch here gui.add(settings, 'Render Switch'); ... } function mainLoop (now) { if(settings['Render Switch']){ cameraControls.update(); renderer.render(); } requestAnimationFrame(mainLoop); } requestAnimationFrame(mainLoop);


    1. Implementing direct lighting

    Implement EvalDiffuse(vec3 wi, vec3 wo, vec2 uv) and EvalDirectionalLight(vec2 uv) in shaders/ssrShader/ssrFragment.glsl.

    // ssrFragment.glsl vec3 EvalDiffuse(vec3 wi, vec3 wo, vec2 screenUV) { vec3 reflectivity = GetGBufferDiffuse(screenUV); vec3 normal = GetGBufferNormalWorld(screenUV); float cosi = max(0., dot(normal, wi)); vec3 f_r = reflectivity * cosi; return f_r; } vec3 EvalDirectionalLight(vec2 screenUV) { vec3 Li = uLightRadiance * GetGBufferuShadow(screenUV); return Li; }

    The first code snippet actually implements the Lambertian reflection model, which corresponds to $f_r \cdot \text{cos}(\theta_i)$ in the rendering equation.

    Here I divide $\pi$, but according to the results given in the assignment framework, there should be no division, so just take it as it is here.

    The second part is responsible for direct lighting (including shadow occlusion), relative to the $L_i \cdot V$ of the rendering equation.


    Let's review the Lambertian reflection model here. We noticed that EvalDiffuse passed in two directions, wi and wo, but we only used the direction of the incident light, wi. This is because the Lambertian model has nothing to do with the direction of observation, but only with the surface normal and the cosine value of the incident light.

    Finally, set the result in main().

    // ssrFragment.glsl void main() { float s = InitRand(gl_FragCoord.xy); vec3 L = vec3(0.0); vec3 wi = normalize(uLightDir); vec3 wo = normalize(uCameraPos -; vec2 worldPos = GetScreenCoordinate(; L = EvalDiffuse(wi, wo, worldPos) * EvalDirectionalLight(worldPos); vec3 color = pow(clamp(L, vec3(0.0), vec3(1.0)), vec3(1.0 / 2.2)) ; gl_FragColor = vec4(vec3(color.rgb), 1.0); }

    2. Specular SSR – Implementing RayMarch

    Implement the RayMarch(ori, dir, out hitPos) function to find the intersection point between the ray and the object and return whether the ray intersects the object. The parameters ori and dir are values in the world coordinate system, representing the starting point and direction of the ray respectively, where the direction vector is a unit vector. For more information, please refer to EA's SIG15Course Report.

    The "cube1" of the work frame itself includes the ground, so the final SSR effect of this thing is not very beautiful. The "beautiful" here refers to the clarity of the result map in the paper or the exquisiteness of the water reflection effect in the game.

    To be precise, what we implement in this article is the most basic "mirror SSR", namely Basic mirror-only SSR.


    The easiest way to implement "mirror SSR" is to use Linear Raymarch, which gradually determines the occlusion relationship between the current position and the depth position of gBuffer through small steps.

    // ssrFragment.glsl bool RayMarch(vec3 ori, vec3 dir, out vec3 hitPos) { const int totalStepTimes = 60; const float threshold = 0.0001; float step = 0.05; vec3 stepDir = normalize(dir) * step; vec3 curPos = ori ; for(int i = 0; i < totalStepTimes; i++) { vec2 screenUV = GetScreenCoordinate(curPos); float rayDepth = GetDepth(curPos); float gBufferDepth = GetGBufferDepth(screenUV); // Check if the ray has hit an object if(rayDepth > gBufferDepth + threshold){ hitPos = curPos; return true; } curPos += stepDir; } return false; }

    Finally, fine-tune the step size. I ended up with 0.05. If the step size is too large, the reflection will be "broken". If the step size is too small and the number of steps is not enough, the calculation may be terminated because the step distance is not enough where the reflection should be. The maximum number of steps in the figure below is 150.

    // ssrFragment.glsl vec3 EvalSSR(vec3 wi, vec3 wo, vec2 screenUV) { vec3 worldNormal = GetGBufferNormalWorld(screenUV); vec3 relfectDir = normalize(reflect(-wo, worldNormal)); vec3 hitPos; if(RayMarch( ,relfectDir, hitPos)){ vec2 INV_screenUV = GetScreenCoordinate(hitPos); return GetGBufferDiffuse(INV_screenUV); } else{ return vec3(0.); } }

    Write a function that calls RayMarch and wraps it up so it can be used in main().

    // ssrFragment.glsl void main() { float s = InitRand(gl_FragCoord.xy); vec3 L = vec3(0.0); vec3 wi = normalize(uLightDir); vec3 wo = normalize(uCameraPos -; vec2 screenUV = GetScreenCoordinate(; // Basic mirror-only SSR float reflectivity = 0.2; L = EvalDiffuse(wi, wo, screenUV) * EvalDirectionalLight(screenUV); L+= EvalSSR(wi, wo, screenUV) * reflectivity; vec3 color = pow(clamp(L, vec3(0.0), vec3(1.0)), vec3(1.0 / 2.2)); gl_FragColor = vec4(vec3(color.rgb), 1.0); }

    If you just want to test the effect of SSR, please adjust it yourself in main().


    Before the release of "Killzone Shadow Fall" in 2013, SSR technology was still subject to great restrictions, because in actual development, we usually need to simulate glossy objects. Due to the performance limitations at the time, SSR technology was not widely adopted. With the release of "Killzone Shadow Fall", it marks a significant progress in real-time reflection technology. Thanks to the special hardware of PS4, it is possible to render high-quality glossy and semi-reflective objects in real time.


    In the following years, SSR technology developed rapidly, especially in combination with technologies such as PBR.

    Starting with Nvidia's RTX graphics cards, the rise of real-time ray tracing has gradually replaced SSR in some scenarios. However, in most development scenarios, traditional SSR still plays a considerable role.

    The future development trend will still be a mixture of traditional SSR technology and ray tracing technology.

    3. Indirect lighting

    Write it according to the pseudocode. That is, use the Monte Carlo method to solve the rendering equation. Unlike before, the samples this time are all in screen space. In the sampling process, you can use the SampleHemisphereUniform(inout s, ou pdf) and SampleHemisphereCos(inout s, out pdf) provided by the framework. These two functions return local coordinates, and the input parameters are the random number s and the sampling probability pdf.

    For this part, you need to understand the pseudo code in the figure below, and then complete EvalIndirectionLight() accordingly.


    First of all, we need to know that our sampling is still based on screen space. Therefore, we treat the content that is not on the screen (gBuffer) as non-existent. It is understood that there is only one layer of shell facing the camera.

    Indirect lighting involves random sampling of the upper hemisphere direction and the calculation of the corresponding PDF. Use InitRand(screenUV) to get the random number, then choose one of the two, SampleHemisphereUniform(inout float s, out float pdf) or SampleHemisphereCos(inout float s, out float pdf), update the random number and get the corresponding PDF and the position dir of the local coordinate system on the unit hemisphere.

    Pass the normal coordinates of the current Shading Point into the function LocalBasis(n, out b1, out b2), and then return b1, b2, where the three unit vectors n, b1, b2 are orthogonal to each other. Through the local coordinate system formed by these three vectors, dir is converted to world coordinates. I will write about the principle of LocalBasis() at the end.

    By the way, the matrix constructed with the vectors n (normal), b1, and b2 is commonly referred to as the TBN matrix in computer graphics.

    // ssrFragment.glsl #define SAMPLE_NUM 5 vec3 EvalIndirectionLight(vec3 wi, vec3 wo, vec2 screenUV){ vec3 L_ind = vec3(0.0); float s = InitRand(screenUV); vec3 normal = GetGBufferNormalWorld(screenUV); vec3 b1, b2; LocalBasis(normal, b1, b2); for(int i = 0; i < SAMPLE_NUM; i++){ float pdf; vec3 direction = SampleHemisphereUniform(s, pdf); vec3 worldDir = normalize(mat3(b1, b2, normal) * direction); vec3 position_1; if(RayMarch(, worldDir, position_1)){ // The sampling ray hits position_1 vec2 hitScreenUV = GetScreenCoordinate(position_1); vec3 bsdf_d = EvalDiffuse(worldDir, wo, screenUV); // Direct lighting vec3 bsdf_i = EvalDiffuse(wi, worldDir, hitScreenUV); // Indirect lighting L_ind += bsdf_d / pdf * bsdf_i * EvalDirectionalLight(hitScreenUV); } } L_ind /= float(SAMPLE_NUM); return L_ind; } // ssrFragment.glsl // Main entry point for the shader void main() { vec3 wi = normalize(uLightDir); vec3 wo = normalize( uCameraPos -; vec2 screenUV = GetScreenCoordinate(; // Basic mirror-only SSR coefficient float ssrCoeff = 0.0; // Indirection Light coefficient float indCoeff = 0.3; // Direction Light vec3 L_d = EvalDiffuse(wi, wo, screenUV) * EvalDirectionalLight(screenUV); // SSR Light vec3 L_ssr = EvalSSR(wi, wo, screenUV) * ssrCoeff; // Indirection Light vec3 L_i = EvalIndirectionLight(wi, wo, screenUV) * IndCorff; vec3 result = L_d + L_ssr + L_i; vec3 color = pow(clamp(result, vec3(0.0), vec3(1.0)), vec3(1.0 / 2.2)); gl_FragColor = vec4(vec3(color.rgb), 1.0); }

    Show only indirect lighting. Samples = 5.


    Direct lighting + indirect lighting. Number of samples = 5.


    It was such a headache to write this part. Even with SAMPLE_NUM set to 1, my computer was sweating profusely. Once the Live Server was turned on, there was a delay when typing directly. I couldn't stand it. Is this the performance of the M1pro? And what I can't stand the most is that the Safari browser is stuck, why is the whole system stuck? Is this your User First strategy of macOS? I don't understand. I had no choice but to take out my gaming computer to pass the LAN test project (sad). I just didn't expect that the RTX3070 would also sweat profusely when running.It seems that the algorithm I wrote is a pile of shit, and my life is also a pile of shit..

    4. RayMarch Improvements

    The current RayMarch() is actually problematic and will cause light leakage.


    When the sampling number is 5, it is only about 46.2 frames. My device is M1pro 16GB.


    Here we will focus on why light leakage occurs. See the figure below. Our gBuffer only has the depth information of the blue part. Even if our algorithm above has determined that the current curPos is deeper than the depth of gBuffer, it cannot ensure that this curPos is the collision point. Therefore, the algorithm above does not consider the situation in the figure, which leads to light leakage.


    forSolve the light leakage problemWe introduce a threshold to solve this problem (yes, it is an approximation). If the difference between curPos and the depth recorded by the current gBuffer is greater than a certain threshold, the situation shown in the figure below will occur. At this time, the information in the screen space cannot correctly provide the reflection information, so the SSR result of this Shading Point is vec3(0). It is so simple and crude!


    The idea of the code is similar to the previous one. At each step, the relationship between the depth of the next step position and the depth of gBuffer is determined. If the next step position is in front of gBuffer (nextDepth

    bool RayMarch(vec3 ori, vec3 dir, out vec3 hitPos) { const float EPS = 1e-2; const int totalStepTimes = 60; const float threshold = 0.1; float step = 0.05; vec3 stepDir = normalize(dir) * step; vec3 curPos = ori + stepDir; vec3 nextPos = curPos + stepDir; for(int i = 0; i < totalStepTimes; i++) { if(GetDepth(nextPos) < GetGBufferDepth(GetScreenCoordinate(nextPos))){ curPos = nextPos; nextPos += stepDir; }else if(GetGBufferDepth(GetScreenCoordinate(curPos )) - GetDepth(curPos) + EPS > threshold){ return false; }else{ curPos += stepDir; vec2 screenUV = GetScreenCoordinate(curPos); float rayDepth = GetDepth(curPos); float gBufferDepth = GetGBufferDepth(screenUV); if(rayDepth > gBufferDepth + threshold){ hitPos = curPos; return true; } } } return false; }

    The frame rate dropped to around 42.6, but the picture was significantly improved! At least there was no noticeable light leakage.


    However, there are still some flaws in the picture, that is, there will be hairy reflection patterns at the edges, which means that the light leakage problem is still not solved, as shown in the following figure:


    The above methodThere is indeed a problemWhen comparing with the threshold, we mistakenly used curPos for comparison (i.e., Step n in the figure below), which caused the code to enter the third branch and return the hitPos of the wrong curPos.


    Taking a step back, we have no way to guarantee that the final calculated curPos falls exactly on the line between the edge of the object and the origin of the camera. To put it bluntly, the blue line in the figure below is quite discrete. We want to get the curPos that is "just" at the boundary, and then deal with the defects in the distance from "Step n" to "the "just" curPos" (that is, the burr error above), but obviously due to various precision reasons, we can't get it. In the figure below, the green line represents a step.


    Even if we adjust the ratio of threshold/step to make it close to 1, we can hardly eliminate the problem and can only alleviate it, as shown in the figure below.


    Therefore, we need to improve the "anti-light leakage" method again.

    In other words, the idea of improvement is very simple. Since I can't get the "exact" curPos point, I will guess it. Specifically, I will do a linear interpolation directly. Before interpolation, I will make an approximation, that is, I will regard the sight lines as parallel to each other, and then make a similar triangle as shown in the figure below, guess the curPos we want, and then use it as hitPos.



    bool RayMarch(vec3 ori, vec3 dir, out vec3 hitPos) { bool result = false; const float EPS = 1e-3; const int totalStepTimes = 60; const float threshold = 0.1; float step = 0.05; vec3 stepDir = normalize(dir ) * step; vec3 curPos = ori + stepDir; vec3 nextPos = curPos + stepDir; for(int i = 0; i < totalStepTimes; i++) { if(GetDepth(nextPos) < GetGBufferDepth(GetScreenCoordinate(nextPos))){ curPos = nextPos; nextPos += stepDir; continue; } float s1 = GetGBufferDepth(GetScreenCoordinate(curPos)) - GetDepth(curPos) + EPS; float s2 = GetDepth(nextPos) - GetGBufferDepth(GetScreenCoordinate(nextPos)) + EPS; if(s1 < threshold && s2 < threshold){ hitPos = curPos + stepDir * s1 / (s1 + s2); result = true; } break; } return result ; }

    The effect is quite good, with no ghosting or border artifacts. And the frame rate is similar to the original algorithm, averaging around 49.2.


    Next, we will focus on optimizing performance, specifically:

    • Add adaptive step
    • Off-screen ignored judgment

    Off-screen ignored judgment Very simple. If the uvScreen of curPos is not between 0 and 1, then the current step is abandoned.

    Let's talk about the adaptive step in detail. That is, add two lines at the beginning of for. The actual frame rate will increase slightly by about 2-3 frames.

    vec2 uvScreen = GetScreenCoordinate(curPos); if(any(bvec4(lessThan(uvScreen, vec2(0.0)), greaterThan(uvScreen, vec2(1.0))))) break;

    Adaptive step It is not difficult. First, set a larger value for the initial step. IfAfter steppingcurPos Not on screen or The depth value is deeper than gBuffer or "s1 < threshold && s2 < threshold" is not satisfied , then let the step be halved to ensure accuracy.

    bool RayMarch(vec3 ori, vec3 dir, out vec3 hitPos) { const float EPS = 1e-2; const int totalStepTimes = 20; const float threshold = 0.1; bool result = false, firstIn = false; float step = 0.8; vec3 curPos = ori; vec3 nextPos; for(int i = 0; i < totalStepTimes; i++) { nextPos = curPos+dir*step; vec2 uvScreen = GetScreenCoordinate(curPos); if(any(bvec4(lessThan(uvScreen, vec2(0.0))), greaterThan(uvScreen, vec2(1.0))))) break; if (GetDepth(nextPos) < GetGBufferDepth(GetScreenCoordinate(nextPos))){ curPos += dir * step; if(firstIn) step *= 0.5; continue; } firstIn = true; if(step < EPS){ float s1 = GetGBufferDepth(GetScreenCoordinate(curPos)) - GetDepth(curPos) + EPS; float s2 = GetDepth(nextPos) - GetGBufferDepth(GetScreenCoordinate(nextPos)) + EPS; if(s1 < threshold && s2 < threshold){ hitPos = curPos + 2.0 * dir * step * s1 / (s1 + s2); result = true; } break; } if(firstIn) step *= 0.5; } return result; }

    After the improvement, the frame rate suddenly reached 100 frames, almost doubling.


    Finally, tidy up the code.

    #define EPS 5e-2 #define TOTAL_STEP_TIMES 20 #define THRESHOLD 0.1 #define INIT_STEP 0.8 bool outScreen(vec3 curPos){ vec2 uvScreen = GetScreenCoordinate(curPos); return any(bvec4(lessThan(uvScreen, vec2(0.0)), greaterThan(uvScreen, vec2(1.0)))); } bool testDepth(vec3 nextPos){ return GetDepth(nextPos) < GetGBufferDepth(GetScreenCoordinate(nextPos)); } bool RayMarch(vec3 ori, vec3 dir, out vec3 hitPos) { float step = INIT_STEP; bool result = false, firstIn = false; vec3 nextPos, curPos = ori; for(int i = 0; i < TOTAL_STEP_TIMES; i++) { nextPos = curPos + dir * step; if(outScreen(curPos)) break; if(testDepth(nextPos)){ // You can improve curPos += dir * step; continue; }else{ // Too advanced firstIn = true; if(step < EPS){ float s1 = GetGBufferDepth(GetScreenCoordinate(curPos)) - GetDepth(curPos) + EPS; float s2 = GetDepth(nextPos) - GetGBufferDepth(GetScreenCoordinate(nextPos)) + EPS; if(s1 < THRESHOLD && s2 < THRESHOLD){ hitPos = curPos + 2.0 * dir * step * s1 / (s1 + s2); result = true; } break; } if(firstIn) step *= 0.5; } } return result; }

    Switching to the cave scene, the sampling rate is set to 32, and the frame rate is only a pitiful 4 frames.


    And the quality of the secondary light source is very good.


    However, this algorithm will cause new problems when applied to reflections, especially the following picture, which has serious distortion.


    5. Mipmap Implementation

    Hierarchical-Z map based occlusion culling

    6. LocalBasis builds TBN principle

    Generally speaking, constructing the normal tangent vector (normal, tangent, and bitangent vector) is achieved through the cross product. The implementation method is very simple. First, select an auxiliary vector that is not parallel to the normal vector, and do a cross product between the two to get the first tangent vector. Then, do a cross product between the tangent vector and the normal vector to get the bitangent vector. The specific code is written as follows:

    void CalculateTBN(const vec3 &normal, vec3 &tangent, vec3 &bitangent) { vec3 helperVec; if (abs(normal.x) < abs(normal.y)) helperVec = vec3(1.0, 0.0, 0.0); else helperVec = vec3(0.0 , 1.0, 0.0); tangent = normalize(cross(helperVec, normal)); bitangent = normalize(cross(normal, tangent)); }

    But the code in the job framework avoids usingCross Product, which is very clever. Simply put, it is to ensure that the vectorDot ProductAll are 0.

    • $b1⋅n=0$
    • $b2⋅n=0$
    • $b1⋅b2=0$
    void LocalBasis(vec3 n, out vec3 b1, out vec3 b2) { float sign_ = sign(nz); if (nz == 0.0) { sign_ = 1.0; } float a = -1.0 / (sign_ + nz); float b = nx * ny * a; b1 = vec3(1.0 + sign_ * nx * nx * a, sign_ * b, -sign_ * nx); b2 = vec3(b, sign_ + ny * ny * a, -ny); }

    This algorithm is a heuristic one, which introduces a symbolic function, which is quite impressive. It also considers the case of division by 0, and the pattern is also full. However, the following four lines should be the author's random disassembly when he wrote the formula one day. Here I will restore the author's disassembly steps at that time. That is, the process of reverse deduction.


    By the way, the sign function in the code can be multiplied in the last step.

    In fact, I can create a hundred such formulas, and I don’t know the difference between them. If you know, please tell me QAQ. If you insist, then it can be explained like this:

    Traditional cross-product-based methods may be numerically unstable because the cross-product result is close to the zero vector in this case. The method adopted in this paper is a heuristic method that constructs an orthogonal basis through a series of carefully designed steps. This method pays special attention to numerical stability, making it effective and stable when dealing with normal vectors close to extreme directions.

    grateful @I am a dragon set little fruit As pointed out by , the above method is very particular. The algorithm provided in the homework framework was obtained by Tom Duff et al. in 2017 by improving Frisvad's method. For details, please refer to the following two papers.


    1. Games 202
    2. LearnOpenGL – Normal Mapping
  • Games202 作业二 PRT实现

    Games202 Assignment 2 PRT Implementation


    Because I am also a newbie, I can't ensure that everything is correct. I hope the experts can correct me.

    Zhihu's formula is a bit ugly, you can go to:GitHub

    Project source code:

    Precomputed spherical harmonic coefficients

    The spherical harmonics coefficients are pre-computed using the framework nori.

    Ambient lighting: Calculate the spherical harmonic coefficients for each pixel of the cubemap

    ProjEnv::PrecomputeCubemapSH(images, width, height, channel); Use the Riemann integral method to calculate the coefficients of the ambient light spherical harmonics.

    Complete code

    // TODO: here you need to compute light sh of each face of cubemap of each pixel
    // TODO: Here you need to calculate the spherical harmonic coefficients of a certain face of the cubemap for each pixel
    Eigen::Vector3f dir = cubemapDirs[i * width * height + y * width + x];
    int index = (y * width + x) * channel;
    Eigen::Array3f Le(images[i][index + 0], images[i][index + 1],
                      images[i][index + 2]);
    // Describe the current angle in spherical coordinates
    double theta = acos(dir.z());
    double phi = atan2(dir.y(), dir.x());
    // Traverse each basis function of spherical harmonics
    for (int l = 0; l <= SHOrder; l++){
        for (int m = -l; m <= l; m++){
            float sh = sh::EvalSH(l, m, phi, theta);
            float delta = CalcArea((float)x, (float)y, width, height);
            SHCoeffiecents[l*(l+1)+m] += Le * sh * delta;


    Spherical harmonic coefficientsIt is the projection of the spherical harmonic function on a sphere, which can be used to represent the distribution of the function on the sphere. Since we have three channels of RGB values, the spherical harmonic coefficients we will store as a three-dimensional vector. Parts that need to be improved:

    /// prt.cpp - PrecomputeCubemapSH()
    // TODO: here you need to compute light sh of each face of cubemap of each pixel
    // TODO: Here you need to calculate the spherical harmonic coefficients of a certain face of the cubemap for each pixel
    Eigen::Vector3f dir = cubemapDirs[i * width * height + y * width + x];
    int index = (y * width + x) * channel;
    Eigen::Array3f Le(images[i][index + 0], images[i][index + 1],
                      images[i][index + 2]);

    First, we sample a direction (a 3D vector representing the direction from the center to the pixel) from each pixel of the six cubemaps (the images array) and convert the direction to spherical coordinates (theta and phi).

    Then, each spherical coordinate is passed into sh::EvalSH() to calculate the real value sh of each spherical harmonic function (basis function) and the proportion delta of the spherical area occupied by each pixel in each cubemap is calculated.

    Finally, we accumulate the spherical harmonic coefficients. In the code, we can accumulate all the pixels of the cubemap, which is similar to the original operation of calculating the integral of the spherical harmonic function.



    • θ is the zenith angle, ranging from 0 to π; ϕ is the azimuth angle, ranging from 0 to 2pi.
    • f(θ,ϕ) is the value of the function at a point on the sphere.
    • Ylm is a spherical harmonic function, which consists of the corresponding Legendre polynomials Plm and some trigonometric functions.
    • l is the order of the spherical harmonics; m is the ordinal number of the spherical harmonics, ranging from −l to l.

    In order to make the readers understand more specifically, here is the estimate of the discrete form of the spherical harmonics in the code, that is, the Riemann integral method for calculation.



    • f(θi,ϕi) is the value of the function at a point on the sphere.
    • Ylm(θi,ϕi) is the value of the spherical harmonics at that point.
    • Δωi is the tiny area or weight of the point on the sphere.
    • N is the total number of discrete points.

    Code Details

    • Get RGB lighting information from cubemap
    Eigen::Array3f Le(images[i][index + 0], images[i][index + 1],
                      images[i][index + 2]);

    The value of channel is 3, corresponding to the three channels of RGB. Therefore, index points to the position of the red channel of a pixel, index + 1 points to the position of the green channel, and index + 2 points to the position of the blue channel.

    • Convert direction vector to spherical coordinates
    double theta = acos(dir.z());
    double phi = atan2(dir.y(), dir.x());

    theta is the angle from the positive z-axis to the direction of dir, and phi is the angle from the positive x-axis to the projection of dir on the xz plane.

    • Traversing the basis functions of spherical harmonics
    for (int l = 0; l <= SHOrder; l++){
        for (int m = -l; m <= l; m++){
            float sh = sh::EvalSH(l, m, phi, theta);
            float delta = CalcArea((float)x, (float)y, width, height);
            SHCoeffiecents[l*(l+1)+m] += Le * sh * delta;

    Unshadowed diffuse term

    scene->getIntegrator()->preprocess(scene); calculation Diffuse Unshadowed Simplify the rendering equation and substitute the spherical harmonic function in the previous section to further calculate the coefficients of the spherical harmonic projection of the BRDF. The key function is ProjectFunction. We need to write a lambda expression for this function to calculate the transfer function term.


    For the diffuse transmission term, we canThere are three situationsconsider:Shadowed,No shadowandMutually Reflective.

    Let's first consider the simplest case without shadows. We have the rendering equation


    • is the incident radiance.
    • It is a geometric function, and the microscopic properties of the surface are related to the direction of the incident light.
    • is the incident light direction.

    For a diffuse surface with equal reflection everywhere, we can simplify to Unshadowed Lighting equation


    • is the diffuse outgoing radiance of the point.
    • is the surface normal.

    The incident radiance and transfer function terms are independent of each other, as the former represents the contribution of the light sources in the scene, and the latter represents how the surface responds to the incident light. Therefore, these two components are treated independently.

    Specifically, when using spherical harmonics approximation, we expand these two items separately. The input of the former is the incident direction of light, and the input of the latter is the reflection (or outgoing direction), and the expansion is two series of arrays, so we use a data structure called Look-Up Table (LUT).

    auto shCoeff = sh::ProjectFunction(SHOrder, shFunc, m_SampleCount);

    Among them, the most important one is the function ProjectFunction above. We need to write a Lambda expression (shFunc) as a parameter for this function, which is used to calculate the transfer function term.

    ProjectFunction function parameter passing:

    • Spherical harmonic order
    • Functions that need to be projected onto basis functions (that we need to write)
    • Number of samples

    This function will take the result returned by the Lambda function and project it onto the basis function to get the coefficient. Finally, it will add up the coefficients of each sample and multiply them by the weight to get the final coefficient of the vertex.

    Complete code

    Compute the geometric terms, i.e. the transfer function terms.

    double H = wi.normalized().dot(n.normalized()) / M_PI;
    if (m_Type == Type::Unshadowed){
        // TODO: here you need to calculate unshadowed transport term of a given direction
        // TODO: Here you need to calculate the unshadowed transmission term spherical harmonics value in a given direction
        return (H > 0.0) ? H : 0.0;

    In short, remember to divide the final integral result by , and then pass it to m_TransportSHCoeffs.

    Shadowed Diffuse Term

    scene->getIntegrator()->preprocess(scene); calculation Diffuse Shadowed This item has an additional visible item.


    The Visibility item () is a value that is either 1 or 0. The bool rayIntersect(const Ray3f &ray) function is used to reflect a ray from the vertex position to the sampling direction. If it hits the object, it is considered to be blocked and has a shadow, and 0 is returned; if the ray does not hit the object, it is still returned.

    Complete code

    double H = wi.normalized().dot(n.normalized()) / M_PI;
        // TODO: here you need to calculate shadowed transport term of a given direction
        // TODO: Here you need to calculate the spherical harmonic value of the shadowed transmission term in a given direction
        if (H > 0.0 && !scene->rayIntersect(Ray3f(v, wi.normalized())))
            return H;
        return 0.0;

    In short, remember to divide the final integral result by , and then pass it to m_TransportSHCoeffs.

    Export calculation results

    The nori framework will generate two pre-calculated result files.

    Add run parameters:


    In prt.xml, you need to do the followingRevise, you can choose to render the ambient light cubemap. In addition, the model, camera parameters, etc. can also be modified by yourself.

    <!-- Render the visible surface normals -->
    <integrator type="prt">
        <string name="type" value="unshadowed" />
        <integer name="bounce" value="1" />
        <integer name="PRTSampleCount" value="100" />
    <!--        <string name="cubemap" value="cubemap/GraceCathedral" />-->
    <!--        <string name="cubemap" value="cubemap/Indoor" />-->
    <!--        <string name="cubemap" value="cubemap/Skybox" />-->
        <string name="cubemap" value="cubemap/CornellBox" />

    Among them, the label optional value:

    • type: unshadowed, shadowed, interreflection
    • bounce: The number of light bounces under the interreflection type (not yet implemented)
    • PRTSampleCount: The number of samples per vertex of the transmission item
    • cubemap: cubemap/GraceCathedral, cubemap/Indoor, cubemap/Skybox, cubemap/CornellBox

    The above pictures are the unshadowed rendering results of GraceCathedral, Indoor, Skybox and CornellBox, with a sampling number of 1.

    Coloring using spherical harmonics

    Manually drag the files generated by nori into the real-time rendering framework and make some changes to the real-time framework.

    After the calculation in the previous chapter is completed, copy the light.txt and transport.txt in the corresponding cubemap path to the cubemap folder of the real-time rendering framework.

    Precomputed data analysis

    Cancel The comments on lines 88-114 in engine.js are used to parse the txt file just added.

    // engine.js
    // file parsing
    ... // Uncomment this code

    Import model/create and use PRT material shader

    In the materials folderEstablishFile PRTMaterial.js.

    class PRTMaterial extends Material {
        constructor(vertexShader, fragmentShader) {
                'uPrecomputeL[0]': { type: 'precomputeL', value: null},
                'uPrecomputeL[1]': { type: 'precomputeL', value: null},
                'uPrecomputeL[2]': { type: 'precomputeL', value: null},
            vertexShader, fragmentShader, null);
    async Function buildPRTMaterial(vertexPath, fragmentPath) {
        let vertexShader = await getShaderString(vertexPath);
        let fragmentShader = await getShaderString(fragmentPath);
        return new PRTMaterial(vertexShader, fragmentShader);

    Then import it in index.html.

