Tag: Unity

  • Unity 曲面細分詳解

    Unity surface breakdown

    Tags: Getting Started/Shader/Tessellation Shader/Displacement Map/LOD/Smooth Outline/Early Culling

    The word tessellation refers to a broad category of design activities, usually involving the arrangement of tiles of various geometric shapes next to each other to form a pattern on a flat surface. Its purpose can be artistic or practical, and many examples date back thousands of years. — Tessellation, Wikipedia, accessed July 2020.


    This article mainly refers to:

    https://nedmakesgames.medium.com/mastering-tessellation-shaders-and-their-many-uses-in-unity-9caeb760150e

    Surface subdivision in game development is generally done in a triangleflat(or Quad) and then use the Displacement map to do vertex displacement, or use the Phong subdivision or PN triangles subdivision implemented in this article to do vertex displacement.

    Phong subdivision does not need to know the adjacent topological information, only uses interpolation calculation, which is more efficient than PN triangles and other algorithms. Loop and Schaefer mentioned in GAMES101 use low-degree quadrilateral surfaces to approximate Catmull-Clark surfaces. The polygons input by these methods are replaced by a polynomial surface. The Phong subdivision in this article does not require any operation to correct additional geometric areas.

    1. Overview of the tessellation process

    This chapter introduces the process of surface subdivision in the rendering pipeline.

    The tessellation shader is located after the vertex shader, and the tessellation is divided into three steps: Hull, Tesselllator and Domain, among which Tessellator is not programmable.

    The first step of tessellation is the tessellation control shader (also known as Tessellation Control Shader, TCS), which will output control points and tessellation factors. This stage mainly consists of two parallel functions: Hull Function and Patch Constant Function.

    Both functions receive patches, which are a set of vertex indices. For example, a triangle uses three numbers to represent the vertex indices. One patch can form a fragment, for example, a triangle fragment is composed of three vertex indices.

    Moreover, the Hull Function is executed once for each vertex, and the Path Constant Function is executed once for each Patch. The former outputs the modified control point data (usually including vertex position, possible normals, texture coordinates and other attributes), while the latter outputs the constant data related to the entire fragment, that is, the subdivision factor. The subdivision factor tells the next stage (the tessellator) how to subdivide each fragment.

    In general, the Hull Function modifies each control point, while the Patch Constant Function determines the level of subdivision based on the distance from the camera.

    Next comes the non-programmable stage, the tessellator. It receives the patch and the subdivision factor just obtained. The tessellator generates a barycentric coordinate for each vertex data.

    Next comes the last step, the Domain Stage (also known as Tessellation Evaluation Shader, TES), which is programmable. This part consists of domain functions, which are executed once per vertex. It receives the barycentric coordinates and the results generated by the two functions in the Patch and Hull Stage. Most of the logic is written here. The most important thing is that you can reposition the vertices in this stage, which is the most important part of tessellation.

    If there is a geometry shader, it will be executed after the Domain Stage. But if not, it will come to the rasterization stage.

    In summary, the first thing is the vertex shader. The Hull stage accepts vertex data and decides how to subdivide the mesh. Then the tessellator stage processes the subdivided mesh, and finally the Domain stage outputs vertices for the fragment shader.

    2. Surface subdivision analysis

    This chapter contains code analysis of Unity's surface subdivision, practical example effects display and an overview of the underlying principles.

    2.1 Key code analysis

    2.1.1 Basic settings of Unity tessellation

    First of all, the tessellation shader needs to use shader target 5.0.

    HLSLPROGRAM
    #Pragmas target 5.0 // 5.0 required for tessellation
    
    #Pragmas vertex Vertex
    #Pragmas hull Hull
    #Pragmas domain Domain
    #Pragmas fragment Fragment
    
    ENDHLSL

    2.1.2 Hull Stage Code 1 – Hull Function

    In the classic process, the vertex shader converts the position and normal information into world space. Then the output result is passed to the Hull Stage. It should be noted that, unlike the vertex shader, the vertices of the Hull shader are represented by INTERNALTESSPOS semantics instead of POSITION semantics. The reason is that Hull does not need to output these vertex positions to the next rendering process, but for its own internal tessellation algorithm, so it will convert these vertices to a coordinate system that is more suitable for tessellation. In addition, developers can also distinguish more clearly.

    struct Attributes {
        float3 positionOS : POSITION;
        float3 normalOS : NORMAL;
        UNITY_VERTEX_INPUT_INSTANCE_ID
    };
    
    struct TessellationControlPoint {
        float3 positionWS : INTERNAL LTESS POS;
        float3 normalWS : NORMAL;
        UNITY_VERTEX_INPUT_INSTANCE_ID
    };
    
    TessellationControlPoint Vertex(Attributes input) {
        TessellationControlPoint output;
    
        UNITY_SETUP_INSTANCE_ID(input);
        UNITY_TRANSFER_INSTANCE_ID(input, output);
    
        VertexPositionInputs posnInputs = GetVertexPositionInputs(input.positionOS);
        VertexNormalInputs normalInputs = GetVertexNormalInputs(input.normalOS);
    
        output.positionWS = posnInputs.positionWS;
        output.normalWS = normalInputs.normalWS;
        return output;
    }

    Below are some setting parameters for the Hull Shader.

    The first line, domain, defines the domain type of the tessellation shader, which means that both the input and output are triangle primitives. You can choose tri (triangle), quad (quadrilateral), etc.

    The second line outputcontrolpoints indicates the number of output control points, 3 corresponds to the three vertices of the triangle.

    The third line outputtopology indicates the topological structure of the primitive after subdivision. triangle_cw means that the vertices of the output triangle are sorted clockwise. The correct order can ensure that the surface faces outward. triangle_cw (clockwise around the triangle), triangle_ccw (counterclockwise around the triangle), line (line segment)

    The fourth line patchconstantfunc is another function of the Hull Stage, which outputs constant data such as subdivision factors. A patch is executed only once.

    The fifth line, partitioning, specifies how to distribute additional vertices to the edges of the original Path primitive. This step can make the subdivision process smoother and more uniform. integer, fractional_even, fractional_odd.

    The maxtessfactor in the sixth line represents the maximum subdivision factor. Limiting the maximum subdivision can control the rendering burden.

    [domain("tri")]
    [outputcontrolpoints(3)]
    [outputtopology("triangle_cw")]
    [patchconstantfunc("patchconstant")]
    [partitioning("fractional_even")]
    [maxtessfactor(64.0)]

    In the Hull Shader, each control point will be called once independently, so this function will be executed the same number of control points. To know which vertex is currently being processed, we use the variable id with the semantics of SV_OutputControlPointID to determine. The function also passes in a special structure that can be used to easily access any control point in the Patch like an array.

    TessellationControlPoint Hull(
        InputPatch<TessellationControlPoint, 3> patch, uint id : SV_OutputControlPointID) {
        TessellationControlPoint h;
        // Hull shader code here
    
        return patch[id];
    }

    2.1.3 Hull Stage Code 2 – Patch Constant Function

    In addition to the Hull Shader, there is another function in the Hull Stage that runs in parallel, the patch constant function. The signature of this function is relatively simple. It inputs a patch and outputs the calculated subdivision factor. The output structure contains the tessellation factor specified for each edge of the triangle. These factors are identified by the special system value semantics SV_TessFactor. Each tessellation factor defines how many small segments the corresponding edge should be subdivided into, thereby affecting the density and details of the resulting mesh. Let's take a closer look at what this factor specifically contains.

    struct TessellationFactors {
        float edge[3] : SV_TessFactor;
        float inside : SV_InsideTessFactor;
    };
    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        UNITY_SETUP_INSTANCE_ID(patch[0]); // Set up instancing
        //Calculate tessellation factors
        TessellationFactors f;
        f.edge[0] = _FactorEdge1.x;
        f.edge[1] = _FactorEdge1.y;
        f.edge[2] = _FactorEdge1.z;
        f.inside = _FactorInside;
        return f;
    }

    First, there is an edge tessellation factor edge[3] in the TessellationFactors structure, marked as SV_TessFactor. When using triangles as the basic primitives for tessellation, each edge is defined as being located relative to the vertex with the same index. Specifically: edge 0 corresponds to vertex 1 and vertex 2. Edge 1 corresponds to vertex 2 and vertex 0. Edge 2 corresponds to vertex 0 and vertex 1. Why is this so? The intuitive explanation is that the index of the edge is the same as the index of the vertex it is not connected to. This helps to quickly identify and process the edges corresponding to specific vertices when writing shader code.

    There is also a center tessellation factor inside labeled SV_InsideTessFactor. This factor directly changes the final tessellation pattern, and more essentially determines the number of edge subdivisions, which is used to control the subdivision density inside the triangle. Compared with the edge subdivision factor, the center tessellation factor controls how the inside of the triangle is further subdivided into smaller triangles, while the edge tessellation factor affects the number of edge subdivisions.

    Patch Constant Function can also output other useful data, but it must be labeled with the correct semantics. For example, BEZIERPOS semantics is very useful and can represent float3 data. This semantics will be used later to output the control points of the smoothing algorithm based on the Bezier curve.

    2.1.4 Domain Stage Code

    Next, we enter the Domain Stage. The Domain Function also has a Domain property, which should be the same as the output topology type of the Hull Function. In this example, it is set to a triangle. This function inputs the patch from the Hull Function, the output of the Patch Constant Function, and the most important vertex barycentric coordinates. The output structure is very similar to the output structure of the vertex shader, containing the position of the Clip space, as well as the lighting data required by the fragment shader.

    It doesn’t matter if you don’t know what it is for now. Just read Chapter 4 of this article and then come back to study it.

    Simply put, each new vertex that is subdivided will run this domain function.

    struct Interpolators {
        float3 normalWS                 : TEXCOORD0;
        float3 positionWS               : TEXCOORD1;
        float4 positionCS               : SV_POSITION;
    };
    
    // Call this macro to interpolate between a triangle patch, passing the field name
    #define BARYCENTRIC_INTERPOLATE(fieldName) \
            patch[0].fieldName * barycentricCoordinates.x + \
            patch[1].fieldName * barycentricCoordinates.y + \
            patch[2].fieldName * barycentricCoordinates.z
    
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
    
        Interpolators output;
    
        // Setup instancing and stereo support (for VR)
        UNITY_SETUP_INSTANCE_ID(patch[0]);
        UNITY_TRANSFER_INSTANCE_ID(patch[0], output);
        UNITY_INITIALIZE_VERTEX_OUTPUT_STEREO(output);
    
        float3 positionWS = BARYCENTRIC_INTERPOLATE(positionWS);
        float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
    
        output.positionCS = TransformWorldToHClip(positionWS);
        output.normalWS = normalWS;
        output.positionWS = positionWS;
    
        return output;
    }

    In this function, Unity will give us the subdivision factor, the three vertices of the patch, and the centroid coordinates of the current new vertex. We can use this data to do displacement processing, etc.

    2.2 Detailed explanation of subdivision factors and division modes

    From thisLink Copy the code, then make the corresponding material and turn on the wireframe mode. We have only drawn vertices for the Mesh and have not applied any operations in the fragment shader, so it looks transparent.

    If any component of the Edge Factor is set to 0 or less than 0, the Mesh will disappear completely. The following figure shows what it looks like after it disappears (the Unity editor's object border stroke is turned on). This feature is very important.

    2.2.1 Overview of subdivision factors

    To put it bluntly, after these factors are set in the Hull Stage, they are simply and crudely written into the barycentric coordinates in the Tessellation Stage, such as edge factors and internal factors. (Assuming they are all tri, if it is quad, it is calculated using uv, which may be more complicated, I don't know) This simple and crude stage is not programmable.

    Take "integer (uniform) cutting mode" as an example. (temporarily) [partitioning("integer")] The domain is all triangles [domain("tri")] The number of output vertices is also 3. [outputcontrolpoints(3)] And the output topology is a triangle clockwise. [outputtopology("triangle_cw")]

    2.2.2 Preparatory work and potential parallel issues

    Modify the code to the following:

    // .shader
    _FactorEdge1("[Float3]Edge factors,[Float]Inside factor", Vector) = (1, 1, 1, 1) // -- Edited -- 
    
    // .hlsl
    float4 _FactorEdge1; // -- Edited -- 
    ...
    f.edge[0] = _FactorEdge1.x;
    f.edge[1] = _FactorEdge1.y; // -- Edited -- 
    f.edge[2] = _FactorEdge1.z; // -- Edited -- 
    f.inside = _FactorEdge1.w; // -- Edited --

    There may be a problem here. Sometimes the compiler will split the Patch Constant Function and calculate each factor in parallel, which may cause some factors to be deleted, and the factors may be inexplicably equal to 0. The solution is to pack these factors into a vector so that the compiler will not use undefined quantities. The following is a simple reproduction of what may happen.

    Modify the Path Constant Function as follows and open two new properties in the panel.

    The modified code lines are commented out with // — Edited — .

    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
    InputPatch<TessellationControlPoint, 3> patch) {
    UNITY_SETUP_INSTANCE_ID(patch[0]); // Set up instancing
    //Calculate tessellation factors
        TessellationFactors f;
        f.edge[0] = _FactorEdge1.x;
        f.edge[1] = _FactorEdge2; // -- Edited --
        f.edge[2] = _FactorEdge3; // -- Edited --
        f.inside = _FactorInside;
    return f;
    }
    _FactorEdge2("Edge 2 factor", Float) = 1 // -- Edited --
    _FactorEdge3("Edge 3 factor", Float) = 1 // -- Edited --

    2.2.3 Edge Factor – SV_TessFactor

    It can be seen that the edge factors correspond approximately to the number of times the corresponding edge is split, and the internal factor corresponds to the complexity of the center.

    The edge factor only affectsOriginal triangle edgeAs for the complex internal pattern, it is controlled by the internal factor Inside Factor and the division mode.

    It should be noted that the surface subdivision in "integer cutting mode" is rounded up, for example, 2.1 is rounded up to 3.

    One picture says it all.

    2.2.4 Inside Factor – SV_InsideTessFactor

    Let's take the INTEGER mode as an example. The internal factor will only affect the complexity of the internal pattern. The specific influence is described in detail below.To summarize, the edge factor affects the triangular subdivision between the outermost layer and the first layer, the internal factor affects how many layers there are, and the division mode affects how each internal layer is subdivided.

    Assuming that the Edge Factors are set to (2,3,4) and only the Insider Factor is modified, an interesting property can be observed: when the internal factor n is an even number, a vertex can be found whose coordinates are exactly at the centroid position (13,13,13).

    Generally, it is good to set the edge factors to the same value. Here, different values are set, and the graph may be more confusing, but the most essential rules can be seen.

    It can be further observed that the number of vertices on any edge closest to the outermost triangle has an equal relationship with the internal factor Inside Factor (n): n=Numpoint−1. That is, the number of vertices on this edge is always equal to the subdivision factor minus 1.

    The number of vertices in each layer decreases by 1. That is, the first layer (not counting the outermost layer, as it will not be subdivided) will have n vertices, the second layer inward will have n−2 vertices, and so on.

    Combining the above three observations, we can get a guess and conclusion(It’s useless, but I calculated it when I had nothing to do)The total number of internal vertices can be calculated using the formula, where n corresponds to the internal factor n-1. Note that the internal factor starts at 2: a2n=3n2a2n−1=3n(n−1)+1. This can be simplified and combined to: ak=−0.125(−1)k+0.75k2+0.125. The formula for all integer operations is as follows: ak=⌊−(−1)k+6k2+18⌋

    2.2.5 Partitioning Mode – [partitioning(“_”)]

    The above only describes the simplest way to divide integers evenly, which uses integer multiples for subdivision. Let's talk about the other methods.Simply put, Fractional Odd and Fractional Even are advanced versions of Integer, but the former is an advanced version of Integer when it is an odd number, and the latter is an advanced version of Integer when it is an even number. The specific advancement is that the fractional part can be used to make the division no longer equal.

    Fractional Odd: Inside Factor can be a fraction (not Ceil), and the denominator is an odd number. Note that the denominator here is actually the denominator represented by the barycentric coordinates of each vertex. The division method with an odd number as the denominator will definitely make a vertex fall on the barycentric coordinates of the triangle, while an even number will not.Kaios.

    Gif

    Fractional Even: Similar to fractional_odd, but with an even denominator. I'm not sure how to choose this.

    Gif

    Pow2 (power of 2): This mode only allows the use of powers of 2 (such as 1, 2, 4, 8, etc.) as subdivision levels. Generally used for texture mapping or shadow calculations.

    3. Segment Optimization

    3.1 View Frustum Culling

    Generating so many vertices will result in very bad performance! Therefore, some methods are needed to improve rendering efficiency. Although vertices outside the frustum will be culled before T rasterization, if unnecessary patches are culled in advance in TCS, the calculation pressure of the tessellation shader will be reduced.

    If the tessellation factor is set to 0 in the Patch Constant Function, the tessellation generator will ignore the patch, which means that the culling here is for the entire patch, rather than the vertex-by-vertex culling in the frustum culling.

    We test every point in the patch to see if they are out of view. To do this, transform every point in the patch into clip space. So we need to calculate the clip space coordinates of each point in the vertex shader and pass it to the Hull Stage. Use GetVertexPositionInputs to get what we want.

    struct TessellationControlPoint {
        float4 positionCS : SV_POSITION; // -- Edited -- 
        ...
    };
    
    TessellationControlPoint Vertex(Attributes input) {
        TessellationControlPoint output;
        ...
        VertexPositionInputs posnInputs = GetVertexPositionInputs(input.positionOS);
        ...
        output.positionCS = posnInputs.positionCS; // -- Edited -- 
        ...
        return output;
    }

    Then write a test function above the Patch Constant Function to determine whether to cull the patch. Temporarily pass false here. The function passes in three points in the clipping space.

    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        return false;
    }

    Then write the IsOutOfBounds function to test whether a point is outside the bounds. The bounds can also be specified, and this method can be used in another function to determine whether a point is outside the view frustum.

    // Returns true if the point is outside the bounds set by lower and higher
    bool IsOutOfBounds(float3 p, float3 lower, float3 higher) {
        return p.x < lower.x || p.x > higher.x || p.y < lower.y || p.y > higher.y || p.z < lower.z || p.z > higher.z;
    }
    
    // Returns true if the given vertex is outside the camera fustum and should be culled
    bool IsPointOutOfFrustum(float4 positionCS) {
        float3 culling = positionCS.xyz;
        float w = positionCS.w;
        // UNITY_RAW_FAR_CLIP_VALUE is either 0 or 1, depending on graphics API
        // Most use 0, however OpenGL uses 1
        float3 lowerBounds = float3(-w, -w, -w * UNITY_RAW_FAR_CLIP_VALUE);
        float3 higherBounds = float3(w, w, w);
        return IsOutOfBounds(culling, lowerBounds, higherBounds);
    }

    In Clip Space, the W component is the secondary coordinate that determines whether a point is in the view frustum. If xyz is outside the range [-w, w], these points will be culled because they are outside the view frustum. Different APIs have differentDepth of processingThere is a different logic on the , we need to pay attention when we use this component as the boundary. DirectX and Vulkan use the left-handed system, the Clip depth is [0, 1], so UNITY_RAW_FAR_CLIP_VALUE is 0. OpenGL is a right-handed system, the Clip depth range is [-1, 1], and UNITY_RAW_FAR_CLIP_VALUE is 1.

    After preparing these, you can determine whether a patch needs to be culled. Go back to the function at the beginning and determine whether all the points of a patch need to be culled.

    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
            IsPointOutOfFrustum(p1PositionCS) &&
            IsPointOutOfFrustum(p2PositionCS); // -- Edited -- 
        return allOutside; // -- Edited -- 
    }

    3.2 Backface Culling

    In addition to frustum culling, patches can also undergo backface culling, using the normal vector to determine whether a patch needs to be culled.

    img

    The normal vector is obtained by taking the cross product of two vectors. Since we are currently in Clip space, we need to do a perspective division to get NDC, which should be in the range of [-1,1]. The reason for converting to NDC is that the position in Clip space is nonlinear, which may cause the position of the vertex to be distorted. Converting to a linear space like NDC can more accurately determine the front and back relationship of the vertices.

    // Returns true if the points in this triangle are wound counter-clockwise
    bool ShouldBackFaceCull(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        float3 point0 = p0PositionCS.xyz / p0PositionCS.w;
        float3 point1 = p1PositionCS.xyz / p1PositionCS.w;
        float3 point2 = p2PositionCS.xyz / p2PositionCS.w;
        float3 normal = cross(point1 - point0, point2 - point0);
        return dot(normal, float3(0, 0, 1)) < 0;
    }

    The above code still has a cross-platform problem. The viewing direction is different in different APIs, so modify the code.

    // In clip space, the view direction is float3(0, 0, 1), so we can just test the z coord
    #if UNITY_REVERSED_Z
        return cross(point1 - point0, point2 - point0).z < 0;
    #else // In OpenGL, the test is reversed
        return cross(point1 - point0, point2 - point0).z > 0;
    #endif

    Finally, add the function you just wrote to ShouldClipPatch to determine backface culling.

    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
            IsPointOutOfFrustum(p1PositionCS) &&
            IsPointOutOfFrustum(p2PositionCS);
        return allOutside || ShouldBackFaceCull(p0PositionCS, p1PositionCS, p2PositionCS); // -- Edited -- 
    }

    Then set the vertex factor of the patch to be culled to 0 in PatchConstantFunction.

    ...
    if (ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS)) {
            f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0; // Cull the patch
    }
    ...

    3.3 Increase Tolerance

    You may want to verify the correctness of the code, or there may be some unexpected exclusions. In this case, adding a tolerance is a flexible approach.

    The first is the frustum culling tolerance. If the tolerance is positive, the culling boundaries will be expanded so that some objects near the edge of the frustum will not be culled even if they are partially out of bounds. This method can reduce the frequent changes in culling state due to small perspective changes or object dynamics.

    // Returns true if the given vertex is outside the camera fustum and should be culled
    bool IsPointOutOfFrustum(float4 positionCS, float tolerance) {
        float3 culling = positionCS.xyz;
        float w = positionCS.w;
        // UNITY_RAW_FAR_CLIP_VALUE is either 0 or 1, depending on graphics API
        // Most use 0, however OpenGL uses 1
        float3 lowerBounds = float3(-w - tolerance, -w - tolerance, -w * UNITY_RAW_FAR_CLIP_VALUE - tolerance);
        float3 higherBounds = float3(w + tolerance, w + tolerance, w + tolerance);
        return IsOutOfBounds(culling, lowerBounds, higherBounds);
    }

    Next, backface culling is adjusted. In practice, this is done by comparing to a tolerance instead of zero to avoid issues with numerical precision. If the dot product result is less than some small positive value (the tolerance) instead of being strictly less than zero, then the primitive is considered a backface. This approach provides an additional buffer, ensuring that only explicitly backface primitives are culled.

    // Returns true if the points in this triangle are wound counter-clockwise
    bool ShouldBackFaceCull(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS, float tolerance) {
        float3 point0 = p0PositionCS.xyz / p0PositionCS.w;
        float3 point1 = p1PositionCS.xyz / p1PositionCS.w;
        float3 point2 = p2PositionCS.xyz / p2PositionCS.w;
        // In clip space, the view direction is float3(0, 0, 1), so we can just test the z coord
    #if UNITY_REVERSED_Z
        return cross(point1 - point0, point2 - point0).z < -tolerance;
    #else // In OpenGL, the test is reversed
        return cross(point1 - point0, point2 - point0).z > tolerance;
    #endif
    }

    It is possible to expose a Range in the Material Panel.

    // .shader
    Properties{
        _tolerance("_tolerance",Range(-0.002,0.001)) = 0
        ...
    }
    // .hlsl
    float _tolerance;
    ...
    // Returns true if it should be clipped due to frustum or winding culling
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS, _tolerance) &&
            IsPointOutOfFrustum(p1PositionCS, _tolerance) &&
            IsPointOutOfFrustum(p2PositionCS, _tolerance); // -- Edited -- 
        return allOutside || ShouldBackFaceCull(p0PositionCS, p1PositionCS, p2PositionCS,_tolerance); // -- Edited -- 
    }

    3.4 Dynamic subdivision factor

    So far, our algorithm has subdivided all surfaces indiscriminately. However, in a complex Mesh, there may be large and small faces.Uneven Mesh AreaThe large face is more obvious visually due to its large area, and more subdivisions are needed to ensure the smoothness and details of the surface. The small face is small in area, so you can consider reducing the subdivision level of this part, which will not have a big impact on the visual effect. Dynamically changing the factor according to the length change is a common method. Set an algorithm to give faces with longer side lengths a higher subdivision factor.

    In addition to the large and small faces of the Mesh itself,The distance between the camera and the patchIt can also be used as a factor to dynamically change the factor. Objects that are farther away from the camera can have a lower tessellation factor because they occupy fewer pixels on the screen.The user’s viewing angle and gaze direction, you can prioritize subdividing faces that face the camera, and reduce the level of subdivision for faces that face away from the camera or to the sides.

    3.4.1 Fixed Segment Scaling

    Get the distance between two vertices. The larger the distance, the larger the subdivision factor. The scale is exposed in the control panel and set to [0,1]. When the scale is 1, the subdivision factor is directly contributed by the distance between the two points. The closer the scale is to 0, the larger the subdivision factor. In addition, an initial value bias is added. Finally, let it take a number of 1 or above to ensure accuracy.

    //Calculate the tessellation factor for an edge
    float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float3 p1PositionWS) {
        float factor = distance(p0PositionWS, p1PositionWS) / scale;
    
        return max(1, factor + bias);
    }

    Then modify the material panel and Patch Constant Function. Generally speaking, the average value of the edge subdivision factor is used as the internal subdivision factor, which will give a more consistent visual effect.

    // .shader
    Properties{
        ...
        _TessellationBias("_TessellationBias", Range(-1,5)) = 1
         _TessellationFactor("_TessellationFactor", Range(0,1)) = 0
    }
    
    // .hlsl
    
    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS);
    f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS);
    f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;

    The degree of subdivision of fragments of different sizes will change dynamically, and the effect is as follows.

    By the way, if you find that your internal factor pattern is very strange, this may be caused by the compiler. Try to modify the internal factor code to the following to solve it.

    f.inside = ( // If the compiler doesn't play nice...
      EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS) + 
      EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS) + 
      EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS)
      ) / 3.0;

    3.4.2 Screen Space Subdivision Scaling

    Next, we need to determine the camera distance. We can directlyUse screen space distance to adjust the subdivision level, which perfectly solves the problem of large and small surfaces + screen distance at the same time!

    Since we already have the data in Clip space, and since screen space is very similar to NDC space, we only need to convert it to NDC, that is, do a perspective division.

    float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float4 p0PositionCS, float3 p1PositionWS, float4 p1PositionCS) {
        float factor = distance(p0PositionCS.xyz / p0PositionCS.w, p1PositionCS.xyz / p1PositionCS.w) / scale;
    
        return max(1, factor + bias);
    }

    Next, pass the Clip space coordinates into the Patch Constant Function.

    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
      patch[1].positionWS, patch[1].positionCS, patch[2].positionWS, patch[2].positionCS);
    f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
      patch[2].positionWS, patch[2].positionCS, patch[0].positionWS, patch[0].positionCS);
    f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
      patch[0].positionWS, patch[0].positionCS, patch[1].positionWS, patch[1].positionCS);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;

    The current effect is quite good, and the level of subdivision changes dynamically as the camera distance (screen space distance) changes. If you use a subdivision mode other than INTEGER, you will get a more consistent effect.

    There are still some areas that can be improved. For example, the unit of the scaling factor. Just now we controlled it to [0,1], which is not very suitable for us to adjust. We multiply it by the screen resolution and change the scaling factor range to [0,1080], which is more convenient for us to adjust. Then modify the material panel properties. Now it is a ratio in pixels.

    // .hlsl
    float factor = distance(p0PositionCS.xyz / p0PositionCS.w, p1PositionCS.xyz / p1PositionCS.w) * _ScreenParams.y / scale;
    
    // .shader
    _TessellationFactor("_TessellationFactor",Range(0,1080)) = 320

    3.4.3 Camera distance subdivision scaling

    How do we use camera distance scaling? It's very simple. We calculate the ratio of the distance between two points and the distance between the midpoint of the two vertices and the camera position. The larger the ratio, the larger the space occupied on the screen, and the more subdivision is needed.

    // .hlsl
    float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float3 p1PositionWS) {
        float length = distance(p0PositionWS, p1PositionWS);
        float distanceToCamera = distance(GetCameraPositionWS(), (p0PositionWS + p1PositionWS) * 0.5);
        float factor = length / (scale * distanceToCamera * distanceToCamera);
        return max(1, factor + bias);
    }
    ...
            f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS);
            f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS);
            f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS);
    
    // .shader
    _TessellationFactor("_TessellationFactor",Range(0, 1)) = 0.02

    Note that the scaling factor is no longer in pixels, but in the original [0,1] unit. Because screen pixels are not very meaningful in this method, they are not used. And the world coordinates are used again.

    The results of screen space subdivision scaling and camera distance subdivision scaling are similar. Generally, a macro can be opened to switch the modes of the above dynamic factors. Here, it is left to the reader to complete.

    3.5 Specifying subdivision factors

    3.5.1 Vertex Storage Subdivision Factor

    In the previous section, we used different strategies to guess the appropriate subdivision factors. If we know exactly how the mesh should be subdivided, we can store the coefficients of these subdivision factors in the mesh. Since the coefficient only needs a float, only one color channel is needed. The following is a pseudo code, just give it a try.

    float EdgeTessellationFactor(float scale, float bias, float multiplier) {
        ...
        return max(1, (factor + bias) * multiplier);
    }
    
    ...
    // PCF()
    [unroll] for (int i = 0; i < 3; i++) {
        multipliers[i] = patch[i].color.g;
    }
    //Calculate tessellation factors
    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, (multipliers[1] + multipliers[2]) / 2);

    3.5.2 SDF Control Surface Subdivision Factor

    It is quite cool to combine the Signed Distance Field (SDF) to control the tessellation factor. Of course, this section does not involve the generation of SDF, assuming that it can be directly obtained through the ready-made function CalculateSDFDistance.

    For a given Mesh, use CalculateSDFDistance to calculate the distance from each vertex in each patch to the shape represented by the SDF (such as a sphere). After obtaining the distance, evaluate the subdivision requirements of the patch and perform subdivision.

    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        float multipliers[3];
    
        // Loop through each vertex
        [unroll] for (int i = 0; i < 3; i++) {
            // Calculate the distance from each vertex to the SDF surface
            float sdfDistance = CalculateSDFDistance(patch[i].positionWS);
    
            // Adjust subdivision factor based on SDF distance
            if (sdfDistance < _TessellationDistanceThreshold) {
                multipliers[i] = lerp(_MinTessellationFactor, _MaxTessellationFactor, (1 - sdfDistance / _TessellationDistanceThreshold));
            } else {
                multipliers[i] = _MinTessellationFactor;
            }
        }
    
        // Calculate the final subdivision factor
        TessellationFactors f;
        f.Edge[0] = max(multipliers[0], multipliers[1]);
        f.Edge[1] = max(multipliers[1], multipliers[2]);
        f.Edge[2] = max(multipliers[2], multipliers[0]);
        f.Inside = (multipliers[0] + multipliers[1] + multipliers[2]) / 3;
    
        return f;
    }

    I don't know how to implement it specifically, so I'll try to understand it first.

    4. Vertex offset – contour smoothing

    The easiest way to add details to a mesh is to use various high-resolution textures. However, the bottom line is that adding more vertices to a mesh is better than increasing the texture resolution. For example, a normal map can change the direction of each fragment's normal, but it does not change the geometry. Even a 128K texture cannot eliminate aliasing and pointy edges.

    Therefore, we need to tessellate the surface and then offset the vertices. All the tessellation operations just mentioned are operated on the plane where the patch is located. If we want to bend these vertices, one of the simplest operations is Phong tessellation.

    4.1 Phong subdivision

    First, the original paper is attached. https://perso.telecom-paristech.fr/boubek/papers/PhongTessellation/PhongTessellation.pdf

    Phong shading should be familiar to you. It is a technique that uses linear interpolation of normal vectors to obtain smooth shading. Phong subdivision is inspired by Phong shading and extends the concept of Phong shading to the spatial domain.

    The core idea of Phong subdivision is to use the vertex normals of each corner of the triangle to affect the position of new vertices during the subdivision process, thereby creating a curved surface instead of a flat surface.

    It is worth noting that many tutorials here use triangle corner to represent vertices. I think they are all the same, so I will still use vertices in this article.

    First, in the Domain function, Unity will give us the centroid coordinates of the new vertex we need to process. Suppose we are currently processing (13,13,13).

    Each vertex of a patch has a normal. Imagine a tangent plane emanating from each vertex, perpendicular to the respective normal vector.

    Then project the current vertex onto these three tangent planes respectively.

    Describe it in mathematical language. P′=P−((P−V)⋅N)N

    in :

    • $P$ is the initially interpolated plane position.
    • $V$ is a vertex position on the plane.
    • $N$ is the normal at vertex $V$.
    • ⋅ represents the dot product.
    • P′ is the projection of $P$ on the plane.

    Get three $P'$.

    The three points projected on the three tangent planes are re-formed into a new triangle, and then the centroid coordinates of the current vertex are applied to the new triangle to calculate the new point.

    //Calculate Phong projection offset
    float3 PhongProjectedPosition(float3 flatPositionWS, float3 cornerPositionWS, float3 normalWS) {
        return flatPositionWS - dot(flatPositionWS - cornerPositionWS, normalWS) * normalWS;
    }
    
    // Apply Phong smoothing
    float3 CalculatePhongPosition(float3 bary, float3 p0PositionWS, float3 p0NormalWS,
        float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        float3 smoothedPositionWS =
            bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
            bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
            bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
        return smoothedPositionWS;
    }
    
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
    
        Interpolators output;
        ...
        float3 positionWS = CalculatePhongPosition(barycentricCoordinates, 
          patch[0].positionWS, patch[0].normalWS, 
          patch[1].positionWS, patch[1].normalWS, 
          patch[2].positionWS, patch[2].normalWS);
        float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
        float3 tangentWS = BARYCENTRIC_INTERPOLATE(tangentWS.xyz);
        ...
        output.positionCS = TransformWorldToHClip(positionWS);
        output.normalWS = normalWS;
        output.positionWS = positionWS;
        output.tangentWS = float4(tangentWS, patch[0].tangentWS.w);
        ...
    }

    Note that we need to add the normal vector here, and then write it into Vertex and Domain. Then write a function to calculate the coordinates of the center of gravity of $P'$.

    struct Attributes {
        ...
        float4 tangentOS : TANGENT;
    };
    struct TessellationControlPoint {
        ...
        float4 tangentWS : TANGENT;
    };
    struct Interpolators {
        ...
        float4 tangentWS : TANGENT;
    };
    TessellationControlPoint Vertex(Attributes input) {
        TessellationControlPoint output;
        ...
        // .....The last one is the symbol coefficient
        output.tangentWS = float4(normalInputs.tangentWS, input.tangentOS.w); // tangent.w contains bitangent multiplier
    }
    // Barycentric interpolation as a function
    float3 BarycentricInterpolate(float3 bary, float3 a, float3 b, float3 c) {
        return bary.x * a + bary.y * b + bary.z * c;
    }

    In the original Phong subdivision paper, an α factor was added to control the degree of curvature. The original author recommends setting this value globally to three-quarters for the best visual effect. Expanding the algorithm with the α factor can produce a quadratic Bezier curve, which does not provide an inflection point but is sufficient for practical development.

    First, let’s look at the formula in the original paper.

    Essentially, it controls the degree of interpolation. A quantitative analysis shows that when α=0, all vertices are on the original plane, which is equivalent to no displacement. When α=1, the new vertices are completely dependent on the Phong subdivision bending vertices. Of course, you can also try values less than zero or greater than one, and the effect is also quite interesting. ~~It doesn’t matter if you don’t understand the mathematical formulas in the original text. I will just use a lerp and make a random interpolation.~~

    // Apply Phong smoothing
    float3 CalculatePhongPosition(float3 bary, float smoothing, float3 p0PositionWS, float3 p0NormalWS,
        float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
        float3 smoothedPositionWS =
            bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
            bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
            bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
        return lerp(flatPositionWS, smoothedPositionWS, smoothing);
    }
    
    // Apply Phong smoothing
    float3 CalculatePhongPosition(float3 bary, float smoothing, float3 p0PositionWS, float3 p0NormalWS,
        float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
        float3 smoothedPositionWS =
            bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
            bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
            bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
        return lerp(flatPositionWS, smoothedPositionWS, smoothing);
    }

    Don't forget to expose in the material panel.

    // .shader
    _TessellationSmoothing("_TessellationSmoothing", Range(0,1)) = 0.5
    
    // .hlsl
    float _TessellationSmoothing;
    
    
    
    Interpolators Domain( .... ) {
        ...
        float smoothing = _TessellationSmoothing;
        float3 positionWS = CalculatePhongPosition(barycentricCoordinates, smoothing,
          patch[0].positionWS, patch[0].normalWS, 
          patch[1].positionWS, patch[1].normalWS, 
          patch[2].positionWS, patch[2].normalWS);
        ...
    }

    It is important to note that some models require some modification. If the edges of the model are very sharp, it means that the normal of this vertex is almost parallel to the normal of the face. In Phong Tessellation, this will cause the projection of the vertex on the tangent plane to be very close to the original vertex position, thus reducing the impact of subdivision.

    To solve this problem, you can add more geometric details by performing what is called "adding loop edges" or "loop cuts" in the modeling software. Insert additional edge loops near the edges of the original model to increase the subdivision density. The specific operation will not be expanded here.

    In general, the effect and performance of Phong subdivision are relatively good. However, if you want a higher quality smoothing effect, you can consider PN triangles. This technology is based on the curved triangle of Bezier curve.

    4.2 PN triangles subdivision

    First, here is the original paper. http://alex.vlachos.com/graphics/CurvedPNTriangles.pdf

    PN Triangles does not require information about neighboring triangles and is less expensive. The PN Triangles algorithm only requires the positions and normals of the three vertices in the patch. The rest of the data can be calculated. Note that all data is in barycentric coordinates.

    In the PN algorithm, 10 control points need to be calculated for surface subdivision, as shown in the figure below. Three triangle vertices, a centroid, and three pairs of control points on the edges constitute all the control points. The calculated Bezier curve control points will be passed to the Domain. Since the control points of each triangle patch are consistent, it is very appropriate to place the step of calculating the control points in the Patch Constant Function.

    The calculation method in the paper is as follows:

    $$
    \begin{aligned}
    b_{300} & =P_1 \
    b_{030} & =P_2 \
    b_{003} & =P_3 \
    w_{ij} & =\left(P_j-P_i\right) \cdot N_i \in \mathbf{R} \quad \text { here ' } \cdot \text { ' is the scalar product, } \
    b_{210} & =\left(2 P_1+P_2-w_{12} N_1\right) / 3 \
    b_{120} & =\left(2 P_2+P_1-w_{21} N_2\right) / 3 \
    b_{021} & =\left(2 P_2+P_3-w_{23} N_2\right) / 3 \
    b_{012} & =\left(2 P_3+P_2-w_{32} N_3\right) / 3 \
    b_{102} & =\left(2 P_3+P_1-w_{31} N_3\right) / 3, \
    b_{201} & =\left(2 P_1+P_3-w_{13} N_1\right) / 3, \
    E & =\left(b_{210}+b_{120}+b_{021}+b_{012}+b_{102}+b_{201}\right) / 6 \
    V & =\left(P_1+P_2+P_3\right) / 3, \
    b_{111} & =E+(EV) / 2 .
    \end{aligned}
    $$

    Each edge of the formula $w_{ij}$ is calculated twice, so a total of 6 times. For example, the meaning of $w_{1 2}$ is the projection length of the vector from $P_1$ to $P_2$ in the normal direction of $P_1$. Multiplying it by the corresponding normal direction means that the projection vector is $w$ in length.

    Let's take the calculation of the factor close to $P_1$ as an example. The weight of the current position point should be larger. Multiplying it by $2$ makes the calculated control point closer to the current vertex. The reason for subtracting the projection vector is to correct the error caused by the position of $P_2$ not being on the plane defined by the $P_1$​​ normal. Make the triangle plane more consistent and reduce the distortion effect. Finally, divide by 3 for standardization.

    Next, calculate the average Bezier control point $E$​, which represents the average position of the six control points. This average position represents the concentration trend of the boundary control points. Then calculate the average position of the triangle vertices. Then find the midpoint of these two average positions and add it to the Bezier average control point. This is the tenth parameter required in the end.

    To summarize, the first three are the positions of the triangle vertices (so they don't need to be written in the structure), six are calculated by weight, and the last one is the average of the previous calculations. The code is very simple to write.

    struct TessellationFactors {
        float edge[3] : SV_TessFactor;
        float inside : SV_InsideTessFactor;
        float3 bezierPoints[7] : BEZIERPOS;
    };
    
    //Bezier control point calculations
    float3 CalculateBezierControlPoint(float3 p0PositionWS, float3 aNormalWS, float3 p1PositionWS, float3 bNormalWS) {
        float w = dot(p1PositionWS - p0PositionWS, aNormalWS);
        return (p0PositionWS * 2 + p1PositionWS - w * aNormalWS) / 3.0;
    }
    
    void CalculateBezierControlPoints(inout float3 bezierPoints[7],
        float3 p0PositionWS, float3 p0NormalWS, float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        bezierPoints[0] = CalculateBezierControlPoint(p0PositionWS, p0NormalWS, p1PositionWS, p1NormalWS);
        bezierPoints[1] = CalculateBezierControlPoint(p1PositionWS, p1NormalWS, p0PositionWS, p0NormalWS);
        bezierPoints[2] = CalculateBezierControlPoint(p1PositionWS, p1NormalWS, p2PositionWS, p2NormalWS);
        bezierPoints[3] = CalculateBezierControlPoint(p2PositionWS, p2NormalWS, p1PositionWS, p1NormalWS);
        bezierPoints[4] = CalculateBezierControlPoint(p2PositionWS, p2NormalWS, p0PositionWS, p0NormalWS);
        bezierPoints[5] = CalculateBezierControlPoint(p0PositionWS, p0NormalWS, p2PositionWS, p2NormalWS);
        float3 avgBezier = 0;
        [unroll] for (int i = 0; i < 6; i++) {
            avgBezier += bezierPoints[i];
        }
        avgBezier /= 6.0;
        float3 avgControl = (p0PositionWS + p1PositionWS + p2PositionWS) / 3.0;
        bezierPoints[6] = avgBezier + (avgBezier - avgControl) / 2.0;
    }
    
    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        ...
        TessellationFactors f = (TessellationFactors)0;
        // Check if this patch should be culled (it is out of view)
        if (ShouldClipPatch(...)) {
            ...
        } else {
            ...
            CalculateBezierControlPoints(f.bezierPoints, patch[0].positionWS, patch[0].normalWS, 
              patch[1].positionWS, patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
        }
        return f;
    }

    Then, in the domain function, use the ten factors output by the Hull Function. According to the formula given in the paper, calculate the final cubic Bezier surface coordinates. Then interpolate and expose them on the material panel.

    $$
    \begin{aligned}
    & b: \quad R^2 \mapsto R^3, \quad \text { for } w=1-uv, \quad u, v, w \geq 0 \
    & b(u, v)= \sum_{i+j+k=3} b_{ijk} \frac{3!}{i!j!k!} u^iv^jw^k \
    &= b_{300} w^3+b_{030} u^3+b_{003} v^3 \
    &+b_{210} 3 w^2 u+b_{120} 3 wu^2+b_{201} 3 w^2 v \
    &+b_{021} 3 u^2 v+b_{102} 3 wv^2+b_{012} 3 uv^2 \
    &+b_{111} 6 wuv .
    \end{aligned}
    $$

    // Barycentric interpolation as a function
    float3 BarycentricInterpolate(float3 bary, float3 a, float3 b, float3 c) {
        return bary.x * a + bary.y * b + bary.z * c;
    }
    
    float3 CalculateBezierPosition(float3 bary, float smoothing, float3 bezierPoints[7],
        float3 p0PositionWS, float3 p1PositionWS, float3 p2PositionWS) {
        float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
        float3 smoothedPositionWS =
            p0PositionWS * (bary.x * bary.x * bary.x) +
            p1PositionWS * (bary.y * bary.y * bary.y) +
            p2PositionWS * (bary.z * bary.z * bary.z) +
            bezierPoints[0] * (3 * bary.x * bary.x * bary.y) +
            bezierPoints[1] * (3 * bary.y * bary.y * bary.x) +
            bezierPoints[2] * (3 * bary.y * bary.y * bary.z) +
            bezierPoints[3] * (3 * bary.z * bary.z * bary.y) +
            bezierPoints[4] * (3 * bary.z * bary.z * bary.x) +
            bezierPoints[5] * (3 * bary.x * bary.x * bary.z) +
            bezierPoints[6] * (6 * bary.x * bary.y * bary.z);
        return lerp(flatPositionWS, smoothedPositionWS, smoothing);
    }
    
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
    
        Interpolators output;
        ...
        // Calculate tessellation smoothing multipler
        float smoothing = _TessellationSmoothing;
    #ifdef _TESSELLATION_SMOOTHING_VCOLORS
        smoothing *= BARYCENTRIC_INTERPOLATE(color.r); // Multiply by the vertex's red channel
    #endif
    
        float3 positionWS = CalculateBezierPosition(barycentricCoordinates,
          smoothing, factors.bezierPoints, 
          patch[0].positionWS, patch[1].positionWS, patch[2].positionWS);
        float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
        float3 tangentWS = BARYCENTRIC_INTERPOLATE(tangentWS.xyz);
        ...
    }

    Compare the effects, PN triangles off and on.

    4.3 Improved PN triangles – Output subdivided normals

    Traditional PN triangles only change the position information of the vertices. We can combine the normal information of the vertices to output dynamically changing normal information to provide better light reflection effects.

    In the original algorithm, the change of normals is very discrete. As shown in the figure below (above), the normals provided by the two vertices of the original triangle may not be able to well represent the change of the normals of the original surface. We want to achieve the effect shown in the figure below (below), so we need to use quadratic interpolation to obtain the possible surface changes in a single patch.

    Since the surface is a cubic Bezier surface, the normal should be a quadratic Bezier surface interpolation, so three additional normal control points are required.TheTusThe article has been explained clearly. Please go to the detailed mathematical principlesRef10. Link.

    The following is a brief introduction on how to obtain the normal direction of the subdivision.

    First, get the two normal information of point AB. Then find their average normal.

    Construct a plane perpendicular to line segment AB and passing through its midpoint.

    Take the reflection vector of the average normal just taken for the plane.

    Count each side, so there are three.

    struct TessellationFactors {
        float edge[3] : SV_TessFactor;
        float inside : SV_InsideTessFactor;
        float3 bezierPoints[10] : BEZIERPOS;
    };
    
    float3 CalculateBezierControlNormal(float3 p0PositionWS, float3 aNormalWS, float3 p1PositionWS, float3 bNormalWS) {
        float3 d = p1PositionWS - p0PositionWS;
        float v = 2 * dot(d, aNormalWS + bNormalWS) / dot(d, d);
        return normalize(aNormalWS + bNormalWS - v * d);
    }
    
    void CalculateBezierNormalPoints(inout float3 bezierPoints[10],
        float3 p0PositionWS, float3 p0NormalWS, float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
        bezierPoints[7] = CalculateBezierControlNormal(p0PositionWS, p0NormalWS, p1PositionWS, p1NormalWS);
        bezierPoints[8] = CalculateBezierControlNormal(p1PositionWS, p1NormalWS, p2PositionWS, p2NormalWS);
        bezierPoints[9] = CalculateBezierControlNormal(p2PositionWS, p2NormalWS, p0PositionWS, p0NormalWS);
    }
    
    // The patch constant function runs once per triangle, or "patch"
    // It runs in parallel to the hull function
    TessellationFactors PatchConstantFunction(
        InputPatch<TessellationControlPoint, 3> patch) {
        ...
        TessellationFactors f = (TessellationFactors)0;
        // Check if this patch should be culled (it is out of view)
        if (ShouldClipPatch(...)) {
            ..
        } else {
            ...
            CalculateBezierControlPoints(f.bezierPoints, 
              patch[0].positionWS, patch[0].normalWS, patch[1].positionWS, 
              patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
            CalculateBezierNormalPoints(f.bezierPoints, 
              patch[0].positionWS, patch[0].normalWS, patch[1].positionWS, 
              patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
        }
        return f;
    }

    And it should be noted that all interpolated normal vectors need to be standardized.

    float3 CalculateBezierNormal(float3 bary, float3 bezierPoints[10],
        float3 p0NormalWS, float3 p1NormalWS, float3 p2NormalWS) {
        return p0NormalWS * (bary.x * bary.x) +
            p1NormalWS * (bary.y * bary.y) +
            p2NormalWS * (bary.z * bary.z) +
            bezierPoints[7] * (2 * bary.x * bary.y) +
            bezierPoints[8] * (2 * bary.y * bary.z) +
            bezierPoints[9] * (2 * bary.z * bary.x);
    }
    
    float3 CalculateBezierNormalWithSmoothFactor(float3 bary, float smoothing, float3 bezierPoints[10],
        float3 p0NormalWS, float3 p1NormalWS, float3 p2NormalWS) {
        float3 flatNormalWS = BarycentricInterpolate(bary, p0NormalWS, p1NormalWS, p2NormalWS);
        float3 smoothedNormalWS = CalculateBezierNormal(bary, bezierPoints, p0NormalWS, p1NormalWS, p2NormalWS);
        return normalize(lerp(flatNormalWS, smoothedNormalWS, smoothing));
    }
    
    // The domain function runs once per vertex in the final, tessellated mesh
    // Use it to reposition vertices and prepare for the fragment stage
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
    
        Interpolators output;
        ...
        // Calculate tessellation smoothing multipler
        float smoothing = _TessellationSmoothing;
        float3 positionWS = CalculateBezierPosition(barycentricCoordinates, smoothing, factors.bezierPoints, patch[0].positionWS, patch[1].positionWS, patch[2].positionWS);
        float3 normalWS = CalculateBezierNormalWithSmoothFactor(
            barycentricCoordinates, smoothing, factors.bezierPoints,
            patch[0].normalWS, patch[1].normalWS, patch[2].normalWS);
        float3 tangentWS = BARYCENTRIC_INTERPOLATE(tangentWS.xyz);
        ...
    }

    There is another problem that needs to be noted. When we use the interpolated normal, the tangent vector corresponding to it is no longer orthogonal to the interpolated normal vector. In order to maintain orthogonality, a new tangent vector needs to be calculated.

    void CalculateBezierNormalAndTangent(
        float3 bary, float smoothing, float3 bezierPoints[10],
        float3 p0NormalWS, float3 p0TangentWS, 
        float3 p1NormalWS, float3 p1TangentWS, 
        float3 p2NormalWS, float3 p2TangentWS,
        out float3 normalWS, out float3 tangentWS) {
    
        float3 flatNormalWS = BarycentricInterpolate(bary, p0NormalWS, p1NormalWS, p2NormalWS);
        float3 smoothedNormalWS = CalculateBezierNormal(bary, bezierPoints, p0NormalWS, p1NormalWS, p2NormalWS);
        normalWS = normalize(lerp(flatNormalWS, smoothedNormalWS, smoothing));
    
        float3 flatTangentWS = BarycentricInterpolate(bary, p0TangentWS, p1TangentWS, p2TangentWS);
        float3 flatBitangentWS = cross(flatNormalWS, flatTangentWS);
        tangentWS = normalize(cross(flatBitangentWS, normalWS));
    }
    
    [domain("tri")] // Signal we're inputting triangles
    Interpolators Domain(
        TessellationFactors factors, //The output of the patch constant function
        OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
        float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
        ...
        float3 normalWS, tangentWS;
        CalculateBezierNormalAndTangent(
            barycentricCoordinates, smoothing, factors.bezierPoints,
            patch[0].normalWS, patch[0].tangentWS.xyz, 
            patch[1].normalWS, patch[1].tangentWS.xyz, 
            patch[2].normalWS, patch[2].tangentWS.xyz,
            normalWS, tangentWS);
        ...
    }

    References

    1. https://www.youtube.com/watch?v=63ufydgBcIk
    2. https://nedmakesgames.medium.com/mastering-tessellation-shaders-and-their-many-uses-in-unity-9caeb760150e
    3. https://zhuanlan.zhihu.com/p/148247621
    4. https://zhuanlan.zhihu.com/p/124235713
    5. https://zhuanlan.zhihu.com/p/141099616
    6. https://zhuanlan.zhihu.com/p/42550699
    7. https://en.wikipedia.org/wiki/Barycentric_coordinate_system
    8. https://zhuanlan.zhihu.com/p/359999755
    9. https://zhuanlan.zhihu.com/p/629364817
    10. https://zhuanlan.zhihu.com/p/629202115
    11. https://perso.telecom-paristech.fr/boubek/papers/PhongTessellation/PhongTessellation.pdf
    12. http://alex.vlachos.com/graphics/CurvedPNTriangles.pdf
  • Unity可互动可砍断八叉树草海渲染 – 几何、计算着色器(BIRP/URP)

    Unity interactive and chopable octree grass sea rendering – geometry, compute shader (BIRP/URP)

    Project (BIRP) on Github:

    https://github.com/Remyuu/Unity-Interactive-Grass

    First, here is a screenshot of 10,0500 grasses running on Compute Shader on my M1 pro without any optimization. It can run more than 200 frames.

    After adding octree frustum culling, distance fading and other operations, the frame rate is not so stable (I want to die). I guess it is because the CPU has too much pressure to operate each frame and needs to maintain such a large amount of grass information. But as long as enough culling is done, running 700+ frames is no problem (comfort). In addition, the depth of the octree also needs to be optimized according to the actual situation. In the figure below, I set the depth of the octree to 5.

    Preface

    This article is getting longer and longer. I mainly use it to review my knowledge. When you read it, you may feel that there are a lot of basic contents. I am a complete novice, and I beg for discussion and correction from you.

    This article mainly has two stages:

    • The GS + TS method achieves the most basic effect of grass rendering
    • Then I used CS to re-render the sea of grass, adding various optimization methods

    The rendering method of geometry shader + tessellation shader should be relatively simple, but the performance ceiling is relatively low and the platform compatibility is poor.

    The method of combining compute shaders with GPU Instancing should be the mainstream method in the current industry, and it can also run well on mobile terminals.

    The CS rendering of the sea of grass in this article mainly refers to the implementation of Colin and Minions Art, which is more like a hybrid of the two (the former has been analyzed by a big guy on ZhihuGrass rendering study notes based on GPU Instance). Use three sets of ComputeBuffer, one is the buffer containing all the grass, one is the buffer that is appended into the Material, and the other is a visible buffer (obtained in real time based on frustum culling). Implemented the use of a quad-octree (odd-even depth) for space division, plus the frustum culling to get the index of all the grass in the current frustum, pass it to the Compute Shader for further processing (such as Mesh generation, quaternion calculation rotation, LoD, etc.), and then use a variable-length ComputeBuffer (ComputeBufferType.Append) to pass the grass to be rendered to the Material through Instancing for final rendering.

    You can also use the Hi-Z solution to eliminate it. I'm digging a hole and working hard to learn.

    In addition, I referred to the article by Minions Art and copied a set of editor grass brushing tools (incomplete version), which stores the positions of all grass vertices by maintaining a vertex list.

    Furthermore, by maintaining another set of Cut Buffer, if the grass is marked with a -1 value, it will not be processed. If it is marked with a non--1 value of the chopper height, it will be passed to the Material, and through the WorldPos + Split.y plus the lerp operation, the upper half of the grass will be made invisible, and the color of the grass will be modified, and finally some grass clippings will be added to achieve a grass-cutting effect.

    Previous articleI have introduced in detail what a tessellation shader is and various optimization methods. Next, I will integrate tessellation into actual development. In addition, I combined the compute shader I learned in a few days to create a grass field based on the compute shader. You can find more details in the following article.This noteThe following is the small effect that this article will achieve, with complete code attached:

    • Grass Rendering
    • Grass Rendering – Geometry Shader (BIRP/URP)
    • Define grass width, height, orientation, pour, curvature, gradient, color, band, normal
    • INTEGER tessellation
    • URP adds Visibility Map
    • Grass rendering – Compute Shader (BIRP/URP) work on MacOS
    • Octree frustum culling
    • Distance fades
    • Grass Interaction
    • Interactive Geometry Shaders (BIRP/URP)
    • Interactive Compute Shader (BIRP) work on MacOS
    • Unity custom grass generation tool
    • Grass cutting system

    Main references(plagiarism)article:

    There are many ways to render grass, two of which are shown in this article:

    • Geometry Shader + Tessellation Shader
    • Compute Shaders + GPU Instancing

    First of all, the first solution has great limitations. Many mobile devices and Metal do not support GS, and GS will recalculate the Mesh every frame, which is quite expensive.

    Secondly, can MacOS no longer run geometry shaders? Not really. If you want to use GS, you must use OpenGL, not Metal. But it should be noted that Apple supports OpenGL up to OpenGL 4.1. In other words, this version does not support Compute Shader. Of course, MacOS in the Intel era can support OpenGL 4.3 and can run CS and GS at the same time. The M series chips do not have this fate. Either use 4.1 or use Metal. On my M1p mbp, even if you choose a virtual machine (Parallels 18+ provides DX11 and Vulkan), the Vulkan running on macOS is translated and is essentially Metal, so there is still no GS. Therefore, there is no native GS after macOS M1.

    Furthermore, Metal doesn't even support Tessellation shaders directly. Apple doesn't want to support these two things on the chip at all. Why? Because the efficiency is too low. On the M chip, TS is even simulated by CS!

    To sum up, geometry shaders are a dead-end technology, especially after the advent of Mesh Shader. Although GS is very popular in Unity, any similar effect can be instanced on CS, and it is more efficient. Although new graphics cards will still support GS, there are still quite a few games on the market that use GS. It's just that Apple didn't consider compatibility and directly cut it off.

    This article explains in detail why GS is so slow:http://www.joshbarczak.com/blog/?p=667. Simply put, Intel optimized GS by blocking threads, etc., while other chips do not have this optimization.

    This article is a study note and is likely to contain errors.

    1. Overview of Geometry Shader Rendering Grass (BIRP)

    This chapter isRoystanA concise summary of the . If you need the project file or the final code, you can download it from the original article. Or readSocrates has no bottom article.

    1.1 Overview

    After the Domain Stage, you can choose to use a geometry shader.

    A geometry shader takes a whole primitive as input and is able to generate vertices on output. The input to a geometry shader is the vertices of a complete primitive (three vertices for a triangle, two vertices for a line or a single vertex for a point). The geometry shader is called once for each primitive.

    fromWeb DownloadInitial engineering.

    1.2 Drawing a triangle

    Draw a triangle.

    // Add inside the CGINCLUDE block.
    struct geometryOutput
    {
        float4 pos : SV_POSITION;
    };
    
    ...
        //Vertex shader
    return vertex;
    ...
    
    [maxvertexcount(3)]
    void geo(triangle float4 IN[3] : SV_POSITION, inout TriangleStreamtriStream)
    {
        geometryOutput o;
    
        o.POS = UnityObjectToClipPos(float4(0.5, 0, 0, 1));
        triStream.Append(o);
    
        o.POS = UnityObjectToClipPos(float4(-0.5, 0, 0, 1));
        triStream.Append(o);
    
        o.POS = UnityObjectToClipPos(float4(0, 1, 0, 1));
        triStream.Append(o);
    }
    
    
    
    // Add inside the SubShader Pass, just below the #pragma fragment frag line.
    #pragma geometry geo

    We actually draw a triangle for each vertex in the mesh, but the positions we assign to the triangle vertices are constant - they don't change for each input vertex - placing all the triangles on top of each other.

    1.3 Vertex Offset

    Therefore, we can just make an offset according to the position of each vertex.

    C#
    // Add to the top of the geometry shader.
    float3 POS = IN[0];
    
    
    
    // Update each assignment of o.pos.
    o.POS = UnityObjectToClipPos(POS + float3(0.5, 0, 0));
    
    
    
    o.POS = UnityObjectToClipPos(POS + float3(-0.5, 0, 0));
    
    
    
    o.POS = UnityObjectToClipPos(POS + float3(0, 1, 0));

    1.4 Rotating blades

    However, it should be noted that currently all triangles are emitted in one direction, so normal correction is added. TBN matrix is constructed and multiplied with the current direction. And the code is organized.

    float3 vNormal = IN[0].normal;
    float4 vTangent = IN[0].tangent;
    float3 vBinormal = cross(vNormal, vTangent) * vTangent.w;
    
    float3x3 tangentToLocal = float3x3(
        vTangent.x, vBinormal.x, vNormal.x,
        vTangent.y, vBinormal.y, vNormal.y,
        vTangent.z, vBinormal.z, vNormal.z
        );
    
    triStream.Append(VertexOutput(POS + mul(tangentToLocal, float3(0.5, 0, 0))));
    triStream.Append(VertexOutput(POS + mul(tangentToLocal, float3(-0.5, 0, 0))));
    triStream.Append(VertexOutput(POS + mul(tangentToLocal, float3(0, 0, 1))));

    1.5 Coloring

    Then define the upper and lower colors of the grass, and use UV to make a lerp gradient.

    return lerp(_BottomColor, _TopColor, i.uv.y);
    C#

    1.6 Rotation Matrix Principle

    Make a random orientation. Here a rotation matrix is constructed. The principle is also mentioned in GAMES101. There is also aVideo of formula derivation, and it is very clear! The simple derivation idea is, assuming that the vector $a$ rotates around the n-axis to $b$, then decompose $a$​ into the component parallel to the n-axis (found to be constant) plus the component perpendicular to the n-axis.

    float3x3 AngleAxis3x3(float angle, float3 axis)
    {
        float c, s;
        sincos(angle, s, c);
    
        float t = 1 - c;
        float x = axis.x;
        float y = axis.y;
        float z = axis.z;
    
        return float3x3(
            t * x * x + c, t * x * y - s * z, t * x * z + s * y,
            t * x * y + s * z, t * y * y + c, t * y * z - s * x,
            t * x * z - s * y, t * y * z + s * x, t * z * z + c
            );
    }

    The rotation matrix $R$ is calculated here using Rodrigues' rotation formula: $$R=I+sin⁡(θ)⋅[k]×+(1−cos⁡(θ))⋅[k]×2$$

    Among them, $\theta$ is the rotation angle. $k$ is the unit rotation axis. $I$ is the identity matrix. $[k]_{\times}$ is the antisymmetric matrix corresponding to the axis $k$.

    For a unit vector $k=(x,y,z)$ , the antisymmetric matrix $[k]_{\times}=\left[\begin{array}{ccc} 0 & -z & y \\ z & 0 & -x \\ -y & x & 0 \end{array}\right]$ finally obtains the matrix elements:

    $$ \begin{array}{ccc} tx^2 + c & txy – sz & txz + sy \\ txy + sz & ty^2 + c & tyz – sx \\ txz – sy & tyz + sx & tz^2 + c \\ \end{array} $$

    float3x3 facingRotationMatrix = AngleAxis3x3(rand(POS) * UNITY_TWO_PI, float3(0, 0, 1));

    1.7 Blade tipping

    Get the grass in a random direction, and then pour it in any random direction on the x or y axis.

    float3x3 bendRotationMatrix = AngleAxis3x3(rand(POS.zzx) * _BendRotationRandom * UNITY_PI * 0.5, float3(-1, 0, 0));

    1.8 Leaf size

    Adjust the width and height of the grass. Originally, we set the height and width to be one unit. To make the grass more natural, we add rand to this step to make it look more natural.

    _BladeWidth("Blade Width", Float) = 0.05
    _BladeWidthRandom("Blade Width Random", Float) = 0.02
    _BladeHeight("Blade Height", Float) = 0.5
    _BladeHeightRandom("Blade Height Random", Float) = 0.3
    
    
    float height = (rand(POS.zyx) * 2 - 1) * _BladeHeightRandom + _BladeHeight;
    float width = (rand(POS.xzy) * 2 - 1) * _BladeWidthRandom + _BladeWidth;
    
    
    triStream.Append(VertexOutput(POS + mul(transformationMatrix, float3(width, 0, 0)), float2(0, 0)));
    triStream.Append(VertexOutput(POS + mul(transformationMatrix, float3(-width, 0, 0)), float2(1, 0)));
    triStream.Append(VertexOutput(POS + mul(transformationMatrix, float3(0, 0, height)), float2(0.5, 1)));

    1.9 Tessellation

    Since the number is too small, the upper surface is subdivided here.

    1.10 Perturbations

    To animate the grass, add the normals to the _Time perturbation. Sample the texture, then calculate the wind rotation matrix and apply it to the grass.

    float2 uv = POS.xz * _WindDistortionMap_ST.xy + _WindDistortionMap_ST.z + _WindFrequency * _Time.y;
    
    float2 windSample = (tex2Dlod(_WindDistortionMap, float4(uv, 0, 0)).xy * 2 - 1) * _WindStrength;
    
    float3 wind = normalize(float3(windSample.x, windSample.y, 0));
    
    float3x3 windRotation = AngleAxis3x3(UNITY_PI * windSample, wind);
    
    float3x3 transformationMatrix = mul(mul(mul(tangentToLocal, windRotation), facingRotationMatrix), bendRotationMatrix);

    1.11 Fixed blade rotation issue

    At this time, the wind may rotate along the x and y axes, which is specifically manifested as:

    Write a matrix for the two points under your feet that rotates only along z.

    float3x3 transformationMatrixFacing = mul(tangentToLocal, facingRotationMatrix);
    
    
    
    triStream.Append(VertexOutput(POS + mul(transformationMatrixFacing, float3(width, 0, 0)), float2(0, 0)));
    triStream.Append(VertexOutput(POS + mul(transformationMatrixFacing, float3(-width, 0, 0)), float2(1, 0)));

    1.12 Blade curvature

    In order to make the leaves have curvature, we have to add vertices. In addition, since double-sided rendering is currently enabled, the order of vertices does not matter. Here, a manual interpolation for loop is used to construct triangles. A forward is calculated to bend the leaves.

    float forward = rand(POS.yyz) * _BladeForward;
    
    
    for (int i = 0; i < BLADE_SEGMENTS; i++)
    {
        float t = i / (float)BLADE_SEGMENTS;
        // Add below the line declaring float t.
        float segmentHeight = height * t;
        float segmentWidth = width * (1 - t);
        float segmentForward = pow(t, _BladeCurve) * forward;
        float3x3 transformMatrix = i == 0 ? transformationMatrixFacing : transformationMatrix;
        triStream.Append(GenerateGrassVertex(POS, segmentWidth, segmentHeight, segmentForward, float2(0, t), transformMatrix));
        triStream.Append(GenerateGrassVertex(POS, -segmentWidth, segmentHeight, segmentForward, float2(1, t), transformMatrix));
    }
    
    triStream.Append(GenerateGrassVertex(POS, 0, height, forward, float2(0.5, 1), transformationMatrix));

    1.13 Creating Shadows

    Create shadows in another Pass and output.

    Pass{
        Tags{
            "LightMode" = "ShadowCaster"
        }
    
        CGPROGRAM
        #Pragmas vertex vert
        #Pragmas geometry geo
        #Pragmas fragment frag
        #Pragmas hull hull
        #Pragmas domain domain
        #Pragmas target 4.6
        #Pragmas multi_compile_shadowcaster
    
        float4 frag(geometryOutput i) : SV_Target{
            SHADOW_CASTER_FRAGMENT(i)
        }
    
        ENDCG
    }

    1.14 Receiving Shadows

    Use SHADOW_ATTENUATION directly in Frag to determine the shadow.

    // geometryOutput struct.
    unityShadowCoord4 _ShadowCoord : TEXCOORD1;
    ...
    o._ShadowCoord = ComputeScreenPos(o.POS);
    ...
    #Pragmas multi_compile_fwdbase
    ...
    return SHADOW_ATTENUATION(i);

    1.15 Removing shadow acne

    Removes surface acne.

    #if UNITY_PASS_SHADOWCASTER
        o.POS = UnityApplyLinearShadowBias(o.POS);
    #endif

    1.16 Adding Normals

    Add normal information to vertices generated by the geometry shader.

    struct geometryOutput
    {
        float4 POS : SV_POSITION;
        float2 uv : TEXCOORD0;
        unityShadowCoord4 _ShadowCoord : TEXCOORD1;
        float3 normal : NORMAL;
    };
    ...
    o.normal = UnityObjectToWorldNormal(normal);

    1.17 Full code‼️ (BIRP)

    The final effect.

    Code:

    https://pastebin.com/8u1ytGgU

    Complete: https://pastebin.com/U14m1Nu0

    2. Geometry Shader Rendering Grass (URP)

    2.1 References

    I have already written the BIRP version, and now I just need to port it.

    • URP code specification reference: https://www.cyanilux.com/tutorials/urp-shader-code/
    • BIRP->URP quick reference table: https://cuihongzhi1991.github.io/blog/2020/05/27/builtinttourp/

    You can followThis article by DanielYou can also follow me to modify the code. It should be noted that the space transformation code in the original repo has problems.Pull requestsThe solution was found in

    Now put the above BIRP tessellation shader together.

    • Tags changed to URP
    • The header file is introduced and replaced with the URP version
    • Variables are surrounded by CBuffer
    • Shadow casting, receiving code

    2.2 Start to change

    Declare the URP pipeline.

    LOD 100
    Cull Off
    Pass{
        Tags{
            "RenderType" = "Opaque"
            "Queue" = "Geometry"
            "RenderPipeline" = "UniversalPipeline"
        }

    Import the URP library.

    #include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Core.hlsl"
    #include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Lighting.hlsl"
    #include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/ShaderVariablesFunctions.hlsl"
    
    o._ShadowCoord = ComputeScreenPos(o.POS);

    Change the function.

    // o.normal = UnityObjectToWorldNormal(normal);
    o.normal = TransformObjectToWorldNormal(normal);

    URP receives the shadow. It is best to calculate this in the vertex shader, but for convenience, it is all calculated in the geometry shader.

    Then generate the shadows. ShadowCaster Pass.

    Pass{
        Name "ShadowCaster"
        Tags{ "LightMode" = "ShadowCaster" }
    
        ZWrite On
        ZTest LEqual
    
        HLSLPROGRAM
    
            half4 frag(geometryOutput input) : SV_TARGET{
                return 1;
            }
    
        ENDHLSL
    }

    2.3 Full code‼️(URP)

    https://pastebin.com/6KveEKMZ

    3. Optimize tessellation logic (BIRP/URP)

    3.1 Organize the code

    Above we just use a fixed number of subdivision levels, which I cannot accept. If you don't understand the principle of surface subdivision, you can seeMy Tessellation Articles, which details several solutions for optimizing segmentation.

    I use the BIRP version of the code that I completed in Section 1 as an example. The current version only has the Uniform subdivision.

    _TessellationUniform("Tessellation Uniform", Range(1, 64)) = 1

    The output structures of each stage are quite confusing, so let's reorganize them.

    3.1 Partitioning Mode

    [KeywordEnum(INTEGER, FRAC_EVEN, FRAC_ODD, POW2)] _PARTITIONING("Partition algorithm", Float) = 0
    
    #Pragmas shader_feature_local _PARTITIONING_INTEGER _PARTITIONING_FRAC_EVEN _PARTITIONING_FRAC_ODD _PARTITIONING_POW2
    
    #if defined(_PARTITIONING_INTEGER)
        [partitioning("integer")]
    #elif defined(_PARTITIONING_FRAC_EVEN)
        [partitioning("fractional_even")]
    #elif defined(_PARTITIONING_FRAC_ODD)
        [partitioning("fractional_odd")]
    #elif defined(_PARTITIONING_POW2)
        [partitioning("pow2")]
    #else 
        [partitioning("integer")]
    #endif

    3.2 Subdivided Frustum Culling

    In BIRP, use _ProjectionParams.z to represent the far plane, and in URP use UNITY_RAW_FAR_CLIP_VALUE.

    bool IsOutOfBounds(float3 p, float3 lower, float3 higher) { //Given rectangle judgment
        return p.x < lower.x || p.x > higher.x || p.y < lower.y || p.y > higher.y || p.z < lower.z || p.z > higher.z;
    }
    bool IsPointOutOfFrustum(float4 positionCS) { //View cone judgment
        float3 culling = positionCS.xyz;
        float w = positionCS.w;
        float3 lowerBounds = float3(-w, -w, -w * _ProjectionParams.z);
        float3 higherBounds = float3(w, w, w);
        return IsOutOfBounds(culling, lowerBounds, higherBounds);
    }
    bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
        bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
            IsPointOutOfFrustum(p1PositionCS) &&
            IsPointOutOfFrustum(p2PositionCS);
        return allOutside;
    }
    
    TessellationControlPoint vert(Attributes v)
    {
        ...
        o.positionCS = UnityObjectToClipPos(v.vertex);
        ...
    }
    
    TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch)
    {
        TessellationFactors f;
        if(ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS)){
            f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0;
        }else{
            f.edge[0] = _TessellationFactor;
            f.edge[1] = _TessellationFactor;
            f.edge[2] = _TessellationFactor;
            f.inside = _TessellationFactor;
        }
        return f;
    }

    However, it should be noted that the judgment input here is the CS coordinates of the grass. If the triangular grass completely leaves the screen, but the grass grows high and may still be on the screen, it will cause a screen bug where the grass suddenly disappears. This depends on the needs of the project. If it is a project with an upward viewing angle and the grass is relatively short, this operation can be used.

    The viewing angle is not a big problem.

    If viewed from Voldemort's perspective, the grass is incomplete and over-culled.

    3.3 Fine-grained control of screen distance

    The grass is dense near and sparse far, but based on the screen distance (CS space). This method is affected by the resolution.

    float EdgeTessellationFactor(float scale, float4 p0PositionCS, float4 p1PositionCS) {
        float factor = distance(p0PositionCS.xyz / p0PositionCS.w, p1PositionCS.xyz / p1PositionCS.w) / scale;
        return max(1, factor);
    }
    
    TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch)
    {
        TessellationFactors f;
    
        f.edge[0] = EdgeTessellationFactor(_TessellationFactor, 
            patch[1].positionCS, patch[2].positionCS);
        f.edge[1] = EdgeTessellationFactor(_TessellationFactor, 
            patch[2].positionCS, patch[0].positionCS);
        f.edge[2] = EdgeTessellationFactor(_TessellationFactor, 
            patch[0].positionCS, patch[1].positionCS);
        f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;
    
    
        #if defined(_CUTTESS_TRUE)
            if(ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS))
                f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0;
        #endif
    
        return f;
    }

    Tessellation Factor = 0.08

    It is not recommended to select Frac as the segmentation mode, otherwise there will be strong shaking, which is very eye-catching. I don't like this method very much.

    3.4 Camera distance classification

    Calculate the ratio of "the distance between two points" to "the distance between the midpoint of the two vertices and the camera position". The larger the ratio, the larger the space occupied on the screen, and the more subdivision is required.

    float EdgeTessellationFactor_WorldBase(float scale, float3 p0PositionWS, float3 p1PositionWS) {
        float length = distance(p0PositionWS, p1PositionWS);
        float distanceToCamera = distance(_WorldSpaceCameraPos, (p0PositionWS + p1PositionWS) * 0.5);
        float factor = length / (scale * distanceToCamera * distanceToCamera);
        return max(1, factor);
    }
    ...
    f.edge[0] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
        patch[1].vertex, patch[2].vertex);
    f.edge[1] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
        patch[2].vertex, patch[0].vertex);
    f.edge[2] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
        patch[0].vertex, patch[1].vertex);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;

    There is still room for improvement. Adjust the density of the grass so that the grass at close distance is not too dense, and the grass curve at medium distance is smoother, and introduce a nonlinear factor to control the relationship between distance and tessellation factor.

    float EdgeTessellationFactor_WorldBase(float scale, float3 p0PositionWS, float3 p1PositionWS) {
        float length = distance(p0PositionWS, p1PositionWS);
        float distanceToCamera = distance(_WorldSpaceCameraPos, (p0PositionWS + p1PositionWS) * 0.5);
        // Use the square root function to adjust the effect of distance to make the tessellation factor change more smoothly at medium distances
        float adjustedDistance = sqrt(distanceToCamera);
        // Adjust the impact of scale. You may need to further fine-tune the coefficient here based on the actual effect.
        float factor = length / (scale * adjustedDistance);
        return max(1, factor);
    }

    This is more appropriate.

    3.5 Visibility Map Controls Grass Subdivision

    The vertex shader reads the texture and passes it to the tessellation shader, which calculates the tessellation logic in PCF.

    Take FIXED mode as an example:

    _VisibilityMap("Visibility Map", 2D) = "white" {}
    TEXTURE2D (_VisibilityMap);SAMPLER(sampler_VisibilityMap);
    struct Attributes
    {
        ...
        float2 uv : TEXCOORD0;
    };
    struct TessellationControlPoint
    {
        ...
        float visibility : TEXCOORD1;
    };
    TessellationControlPoint vert(Attributes v){
        ...
        float visibility = SAMPLE_TEXTURE2D_LOD(_VisibilityMap, sampler_VisibilityMap, v.uv, 0).r; 
        o.visibility    = visibility;
        ...
    }
    TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch){
        ...
        float averageVisibility = (patch[0].visibility + patch[1].visibility + patch[2].visibility) / 3; // Calculate the average grayscale value of the three vertices
        float baseTessellationFactor = _TessellationFactor_FIXED; 
        float tessellationMultiplier = lerp(0.1, 1.0, averageVisibility); // Adjust the factor based on the average gray value
        #if defined(_DYNAMIC_FIXED)
            f.edge[0] = _TessellationFactor_FIXED * tessellationMultiplier;
            f.edge[1] = _TessellationFactor_FIXED * tessellationMultiplier;
            f.edge[2] = _TessellationFactor_FIXED * tessellationMultiplier;
            f.inside  = _TessellationFactor_FIXED * tessellationMultiplier;
        ...

    3.6 Complete code‼️ (BIRP)

    Grass Shader:

    https://pastebin.com/TD0AupGz

    3.7 Full code ‼ ️ (URP)

    There are some differences in URP. For example, to calculate ShadowBias, you need to do the following. I won’t expand on it. Just look at the code yourself.

    #if UNITY_PASS_SHADOWCASTER
        // o.pos = UnityApplyLinearShadowBias(o.pos);
        o.shadowCoord = TransformWorldToShadowCoord(ApplyShadowBias(posWS, norWS, 0));
    #endif

    Grass Shader:

    https://pastebin.com/2ZX2aVm9

    4. Interactive Grassland

    URP and BIRP are exactly the same.

    4.1 Implementation steps

    The principle is very simple. The script transmits the character's world coordinates, and then bends the grass according to the set radius and interaction strength.

    uniform float3 _PositionMoving; // Object position float _Radius; // Object interaction radius float _Strength; // Interaction strength

    In the grass generation loop, calculate the distance between each grass fragment and the object and adjust the grass position according to this distance.

    float dis = distance(_PositionMoving, posWS); // Calculate distance
    float radiusEffect = 1 - saturate(dis / _Radius); // Calculate effect attenuation based on distance
    float3 sphereDisp = POS - _PositionMoving; // Calculate the position difference
    sphereDisp *= radiusEffect * _Strength; // Apply falloff and intensity
    sphereDisp = clamp(sphereDisp, -0.8, 0.8); // Limit the maximum displacement

    The new positions are then calculated within each blade of grass.

    // Apply interactive effects
    float3 newPos = i == 0 ? POS : POS + (sphereDisp * t);
    triStream.Append(GenerateGrassVertex(newPos, segmentWidth, segmentHeight, segmentForward, float2(0, t), transformMatrix));
    triStream.Append(GenerateGrassVertex(newPos, -segmentWidth, segmentHeight, segmentForward, float2(1, t), transformMatrix));

    Don't forget the outside of the for loop, which is the top vertex.

    // Final grass fragment
    float3 newPosTop = POS + sphereDisp;
    triStream.Append(GenerateGrassVertex(newPosTop, 0, height, forward, float2(0.5, 1), transformationMatrix));
    triStream.RestartStrip();

    In URP, using uniform float3 _PositionMoving may cause SRP Batcher to fail.

    4.2 Script Code

    Bind the object that needs interaction.

    using UnityEngine;
    
    public class ShaderInteractor : MonoBehaviour
    {
        // Update is called once per frame
        void Update()
        {
            Shader.SetGlobalVector("_PositionMoving", transform.position);
        }
    }

    4.3 Full code ‼ ️ (URP)

    Grass shader:

    https://pastebin.com/Zs77EQgy

    5. Compute Shader Rendering Grass v1.0

    Why v1.0? Because I think it is quite difficult to render the sea of grass with this compute shader. Many of the things that are not available now can be improved slowly in the future. I also wrote some notes about Compute Shader.

    1. Compute Shader Study Notes (I)
    2. Compute Shader Learning Notes (II) Post-processing Effects
    3. Compute Shader Learning Notes (II) Particle Effects and Cluster Behavior Simulation
    4. Compute Shader Learning Notes (Part 3) Grass Rendering

    5.1 Review/Organization

    The Compute Shader notes above fully describe how to write a stylized grass sea from scratch in CS. If you forgot, review it here.

    There are still many things that the CPU needs to do in the initialization stage. First, define the grass Mesh and Buffer transfer (the width and height of the grass, the position of each grass generation, the random orientation of the grass, and the random color depth of the grass). It also needs to specifically pass the maximum curvature value and grass interaction radius to the Compute Shader.

    For each frame, the CPU also passes the time variable, wind direction, wind force/speed, and wind field scaling factor to the Compute Shader.

    Compute Shader uses the information passed by the CPU to calculate how the grass should turn, using quaternions as output.

    Finally, the shader instantiates the ID and all calculation results, first calculating the vertex offset, then applying the quaternion rotation, and finally modifying the normal information.

    This demo can actually be further optimized, such as putting more calculations in the Compute Shader, such as the process of generating Mesh, the width and height of the grass, random tilting, etc. More real-time parameter adjustment variables can also be optimized. Various optimization culling can also be performed, such as culling the incoming camera position by distance, or culling with the view frustum, etc. This culling process requires the use of some atomic operations. There is also multi-object interaction. The logic of interactive grass deformation can also be optimized, such as the degree of interaction is proportional to the power of the distance of the interactive object, etc. The engine function can also be increased, and the function of brushing grass can be developed, which may require a quadtree storage system, etc.

    And in Compute Shader, use vectors instead of scalars when possible.

    First, organize the code. Put all variables that do not need to be sent to the Compute Shader every frame into a function for unified initialization. Organize the Inspector panel. (There are many code changes)

    First, basically all calculations are run on the GPU, except that the world coordinates of each grass are calculated in the CPU and passed to the GPU through a Buffer.

    The size of the buffer transmission depends entirely on the size of the ground mesh and the set density. In other words, if it is a super large open world, the buffer will become super large. For a 5*5 grass field, with the Density set to 0.5, approximately 312576 grass data will be sent, and the actual data will reach 4*312576*4=5001216 bytes. Based on the CPU->GPU transmission speed of 8 GB/s, it takes about 10 milliseconds to transmit.

    Fortunately, this buffer does not need to be transmitted every frame, but it is enough to attract our attention. If the current grass size increases to 100*100, the time required will increase several times, which is scary. Moreover, we may not use many of the vertices, which causes a great waste of performance.

    I added a function to generate perlin noise in the Compute Shader, as well as the xorshift128 random number generation algorithm.

    // Perlin random number algorithm
    float hash(float x, float y) {
        return frac(abs(sin(sin(123.321 + x) * (y + 321.123)) * 456.654));
    }
    float perlin(float x, float y){
        float col = 0.0;
        for (int i = 0; i < 8; i++) {
            float fx = floor(x); float fy = floor(y);
            float xx = ceil(x); float cy = ceil(y);
            float a = hash(fx, fy); float b = hash(fx, cy);
            float c = hash(xx, fy); float d = hash(xx, cy);
            col += lerp(lerp(a, b, frac(y)), lerp(c, d, frac(y)), frac(x));
            col /= 2.0; x /= 2.0; y /= 2.0;
        }
        return col;
    }
    // XorShift128 random number algorithm -- Edited Directly output normalized data
    uint state[4];
    void xorshift_init(uint s) {
        state[0] = s; state[1] = s | 0xffff0000u;
        state[2] = s < 16; state[3] = s >> 16;
    }
    float xorshift128() {
        uint t = state[3]; uint s = state[0];
        state[3] = state[2]; state[2] = state[1]; state[1] = s;
        t ^= t < 11u; t ^= t >> 8u;
        state[0] = t ^ s ^ (s >> 19u);
        return (float)state[0] / float(0xffffffffu);
    }
    
    [numthreads(THREADGROUPSIZE,1,1)]
    void BendGrass (uint3 id : SV_DispatchThreadID)
    {
        xorshift_init(id.x * 73856093u ^ id.y * 19349663u ^ id.z * 83492791u);
        ...
    }

    To review, at present, the CPU uses an AABB average grass paving logic to generate all possible grass vertices, which are then passed to the GPU to perform some culling, LoD and other operations in the Compute Shader.

    So far I have three Buffers.

    m_InputBuffer is the structure on the left of the above picture that sends all the grass to the GPU without any culling.

    m_OutputBuffer is a variable length buffer that increases slowly in the Compute Shader. If the grass of the current thread ID is suitable, it will be added to this buffer for instanced rendering later. The structure on the right of the above picture.

    m_argsBuffer is a parameterized Buffer, which is different from other Buffers. It is used to pass parameters to Draw, and its specific content is to specify the number of vertices to be rendered in batches, the number of rendering instances, etc. Let's take a look at it in detail:

    First parameter, my grass mesh has seven triangles, so there are 21 vertices to render.

    The second parameter is temporarily set to 0, indicating that nothing needs to be rendered. This number will be dynamically set according to the length of m_OutputBuffer after the Compute Shader calculation is completed. In other words, the number here will be the same as the number of grasses appended in the Compute Shader.

    The third and fourth parameters represent respectively: the index of the first rendered vertex and the index of the first instantiation.

    I haven't used the fifth parameter, so I don't know what it is used for.

    The last step looks like this, passing in the Mesh, material, AABB and parameter Buffer.

    5.2 Customizing Unity Tools

    Create a new C# script and save it in the Editor directory of the project (if it doesn't exist, create one). The script inherits from Editor, and then write [CustomEditor(typeof(XXX))] . It means you work for XXX. I work for GrassControl, and then you can attach what you wrote now to XXX. Of course, you can also have a separate window, which should inherit from EditorWindow.

    Write tools in the OnInspectorGUI() function, for example, write a Label.

    GUILayout.Label("== Remo Grass Generator ==");

    To center the Inspector, add a parameter.

    GUILayout.Label("== Remo Grass Generator ==", new GUIStyle(EditorStyles.boldLabel) { alignment = TextAnchor.MiddleCenter });

    Too crowded? Just add a line of space.

    EditorGUILayout.Space();

    If you want to attach tools above XXX, then all the logic should be written above OnInspectorGUI.

    ... // Write here
    // The default Inspector interface of GrassControl
    base.OnInspectorGUI();

    Create a button and press the code:

    if (GUILayout.Button("xxx"))
    {
        ...//Code after pressing

    Anyway, these are the ones I use now.

    5.3 Editor selects the object to generate grass

    It is also very simple to get the Object of the script of the current service and display it in the Inspector.

    [SerializeField] private GameObject grassObject;
    ...
    grassObject = (GameObject)EditorGUILayout.ObjectField("Write any name", grassObject, typeof(GameObject), true);
    if (grassObject == null)
    {
        grassObject = FindObjectOfType<GrassControl>()?.gameObject;
    }

    After obtaining it, you can access the contents of the current script through GameObject.

    How to get the object selected in the Editor window? It can be done with one line of code.

    foreach (GameObject obj in Selection.gameObjects)

    Display the selected objects in the Inspector panel. Note that you need to handle the case of multiple selections, otherwise a Warning will be issued.

    // Display the current Editor selected object in real time and control the availability of the button
    EditorGUILayout.LabelField("Selection Info:", EditorStyles.boldLabel);
    bool hasSelection = Selection.activeGameObject != null;
    GUI.enabled = hasSelection;
    if (hasSelection)
        foreach (GameObject obj in Selection.gameObjects)
            EditorGUILayout.LabelField(obj.name);
    else
        EditorGUILayout.LabelField("No active object selected.");

    Next, get the MeshFilter and Renderer of the selected object. Since Raycast detection is required, get a Collider. If it does not exist, create one.

    Then I will not talk about the code of sketching grass here.

    5.4 Processing AABBs

    After generating a bunch of grass, add each grass to the AABB and finally pass it to Instancing.

    I assume that each grass is the size of a unit cube, so it is Vector3.one. If the grass is particularly tall, this should need to be modified.

    Stuff each blade of grass into the big AABB and pass the new AABB back to the script's m_LocalBounds for Instancing.

    Graphics.DrawMeshInstancedIndirect(blade, 0, m_Material, m_LocalBounds, m_argsBuffer);

    5.5 Surface Shader – Pitfalls

    There is a small problem here. Since the current Material is a Surface Shader, the Vertex of the Surface Shader has calculated the center of the AABB by default to do the vertex offset, so the world coordinates passed in before cannot be used directly. You also need to pass the center of the AABB in and subtract it. It's so strange. I wonder if there is any elegant way.

    5.6 Simple Camera Distance Culling + Fade

    Currently, all generated grass is passed to the Compute Shader on the CPU, and then all grass is added to the AppendBuffer, which means there is no culling logic.

    The simplest culling solution is to cull grass based on the distance between the camera and the grass. In the Inspector panel, open a value to represent the culling distance. Calculate the distance between the camera and the current grass instance. If it is greater than the set value, it will not be added to the AppendBuffer.

    First, pass the world coordinates of the camera into C#. Here is the semi-pseudo code:

    // Get the camera
    private Camera m_MainCamera;
    
    m_MainCamera = Camera.main;
    
    if (m_MainCamera != null)
        m_ComputeShader.SetVector(ID_camreaPos, m_MainCamera.transform.position);

    In CS, calculate the distance between the grass and the camera:

    float distanceFromCamera = distance(input.position, _CameraPositionWS);

    The distance function code is as follows:

    float distanceFade = 1 - saturate((distanceFromCamera - _MinFadeDist) / (_MaxFadeDist - _MinFadeDist));

    If the value is less than 0, return directly.

    // skip if out of fading range too
    if (distanceFade < 0.001f)
    {
        return;
    }

    In the part between culling and not culling, set the grass width + Fade value to achieve a fading effect.

    Result.height = (bladeHeight + bladeHeightOffset * (xorshift128()*2-1)) * distanceFade;
    Result.width = (bladeWeight + bladeWeightOffset * (xorshift128()*2-1)) * distanceFade;
    ...
    Result.fade = xorshift128() * distanceFade;

    In the figure below, both are set to be relatively small for the convenience of demonstration.

    I think the actual effect is quite good and smooth. If the width and height of the grass are not modified, the effect will be greatly reduced.

    Of course, you can also modify the logic: do not completely remove the grass that exceeds the maximum drawing range, but reduce the number of drawings; or selectively draw the grass in the transition area.

    Both logics are acceptable, and if it were me I would choose the latter.

    5.7 Maintaining a set of visible ID buffers

    The so-called frustum culling is to reduce the redundant calculations of GPU through various methods at the CPU stage.

    So how do I let the Compute Shader know which grass needs to be rendered and which needs to be culled? My approach is to maintain a set of ID Lists. The length is the number of all grasses. If the current grass needs to be culled, otherwise the index value of the grass that needs to be rendered is recorded.

    List<uint> grassVisibleIDList = new List<uint>();
    
    // buffer that contains the ids of all visible instances
    private ComputeBuffer m_VisibleIDBuffer;
    
    private const int VISIBLE_ID_STRIDE        =  1 * sizeof(uint);
    
    m_VisibleIDBuffer = new ComputeBuffer(grassData.Count, VISIBLE_ID_STRIDE,
        ComputeBufferType.Structured); //uint only, per visible grass
    m_ComputeShader.SetBuffer(m_ID_GrassKernel, "_VisibleIDBuffer", m_VisibleIDBuffer);
    
    m_VisibleIDBuffer?.Release();

    Since some grass has been removed before being passed to the Compute Shader, the number of Dispatches is no longer the number of all grasses, but the number of the current List.

    // m_ComputeShader.Dispatch(m_ID_GrassKernel, m_DispatchSize, 1, 1);
    
    m_DispatchSize = Mathf.CeilToInt(grassVisibleIDList.Count / threadGroupSize);

    Generates a fully visible ID sequence.

    void GrassFastList(int count)
    {
        grassVisibleIDList = Enumerable.Range(0, count).ToArray().ToList();
    }

    And each frame should be uploaded to GPU. The preparation is complete, and then use Quad tree to operate this array.

    5.8 Quad/Octtree Storing Grass Index

    You can consider dividing an AABB into multiple sub-AABBs and then use a quadtree to store and manage them.

    Currently, all grass is in one AABB. Next, we build an octree and put all the grass in this AABB into branches. This makes it easy to do frustum culling in the early stages of the CPU.

    How to store it? If the current grass has a small vertical drop, then a quadtree is enough. If it is an open world with undulating mountains, then use an octree. However, considering that the grass has a relatively high horizontal density, I use a quadtree + octree structure here. The parity of the depth determines whether the current depth is divided into four nodes or eight nodes. If there is no need for strong height division, it is OK to use an octree, but I feel that the efficiency may be a little lower. Here, it is directly evenly distributed. Later optimization can consider the AABB division method based on variable length dynamic changes.

    if (depth % 2 == 0)
    {
        ...
        m_children.Add(new CullingTreeNode(topLeftSingle, depth - 1));
        m_children.Add(new CullingTreeNode(bottomRightSingle, depth - 1));
        m_children.Add(new CullingTreeNode(topRightSingle, depth - 1));
        m_children.Add(new CullingTreeNode(bottomLeftSingle, depth - 1));
    }
    else
    {
        ...
        m_children.Add(new CullingTreeNode(topLeft, depth - 1));
        m_children.Add(new CullingTreeNode(bottomRight, depth - 1));
        m_children.Add(new CullingTreeNode(topRight, depth - 1));
        m_children.Add(new CullingTreeNode(bottomLeft, depth - 1));
    
        m_children.Add(new CullingTreeNode(topLeft2, depth - 1));
        m_children.Add(new CullingTreeNode(bottomRight2, depth - 1));
        m_children.Add(new CullingTreeNode(topRight2, depth - 1));
        m_children.Add(new CullingTreeNode(bottomLeft2, depth - 1));
    }

    The detection of the view frustum and AABB can be done with GeometryUtility.TestPlanesAABB.

    public void RetrieveLeaves(Plane[] frustum, List<Bounds> list, List<int> visibleIDList)
    {
        if (GeometryUtility.TestPlanesAABB(frustum, m_bounds))
        {
            if (m_children.Count == 0)
            {
                if (grassIDHeld.Count > 0)
                {
                    list.Add(m_bounds);
                    visibleIDList.AddRange(grassIDHeld);
                }
            }
            else
            {
                foreach (CullingTreeNode child in m_children)
                {
                    child.RetrieveLeaves(frustum, list, visibleIDList);
                }
            }
        }
    }

    This code is the key part, passing in:

    • The six planes of the camera frustum Plane[]
    • A list of Bounds objects storing all nodes within the frustum
    • Stores a list of all grass indices contained in the node within the frustum

    By calling the method of this quad/octree, you can get the list of all bounding boxes and grass within the frustum.

    Then all the grass indexes can be made into a Buffer and passed to the Compute Shader.

    m_VisibleIDBuffer.SetData(grassVisibleIDList);

    To get a visual AABB, use the OnDrawGizmos() method.

    Pass all the AABBs obtained by culling the view frustum into this function. This way you can see the AABBs intuitively.

    Also write everything inside the view frustum to the visible grass.

    5.9 Flickering grass problem – Pitfalls

    Here I hit a small pit. I completed the octree and successfully divided many sub-AABBs as shown above. But when I moved the camera, the grass flickered wildly. I was a little lazy and didn't want to make GIF videos. Observe the two pictures below. I just moved the view slightly and changed the current Visibility List. The position of the grass jumped a lot, and it looked like the grass flickered continuously.

    I can't figure it out, there is no problem with Compute Shader culling.

    The number of dispatches is also calculated based on the length of the visibility list, so there must be enough threads to compute the shader.

    And there is no problem with DrawMeshInstancedIndirect.

    What's the problem?

    After a long debugging, I found that the problem lies in the process of taking random numbers by Xorshift of Compute Shader.

    Before using _VisibleIDBuffer, one grass corresponds to one thread ID, which is determined from the moment the grass is born. Now that this group of indexes has been added, and the ID of the incoming random value is not changed to a Visible ID, the random numbers will appear very discrete.

    That is to say, all previous IDs are replaced with index values taken from _VisibleIDBuffer!

    5.10 Multi-object Interaction

    Currently there is only one trampler passed in. If it is not passed in, an error will be reported, which is unbearable.

    There are three parameters about interaction:

    • pos – Vector3
    • trampleStrength – Float
    • trampleRadius – Float

    Now put trampleRadius into pos (Vector4) (or another one, depending on your needs), and pass the position array into it using SetVectorArray. This way each interactive object can have a dedicated interactive radius. For fat interactive objects, make the radius larger, and for skinny ones, make it smaller. That is, remove the following line:

    // In SetGrassDataBase, no need to upload every frame
    // m_ComputeShader.SetFloat("trampleRadius", trampleRadius);

    become:

    // In SetGrassDataUpdate, each frame must be uploaded
    // Set up multiple interactive objects
    if (trampler.Length > 0)
    {
        Vector4[] positions = new Vector4[trampler.Length];
        for (int i = 0; i < trampler.Length; i++)
        {
            positions[i] = new Vector4(trampler[i].transform.position.x, trampler[i].transform.position.y, trampler[i].transform.position.z,
                trampleRadius);
        }
        m_ComputeShader.SetVectorArray(ID_tramplePos, positions);
    }

    Then you have to pass the number of interactive objects so that the Compute Shader knows how many interactive objects need to be processed. This also needs to be updated every frame. I am used to storing an ID index for objects that are updated every frame, which is more efficient.

    // Initializing
    ID_trampleLength = Shader.PropertyToID("_trampleLength");
    // In each frame
    m_ComputeShader.SetFloat(ID_trampleLength, trampler.Length);

    I repackaged it:

    By modifying the corresponding code, you can adjust the radius of each interactive object on the panel. If you want to enrich this adjustment function, you can consider passing a separate Buffer into it.

    In the Compute Shader, it is relatively simple to combine multiple rotations.

    // Trampler
    float4 qt = float4(0, 0, 0, 1); // 1 in quaternion is like this, the imaginary part is 0
    for (int trampleIndex = 0; trampleIndex < trampleLength; trampleIndex++)
    {
        float trampleRadius = tramplePos[trampleIndex].a;
        float3 relativePosition = input.position - tramplePos[trampleIndex].xyz;
        float dist = length(relativePosition);
        if (dist < trampleRadius) {
            // Use the power to enhance the effect at close range
            float eff = pow((trampleRadius - dist) / trampleRadius, 2) * trampleStrength;
            float3 direction = normalize(relativePosition);
            float3 newTargetDirection = float3(direction.x * eff, 1, direction.z * eff);
            qt = quatMultiply(MapVector(float3(0, 1, 0), newTargetDirection), qt);
        }
    }

    5.11 Editor real-time preview

    The camera currently passed to the Compute Shader is the main camera, which is the one in the game window. Now you want to temporarily get the main camera's lens in the editor (Scene window) and restore it after starting the game. You can use the Scene View GUI to draw events.

    Here is an example of remodeling my current code:

    #if UNITY_EDITOR
        SceneView view;
    
        void OnDestroy()
        {
            // When the window is destroyed, remove the delegate
            // so that it will no longer do any drawing.
            SceneView.duringSceneGui -= this.OnScene;
        }
    
        void OnScene(SceneView scene)
        {
            view = scene;
            if (!Application.isPlaying)
            {
                if (view.camera != null)
                {
                    m_MainCamera = view.camera;
                }
            }
            else
            {
                m_MainCamera = Camera.main;
            }
        }
        private void OnValidate()
        {
            // Set up components
            if (!Application.isPlaying)
            {
                if (view != null)
                {
                    m_MainCamera = view.camera;
                }
            }
            else
            {
                m_MainCamera = Camera.main;
            }
        }
    #endif

    When initializing the shader, subscribe to the event at the beginning, and then determine whether the current state is game, and then pass a camera. If it is in edit mode, then m_MainCamera is still NULL.

    void InitShader()
    {
    #if UNITY_EDITOR
        SceneView.duringSceneGui += this.OnScene;
        if (!Application.isPlaying)
        {
            if (view != null && view.camera != null)
            {
                m_MainCamera = view.camera;
            }
        }
    #endif
        if (Application.isPlaying)
        {
            m_MainCamera = Camera.main;
        }
        ...

    In the frame-by-frame Update function, if it is detected that m_MainCamera is NULL, it is determined that the current mode is edit mode:

    // Pass in the camera coordinates
            if (m_MainCamera != null)
                m_ComputeShader.SetVector(ID_camreaPos, m_MainCamera.transform.position);
    #if UNITY_EDITOR
            else if (view != null && view.camera != null)
            {
                m_ComputeShader.SetVector(ID_camreaPos, view.camera.transform.position);
            }
    
    #endif

    6. Cutting Grass

    Maintain a set of Cut Buffers

    // added for cutting
    private ComputeBuffer m_CutBuffer;
    float[] cutIDs;

    Initializing Buffer

    private const int CUT_ID_STRIDE            =  1 * sizeof(float);
    // added for cutting
    m_CutBuffer = new ComputeBuffer(grassData.Count, CUT_ID_STRIDE, ComputeBufferType.Structured);
    // added for cutting
    m_ComputeShader.SetBuffer(m_ID_GrassKernel, "_CutBuffer", m_CutBuffer);
    m_CutBuffer.SetData(cutIDs);

    Don't forget to release it when you disable it.

    // added for cutting
    m_CutBuffer?.Release();

    Define a method to pass in the current position and radius to calculate the position of the grass. Set the corresponding cutID to -1.

    // newly added for cutting
    public void UpdateCutBuffer(Vector3 hitPoint, float radius)
    {
        // can't cut grass if there is no grass in the scene
        if (grassData.Count > 0)
        {
            List<int> grasslist = new List<int>();
            // Get the list of IDS that are near the hitpoint within the radius
            cullingTree.ReturnLeafList(hitPoint, grasslist, radius);
            Vector3 brushPosition = this.transform.position;
            // Compute the squared radius to avoid square root calculations
            float squaredRadius = radius * radius;
    
            for (int i = 0; i < grasslist.Count; i++)
            {
                int currentIndex = grasslist[i];
                Vector3 grassPosition = grassData[currentIndex].position + brushPosition;
    
                // Calculate the squared distance
                float squaredDistance = (hitPoint - grassPosition).sqrMagnitude;
    
                // Check if the squared distance is within the squared radius
                // Check if there is grass to cut, or of the grass is uncut(-1)
                if (squaredDistance <= squaredRadius && (cutIDs[currentIndex] > hitPoint.y || cutIDs[currentIndex] == -1))
                {
                    // store cutting point
                    cutIDs[currentIndex] = hitPoint.y;
                }
    
            }
        }
        m_CutBuffer.SetData(cutIDs);
    }

    Then bind a script to the object that needs to be cut:

    using System.Collections;
    using System.Collections.Generic;
    using UnityEngine;
    
    
    public class Cutgrass : MonoBehaviour
    {
        [SerializeField]
        GrassControl grassComputeScript;
    
        [SerializeField]
        float radius = 1f;
    
        public bool updateCuts;
    
        Vector3 cachedPos;
        // Start is called before the first frame update
    
    
        // Update is called once per frame
        void Update()
        {
            if (updateCuts && transform.position != cachedPos)
            {
                Debug.Log("Cutting");
                grassComputeScript.UpdateCutBuffer(transform.position, radius);
                cachedPos = transform.position;
    
            }
        }
    
        private void OnDrawGizmos()
        {
            Gizmos.color = new Color(1, 0, 0, 0.3f);
            Gizmos.DrawWireSphere(transform.position, radius);
        }
    }

    In the Compute Shader, just modify the grass height. (Very straightforward...) You can change the effect to whatever you want.

    StructuredBuffer<float> _CutBuffer;// added for cutting
    
        float cut = _CutBuffer[usableID];
        Result.height = (bladeHeight + bladeHeightOffset * (xorshift128()*2-1)) * distanceFade;
        if(cut != -1){
            Result.height *= 0.1f;
        }

    Done!

    References

    1. https://learn.microsoft.com/zh-cn/windows/uwp/graphics-concepts/geometry-shader-stage–gs-
    2. https://roystan.net/articles/grass-shader/
    3. https://danielilett.com/2021-08-24-tut5-17-stylised-grass/
    4. https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
    5. Notes - A preliminary exploration of compute-shader
    6. https://www.patreon.com/posts/53587750
    7. https://www.youtube.com/watch?v=xKJHL8nQiuM
    8. https://www.patreon.com/posts/40090373
    9. https://www.patreon.com/posts/47447321
    10. https://www.patreon.com/posts/wip-patron-only-83683483
    11. https://www.youtube.com/watch?v=DeATXF4Szqo
    12. https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
    13. https://docs.unity3d.com/Manual/class-ComputeShader.html
    14. https://docs.unity3d.com/ScriptReference/ComputeShader.html
    15. https://learn.microsoft.com/en-us/windows/win32/api/D3D11/nf-d3d11-id3d11devicecontext-dispatch
    16. https://zhuanlan.zhihu.com/p/102104374
    17. Unity-compute-shader-Basic knowledge
    18. https://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html
    19. https://cuihongzhi1991.github.io/blog/2020/05/27/builtinttourp/
    20. https://jadkhoury.github.io/files/MasterThesisFinal.pdf

  • Compute Shader学习笔记(四)之 草地渲染

    Compute Shader Learning Notes (IV) Grass Rendering

    Project address:

    https://github.com/Remyuu/Unity-Compute-Shader-Learngithub.com/Remyuu/Unity-Compute-Shader-Learn

    img

    视频封面

    L5 Grass Rendering

    The current effect is very ugly, and there are still many details that are not perfect, it is just "implemented". Since I am also a rookie, I hope you can correct me if I write/do it poorly.

    img

    Summary of knowledge points:

    • Grass Rendering Solution
    • UNITY_PROCEDURAL_INSTANCING_ENABLED
    • bounds.extents
    • X-ray detection
    • Rodrigo Spin
    • Quaternion rotation

    Preface 1

    Preface Reference Articles:

    img

    There are many ways to render grass.

    The simplest way is to directly paste a grass texture on it.

    img

    In addition, eachMesh GrassIt is also common to drag it into the scene. This method has a large operating space and every blade of grass is under control. Although you can use Batching and other methods to optimize and reduce the transmission time from CPU to GPU, this will consume the life of the Ctrl, C, V and D keys on your keyboard. However, you can use L(a, b) in the Transform component to evenly distribute the selected objects between a and b. If you want randomness, you can use R(a, b). For more related operations, seeOfficial Documentation.

    img

    Can also be combinedGeometry shaders and tessellation shadersThis method looks good, but one shader can only correspond to one type of geometry (grass). If you want to generate flowers or rocks on this mesh, you need to modify the code in the geometry shader. This problem is not the most critical. The more serious problem is that many mobile devices and Metal do not support geometry shaders at all. Even if they do, they are only software-simulated, with poor performance. And the grass mesh will be recalculated every frame, wasting performance.

    img

    BillboardTechnical rendering of grass is also a widely used and long-lasting method. This method works very well when we don't need high-fidelity images. This method is to simply render a Quad+map (Alpha clipping). Use DrawProcedural. However, this method can only be viewed from a distance and not up close, otherwise it will be exposed.

    img

    Using UnityTerrain SystemYou can also draw very nice grass. And Unity uses instancing technology to ensure performance. The best part is its brush tool, but if your workflow does not include the terrain system, you can also use third-party plugins to do it.

    img

    When searching for information, I also found aImpostors. It's quite interesting to combine the vertex saving advantage of billboards with the ability to realistically reproduce objects from multiple angles. This technology "takes" a Mesh photo of real grass from multiple angles in advance and stores it through Texture. At runtime, the appropriate texture is selected for rendering according to the viewing direction of the current camera. It is equivalent to an upgraded version of the billboard technology. I think the Impostors technology is very suitable for objects that are large but players may need to view from multiple angles, such as trees or complex buildings. However, this method may have problems when the camera is very close or changes between two angles. A more reasonable solution is: use a mesh-based method at very close distances, use Impostors at medium distances, and use billboards at long distances.

    img

    The method to be implemented in this article is based on GPU Instancing, which should be called "per-blade mesh grass". This solution is used in games such as "Ghost of Tsushima", "Genshin Impact" and "The Legend of Zelda: Breath of the Wild". Each grass has its own entity, and the light and shadow effects are quite realistic.

    img

    Rendering process:

    img

    Preface 2

    Unity's Instancing technology is quite complex, and I have only seen a glimpse of it. Please correct me if I find any mistakes. The current code is written according to the documentation. GPU instancing currently supports the following platforms:

    • Windows: DX11 and DX12 with SM 4.0 and above / OpenGL 4.1 and above
    • OS X and Linux: OpenGL 4.1 and above
    • Mobile: OpenGL ES 3.0 and above / Metal
    • PlayStation 4
    • Xbox One

    In addition, Graphics.DrawMeshInstancedIndirect has been eliminated. You should use Graphics.RenderMeshIndirect. This function will automatically calculate the Bounding Box. This is a later story. For details, please see the official documentation:RenderMeshIndirect . This article was also helpful:

    https://zhuanlan.zhihu.com/p/403885438.

    The principle of GPU Instancing is to send a Draw Call to multiple objects with the same Mesh. The CPU first collects all the information, then puts it into an array and sends it to the GPU at once. The limitation is that the Material and Mesh of these objects must be the same. This is the principle of being able to draw so much grass at a time while maintaining high performance. To achieve GPU Instancing to draw millions of Meshes, you need to follow some rules:

    • All meshes need to use the same Material
    • Check GPU Instancing
    • Shader needs to support instancing
    • Skin Mesh Renderer is not supported

    Since Skin Mesh Renderer is not supported,In the previous articleWe bypassed SMR and directly took out the Mesh of different key frames and passed it to the GPU. This is also the reason why the question was raised at the end of the previous article.

    There are two main types of Instancing in Unity: GPU Instancing and Procedural Instancing (involving Compute Shaders and Indirect Drawing technology), and the other is the stereo rendering path (UNITY_STEREO_INSTANCING_ENABLED), which I won't go into here. In Shader, the former uses #pragma multi_compile_instancing and the latter uses #pragma instancing_options procedural:setup. For details, please see the official documentationCreating shaders that support GPU instancing .

    Then currently the SRP pipeline does not support custom GPU Instancing Shaders, only BIRP can.

    Then there is UNITY_PROCEDURAL_INSTANCING_ENABLED . This macro is used to indicate whether Procedural Instancing is enabled. When using Compute Shader or Indirect Drawing API, the attributes of the instance (such as position, color, etc.) can be calculated in real time on the GPU and used directly for rendering without CPU intervention.In the source code, the core code of this macro is:

    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED #ifndef UNITY_INSTANCING_PROCEDURAL_FUNC #error "UNITY_INSTANCING_PROCEDURAL_FUNC must be defined." #else void UNITY_INSTANCING_PROCEDURAL_FUNC(); // Forward declaration of programmatic function #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input) { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input)); UNITY_INSTANCING_PROCEDURAL_FUNC();} #endif #else #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input) { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input));} #endif

    The Shader is required to define a UNITY_INSTANCING_PROCEDURAL_FUNC function, which is actually the setup() function. If there is no setup() function, an error will be reported.

    Generally speaking, what the setup() function needs to do is to extract the corresponding (unity_InstanceID) data from the Buffer, and then calculate the current instance's position, transformation matrix, color, metalness, or custom data and other attributes.

    GPU Instancing is just one of Unity's many optimization methods, and you still need to continue learning.

    1. Swaying 3-Quad Grass

    All the CS knowledge points used in this chapter have been covered in the previous article, but the background is changed. Draw a simple diagram.

    img

    The implementation is to use GPU Instancing, that is, rendering a large mesh at one time. The core code is just one sentence:

    Graphics.DrawMeshInstancedIndirect(mesh, 0, material, bounds, argsBuffer);

    The Mesh is composed of three Quads and a total of six triangles.

    img

    Then add a texture + Alpha Test.

    img

    The data structure of grass:

    • Location
    • Tilt Angle
    • Random noise value (used to calculate random tilt angles)
    public Vector3 position; // World coordinates, need to be calculated public float lean; public float noise; public GrassClump( Vector3 pos){ position.x = pos.x; position.y = pos.y; position.z = pos.z; lean = 0; noise = Random.Range(0.5f, 1); if (Random.value < 0.5f) noise = -noise; }

    Pass the buffer of the grass to be rendered (the world coordinates need to be calculated) to the GPU. First determine where the grass is generated and how much is generated. Get the AABB of the current object's Mesh (assuming it is a Plane Mesh for now).

    Bounds bounds = mf.sharedMesh.bounds; Vector3 clumps = bounds.extents;
    img

    Determine the extent of the grass, then randomly generate grass on the xOz plane.

    img

    Add a caption for the image, no more than 140 characters (optional)

    It should be noted that we are still in object space, so we need to convert Object Space to World Space.

    pos = transform.TransformPoint(pos);

    Combined with the density parameter and the object scaling factor, calculate how many grasses to render in total.

    Vector3 vec = transform.localScale / 0.1f * density; clumps.x *= vec.x; clumps.z *= vec.z; int total = (int)clumps.x * (int)clumps.z;

    Since the logic of Compute Shader is that each thread calculates a blade of grass, it is very likely that the number of blades of grass that need to be rendered is not a multiple of threads. Therefore, the number of blades of grass that need to be rendered is rounded up to a multiple of threads. In other words, when the density factor = 1, the number of blades of grass rendered is equal to the number of threads in a thread group.

    groupSize = Mathf.CeilToInt((float)total / (float)threadGroupSize); int count = groupSize * (int)threadGroupSize;

    Let the Compute Shader calculate the tilt angle of each grass.

    GrassClump clump = clumpsBuffer[id.x]; clump.lean = sin(time) * maxLean * clump.noise; clumpsBuffer[id.x] = clump;

    Passing the grass position and rotation angle to the GPU Buffer is not the end. The Material must decide the final appearance of the rendered instance before Graphics.DrawMeshInstancedIndirect can be executed.

    In the rendering process, before the instantiation phase (that is, in the procedural:setup function), use unity_InstanceID to determine which grass is currently being rendered. Get the current grass's world space and the grass's dump value.

    GrassClump clump = clumpsBuffer[unity_InstanceID]; _Position = clump.position; _Matrix = create_matrix(clump.position, clump.lean);

    Specific rotation + displacement matrix:

    float4x4 create_matrix(float3 pos, float theta){ float c = cos(theta); // Calculate the cosine of the rotation angle float s = sin(theta); // Calculate the sine of the rotation angle // Return a 4x4 transformation matrix return float4x4( c, -s, 0, pos.x, // First row: X-axis rotation and translation s, c, 0, pos.y, // Second row: Y-axis rotation (enough for 2D, but may not be used for grass) 0, 0, 1, pos.z, // Third row: Z axis unchanged 0, 0, 0, 1 // Fourth row: uniform coordinates (remain unchanged) ); }

    How is this formula derived? Substitute (0,0,1) into the Rodriguez formula to get a rotation matrix, and then expand it to the barycentric coordinates. Substitute it into the code formula.

    img

    Multiply this matrix by the vertices of Object Space to get the vertex coordinates of the dumped + displaced vertex.

    v.vertex.xyz *= _Scale; float4 rotatedVertex = mul(_Matrix, v.vertex); v.vertex = rotatedVertex;

    Now comes the problem. Currently the grass is not a plane, but a three-dimensional figure composed of three groups of Quads.

    img

    If you simply rotate all vertices along the z-axis, the grass roots will be greatly offset.

    img

    Therefore, we use v.texcoord.y to lerp the vertex positions before and after the rotation. In this way, the higher the Y value of the texture coordinate (that is, the closer the vertex is to the top of the model), the greater the rotation effect on the vertex. Since the Y value of the grass root is 0, the grass root will not shake after lerp.

    v.vertex.xyz *= _Scale; float4 rotatedVertex = mul(_Matrix, v.vertex); // v.vertex = rotatedVertex; v.vertex.xyz += _Position; v.vertex = lerp(v.vertex, rotatedVertex, v.texcoord.y);

    The effect is very poor, the grass is too fake. This kind of Quad grass can only be used from a distance.

    • Swinging stiffness
    • Stiff leaves
    • Poor lighting effects
    img

    Current version code:

    2. Stylized Grass

    In the previous section, I used several Quads and grass with alpha maps, and used sin waves for disturbance, but the effect was very average. Now I will use stylized grass and Perlin noise to improve it.

    Define the grass' vertices, normals and UVs in C# and pass them to the GPU as a Mesh.

    Vector3[] vertices = { new Vector3(-halfWidth, 0, 0), new Vector3( halfWidth, 0, 0), new Vector3(-halfWidth, rowHeight, 0), new Vector3( halfWidth, rowHeight, 0), new Vector3 (-halfWidth*0.9f, rowHeight*2, 0), new Vector3( halfWidth*0.9f, rowHeight*2, 0), new Vector3(-halfWidth*0.8f, rowHeight*3, 0), new Vector3( halfWidth*0.8f, rowHeight*3, 0), new Vector3( 0, rowHeight*4, 0) } ; Vector3 normal = new Vector3(0, 0, -1); Vector3[] normals = { normal, normal, normal, normal, normal, normal, normal, normal, normal }; Vector2[] uvs = { new Vector2(0,0), new Vector2(1,0), new Vector2(0,0.25f), new Vector2(1,0.25f), new Vector2(0,0.5f), new Vector2(1,0.5f) , new Vector2(0,0.75f), new Vector2(1,0.75f), new Vector2(0.5f,1) };

    Unity's Mesh also has a vertex order that needs to be set. The default isCounterclockwiseIf you write clockwise and enable backface culling, you won't see anything.

    img
    int[] indices = { 0,1,2,1,3,2,//row 1 2,3,4,3,5,4,//row 2 4,5,6,5,7,6, //row 3 6,7,8//row 4 }; mesh.SetIndices(indices, MeshTopology.Triangles, 0);

    The wind direction, size and noise ratio are set in the code, packed into a float4, and passed to the Compute Shader to calculate the swinging direction of a blade of grass.

    Vector4 wind = new Vector4(Mathf.Cos(theta), Mathf.Sin(theta), windSpeed, windScale);

    A blade of grass data structure

    struct GrassBlade { public Vector3 position; public float bend; // Random grass blade dumping public float noise; // CS calculates noise value public float fade; // Random grass blade brightness public float face; // Blade facing public GrassBlade( Vector3 pos) { position.x = pos.x; position.y = pos.y; position.z = pos.z; bend = 0; noise = Random.Range(0.5f, 1) * 2 - 1; fade = Random.Range(0.5f, 1); face = Random.Range(0, Mathf.PI); } }

    Currently, the grass blades are all oriented in the same direction. In the Setup function, first change the blade orientation.

    // Create a rotation matrix around the Y axis (facing) float4x4 rotationMatrixY = AngleAxis4x4(blade.position, blade.face, float3(0,1,0));
    img

    The logic of tipping the grass blades (since AngleAxis4x4 includes displacement, the following figure only demonstrates the tipping of the blades without random orientation. If you want to get the effect shown in the figure below, remember to add displacement to the code):

    // Create a rotation matrix around the X axis (dump) float4x4 rotationMatrixX = AngleAxis4x4(float3(0,0,0), blade.bend, float3(1,0,0));
    img

    Then combine the two rotation matrices.

    _Matrix = mul(rotationMatrixY, rotationMatrixX);
    img

    The lighting is now very strange because the normals are not modified.

    // Calculate the inverse transpose matrix for normal transformation float3x3 normalMatrix = (float3x3)transpose(((float3x3)_Matrix)); // Transform normal v.normal = mul(normalMatrix, v.normal);

    Here is the code for the inverse matrix:

    float3x3 transpose(float3x3 m) { return float3x3( float3(m[0][0], m[1][0], m[2][0]), // Column 1 float3(m[0][1] , m[1][1], m[2][1]), // Column 2 float3(m[0][2], m[1][2], m[2][2]) // Column 3 ); }

    For code readability, add the homogeneous coordinate transformation matrix, which is upgraded to the famous rotation formula:

    float4x4 AngleAxis4x4(float3 pos, float angle, float3 axis){ float c, s; sincos(angle*2*3.14, s, c); float t = 1 - c; float x = axis.x; float y = axis. y; float z = axis.z; return float4x4( t * x * x + c , t * x * y - s * z, t * x * z + s * y, pos.x, t * x * y + s * z, t * y * y + c , t * y * z - s * x, pos.y, t * x * z - s * y, t * y * z + s * x, t * z * z + c , pos.z, 0,0,0,1 ); }
    img
    img
    img

    What if you want to spawn on uneven ground?

    img

    You only need to modify the logic of generating the initial height of the grass, and use MeshCollider and ray detection.

    bladesArray = new GrassBlade[count]; gameObject.AddComponent (); RaycastHit hit; Vector3 v = new Vector3(); Debug.Log(bounds.center.y + bounds.extents.y); vy = (bounds.center.y + bounds.extents.y); v = transform .TransformPoint(v); float heightWS = vy + 0.01f; // Floating point error v.Set(0, 0, 0); vy = (bounds.center.y - bounds.extents.y); v = transform.TransformPoint(v); float neHeightWS = vy; float range = heightWS - neHeightWS; // heightWS += 10; // Increase the error slightly and adjust it yourself int index = 0; int loopCount = 0; while (index < count && loopCount < (count * 10)) { loopCount++; Vector3 pos = new Vector3( Random.value * bounds.extents.x * 2 - bounds.extents.x + bounds.center.x, 0, Random.value * bounds.extents.z * 2 - bounds.extents.z + bounds.center.z); pos = transform.TransformPoint(pos); pos.y = heightWS; if ( Physics.Raycast(pos, Vector3.down, out hit)) { pos.y = hit.point.y; GrassBlade blade = new GrassBlade(pos); bladesArray[index++] = blade; } }

    Here, rays are used to detect the position of each grass and calculate its correct height.

    img

    You can also adjust it so that the higher the altitude, the sparser the grass.

    img

    As shown above, calculate the ratio of the two green arrows. The higher the altitude, the lower the probability of generation.

    float deltaHeight = (pos.y - neHeightWS) / range; if (Random.value > deltaHeight) { // Grass }
    img
    img

    Current code link:

    Now there is no problem with lighting or shadow.

    3. Interactive Grass

    In the previous section, we first rotated the direction of the grass and then changed the tilt of the grass. Now we need to add another rotation. When an object approaches the grass, the grass will fall in the opposite direction of the object. This means another rotation. This rotation is not easy to set, so it is changed to quaternion. The calculation of quaternion is performed in Compute Shader. The quaternion is also passed to the material and stored in the structure of the grass piece. Finally, in the vertex shader, the quaternion is converted back to an affine matrix to apply the rotation.

    Here we add random width and height of grass. Because each grass mesh is the same, we can't modify the height of grass by modifying the mesh. So we can only do vertex offset in Vert.

    // C# [Range(0,0.5f)] public float width = 0.2f; [Range(0,1f)] public float rd_width = 0.1f; [Range(0,2)] public float height = 1f; [Range (0,1f)] public float rd_height = 0.2f; GrassBlade blade = new GrassBlade(pos); blade.height = Random.Range(-rd_height, rd_height); blade.width = Random.Range(-rd_width, rd_width); bladesArray[index++] = blade; // Setup starts with GrassBlade blade = bladesBuffer[unity_InstanceID]; _HeightOffset = blade.height_offset; _WidthOffset = blade.width_offset; // Vert starts with float tempHeight = v.vertex.y * _HeightOffset; float tempWidth = v.vertex.x * _WidthOffset; v.vertex.y += tempHeight; v.vertex.x += tempWidth;

    To sort it out, the current grass Buffer stores:

    struct GrassBlade{ public Vector3 position; // World position - need to be initialized public float height; // Grass height offset - need to be initialized public float width; // Grass width offset - need to be initialized public float dir; // Blade orientation - need to be initialized public float fade; // Random grass blade shading - need to be initialized public Quaternion quaternion; // Rotation parameters - CS calculation->Vert public float padding; public GrassBlade( Vector3 pos){ position.x = pos.x; position.y = pos.y; position.z = pos.z; height = width = 0; dir = Random.Range(0, 180); fade = Random.Range(0.99f, 1); quaternion = Quaternion.identity; padding = 0; } } int SIZE_GRASS_BLADE = 12 * sizeof(float);

    The quaternion q used to represent the rotation from vector v1 to vector v2 is:

    float4 MapVector(float3 v1, float3 v2){ v1 = normalize(v1); v2 = normalize(v2); float3 v = v1+v2; v = normalize(v); float4 q = 0; qw = dot(v, v2 ); q.xyz = cross(v, v2); return q; }

    To combine two rotational quaternions, you need to use multiplication (note the order).

    Suppose there are two quaternions and . The formula for calculating their product is:

    where are the real and imaginary components of , and are the real and imaginary components of .

    float4 quatMultiply(float4 q1, float4 q2) { // q1 = a + bi + cj + dk // q2 = x + yi + zj + wk // Result = q1 * q2 return float4( q1.w * q2.x + q1.x * q2.w + q1.y * q2.z - q1.z * q2.y, // z + q1.x * q2.y - q1.y * q2.x + q1.z * q2.w, // Z component q1.w * q2.w - q1.x * q2.x - q1.y * q2.y - q1.z * q2.z // W (real) component ); }

    To determine where the grass should fall, you need to get the Pos of the interactive object trampler, that is, its Transform component. And each frame is passed to the GPU Buffer through SetVector for use by the Compute Shader, so the GPU memory address is stored as an ID and does not need to be accessed with a string every time. It is also necessary to determine the range of the grass to fall and how to transition between falling and not falling, and pass a trampleRadius to the GPU. Since this is a constant, it does not need to be modified every frame, so it can be directly set with a string.

    // CSharp public Transform trampler; [Range(0.1f,5f)] public float trampleRadius = 3f; ... Init(){ shader.SetFloat("trampleRadius", trampleRadius); tramplePosID = Shader.PropertyToID("tramplePos") ; } Update(){ shader.SetVector(tramplePosID, pos); }

    In this section, all rotation operations are thrown into the Compute Shader and calculated at once, and a quaternion is directly returned to the material. First, q1 calculates the quaternion of the random orientation, q2 calculates the random dump, and qt calculates the interactive dump. Here you can open an interactive coefficient in the Inspector.

    [numthreads(THREADGROUPSIZE,1,1)] void BendGrass (uint3 id : SV_DispatchThreadID) { GrassBlade blade = bladesBuffer[id.x]; float3 relativePosition = blade.position - tramplePos.xyz; float dist = length(relativePosition); float4 qt ; if (dist

    Then the method of converting quaternion to rotation matrix is:

    float4x4 quaternion_to_matrix(float4 quat) { float4x4 m = float4x4(float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0 , 0, 0)); float x = quat.x, y = quat.y, z = quat.z, w = quat.w; float x2 = x + x, y2 = y + y, z2 = z + z; float xx = x * x2, xy = x * y2, xz = x * z2; float yy = y * y2, yz = y * z2, zz = z * z2; float wx = w * x2, wy = w * y2, wz = w * z2; m[0][0] = 1.0 - (yy + zz); m[0][1] = xy - wz; m[0][2] = xz + wy; m[1][0] = xy + wz; m[1][1] = 1.0 - (xx + zz); m[1][2] = yz - wx; m[2][0] = xz - wy; m[2][1] = yz + wx; m[2][2] = 1.0 - (xx + yy); m[0][3] = _Position.x; m[1][3] = _Position.y; m[2][3] = _Position. z; m[3][3] = 1.0; return m; }

    Then apply it.

    void vert(inout appdata_full v, out Input data) { UNITY_INITIALIZE_OUTPUT(Input, data); #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED float tempHeight = v.vertex.y * _HeightOffset; float tempWidth = v.vertex.x * _WidthOffset; v.vertex.y += tempHeight; v.vertex.x += tempWidth; // Apply model vertex transformation v.vertex = mul(_Matrix, v.vertex); v.vertex.xyz += _Position; // Calculate the inverse transpose matrix for normal transformation v.normal = mul((float3x3)transpose(_Matrix), v.normal); #endif } void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED // Get Compute Shader calculation results GrassBlade blade = bladesBuffer[unity_InstanceID]; _HeightOffset = blade.height_offset; _WidthOffset = blade.width_offset; _Fade = blade.fade; // Set shading _Matrix = quaternion_to_matrix(blade.quaternion); // Set the final rotation matrix _Position = blade.position; // Set position #endif }
    img
    img

    Current code link:

    4. Summary/Quiz

    How do you programmatically get the thread group sizes of a kernel?

    img

    When defining a Mesh in code, the number of normals must be the same as the number of vertex positions. True or false.

    img
  • Compute Shader学习笔记(三)之 粒子效果与群集行为模拟

    Compute Shader Learning Notes (Part 3) Particle Effects and Cluster Behavior Simulation

    img

    紧接着上一篇文章

    remoooo:Compute Shader学习笔记(二)之 后处理效果

    L4 粒子效果与群集行为模拟

    本章节使用Compute Shader生成粒子。学习如何使用DrawProcedural和DrawMeshInstancedIndirect,也就是GPU Instancing。

    知识点总结:

    • Compute Shader、Material、C#脚本和Shader共同协作
    • Graphics.DrawProcedural
    • material.SetBuffer()
    • xorshift 随机算法
    • 集群行为模拟
    • Graphics.DrawMeshInstancedIndirect
    • 旋转平移缩放矩阵,齐次坐标
    • Surface Shader
    • ComputeBufferType.Default
    • #pragma instancing_options procedural:setup
    • unity_InstanceID
    • Skinned Mesh Renderer
    • 数据对齐

    1. 介绍与准备工作

    Compute Shader除了可以同时处理大量的数据,还有一个关键的优势,就是Buffer存储在GPU中。因此可以将Compute Shader处理好的数据直接传递给与Material关联的Shader中,即Vertex/Fragment Shader。这里的关键就是,material也可以像Compute Shader一样SetBuffer(),直接从GPU的Buffer中访问数据!

    img

    使用Compute Shader来制作粒子系统可以充分体现Compute Shader的强大并行能力。

    在渲染过程中,Vertex Shader会从Compute Buffer中读取每个粒子的位置和其他属性,并将它们转换为屏幕上的顶点。Fragment Shader则负责根据这些顶点的信息(如位置和颜色)来生成像素。通过Graphics.DrawProcedural方法,Unity可以直接渲染这些由Shader处理的顶点,无需预先定义的网格结构,也不依赖Mesh Renderer,这对于渲染大量粒子特别有效。

    2. 粒子你好

    步骤也是非常简单,在 C# 中定义好粒子的信息(位置、速度与生命周期),初始化将数据传给Buffer,绑定Buffer到Compute Shader和Material。渲染阶段在OnRenderObject()里调用Graphics.DrawProceduralNow实现高效地渲染粒子。

    img

    新建一个场景,制作一个效果:百万粒子跟随鼠标绽放生命的粒子,如下:

    img

    写到这里,不禁让我思绪万千。粒子的生命周期很短暂,如同星火一般瞬间点燃,又如同流星一闪即逝。纵有千百磨难,我亦不过是亿万尘埃中的一粒,平凡且渺小。这些粒子,虽或许会在空间中随机漂浮(使用”Xorshift”算法计算粒子生成的位置),或许会拥有独一无二的色彩,但它们终究逃不出被程式预设的命运。这难道不正是我的人生写照吗?按部就班地上演着自己的角色,无法逃脱那无形的束缚。

    “上帝已死!而我们这些杀死他的人,又怎能不感到最大的痛苦呢?” – 弗里德里希·尼采

    尼采不仅宣告了宗教信仰的消逝,更指出了现代人面临的虚无感,即没有了传统的道德和宗教支柱,人们感到了前所未有的孤独和方向感的缺失。粒子在C#脚本中被定义、创造,按照特定规则运动和消亡,这与尼采所描述的现代人在宇宙中的状态颇有相似之处。虽然每个人都试图寻找自己的意义,但最终仍受限于更广泛的社会和宇宙规则。

    生活中充满了各种不可避免的痛苦,反映了人类存在的固有虚无和孤独感。失恋、生离死别、工作失意以及即将编写的粒子死亡逻辑等等,都印证了尼采所表达的,生活中没有什么是永恒不变的。同一个Buffer中的粒子必然在未来某个时刻消失,这体现了尼采所描述的现代人的孤独感,个体可能会感受到前所未有的孤立无援,因此每个人都是孤独的战士,必须学会独自面对内心的龙卷风和外部世界的冷漠。

    但是没关系,「夏天会周而复始,该相逢的人会再次相逢」。本文的粒子也会在结束后再次生成,以最好的状态拥抱属于它的Buffer。

    Summer will come around again. People who meet will meet again.

    img

    当前版本代码,可以自己拷下来跑跑(都有注释):

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Shaders/ParticleFun.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Scripts/ParticleFun.cs
    • Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Shaders/Particle.shader

    废话就说到这,先看看 C# 脚本是咋写的。

    img

    老样子,先定义粒子的Buffer(结构体),并且初始化一下子,然后传给GPU,关键在于最后三行将Buffer绑定给shader的操作。下面省略号的代码没什么好讲的,都是常规操作,用注释一笔带过了。

    struct Particle{
        public Vector3 position; // 粒子位置
        public Vector3 velocity; // 粒子速度
        public float life;       // 粒子生命周期
    }
    ComputeBuffer particleBuffer; // GPU 的 Buffer
    ...
    // Init() 中
        // 初始化粒子数组
        Particle[] particleArray = new Particle[particleCount];
        for (int i = 0; i < particleCount; i++){
            // 生成随机位置和归一化
            ...
            // 设置粒子的初始位置和速度
            ... 
            // 设置粒子的生命周期
            particleArray[i].life = Random.value * 5.0f + 1.0f;
        }
        // 创建并设置Compute Buffer
        ...
        // 查找Compute Shader中的kernel ID
        ...
        // 绑定Compute Buffer到shader
        shader.SetBuffer(kernelID, "particleBuffer", particleBuffer);
        material.SetBuffer("particleBuffer", particleBuffer);
        material.SetInt("_PointSize", pointSize);

    关键的渲染阶段来了 OnRenderObject() 。material.SetPass 用于设置渲染材质通道。DrawProceduralNow 方法在不使用传统网格的情况下绘制几何体。MeshTopology.Points 指定了渲染的拓扑类型为点,GPU会把每个顶点作为一个点来处理,不会进行顶点之间的连线或面的形成。第二个参数 1 表示从第一个顶点开始绘制。particleCount 指定了要渲染的顶点数,这里是粒子的数量,即告诉GPU总共需要渲染多少个点。

    void OnRenderObject()
    {
        material.SetPass(0);
        Graphics.DrawProceduralNow(MeshTopology.Points, 1, particleCount);
    }

    获取当前鼠标位置方法。OnGUI()这个方法每一帧可能调用多次。z值设为摄像机的近裁剪面加上一个偏移量,这里加14是为了得到一个更合适视觉深度的世界坐标(也可以自行调整)。

    void OnGUI()
    {
        Vector3 p = new Vector3();
        Camera c = Camera.main;
        Event e = Event.current;
        Vector2 mousePos = new Vector2();
        // Get the mouse position from Event.
        // Note that the y position from Event is inverted.
        mousePos.x = e.mousePosition.x;
        mousePos.y = c.pixelHeight - e.mousePosition.y;
        p = c.ScreenToWorldPoint(new Vector3(mousePos.x, mousePos.y, c.nearClipPlane + 14));
        cursorPos.x = p.x;
        cursorPos.y = p.y;
    }

    上面已经将 ComputeBuffer particleBuffer; 传到了Compute Shader和Shader中。

    先看看Compute Shader的数据结构。没什么特别的。

    // 定义粒子数据结构
    struct Particle
    {
        float3 position;  // 粒子的位置
        float3 velocity;  // 粒子的速度
        float life;       // 粒子的剩余生命时间
    };
    // 用于存储和更新粒子数据的结构化缓冲区,可从GPU读写
    RWStructuredBuffer<Particle> particleBuffer;
    // 从CPU设置的变量
    float deltaTime;       // 从上一帧到当前帧的时间差
    float2 mousePosition;  // 当前鼠标位置
    img

    这里简单讲讲一个特别好用的随机数序列生成方法 xorshift 算法。一会将用来随机粒子的运动方向如上图,粒子会随机朝着三维的方向运动。

    • 详细参考:https://en.wikipedia.org/wiki/Xorshift
    • 原论文链接:https://www.jstatsoft.org/article/view/v008i14

    这个算法03年由George Marsaglia提出,优点在于运算速度极快,并且非常节约空间。即使是最简单的Xorshift实现,其伪随机数周期也是相当长的。

    基本操作是位移(shift)和异或(xor)。算法的名字也由此而来。它的核心是维护一个非零的状态变量,通过对这个状态变量进行一系列的位移和异或操作来生成随机数。

    // 用于生成随机数的状态变量
    uint rng_state;
    uint rand_xorshift() {
        // Xorshift algorithm from George Marsaglia's paper
        rng_state ^= (rng_state << 13);  // 将状态变量左移13位,然后与原状态进行异或
        rng_state ^= (rng_state >> 17);  // 将更新后的状态变量右移17位,再次进行异或
        rng_state ^= (rng_state << 5);   // 最后,将状态变量左移5位,进行最后一次异或
        return rng_state;                // 返回更新后的状态变量作为生成的随机数
    }

    基本Xorshift 算法的核心已在前面的解释中提到,不过不同的位移组合可以创建多种变体。原论文还提到了Xorshift128变体。使用128位的状态变量,通过四次不同的位移和异或操作更新状态。代码如下:

    img
    // c language Ver
    uint32_t xorshift128(void) {
        static uint32_t x = 123456789;
        static uint32_t y = 362436069;
        static uint32_t z = 521288629;
        static uint32_t w = 88675123; 
        uint32_t t = x ^ (x << 11);
        x = y; y = z; z = w;
        w = w ^ (w >> 19) ^ (t ^ (t >> 8));
        return w;
    }

    可以产生更长的周期和更好的统计性能。这个变体的周期接近 ,非常厉害。

    总的来说,这个算法用在游戏开发完全足够了,只是不适合用在密码学等领域。

    在Compute Shader中使用这个算法时,需要注意Xorshift算法生成的随机数范围时uint32的的范围,需要再做一个映射( [0, 2^32-1] 映射到 [0, 1]):

    float tmp = (1.0 / 4294967296.0);  // 转换因子
    rand_xorshift()) * tmp

    而粒子运动方向是有符号的,因此只要在这个基础上减去0.5就好了。三个方向的随机运动:

    float f0 = float(rand_xorshift()) * tmp - 0.5;
    float f1 = float(rand_xorshift()) * tmp - 0.5;
    float f2 = float(rand_xorshift()) * tmp - 0.5;
    float3 normalF3 = normalize(float3(f0, f1, f2)) * 0.8f; // 缩放了运动方向

    每一个Kernel需要完成的内容如下:

    • 先得到Buffer中上一帧的粒子信息
    • 维护粒子Buffer(计算粒子速度,更新位置、生命值),写回Buffer
    • 若生命值小于0,重新生成一个粒子

    生成粒子,初始位置利用刚刚Xorshift得到的随机数,定义粒子的生命值,重置速度。

    // 设置粒子的新位置和生命值
    particleBuffer[id].position = float3(normalF3.x + mousePosition.x, normalF3.y + mousePosition.y, normalF3.z + 3.0);
    particleBuffer[id].life = 4;  // 重置生命值
    particleBuffer[id].velocity = float3(0,0,0);  // 重置速度

    最后是Shader的基本数据结构:

    struct Particle{
        float3 position;
        float3 velocity;
        float life;
    };
    struct v2f{
        float4 position : SV_POSITION;
        float4 color : COLOR;
        float life : LIFE;
        float size: PSIZE;
    };
    // particles' data
    StructuredBuffer<Particle> particleBuffer;

    然后在顶点着色器计算粒子的顶点色、顶点的Clip位置以及传输一个顶点大小的信息。

    v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID){
        v2f o = (v2f)0;
        // Color
        float life = particleBuffer[instance_id].life;
        float lerpVal = life * 0.25f;
        o.color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal);
        // Position
        o.position = UnityObjectToClipPos(float4(particleBuffer[instance_id].position, 1.0f));
        o.size = _PointSize;
        return o;
    }

    片元着色器计算插值颜色。

    float4 frag(v2f i) : COLOR{
        return i.color;
    }

    至此,就可以得到上面的效果。

    img

    3. Quad粒子

    上一节每一个粒子都只有一个点,没什么意思。现在把一个点变成一个Quad。在Unity中,没有Quad,只有两个三角形组成的假Quad。

    开干,基于上面的代码。在 C# 中定义顶点,一个Quad的尺寸。

    // struct
    struct Vertex
    {
        public Vector3 position;
        public Vector2 uv;
        public float life;
    }
    const int SIZE_VERTEX = 6 * sizeof(float);
    public float quadSize = 0.1f; // Quad的尺寸
    img

    每一个粒子的的基础上,设置六个顶点的uv坐标,给顶点着色器用。并且按照Unity规定的顺序绘制。

    index = i*6;
        //Triangle 1 - bottom-left, top-left, top-right
        vertexArray[index].uv.Set(0,0);
        vertexArray[index+1].uv.Set(0,1);
        vertexArray[index+2].uv.Set(1,1);
        //Triangle 2 - bottom-left, top-right, bottom-right
        vertexArray[index+3].uv.Set(0,0);
        vertexArray[index+4].uv.Set(1,1);
        vertexArray[index+5].uv.Set(1,0);

    最后传递给Buffer。这里的 halfSize 目的是传给Compute Shader计算Quad的各个顶点位置用的。

    vertexBuffer = new ComputeBuffer(numVertices, SIZE_VERTEX);
    vertexBuffer.SetData(vertexArray);
    shader.SetBuffer(kernelID, "vertexBuffer", vertexBuffer);
    shader.SetFloat("halfSize", quadSize*0.5f);
    material.SetBuffer("vertexBuffer", vertexBuffer);

    渲染阶段把点改为三角形,有六个点。

    void OnRenderObject()
    {
        material.SetPass(0);
        Graphics.DrawProceduralNow(MeshTopology.Triangles, 6, numParticles);
    }

    在Shader中改一下设置,接收顶点数据。并且接收一张贴图用于显示。需要做alpha剔除。

    _MainTex("Texture", 2D) = "white" {}     
    ...
    Tags{ "Queue"="Transparent" "RenderType"="Transparent" "IgnoreProjector"="True" }
    LOD 200
    Blend SrcAlpha OneMinusSrcAlpha
    ZWrite Off
    ...
        struct Vertex{
            float3 position;
            float2 uv;
            float life;
        };
        StructuredBuffer<Vertex> vertexBuffer;
        sampler2D _MainTex;
        v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID)
        {
            v2f o = (v2f)0;
            int index = instance_id*6 + vertex_id;
            float lerpVal = vertexBuffer[index].life * 0.25f;
            o.color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal);
            o.position = UnityWorldToClipPos(float4(vertexBuffer[index].position, 1.0f));
            o.uv = vertexBuffer[index].uv;
            return o;
        }
        float4 frag(v2f i) : COLOR
        {
            fixed4 color = tex2D( _MainTex, i.uv ) * i.color;
            return color;
        }

    在Compute Shader中,增加接收顶点数据,还有halfSize。

    struct Vertex
    {
        float3 position;
        float2 uv;
        float life;
    };
    RWStructuredBuffer<Vertex> vertexBuffer;
    float halfSize;

    计算每个Quad六个顶点的位置。

    img
    //Set the vertex buffer //
        int index = id.x * 6;
        //Triangle 1 - bottom-left, top-left, top-right   
        vertexBuffer[index].position.x = p.position.x-halfSize;
        vertexBuffer[index].position.y = p.position.y-halfSize;
        vertexBuffer[index].position.z = p.position.z;
        vertexBuffer[index].life = p.life;
        vertexBuffer[index+1].position.x = p.position.x-halfSize;
        vertexBuffer[index+1].position.y = p.position.y+halfSize;
        vertexBuffer[index+1].position.z = p.position.z;
        vertexBuffer[index+1].life = p.life;
        vertexBuffer[index+2].position.x = p.position.x+halfSize;
        vertexBuffer[index+2].position.y = p.position.y+halfSize;
        vertexBuffer[index+2].position.z = p.position.z;
        vertexBuffer[index+2].life = p.life;
        //Triangle 2 - bottom-left, top-right, bottom-right  // // 
        vertexBuffer[index+3].position.x = p.position.x-halfSize;
        vertexBuffer[index+3].position.y = p.position.y-halfSize;
        vertexBuffer[index+3].position.z = p.position.z;
        vertexBuffer[index+3].life = p.life;
        vertexBuffer[index+4].position.x = p.position.x+halfSize;
        vertexBuffer[index+4].position.y = p.position.y+halfSize;
        vertexBuffer[index+4].position.z = p.position.z;
        vertexBuffer[index+4].life = p.life;
        vertexBuffer[index+5].position.x = p.position.x+halfSize;
        vertexBuffer[index+5].position.y = p.position.y-halfSize;
        vertexBuffer[index+5].position.z = p.position.z;
        vertexBuffer[index+5].life = p.life;

    大功告成。

    img

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Shaders/QuadParticles.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Scripts/QuadParticles.cs
    • Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Shaders/QuadParticle.shader

    下一节,将Mesh升级为预制体,并且尝试模拟鸟类飞行时的集群行为。

    4. Flocking(集群行为)模拟

    img

    Flocking 是一种模拟自然界中鸟群、鱼群等动物集体运动行为的算法。核心是基于三个基本的行为规则,由Craig Reynolds在Sig 87提出,通常被称为“Boids”算法:

    • 分离(Separation) 粒子与粒子之间不能太靠近,要有边界感。具体是计算周边一定半径的粒子然后计算一个避免碰撞的方向。
    • 对齐(Alignment) 个体的速度趋于群体的平均速度,要有归属感。具体是计算视觉范围内粒子的平均速度(速度大小 方向)。这个视觉范围要根据鸟类实际的生物特性决定,下一节会提及。
    • 聚合(Cohesion) 个体的位置趋于平均位置(群体的中心),要有安全感。具体是,每个粒子找出周围邻居的几何中心,计算一个移动向量(最终结果是平均Location)。
    img
    img

    思考一下,上面三个规则,哪一个最难实现?

    答:Separation。众所周知,计算物体间的碰撞是非常难以实现的。因为每个个体都需要与其他所有个体进行距离比较,这会导致算法的时间复杂度接近O(n^2),其中n是粒子的数量。例如,如果有1000个粒子,那么在每次迭代中可能需要进行将近500,000次的距离计算。在当年原论文作者在没有经过优化的原始算法(时间复杂度O(N^2))中渲染一帧(80只鸟)所需时间是95秒,渲染一个300帧的动画使用了将近9个小时。

    一般来说,使用四叉树或者是格点哈希(Spatial Hashing)等空间划分方法可以优化计算。也可以维护一个近邻列表存储每个个体周边一定距离的个体。当然了,还可以使用Compute Shader硬算。

    img

    废话不多说,开干。

    首先下载好预备的工程文件(如果没有事先准备):

    • 鸟的Prefab:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Prefabs/Boid.prefab
    • 脚本:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Scripts/SimpleFlocking.cs
    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Shaders/SimpleFlocking.compute

    然后添加到一个空GO中。

    img

    启动项目就可以看到一堆鸟。

    img

    下面是关于群体行为模拟的一些参数。

    // 定义群体行为模拟的参数。
        public float rotationSpeed = 1f; // 旋转速度。
        public float boidSpeed = 1f; // Boid速度。
        public float neighbourDistance = 1f; // 邻近距离。
        public float boidSpeedVariation = 1f; // 速度变化。
        public GameObject boidPrefab; // Boid对象的预制体。
        public int boidsCount; // Boid的数量。
        public float spawnRadius; // Boid生成的半径。
        public Transform target; // 群体的移动目标。

    除了Boid预制体boidPrefab和生成半径spawnRadius之外,其他都需要传给GPU。

    为了方便,这一节先犯个蠢,只在GPU计算鸟的位置和方向,然后传回给CPU,做如下处理:

    ...
    boidsBuffer.GetData(boidsArray);
    // 更新每个鸟的位置与朝向
    for (int i = 0; i < boidsArray.Length; i++){
        boids[i].transform.localPosition = boidsArray[i].position;
        if (!boidsArray[i].direction.Equals(Vector3.zero)){
            boids[i].transform.rotation = Quaternion.LookRotation(boidsArray[i].direction);
        }
    }

    Quaternion.LookRotation() 方法用于创建一个旋转,使对象面向指定的方向。

    在Compute Shader中计算每个鸟的位置。

    #pragma kernel CSMain
    #define GROUP_SIZE 256    
    struct Boid{
        float3 position;
        float3 direction;
    };
    RWStructuredBuffer<Boid> boidsBuffer;
    float time;
    float deltaTime;
    float rotationSpeed;
    float boidSpeed;
    float boidSpeedVariation;
    float3 flockPosition;
    float neighbourDistance;
    int boidsCount;
    

    [numthreads(GROUP_SIZE,1,1)]

    void CSMain (uint3 id : SV_DispatchThreadID){ …// 接下文 }

    先写对齐和聚合的逻辑,最终输出实际位置、方向给Buffer。

    Boid boid = boidsBuffer[id.x];
        float3 separation = 0; // 分离
        float3 alignment = 0; // 对齐 - 方向
        float3 cohesion = flockPosition; // 聚合 - 位置
        uint nearbyCount = 1; // 自身算作周边的个体。
        for (int i=0; i<boidsCount; i++)
        {
            if(i!=(int)id.x) // 把自己排除 
            {
                Boid temp = boidsBuffer[i];
                // 计算周围范围内的个体
                if(distance(boid.position, temp.position)< neighbourDistance){
                    alignment += temp.direction;
                    cohesion += temp.position;
                    nearbyCount++;
                }
            }
        }
        float avg = 1.0 / nearbyCount;
        alignment *= avg;
        cohesion *= avg;
        cohesion = normalize(cohesion-boid.position);
        // 综合一个移动方向
        float3 direction = alignment + separation + cohesion;
        // 平滑转向和位置更新
        boid.direction = lerp(direction, normalize(boid.direction), 0.94);
        // deltaTime确保移动速度不会因帧率变化而改变。
        boid.position += boid.direction * boidSpeed * deltaTime;
        boidsBuffer[id.x] = boid;

    这就是没有边界感(分离项)的下场,所有的个体都表现出相当亲密的关系,都重叠在一起了。

    img

    添加下面的代码。

    if(distance(boid.position, temp.position)< neighbourDistance)
    {
        float3 offset = boid.position - temp.position;
        float dist = length(offset);
        if(dist < neighbourDistance)
        {
            dist = max(dist, 0.000001);
            separation += offset * (1.0/dist - 1.0/neighbourDistance);
        }
        ...

    1.0/dist 当Boid越靠近时,这个值越大,表示分离力度应当越大。1.0/neighbourDistance 是一个常数,基于定义的邻近距离。两者的差值表示实际的分离力应对距离的反应程度。如果两个Boid的距离正好是 neighbourDistance,这个值为零(没有分离力)。如果两个Boid距离小于 neighbourDistance,这个值为正,且距离越小,值越大。

    img

    当前代码:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Flocking/Assets/Shaders/SimpleFlocking.compute

    下一节将采用Instanced Mesh,提高性能。

    5. GPU Instancing优化

    首先回顾一下本章节的内容。「粒子你好」与「Quad粒子」的两个例子中,我们都运用了Instanced技术(Graphics.DrawProceduralNow()),将Compute Shader的计算好的粒子位置直接传递给VertexFrag着色器。

    img

    本节使用的DrawMeshInstancedIndirect 用于绘制大量几何体实例,实例都是相似的,只是位置、旋转或其他参数略有不同。相对于每帧都重新生成几何体并渲染的 DrawProceduralNow,DrawMeshInstancedIndirect 只需要一次性设置好实例的信息,然后 GPU 就可以根据这些信息一次性渲染所有实例。渲染草地、群体动物就用这个函数。

    img

    这个函数有很多参数,只用其中的一部分。

    img
    Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);
    1. boidMesh:把鸟Mesh丢进去。
    2. subMeshIndex:绘制的子网格索引。如果网格只有一个子网格,通常为0。
    3. boidMaterial:应用到实例化对象的材质。
    4. bounds:包围盒指定了绘制的范围。实例化对象只有在这个包围盒内的区域才会被渲染。优化性能之用。
    5. argsBuffer:参数的 ComputeBuffer,参数包括每个实例的几何体的索引数量和实例化的数量。

    这个 argsBuffer 是啥?这个参数用来告诉Unity,我们现在要渲染哪个Mesh、要渲染多少个!可以用一种特殊的Buffer作为参数给进去。

    在初始化shader时候,创建一种特殊Buffer,其标注为 ComputeBufferType.IndirectArguments 。这种类型的缓冲区专门用于传递给 GPU,以便在 GPU 上执行间接绘制命令。这里的new ComputeBuffer 第一个参数是 1 ,表示一个args数组(一个数组有5个uint),不要理解错了。

    ComputeBuffer argsBuffer;
    ...
    argsBuffer = new ComputeBuffer(1, 5 * sizeof(uint), ComputeBufferType.IndirectArguments);
    if (boidMesh != null)
    {
        args[0] = (uint)boidMesh.GetIndexCount(0);
        args[1] = (uint)numOfBoids;
    }
    argsBuffer.SetData(args);
    ...
    Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);

    在上一章的基础上,个体的数据结构增加一个offset,在Compute Shader用于方向上的偏移。另外初始状态的方向用Slerp插值,70%保持原来的方向,30%随机。Slerp插值的结果是四元数,需要用四元数方法转换到欧拉角再传入构造函数。

    public float noise_offset;
    ...
    Quaternion rot = Quaternion.Slerp(transform.rotation, Random.rotation, 0.3f);
    boidsArray[i] = new Boid(pos, rot.eulerAngles, offset);

    将这个新的属性noise_offset传到Compute Shader后,计算范围是 [-1, 1] 的噪声值,应用到鸟的速度上。

    float noise = clamp(noise1(time / 100.0 + boid.noise_offset), -1, 1) * 2.0 - 1.0;
    float velocity = boidSpeed * (1.0 + noise * boidSpeedVariation);

    然后稍微优化了一下算法。Compute Shader大体是没有区别的。

    if (distance(boid_pos, boidsBuffer[i].position) < neighbourDistance)
    {
        float3 tempBoid_position = boidsBuffer[i].position;
        float3 offset = boid.position - tempBoid_position;
        float dist = length(offset);
        if (dist<neighbourDistance){
            dist = max(dist, 0.000001);//Avoid division by zero
            separation += offset * (1.0/dist - 1.0/neighbourDistance);
        }
        alignment += boidsBuffer[i].direction;
        cohesion += tempBoid_position;
        nearbyCount += 1;
    }

    最大的不同在于Shader上。本节使用Surface Shader取代Frag。这个东西其实就是一个包装好的vertex and fragment shader。Unity已经完成了光照、阴影等一系列繁琐的工作。你依旧可以指定一个Vert。

    写Shader制作材质的时候,需要对Instanced的物体做特别处理。因为普通的渲染对象,他们的位置、旋转和其他属性在Unity中是静态的。而对于当前要构建的实例化对象,其位置、旋转等参数时刻在变化,因此,在渲染管线中需要通过特殊的机制来动态设置每个实例化对象的位置和参数。当前的方法基于程序的实例化技术,可以一次性渲染所有的实例化对象,而不需要逐个绘制。也就是一次性批量渲染。

    着色器应用instanced技术方法。实例化阶段是在vert之前执行。这样每个实例化的对象都有单独的旋转、位移和缩放等矩阵。

    现在需要为每个实例化对象创建属于他们的旋转矩阵。从Buffer中我们拿到了Compute Shader计算后的鸟的基本信息(上一节中,该数据传回了CPU,这里直接传给Shader做实例化):

    img

    Shader里将Buffer传来的数据结构、相关操作用下面的宏包裹起来。

    // .shader
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    struct Boid
    {
        float3 position;
        float3 direction;
        float noise_offset;
    };
    StructuredBuffer<Boid> boidsBuffer; 
    #endif

    由于我只在 C# 的 DrawMeshInstancedIndirect 的args[1]指定了需要实例化的数量(鸟的数量,也是Buffer的大小),因此直接使用unity_InstanceID索引访问Buffer就好了。

    #pragma instancing_options procedural:setup
    void setup()
    {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            _BoidPosition = boidsBuffer[unity_InstanceID].position;
            _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
        #endif
    }

    这里的空间变换矩阵的计算涉及到Homogeneous Coordinates,可以去复习一下GAMES101的课程。点是(x,y,z,1),坐标是(x,y,z,0)。

    如果使用仿射变换(Affine Transformations),代码是这样的:

    void setup()
    {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        _BoidPosition = boidsBuffer[unity_InstanceID].position;
        _LookAtMatrix = look_at_matrix(boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
        #endif
    }
     void vert(inout appdata_full v, out Input data)
    {
        UNITY_INITIALIZE_OUTPUT(Input, data);
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        v.vertex = mul(_LookAtMatrix, v.vertex);
        v.vertex.xyz += _BoidPosition;
        #endif
    }

    不够优雅,我们直接使用一个齐次坐标(Homogeneous Coordinates)。一个矩阵搞掂旋转平移缩放!

    void setup()
    {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        _BoidPosition = boidsBuffer[unity_InstanceID].position;
        _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
        #endif
    }
     void vert(inout appdata_full v, out Input data)
    {
        UNITY_INITIALIZE_OUTPUT(Input, data);
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        v.vertex = mul(_Matrix, v.vertex);
        #endif
    }

    至此,就大功告成了!当前的帧率比上一节提升了将近一倍。

    img
    img

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Shaders/InstancedFlocking.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Scripts/InstancedFlocking.cs
    • Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Shaders/InstancedFlocking.shader

    6. 应用蒙皮动画

    img

    本节要做的是,使用Animator组件,在实例化物体之前,将各个关键帧的Mesh抓取到Buffer当中。通过选取不同索引,得到不同姿势的Mesh。具体的骨骼动画制作不在本文讨论范围。

    只需要在上一章的基础上修改代码,添加Animator等逻辑。我已经在下面写了注释,可以看看。

    并且个体的数据结构有所更新:

    struct Boid{
        float3 position;
        float3 direction;
        float noise_offset;
        float speed; // 暂时没啥用
        float frame; // 表示动画中的当前帧索引
        float3 padding; // 确保数据对齐
    };

    详细说说这里的对齐。一个数据结构中,数据的大小最好是16字节的整数倍。

    • float3 position; (12字节)
    • float3 direction; (12字节)
    • float noise_offset; (4字节)
    • float speed; (4字节)
    • float frame; (4字节)
    • float3 padding; (12字节)

    如果没有Padding,大小是36字节,不是常见的对齐大小。加上Padding,对齐到48字节,完美!

    private SkinnedMeshRenderer boidSMR; // 用于引用包含蒙皮网格的SkinnedMeshRenderer组件。
    private Animator animator;
    public AnimationClip animationClip; // 具体的动画剪辑,通常用于计算动画相关的参数。
    private int numOfFrames; // 动画中的帧数,用于确定在GPU缓冲区中存储多少帧数据。
    public float boidFrameSpeed = 10f; // 控制动画播放的速度。
    MaterialPropertyBlock props; // 在不创建新材料实例的情况下传递参数给着色器。这意味着可以改变实例的材质属性(如颜色、光照系数等),而不会影响到使用相同材料的其他对象。
    Mesh boidMesh; // 存储从SkinnedMeshRenderer烘焙出的网格数据。
    ...
    void Start(){ // 这里首先初始化Boid数据,然后调用GenerateSkinnedAnimationForGPUBuffer来准备动画数据,最后调用InitShader来设置渲染所需的Shader参数。
        ...
        // This property block is used only for avoiding an instancing bug.
        props = new MaterialPropertyBlock();
        props.SetFloat("_UniqueID", Random.value);
        ...
        InitBoids();
        GenerateSkinnedAnimationForGPUBuffer();
        InitShader();
    }
    void InitShader(){ // 此方法配置Shader和材料属性,确保动画播放可以根据实例的不同阶段正确显示。frameInterpolation的启用或禁用决定了是否在动画帧之间进行插值,以获得更平滑的动画效果。
        ...
        if (boidMesh)//Set by the GenerateSkinnedAnimationForGPUBuffer
        ...
        shader.SetFloat("boidFrameSpeed", boidFrameSpeed);
        shader.SetInt("numOfFrames", numOfFrames);
        boidMaterial.SetInt("numOfFrames", numOfFrames);
        if (frameInterpolation && !boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION"))
        boidMaterial.EnableKeyword("FRAME_INTERPOLATION");
        if (!frameInterpolation && boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION"))
        boidMaterial.DisableKeyword("FRAME_INTERPOLATION");
    }
    void Update(){
        ...
        // 后面两个参数:
            // 1. 0: 参数缓冲区的偏移量,用于指定从哪里开始读取参数。
            // 2. props: 前面创建的 MaterialPropertyBlock,包含所有实例共享的属性。
        Graphics.DrawMeshInstancedIndirect( boidMesh, 0, boidMaterial, bounds, argsBuffer, 0, props);
    }
    void OnDestroy(){ 
        ...
        if (vertexAnimationBuffer != null) vertexAnimationBuffer.Release();
    }
    private void GenerateSkinnedAnimationForGPUBuffer()
    {
        ... // 接下文
    }

    为了给Shader在不同的时间提供不同姿势的Mesh,因此在 GenerateSkinnedAnimationForGPUBuffer() 函数中,从 Animator 和 SkinnedMeshRenderer 中提取每一帧的网格顶点数据,然后将这些数据存储到GPU的 ComputeBuffer 中,以便在实例化渲染时使用。

    通过GetCurrentAnimatorStateInfo获取当前动画层的状态信息,用于后续控制动画的精确播放。

    numOfFrames 使用最接近动画长度和帧率乘积的二次幂来确定,可以优化GPU的内存访问。

    然后创建一个ComputeBuffer来存储所有帧的所有顶点数据。vertexAnimationBuffer

    在for循环中,烘焙所有动画帧。具体做法是,在每个sampleTime时间点播放并立即更新,然后烘焙当前动画帧的网格到bakedMesh中。并且提取刚刚烘焙好的Mesh顶点,更新到数组 vertexAnimationData 中,最后上传至GPU,结束。

    // ...接上文
    boidSMR = boidObject.GetComponentInChildren<SkinnedMeshRenderer>();
    boidMesh = boidSMR.sharedMesh;
    animator = boidObject.GetComponentInChildren<Animator>();
    int iLayer = 0;
    AnimatorStateInfo aniStateInfo = animator.GetCurrentAnimatorStateInfo(iLayer);
    Mesh bakedMesh = new Mesh();
    float sampleTime = 0;
    float perFrameTime = 0;
    numOfFrames = Mathf.ClosestPowerOfTwo((int)(animationClip.frameRate * animationClip.length));
    perFrameTime = animationClip.length / numOfFrames;
    var vertexCount = boidSMR.sharedMesh.vertexCount;
    vertexAnimationBuffer = new ComputeBuffer(vertexCount * numOfFrames, 16);
    Vector4[] vertexAnimationData = new Vector4[vertexCount * numOfFrames];
    for (int i = 0; i < numOfFrames; i++)
    {
        animator.Play(aniStateInfo.shortNameHash, iLayer, sampleTime);
        animator.Update(0f);
        boidSMR.BakeMesh(bakedMesh);
        for(int j = 0; j < vertexCount; j++)
        {
            Vector4 vertex = bakedMesh.vertices[j];
            vertex.w = 1;
            vertexAnimationData[(j * numOfFrames) +  i] = vertex;
        }
        sampleTime += perFrameTime;
    }
    vertexAnimationBuffer.SetData(vertexAnimationData);
    boidMaterial.SetBuffer("vertexAnimation", vertexAnimationBuffer);
    boidObject.SetActive(false);

    在Compute Shader中,维护每一个个体数据结构中储存的帧变量。

    boid.frame = boid.frame + velocity * deltaTime * boidFrameSpeed;
    if (boid.frame >= numOfFrames) boid.frame -= numOfFrames;

    在Shader中lerp不同帧的动画。左边是没有帧插值的,右边是插值后的,效果非常显著。

    视频封面

    好的标题可以获得更多的推荐及关注者

    void vert(inout appdata_custom v)
    {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            #ifdef FRAME_INTERPOLATION
                v.vertex = lerp(vertexAnimation[v.id * numOfFrames + _CurrentFrame], vertexAnimation[v.id * numOfFrames + _NextFrame], _FrameInterpolation);
            #else
                v.vertex = vertexAnimation[v.id * numOfFrames + _CurrentFrame];
            #endif
            v.vertex = mul(_Matrix, v.vertex);
        #endif
    }
    void setup()
    {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
            _CurrentFrame = boidsBuffer[unity_InstanceID].frame;
            #ifdef FRAME_INTERPOLATION
                _NextFrame = _CurrentFrame + 1;
                if (_NextFrame >= numOfFrames) _NextFrame = 0;
                _FrameInterpolation = frac(boidsBuffer[unity_InstanceID].frame);
            #endif
        #endif
    }

    非常不容易,终于完整了。

    img

    完整工程链接:https://github.com/Remyuu/Unity-Compute-Shader-Learn/tree/L4_Skinned/Assets/Scripts

    8. 总结/小测试

    When rendering points which gives the best answer?

    img

    What are the three key steps in flocking?

    img

    When creating an arguments buffer for DrawMeshInstancedIndirect, how many uints are required?

    img

    We created the wing flapping by using a skinned mesh shader. True or False.

    img

    In a shader used by DrawMeshInstancedIndirect, which variable name gives the correct index for the instance?

    img

    References

    1. https://en.wikipedia.org/wiki/Boids
    2. Flocks, Herds, and Schools: A Distributed Behavioral Model
  • Compute Shader学习笔记(二)之 后处理效果

    Compute Shader Learning Notes (II) Post-processing Effects

    img

    Preface

    初步认识了Compute Shader,实现一些简单的效果。所有的代码都在:

    https://github.com/Remyuu/Unity-Compute-Shader-Learngithub.com/Remyuu/Unity-Compute-Shader-Learn

    main分支是初始代码,可以下载完整的工程跟着我敲一遍。PS:每一个版本的代码我都单独开了分支。

    img

    这一篇文章学习如何使用Compute Shader制作:

    • 后处理效果
    • 粒子系统

    上一篇文章没有提及GPU的架构,是因为我觉得一上来就解释一大堆名词根本听不懂QAQ。有了实际编写Compute Shader的经验,就可以将抽象的概念和实际的代码联系起来。

    CUDA在GPU上的执行程序可以用三层架构来说明:

    • Grid – 对应一个Kernel
    • |-Block – 一个Grid有多个Block,执行相同的程序
    • | |-Thread – GPU上最基本的运算单元
    img

    Thread是GPU最基础的单元,不同Thread中自然就会有信息交换。为了有效地支持大量并行线程的运行,并解决这些线程之间的数据交换需求,内存被设计成多个层次。因此存储角度也可以分为三层:

    • Per-Thread memory – 一个Thread内,传输周期是一个时钟周期(小于1纳秒),速度可以比全局内存快几百倍。
    • Shared memory – 一个Block之间,速度比全局快很多。
    • Global memory – 所有线程之间,但速度最慢,通常是GPU的瓶颈。Volta架构使用了HBM2作为设备的全局内存,Turing则是用了GDDR6。

    如果超过内存大小限制,则会被推到容量更大但是更慢的存储空间上。

    Shared Memory和L1 cache共享同一个物理空间,但是功能上有区别:前者需要手动管理,后者由硬件自动管理。我的理解是,Shared Memory 功能上类似于一个可编程的L1缓存。

    img

    在NVIDIA的CUDA架构中,流式多处理器(Streaming Multiprocessor, SM)是GPU上的一个处理单元,负责执行分配给它的线程块(Blocks)中的线程。流处理器(Stream Processors),也称为“CUDA核心”,是SM内的处理元件,每个流处理器可以并行处理多个线程。总的来说:

    • GPU -> Multi-Processors (SMs) -> Stream Processors

    即,GPU包含多个SM(也就是多处理器),每个SM包含多个流处理器。每个流处理器负责执行一个或多个线程(Thread)的计算指令。

    在GPU中,Thread(线程)是执行计算的最小单元,Warp(纬度)是CUDA中的基本执行单位。

    在NVIDIA的CUDA架构中,每个Warp通常包含32个线程(AMD有64个)。Block(块)是一个线程组,包含多个线程。在CUDA中,一个Block可以包含多个Warp.Kernel(内核)是在GPU上执行的一个函数,你可以将其视为一段特定的代码,这段代码被所有激活的线程并行执行。总的来说:

    • Kernel -> Grid -> Blocks -> Warps -> Threads

    但在日常开发中,通常需要同时执行的线程(Threads)远超过32个。

    为了解决软件需求与硬件架构之间的数量不匹配问题,GPU采用了一种策略:将属于同一个块(Block)的线程分组。这种分组被称为“Warp”,每个Warp包含固定数量的线程。当需要执行的线程数量超过一个Warp所能包含的数量时,GPU会调度额外的Warp。这样做的原则是确保没有任何线程被遗漏,即便这意味着需要启动更多的Warp。

    举个例子,如果一个块(Block)有128个线程(Thread),并且我的显卡身穿皮夹克(Nvidia每个Warp有32个Thread),那么一个块(Block)就会有 128/32=4 个Warp。举一个极端的例子,如果有129个线程,那么就会开5个Warp。有31个线程位置将直接空闲!因此我们在写Compute Shader时,[numthreads(a,b,c)] 中的 abc 最好是32的倍数,减少CUDA核心的浪费。

    读到这里,想必你一定会很混乱。我按照个人的理解画了个图。若有错误请指出。

    img

    L3 后处理效果

    当前构建基于BIRP管线,SRP管线只需要修改几处代码。

    这一章关键在于构建一个抽象基类管理Compute Shader所需的资源(第一节)。然后基于这个抽象基类,编写一些简单的后处理效果,比如高斯模糊、灰阶效果、低分辨率像素效果以及夜视仪效果等等。这一章的知识点的小总结:

    • 获取和处理Camera的渲染贴图
    • ExecuteInEditMode 关键词
    • SystemInfo.supportsComputeShaders 检查系统是否支持
    • Graphics.Blit() 函数的使用(全程是Bit Block Transfer)
    • 用 smoothstep() 制作各种效果
    • 多个Kernel之间传输数据 Shared 关键词

    1. 介绍与准备工作

    后处理效果需要准备两张贴图,一个只读,另一个可读写。至于贴图从哪来,都说是后处理了,那肯定从相机身上获取贴图,也就是Camera组件上的Target Texture。

    • Source:只读
    • Destination:可读写,用于最终输出
    img

    由于后续会实现多种后处理效果,因此抽象出一个基类,减少后期工作量。

    在基类中封装以下特性:

    • 初始化资源(创建贴图、Buffer等)
    • 管理资源(比方说屏幕分辨率改变后,重新创建Buffer等等)
    • 硬件检查(检查当前设备是否支持Compute Shader)

    抽象类完整代码链接:https://pastebin.com/9pYvHHsh

    首先,当脚本实例被激活或者附加到活着的GO的时候,调用 OnEnable() 。在里面写初始化的操作。检查硬件是否支持、检查Compute Shader是否在Inspector上绑定、获取指定的Kernel、获取当前GO的Camera组件、创建纹理以及设置初始化状态为真。

    if (!SystemInfo.supportsComputeShaders)
        ...
    if (!shader)
        ...
    kernelHandle = shader.FindKernel(kernelName);
    thisCamera = GetComponent<Camera>();
    if (!thisCamera)
        ...
    CreateTextures();
    init = true;

    创建两个纹理 CreateTextures() ,一个Source一个Destination,尺寸为摄像机分辨率。

    texSize.x = thisCamera.pixelWidth;
    texSize.y = thisCamera.pixelHeight;
    if (shader)
    {
        uint x, y;
        shader.GetKernelThreadGroupSizes(kernelHandle, out x, out y, out _);
        groupSize.x = Mathf.CeilToInt((float)texSize.x / (float)x);
        groupSize.y = Mathf.CeilToInt((float)texSize.y / (float)y);
    }
    CreateTexture(ref output);
    CreateTexture(ref renderedSource);
    shader.SetTexture(kernelHandle, "source", renderedSource);
    shader.SetTexture(kernelHandle, "outputrt", output);

    具体纹理的创建:

    protected void CreateTexture(ref RenderTexture textureToMake, int divide=1)
    {
        textureToMake = new RenderTexture(texSize.x/divide, texSize.y/divide, 0);
        textureToMake.enableRandomWrite = true;
        textureToMake.Create();
    }

    这样就完成初始化了,当摄像机完成场景渲染并准备显示到屏幕上时,Unity会调用 OnRenderImage() ,这个时候就开始调用Compute Shader开始计算了。若当前没初始化好或者没shader,就Blit一下,把source直接拷给destination,即啥也不干。 CheckResolution(out _) 这个方法检查渲染纹理的分辨率是否需要更新,如果要,就重新生成一下Texture。完事之后,就到了老生常谈的Dispatch阶段啦。这里就需要将source贴图通过Buffer传给GPU,计算完毕后,传回给destination。

    protected virtual void OnRenderImage(RenderTexture source, RenderTexture destination)
    {
        if (!init || shader == null)
        {
            Graphics.Blit(source, destination);
        }
        else
        {
            CheckResolution(out _);
            DispatchWithSource(ref source, ref destination);
        }
    }

    注意看,这里我们没有用什么 SetData() 或者是 GetData() 之类的操作。因为现在所有数据都在GPU上,我们直接命令GPU自产自销就好了,CPU不要趟这滩浑水。如果将纹理取回内存,再传给GPU,性能就相当糟糕。

    protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination)
    {
        Graphics.Blit(source, renderedSource);
        shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
        Graphics.Blit(output, destination);
    }

    我不信邪,非得传回CPU再传回GPU,测试结果相当震惊,性能竟然差了4倍以上。因此我们需要减少CPU和GPU之间的通信,这是使用Compute Shader时非常需要关心的。

    // 笨蛋方法
    protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination)
    {
        // 将源贴图Blit到用于处理的贴图
        Graphics.Blit(source, renderedSource);
        // 使用计算着色器处理贴图
        shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
        // 将输出贴图复制到一个Texture2D对象中,以便读取数据到CPU
        Texture2D tempTexture = new Texture2D(renderedSource.width, renderedSource.height, TextureFormat.RGBA32, false);
        RenderTexture.active = output;
        tempTexture.ReadPixels(new Rect(0, 0, output.width, output.height), 0, 0);
        tempTexture.Apply();
        RenderTexture.active = null;
        // 将Texture2D数据传回GPU到一个新的RenderTexture
        RenderTexture tempRenderTexture = RenderTexture.GetTemporary(output.width, output.height);
        Graphics.Blit(tempTexture, tempRenderTexture);
        // 最终将处理后的贴图Blit到目标贴图
        Graphics.Blit(tempRenderTexture, destination);
        // 清理资源
        RenderTexture.ReleaseTemporary(tempRenderTexture);
        Destroy(tempTexture);
    }
    img

    接下来开始编写第一个后处理效果。

    小插曲:奇怪的BUG

    另外插播一个奇怪bug。

    在Compute Shader中,如果最终输出的贴图结果名字是output,那么在某些API比如Metal中,就会出问题。解决方法是,改个名字。

    RWTexture2D<float4> outputrt;
    img

    Add a caption for the image, no more than 140 characters (optional)

    2. RingHighlight效果

    img

    创建RingHighlight类,继承自刚刚编写的基类。

    img

    重载初始化方法,指定Kernel。

    protected override void Init()
    {
        center = new Vector4();
        kernelName = "Highlight";
        base.Init();
    }

    重载渲染方法。想要实现聚焦某个角色的效果,则需要给Compute Shader传入角色的屏幕空间的坐标 center 。并且,如果在Dispatch之前,屏幕分辨率发生改变,那么重新初始化。

    protected void SetProperties()
    {
        float rad = (radius / 100.0f) * texSize.y;
        shader.SetFloat("radius", rad);
        shader.SetFloat("edgeWidth", rad * softenEdge / 100.0f);
        shader.SetFloat("shade", shade);
    }
    protected override void OnRenderImage(RenderTexture source, RenderTexture destination)
    {
        if (!init || shader == null)
        {
            Graphics.Blit(source, destination);
        }
        else
        {
            if (trackedObject && thisCamera)
            {
                Vector3 pos = thisCamera.WorldToScreenPoint(trackedObject.position);
                center.x = pos.x;
                center.y = pos.y;
                shader.SetVector("center", center);
            }
            bool resChange = false;
            CheckResolution(out resChange);
            if (resChange) SetProperties();
            DispatchWithSource(ref source, ref destination);
        }
    }

    并且改变Inspector面板的时候可以实时看到参数变化效果,添加 OnValidate() 方法。

    private void OnValidate()
    {
        if(!init)
            Init();
        SetProperties();
    }

    GPU中,该怎么制作一个圆内没有阴影,圆的边缘平滑过渡,过渡层外是阴影的效果呢?基于上一篇文章判断一个点是否在圆内的方法,我们用 smoothstep() ,处理过渡层即可。

    #Pragmas kernel Highlight
    
    Texture2D<float4> source;
    RWTexture2D<float4> outputrt;
    float radius;
    float edgeWidth;
    float shade;
    float4 center;
    
    float inCircle( float2 pt, float2 center, float radius, float edgeWidth ){
        float len = length(pt - center);
        return 1.0 - smoothstep(radius-edgeWidth, radius, len);
    }
    
    [numthreads(8, 8, 1)]
    void Highlight(uint3 id : SV_DispatchThreadID)
    {
        float4 srcColor = source[id.xy];
        float4 shadedSrcColor = srcColor * shade;
        float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
        float4 color = lerp( shadedSrcColor, srcColor, highlight );
    
        outputrt[id.xy] = color;
    
    }

    img

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_RingHighlight/Assets/Shaders/RingHighlight.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_RingHighlight/Assets/Scripts/RingHighlight.cs

    3. 模糊效果

    img

    模糊效果原理很简单,每一个像素采样周边的 n*n 个像素加权平均就可以得到最终效果。

    但是有效率问题。众所周知,减少对纹理的采样次数对优化非常重要。如果每个像素都需要采样20*20个周边像素,那么渲染一个像素就需要采样400次,显然是无法接受的。并且,对于单个像素而言,采集周边一整个矩形像素的操作在Compute Shader中很难处理。怎么解决呢?

    通常做法是,横着采样一遍,再竖着采样一遍。什么意思呢?对于每一个像素,只在x方向上采样20个像素,y方向上采样20个像素,总共采样20+20个像素,再加权平均。这种方法不仅减少了采样次数,还更符合Compute Shader的逻辑。横着采样,设置一个Kernel;竖着采样,设置另一个Kernel。

    #pragma kernel HorzPass
    #pragma kernel Highlight

    由于Dispatch是顺序执行的,因此我们计算完水平的模糊后,利用计算好的结果再垂直采样一遍。

    shader.Dispatch(kernelHorzPassID, groupSize.x, groupSize.y, 1);
    shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);

    做完模糊操作之后,再结合上一节的RingHighlight,完工!

    有一点不同的是,再计算完水平模糊后,怎么将结果传给下一个Kernel呢?答案呼之欲出了,直接使用 shared 关键词。具体步骤如下。

    CPU中声明存储水平模糊纹理的引用,制作水平纹理的kernel,并绑定。

    RenderTexture horzOutput = null;
    int kernelHorzPassID;
    protected override void Init()
    {
        ...
        kernelHorzPassID = shader.FindKernel("HorzPass");
        ...
    }

    还需要额外在GPU中开辟空间,用来存储第一个kernel的结果。

    protected override void CreateTextures()
    {
        base.CreateTextures();
        shader.SetTexture(kernelHorzPassID, "source", renderedSource);
        CreateTexture(ref horzOutput);
        shader.SetTexture(kernelHorzPassID, "horzOutput", horzOutput);
        shader.SetTexture(kernelHandle, "horzOutput", horzOutput);
    }

    GPU上这样设置:

    shared Texture2D<float4> source;
    shared RWTexture2D<float4> horzOutput;
    RWTexture2D<float4> outputrt;

    另外有个疑问, shared 这个关键词好像加不加都一样,实际测试不同的kernel都可以访问到。那请问shared还有什么意义呢?

    在Unity中,变量前加shared表示这个资源不是每次调用都重新初始化,而是保持其状态,供不同的shader或dispatch调用使用。这有助于在不同的shader调用之间共享数据。标记了 shared 可以帮助编译器优化出更高性能的代码。

    img

    在计算边界的像素时,会遇到可用像素数量不足的情况。要么就是左边剩下的像素不足 blurRadius ,要么右边剩余像素不足。因此先算出安全的左索引,然后再计算从左到右最大可以取多少。

    [numthreads(8, 8, 1)]
    void HorzPass(uint3 id : SV_DispatchThreadID)
    {
        int left = max(0, (int)id.x-blurRadius);
        int count = min(blurRadius, (int)id.x) + min(blurRadius, source.Length.x - (int)id.x);
        float4 color = 0;
        uint2 index = uint2((uint)left, id.y);
        [unroll(100)]
        for(int x=0; x<count; x++){
            color += source[index];
            index.x++;
        }
        color /= (float)count;
        horzOutput[id.xy] = color;
    }
    [numthreads(8, 8, 1)]
    void Highlight(uint3 id : SV_DispatchThreadID)
    {
        //Vert blur
        int top = max(0, (int)id.y-blurRadius);
        int count = min(blurRadius, (int)id.y) + min(blurRadius, source.Length.y - (int)id.y);
        float4 blurColor = 0;
        uint2 index = uint2(id.x, (uint)top);
        [unroll(100)]
        for(int y=0; y<count; y++){
            blurColor += horzOutput[index];
            index.y++;
        }
        blurColor /= (float)count;
        float4 srcColor = source[id.xy];
        float4 shadedBlurColor = blurColor * shade;
        float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
        float4 color = lerp( shadedBlurColor, srcColor, highlight );
        outputrt[id.xy] = color;
    }

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_BlurEffect/Assets/Shaders/BlurHighlight.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_BlurEffect/Assets/Scripts/BlurHighlight.cs

    4. 高斯模糊

    和上面不同的是,采样之后不再是取平均值,而是用一个高斯函数加权求得。

    其中, 是标准差,控制宽度。

    有关更多Blur的内容:https://www.gamedeveloper.com/programming/four-tricks-for-fast-blurring-in-software-and-hardware#close-modal

    由于这个计算量还有不小的,如果每一个像素都去计算一次这个式子就非常耗。我们用预计算的方式,将计算结果通过Buffer的方式传到GPU上。由于两个kernel都需要使用,在Buffer声明的时候加一个shared。

    float[] SetWeightsArray(int radius, float sigma)
    {
        int total = radius * 2 + 1;
        float[] weights = new float[total];
        float sum = 0.0f;
        for (int n=0; n<radius; n++)
        {
            float weight = 0.39894f * Mathf.Exp(-0.5f * n * n / (sigma * sigma)) / sigma;
            weights[radius + n] = weight;
            weights[radius - n] = weight;
            if (n != 0)
                sum += weight * 2.0f;
            else
                sum += weight;
        }
        // normalize kernels
        for (int i=0; i<total; i++) weights[i] /= sum;
        return weights;
    }
    private void UpdateWeightsBuffer()
    {
        if (weightsBuffer != null)
            weightsBuffer.Dispose();
        float sigma = (float)blurRadius / 1.5f;
        weightsBuffer = new ComputeBuffer(blurRadius * 2 + 1, sizeof(float));
        float[] blurWeights = SetWeightsArray(blurRadius, sigma);
        weightsBuffer.SetData(blurWeights);
        shader.SetBuffer(kernelHorzPassID, "weights", weightsBuffer);
        shader.SetBuffer(kernelHandle, "weights", weightsBuffer);
    }
    img

    完整代码:

    • https://pastebin.com/0qWtUKgy
    • https://pastebin.com/A6mDKyJE

    5. 低分辨率效果

    GPU:真是酣畅淋漓的计算啊。

    img

    让一张高清的纹理边模糊,同时不修改分辨率。实现方法很简单,每 n*n 个像素,都只取左下角的像素颜色即可。利用整数的特性,id.x索引先除n,再乘上n就可以了。

    uint2 index = (uint2(id.x, id.y)/3) * 3;
    float3 srcColor = source[index].rgb;
    float3 finalColor = srcColor;

    效果已经放在上面了。但是这个效果太锐利了,通过添加噪声,柔化锯齿。

    uint2 index = (uint2(id.x, id.y)/3) * 3;
    float noise = random(id.xy, time);
    float3 srcColor = lerp(source[id.xy].rgb, source[index],noise);
    float3 finalColor = srcColor;
    img

    每 n*n 个格子的像素不在只取左下角的颜色,而是取原本颜色和左下角颜色的随机插值结果。效果一下子就精细了不少。当n比较大的时候,还能看到下面这样的效果。只能说不太好看,但是在一些故障风格道路中还是可以继续探索。

    img

    如果想要得到噪声感的画面,可以尝试lerp的两端添加系数,比如:

    float3 srcColor = lerp(source[id.xy].rgb * 2, source[index],noise);
    img

    6. 灰阶效果与染色

    Grayscale Effect & Tinted

    将彩色图像转换为灰阶图像的过程涉及将每个像素的RGB值转换为一个单一的颜色值。这个颜色值是RGB值的加权平均值。这里有两种方法,一种是简单平均,一种是符合人眼感知的加权平均。

    1. 平均值法(简单但不准确):

    这种方法对所有颜色通道给予相同的权重。 2. 加权平均法(更准确, 反映人眼感知):

    这种方法根据人眼对绿色更敏感、对红色次之、对蓝色最不敏感的特点, 给予不同颜色通道不同的权重。(下面的截图效果不太好,我也没看出来lol)

    img

    加权后,再简单地颜色混合(乘法),最后lerp得到可控的染色强度结果。

    uint2 index = (uint2(id.x, id.y)/6) * 6;
    float noise = random(id.xy, time);
    float3 srcColor = lerp(source[id.xy].rgb, source[index],noise);
    // float3 finalColor = srcColor;
    float3 grayScale = (srcColor.r+srcColor.g+srcColor.b)/3.0;
    // float3 grayScale = srcColor.r*0.299f+srcColor.g*0.587f+srcColor.b*0.114f;
    float3 tinted = grayScale * tintColor.rgb;
    float3 finalColor = lerp(srcColor, tinted, tintStrength);
    outputrt[id.xy] = float4(finalColor, 1);

    染一个废土颜色:

    img

    7. 屏幕扫描线效果

    首先 uvY 将坐标归一化到 [0,1] 。

    lines 是控制扫描线数量的一个参数。

    然后增加一个时间偏移,系数控制偏移速度。可以开放一个参数控制线条偏移的速度。

    float uvY = (float)id.y/(float)source.Length.y;
    float scanline = saturate(frac(uvY * lines + time * 3));
    img

    这个“线”看起来不太够“线”,减个肥。

    float uvY = (float)id.y/(float)source.Length.y;
    float scanline = saturate(smoothstep(0.1,0.2,frac(uvY * lines + time * 3)));
    img

    然后lerp上颜色。

    float uvY = (float)id.y/(float)source.Length.y;
    float scanline = saturate(smoothstep(0.1, 0.2, frac(uvY * lines + time*3)) + 0.3);
    finalColor = lerp(source[id.xy].rgb*0.5, finalColor, scanline);
    img

    “减肥”前后,各取所需吧!

    img

    8. 夜视仪效果

    这一节总结上面所有内容,实现一个夜视仪的效果。先做一个单眼效果。

    float2 pt = (float2)id.xy;
    float2 center = (float2)(source.Length >> 1);
    float inVision = inCircle(pt, center, radius, edgeWidth);
    float3 blackColor = float3(0,0,0);
    finalColor = lerp(blackColor, finalColor, inVision);
    img

    双眼效果不同点在于有两个圆心,计算得到的两个遮罩vision用 max() 或者是 saturate() 合并即可。

    float2 pt = (float2)id.xy;
    float2 centerLeft = float2(source.Length.x / 3.0, source.Length.y /2);
    float2 centerRight = float2(source.Length.x / 3.0 * 2.0, source.Length.y /2);
    float inVisionLeft = inCircle(pt, centerLeft, radius, edgeWidth);
    float inVisionRight = inCircle(pt, centerRight, radius, edgeWidth);
    float3 blackColor = float3(0,0,0);
    // float inVision = max(inVisionLeft, inVisionRight);
    float inVision = saturate(inVisionLeft + inVisionRight);
    finalColor = lerp(blackColor, finalColor, inVision);
    img

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_NightVision/Assets/Shaders/NightVision.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_NightVision/Assets/Scripts/NightVision.cs

    9. 平缓过渡线条

    思考一下,我们应该怎么在屏幕上画一条平滑过渡的直线。

    img

    smoothstep() 函数可以完成这个操作,熟悉这个函数的读者可以略过这一段。这个函数用来创建平滑的渐变。smoothstep(edge0, edge1, x) 函数在x在 edge0 和 edge1 之间时,输出值从0渐变到1。如果 x < edge0 ,返回0;如果 x > edge1 ,返回1。其输出值是根据Hermite插值计算的:

    img
    float onLine(float position, float center, float lineWidth, float edgeWidth) {
        float halfWidth = lineWidth / 2.0;
        float edge0 = center - halfWidth - edgeWidth;
        float edge1 = center - halfWidth;
        float edge2 = center + halfWidth;
        float edge3 = center + halfWidth + edgeWidth;
        return smoothstep(edge0, edge1, position) - smoothstep(edge2, edge3, position);
    }

    上面代码中,传入的参数都已经归一化 [0,1]。position 是考察的点的位置,center 是线的中心位置,lineWidth 是线的实际宽度,edgeWidth 是边缘的宽度,用于平滑过渡。我实在对我的表达能力感到不悦!至于怎么算的,我给大家画个图理解吧!

    大概就是:, ,。

    img

    思考一下,怎么画一个平滑过渡的圆。

    对于每个点,先计算与圆心的距离向量,结果返回给 position ,并且计算其长度返回给 len 。

    模仿上面两个 smoothstep 做差的方法,通过减去外边缘插值结果来生成一个环形的线条效果。

    float circle(float2 position, float2 center, float radius, float lineWidth, float edgeWidth){
        position -= center;
        float len = length(position);
        //Change true to false to soften the edge
        float result = smoothstep(radius - lineWidth / 2.0 - edgeWidth, radius - lineWidth / 2.0, len) - smoothstep(radius + lineWidth / 2.0, radius + lineWidth / 2.0 + edgeWidth, len);
        return result;
    }
    img

    10. 扫描线效果

    然后一条横线、一条竖线,套娃几个圆,做一个雷达扫描的效果。

    float3 color = float3(0.0f,0.0f,0.0f);
    color += onLine(uv.y, center.y, 0.002, 0.001) * axisColor.rgb;//xAxis
    color += onLine(uv.x, center.x, 0.002, 0.001) * axisColor.rgb;//yAxis
    color += circle(uv, center, 0.2f, 0.002, 0.001) * axisColor.rgb;
    color += circle(uv, center, 0.3f, 0.002, 0.001) * axisColor.rgb;
    color += circle(uv, center, 0.4f, 0.002, 0.001) * axisColor.rgb;

    再画一个扫描线,并且带有轨迹。

    float sweep(float2 position, float2 center, float radius, float lineWidth, float edgeWidth) {
        float2 direction = position - center;
        float theta = time + 6.3;
        float2 circlePoint = float2(cos(theta), -sin(theta)) * radius;
        float projection = clamp(dot(direction, circlePoint) / dot(circlePoint, circlePoint), 0.0, 1.0);
        float lineDistance = length(direction - circlePoint * projection);
        float gradient = 0.0;
        const float maxGradientAngle = PI * 0.5;
        if (length(direction) < radius) {
            float angle = fmod(theta + atan2(direction.y, direction.x), PI2);
            gradient = clamp(maxGradientAngle - angle, 0.0, maxGradientAngle) / maxGradientAngle * 0.5;
        }
        return gradient + 1.0 - smoothstep(lineWidth, lineWidth + edgeWidth, lineDistance);
    }

    添加到颜色中。

    ...
    color += sweep(uv, center, 0.45f, 0.003, 0.001) * sweepColor.rgb;
    ...
    img

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_HUDOverlay/Assets/Shaders/HUDOverlay.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_HUDOverlay/Assets/Scripts/HUDOverlay.cs

    11. 渐变背景阴影效果

    这个效果可以用在字幕或者是一些说明性文字之下。虽然可以直接在UI Canvas中加一张贴图,但是使用Compute Shader可以实现更加灵活的效果以及资源的优化。

    img

    字幕、对话文字背景一般都在屏幕下方,上方不作处理。同时需要较高的对比度,因此对原有画面做一个灰度处理、并且指定一个阴影。

    if (id.y<(uint)tintHeight){
        float3 grayScale = (srcColor.r + srcColor.g + srcColor.b) * 0.33 * tintColor.rgb;
        float3 shaded = lerp(srcColor.rgb, grayScale, tintStrength) * shade;
        ... // 接下文
    }else{
        color = srcColor;
    }
    img

    渐变效果。

    ...// 接上文
        float srcAmount = smoothstep(tintHeight-edgeWidth, (float)tintHeight, (float)id.y);
        ...// 接下文
    img

    最后再lerp起来。

    ...// 接上文
        color = lerp(float4(shaded, 1), srcColor, srcAmount);
    img

    12. 总结/小测试

    If id.xy = [ 100, 30 ]. What would be the return value of inCircle((float2)id.xy, float2(130, 40), 40, 0.1)

    img

    When creating a blur effect which answer describes our approach best?

    img

    Which answer would create a blocky low resolution version of the source image?

    img

    What is smoothstep(5, 10, 6); ?

    img

    If an and b are both vectors. Which answer best describes dot(a,b)/dot(b,b); ?

    img

    What is _MainTex_TexelSize.x? If _MainTex is 512 x 256 pixel resolution.

    img

    13. 利用Blit结合Material做后处理

    除了使用Compute Shader制作后处理,还有一种简单的方法。

    // .cs
    Graphics.Blit(source, dest, material, passIndex);
    // .shader
    Pass{
        CGPROGRAM
        #pragma vertex vert_img
        #pragma fragment frag
        fixed4 frag(v2f_img input) : SV_Target{
            return tex2D(_MainTex, input.uv);
        }
        ENDCG
    }

    通过结合Shader来处理图像数据。

    那么问题来了,两者有什么区别?而且传进来的不是一张纹理吗,哪来的顶点?

    答:

    第一个问题。这种方法称为“屏幕空间着色”,完全集成在Unity的图形管线中,性能其实比Compute Shader更高。而Compute Shader提供了对GPU资源的更细粒度控制。它不受图形管线的限制,可以直接访问和修改纹理、缓冲区等资源。

    第二个问题。注意看 vert_img 。在UnityCG中可以找到如下定义:

    img
    img

    Unity会自动将传进来的纹理自动转换为两个三角形(一个充满屏幕的矩形),我们用材质的方法编写后处理时直接在frag上写就好了。

    下一章将会学习如何将Material、Shader、Compute Shader还有C#联系起来。

  • Compute Shader学习笔记(一)之 入门

    Compute Shader Learning Notes (I) Getting Started

    标签 :入门/Shader/计算着色器/GPU优化

    img

    Preface

    Compute Shader比较复杂,需要具备一定的编程知识、图形学知识以及GPU相关的硬件知识才能较好的掌握。学习笔记分为四个部分:

    • 初步认识Compute Shader,实现一些简单的效果
    • 画圆、星球轨道、噪声图、操控Mesh等等
    • 后处理、粒子系统
    • 物理模拟、绘制草地
    • 流体模拟

    主要参考资料如下:

    • https://www.udemy.com/course/compute-shaders/?couponCode=LEADERSALE24A
    • https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
    • https://medium.com/ericzhan-publication/shader notes-a preliminary exploration of compute-shader-9efeebd579c1
    • https://docs.unity3d.com/Manual/class-ComputeShader.html
    • https://docs.unity3d.com/ScriptReference/ComputeShader.html
    • https://learn.microsoft.com/en-us/windows/win32/api/D3D11/nf-d3d11-id3d11devicecontext-dispatch
    • lygyue:Compute Shader(很有意思)
    • https://medium.com/@sengallery/unity-compute-shader-基礎認識-5a99df53cea1
    • https://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html(太老,已经过时)
    • http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf
    • 王江荣:【Unity】Compute Shader的基础介绍与使用
    • …未完待续

    L1 介绍Compute Shader

    1. 初识Compute Shader

    简单的说,可以通过Compute Shader,计算出一个材质,然后通过Renderer显示出来。需要注意,Compute Shader不仅仅可以做这些。

    img
    img

    可以把下面两份代码拷下来测试一下。

    using System.Collections;
    using System.Collections.Generic;
    using UnityEngine;
    
    public class AssignTexture : MonoBehaviour
    {
        // ComputeShader 用于在 GPU 上执行计算任务
        public ComputeShader shader;
    
        // 纹理分辨率
        public int texResolution = 256;
    
        // 渲染器组件
        private Renderer rend;
        // 渲染纹理
        private RenderTexture outputTexture;
        // 计算着色器内核句柄
        private int kernelHandle;
    
        // Start 在脚本启用时被调用一次
        void Start()
        {
            // 创建一个新的渲染纹理,指定宽度、高度和位深度(此处位深度为0)
            outputTexture = new RenderTexture(texResolution, texResolution, 0);
            // 允许随机写入
            outputTexture.enableRandomWrite = true;
            // 创建渲染纹理实例
            outputTexture.Create();
    
            // 获取当前对象的渲染器组件
            rend = GetComponent<Renderer>();
            // 启用渲染器
            rend.enabled = true;
    
            InitShader();
        }
    
        private void InitShader()
        {
            // 查找计算着色器内核 "CSMain" 的句柄
            kernelHandle = shader.FindKernel("CSMain");
    
            // 设置计算着色器中使用的纹理
            shader.SetTexture(kernelHandle, "Result", outputTexture);
    
            // 将渲染纹理设置为材质的主纹理
            rend.material.SetTexture("_MainTex", outputTexture);
    
            // 调度计算着色器的执行,传入计算组的大小
            // 这里假设每个工作组是 16x16
            // 简单的说就是,要分配多少个组,才能完成计算,目前只分了xy的各一半,因此只渲染了1/4的画面。
            DispatchShader(texResolution / 16, texResolution / 16);
        }
    
        private void DispatchShader(int x, int y)
        {
            // 调度计算着色器的执行
            // x 和 y 表示计算组的数量,1 表示 z 方向上的计算组数量(这里只有一个)
            shader.Dispatch(kernelHandle, x, y, 1);
        }
    
        void Update()
        {
            // 每帧检查是否有键盘输入(按键 U 被松开)
            if (Input.GetKeyUp(KeyCode.U))
            {
                // 如果按键 U 被松开,则重新调度计算着色器
                DispatchShader(texResolution / 8, texResolution / 8);
            }
        }
    }

    Unity默认的Compute Shader:

    // Each #kernel tells which function to compile; you can have many kernels
    #Pragmas kernel CSMain
    
    // Create a RenderTexture with enableRandomWrite flag and set it
    // with cs.SetTexture
    RWTexture2D<float4> Result;
    
    [numthreads(8,8,1)]
    void CSMain (uint3 id : SV_DispatchThreadID) { 
      // TODO: insert actual code here! Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0); 
    }

    在这个示例中,我们可以看到左下角四分之一的区域绘制上了一种名为Sierpinski网的分形结构,这个无关紧要,Unity官方觉得这个图形很有代表性,就把它当作默认代码了。

    具体讲一下Compute Shader的代码, C# 的代码看注释即可。

    #pragma kernel CSMain 这行代码指示了Compute Shader的入口。CSMain名字随便改。

    RWTexture2D Result 这行代码是一个可读写的二维纹理。R代表Read,W代表Write。

    着重看这一行代码:

    [numthreads(8,8,1)]

    在Compute Shader文件中,这行代码规定了一个线程组的大小,比如这个8 * 8 * 1的线程组中,一共有64个线程。每一个线程计算一个单位的像素(RWTexture)。

    而在上面的 C# 文件中,我们用 shader.Dispatch 指定线程组的数量。

    img
    img
    img

    接下来提一个问题,如果当前线程组指定为 881 ,那么我们需要多少个线程组才能渲染完 res*res 大小的RWTexture呢?

    答案是:res/8 个。而我们代码目前只调用了 res/16 个,因此只渲染了左下角的1/4的区域。

    除此之外,入口函数传入的参数也值得一说。uint3 id : SV_DispatchThreadID 这个id表示当前线程的唯一标识符。

    2. 四分图案

    学会走之前,先学会爬。首先在 C# 中指定需要执行的任务(Kernel)。

    img

    目前我们写死了,现在我们暴露一个参数,表示可以执行渲染不同的任务。

    public string kernelName = "CSMain";
    ...
    kernelHandle = shader.FindKernel(kernelName);

    这样,就可以在Inspector中随意修改了。

    img

    但是,光上盘子可不行,得上菜啊。我们在Compute Shader中做菜。

    先设置几个菜单。

    #pragma kernel CSMain // 刚刚我们已经声明好了
    #pragma kernel SolidRed // 定义一个新的菜,并且在下面写出来就好了
    ... // 可以写很多
    [numthreads(8,8,1)]
    void CSMain (uint3 id : SV_DispatchThreadID){ ... }
    [numthreads(8,8,1)]
    void SolidRed (uint3 id : SV_DispatchThreadID){
     Result[id.xy] = float4(1,0,0,0); 
    }

    在Inspector中修改对应的名字,就可以启用不同的Kernel。

    img

    如果我想传数据给Compute Shader咋办?比方说,给Compute Shader传一个材质的分辨率。

    shader.SetInt("texResolution", texResolution);
    img
    img

    并且在Compute Shader里,也要声明好。

    img

    思考一个问题,怎么实现下面的效果?

    img
    [numthreads(8,8,1)]
    void SplitScreen (uint3 id : SV_DispatchThreadID)
    {
        int halfRes = texResolution >> 1;
        Result[id.xy] = float4(step(halfRes, id.x),step(halfRes, id.y),0,1);
    }

    解释一下,step 函数其实就是:

    step(edge, x){
        return x>=edge ? 1 : 0;
    }

    (uint)res >> 1 意思就是res的位往右边移动一位。相当于除2(二进制的内容)。

    这个计算方法就只是简单的依赖当前的线程id。

    位于左下角的线程永远输出黑色。因为step返回永远都是0。

    而左下半边的线程, id.x > halfRes ,因此在红通道返回1。

    以此类推,非常简单。如果你不信服,可以具体算一下,可以帮助理解线程id、线程组和线程组组的关系。

    img
    img

    3. 画圆

    原理听上去很简单,判断 (id.x, id.y) 是否在圆内,是则输出1,否则0。动手试试吧。

    img
    float inCircle( float2 pt, float radius ){
        return ( length(pt)<radius ) ? 1.0 : 0.0;
    }
    
    [numthreads(8,8,1)]
    void Circle (uint3 id : SV_DispatchThreadID)
    {
        int halfRes = texResolution >> 1;
        int isInside = inCircle((float2)((int2)id.xy-halfRes), (float)(halfRes>>1));
        Result[id.xy] = float4(0.0,isInside ,0,1);
    }

    img

    4. Summary/Quiz

    如果输出是 256 为边长的RWTexture,哪个答案会产生完整的红色的纹理?

    RWTexture2D<float4> output;
    
    [numthreads(16,16,1)]
    void CSMain (uint3 id : SV_DispatchThreadID)
    {
         output[id.xy] = float4(1.0, 0.0, 0.0, 1.0);
    }

    img

    哪个答案将在纹理输出的左侧给出红色,右侧给出黄色?

    img

    L2 开始了

    1. 传递值给GPU

    img

    废话不多说,先画一个圆。两份初始代码在这里。

    PassData.cs: https://pastebin.com/PMf4SicK

    PassData.compute: https://pastebin.com/WtfUmhk2

    大体结构和上文的没有变化。可以看到最终调用了一个drawCircle函数来画圆。

    [numthreads(1,1,1)]
    void Circles (uint3 id : SV_DispatchThreadID)
    {
        int2 centre = (texResolution >> 1);
        int radius = 80;
        drawCircle( centre, radius );
    }

    这里使用的画圆方法是非常经典的光栅化绘制方法,对数学原理感兴趣的可以看 http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf 。大概思路是利用一种对称的思想生成的。

    不同的是,这里我们使用指定 (1,1,1) 为一个线程组的大小。在CPU端调用CS:

    private void DispatchKernel(int count)
    {
        shader.Dispatch(circlesHandle, count, 1, 1);
    }
    void Update()
    {
        DispatchKernel(1);
    }

    问题来了,请问一个线程执行了多少次?

    答:只执行了一次。因为一个线程组只有 111=1 个线程,并且CPU端只调用了 111=1 个线程组来计算。因此只用了一个线程完成了一个圆的绘制。也就是说,一个线程可以一次绘制一整个RWTexture,也不是之前那样,一个线程绘制一个pixel。

    这也说明了Compute Shader和Fragment Shader是有本质的区别的。片元着色器只是计算单个像素的颜色,而Compute Shader可以执行或多或少任意的操作!

    img

    回到Unity,想绘制好看的圆,就需要轮廓颜色、填充颜色。将这两个参数传递到CS中。

    float4 clearColor;
    float4 circleColor;

    并且增加颜色填充Kernel,并修改Circles内核。如果有多个内核同时访问一个RWTexture的时候,可以添加上 shared 关键词。

    #Pragmas kernel Circles
    #Pragmas kernel Clear
        ...
    shared RWTexture2D<float4> Result;
        ...
    [numthreads(32,1,1)]
    void Circles (uint3 id : SV_DispatchThreadID)
    {
        // int2 centre = (texResolution >> 1);
        int2 centre = (int2)(random2((float)id.x) * (float)texResolution);
        int radius = (int)(random((float)id.x) * 30);
        drawCircle( centre, radius );
    }
    
    [numthreads(8,8,1)]
    void Clear (uint3 id : SV_DispatchThreadID)
    {
        Result[id.xy] = clearColor;
    }

    在CPU端获取Clear内核,传入数据。

    private int circlesHandle;
    private int clearHandle;
        ...
    shader.SetVector( "clearColor", clearColor);
    shader.SetVector( "circleColor", circleColor);
        ...
    private void DispatchKernels(int count)
    {
        shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1);
        shader.Dispatch(circlesHandle, count, 1, 1);
    }
    void Update()
    {
        DispatchKernels(1); // 现在画面有32个圆圆
    }

    一个问题,如果代码改为:DispatchKernels(10) ,画面会有多少个圆?

    答:320个。一开始Dispatch为 111=1 时,一个线程组有 3211=32 个线程,每个线程画一个圆。小学数学。

    接下来,加入 _Time 变量,让圆圆随着时间变化。由于Compute Shader内部貌似没有_time这样的变量,所以只能由CPU传入。

    CPU端,注意,实时更新的变量需要在每次Dispatch前更新(outputTexture不需要,因为这outputTexture指向的实际上是GPU纹理的引用!):

    private void DispatchKernels(int count)
    {
        shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1);
        shader.SetFloat( "time", Time.time);
        shader.Dispatch(circlesHandle, count, 1, 1);
    }

    Compute Shader:

    float time;
    ...
    void Circles (uint3 id : SV_DispatchThreadID){
        ...
        int2 centre = (int2)(random2((float)id.x + time) * (float)texResolution);
        ...
    }

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Shaders/PassData.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Scripts/PassData.cs

    但是现在的圆非常混乱,下一步就需要利用Buffer让圆圆看起来更有规律。

    img

    同时不需要担心多个线程尝试同时写入同一个内存位置(比如 RWTexture),可能会出现竞争条件(race condition)。当前的API都会很好的处理这个问题。

    2. 利用Buffer传递数据给GPU

    目前为止,我们学习了如何从CPU传送一些简单的数据给GPU。如何传递自定义的结构体呢?

    img

    我们可以使用Buffer作为媒介,其中Buffer当然是存在GPU当中的,CPU端(C#)只存储其引用。。首先,在CPU声明一个结构体,然后声明CPU端的引用和GPU端的引用.

    struct Circle
    {
        public Vector2 origin;
        public Vector2 velocity;
        public float radius;
    }
        Circle[] circleData;  // 在CPU上
        ComputeBuffer buffer; // 在GPU上

    获取一个线程组的大小信息,可以这样,下面代码只获取了circlesHandles线程组的x方向上的线程数量,yz都不要了(因为假设线程组yz都是1)。并且乘上分配的线程组数量,就可以得到总的线程数量。

    uint threadGroupSizeX;
    shader.GetKernelThreadGroupSizes(circlesHandle, out threadGroupSizeX, out _, out _);
    int total = (int)threadGroupSizeX * count;

    现在把需要传给GPU的数据准备好。这里创建了线程数个圆形,circleData[threadNums]。

    circleData = new Circle[total];
    float speed = 100;
    float halfSpeed = speed * 0.5f;
    float minRadius = 10.0f;
    float maxRadius = 30.0f;
    float radiusRange = maxRadius - minRadius;
    for(int i=0; i<total; i++)
    {
        Circle circle = circleData[i];
        circle.origin.x = Random.value * texResolution;
        circle.origin.y = Random.value * texResolution;
        circle.velocity.x = (Random.value * speed) - halfSpeed;
        circle.velocity.y = (Random.value * speed) - halfSpeed;
        circle.radius = Random.value * radiusRange + minRadius;
        circleData[i] = circle;
    }

    然后在Compute Shader上接受这个Buffer。声明一个一模一样的结构体(Vector2和Float2是一样的),然后创建一个Buffer的引用。

    // Compute Shader
    struct circle
    {
        float2 origin;
        float2 velocity;
        float radius;
    };
    StructuredBuffer<circle> circlesBuffer;

    注意,这里使用的StructureBuffer是只读的,区别于下一节提到的RWStructureBuffer。

    回到CPU端,将刚才准备好的CPU数据通过Buffer发送给GPU。首先明确我们申请的Buffer大小,也就是我们要传多大的东西给GPU。这里一份圆形的数据有两个 float2 的变量和一个 float 的变量,一个float是4bytes(不同平台可能不同,你可以用 sizeof(float) 加以判断),并且有 circleData.Length 份圆数据需要传递。circleData.Length表示缓冲区需要存储多少个圆形对象,而stride定义了每个对象的数据占用多少字节。开辟了这么大的空间,接下来使用SetData()将数据填充到缓冲区,也就是这一步,将数据传递给了GPU。最后将数据所在的GPU引用绑定到Compute Shader指定的Kernel。

    int stride = (2 + 2 + 1) * 4; //2 floats origin, 2 floats velocity, 1 float radius - 4 bytes per float
    buffer = new ComputeBuffer(circleData.Length, stride);
    buffer.SetData(circleData);
    shader.SetBuffer(circlesHandle, "circlesBuffer", buffer);

    目前为止,我们已经将CPU准备好的一些数据,通过Buffer传递给了GPU。

    img

    OK,现在把好不容易传到GPU的数据利用起来。

    [numthreads(32,1,1)]
    void Circles (uint3 id : SV_DispatchThreadID)
    {
        int2 centre = (int2)(circlesBuffer[id.x].origin + circlesBuffer[id.x].velocity * time);
        while (centre.x>texResolution) centre.x -= texResolution;
        while (centre.x<0) centre.x += texResolution;
        while (centre.y>texResolution) centre.y -= texResolution;
        while (centre.y<0) centre.y += texResolution;
        uint radius = (int)circlesBuffer[id.x].radius;
        drawCircle( centre, radius );
    }

    就可以看到,现在的圆圆是连续运动的。因为我们Buffer存储了id.x为索引的圆在上一帧的位置以及这个圆的运动状态。

    img

    总结一下,这一节学会了如何在CPU端自定义一个结构体(数据结构),并且通过Buffer传递给GPU,在GPU上对数据进行处理。

    下一节,我们学习如何从GPU获取数据返回给CPU。

    • Current version code:
    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Shaders/BufferJoy.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Scripts/BufferJoy.cs

    3. 从GPU取得数据

    还是老样子,创建一个Buffer,用于把数据从GPU传回给CPU。并且在CPU这边定义一个数组,用于接受数据。然后创建好缓冲区、绑定到着色器上,最后在CPU上创建好准备接受GPU数据的变量。

    ComputeBuffer resultBuffer; // Buffer
    Vector3[] output;           // CPU接受
    ...
        //buffer on the gpu in the ram
        resultBuffer = new ComputeBuffer(starCount, sizeof(float) * 3);
        shader.SetBuffer(kernelHandle, "Result", resultBuffer);
        output = new Vector3[starCount];

    在Compute Shader中也接受这样一个Buffer。这里的Buffer是可读写的,也就是说这个Buffer可以被Compute Shader修改。上一节中,Compute Shader只需要读取Buffer,因此 StructuredBuffer 足矣。这里我们需要使用RW。

    RWStructuredBuffer<float3> Result;

    接下来,在Dispatch后面用 GetData 接收数据即可。

    shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
    resultBuffer.GetData(output);
    img

    思路就是这么简单。现在我们尝试制作一大堆围绕球心运动的星星场景。

    将计算星星坐标的任务放到GPU上完成,最终获取计算好的各个星星的位置数据,在 C# 中实例化物体。

    Compute Shader中,每一个线程计算一个星星的位置,然后输出到Buffer当中。

    [numthreads(64,1,1)]
    void OrbitingStars (uint3 id : SV_DispatchThreadID)
    {
        float3 sinDir = normalize(random3(id.x) - 0.5);
        float3 vec = normalize(random3(id.x + 7.1393) - 0.5);
        float3 cosDir = normalize(cross(sinDir, vec));
        float scaledTime = time * 0.5 + random(id.x) * 712.131234;
        float3 pos = sinDir * sin(scaledTime) + cosDir * cos(scaledTime);
        Result[id.x] = pos * 2;
    }

    在CPU端通过 GetData 得到计算结果,时刻修改对应事先实例化好的GameObject的Pos。

    void Update()
    {
        shader.SetFloat("time", Time.time);
        shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
        resultBuffer.GetData(output);
        for (int i = 0; i < stars.Length; i++)
            stars[i].localPosition = output[i];
    }
    img

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Shaders/OrbitingStars.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Scripts/OrbitingStars.cs

    4. 使用噪声

    使用Compute Shader生成一张噪声图非常简单,并且非常高效。

    float random (float2 pt, float seed) {
        const float a = 12.9898;
        const float b = 78.233;
        const float c = 43758.543123;
        return frac(sin(seed + dot(pt, float2(a, b))) * c );
    }
    
    [numthreads(8,8,1)]
    void CSMain (uint3 id : SV_DispatchThreadID)
    {
        float4 white = 1;
        Result[id.xy] = random(((float2)id.xy)/(float)texResolution, time) * white;
    }
    img

    有一个库可以得到更多各式各样的噪声。https://pastebin.com/uGhMLKeM

    #include "noiseSimplex.cginc" // Paste the code above and named "noiseSimplex.cginc"
    
    ...
    
    [numthreads(8,8,1)]
    void CSMain (uint3 id : SV_DispatchThreadID)
    {
        float3 POS = (((float3)id)/(float)texResolution) * 2.0;
        float n = snoise(POS);
        float ring = frac(noiseScale * n);
        float delta = pow(ring, ringScale) + n;
    
        Result[id.xy] = lerp(darkColor, paleColor, delta);
    }

    img

    5. 变形的Mesh

    这一节中,我们将一个Cube正方体,通过Compute Shader变成一个球体,并且要有动画过程,是渐变的!

    img

    老样子,在CPU端声明顶点参数,然后丢到GPU里面计算,计算得到的新坐标newPos,应用到Mesh上。

    顶点结构的声明,CPU端的声明我们附带一个构造函数,这样方便些。GPU端的照葫芦画瓢。此处,我们打算向GPU传递两个Buffer,一个只读另一个可读写。一开始两个Buffer是一样的,随着时间变化(渐变),可读写的Buffer逐渐变化,Mesh从立方体不断变成球球。

    // CPU
    public struct Vertex
    {
        public Vector3 position;
        public Vector3 normal;
        public Vertex( Vector3 p, Vector3 n )
        {
            position.x = p.x;
            position.y = p.y;
            position.z = p.z;
            normal.x = n.x;
            normal.y = n.y;
            normal.z = n.z;
        }
    }
    ...
    Vertex[] vertexArray;
    Vertex[] initialArray;
    ComputeBuffer vertexBuffer;
    ComputeBuffer initialBuffer;
    // GPU
    struct Vertex {
        float3 position;
        float3 normal;
    };
    ...
    RWStructuredBuffer<Vertex>  vertexBuffer;
    StructuredBuffer<Vertex>    initialBuffer;

    初始化( Start() 函数)的完整步骤如下:

    1. 在CPU端,初始化kernel,获取Mesh引用
    2. 将Mesh数据传到CPU中
    3. 在GPU中声明Mesh数据的Buffer
    4. 将Mesh数据和其他参数传到GPU中

    完成这些操作后,每一帧Update,我们将从GPU得到的新顶点,应用给mesh。

    那GPU的计算怎么实现呢?

    相当简单的做法,我们只需要归一化模型空间的各个顶点即可!试想一下,当所有顶点位置向量都归一化了,那模型就变成一个球。

    img

    实际代码中,我们还需要同时计算法线,如果不改变法线,物体的光照就会非常奇怪。那问题来了,法线怎么计算呢?非常简单,原本正方体的顶点的坐标就是最终球球的法线向量!

    img

    为了实现“呼吸”的效果,加入一个正弦函数,控制归一化的系数。

    float delta = (Mathf.Sin(Time.time) + 1)/ 2;

    由于代码有点长,放一个链接吧。

    Current version code:

    • Compute Shader:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Shaders/MeshDeform.compute
    • CPU:https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Scripts/MeshDeform.cs
    img

    6. 总结/小测试

    应该如何在GPU上定义这个结构:

    struct Circle
    {
        public Vector2 origin;
        public Vector2 velocity;
        public float radius;
    }
    img

    这个结构应该怎样设置ComputeBuffer的大小?

    struct Circle
    {
        public Vector2 origin;
        public Vector2 velocity;
        public float radius;
    }
    img

    下面代码为什么错误?

    StructuredBuffer<float3> positions;
    //Inside a kernel
    ...
    positions[id.x] = fixed3(1,0,0);
    img

    References

en_USEN