Loading [MathJax]/jax/output/CommonHTML/jax.js

Remo

标签： Unity

Unity 曲面細分詳解
标签：入门/Shader/曲面细分着色器/Displacement贴图/LOD/平滑轮廓/Early Culling
tessellation（镶嵌）一词是指一大类设计活动，通常是指在平坦的表面上，用各种几何形状的瓷砖相邻排列以形成图案。它的目的可以是艺术性的或实用性的，很多例子可以追溯到几千年前。 — Tessellation, Wikipedia, accessed July 2020.
本文主要參考：
https://nedmakesgames.medium.com/mastering-tessellation-shaders-and-their-many-uses-in-unity-9caeb760150e
游戏开发中的曲面细分一般是在一个三角形平面（或者是Quad）中做细分（增加顶点数量），然后用Displacement贴图来做顶点位移，或者是用本文实现的Phong细分或者PN triangles细分来做顶点位移。
Phong细分不需要知道相邻的拓扑信息，仅仅用插值计算，比PN triangles等算法效率更高。GAMES101上提到的Loop and Schaefer利用低度数四边形曲面近似Catmull-Clark曲面，这些方法输入的多边形都被一个多项式曲面替代。而本文的Phong细分不需要任何修正额外的几何区域的操作。
一、曲面细分流程概述
这章内容是曲面细分在渲染管线流程的介绍。
曲面细分着色器位于顶点着色器之后，且曲面细分分为三个步骤：Hull、Tesselllator和Domain，其中Tessellator不可编程。
曲面细分的第一个步骤是曲面细分控制着色器（也称为Tessellation Control Shader，TCS），这个着色器将会输出控制点和细分因子。这个阶段主要由两个并行的函数组成：Hull Function和Patch Constant Function。
这两个函数都接收一个个的Patch，即一组顶点索引，比如三角形则用三个数字表示顶点的索引。其中一个Patch就可以组成一个片元，比方说一个三角形片元就是由三个顶点索引组成的。
并且，Hull Function每个顶点执行一次，Path Constant Function每个Patch执行一次，前者输出修改后的控制点数据（通常包括顶点位置、可能的法线、纹理坐标等属性），后者则输出整个片元相关的常量数据，即细分因子。细分因子会告诉下一个阶段（镶嵌器Tessellator）如何对每个片元进行细分。
笼统地讲，Hull Function修改每个控制点，而Patch Constant Function确定基于摄像机距离的细分级别。
接下来进入不可编程阶段，镶嵌器（tessellator）。他接收Patch和刚刚得到的细分因子。镶嵌器会为每一个顶点数据生成一个重心坐标（Barycentric coordinates）。
紧接着来到最后一步，域阶段（Domain Stage，也称为Tessellation Evaluation Shader，TES），这是可编程的。这个部分由域函数组成，每个顶点执行一次。接收重心坐标、Patch和Hull Stage中两个函数生成的结果。大多数逻辑都在这个地方编写。最重要的是你可以在这个阶段重新定位顶点，这是曲面细分中最重要的环节。
如果有几何着色器，他将会在Domain Stage后执行。但是如果不用，则来到光栅化阶段。
总结，最开始是顶点着色器。Hull阶段接受顶点数据，决定如何细分Mesh。然后通过tessellator阶段处理细分网格，最后由Domain阶段为片元着色器输出顶点。
二、曲面细分分析
这章内容是Unity曲面细分的代码分析，实际例子效果展示和底层原理概述。
2.1 关键代码分析
2.1.1 Unity曲面细分基本设置
首先曲面细分着色器需要使用shader target 5.0。
```
HLSLPROGRAM
#pragma target 5.0 // 5.0 required for tessellation

#pragma vertex Vertex
#pragma hull Hull
#pragma domain Domain
#pragma fragment Fragment

ENDHLSL
```
2.1.2 Hull Stage代码1 – Hull Function
经典的流程，顶点着色器将位置和法线信息转为世界空间。然后将输出结果传递到Hull Stage中。需要注意的是，和顶点着色器不同，Hull着色器的顶点使用 INTERNALTESSPOS 语义而不是 POSITION 语义来表示。原因在于Hull不需要将这些顶点位置输出到下一个渲染流程，而是用于自身内部曲面细分的算法，所以会将这些顶点转换到更适合曲面细分的坐标系统。除此之外开发者也能更加清晰区分。
```
struct Attributes {
    float3 positionOS : POSITION;
    float3 normalOS : NORMAL;
    UNITY_VERTEX_INPUT_INSTANCE_ID
};

struct TessellationControlPoint {
    float3 positionWS : INTERNALTESSPOS;
    float3 normalWS : NORMAL;
    UNITY_VERTEX_INPUT_INSTANCE_ID
};

TessellationControlPoint Vertex(Attributes input) {
    TessellationControlPoint output;

    UNITY_SETUP_INSTANCE_ID(input);
    UNITY_TRANSFER_INSTANCE_ID(input, output);

    VertexPositionInputs posnInputs = GetVertexPositionInputs(input.positionOS);
    VertexNormalInputs normalInputs = GetVertexNormalInputs(input.normalOS);

    output.positionWS = posnInputs.positionWS;
    output.normalWS = normalInputs.normalWS;
    return output;
}
```
下面是Hull Shader的一些设置参数。
第一行domain是定义曲面细分着色器的域类型，意味着输入输出都是三角形图元。可以选tri(三角形)、quad（四边形）等。
第二行outputcontrolpoints 则表示输出控制点的数量，3对应三角形的三个顶点。
第三行outputtopology表示细分后图元的拓扑结构，triangle_cw意思是输出三角形的顶点按照顺时针排序，正确的顺序可以确保表面正面朝外。triangle_cw（顺时针环绕三角形）、triangle_ccw（逆时针环绕三角形）、line（线段）
第四行patchconstantfunc就是Hull Stage的另外一个函数，输出的是细分因子等常量数据。一个Patch只执行一次。
第五行partitioning，分割模式，指定了如何分配额外的顶点到原始Path图元的边上，这一步可以让细分过程更加的平滑均匀。integer，fractional_even，fractional_odd。
第六行的maxtessfactor表示最大细分因子，限制最大的细分可以控制渲染负担。
```
[domain("tri")]
[outputcontrolpoints(3)]
[outputtopology("triangle_cw")]
[patchconstantfunc("patchconstant")]
[partitioning("fractional_even")]
[maxtessfactor(64.0)]
```
在Hull Shader中，每一个控制点都会被独立调用一次，所以这个函数要执行控制点数量的次数。要知道当前正在处理的是哪一个顶点，我们用语义为 SV_OutputControlPointID 的变量 id 来判断。函数还传入一个特殊的结构，该结构可以像使用数组一样方便的取用Patch里面的任意一个控制点。
```
TessellationControlPoint Hull(
    InputPatch<TessellationControlPoint, 3> patch, uint id : SV_OutputControlPointID) {
    TessellationControlPoint h;
    // Hull shader code here

    return patch[id];
}
```
2.1.3 Hull Stage代码2 – Patch Constant Function
除了Hull Shader，Hull Stage里还有一个函数与之并行，patch constant function。这个函数的签名比较简单，输入一个patch，输出计算后的细分因子。输出结构包含了为三角形每条边指定的鑲嵌因子。这些因子通过特殊的系统值语义 SV_TessFactor 进行标识。每个鑲嵌因子定义了相对应边应该被细分成多少小段，从而影响最终生成的网格的密度和细节。下面具体来看看这个因子具体包含了什么。
```
struct TessellationFactors {
    float edge[3] : SV_TessFactor;
    float inside : SV_InsideTessFactor;
};
// The patch constant function runs once per triangle, or "patch"
// It runs in parallel to the hull function
TessellationFactors PatchConstantFunction(
    InputPatch<TessellationControlPoint, 3> patch) {
    UNITY_SETUP_INSTANCE_ID(patch[0]); // Set up instancing
    // Calculate tessellation factors
    TessellationFactors f;
    f.edge[0] = _FactorEdge1.x;
    f.edge[1] = _FactorEdge1.y;
    f.edge[2] = _FactorEdge1.z;
    f.inside = _FactorInside;
    return f;
}
```
首先TessellationFactors结构体里面有一个边缘镶嵌因子 edge[3] ，标记为 SV_TessFactor 。当使用三角形作为基本图元细分时，每条边被定义为位于与具有相同索引的顶点相对的位置。具体说是：边0对应顶点1和顶点2之间。边1对应顶点2和顶点0之间。边2对应顶点0和顶点1之间。为什么这样？直观解释是，边的索引与它不连接的那个顶点的索引相同。这有助于在编写Shader代码时快速识别和处理与特定顶点相对应的边。
还有一个中心镶嵌因子 inside 标记为 SV_InsideTessFactor 。这个因子直观改变最终镶嵌的图案，更本质的说是决定了边缘细分的次数，用于控制三角形内部的细分密度。与边的细分因子相比，中心镶嵌因子控制的是三角形内部如何被进一步细分成更小的三角形，而边缘镶嵌因子影响边缘细分的次数。
Patch Constant Function还可以输出其他有用的数据，但是必须标注正确的语义。比方说BEZIERPOS语义就非常有用，可以表示float3的数据。稍后将会使用这个语义输出基于贝塞尔曲线的平滑算法控制点。
2.1.4 Domain Stage代码
接下来就进入Domain Stage。Domain Function也有一个Domain属性，应该与Hull Function的输出拓扑类型相同，该例子设置为三角形。这个函数输入来自Hull Function的Patch、Patch Constant Function的输出以及最重要的顶点重心坐标。输出结构非常接近顶点着色器的输出结构，包含Clip空间的位置，以及片元着色器所需要的照明数据。
暂时不知道干嘛的没关系，读到本文第四章再跳回来研究。
简单的说就是，细分出来的每一个新顶点都会跑一边这个domain函数。
```
struct Interpolators {
    float3 normalWS                 : TEXCOORD0;
    float3 positionWS               : TEXCOORD1;
    float4 positionCS               : SV_POSITION;
};

// Call this macro to interpolate between a triangle patch, passing the field name
#define BARYCENTRIC_INTERPOLATE(fieldName) \
        patch[0].fieldName * barycentricCoordinates.x + \
        patch[1].fieldName * barycentricCoordinates.y + \
        patch[2].fieldName * barycentricCoordinates.z

// The domain function runs once per vertex in the final, tessellated mesh
// Use it to reposition vertices and prepare for the fragment stage
[domain("tri")] // Signal we're inputting triangles
Interpolators Domain(
    TessellationFactors factors, // The output of the patch constant function
    OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
    float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle

    Interpolators output;

    // Setup instancing and stereo support (for VR)
    UNITY_SETUP_INSTANCE_ID(patch[0]);
    UNITY_TRANSFER_INSTANCE_ID(patch[0], output);
    UNITY_INITIALIZE_VERTEX_OUTPUT_STEREO(output);

    float3 positionWS = BARYCENTRIC_INTERPOLATE(positionWS);
    float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);

    output.positionCS = TransformWorldToHClip(positionWS);
    output.normalWS = normalWS;
    output.positionWS = positionWS;

    return output;
}
```
这个函数，Unity会给我们细分因子、Patch的三个顶点还有当前的新顶点的重心坐标。我们可使用这些数据做位移处理等。
2.2 细分因子与划分模式详解
从这个链接拷贝代码，然后制作对应的材质，并且开启线框模式。我们目前只为Mesh绘制了顶点，并没有在片元着色器应用任何操作，因此看上去是透明的。
如果将Edge因子任意一个分量设置为0或者小于0，那么Mesh就会完全消失。下图就是消失后的样子（打开了Unity编辑器的物体边框描边），这个特性十分重要。
2.2.1 细分因子概述
说白了，这些个因子在Hull Stage设置了之后，就只是简单粗暴的在Tessellation Stage中写进重心坐标里，比如说边缘因子、内部因子。（假设都是tri，如果是quad则是用uv来计算，可能会更加复杂，我不知道）这个简单粗暴的阶段并不可编程。
以“整数（均匀）切割模式”为例子。（暂时） [partitioning(“integer”)] domain都是三角形 [domain(“tri”)] 输出的顶点数量也是3。 [outputcontrolpoints(3)] 并且输出的拓扑结构是三角形顺时针。 [outputtopology(“triangle_cw”)]
2.2.2 准备工作与潜在的并行问题
将代码修改改为如下：
```
// .shader
_FactorEdge1("[Float3]Edge factors,[Float]Inside factor", Vector) = (1, 1, 1, 1) // --  Edited  -- 

// .hlsl
float4 _FactorEdge1; // --  Edited  -- 
...
f.edge[0] = _FactorEdge1.x;
f.edge[1] = _FactorEdge1.y; // --  Edited  -- 
f.edge[2] = _FactorEdge1.z; // --  Edited  -- 
f.inside = _FactorEdge1.w; // --  Edited  --
```
这里可能会存在一个问题。有时候编译器会拆分Patch Constant Function并行计算每一个因子，这就导致有时候一些因子被删除了，可能会到看因子会莫名其妙等于0。解决方法是将这些因子打包成一个向量，这样编译器就不会使用未定义的量。下面简单复现一下可能会发生的情况。
修改Path Constant Function如下，并且在面板中开放两个新的属性。
修改的代码行后注释了 // — Edited — 。
```
// The patch constant function runs once per triangle, or "patch"
// It runs in parallel to the hull function
TessellationFactors PatchConstantFunction(
InputPatch<TessellationControlPoint, 3> patch) {
UNITY_SETUP_INSTANCE_ID(patch[0]); // Set up instancing
// Calculate tessellation factors
    TessellationFactors f;
    f.edge[0] = _FactorEdge1.x;
    f.edge[1] = _FactorEdge2; // --  Edited  --
    f.edge[2] = _FactorEdge3; // --  Edited  --
    f.inside = _FactorInside;
return f;
}
_FactorEdge2("Edge 2 factor", Float) = 1 // --  Edited  --
_FactorEdge3("Edge 3 factor", Float) = 1 // --  Edited  --
```
2.2.3 边缘因子效果 Edge Factor – SV_TessFactor
可以看到边缘因子Edge Factors大约对应于对应边缘被分割的次数，内部因子Inside Factor对应中心的复杂度。
边缘因子只会影响在原本三角形边上的细分。至于内部复杂的图案，就交给内部因子Inside Factor和划分模式来控制。
需要注意的是，“整数切割模式”的曲面细分都是向上取整。比如2.1取3。
一张图说明一切。
2.2.4 内部因子 Inside Factor – SV_InsideTessFactor
还是INTEGER模式举例子。内部因子只会影响内部图案的复杂程度，具体怎么影响，下面详细介绍。概括一下就是，边缘因子会影响最外层与第一层之间的三角形细分，内部因子会影响到底有多少层，而划分模式则是会影响内部每层是怎么细分的。
假设Edge Factors设置为 (2,3,4) ，只修改Insider Factor，可以观察到一个有趣的性质：当内部因子 n 是偶数时，可以找到一个顶点的坐标恰好位于重心的位置 (13,13,13) 。
一般边缘因子Edge Factors设置为一样的数就好了。这里设置成不同的数，图可能会比较混乱，但是可以看到最本质的规律。
进一步还能观察到，任意一条最靠近最外层三角形的边的顶点数量和内部因子Inside Factor （ n ）有一个等量关系： n=Numpoint−1 。即，这条边上的顶点数永远等于细分因子减 1 。
每一层的顶点数量都会减少1。也就是说，第一层（最外围的不算，因为不会细分）会有 n 个顶点，向内第二层会有 n−2 个顶点，以此类推。
综合上面三个观察，我们可以得到一个猜测和结论（没啥用，但是闲着没事算了一下）。内部总顶点数量可以用公式计算，这里的n对应内部因子的n-1，需要注意一下，因为内部因子是从2开始取的： a2n=3n2a2n−1=3n(n−1)+1 最终可以化简合并为： ak=−0.125(−1)k+0.75k2+0.125 全部为整数int运算的公式如下： ak=⌊−(−1)k+6k2+18⌋
2.2.5 划分模式 – [partitioning(“_”)]
上面只说了最简单的均匀划分integer，这种情况会使用整数倍数进行细分。接下来说说其他几种。简单的说，Fractional Odd 和 Fractional Even是Integer的进阶版，但是前者是Integer取奇数情况下的进阶版，后者是Integer取偶数情况下的进阶版。具体进阶在可以用小数部分使得划分不再是平均的。
Fractional Odd (分数奇数)：Inside Factor可以是分数（不会被Ceil），且分母为奇数。注意这里说的分母其实是每一个顶点的重心坐标所表示的分母。奇数作为分母的的划分方式一定会让一个顶点落在三角形的重心上，偶数的就不是。这里搬运一下凯奥斯的图。
动图
Fractional Even (分数偶数)：与fractional_odd类似，但分母为偶数。具体怎么选我也不清楚。
动图
Pow2 (2的幂次方)：此模式仅允许使用2的幂次方（如1, 2, 4, 8等）作为细分级别。一般用在纹理映射或阴影计算。
三、细分优化
3.1 视锥体剔除
生成如此多的顶点会导致性能相当糟糕！因此需要采用一些方法提高渲染效率。虽然在T光栅化之前，会将在视锥体之外的顶点进行剔除，但是如果在TCS中提前把没必要进行细分的Patch剔除了，这样就会减少曲面细分着色器的计算压力。
在Patch Constant Function种将曲面细分因子设置为0，那么曲面细分器就会忽略这个Patch。也就是说这里的剔除是对一整个Patch剔除，而不是视锥体剔除中精细到顶点的剔除。
我们测试Patch中的每一个点，看看他们是否都在视野之外。为此，将Patch的每一个点转换到裁剪空间中。因此我们需要在顶点着色器中计算出每一个点的裁切空间坐标并且将其传给Hull Stage。使用 GetVertexPositionInputs 就可以得到我们想要的了。
```
struct TessellationControlPoint {
    float4 positionCS : SV_POSITION; // --  Edited  -- 
    ...
};

TessellationControlPoint Vertex(Attributes input) {
    TessellationControlPoint output;
    ...
    VertexPositionInputs posnInputs = GetVertexPositionInputs(input.positionOS);
    ...
    output.positionCS = posnInputs.positionCS; // --  Edited  -- 
    ...
    return output;
}
```
然后在Patch Constant Function上方写一个测试函数，用于判断是否剔除该补丁。这里暂时传false。该函数传进来三个裁切空间的点。
```
// Returns true if it should be clipped due to frustum or winding culling
bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
    return false;
}
```
然后再编写 IsOutOfBounds 函数测试某个点是否超过边界。边界也是可以指定，在另一个函数中将这个方法利用起来，判断某个点是否在视锥体之外。
```
// Returns true if the point is outside the bounds set by lower and higher
bool IsOutOfBounds(float3 p, float3 lower, float3 higher) {
    return p.x < lower.x || p.x > higher.x || p.y < lower.y || p.y > higher.y || p.z < lower.z || p.z > higher.z;
}

// Returns true if the given vertex is outside the camera fustum and should be culled
bool IsPointOutOfFrustum(float4 positionCS) {
    float3 culling = positionCS.xyz;
    float w = positionCS.w;
    // UNITY_RAW_FAR_CLIP_VALUE is either 0 or 1, depending on graphics API
    // Most use 0, however OpenGL uses 1
    float3 lowerBounds = float3(-w, -w, -w * UNITY_RAW_FAR_CLIP_VALUE);
    float3 higherBounds = float3(w, w, w);
    return IsOutOfBounds(culling, lowerBounds, higherBounds);
}
```
在裁切空间（Clip Space）中，W分量是其次坐标，可以决定点是否在视锥体中。如果xyz超出了 [-w, w] 的范围，这些点就会被剔除，因为他们在视锥体之外。不同的API在深度的处理上有不同的逻辑，我们用这个分量作为边界的时候需要注意。DirectX和Vulkan使用左手系，Clip深度是 [0, 1] ，所以UNITY_RAW_FAR_CLIP_VALUE是0。OpenGL是右手系，Clip深度范围 [-1, 1] ，UNITY_RAW_FAR_CLIP_VALUE是1。
准备好这些后，就可以判断一个Patch是否需要剔除了。回到刚开始的函数，在这个函数中判断一个Patch的所有点是否需要剔除。
```
// Returns true if it should be clipped due to frustum or winding culling
bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
    bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
        IsPointOutOfFrustum(p1PositionCS) &&
        IsPointOutOfFrustum(p2PositionCS); // --  Edited  -- 
    return allOutside; // --  Edited  -- 
}
```
3.2 背面剔除
Patch除了经历视锥体剔除，还可以做一个背面剔除。用法向量来判断Patch是否需要剔除。
img
用两个向量做叉积就得到法向量。由于当前在Clip空间，需要做一个透视除法，得到NDC，这个范围应该是 [-1,1] 的。需要转换到NDC的原因是，在Clip空间中的位置是非线性的，这有可能导致顶点的位置的扭曲，转换到NDC这样的线性空间能更加准确的判断顶点的前后关系。
```
// Returns true if the points in this triangle are wound counter-clockwise
bool ShouldBackFaceCull(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
    float3 point0 = p0PositionCS.xyz / p0PositionCS.w;
    float3 point1 = p1PositionCS.xyz / p1PositionCS.w;
    float3 point2 = p2PositionCS.xyz / p2PositionCS.w;
    float3 normal = cross(point1 - point0, point2 - point0);
    return dot(normal, float3(0, 0, 1)) < 0;
}
```
上面的代码还存在一个跨平台问题。观察方向在不同API的朝向是不同的，因此修改一下代码。
```
// In clip space, the view direction is float3(0, 0, 1), so we can just test the z coord
#if UNITY_REVERSED_Z
    return cross(point1 - point0, point2 - point0).z < 0;
#else // In OpenGL, the test is reversed
    return cross(point1 - point0, point2 - point0).z > 0;
#endif
```
最后的最后，在 ShouldClipPatch 中添加刚写好的函数用于判断背面剔除。
```
// Returns true if it should be clipped due to frustum or winding culling
bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
    bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
        IsPointOutOfFrustum(p1PositionCS) &&
        IsPointOutOfFrustum(p2PositionCS);
    return allOutside || ShouldBackFaceCull(p0PositionCS, p1PositionCS, p2PositionCS); // --  Edited  -- 
}
```
然后在 PatchConstantFunction 中将需要剔除的Patch的顶点因子设置为0 。
```
...
if (ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS)) {
        f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0; // Cull the patch
}
...
```
3.3 增加容差
你可能想验证代码正确性，也可能会有一些意外剔除的情况。此时增加一个容差tolerance是一个灵活的办法。
首先是视锥体剔除容差。如果容差是正值，那么剔除边界会扩展，这样一些位于视锥体边缘附近的物体即使部分越界也不会被剔除。这种方法可以减少因为小的视角变动或物体动态而频繁变化的剔除状态。
```
// Returns true if the given vertex is outside the camera fustum and should be culled
bool IsPointOutOfFrustum(float4 positionCS, float tolerance) {
    float3 culling = positionCS.xyz;
    float w = positionCS.w;
    // UNITY_RAW_FAR_CLIP_VALUE is either 0 or 1, depending on graphics API
    // Most use 0, however OpenGL uses 1
    float3 lowerBounds = float3(-w - tolerance, -w - tolerance, -w * UNITY_RAW_FAR_CLIP_VALUE - tolerance);
    float3 higherBounds = float3(w + tolerance, w + tolerance, w + tolerance);
    return IsOutOfBounds(culling, lowerBounds, higherBounds);
}
```
接着调整背面剔除。在实际操作中，通过与容差而不是零进行比较，可以避免由于数值计算精度带来的问题。如果点积结果小于某个小的正值（容差），而不是严格小于零，那么图元被视为背面。这种方法提供了额外的缓冲区，确保只有明确的背面图元被剔除。
```
// Returns true if the points in this triangle are wound counter-clockwise
bool ShouldBackFaceCull(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS, float tolerance) {
    float3 point0 = p0PositionCS.xyz / p0PositionCS.w;
    float3 point1 = p1PositionCS.xyz / p1PositionCS.w;
    float3 point2 = p2PositionCS.xyz / p2PositionCS.w;
    // In clip space, the view direction is float3(0, 0, 1), so we can just test the z coord
#if UNITY_REVERSED_Z
    return cross(point1 - point0, point2 - point0).z < -tolerance;
#else // In OpenGL, the test is reversed
    return cross(point1 - point0, point2 - point0).z > tolerance;
#endif
}
```
可以在材质面板中暴露一个Range。
```
// .shader
Properties{
    _tolerance("_tolerance",Range(-0.002,0.001)) = 0
    ...
}
// .hlsl
float _tolerance;
...
// Returns true if it should be clipped due to frustum or winding culling
bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
    bool allOutside = IsPointOutOfFrustum(p0PositionCS, _tolerance) &&
        IsPointOutOfFrustum(p1PositionCS, _tolerance) &&
        IsPointOutOfFrustum(p2PositionCS, _tolerance); // --  Edited  -- 
    return allOutside || ShouldBackFaceCull(p0PositionCS, p1PositionCS, p2PositionCS,_tolerance); // --  Edited  -- 
}
```
3.4 动态细分因子
目前为止，我们的算法是无差别地细分所有的表面。但在一个复杂的Mesh中，可能会出现大小面的情况，即Mesh面积不均的情况。大面由于面积大，在视觉上更为明显，需要更多的细分来保证表面的平滑度和细节。小面由于面积小，可以考虑减少这个部分的细分程度，不会对视觉效果带来太大的影响。根据变长来动态改变因子是比较常见的方法。设置一个算法，让边长较长的面拥有更高的细分因子。
除了Mesh自身的大小面以外，摄像机与Patch的距离也可以作为动态改变因子的因素。距离摄像机较远的对象可以降低细分因子，因为在屏幕上占据的像素数较少。还可以根据用户的视角和视线方向，可以优先细分那些面向摄像机的面，而对背对摄像机或侧面的部分降低细分程度。
3.4.1 固定的细分缩放
获取两个顶点的距离。距离越大，细分的因子就越大。scale暴露在控制面板将其设置为 [0,1] ，scale是1时，细分因子直接由两点距离贡献。scale越接近0，细分因子越大。另外加上一个初值bias。最后让因此取1或以上的数，确保准确性。
```
// Calculate the tessellation factor for an edge
float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float3 p1PositionWS) {
    float factor = distance(p0PositionWS, p1PositionWS) / scale;

    return max(1, factor + bias);
}
```
然后修改材质面板和Patch Constant Function。一般来说，采用边缘细分因子的平均值作为内部细分因子，视觉效果比较连贯。
```
// .shader
Properties{
    ...
    _TessellationBias("_TessellationBias", Range(-1,5)) = 1
     _TessellationFactor("_TessellationFactor", Range(0,1)) = 0
}

// .hlsl

f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS);
f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS);
f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS);
f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;
```
不同尺寸的片元其细分程度会动态变化，效果如下。
对了，如果发现你的内部因子图案非常奇怪，这可能是编译器导致的，尝试将内部因子代码修改为以下就可以解决。
```
f.inside = ( // If the compiler doesn't play nice...
  EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS) + 
  EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS) + 
  EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS)
  ) / 3.0;
```
3.4.2 屏幕空间细分缩放
接下来加入摄像机距离的判断。我们可以直接用屏幕空间的距离来调整细分程度，这样完美地同时处理了大小面+屏幕距离的问题！
由于我们已经有了Clip空间的数据。由于屏幕空间与NDC空间非常相似，只需要换到NDC就可以了，即做一个透视除法。
```
float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float4 p0PositionCS, float3 p1PositionWS, float4 p1PositionCS) {
    float factor = distance(p0PositionCS.xyz / p0PositionCS.w, p1PositionCS.xyz / p1PositionCS.w) / scale;

    return max(1, factor + bias);
}
```
接下来在Patch Constant Function中传入Clip空间的坐标。
```
f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
  patch[1].positionWS, patch[1].positionCS, patch[2].positionWS, patch[2].positionCS);
f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
  patch[2].positionWS, patch[2].positionCS, patch[0].positionWS, patch[0].positionCS);
f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, 
  patch[0].positionWS, patch[0].positionCS, patch[1].positionWS, patch[1].positionCS);
f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;
```
当前的效果相当的不错，随着摄像机的距离（屏幕空间的距离）的变化，细分程度也会动态变化。如果使用INTEGER意外的划分模式，会得到更连贯的效果。
还有一些地方可以改进。比如缩放系数的单位。方才我们将其控制在 [0,1] ，其实并不是很适合我们去调整。我们乘上一个屏幕分辨率，然后将缩放系数范围改为 [0,1080] ，更方便我们调整。然后修改一下材质面板属性。现在就是以像素为单位的比例了。
```
// .hlsl
float factor = distance(p0PositionCS.xyz / p0PositionCS.w, p1PositionCS.xyz / p1PositionCS.w) * _ScreenParams.y / scale;

// .shader
_TessellationFactor("_TessellationFactor",Range(0,1080)) = 320
```
3.4.3 相机距离细分缩放
我们怎么采用相机距离缩放呢？非常简单，计算「两点间的距离」与「两顶点的中点与相机位置的距离」的比值。比值越大说明占据屏幕的空间就越大，需要更多的细分程度。
```
// .hlsl
float EdgeTessellationFactor(float scale, float bias, float3 p0PositionWS, float3 p1PositionWS) {
    float length = distance(p0PositionWS, p1PositionWS);
    float distanceToCamera = distance(GetCameraPositionWS(), (p0PositionWS + p1PositionWS) * 0.5);
    float factor = length / (scale * distanceToCamera * distanceToCamera);
    return max(1, factor + bias);
}
...
        f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[1].positionWS, patch[2].positionWS);
        f.edge[1] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[2].positionWS, patch[0].positionWS);
        f.edge[2] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, patch[0].positionWS, patch[1].positionWS);

// .shader
_TessellationFactor("_TessellationFactor",Range(0, 1)) = 0.02
```
注意，此时的缩放因子单位不再是像素，而是用最开始的 [0,1] 。因为这个方法，屏幕像素意义不是特别大，所以就不用了。并且用回了世界坐标。
屏幕空间细分缩放和相机距离细分缩放的结果比较相似，一般可以开放一个宏来切换上面几种动态因子的模式。这里就留给读者自行完成。
3.5 指定细分因子
3.5.1 顶点存储细分因子
上一节中，我们使用不同的策略猜测适当的细分因子。如果我们确切知道该Mesh应该怎么细分，那么可以在Mesh中存储这些细分因子的系数。由于系数只需要一个float，因此只需要用到一个颜色通道就可以了。下面是一个伪代码，感受一下就行。
```
float EdgeTessellationFactor(float scale, float bias, float multiplier) {
    ...
    return max(1, (factor + bias) * multiplier);
}

...
// PCF()
[unroll] for (int i = 0; i < 3; i++) {
    multipliers[i] = patch[i].color.g;
}
// Calculate tessellation factors
f.edge[0] = EdgeTessellationFactor(_TessellationFactor, _TessellationBias, (multipliers[1] + multipliers[2]) / 2);
```
3.5.2 SDF控制曲面细分因子
结合有符号距离场（Signed Distance Field, SDF）来控制曲面细分（Tessellation）因子，相当的酷炫。当然本节不涉及SDF的生成，假设能够直接通过现成的函数 CalculateSDFDistance 获取。
对于给定的Mesh，用 CalculateSDFDistance 计算出每个Patch中各个顶点到SDF表示的形状（例如球体）的距离。得到距离后再评估该Patch的细分需求，进行细分。
```
TessellationFactors PatchConstantFunction(
    InputPatch<TessellationControlPoint, 3> patch) {
    float multipliers[3];

    // 循环处理每个顶点
    [unroll] for (int i = 0; i < 3; i++) {
        // 计算每个顶点到SDF表面的距离
        float sdfDistance = CalculateSDFDistance(patch[i].positionWS);

        // 根据SDF距离调整细分因子
        if (sdfDistance < _TessellationDistanceThreshold) {
            multipliers[i] = lerp(_MinTessellationFactor, _MaxTessellationFactor, (1 - sdfDistance / _TessellationDistanceThreshold));
        } else {
            multipliers[i] = _MinTessellationFactor;
        }
    }

    // 计算最终的细分因子
    TessellationFactors f;
    f.Edge[0] = max(multipliers[0], multipliers[1]);
    f.Edge[1] = max(multipliers[1], multipliers[2]);
    f.Edge[2] = max(multipliers[2], multipliers[0]);
    f.Inside = (multipliers[0] + multipliers[1] + multipliers[2]) / 3;

    return f;
}
```
具体实现我也不会，先庄懂一下。
四、顶点偏移 – 轮廓平滑
为一个Mesh添加细节最简单的方法是上各种高分辨率贴图。但是底大一级压死人，说的就是增加Mesh顶点的效果比增加贴图分辨率的效果要好。举个例子，法线贴图虽然可以改变每一个片元的法线方向，但是并不会改变几何外观。就算是128K的纹理也无法消除锯齿和pointy的边缘。
因此需要上曲面细分，然后偏移顶点。刚刚提到的所有曲面细分操作都是在Patch所在的平面上操作的。如果我们想要弯曲这些顶点，一个最简单的操作就是Phong细分。
4.1 Phong细分
首先附上原论文。https://perso.telecom-paristech.fr/boubek/papers/PhongTessellation/PhongTessellation.pdf
Phong着色应该很熟悉，是一种利用法向量线性差值得到平滑的着色的技术。Phong细分的灵感来自Phong着色，将Phong着色这一概念扩展到空间域。
Phong细分的核心思想是利用三角形每个角的顶点法线来影响细分过程中新顶点的位置，从而创造出曲面而非平面。
值得注意一下，这里很多教程会用triangle corner（三角形的角）来表示顶点，我觉得都差不多，本文还是用回顶点。
首先，在Domain函数内unity会给我们当前需要处理的新顶点的重心坐标。假设我们现在处理的是 (13,13,13) 。
Patch的每一个顶点都有法线。想象从每一个顶点发出一个切平面，垂直于各自的法向量。
然后将当前的顶点分别投影到这三个切平面上。
用数学语言描述。 P′=P−((P−V)⋅N)N
其中 :
- $P$ 是最初插值的平面位置。
- $V$ 是平面上的一个顶点位置。
- $N$ 是顶点 $V$ 处的法线。
- ⋅ 表示点积。
- P′ 是 $P$ 在平面上的投影。
得到三个 $P’$ 。
投影在三个切平面的三个点重新组成一个新的三角形，再用回当前顶点的重心坐标应用到新的三角形上，计算出新的点。
```
// Calculate Phong projection offset
float3 PhongProjectedPosition(float3 flatPositionWS, float3 cornerPositionWS, float3 normalWS) {
    return flatPositionWS - dot(flatPositionWS - cornerPositionWS, normalWS) * normalWS;
}

// Apply Phong smoothing
float3 CalculatePhongPosition(float3 bary, float3 p0PositionWS, float3 p0NormalWS,
    float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
    float3 smoothedPositionWS =
        bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
        bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
        bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
    return smoothedPositionWS;
}

// The domain function runs once per vertex in the final, tessellated mesh
// Use it to reposition vertices and prepare for the fragment stage
[domain("tri")] // Signal we're inputting triangles
Interpolators Domain(
    TessellationFactors factors, // The output of the patch constant function
    OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
    float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle

    Interpolators output;
    ...
    float3 positionWS = CalculatePhongPosition(barycentricCoordinates, 
      patch[0].positionWS, patch[0].normalWS, 
      patch[1].positionWS, patch[1].normalWS, 
      patch[2].positionWS, patch[2].normalWS);
    float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
    float3 tangentWS = BARYCENTRIC_INTERPOLATE(tangentWS.xyz);
    ...
    output.positionCS = TransformWorldToHClip(positionWS);
    output.normalWS = normalWS;
    output.positionWS = positionWS;
    output.tangentWS = float4(tangentWS, patch[0].tangentWS.w);
    ...
}
```
注意这里需要添加法线向量，然后写进Vertex和Domain。再写一个计算算 $P’$ 重心坐标的函数。
```
struct Attributes {
    ...
    float4 tangentOS : TANGENT;
};
struct TessellationControlPoint {
    ...
    float4 tangentWS : TANGENT;
};
struct Interpolators {
    ...
    float4 tangentWS : TANGENT;
};
TessellationControlPoint Vertex(Attributes input) {
    TessellationControlPoint output;
    ...
    // .....最后一个是符号系数
    output.tangentWS = float4(normalInputs.tangentWS, input.tangentOS.w); // tangent.w containts bitangent multiplier
}
// Barycentric interpolation as a function
float3 BarycentricInterpolate(float3 bary, float3 a, float3 b, float3 c) {
    return bary.x * a + bary.y * b + bary.z * c;
}
```
在Phong细分原论文中，还加入了一个 α 因子，用于控制弯曲的程度。原文作者推荐将这个数值全局地设置为四分之三，这样的视觉效果最好。将含有 α 因子的算法展开后可以得到二次贝塞尔曲线，虽然不能提供拐点但是实际开发中已经足够使用。
首先看看原论文的公式。
本质上就是控制插值的程度，定量分析一下就知道，当 α=0 的时候，所有顶点都在原来的平面上，也就相当于没有任何位移。当 α=1 的时候，新的顶点完全依赖于Phong细分弯曲顶点。当然，你也可以尝试小于零或者大于一的数值，效果也是比较有趣的。~~看不懂原文的数学公式没关系，我反手直接上一个lerp，主打一个胡乱插值。~~
```
// Apply Phong smoothing
float3 CalculatePhongPosition(float3 bary, float smoothing, float3 p0PositionWS, float3 p0NormalWS,
    float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
    float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
    float3 smoothedPositionWS =
        bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
        bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
        bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
    return lerp(flatPositionWS, smoothedPositionWS, smoothing);
}

// Apply Phong smoothing
float3 CalculatePhongPosition(float3 bary, float smoothing, float3 p0PositionWS, float3 p0NormalWS,
    float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
    float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
    float3 smoothedPositionWS =
        bary.x * PhongProjectedPosition(flatPositionWS, p0PositionWS, p0NormalWS) +
        bary.y * PhongProjectedPosition(flatPositionWS, p1PositionWS, p1NormalWS) +
        bary.z * PhongProjectedPosition(flatPositionWS, p2PositionWS, p2NormalWS);
    return lerp(flatPositionWS, smoothedPositionWS, smoothing);
}
```
别忘了暴露在材质面板中。
```
// .shader
_TessellationSmoothing("_TessellationSmoothing", Range(0,1)) = 0.5

// .hlsl
float _TessellationSmoothing;



Interpolators Domain( .... ) {
    ...
    float smoothing = _TessellationSmoothing;
    float3 positionWS = CalculatePhongPosition(barycentricCoordinates, smoothing,
      patch[0].positionWS, patch[0].normalWS, 
      patch[1].positionWS, patch[1].normalWS, 
      patch[2].positionWS, patch[2].normalWS);
    ...
}
```
需要特别注意的是，有些模型需要一些修饰。如果模型的边缘非常锐利，那么就说明这个顶点的法线和所在面的法线几乎平行。在Phong Tessellation中，这会导致顶点在切平面上的投影非常接近于原始的顶点位置，从而使得细分的影响减少。
为了解决这个问题，可以在建模软件中进行所谓的“添加环边”（adding loop edges）或“环切割”（loop cut），以添加更多的几何细节。在原模型的边缘附近插入额外的边缘环，从而增加细分密度。具体操作这里就不展开了。
总的来说，Phong细分的效果和性能都相对不错。但是如果希望得到更高品质的平滑效果，可以考虑 PN triangles。该技术基于贝塞尔曲线弯曲三角形。
4.2 PN triangles 细分
首先附上原论文。http://alex.vlachos.com/graphics/CurvedPNTriangles.pdf
PN Triangles不需要邻近三角形的信息，并且成本较低。PN Triangles算法只需要Patch里的三个顶点的位置和法线信息。剩下的数据都可以通过计算得到。注意，所有数据都在重心坐标。
在PN算法中，需要先计算出10个控制点用于曲面细分，如下图所示。三个三角形的顶点，一个重心，还有三对边上的控制点组成所有控制点。计算得到的贝塞尔曲线控制点，会传给Domain。由于每个三角形Patch的控制点都是一致的，因此计算控制点的步骤放在Patch Constant Function非常合适。
论文中的计算方式如下：
$\begin{aligned} b_{300} & =P_1 \ b_{030} & =P_2 \ b_{003} & =P_3 \ w_{i j} & =\left(P_j-P_i\right) \cdot N_i \in \mathbf{R} \quad \text { here ‘ } \cdot \text { ‘ is the scalar product, } \ b_{210} & =\left(2 P_1+P_2-w_{12} N_1\right) / 3 \ b_{120} & =\left(2 P_2+P_1-w_{21} N_2\right) / 3 \ b_{021} & =\left(2 P_2+P_3-w_{23} N_2\right) / 3 \ b_{012} & =\left(2 P_3+P_2-w_{32} N_3\right) / 3 \ b_{102} & =\left(2 P_3+P_1-w_{31} N_3\right) / 3, \ b_{201} & =\left(2 P_1+P_3-w_{13} N_1\right) / 3, \ E & =\left(b_{210}+b_{120}+b_{021}+b_{012}+b_{102}+b_{201}\right) / 6 \ V & =\left(P_1+P_2+P_3\right) / 3, \ b_{111} & =E+(E-V) / 2 . \end{aligned}$
公式中的 $w_{i j}$ 每条边都会计算两次，因此一共会计算6次。比如 $w_{1 2}$ 的意义就是， $P_1$ 到 $P_2$ 的向量在 $P_1$ 法线方向上的投影长度。再乘上对应的法线方向就表示 $w$ 为长度的投影向量。
还是计算靠近 $P_1$ 的因子为例，当前位置点的权重应该较大，乘上一个 $2$ 使得计算出来的控制点更加靠近当前的顶点。减去投影向量的原因是为了修正因 $P_2$ 位置不在 $P_1$ 法线定义的平面上而导致的误差。让三角形平面更加吻合，减少扭曲效果。最后再除3，为了标准化。
接着计算平均贝塞尔控制点 $E$ ，表示六个控制点的平均位置。这个平均位置代表了边界控制点的集中趋势。然后算一下三角形顶点的平均位置。然后求出这两个平均位置的中点位置，加到贝塞尔平均控制点。这就是最终要求的第十个参数了。
总结一下，前三个是三角形的顶点位置（因此不用写在结构体里面），有六个是通过权重计算，最后一个是集合前面计算的平均起来。代码书写非常简单。
```
struct TessellationFactors {
    float edge[3] : SV_TessFactor;
    float inside : SV_InsideTessFactor;
    float3 bezierPoints[7] : BEZIERPOS;
};

//Bezier control point calculations
float3 CalculateBezierControlPoint(float3 p0PositionWS, float3 aNormalWS, float3 p1PositionWS, float3 bNormalWS) {
    float w = dot(p1PositionWS - p0PositionWS, aNormalWS);
    return (p0PositionWS * 2 + p1PositionWS - w * aNormalWS) / 3.0;
}

void CalculateBezierControlPoints(inout float3 bezierPoints[7],
    float3 p0PositionWS, float3 p0NormalWS, float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
    bezierPoints[0] = CalculateBezierControlPoint(p0PositionWS, p0NormalWS, p1PositionWS, p1NormalWS);
    bezierPoints[1] = CalculateBezierControlPoint(p1PositionWS, p1NormalWS, p0PositionWS, p0NormalWS);
    bezierPoints[2] = CalculateBezierControlPoint(p1PositionWS, p1NormalWS, p2PositionWS, p2NormalWS);
    bezierPoints[3] = CalculateBezierControlPoint(p2PositionWS, p2NormalWS, p1PositionWS, p1NormalWS);
    bezierPoints[4] = CalculateBezierControlPoint(p2PositionWS, p2NormalWS, p0PositionWS, p0NormalWS);
    bezierPoints[5] = CalculateBezierControlPoint(p0PositionWS, p0NormalWS, p2PositionWS, p2NormalWS);
    float3 avgBezier = 0;
    [unroll] for (int i = 0; i < 6; i++) {
        avgBezier += bezierPoints[i];
    }
    avgBezier /= 6.0;
    float3 avgControl = (p0PositionWS + p1PositionWS + p2PositionWS) / 3.0;
    bezierPoints[6] = avgBezier + (avgBezier - avgControl) / 2.0;
}

// The patch constant function runs once per triangle, or "patch"
// It runs in parallel to the hull function
TessellationFactors PatchConstantFunction(
    InputPatch<TessellationControlPoint, 3> patch) {
    ...
    TessellationFactors f = (TessellationFactors)0;
    // Check if this patch should be culled (it is out of view)
    if (ShouldClipPatch(...)) {
        ...
    } else {
        ...
        CalculateBezierControlPoints(f.bezierPoints, patch[0].positionWS, patch[0].normalWS, 
          patch[1].positionWS, patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
    }
    return f;
}
```
接着在domain函数中，使用Hull Function输出的十个因子。根据论文给出的公式，计算出最终的立方贝塞尔曲面坐标。然后再插值一下，暴露到材质面板上。
$\begin{aligned} & b: \quad R^2 \mapsto R^3, \quad \text { for } w=1-u-v, \quad u, v, w \geq 0 \ & b(u, v)= \sum_{i+j+k=3} b_{i j k} \frac{3!}{i!j!k!} u^i v^j w^k \ &= b_{300} w^3+b_{030} u^3+b_{003} v^3 \ &+b_{210} 3 w^2 u+b_{120} 3 w u^2+b_{201} 3 w^2 v \ &+b_{021} 3 u^2 v+b_{102} 3 w v^2+b_{012} 3 u v^2 \ &+b_{111} 6 w u v . \end{aligned}$
```
// Barycentric interpolation as a function
float3 BarycentricInterpolate(float3 bary, float3 a, float3 b, float3 c) {
    return bary.x * a + bary.y * b + bary.z * c;
}

float3 CalculateBezierPosition(float3 bary, float smoothing, float3 bezierPoints[7],
    float3 p0PositionWS, float3 p1PositionWS, float3 p2PositionWS) {
    float3 flatPositionWS = BarycentricInterpolate(bary, p0PositionWS, p1PositionWS, p2PositionWS);
    float3 smoothedPositionWS =
        p0PositionWS * (bary.x * bary.x * bary.x) +
        p1PositionWS * (bary.y * bary.y * bary.y) +
        p2PositionWS * (bary.z * bary.z * bary.z) +
        bezierPoints[0] * (3 * bary.x * bary.x * bary.y) +
        bezierPoints[1] * (3 * bary.y * bary.y * bary.x) +
        bezierPoints[2] * (3 * bary.y * bary.y * bary.z) +
        bezierPoints[3] * (3 * bary.z * bary.z * bary.y) +
        bezierPoints[4] * (3 * bary.z * bary.z * bary.x) +
        bezierPoints[5] * (3 * bary.x * bary.x * bary.z) +
        bezierPoints[6] * (6 * bary.x * bary.y * bary.z);
    return lerp(flatPositionWS, smoothedPositionWS, smoothing);
}

// The domain function runs once per vertex in the final, tessellated mesh
// Use it to reposition vertices and prepare for the fragment stage
[domain("tri")] // Signal we're inputting triangles
Interpolators Domain(
    TessellationFactors factors, // The output of the patch constant function
    OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
    float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle

    Interpolators output;
    ...
    // Calculate tessellation smoothing multipler
    float smoothing = _TessellationSmoothing;
#ifdef _TESSELLATION_SMOOTHING_VCOLORS
    smoothing *= BARYCENTRIC_INTERPOLATE(color.r); // Multiply by the vertex's red channel
#endif

    float3 positionWS = CalculateBezierPosition(barycentricCoordinates,
      smoothing, factors.bezierPoints, 
      patch[0].positionWS, patch[1].positionWS, patch[2].positionWS);
    float3 normalWS = BARYCENTRIC_INTERPOLATE(normalWS);
    float3 tangentWS = BARYCENTRIC_INTERPOLATE(tangentWS.xyz);
    ...
}
```
对比效果，关闭与开启PN triangles。
4.3 改进版 PN triangles – 输出细分的法线
传统的PN triangles只改变了顶点的位置信息，我们可以再结合顶点的法线信息，输出动态变化的法线信息，提供更好的光线反射效果。
在原本的的算法中，法线的变化是非常离散的。如下图（上）所示，利用原本三角形的两个顶点提供的法线也许不能很好的表现原本曲面的法线变化。我们想要达到下图（下）的效果，因此需要利用二次插值得到单个Patch中可能的曲面变化。
由于曲面是三次贝塞尔面，所以法线应该是二次贝塞尔曲面插值。因此需要额外的三个法线控制点。TheTus的文章已经讲得比较清晰了，详细的数学原理请移步Ref10.链接。
下面简单介绍一下如何获取细分的法线方向。
首先获取点AB的两个法线信息。然后求出他们的平均法向。
构造一个垂直于线段AB过中点的平面。
取刚刚平均法向对于该平面的反射向量。
每条边都算一下，算三个。
```
struct TessellationFactors {
    float edge[3] : SV_TessFactor;
    float inside : SV_InsideTessFactor;
    float3 bezierPoints[10] : BEZIERPOS;
};

float3 CalculateBezierControlNormal(float3 p0PositionWS, float3 aNormalWS, float3 p1PositionWS, float3 bNormalWS) {
    float3 d = p1PositionWS - p0PositionWS;
    float v = 2 * dot(d, aNormalWS + bNormalWS) / dot(d, d);
    return normalize(aNormalWS + bNormalWS - v * d);
}

void CalculateBezierNormalPoints(inout float3 bezierPoints[10],
    float3 p0PositionWS, float3 p0NormalWS, float3 p1PositionWS, float3 p1NormalWS, float3 p2PositionWS, float3 p2NormalWS) {
    bezierPoints[7] = CalculateBezierControlNormal(p0PositionWS, p0NormalWS, p1PositionWS, p1NormalWS);
    bezierPoints[8] = CalculateBezierControlNormal(p1PositionWS, p1NormalWS, p2PositionWS, p2NormalWS);
    bezierPoints[9] = CalculateBezierControlNormal(p2PositionWS, p2NormalWS, p0PositionWS, p0NormalWS);
}

// The patch constant function runs once per triangle, or "patch"
// It runs in parallel to the hull function
TessellationFactors PatchConstantFunction(
    InputPatch<TessellationControlPoint, 3> patch) {
    ...
    TessellationFactors f = (TessellationFactors)0;
    // Check if this patch should be culled (it is out of view)
    if (ShouldClipPatch(...)) {
        ..
    } else {
        ...
        CalculateBezierControlPoints(f.bezierPoints, 
          patch[0].positionWS, patch[0].normalWS, patch[1].positionWS, 
          patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
        CalculateBezierNormalPoints(f.bezierPoints, 
          patch[0].positionWS, patch[0].normalWS, patch[1].positionWS, 
          patch[1].normalWS, patch[2].positionWS, patch[2].normalWS);
    }
    return f;
}
```
并且需要注意，所有插值得到的法线向量都需要标准化。
```
float3 CalculateBezierNormal(float3 bary, float3 bezierPoints[10],
    float3 p0NormalWS, float3 p1NormalWS, float3 p2NormalWS) {
    return p0NormalWS * (bary.x * bary.x) +
        p1NormalWS * (bary.y * bary.y) +
        p2NormalWS * (bary.z * bary.z) +
        bezierPoints[7] * (2 * bary.x * bary.y) +
        bezierPoints[8] * (2 * bary.y * bary.z) +
        bezierPoints[9] * (2 * bary.z * bary.x);
}

float3 CalculateBezierNormalWithSmoothFactor(float3 bary, float smoothing, float3 bezierPoints[10],
    float3 p0NormalWS, float3 p1NormalWS, float3 p2NormalWS) {
    float3 flatNormalWS = BarycentricInterpolate(bary, p0NormalWS, p1NormalWS, p2NormalWS);
    float3 smoothedNormalWS = CalculateBezierNormal(bary, bezierPoints, p0NormalWS, p1NormalWS, p2NormalWS);
    return normalize(lerp(flatNormalWS, smoothedNormalWS, smoothing));
}

// The domain function runs once per vertex in the final, tessellated mesh
// Use it to reposition vertices and prepare for the fragment stage
[domain("tri")] // Signal we're inputting triangles
Interpolators Domain(
    TessellationFactors factors, // The output of the patch constant function
    OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
    float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle

    Interpolators output;
    ...
    // Calculate tessellation smoothing multipler
    float smoothing = _TessellationSmoothing;
    float3 positionWS = CalculateBezierPosition(barycentricCoordinates, smoothing, factors.bezierPoints, patch[0].positionWS, patch[1].positionWS, patch[2].positionWS);
    float3 normalWS = CalculateBezierNormalWithSmoothFactor(
        barycentricCoordinates, smoothing, factors.bezierPoints,
        patch[0].normalWS, patch[1].normalWS, patch[2].normalWS);
    float3 tangentWS = BARYCENTRIC_INTERPOLATE(tangentWS.xyz);
    ...
}
```
还有一个问题需要注意，当我们使用了插值得到的法线，与之一一对应的切线向量就不再与插值得到的法线向量正交。为了保持正交性，需要重新计算一个切线向量。
```
void CalculateBezierNormalAndTangent(
    float3 bary, float smoothing, float3 bezierPoints[10],
    float3 p0NormalWS, float3 p0TangentWS, 
    float3 p1NormalWS, float3 p1TangentWS, 
    float3 p2NormalWS, float3 p2TangentWS,
    out float3 normalWS, out float3 tangentWS) {

    float3 flatNormalWS = BarycentricInterpolate(bary, p0NormalWS, p1NormalWS, p2NormalWS);
    float3 smoothedNormalWS = CalculateBezierNormal(bary, bezierPoints, p0NormalWS, p1NormalWS, p2NormalWS);
    normalWS = normalize(lerp(flatNormalWS, smoothedNormalWS, smoothing));

    float3 flatTangentWS = BarycentricInterpolate(bary, p0TangentWS, p1TangentWS, p2TangentWS);
    float3 flatBitangentWS = cross(flatNormalWS, flatTangentWS);
    tangentWS = normalize(cross(flatBitangentWS, normalWS));
}

[domain("tri")] // Signal we're inputting triangles
Interpolators Domain(
    TessellationFactors factors, // The output of the patch constant function
    OutputPatch<TessellationControlPoint, 3> patch, // The Input triangle
    float3 barycentricCoordinates : SV_DomainLocation) { // The barycentric coordinates of the vertex on the triangle
    ...
    float3 normalWS, tangentWS;
    CalculateBezierNormalAndTangent(
        barycentricCoordinates, smoothing, factors.bezierPoints,
        patch[0].normalWS, patch[0].tangentWS.xyz, 
        patch[1].normalWS, patch[1].tangentWS.xyz, 
        patch[2].normalWS, patch[2].tangentWS.xyz,
        normalWS, tangentWS);
    ...
}
```
References
1. https://www.youtube.com/watch?v=63ufydgBcIk
2. https://nedmakesgames.medium.com/mastering-tessellation-shaders-and-their-many-uses-in-unity-9caeb760150e
3. https://zhuanlan.zhihu.com/p/148247621
4. https://zhuanlan.zhihu.com/p/124235713
5. https://zhuanlan.zhihu.com/p/141099616
6. https://zhuanlan.zhihu.com/p/42550699
7. https://en.wikipedia.org/wiki/Barycentric_coordinate_system
8. https://zhuanlan.zhihu.com/p/359999755
9. https://zhuanlan.zhihu.com/p/629364817
10. https://zhuanlan.zhihu.com/p/629202115
11. https://perso.telecom-paristech.fr/boubek/papers/PhongTessellation/PhongTessellation.pdf
12. http://alex.vlachos.com/graphics/CurvedPNTriangles.pdf
2024-06-25
Unity可互动可砍断八叉树草海渲染 – 几何、计算着色器（BIRP/URP）
项目（BIRP）在Github：
https://github.com/Remyuu/Unity-Interactive-Grass
先放一张10, 0500棵草在Compute Shader上未经任何优化在我的M1 pro上运行的截图，能跑个两百多帧。
加入八叉树视锥体剔除、距离渐隐等操作，帧数反而没有这么稳定了（想死），我猜测是CPU端每一帧的操作压力太大，需要维护这么大量的草地信息。但是只要剔除得足够多，跑个700帧+是没问题的（安慰）。另外，八叉树的深度也需要根据实际做优化，下图八叉树的深度我设置为了5。
前言
这篇文章已经越来越长了，主要留给自己回顾知识用，大佬们阅读的时候可能会感觉很多基础的内容。我是纯新手，恳求各位大佬的讨论和指正。
本文主要有两阶段：
- GS + TS的方法实现草地渲染最基础的效果
- 然后用CS重新实现草海渲染，加上了各种优化手段
几何着色器+曲面细分着色器的渲染方式应该是比较简单的，但是性能上限比较低，且平台兼容性差。
计算着色器配合GPU Instancing的方法应该才是当前业界的主流方法，并且在移动端上也能很好的运行。
本文的CS渲染草海Demo主要参考了Colin和Minions Art的实现，更类似两者的杂交低级版（前者知乎上已经有大佬解析过了基于GPU Instance的草地渲染学习笔记）。用三组ComputeBuffer，一组是包含所有草的Buffer，一个是Append丢进Material的Buffer，另一组是一个可见Buffer（根据视锥剔除实时得到）。实现了用一颗四八叉树（奇偶深度）来做空间划分，加上通过视锥剔除得到当前视锥体内的所有草的索引，传给Compute Shader做进一步的处理（例如Mesh生成、四元数计算旋转、LoD等操作），然后再用一个变长的ComputeBuffer（ComputeBufferType.Append）将需要渲染的草，通过Instancing传给Material做最终的渲染。
还可以用Hi-Z的方案做剔除，挖一个坑，努力学习中。
另外参考了Minions Art大佬的文章复刻了一套编辑器刷草的工具（残缺版），通过维护一个顶点列表，存储所有的草地顶点位置。
再进一步的，通过另外维护一组Cut Buffer，如果被标记为 -1 值的草，则不做处理。如果标记为砍刀高度的非 -1 数值，则会传到Material中，通过WorldPos + Split.y再加上lerp的操作，将草的上半部分变得不可见，并且再修改草的颜色，最后加上一些草屑的例子效果，实现一个砍草的效果。
GS的绝唱
上一篇文章已经详细介绍了什么是曲面细分着色器，以及各种优化方法。接下来将曲面细分融入实际开发。另外，结合了几天速成的Compute Shader，捣鼓出了基于计算着色器的草地，详细可以这一篇笔记。以下是本文将要实现的小效果，并附完整代码：
- 草地渲染
- 草地渲染 – 几何着色器（BIRP/URP）
- 定义草宽高朝向倾倒曲率渐变颜色带法向
- INTEGER曲面细分
- URP新增Visibility Map
- 草地渲染 – Compute Shader（BIRP/URP）work on MacOS
- 八叉树视锥体剔除
- 距离渐隐
- 草地交互
- 交互性几何着色器（BIRP/URP）
- 交互性Compute Shader（BIRP）work on MacOS
- Unity自定义草地生成工具
- 砍草系统
主要参考（抄袭）文章：
- 几何着色器绘制草地（BIRP）：https://roystan.net/articles/grass-shader/
- 几何着色器绘制草地（URP）https://danielilett.com/2021-08-24-tut5-17-stylised-grass/
- Compute Shader教程-1：https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
- Compute Shader教程-2：https://medium.com/ericzhan-publication/shader筆記-初探compute-shader-9efeebd579c1
- Compute Shader绘制草地：https://www.patreon.com/posts/53587750
- 草地绘制工具整合：https://www.youtube.com/watch?v=xKJHL8nQiuM
- 交互几何着色器草地（BIRP）：https://www.patreon.com/posts/40090373
- 交互几何着色器草地（URP）：https://www.patreon.com/posts/47447321
- 交互Compute Shader草地（BIRP/URP）：https://www.patreon.com/posts/wip-patron-only-83683483
- Ned的参考：https://www.youtube.com/watch?v=DeATXF4Szqo
- URP草地Compute Shader参考代码：https://github.com/ColinLeung-NiloCat/UnityURP-MobileDrawMeshInstancedIndirectExample
- Compute Shader参考代码：https://github.com/ellioman/Indirect-Rendering-With-Compute-Shaders
草地渲染有很多种方案，本文中的两种：
- 几何着色器+曲面细分着色器
- 计算着色器+GPU Instancing
首先，第一种方案局限性很大。很多移动设备还有Metal不支持GS，而且GS每一帧都会重新计算一次Mesh，开销还是挺大的。
其次，MacOS就不能跑几何着色器了吗？也不是。想要用GS，就必须使用OpenGL，而不是Metal。但是需要注意，Apple对OpenGL最高支持到OpenGL 4.1。也就是说，这个版本不支持Compute Shader。当然，Intel时期的MacOS可以支持到OpenGL 4.3，可以同时跑CS和GS。M系列芯片就没这个命运了，要么用4.1，要么老老实实用Metal。在我的M1p mbp上，即使选择虚拟机（Parallels 18+ 提供了DX11和Vulkan），但是运行在macOS上的Vulkan是经过转译的，本质还是Metal，所以还是没GS。因此macOS M1之后就没有原生的GS了。
再者，Metal 甚至不直接支持 Tessellation 着色器。Apple压根不想在芯片上对这两个东西做支持。为什么呢？因为效率太低了。在M芯片上，TS甚至都是用CS模拟的！
总结一下，几何着色器是一个没有出路的技术，尤其是在Mesh Shader问世之后。虽然GS在Unity中很流行，但任何类似的效果都可以在CS上Instance出来，并且效率更高。现在的新显卡虽然还是会支持GS，因为目前市面上还是有相当多的游戏在用GS。只是Apple不考虑兼容性，直接砍掉了。
MacOS的DX11
这篇文章详细讲述了为啥GS这么慢：http://www.joshbarczak.com/blog/?p=667。简单的说就是，Intel通过阻塞线程等方式优化了GS，其他芯片则没有这种优化。
本文作为学习笔记，很有可能会出错。
一、几何着色器渲染草概述（BIRP）
本章节是Roystan的精简概括。需要工程文件或者最终代码的可以去原文下载。或者阅读苏格拉没有底的文章。
1.1 概述
Domain Stage之后，可以选择使用几何着色器。
几何着色器将整个基元作为输入，并能够在输出上生成顶点。几何着色器的输入是完整基元的顶点（三角形为三个顶点，线为两个顶点或点为单个顶点）。每个基元都将调用一次几何着色器。
从网页下载初始工程。
1.2 绘制三角形
绘制一个三角形。
```
// Add inside the CGINCLUDE block.
struct geometryOutput
{
    float4 pos : SV_POSITION;
};

...
    //顶点着色器
return vertex;
...

[maxvertexcount(3)]
void geo(triangle float4 IN[3] : SV_POSITION, inout TriangleStream<geometryOutput> triStream)
{
    geometryOutput o;

    o.pos = UnityObjectToClipPos(float4(0.5, 0, 0, 1));
    triStream.Append(o);

    o.pos = UnityObjectToClipPos(float4(-0.5, 0, 0, 1));
    triStream.Append(o);

    o.pos = UnityObjectToClipPos(float4(0, 1, 0, 1));
    triStream.Append(o);
}

…

// Add inside the SubShader Pass, just below the #pragma fragment frag line.
#pragma geometry geo
```
實際上，我們為網格中的每個頂點繪製了一個三角形，但我們分配給三角形頂點的位置是恆定的 – 它們不會針對每個輸入頂點而改變 – 將所有三角形放置在彼此之上了。
1.3 顶点偏移
因此，根据每一个顶点位置做偏移即可。
C#
```
// Add to the top of the geometry shader.
float3 pos = IN[0];

…

// Update each assignment of o.pos.
o.pos = UnityObjectToClipPos(pos + float3(0.5, 0, 0));

…

o.pos = UnityObjectToClipPos(pos + float3(-0.5, 0, 0));

…

o.pos = UnityObjectToClipPos(pos + float3(0, 1, 0));
```
1.4 旋转叶片
但是需要注意，目前三角形都是一个方向发射，因此加入法线修正。构建TBN矩阵，与当前给的方向做乘积。并且整理代码。
```
float3 vNormal = IN[0].normal;
float4 vTangent = IN[0].tangent;
float3 vBinormal = cross(vNormal, vTangent) * vTangent.w;

float3x3 tangentToLocal = float3x3(
    vTangent.x, vBinormal.x, vNormal.x,
    vTangent.y, vBinormal.y, vNormal.y,
    vTangent.z, vBinormal.z, vNormal.z
    );

triStream.Append(VertexOutput(pos + mul(tangentToLocal, float3(0.5, 0, 0))));
triStream.Append(VertexOutput(pos + mul(tangentToLocal, float3(-0.5, 0, 0))));
triStream.Append(VertexOutput(pos + mul(tangentToLocal, float3(0, 0, 1))));
```
1.5 上色
然后定义草的上下两个颜色，用uv做lerp渐变。
```
return lerp(_BottomColor, _TopColor, i.uv.y);
```
C#
1.6 旋转矩阵原理
做随机朝向。这里构建了一个旋转矩阵。原理在GAMES101也有讲到哦。B站还有一个公式推导的视频，讲得也很清晰！简单的推导思路就是，假設是向量 $a$ 繞著n軸旋轉至 $b$ ，則將 $a$ 分解為平行於n軸的分量（發現是不變的）加上垂直於n軸的分量。
```
float3x3 AngleAxis3x3(float angle, float3 axis)
{
    float c, s;
    sincos(angle, s, c);

    float t = 1 - c;
    float x = axis.x;
    float y = axis.y;
    float z = axis.z;

    return float3x3(
        t * x * x + c, t * x * y - s * z, t * x * z + s * y,
        t * x * y + s * z, t * y * y + c, t * y * z - s * x,
        t * x * z - s * y, t * y * z + s * x, t * z * z + c
        );
}
```
旋转矩阵 $R$ 这里用罗德里格旋转公式（Rodrigues’ rotation formula）来计算： $R=I+sin⁡(θ)⋅[k]×+(1−cos⁡(θ))⋅[k]×2$
其中， $\theta$ 是旋转角。 $k$ 是单位旋转轴。 $I$ 是单位矩阵。 $[k]_{\times}$ 是轴 $k$ 对应的反对称矩阵。
对于一个单位向量 $k=(x,y,z)$ , 反对称矩阵 $[k]_{\times}=\left[\begin{array}{ccc} 0 & -z & y \\ z & 0 & -x \\ -y & x & 0 \end{array}\right]$ 最后得到的矩阵元素：
$\begin{array}{ccc} tx^2 + c & txy – sz & txz + sy \\ txy + sz & ty^2 + c & tyz – sx \\ txz – sy & tyz + sx & tz^2 + c \\ \end{array}$
```
float3x3 facingRotationMatrix = AngleAxis3x3(rand(pos) * UNITY_TWO_PI, float3(0, 0, 1));
```
1.7 叶片倾倒
得到随机方向朝向的草，接下来在x或者y轴任意随机方向倾倒。
```
float3x3 bendRotationMatrix = AngleAxis3x3(rand(pos.zzx) * _BendRotationRandom * UNITY_PI * 0.5, float3(-1, 0, 0));
```
1.8 叶片大小
调整草的宽与高。原本我们默认高和宽都是一个单位。为了让草更加自然，这个步骤再加入rand，显得更加自然。
```
_BladeWidth("Blade Width", Float) = 0.05
_BladeWidthRandom("Blade Width Random", Float) = 0.02
_BladeHeight("Blade Height", Float) = 0.5
_BladeHeightRandom("Blade Height Random", Float) = 0.3


float height = (rand(pos.zyx) * 2 - 1) * _BladeHeightRandom + _BladeHeight;
float width = (rand(pos.xzy) * 2 - 1) * _BladeWidthRandom + _BladeWidth;


triStream.Append(VertexOutput(pos + mul(transformationMatrix, float3(width, 0, 0)), float2(0, 0)));
triStream.Append(VertexOutput(pos + mul(transformationMatrix, float3(-width, 0, 0)), float2(1, 0)));
triStream.Append(VertexOutput(pos + mul(transformationMatrix, float3(0, 0, height)), float2(0.5, 1)));
```
1.9 曲面细分
由于数量太少，此处上曲面细分。
1.10 扰动
让草动起来，加法线随着 _Time 扰动。采样贴图，然后计算风的旋转矩阵，应用到草上。
```
float2 uv = pos.xz * _WindDistortionMap_ST.xy + _WindDistortionMap_ST.zw + _WindFrequency * _Time.y;

float2 windSample = (tex2Dlod(_WindDistortionMap, float4(uv, 0, 0)).xy * 2 - 1) * _WindStrength;

float3 wind = normalize(float3(windSample.x, windSample.y, 0));

float3x3 windRotation = AngleAxis3x3(UNITY_PI * windSample, wind);

float3x3 transformationMatrix = mul(mul(mul(tangentToLocal, windRotation), facingRotationMatrix), bendRotationMatrix);
```
1.11 修正叶片旋转问题
此时风可能会沿着x和y轴的旋转，具体表现就是：
将脚下的两个点单独写一个只沿着z旋转的矩阵。
```
float3x3 transformationMatrixFacing = mul(tangentToLocal, facingRotationMatrix);

…

triStream.Append(VertexOutput(pos + mul(transformationMatrixFacing, float3(width, 0, 0)), float2(0, 0)));
triStream.Append(VertexOutput(pos + mul(transformationMatrixFacing, float3(-width, 0, 0)), float2(1, 0)));
```
1.12 叶片曲率
为了让叶子具有曲率，就只能增加顶点。另外，由于当前开启了双面渲染，顶点的顺序就没什么所谓了。这里手动插值for loop构建三角形。计算一个 forward 用于弯曲叶片。
```
float forward = rand(pos.yyz) * _BladeForward;


for (int i = 0; i < BLADE_SEGMENTS; i++)
{
    float t = i / (float)BLADE_SEGMENTS;
    // Add below the line declaring float t.
    float segmentHeight = height * t;
    float segmentWidth = width * (1 - t);
    float segmentForward = pow(t, _BladeCurve) * forward;
    float3x3 transformMatrix = i == 0 ? transformationMatrixFacing : transformationMatrix;
    triStream.Append(GenerateGrassVertex(pos, segmentWidth, segmentHeight, segmentForward, float2(0, t), transformMatrix));
    triStream.Append(GenerateGrassVertex(pos, -segmentWidth, segmentHeight, segmentForward, float2(1, t), transformMatrix));
}

triStream.Append(GenerateGrassVertex(pos, 0, height, forward, float2(0.5, 1), transformationMatrix));
```
1.13 制造阴影
在另外一个Pass中制造阴影，输出。
```
Pass{
    Tags{
        "LightMode" = "ShadowCaster"
    }

    CGPROGRAM
    #pragma vertex vert
    #pragma geometry geo
    #pragma fragment frag
    #pragma hull hull
    #pragma domain domain
    #pragma target 4.6
    #pragma multi_compile_shadowcaster

    float4 frag(geometryOutput i) : SV_Target{
        SHADOW_CASTER_FRAGMENT(i)
    }

    ENDCG
}
```
1.14 接收阴影
直接在Frag用 SHADOW_ATTENUATION 判断阴影。
```
// geometryOutput struct.
unityShadowCoord4 _ShadowCoord : TEXCOORD1;
...
o._ShadowCoord = ComputeScreenPos(o.pos);
...
#pragma multi_compile_fwdbase
...
return SHADOW_ATTENUATION(i);
```
1.15 去除阴影痤疮
去除表面痤疮。
```
#if UNITY_PASS_SHADOWCASTER
    o.pos = UnityApplyLinearShadowBias(o.pos);
#endif
```
1.16 增加法线
给几何着色器生成的顶点加法线信息。
```
struct geometryOutput
{
    float4 pos : SV_POSITION;
    float2 uv : TEXCOORD0;
    unityShadowCoord4 _ShadowCoord : TEXCOORD1;
    float3 normal : NORMAL;
};
...
o.normal = UnityObjectToWorldNormal(normal);
```
1.17 完整代码‼️（BIRP）
最终效果。
代码：
https://pastebin.com/8u1ytGgU
完整的：https://pastebin.com/U14m1Nu0
二、几何着色器渲染草（URP）
2.1 参考
刚才已经写了BIRP版本，现在只需要移植一下就好了。
- URP代码规范参考：https://www.cyanilux.com/tutorials/urp-shader-code/
- BIRP->URP速查表：https://cuihongzhi1991.github.io/blog/2020/05/27/builtinttourp/
大家可以跟着Daniel的这篇文章从头写一遍，也可以跟着我修改刚刚的代码。需要注意的是，原repo的空间变换代码是存在问题的，可以在Pull request中找到解决方案。
现将上面BIRP的曲面细分着色器整理到一起。
- Tags改为URP
- 头文件引入替换为URP版本
- 变量用CBuffer包围
- 阴影投射、接收代码
2.2 开始改
声明URP管线。
```
LOD 100
Cull Off
Pass{
    Tags{
        "RenderType" = "Opaque"
        "Queue" = "Geometry"
        "RenderPipeline" = "UniversalPipeline"
    }
```
导入URP的库。
```
#include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Core.hlsl"
#include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Lighting.hlsl"
#include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/ShaderVariablesFunctions.hlsl"

o._ShadowCoord = ComputeScreenPos(o.pos);
```
改一下函数。
```
// o.normal = UnityObjectToWorldNormal(normal);
o.normal = TransformObjectToWorldNormal(normal);
```
URP接收阴影。这里最好在顶点着色器计算，但是为了方便就全放在几何着色器计算了。
然后生成阴影。ShadowCaster Pass。
```
Pass{
    Name "ShadowCaster"
    Tags{ "LightMode" = "ShadowCaster" }

    ZWrite On
    ZTest LEqual

    HLSLPROGRAM

        half4 frag(geometryOutput input) : SV_TARGET{
            return 1;
        }

    ENDHLSL
}
```
2.3 完整代码‼️（URP）
https://pastebin.com/6KveEKMZ
三、优化曲面细分逻辑（BIRP/URP）
3.1 整理代码
上面我们都只是采用固定数量的细分等级，我不能接受。如果不了解曲面细分原理的可以看我的曲面细分文章，里面详细讲了几种优化细分的方案。
我用第一节完成的BIRP版本的代码为例子。当前版本只有Uniform的细分。
```
_TessellationUniform("Tessellation Uniform", Range(1, 64)) = 1
```
当前各个阶段输出的结构体相当混乱，重新整理一下。
3.1 划分模式
```
[KeywordEnum(INTEGER, FRAC_EVEN, FRAC_ODD, POW2)] _PARTITIONING("Partition algoritm", Float) = 0

#pragma shader_feature_local _PARTITIONING_INTEGER _PARTITIONING_FRAC_EVEN _PARTITIONING_FRAC_ODD _PARTITIONING_POW2

#if defined(_PARTITIONING_INTEGER)
    [partitioning("integer")]
#elif defined(_PARTITIONING_FRAC_EVEN)
    [partitioning("fractional_even")]
#elif defined(_PARTITIONING_FRAC_ODD)
    [partitioning("fractional_odd")]
#elif defined(_PARTITIONING_POW2)
    [partitioning("pow2")]
#else 
    [partitioning("integer")]
#endif
```
3.2 细分的视锥体剔除
在BIRP中，使用 _ProjectionParams.z 表示远平面，URP中使用UNITY_RAW_FAR_CLIP_VALUE 。
```
bool IsOutOfBounds(float3 p, float3 lower, float3 higher) { //给定矩形判断
    return p.x < lower.x || p.x > higher.x || p.y < lower.y || p.y > higher.y || p.z < lower.z || p.z > higher.z;
}
bool IsPointOutOfFrustum(float4 positionCS) { //视锥体判断
    float3 culling = positionCS.xyz;
    float w = positionCS.w;
    float3 lowerBounds = float3(-w, -w, -w * _ProjectionParams.z);
    float3 higherBounds = float3(w, w, w);
    return IsOutOfBounds(culling, lowerBounds, higherBounds);
}
bool ShouldClipPatch(float4 p0PositionCS, float4 p1PositionCS, float4 p2PositionCS) {
    bool allOutside = IsPointOutOfFrustum(p0PositionCS) &&
        IsPointOutOfFrustum(p1PositionCS) &&
        IsPointOutOfFrustum(p2PositionCS);
    return allOutside;
}

TessellationControlPoint vert(Attributes v)
{
    ...
    o.positionCS = UnityObjectToClipPos(v.vertex);
    ...
}

TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch)
{
    TessellationFactors f;
    if(ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS)){
        f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0;
    }else{
        f.edge[0] = _TessellationFactor;
        f.edge[1] = _TessellationFactor;
        f.edge[2] = _TessellationFactor;
        f.inside = _TessellationFactor;
    }
    return f;
}
```
但是需要注意的是，這裡傳入的判斷是草皮的CS座標。如果三角形草皮完全離開屏幕，但是草長得高還可能會在屏幕中，就會導致草突然消失的畫面BUG。這就看項目的需求了，如果是仰視角並且草地比較矮的項目，就可以使用這個操作。
仰視角問題不大。
如果是伏地魔視角，草地並不完整，過度剔除了。
3.3 屏幕距離的細分控制
實現近處的草密集，遠處的草稀疏，但是基於屏幕距離（CS空間）。這個方法會受到分辨率的影響。
```
float EdgeTessellationFactor(float scale, float4 p0PositionCS, float4 p1PositionCS) {
    float factor = distance(p0PositionCS.xyz / p0PositionCS.w, p1PositionCS.xyz / p1PositionCS.w) / scale;
    return max(1, factor);
}

TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch)
{
    TessellationFactors f;

    f.edge[0] = EdgeTessellationFactor(_TessellationFactor, 
        patch[1].positionCS, patch[2].positionCS);
    f.edge[1] = EdgeTessellationFactor(_TessellationFactor, 
        patch[2].positionCS, patch[0].positionCS);
    f.edge[2] = EdgeTessellationFactor(_TessellationFactor, 
        patch[0].positionCS, patch[1].positionCS);
    f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;


    #if defined(_CUTTESS_TRUE)
        if(ShouldClipPatch(patch[0].positionCS, patch[1].positionCS, patch[2].positionCS))
            f.edge[0] = f.edge[1] = f.edge[2] = f.inside = 0;
    #endif

    return f;
}
```
Tessellation Factor = 0.08
並且劃分模式不建議選取Frac，不然就會有強烈的抖動，非常晃眼睛。這種方法我不太喜歡。
3.4 相機距離細分
计算「两点间的距离」与「两顶点的中点与相机位置的距离」的比值。比值越大说明占据屏幕的空间就越大，需要更多的细分程度。
```
float EdgeTessellationFactor_WorldBase(float scale, float3 p0PositionWS, float3 p1PositionWS) {
    float length = distance(p0PositionWS, p1PositionWS);
    float distanceToCamera = distance(_WorldSpaceCameraPos, (p0PositionWS + p1PositionWS) * 0.5);
    float factor = length / (scale * distanceToCamera * distanceToCamera);
    return max(1, factor);
}
...
f.edge[0] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
    patch[1].vertex, patch[2].vertex);
f.edge[1] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
    patch[2].vertex, patch[0].vertex);
f.edge[2] = EdgeTessellationFactor_WorldBase(_TessellationFactor_WORLD_BASE, 
    patch[0].vertex, patch[1].vertex);
f.inside = (f.edge[0] + f.edge[1] + f.edge[2]) / 3.0;
```
还有改进空间。调整草地的密集度，使得近距离的草地不太密集，而中距离的草地曲线更为平滑，引入非线性因子来控制距离与镶嵌因子的关系。
```
float EdgeTessellationFactor_WorldBase(float scale, float3 p0PositionWS, float3 p1PositionWS) {
    float length = distance(p0PositionWS, p1PositionWS);
    float distanceToCamera = distance(_WorldSpaceCameraPos, (p0PositionWS + p1PositionWS) * 0.5);
    // 使用平方根函数调整距离的影响，使中距离的镶嵌因子变化更平滑
    float adjustedDistance = sqrt(distanceToCamera);
    // 调整 scale 的影响，可能需要根据实际效果进一步微调这里的系数
    float factor = length / (scale * adjustedDistance);
    return max(1, factor);
}
```
这样就比较合适了。
3.5 Visibility Map 控制草地细分
顶点着色器读取贴图，传给曲面细分着色器，在PCF计算细分逻辑。
以FIXED模式为例：
```
_VisibilityMap("Visibility Map", 2D) = "white" {}
TEXTURE2D (_VisibilityMap);SAMPLER(sampler_VisibilityMap);
struct Attributes
{
    ...
    float2 uv : TEXCOORD0;
};
struct TessellationControlPoint
{
    ...
    float visibility : TEXCOORD1;
};
TessellationControlPoint vert(Attributes v){
    ...
    float visibility = SAMPLE_TEXTURE2D_LOD(_VisibilityMap, sampler_VisibilityMap, v.uv, 0).r; 
    o.visibility    = visibility;
    ...
}
TessellationFactors patchConstantFunction (InputPatch<TessellationControlPoint, 3> patch){
    ...
    float averageVisibility = (patch[0].visibility + patch[1].visibility + patch[2].visibility) / 3; // 计算三个顶点灰度值的平均值
    float baseTessellationFactor = _TessellationFactor_FIXED; 
    float tessellationMultiplier = lerp(0.1, 1.0, averageVisibility); // 根据平均灰度值调整因子
    #if defined(_DYNAMIC_FIXED)
        f.edge[0] = _TessellationFactor_FIXED * tessellationMultiplier;
        f.edge[1] = _TessellationFactor_FIXED * tessellationMultiplier;
        f.edge[2] = _TessellationFactor_FIXED * tessellationMultiplier;
        f.inside  = _TessellationFactor_FIXED * tessellationMultiplier;
    ...
```
3.6 完整代码‼️（BIRP）
Grass Shader:
https://pastebin.com/TD0AupGz
3.7 完整代码‼️（URP）
URP有一些地方不太一样，比如说计算ShadowBias，就需要下面这样，不展开了，自己看代码吧。
```
#if UNITY_PASS_SHADOWCASTER
    // o.pos = UnityApplyLinearShadowBias(o.pos);
    o.shadowCoord = TransformWorldToShadowCoord(ApplyShadowBias(posWS, norWS, 0));
#endif
```
Grass Shader:
https://pastebin.com/2ZX2aVm9
四、互动草地
URP和BIRP完全一致。
4.1 实现步骤
原理很简单，脚本传角色的世界坐标进来，然后根据设定好的半径、互动强度，将草压弯。
```
uniform float3 _PositionMoving; // 物体的位置
float _Radius; // 物体的交互半径
float _Strength; // 交互强度
```
在草地生成的循环中，计算每个草片段与物体之间的距离，并根据这个距离调整草地的位置。
```
float dis = distance(_PositionMoving, posWS); // 计算距离
float radiusEffect = 1 - saturate(dis / _Radius); // 根据距离计算效果衰减
float3 sphereDisp = pos - _PositionMoving; // 计算位置差
sphereDisp *= radiusEffect * _Strength; // 应用衰减和强度
sphereDisp = clamp(sphereDisp, -0.8, 0.8); // 限制最大位移
```
然后在各个草叶中计算新的位置。
```
// 应用交互效果
float3 newPos = i == 0 ? pos : pos + (sphereDisp * t);
triStream.Append(GenerateGrassVertex(newPos, segmentWidth, segmentHeight, segmentForward, float2(0, t), transformMatrix));
triStream.Append(GenerateGrassVertex(newPos, -segmentWidth, segmentHeight, segmentForward, float2(1, t), transformMatrix));
```
别忘了for loop外面，也就是最上面的顶点。
```
// 最后的草片段
float3 newPosTop = pos + sphereDisp;
triStream.Append(GenerateGrassVertex(newPosTop, 0, height, forward, float2(0.5, 1), transformationMatrix));
triStream.RestartStrip();
```
在URP中，使用 uniform float3 _PositionMoving 可能会导致SRP Batcher失败。
4.2 脚本代码
哪个物体需要添加交互，就绑定上去。
```
using UnityEngine;

public class ShaderInteractor : MonoBehaviour
{
    // Update is called once per frame
    void Update()
    {
        Shader.SetGlobalVector("_PositionMoving", transform.position);
    }
}
```
4.3 完整代码‼️（URP）
Grass shader:
https://pastebin.com/Zs77EQgy
五、计算着色器渲染草 v1.0
为什么是 v1.0 呢，因为我觉得这个计算着色器渲染草海的难度比较大，很多目前不会的以后可以慢慢完善进来。我也写了一些Compute Shader的笔记。
5.1 回顾/整理
上面的Compute Shader笔记里面完整的写了如何从零用CS写一个程式化的草海。如果忘记了在这里重新回顾一下。
在初始化阶段CPU要做的事情还是很多的，首先定义草的Mesh、Buffer传递（草的宽度、高度随机、每个草生成的位置、草地的随机朝向、草的随机色深）、还要专门向Compute Shader传递最大的弯曲值、草地互动半径。
每一帧CPU还要向Compute Shader传递时间变量、风向、风力/速、风场缩放因子。
Compute Shader利用CPU传递的信息计算出草应该怎么转向，使用了四元数作为输出。
最后Shader通过实例化标示ID和所有计算结果，首先计算顶点偏移，然后应用四元数旋转，最后修改法线信息。
这个Demo其实可以进一步优化，比如将更多的计算放在Compute Shader中进行，比如生成Mesh的过程、草地的宽高、随机朝向倾倒等。还可以优化一下更多实时的参数调节变量。还可以将做各种优化剔除，比如传入相机位置通过距离来剔除、或者用视锥体剔除等等，这个剔除的过程就需要使用到一些原子操作。还可以多物体交互。还可以优化交互草地变形的逻辑，比如交互的程度与交互物体的距离呈次方的关系等。还可以增加引擎功能，开发出笔刷刷草的功能，这就有可能需要一套四叉树存储系统等等。
并且在Compute Shader中，能用向量一把梭哈就不要用标量。
首先先整理一下代码。将不需要每帧都发给Compute Shader的变量都放在一个函数统一初始化。将Inspector面板整理一下。（代码改动很多）
首先将基本上所有的计算都放在GPU上运行了，除了每个草的世界坐标在CPU中计算，通过一个Buffer传给GPU。
Buffer传输的大小则完全取决于地面Mesh的大小与设置的密度。也就是说，如果是超级大的开放世界，这个Buffer就会变得超级大。一个 5*5 大小的草地，将Density设置为0.5，就大约会发送 312576 个草数据，实际数据就会达到 4*312576*4=5001216 字节，按照CPU->GPU的传输速度为8 GB/s 来计算，大约需要传10毫秒左右。
万幸这个Buffer并不是每一帧都需要传输，但是也足够引起我们的重视。假如当前草地大小变大到 100*100，所需时间将翻数倍，很吓人。而且这其中很多顶点我们都可能用不到，这就造成了很大的性能浪费。
我在Compute Shader里面加入了生成perlin噪声的函数，还有xorshift128随机数生成算法。
```
// Perlin 随机数算法
float hash(float x, float y) {
    return frac(abs(sin(sin(123.321 + x) * (y + 321.123)) * 456.654));
}
float perlin(float x, float y){
    float col = 0.0;
    for (int i = 0; i < 8; i++) {
        float fx = floor(x); float fy = floor(y);
        float cx = ceil(x); float cy = ceil(y);
        float a = hash(fx, fy); float b = hash(fx, cy);
        float c = hash(cx, fy); float d = hash(cx, cy);
        col += lerp(lerp(a, b, frac(y)), lerp(c, d, frac(y)), frac(x));
        col /= 2.0; x /= 2.0; y /= 2.0;
    }
    return col;
}
// XorShift128 随机数算法 -- Edited 直接输出归一化数据
uint state[4];
void xorshift_init(uint s) {
    state[0] = s; state[1] = s | 0xffff0000u;
    state[2] = s << 16; state[3] = s >> 16;
}
float xorshift128() {
    uint t = state[3]; uint s = state[0];
    state[3] = state[2]; state[2] = state[1]; state[1] = s;
    t ^= t << 11u; t ^= t >> 8u;
    state[0] = t ^ s ^ (s >> 19u);
    return (float)state[0] / float(0xffffffffu);
}

[numthreads(THREADGROUPSIZE,1,1)]
void BendGrass (uint3 id : SV_DispatchThreadID)
{
    xorshift_init(id.x * 73856093u ^ id.y * 19349663u ^ id.z * 83492791u);
    ...
}
```
复盘一下，目前，在CPU用的是草地的一个AABB平均铺草的逻辑生成所有可能的草的顶点，然后传给GPU，在Compute Shader中做一些剔除、LoD等操作。
目前为止我搞了三个Buffer。
m_InputBuffer就是将所有的草一股脑传给GPU，没有任何剔除的。上图左边的结构体。
m_OutputBuffer是一个变长的Buffer，在Compute Shader中慢慢增加的。如果当前线程ID的草适合，就会被加到这个Buffer中，用于一会的Instanced渲染。上图右边的结构体。
m_argsBuffer是一个参数化的Buffer，类型和其他Buffer都不同的。最后用于Draw传参，具体内容就是指定了批量渲染的顶点数量、渲染实例数量等等。详细来看看：
第一个参数，我的草Mesh有七个三角形，所以要渲染21个顶点。
第二个参数暂时设置为0，表示啥也不需要渲染。这个数字会在Compute Shader计算结束后，根据m_OutputBuffer的长度来动态设置。也就是说，Compute Shader里Append了多少个草，这里就会变成多少。
第三第四个参数分别表示：第一个渲染的顶点的索引、第一个实例化的索引。
后面第五个参数我没用过，不知道有啥用。
最后一步长这样，把Mesh、材质、AABB还有参数Buffer传进去了。
5.2 自定义Unity工具
新建一个C#脚本，存在项目的Editor目录下（没有就创建一个）。脚本继承自Editor，然后写上 [CustomEditor(typeof(XXX))] 。表示你是为XXX工作。我为GrassControl工作，然后可以将现在这个写的东西附加到XXX上。当然也可以单独一个窗口，应该就是继承自EditorWindow。
在 OnInspectorGUI() 函数中写工具。比方说写一个Label。
```
GUILayout.Label("== Remo Grass Generator ==");
```
想要在Inspector居中，加一段参数。
```
GUILayout.Label("== Remo Grass Generator ==", new GUIStyle(EditorStyles.boldLabel) { alignment = TextAnchor.MiddleCenter });
```
位置太挤了？加一行空格就好。
```
EditorGUILayout.Space();
```
想在XXX的上方附加工具，那所有逻辑就写在OnInspectorGUI的上方。
```
... // 写在这
// 默认的 GrassControl 的 Inspector 界面
base.OnInspectorGUI();
```
创建按钮，并且按下的代码：
```
if (GUILayout.Button("xxx"))
{
    ...//按下后的代码
```
反正目前我用到的就这些。
5.3 Editor选中对象生成草
获取当前服务的脚本的Object，并且显示在Inspector上，也很简单。
```
[SerializeField] private GameObject grassObject;
...
grassObject = (GameObject)EditorGUILayout.ObjectField("名字随便写", grassObject, typeof(GameObject), true);
if (grassObject == null)
{
    grassObject = FindObjectOfType<GrassControl>()?.gameObject;
}
```
获取完了之后，就可以通过GameObject访问当前脚本里边的东西了。
如何获取在Editor窗口选中的对象呢？一行代码就搞掂。
```
foreach (GameObject obj in Selection.gameObjects)
```
将选中的物体展示在Inspector面板上。注意，这里需要处理多选物体的情况，否则会Warning。
```
// 实时显示当前Editor选中对象并控制按钮的可用性
EditorGUILayout.LabelField("Selection Info:", EditorStyles.boldLabel);
bool hasSelection = Selection.activeGameObject != null;
GUI.enabled = hasSelection;
if (hasSelection)
    foreach (GameObject obj in Selection.gameObjects)
        EditorGUILayout.LabelField(obj.name);
else
    EditorGUILayout.LabelField("No active object selected.");
```
接下来获取选中对象的MeshFilter和Renderer，由于要Raycast检测，就再获取个Collider。若没有就创建一个。
然后写生草的代码，这里就不说了。
5.4 处理AABB
生成完一堆草后，要将每个草加到AABB里面，最后传给Instancing。
我假设每个草都是一个单位立方体的大小，所以是Vector3.one。如果草特别高，这里应该是需要修改的。
将每个草都塞进大的AABB中，将新的AABB传回给脚本的m_LocalBounds，给Instancing用。
```
Graphics.DrawMeshInstancedIndirect(blade, 0, m_Material, m_LocalBounds, m_argsBuffer);
```
5.5 Surface Shader – 踩坑
这里有个小问题，由于当前Material是Surface Shader，Surface Shader的Vertex已经默认计算了AABB的center做了顶点偏移，所以之前传进去的世界坐标就不能直接用。还需要传AABB的center进去，减掉才行。好奇怪啊，不知道有没有什么优雅的方法。
5.6 简单的摄像机距离剔除+渐隐
目前在CPU将所有生成的草都传进了Compute Shader中，然后所有的草都会加进AppendBuffer中。也就是说没有任何剔除逻辑可言。
最简单的剔除方案就是根据摄像机与草地的距离做剔除。在Inspector面板开放一个数值表示剔除距离。计算摄像机与当前草实例的距离，如果大于设定的数值，则不添加到AppendBuffer中。
首先在 C# 中传入相机的世界坐标。下面是半伪代码：
```
// 获取摄像机
private Camera m_MainCamera;

m_MainCamera = Camera.main;

if (m_MainCamera != null)
    m_ComputeShader.SetVector(ID_camreaPos, m_MainCamera.transform.position);
```
CS中，计算草地和摄像机的距离：
```
float distanceFromCamera = distance(input.position, _CameraPositionWS);
```
距离函数代码如下：
```
float distanceFade = 1 - saturate((distanceFromCamera - _MinFadeDist) / (_MaxFadeDist - _MinFadeDist));
```
如果数值小于0，就直接return。
```
// skip if out of fading range too
if (distanceFade < 0.001f)
{
    return;
}
```
在剔除与不剔除之间的部分，设置一下草的宽度+Fade值，达到渐隐的效果。
```
result.height = (bladeHeight + bladeHeightOffset * (xorshift128()*2-1)) * distanceFade;
result.width = (bladeWeight + bladeWeightOffset * (xorshift128()*2-1)) * distanceFade;
...
result.fade = xorshift128() * distanceFade;
```
下图为了方便演示，把两个都设置得比较小。
实际效果我觉得还是很不错的，十分流畅。如果不修改草的宽高，效果就会大打折扣。
当然了，也可以修改一下逻辑：不要完全剔除超过最大绘制范围的草，而是减少绘制的数量；或者是在过渡区的草选择性的绘制。
两种逻辑都可以，如果是我我会选择后者。
5.7 维护一组可视ID Buffer
所谓视锥体剔除，就是在CPU阶段，通过各种方法减少GPU多余的计算。
那怎么让Compute Shader知道哪些草需要渲染，哪些需要Cull呢？我的做法是维护一组ID List。长度是所有草的数量。如果当前草需要被剔除，否则就记录需要渲染的草的索引值。
```
List<uint> grassVisibleIDList = new List<uint>();

// buffer that contains the ids of all visible instances
private ComputeBuffer m_VisibleIDBuffer;

private const int VISIBLE_ID_STRIDE        =  1 * sizeof(uint);

m_VisibleIDBuffer = new ComputeBuffer(grassData.Count, VISIBLE_ID_STRIDE,
    ComputeBufferType.Structured); //uint only, per visible grass
m_ComputeShader.SetBuffer(m_ID_GrassKernel, "_VisibleIDBuffer", m_VisibleIDBuffer);

m_VisibleIDBuffer?.Release();
```
既然在传入Compute Shader之前，就已经有一部分草被剔除了，那么Dispatch的数量就不再是所有草的数量，而是当前List的数量。
```
// m_ComputeShader.Dispatch(m_ID_GrassKernel, m_DispatchSize, 1, 1);

m_DispatchSize = Mathf.CeilToInt(grassVisibleIDList.Count / threadGroupSize);
```
生成一个全部可视的ID序列。
```
void GrassFastList(int count)
{
    grassVisibleIDList = Enumerable.Range(0, count).ToArray().ToList();
}
```
并且每一帧都应用上传到GPU中。准备工作就完成了，接下来用Quad树操作这个数组。
5.8 四/八叉树存储草索引
可以考虑将一个AABB划分为多个子AABB，然后用四叉树存储管理。
目前，所有的草都在一个AABB里面。接下来构建一个八叉树，将这个AABB中的所有草都放进各个分支中。这样就很方便的在CPU前期做视锥体剔除。
怎么存呢？如果当前的草地垂直落差较小，那么用四叉树就足够了。那如果是开放世界，山脉高低起伏的，那就用八叉树。但是考虑到草是水平的密度比较高，我这里使用了一个四叉树+八叉树的结构。根据深度的奇偶来决定当前深度是分四个节点还是八个节点。如果不需要强烈的高度划分，就全用八叉树也行，我感觉效率可能会低一点点。这里直接一把平均分配，后期优化可以考虑根据变长动态变化的划分AABB方式。
```
if (depth % 2 == 0)
{
    ...
    m_children.Add(new CullingTreeNode(topLeftSingle, depth - 1));
    m_children.Add(new CullingTreeNode(bottomRightSingle, depth - 1));
    m_children.Add(new CullingTreeNode(topRightSingle, depth - 1));
    m_children.Add(new CullingTreeNode(bottomLeftSingle, depth - 1));
}
else
{
    ...
    m_children.Add(new CullingTreeNode(topLeft, depth - 1));
    m_children.Add(new CullingTreeNode(bottomRight, depth - 1));
    m_children.Add(new CullingTreeNode(topRight, depth - 1));
    m_children.Add(new CullingTreeNode(bottomLeft, depth - 1));

    m_children.Add(new CullingTreeNode(topLeft2, depth - 1));
    m_children.Add(new CullingTreeNode(bottomRight2, depth - 1));
    m_children.Add(new CullingTreeNode(topRight2, depth - 1));
    m_children.Add(new CullingTreeNode(bottomLeft2, depth - 1));
}
```
视锥体与AABB的检测用 GeometryUtility.TestPlanesAABB 就好了。
```
public void RetrieveLeaves(Plane[] frustum, List<Bounds> list, List<int> visibleIDList)
{
    if (GeometryUtility.TestPlanesAABB(frustum, m_bounds))
    {
        if (m_children.Count == 0)
        {
            if (grassIDHeld.Count > 0)
            {
                list.Add(m_bounds);
                visibleIDList.AddRange(grassIDHeld);
            }
        }
        else
        {
            foreach (CullingTreeNode child in m_children)
            {
                child.RetrieveLeaves(frustum, list, visibleIDList);
            }
        }
    }
}
```
这段代码是关键部分，传入：
- 摄像机视锥体的六个平面 Plane[]
- 存储所有在视锥体内节点的 Bounds 对象的列表
- 存储所有在视锥体内节点包含的草地索引的列表
调用这个四/八叉树的方法，就可以得到所有在视锥体内的包围盒列表、草地列表。
然后就可以将得到的所有草地索引做成一个Buffer传给Compute Shader。
```
m_VisibleIDBuffer.SetData(grassVisibleIDList);
```
为了得到可视化的AABB，可以用 OnDrawGizmos() 方法。
将刚刚视锥体剔除得到的所有AABB传进这个函数。这样就可以直观看到AABB了。
还要将所有在视锥体内的写入可见草中。
5.9 草叶闪烁问题 – 踩坑
在这里我踩了一个小坑。当我完整了八叉树的编写，并且成功像上图一样划分出了诸多子AABB。但是当我移动摄像头的时候，草在疯狂闪烁。GIF视频啥的我有点懒不想弄，观察一下下面两张图，我只是稍微移动了一下视角，并且改变了当前Visibility List。草的位置就会大跳跃，连续地看就是草在闪烁。
我百思不得其解，Compute Shader的剔除也没问题。
Dispatch数量也是根据Visibility List的长度来运算的，因此计算着色器的线程肯定是开够的。
并且DrawMeshInstancedIndirect也没问题。
问题出在哪呢？
经过漫长的调试，我发现问题出在Compute Shader的Xorshift取随机数的过程。
在使用_VisibleIDBuffer之前，一个草对应一个线程ID，这是从草出生那一刻就已经决定的了。而现在加入了这一组索引，又不将传入随机值的ID改成 Visible ID ，就会出现随机数字非常离散的感觉。
也就是将之前的id全部都换成从_VisibleIDBuffer 取的索引值！
5.10 多物体交互
目前只有一个trampler传入。不传还会报错，不能忍。
关于交互的参数有三个：
- pos – Vector3
- trampleStrength – Float
- trampleRadius – Float
现在将trampleRadius塞进pos（Vector4）里面（塞另外一个也行，看需求），用SetVectorArray将位置数组传进去。这样每个交互对象都可以拥有一个专用的交互半径。肥肥的交互物体半径调大一些，瘦瘦的就小一些。也就是将下面这行去掉：
```
// SetGrassDataBase中，不需要每帧上传
// m_ComputeShader.SetFloat("trampleRadius", trampleRadius);
```
变成：
```
// SetGrassDataUpdate中，每帧都要上传
// 设置多交互物体
if (trampler.Length > 0)
{
    Vector4[] positions = new Vector4[trampler.Length];
    for (int i = 0; i < trampler.Length; i++)
    {
        positions[i] = new Vector4(trampler[i].transform.position.x, trampler[i].transform.position.y, trampler[i].transform.position.z,
            trampleRadius);
    }
    m_ComputeShader.SetVectorArray(ID_tramplePos, positions);
}
```
然后还得传一个交互物体的数量，让Compute Shader知道需要处理多少个交互物体。这个也是需要每一帧更新的。我习惯为每一帧都更新的物体存储一个ID索引，这样效率更高。
```
// 初始化中
ID_trampleLength = Shader.PropertyToID("_trampleLength");
// 每帧中
m_ComputeShader.SetFloat(ID_trampleLength, trampler.Length);
```
我再包装了一下：
对应代码再修改一下，就可以在面板上随便调整每个交互物体的半径了。如果要丰富这个调节功能，可以考虑单独传一个Buffer进去。
在Compute Shader中，并且多个旋转组合起来，还是比较简单的。
```
// Trampler
float4 qt = float4(0, 0, 0, 1); // 四元数里的1就是这样的，虚部都是0
for (int trampleIndex = 0; trampleIndex < trampleLength; trampleIndex++)
{
    float trampleRadius = tramplePos[trampleIndex].a;
    float3 relativePosition = input.position - tramplePos[trampleIndex].xyz;
    float dist = length(relativePosition);
    if (dist < trampleRadius) {
        // 使用次方增强近距离的效果
        float eff = pow((trampleRadius - dist) / trampleRadius, 2) * trampleStrength;
        float3 direction = normalize(relativePosition);
        float3 newTargetDirection = float3(direction.x * eff, 1, direction.z * eff);
        qt = quatMultiply(MapVector(float3(0, 1, 0), newTargetDirection), qt);
    }
}
```
5.11 Editor实时预览
当前传给Compute Shader的摄像机是主相机，也就是游戏窗口那个。现在想要在编辑（Scene窗口）暂时得到主摄像机的镜头，启动游戏之后复原。可以使用 Scene View GUI 绘制事件。
以下是改造我当前代码的例子：
```
#if UNITY_EDITOR
    SceneView view;

    void OnDestroy()
    {
        // When the window is destroyed, remove the delegate
        // so that it will no longer do any drawing.
        SceneView.duringSceneGui -= this.OnScene;
    }

    void OnScene(SceneView scene)
    {
        view = scene;
        if (!Application.isPlaying)
        {
            if (view.camera != null)
            {
                m_MainCamera = view.camera;
            }
        }
        else
        {
            m_MainCamera = Camera.main;
        }
    }
    private void OnValidate()
    {
        // Set up components
        if (!Application.isPlaying)
        {
            if (view != null)
            {
                m_MainCamera = view.camera;
            }
        }
        else
        {
            m_MainCamera = Camera.main;
        }
    }
#endif
```
在初始化着色器的时候，在开头订阅事件，然后判断当前是否为游戏状态，是才传递一个摄像机。如果是编辑模式，那m_MainCamera这一项还是NULL。
```
void InitShader()
{
#if UNITY_EDITOR
    SceneView.duringSceneGui += this.OnScene;
    if (!Application.isPlaying)
    {
        if (view != null && view.camera != null)
        {
            m_MainCamera = view.camera;
        }
    }
#endif
    if (Application.isPlaying)
    {
        m_MainCamera = Camera.main;
    }
    ...
```
在逐帧Update的函数中，如果检测到m_MainCamera是NULL，那么断定当前是编辑模式：
```
// 传入摄像机坐标
        if (m_MainCamera != null)
            m_ComputeShader.SetVector(ID_camreaPos, m_MainCamera.transform.position);
#if UNITY_EDITOR
        else if (view != null && view.camera != null)
        {
            m_ComputeShader.SetVector(ID_camreaPos, view.camera.transform.position);
        }

#endif
```
六、砍草
维护一组Cut Buffer
```
// added for cutting
private ComputeBuffer m_CutBuffer;
float[] cutIDs;
```
初始化Buffer
```
private const int CUT_ID_STRIDE            =  1 * sizeof(float);
// added for cutting
m_CutBuffer = new ComputeBuffer(grassData.Count, CUT_ID_STRIDE, ComputeBufferType.Structured);
// added for cutting
m_ComputeShader.SetBuffer(m_ID_GrassKernel, "_CutBuffer", m_CutBuffer);
m_CutBuffer.SetData(cutIDs);
```
别忘了在Disable的时候释放。
```
// added for cutting
m_CutBuffer?.Release();
```
定义一个方法，传入当前位置和半径，计算草的位置。将对应cutID设为-1。
```
// newly added for cutting
public void UpdateCutBuffer(Vector3 hitPoint, float radius)
{
    // can't cut grass if there is no grass in the scene
    if (grassData.Count > 0)
    {
        List<int> grasslist = new List<int>();
        // Get the list of IDS that are near the hitpoint within the radius
        cullingTree.ReturnLeafList(hitPoint, grasslist, radius);
        Vector3 brushPosition = this.transform.position;
        // Compute the squared radius to avoid square root calculations
        float squaredRadius = radius * radius;

        for (int i = 0; i < grasslist.Count; i++)
        {
            int currentIndex = grasslist[i];
            Vector3 grassPosition = grassData[currentIndex].position + brushPosition;

            // Calculate the squared distance
            float squaredDistance = (hitPoint - grassPosition).sqrMagnitude;

            // Check if the squared distance is within the squared radius
            // Check if there is grass to cut, or of the grass is uncut(-1)
            if (squaredDistance <= squaredRadius && (cutIDs[currentIndex] > hitPoint.y || cutIDs[currentIndex] == -1))
            {
                // store cutting point
                cutIDs[currentIndex] = hitPoint.y;
            }

        }
    }
    m_CutBuffer.SetData(cutIDs);
}
```
然后在需要砍草的对象身上绑一个脚本：
```
using System.Collections;
using System.Collections.Generic;
using UnityEngine;


public class Cutgrass : MonoBehaviour
{
    [SerializeField]
    GrassControl grassComputeScript;

    [SerializeField]
    float radius = 1f;

    public bool updateCuts;

    Vector3 cachedPos;
    // Start is called before the first frame update


    // Update is called once per frame
    void Update()
    {
        if (updateCuts && transform.position != cachedPos)
        {
            Debug.Log("Cutting");
            grassComputeScript.UpdateCutBuffer(transform.position, radius);
            cachedPos = transform.position;

        }
    }

    private void OnDrawGizmos()
    {
        Gizmos.color = new Color(1, 0, 0, 0.3f);
        Gizmos.DrawWireSphere(transform.position, radius);
    }
}
```
在Compute Shader中，直接修改草的高度。（非常直截了当。。。）想改啥效果就随意了。
```
StructuredBuffer<float> _CutBuffer;// added for cutting

    float cut = _CutBuffer[usableID];
    result.height = (bladeHeight + bladeHeightOffset * (xorshift128()*2-1)) * distanceFade;
    if(cut != -1){
        result.height *= 0.1f;
    }
```
完工！
References
2024-06-23
Compute Shader学习笔记（四）之草地渲染
项目地址：
https://github.com/Remyuu/Unity-Compute-Shader-Learngithub.com/Remyuu/Unity-Compute-Shader-Learn
L5 草地渲染
当前做的效果非常丑陋，还有很多细节没有完善，仅仅是“实现”了。由于我也是菜鸡，写/做的不够好的地方望各位指正。
知识点小结：
- 草地渲染方案
- UNITY_PROCEDURAL_INSTANCING_ENABLED
- bounds.extents
- 射线检测
- 罗德里格旋转
- 四元数旋转
前言1
前言参考文章：
草地渲染有很多方法。
最简单的是直接一张草地的纹理贴上去。
除此之外，将一个个Mesh草拖到场景中也很常见。这种方法操作空间大，每一颗草都在掌控中。虽然可以用Batching等方法优化，减少CPU到GPU的传输时间，但是这会损耗您键盘上的Ctrl、C、V和D键的寿命。不过可以在Transform组件里面用 L(a, b) 让选中的物体平均分布在 a 和 b 之间。想随机，可以用 R(a, b) 。更多相关的操作可以看官方文档。
还可以结合几何着色器和曲面细分着色器，这个方法看起来不错的，但是一个着色器只能对应一种几何（草），如果想要在这个网格生成花或者岩石，就需要在几何着色器中修改代码。这个问题其实不是最关键的，更要命的问题是很多移动设备还有Metal根本就不支持几何着色器，就算支持也只是软件模拟的，性能差劲。并且每一帧都会重新计算一次草地Mesh，浪费性能。
广告牌技术渲染草也是一种广泛流传经久不衰的方法。当我们不需要高保真的画面时，这个方法非常奏效。这个方法是简单的渲一个Quad+贴图（Alpha裁切）。用DrawProcedural就可以了。但是这个方法只可远观不可近看，否则就会大露馅。
用Unity的地形系统也可以画出非常nice的草。并且Unity使用了instancing技术确保了性能。其中最好用的地方莫过于他的笔刷工具，但是如果你的工作流没有地形系统的身影，那么你还可以用第三方插件做到。
在搜索资料的时候我还发现了一种叫Impostors「冒名顶替」技术。结合了广告牌的顶点节省优势和从多个角度真实重现对象的能力，还挺有意思。这个技术通过预先从多个角度“拍下”一个真实草的Mesh照片，通过Texture存起来。运行的时候根据当前相机的观看方向选择合适的纹理进行渲染。相当于广告牌技术的升级版。我认为Impostors技术非常适合用于那些大型但玩家可能需要从多个角度查看的对象，如树木或复杂建筑。然而，当相机非常接近或者在两个角度之间变换时，这种方法可能会出现问题。比较合理的方案是：在距离非常近用基于Mesh的方法，中等距离用Impostors，远距离用广告牌。
本文要实现的方法是基于GPU Instancing的，应该称之为「per-blade mesh grass」。在《對馬島之魂》、《原神》和《薩爾達傳說：曠野之息》等游戏上都是使用这种方案。每个草都有自己的实体，光影效果也相当真实。
渲染流程：
前言2
Unity的Instancing技术比较复杂，我也只是管中窥豹，出现错误请指正。目前的代码都是仿照文档写的。GPU instancing目前支持的平台：
- Windows: DX11 and DX12 with SM 4.0 and above / OpenGL 4.1 and above
- OS X and Linux: OpenGL 4.1 and above
- Mobile: OpenGL ES 3.0 and above / Metal
- PlayStation 4
- Xbox One
另外Graphics.DrawMeshInstancedIndirect目前已经淘汰了，应该使用 Graphics.RenderMeshIndirect ，这个函数会自动计算Bounding Box，这个就是后话了。详细请看官方文档：RenderMeshIndirect 。这篇文章也很有帮助：
https://zhuanlan.zhihu.com/p/403885438。
GPU Instancing原理是将多个具有相同Mesh的对象发一次Draw Call。CPU首先收集好所有信息，然后放到数组里一次性发给GPU。局限就是这些对象的Material和Mesh都要相同。这就是一次能绘制这么多草而保持高性能的原理。要实现GPU Instancing绘制上百万的Mesh，就需要遵循一些规定：
- 所有的网格需使用相同的Material
- 勾选GPU Instancing
- Shader需支持实例化
- 不支持Skin Mesh Renderer
由于不支持Skin Mesh Renderer，在上一篇文章中，我们绕过了SMR，直接取了不同关键帧的Mesh出来传给GPU，这也是上一篇文章最后提出那个问题的原因。
Unity中的Instancing分为两种主要类型：GPU Instancing和Procedural Instancing（涉及到Compute Shaders和Indirect Drawing技术），还有一种是立体渲染路径（UNITY_STEREO_INSTANCING_ENABLED），这里就不深入了。在Shader中，前者用#pragma multi_compile_instancing 后者用#pragma instancing_options procedural:setup 。具体的请看官方文档Creating shaders that support GPU instancing 。
然后目前SRP管线不支持自定义的GPU Instancing Shader，只有BIRP可以。
然后就是UNITY_PROCEDURAL_INSTANCING_ENABLED 。这个宏用于表示是否启用了Procedural Instancing。在使用Compute Shader或Indirect Drawing API时，实例的属性（如位置、颜色等）可以在GPU上实时计算并直接用于渲染，无需CPU的介入。在源代码中，关于这个宏的核心代码是：
```
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    #ifndef UNITY_INSTANCING_PROCEDURAL_FUNC
        #error "UNITY_INSTANCING_PROCEDURAL_FUNC must be defined."
    #else
        void UNITY_INSTANCING_PROCEDURAL_FUNC(); // 前向声明程序化函数
        #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input)      { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input)); UNITY_INSTANCING_PROCEDURAL_FUNC();}
    #endif
#else
    #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input)          { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input));}
#endif
```
要求Shader定义一个UNITY_INSTANCING_PROCEDURAL_FUNC函数，其实就是 setup() 函数。没有这个setup()函数，就会报错。
一般来说，setup()函数要做的就是从Buffer中取出对应（unity_InstanceID）的数据，然后计算当前实例的位置、变换矩阵、颜色、金属度或者是自定义数据等属性。
GPU Instancing只是Unity众多优化手段的一种，仍然需要继续学习。
1. 摇曳的3-Quad草
这一章所运用关于CS的知识点在上一篇文章都已全部涉及，只不过换一个背景罢了。简单画一个示意图。
实现是使用GPU Instancing，也就是一次性渲染一大片Mesh。核心的代码就一句：
```
Graphics.DrawMeshInstancedIndirect(mesh, 0, material, bounds, argsBuffer);
```
Mesh采用三个Quad共六个三角形组成。
然后上一张贴图+Alpha Test。
草的数据结构：
- 位置
- 倾斜角度
- 随机噪声值（用于计算随机的倾斜角度）
```
public Vector3 position; // 世界坐标，需要计算
public float lean;
public float noise;
public GrassClump( Vector3 pos){
    position.x = pos.x;
    position.y = pos.y;
    position.z = pos.z;
    lean = 0;
    noise = Random.Range(0.5f, 1);
    if (Random.value < 0.5f) noise = -noise;
}
```
将需要渲染的草的Buffer（世界坐标需要计算）传给GPU。首先确定草在哪里生成、生成多少。获取当前物体的Mesh（暂时假设是一个Plane Mesh）的AABB。
```
Bounds bounds = mf.sharedMesh.bounds;
Vector3 clumps = bounds.extents;
```
确定草的范围，然后在xOz平面上随机生成草。
添加图片注释，不超过 140 字（可选）
需要注意，当前还是在物体空间，因此需要将Object Space转换到World Space。
```
pos = transform.TransformPoint(pos);
```
再结合密度density参数和物体缩放系数，计算出一共要渲染多少个草。
```
Vector3 vec = transform.localScale / 0.1f * density;
clumps.x *= vec.x;
clumps.z *= vec.z;
int total = (int)clumps.x * (int)clumps.z;
```
由于Compute Shader的逻辑是每个线程计算一棵草，极有可能需要渲染的草的数量不是线程的倍数。因此将需要渲染的草的数量向上取整到线程的倍数。也就是说，当密度因子=1的时候，渲染的草的数量等于一个线程组中线程的数量。
```
groupSize = Mathf.CeilToInt((float)total / (float)threadGroupSize);
int count = groupSize * (int)threadGroupSize;
```
让Compute Shader计算每个草的倾斜角度。
```
GrassClump clump = clumpsBuffer[id.x];
clump.lean = sin(time) * maxLean * clump.noise;
clumpsBuffer[id.x] = clump;
```
将草的位置、旋转角度传给GPU Buffer还没完，还得拜托Material决定渲染实例的最终外观，才能最终执行Graphics.DrawMeshInstancedIndirect。
渲染流程中，在实例化阶段之前（也就是procedural:setup函数内），使用unity_InstanceID确定现在渲的是哪个草。获取当前草的世界空间，草的倾倒值。
```
GrassClump clump = clumpsBuffer[unity_InstanceID];
_Position = clump.position;
_Matrix = create_matrix(clump.position, clump.lean);
```
具体的旋转+位移矩阵：
```
float4x4 create_matrix(float3 pos, float theta){
    float c = cos(theta); // 计算旋转角度的余弦值
    float s = sin(theta); // 计算旋转角度的正弦值
    // 返回一个4x4变换矩阵
    return float4x4(
        c, -s, 0, pos.x, // 第一行：X轴旋转和位移
        s,  c, 0, pos.y, // 第二行：Y轴旋转（对于2D足够，但草丛可能不使用）
        0,  0, 1, pos.z, // 第三行：Z轴不变
        0,  0, 0, 1     // 第四行：均匀坐标（保持不变）
    );
}
```
这个公式怎么推的呢？将(0,0,1)带入罗德里格斯公式得到一个的旋转矩阵，然后扩展到重心坐标。带入就是代码的公式了。
用这个矩阵乘上Object Space的顶点，得到倾倒+位移的顶点坐标。
```
v.vertex.xyz *= _Scale;
float4 rotatedVertex = mul(_Matrix, v.vertex);
v.vertex = rotatedVertex;
```
这时候问题来了。目前草并不是一个平面，而是三组Quad组成的立体图形。
如果简单的将所有顶点按照z轴旋转，就会出现草根大偏移的问题。
因此借助 v.texcoord.y ，将旋转前后的顶点位置lerp起来。这样，纹理坐标的Y值越高（即顶点在模型上的位置越靠近顶部），顶点受到的旋转影响就越大。由于草根的Y值为0，lerp之后草根就不会乱晃了。
```
v.vertex.xyz *= _Scale;
float4 rotatedVertex = mul(_Matrix, v.vertex);
// v.vertex = rotatedVertex;
v.vertex.xyz += _Position;
v.vertex = lerp(v.vertex, rotatedVertex, v.texcoord.y);
```
效果很差，草太假了。这种Quad草只有在远处用用。
- 摆动僵硬
- 叶片僵硬
- 光影效果很差
当前版本代码：
2. 程式化草叶
上一节用几个Quad和带Alpha贴图的草，用sin wave做扰动，效果非常一般。现在用程式化的草和Perlin噪声改善。
在 C# 中定义草的顶点、法线和uv作为Mesh传到GPU上。
```
Vector3[] vertices =
{
    new Vector3(-halfWidth, 0, 0),
    new Vector3( halfWidth, 0, 0),
    new Vector3(-halfWidth, rowHeight, 0),
    new Vector3( halfWidth, rowHeight, 0),
    new Vector3(-halfWidth*0.9f, rowHeight*2, 0),
    new Vector3( halfWidth*0.9f, rowHeight*2, 0),
    new Vector3(-halfWidth*0.8f, rowHeight*3, 0),
    new Vector3( halfWidth*0.8f, rowHeight*3, 0),
    new Vector3( 0, rowHeight*4, 0)
};
Vector3 normal = new Vector3(0, 0, -1);
Vector3[] normals =
{
    normal, normal, normal, normal, normal, normal, normal, normal, normal
};
Vector2[] uvs =
{
    new Vector2(0,0),
    new Vector2(1,0),
    new Vector2(0,0.25f),
    new Vector2(1,0.25f),
    new Vector2(0,0.5f),
    new Vector2(1,0.5f),
    new Vector2(0,0.75f),
    new Vector2(1,0.75f),
    new Vector2(0.5f,1)
};
```
Unity的Mesh还有一个顶点顺序需要设定，默认是逆时针。如果顺时针写并且开启背面剔除，那就啥也看不见了。
```
int[] indices =
{
    0,1,2,1,3,2,//row 1
    2,3,4,3,5,4,//row 2
    4,5,6,5,7,6,//row 3
    6,7,8//row 4
};
mesh.SetIndices(indices, MeshTopology.Triangles, 0);
```
在代码那边设置好风的方向、大小还有噪声比重，打包进一个float4里面，传给Compute Shader计算一片草叶的摆动方向。
```
Vector4 wind = new Vector4(Mathf.Cos(theta), Mathf.Sin(theta), windSpeed, windScale);
```
一个草叶的数据结构
```
struct GrassBlade
{
    public Vector3 position;
    public float bend; // 随机草叶倾倒
    public float noise;// CS计算噪声值
    public float fade; // 随机草叶明暗
    public float face; // 叶片朝向
    public GrassBlade( Vector3 pos)
    {
        position.x = pos.x;
        position.y = pos.y;
        position.z = pos.z;
        bend = 0;
        noise = Random.Range(0.5f, 1) * 2 - 1;
        fade = Random.Range(0.5f, 1);
        face = Random.Range(0, Mathf.PI);
    }
}
```
当前的草叶都是一个方向的。Setup函数里，先修改叶片朝向。
```
// 创建绕Y轴的旋转矩阵（面向）
float4x4 rotationMatrixY = AngleAxis4x4(blade.position, blade.face, float3(0,1,0));
```
将草叶倾倒的逻辑（由于AngleAxis4x4是包含了位移，下图只是单独演示了叶片倾倒而没有随机朝向，如果要得到下图的效果代码中记得加入位移）：
```
// 创建绕X轴的旋转矩阵（倾倒）
float4x4 rotationMatrixX = AngleAxis4x4(float3(0,0,0), blade.bend, float3(1,0,0));
```
然后合成两个旋转矩阵。
```
_Matrix = mul(rotationMatrixY, rotationMatrixX);
```
现在的光照是非常奇怪的。因为法线没有修改。
```
// 计算逆转置矩阵用于法线变换
float3x3 normalMatrix = (float3x3)transpose(((float3x3)_Matrix));
// 变换法线
v.normal = mul(normalMatrix, v.normal);
```
这里逆矩阵的代码：
```
float3x3 transpose(float3x3 m)
{
    return float3x3(
        float3(m[0][0], m[1][0], m[2][0]), // Column 1
        float3(m[0][1], m[1][1], m[2][1]), // Column 2
        float3(m[0][2], m[1][2], m[2][2])  // Column 3
    );
}
```
为了代码可读性，再补上齐次坐标变换矩阵，这里升级为那个著名的旋转公式：
```
float4x4 AngleAxis4x4(float3 pos, float angle, float3 axis){
    float c, s;
    sincos(angle*2*3.14, s, c);
    float t = 1 - c;
    float x = axis.x;
    float y = axis.y;
    float z = axis.z;
    return float4x4(
        t * x * x + c    , t * x * y - s * z, t * x * z + s * y, pos.x,
        t * x * y + s * z, t * y * y + c    , t * y * z - s * x, pos.y,
        t * x * z - s * y, t * y * z + s * x, t * z * z + c    , pos.z,
        0,0,0,1
        );
}
```
想要在不平坦的地面生成怎么办？
只需要修改生成草地初始位置高度的逻辑，用MeshCollider加射线检测，
```
bladesArray = new GrassBlade[count];
gameObject.AddComponent<MeshCollider>();
RaycastHit hit;
Vector3 v = new Vector3();
Debug.Log(bounds.center.y + bounds.extents.y);
v.y = (bounds.center.y + bounds.extents.y);
v = transform.TransformPoint(v);
float heightWS = v.y + 0.01f; // 浮点数误差
v.Set(0, 0, 0);
v.y = (bounds.center.y - bounds.extents.y);
v = transform.TransformPoint(v);
float neHeightWS = v.y;
float range = heightWS - neHeightWS;
// heightWS += 10; // 稍微调高一点 误差自行调整
int index = 0;
int loopCount = 0;
while (index < count && loopCount < (count * 10))
{
    loopCount++;
    Vector3 pos = new Vector3( Random.value * bounds.extents.x * 2 - bounds.extents.x + bounds.center.x,
        0,
        Random.value * bounds.extents.z * 2 - bounds.extents.z + bounds.center.z);
    pos = transform.TransformPoint(pos);
    pos.y = heightWS;
    if (Physics.Raycast(pos, Vector3.down, out hit))
    {
        pos.y = hit.point.y;
        GrassBlade blade = new GrassBlade(pos);
        bladesArray[index++] = blade;
    }
}
```
这里用射线检测每个草的位置，计算其正确高度。
还可以调整一下，海拔越高，草地越稀疏。
如上图。计算两个绿色箭头的比值，越高的海拔生成的概率越低。
```
float deltaHeight = (pos.y - neHeightWS) / range;
if (Random.value > deltaHeight)
{
    // 生草
}
```
当前代码链接：
现在光影啥的都没问题了。
3. 交互草
上一节中，我们先是旋转了草的朝向，又是改变了草的倾倒。现在我们还要加上一个旋转，当一个物体靠近草，就让草朝着与物体相反的方向伏倒。这意味着又来一个旋转。这个旋转并不好设置，因此改为四元数进行。而四元数的计算在Compute Shader进行。传给材质的也是四元数，存在草片的结构体中。最后在顶点着色器中将四元数转换回仿射矩阵应用旋转。
这里再加入草的随机宽和身高。因为目前每个草Mesh都是一样的，没办法通过修改Mesh的方法修改草的高度。因此只能在Vert做顶点偏移了。
```
// C#
[Range(0,0.5f)]
public float width = 0.2f;
[Range(0,1f)]
public float rd_width = 0.1f;
[Range(0,2)]
public float height = 1f;
[Range(0,1f)]
public float rd_height = 0.2f;
    GrassBlade blade = new GrassBlade(pos);
    blade.height = Random.Range(-rd_height, rd_height);
    blade.width = Random.Range(-rd_width, rd_width);
    bladesArray[index++] = blade;
// Setup 开头
GrassBlade blade = bladesBuffer[unity_InstanceID];
_HeightOffset = blade.height_offset;
_WidthOffset = blade.width_offset;
// Vert 开头
float tempHeight = v.vertex.y * _HeightOffset;
float tempWidth = v.vertex.x * _WidthOffset;
v.vertex.y += tempHeight;
v.vertex.x += tempWidth;
```
整理一下，当前的一个草Buffer存了:
```
struct GrassBlade{
    public Vector3 position; // 世界坐标位置 - 需初始化
    public float height; // 草的身高偏移 - 需初始化
    public float width; // 草的宽度偏移 - 需初始化
    public float dir; // 叶片朝向 - 需初始化
    public float fade; // 随机草叶明暗 - 需初始化
    public Quaternion quaternion; // 旋转参数 - CS计算->Vert
    public float padding;
    public GrassBlade( Vector3 pos){
        position.x = pos.x;
        position.y = pos.y;
        position.z = pos.z;
        height = width = 0;
        dir = Random.Range(0, 180);
        fade = Random.Range(0.99f, 1);
        quaternion = Quaternion.identity;
        padding = 0;
    }
}
int SIZE_GRASS_BLADE = 12 * sizeof(float);
```
用来表示从向量 v1 旋转到向量 v2 的四元数 q ：
```
float4 MapVector(float3 v1, float3 v2){
    v1 = normalize(v1);
    v2 = normalize(v2);
    float3 v = v1+v2;
    v = normalize(v);
    float4 q = 0;
    q.w = dot(v, v2);
    q.xyz = cross(v, v2);
    return q;
}
```
想要组合两个旋转的四元数，需要用乘法（注意顺序）。
假设有两个四元数和。它们的乘积计算公式是 :
其中是的实部和虚部分量, 是的实部和虚部分量。
```
float4 quatMultiply(float4 q1, float4 q2) {
    // q1 = a + bi + cj + dk
    // q2 = x + yi + zj + wk
    // Result = q1 * q2
    return float4(
        q1.w * q2.x + q1.x * q2.w + q1.y * q2.z - q1.z * q2.y, // X component
        q1.w * q2.y - q1.x * q2.z + q1.y * q2.w + q1.z * q2.x, // Y component
        q1.w * q2.z + q1.x * q2.y - q1.y * q2.x + q1.z * q2.w, // Z component
        q1.w * q2.w - q1.x * q2.x - q1.y * q2.y - q1.z * q2.z  // W (real) component
    );
}
```
要确定草是往哪个地方倒，就需要获取交互物体trampler的Pos，也就是其Transform组件。并且每一帧都通过SetVector传到GPU Buffer中，给Compute Shader用，所以把GPU的内存地址当作ID存着，不需要每次都用字符串访问。还要确定多大范围内的草要倒下，倒与不倒之间怎么过渡，给GPU传一个 trampleRadius ，由于这个是常数，就不用每一帧都修改，因此直接用字符串Set一下就好了。
```
// CSharp
public Transform trampler;
[Range(0.1f,5f)]
public float trampleRadius = 3f;
...
Init(){
    shader.SetFloat("trampleRadius", trampleRadius);
    tramplePosID = Shader.PropertyToID("tramplePos");
}
Update(){
    shader.SetVector(tramplePosID, pos);
}
```
本节把所有旋转的操作都丢进Compute Shader里面一次算完，直接返回一个四元数给材质。首先是q1计算随机朝向的四元数，q2计算随机倾倒，qt计算交互的倾倒。这里可以在Inspector开放一个交互的系数。
```
[numthreads(THREADGROUPSIZE,1,1)]
void BendGrass (uint3 id : SV_DispatchThreadID)
{
    GrassBlade blade = bladesBuffer[id.x];
    float3 relativePosition = blade.position - tramplePos.xyz;
    float dist = length(relativePosition);
    float4 qt;
    if (dist<trampleRadius){
        float eff = ((trampleRadius - dist)/trampleRadius) * 0.6;
        qt = MapVector(float3(0,1,0), float3(relativePosition.x*eff,1,relativePosition.z*eff));
    }else{
        qt = MapVector(float3(0,1,0),float3(0,1,0));
    }
    float2 offset = (blade.position.xz + wind.xy * time * wind.z) * wind.w;
    float noise = perlin(offset.x, offset.y) * 2 - 1;
    noise *= maxBend;
    float4 q1 = MapVector(float3(0,1,0), (float3(wind.x * noise,1,wind.y*noise)));
    float faceTheta = blade.dir * 3.1415f / 180.0f;
    float4 q2 = MapVector(float3(1,0,0),float3(cos(faceTheta),0,sin(faceTheta)));
    blade.quaternion = quatMultiply(qt,quatMultiply(q2,q1));
    bladesBuffer[id.x] = blade;
}
```
然后四元数到旋转矩阵的方法是：
```
float4x4 quaternion_to_matrix(float4 quat)
{
    float4x4 m = float4x4(float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0));
    float x = quat.x, y = quat.y, z = quat.z, w = quat.w;
    float x2 = x + x, y2 = y + y, z2 = z + z;
    float xx = x * x2, xy = x * y2, xz = x * z2;
    float yy = y * y2, yz = y * z2, zz = z * z2;
    float wx = w * x2, wy = w * y2, wz = w * z2;
    m[0][0] = 1.0 - (yy + zz);
    m[0][1] = xy - wz;
    m[0][2] = xz + wy;
    m[1][0] = xy + wz;
    m[1][1] = 1.0 - (xx + zz);
    m[1][2] = yz - wx;
    m[2][0] = xz - wy;
    m[2][1] = yz + wx;
    m[2][2] = 1.0 - (xx + yy);
    m[0][3] = _Position.x;
    m[1][3] = _Position.y;
    m[2][3] = _Position.z;
    m[3][3] = 1.0;
    return m;
}
```
然后应用一下。
```
void vert(inout appdata_full v, out Input data)
{
    UNITY_INITIALIZE_OUTPUT(Input, data);
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    float tempHeight = v.vertex.y * _HeightOffset;
    float tempWidth = v.vertex.x * _WidthOffset;
    v.vertex.y += tempHeight;
    v.vertex.x += tempWidth;
    // 应用模型顶点变换
    v.vertex = mul(_Matrix, v.vertex);
    v.vertex.xyz += _Position;
    // 计算逆转置矩阵用于法线变换
    v.normal = mul((float3x3)transpose(_Matrix), v.normal);
    #endif
}
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        // 获取Compute Shader计算结果
        GrassBlade blade = bladesBuffer[unity_InstanceID];
        _HeightOffset = blade.height_offset;
        _WidthOffset = blade.width_offset;
        _Fade = blade.fade; // 设置明暗
        _Matrix = quaternion_to_matrix(blade.quaternion); // 设置最终转转矩阵  
        _Position = blade.position; // 设置位置
    #endif
}
```
当前代码链接：
4. 总结/小测试
How do you programmatically get the thread group sizes of a kernel?
When defining a Mesh in code, the number of normals must be the same as the number of vertex positions. True or false.
2024-06-04
Compute Shader学习笔记（三）之粒子效果与群集行为模拟
紧接着上一篇文章
remoooo：Compute Shader学习笔记（二）之后处理效果
L4 粒子效果与群集行为模拟
本章节使用Compute Shader生成粒子。学习如何使用DrawProcedural和DrawMeshInstancedIndirect，也就是GPU Instancing。
知识点总结：
- Compute Shader、Material、C#脚本和Shader共同协作
- Graphics.DrawProcedural
- material.SetBuffer()
- xorshift 随机算法
- 集群行为模拟
- Graphics.DrawMeshInstancedIndirect
- 旋转平移缩放矩阵，齐次坐标
- Surface Shader
- ComputeBufferType.Default
- #pragma instancing_options procedural:setup
- unity_InstanceID
- Skinned Mesh Renderer
- 数据对齐
1. 介绍与准备工作
Compute Shader除了可以同时处理大量的数据，还有一个关键的优势，就是Buffer存储在GPU中。因此可以将Compute Shader处理好的数据直接传递给与Material关联的Shader中，即Vertex/Fragment Shader。这里的关键就是，material也可以像Compute Shader一样SetBuffer()，直接从GPU的Buffer中访问数据！
使用Compute Shader来制作粒子系统可以充分体现Compute Shader的强大并行能力。
在渲染过程中，Vertex Shader会从Compute Buffer中读取每个粒子的位置和其他属性，并将它们转换为屏幕上的顶点。Fragment Shader则负责根据这些顶点的信息（如位置和颜色）来生成像素。通过Graphics.DrawProcedural方法，Unity可以直接渲染这些由Shader处理的顶点，无需预先定义的网格结构，也不依赖Mesh Renderer，这对于渲染大量粒子特别有效。
2. 粒子你好
步骤也是非常简单，在 C# 中定义好粒子的信息（位置、速度与生命周期），初始化将数据传给Buffer，绑定Buffer到Compute Shader和Material。渲染阶段在OnRenderObject()里调用Graphics.DrawProceduralNow实现高效地渲染粒子。
新建一个场景，制作一个效果：百万粒子跟随鼠标绽放生命的粒子，如下：
写到这里，不禁让我思绪万千。粒子的生命周期很短暂，如同星火一般瞬间点燃，又如同流星一闪即逝。纵有千百磨难，我亦不过是亿万尘埃中的一粒，平凡且渺小。这些粒子，虽或许会在空间中随机漂浮（使用”Xorshift”算法计算粒子生成的位置），或许会拥有独一无二的色彩，但它们终究逃不出被程式预设的命运。这难道不正是我的人生写照吗？按部就班地上演着自己的角色，无法逃脱那无形的束缚。
“上帝已死！而我们这些杀死他的人，又怎能不感到最大的痛苦呢？” – 弗里德里希·尼采
尼采不仅宣告了宗教信仰的消逝，更指出了现代人面临的虚无感，即没有了传统的道德和宗教支柱，人们感到了前所未有的孤独和方向感的缺失。粒子在C#脚本中被定义、创造，按照特定规则运动和消亡，这与尼采所描述的现代人在宇宙中的状态颇有相似之处。虽然每个人都试图寻找自己的意义，但最终仍受限于更广泛的社会和宇宙规则。
生活中充满了各种不可避免的痛苦，反映了人类存在的固有虚无和孤独感。失恋、生离死别、工作失意以及即将编写的粒子死亡逻辑等等，都印证了尼采所表达的，生活中没有什么是永恒不变的。同一个Buffer中的粒子必然在未来某个时刻消失，这体现了尼采所描述的现代人的孤独感，个体可能会感受到前所未有的孤立无援，因此每个人都是孤独的战士，必须学会独自面对内心的龙卷风和外部世界的冷漠。
但是没关系，「夏天会周而复始，该相逢的人会再次相逢」。本文的粒子也会在结束后再次生成，以最好的状态拥抱属于它的Buffer。
Summer will come around again. People who meet will meet again.
当前版本代码，可以自己拷下来跑跑（都有注释）：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Shaders/ParticleFun.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Scripts/ParticleFun.cs
- Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Shaders/Particle.shader
废话就说到这，先看看 C# 脚本是咋写的。
老样子，先定义粒子的Buffer（结构体），并且初始化一下子，然后传给GPU，关键在于最后三行将Buffer绑定给shader的操作。下面省略号的代码没什么好讲的，都是常规操作，用注释一笔带过了。
```
struct Particle{
    public Vector3 position; // 粒子位置
    public Vector3 velocity; // 粒子速度
    public float life;       // 粒子生命周期
}
ComputeBuffer particleBuffer; // GPU 的 Buffer
...
// Init() 中
    // 初始化粒子数组
    Particle[] particleArray = new Particle[particleCount];
    for (int i = 0; i < particleCount; i++){
        // 生成随机位置和归一化
        ...
        // 设置粒子的初始位置和速度
        ... 
        // 设置粒子的生命周期
        particleArray[i].life = Random.value * 5.0f + 1.0f;
    }
    // 创建并设置Compute Buffer
    ...
    // 查找Compute Shader中的kernel ID
    ...
    // 绑定Compute Buffer到shader
    shader.SetBuffer(kernelID, "particleBuffer", particleBuffer);
    material.SetBuffer("particleBuffer", particleBuffer);
    material.SetInt("_PointSize", pointSize);
```
关键的渲染阶段来了 OnRenderObject() 。material.SetPass 用于设置渲染材质通道。DrawProceduralNow 方法在不使用传统网格的情况下绘制几何体。MeshTopology.Points 指定了渲染的拓扑类型为点，GPU会把每个顶点作为一个点来处理，不会进行顶点之间的连线或面的形成。第二个参数 1 表示从第一个顶点开始绘制。particleCount 指定了要渲染的顶点数，这里是粒子的数量，即告诉GPU总共需要渲染多少个点。
```
void OnRenderObject()
{
    material.SetPass(0);
    Graphics.DrawProceduralNow(MeshTopology.Points, 1, particleCount);
}
```
获取当前鼠标位置方法。OnGUI()这个方法每一帧可能调用多次。z值设为摄像机的近裁剪面加上一个偏移量，这里加14是为了得到一个更合适视觉深度的世界坐标（也可以自行调整）。
```
void OnGUI()
{
    Vector3 p = new Vector3();
    Camera c = Camera.main;
    Event e = Event.current;
    Vector2 mousePos = new Vector2();
    // Get the mouse position from Event.
    // Note that the y position from Event is inverted.
    mousePos.x = e.mousePosition.x;
    mousePos.y = c.pixelHeight - e.mousePosition.y;
    p = c.ScreenToWorldPoint(new Vector3(mousePos.x, mousePos.y, c.nearClipPlane + 14));
    cursorPos.x = p.x;
    cursorPos.y = p.y;
}
```
上面已经将 ComputeBuffer particleBuffer; 传到了Compute Shader和Shader中。
先看看Compute Shader的数据结构。没什么特别的。
```
// 定义粒子数据结构
struct Particle
{
    float3 position;  // 粒子的位置
    float3 velocity;  // 粒子的速度
    float life;       // 粒子的剩余生命时间
};
// 用于存储和更新粒子数据的结构化缓冲区，可从GPU读写
RWStructuredBuffer<Particle> particleBuffer;
// 从CPU设置的变量
float deltaTime;       // 从上一帧到当前帧的时间差
float2 mousePosition;  // 当前鼠标位置
```
这里简单讲讲一个特别好用的随机数序列生成方法 xorshift 算法。一会将用来随机粒子的运动方向如上图，粒子会随机朝着三维的方向运动。
- 详细参考：https://en.wikipedia.org/wiki/Xorshift
- 原论文链接：https://www.jstatsoft.org/article/view/v008i14
这个算法03年由George Marsaglia提出，优点在于运算速度极快，并且非常节约空间。即使是最简单的Xorshift实现，其伪随机数周期也是相当长的。
基本操作是位移（shift）和异或（xor）。算法的名字也由此而来。它的核心是维护一个非零的状态变量，通过对这个状态变量进行一系列的位移和异或操作来生成随机数。
```
// 用于生成随机数的状态变量
uint rng_state;
uint rand_xorshift() {
    // Xorshift algorithm from George Marsaglia's paper
    rng_state ^= (rng_state << 13);  // 将状态变量左移13位，然后与原状态进行异或
    rng_state ^= (rng_state >> 17);  // 将更新后的状态变量右移17位，再次进行异或
    rng_state ^= (rng_state << 5);   // 最后，将状态变量左移5位，进行最后一次异或
    return rng_state;                // 返回更新后的状态变量作为生成的随机数
}
```
基本Xorshift 算法的核心已在前面的解释中提到，不过不同的位移组合可以创建多种变体。原论文还提到了Xorshift128变体。使用128位的状态变量，通过四次不同的位移和异或操作更新状态。代码如下：
```
// c language Ver
uint32_t xorshift128(void) {
    static uint32_t x = 123456789;
    static uint32_t y = 362436069;
    static uint32_t z = 521288629;
    static uint32_t w = 88675123; 
    uint32_t t = x ^ (x << 11);
    x = y; y = z; z = w;
    w = w ^ (w >> 19) ^ (t ^ (t >> 8));
    return w;
}
```
可以产生更长的周期和更好的统计性能。这个变体的周期接近，非常厉害。
总的来说，这个算法用在游戏开发完全足够了，只是不适合用在密码学等领域。
在Compute Shader中使用这个算法时，需要注意Xorshift算法生成的随机数范围时uint32的的范围，需要再做一个映射( [0, 2^32-1] 映射到 [0, 1])：
```
float tmp = (1.0 / 4294967296.0);  // 转换因子
rand_xorshift()) * tmp
```
而粒子运动方向是有符号的，因此只要在这个基础上减去0.5就好了。三个方向的随机运动：
```
float f0 = float(rand_xorshift()) * tmp - 0.5;
float f1 = float(rand_xorshift()) * tmp - 0.5;
float f2 = float(rand_xorshift()) * tmp - 0.5;
float3 normalF3 = normalize(float3(f0, f1, f2)) * 0.8f; // 缩放了运动方向
```
每一个Kernel需要完成的内容如下：
- 先得到Buffer中上一帧的粒子信息
- 维护粒子Buffer（计算粒子速度，更新位置、生命值），写回Buffer
- 若生命值小于0，重新生成一个粒子
生成粒子，初始位置利用刚刚Xorshift得到的随机数，定义粒子的生命值，重置速度。
```
// 设置粒子的新位置和生命值
particleBuffer[id].position = float3(normalF3.x + mousePosition.x, normalF3.y + mousePosition.y, normalF3.z + 3.0);
particleBuffer[id].life = 4;  // 重置生命值
particleBuffer[id].velocity = float3(0,0,0);  // 重置速度
```
最后是Shader的基本数据结构：
```
struct Particle{
    float3 position;
    float3 velocity;
    float life;
};
struct v2f{
    float4 position : SV_POSITION;
    float4 color : COLOR;
    float life : LIFE;
    float size: PSIZE;
};
// particles' data
StructuredBuffer<Particle> particleBuffer;
```
然后在顶点着色器计算粒子的顶点色、顶点的Clip位置以及传输一个顶点大小的信息。
```
v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID){
    v2f o = (v2f)0;
    // Color
    float life = particleBuffer[instance_id].life;
    float lerpVal = life * 0.25f;
    o.color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal);
    // Position
    o.position = UnityObjectToClipPos(float4(particleBuffer[instance_id].position, 1.0f));
    o.size = _PointSize;
    return o;
}
```
片元着色器计算插值颜色。
```
float4 frag(v2f i) : COLOR{
    return i.color;
}
```
至此，就可以得到上面的效果。
3. Quad粒子
上一节每一个粒子都只有一个点，没什么意思。现在把一个点变成一个Quad。在Unity中，没有Quad，只有两个三角形组成的假Quad。
开干，基于上面的代码。在 C# 中定义顶点，一个Quad的尺寸。
```
// struct
struct Vertex
{
    public Vector3 position;
    public Vector2 uv;
    public float life;
}
const int SIZE_VERTEX = 6 * sizeof(float);
public float quadSize = 0.1f; // Quad的尺寸
```
每一个粒子的的基础上，设置六个顶点的uv坐标，给顶点着色器用。并且按照Unity规定的顺序绘制。
```
index = i*6;
    //Triangle 1 - bottom-left, top-left, top-right
    vertexArray[index].uv.Set(0,0);
    vertexArray[index+1].uv.Set(0,1);
    vertexArray[index+2].uv.Set(1,1);
    //Triangle 2 - bottom-left, top-right, bottom-right
    vertexArray[index+3].uv.Set(0,0);
    vertexArray[index+4].uv.Set(1,1);
    vertexArray[index+5].uv.Set(1,0);
```
最后传递给Buffer。这里的 halfSize 目的是传给Compute Shader计算Quad的各个顶点位置用的。
```
vertexBuffer = new ComputeBuffer(numVertices, SIZE_VERTEX);
vertexBuffer.SetData(vertexArray);
shader.SetBuffer(kernelID, "vertexBuffer", vertexBuffer);
shader.SetFloat("halfSize", quadSize*0.5f);
material.SetBuffer("vertexBuffer", vertexBuffer);
```
渲染阶段把点改为三角形，有六个点。
```
void OnRenderObject()
{
    material.SetPass(0);
    Graphics.DrawProceduralNow(MeshTopology.Triangles, 6, numParticles);
}
```
在Shader中改一下设置，接收顶点数据。并且接收一张贴图用于显示。需要做alpha剔除。
```
_MainTex("Texture", 2D) = "white" {}     
...
Tags{ "Queue"="Transparent" "RenderType"="Transparent" "IgnoreProjector"="True" }
LOD 200
Blend SrcAlpha OneMinusSrcAlpha
ZWrite Off
...
    struct Vertex{
        float3 position;
        float2 uv;
        float life;
    };
    StructuredBuffer<Vertex> vertexBuffer;
    sampler2D _MainTex;
    v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID)
    {
        v2f o = (v2f)0;
        int index = instance_id*6 + vertex_id;
        float lerpVal = vertexBuffer[index].life * 0.25f;
        o.color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal);
        o.position = UnityWorldToClipPos(float4(vertexBuffer[index].position, 1.0f));
        o.uv = vertexBuffer[index].uv;
        return o;
    }
    float4 frag(v2f i) : COLOR
    {
        fixed4 color = tex2D( _MainTex, i.uv ) * i.color;
        return color;
    }
```
在Compute Shader中，增加接收顶点数据，还有halfSize。
```
struct Vertex
{
    float3 position;
    float2 uv;
    float life;
};
RWStructuredBuffer<Vertex> vertexBuffer;
float halfSize;
```
计算每个Quad六个顶点的位置。
```
//Set the vertex buffer //
    int index = id.x * 6;
    //Triangle 1 - bottom-left, top-left, top-right   
    vertexBuffer[index].position.x = p.position.x-halfSize;
    vertexBuffer[index].position.y = p.position.y-halfSize;
    vertexBuffer[index].position.z = p.position.z;
    vertexBuffer[index].life = p.life;
    vertexBuffer[index+1].position.x = p.position.x-halfSize;
    vertexBuffer[index+1].position.y = p.position.y+halfSize;
    vertexBuffer[index+1].position.z = p.position.z;
    vertexBuffer[index+1].life = p.life;
    vertexBuffer[index+2].position.x = p.position.x+halfSize;
    vertexBuffer[index+2].position.y = p.position.y+halfSize;
    vertexBuffer[index+2].position.z = p.position.z;
    vertexBuffer[index+2].life = p.life;
    //Triangle 2 - bottom-left, top-right, bottom-right  // // 
    vertexBuffer[index+3].position.x = p.position.x-halfSize;
    vertexBuffer[index+3].position.y = p.position.y-halfSize;
    vertexBuffer[index+3].position.z = p.position.z;
    vertexBuffer[index+3].life = p.life;
    vertexBuffer[index+4].position.x = p.position.x+halfSize;
    vertexBuffer[index+4].position.y = p.position.y+halfSize;
    vertexBuffer[index+4].position.z = p.position.z;
    vertexBuffer[index+4].life = p.life;
    vertexBuffer[index+5].position.x = p.position.x+halfSize;
    vertexBuffer[index+5].position.y = p.position.y-halfSize;
    vertexBuffer[index+5].position.z = p.position.z;
    vertexBuffer[index+5].life = p.life;
```
大功告成。
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Shaders/QuadParticles.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Scripts/QuadParticles.cs
- Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Shaders/QuadParticle.shader
下一节，将Mesh升级为预制体，并且尝试模拟鸟类飞行时的集群行为。
4. Flocking（集群行为）模拟
Flocking 是一种模拟自然界中鸟群、鱼群等动物集体运动行为的算法。核心是基于三个基本的行为规则，由Craig Reynolds在Sig 87提出，通常被称为“Boids”算法：
- 分离（Separation） 粒子与粒子之间不能太靠近，要有边界感。具体是计算周边一定半径的粒子然后计算一个避免碰撞的方向。
- 对齐（Alignment） 个体的速度趋于群体的平均速度，要有归属感。具体是计算视觉范围内粒子的平均速度（速度大小方向）。这个视觉范围要根据鸟类实际的生物特性决定，下一节会提及。
- 聚合（Cohesion） 个体的位置趋于平均位置（群体的中心），要有安全感。具体是，每个粒子找出周围邻居的几何中心，计算一个移动向量（最终结果是平均位置）。
思考一下，上面三个规则，哪一个最难实现？
答：Separation。众所周知，计算物体间的碰撞是非常难以实现的。因为每个个体都需要与其他所有个体进行距离比较，这会导致算法的时间复杂度接近O(n^2)，其中n是粒子的数量。例如，如果有1000个粒子，那么在每次迭代中可能需要进行将近500,000次的距离计算。在当年原论文作者在没有经过优化的原始算法（时间复杂度O(N^2)）中渲染一帧（80只鸟）所需时间是95秒，渲染一个300帧的动画使用了将近9个小时。
一般来说，使用四叉树或者是格点哈希（Spatial Hashing）等空间划分方法可以优化计算。也可以维护一个近邻列表存储每个个体周边一定距离的个体。当然了，还可以使用Compute Shader硬算。
废话不多说，开干。
首先下载好预备的工程文件（如果没有事先准备）：
- 鸟的Prefab：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Prefabs/Boid.prefab
- 脚本：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Scripts/SimpleFlocking.cs
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Shaders/SimpleFlocking.compute
然后添加到一个空GO中。
启动项目就可以看到一堆鸟。
下面是关于群体行为模拟的一些参数。
```
// 定义群体行为模拟的参数。
    public float rotationSpeed = 1f; // 旋转速度。
    public float boidSpeed = 1f; // Boid速度。
    public float neighbourDistance = 1f; // 邻近距离。
    public float boidSpeedVariation = 1f; // 速度变化。
    public GameObject boidPrefab; // Boid对象的预制体。
    public int boidsCount; // Boid的数量。
    public float spawnRadius; // Boid生成的半径。
    public Transform target; // 群体的移动目标。
```
除了Boid预制体boidPrefab和生成半径spawnRadius之外，其他都需要传给GPU。
为了方便，这一节先犯个蠢，只在GPU计算鸟的位置和方向，然后传回给CPU，做如下处理：
```
...
boidsBuffer.GetData(boidsArray);
// 更新每个鸟的位置与朝向
for (int i = 0; i < boidsArray.Length; i++){
    boids[i].transform.localPosition = boidsArray[i].position;
    if (!boidsArray[i].direction.Equals(Vector3.zero)){
        boids[i].transform.rotation = Quaternion.LookRotation(boidsArray[i].direction);
    }
}
```
Quaternion.LookRotation() 方法用于创建一个旋转，使对象面向指定的方向。
在Compute Shader中计算每个鸟的位置。
```
#pragma kernel CSMain
#define GROUP_SIZE 256    
struct Boid{
    float3 position;
    float3 direction;
};
RWStructuredBuffer<Boid> boidsBuffer;
float time;
float deltaTime;
float rotationSpeed;
float boidSpeed;
float boidSpeedVariation;
float3 flockPosition;
float neighbourDistance;
int boidsCount;
```
[numthreads(GROUP_SIZE,1,1)]
void CSMain (uint3 id : SV_DispatchThreadID){ …// 接下文 }
先写对齐和聚合的逻辑，最终输出实际位置、方向给Buffer。
```
Boid boid = boidsBuffer[id.x];
    float3 separation = 0; // 分离
    float3 alignment = 0; // 对齐 - 方向
    float3 cohesion = flockPosition; // 聚合 - 位置
    uint nearbyCount = 1; // 自身算作周边的个体。
    for (int i=0; i<boidsCount; i++)
    {
        if(i!=(int)id.x) // 把自己排除 
        {
            Boid temp = boidsBuffer[i];
            // 计算周围范围内的个体
            if(distance(boid.position, temp.position)< neighbourDistance){
                alignment += temp.direction;
                cohesion += temp.position;
                nearbyCount++;
            }
        }
    }
    float avg = 1.0 / nearbyCount;
    alignment *= avg;
    cohesion *= avg;
    cohesion = normalize(cohesion-boid.position);
    // 综合一个移动方向
    float3 direction = alignment + separation + cohesion;
    // 平滑转向和位置更新
    boid.direction = lerp(direction, normalize(boid.direction), 0.94);
    // deltaTime确保移动速度不会因帧率变化而改变。
    boid.position += boid.direction * boidSpeed * deltaTime;
    boidsBuffer[id.x] = boid;
```
这就是没有边界感（分离项）的下场，所有的个体都表现出相当亲密的关系，都重叠在一起了。
添加下面的代码。
```
if(distance(boid.position, temp.position)< neighbourDistance)
{
    float3 offset = boid.position - temp.position;
    float dist = length(offset);
    if(dist < neighbourDistance)
    {
        dist = max(dist, 0.000001);
        separation += offset * (1.0/dist - 1.0/neighbourDistance);
    }
    ...
```
1.0/dist 当Boid越靠近时，这个值越大，表示分离力度应当越大。1.0/neighbourDistance 是一个常数，基于定义的邻近距离。两者的差值表示实际的分离力应对距离的反应程度。如果两个Boid的距离正好是 neighbourDistance，这个值为零（没有分离力）。如果两个Boid距离小于 neighbourDistance，这个值为正，且距离越小，值越大。
当前代码：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Flocking/Assets/Shaders/SimpleFlocking.compute
下一节将采用Instanced Mesh，提高性能。
5. GPU Instancing优化
首先回顾一下本章节的内容。「粒子你好」与「Quad粒子」的两个例子中，我们都运用了Instanced技术（Graphics.DrawProceduralNow()），将Compute Shader的计算好的粒子位置直接传递给VertexFrag着色器。
本节使用的DrawMeshInstancedIndirect 用于绘制大量几何体实例，实例都是相似的，只是位置、旋转或其他参数略有不同。相对于每帧都重新生成几何体并渲染的 DrawProceduralNow，DrawMeshInstancedIndirect 只需要一次性设置好实例的信息，然后 GPU 就可以根据这些信息一次性渲染所有实例。渲染草地、群体动物就用这个函数。
这个函数有很多参数，只用其中的一部分。
```
Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);
```
1. boidMesh：把鸟Mesh丢进去。
2. subMeshIndex：绘制的子网格索引。如果网格只有一个子网格，通常为0。
3. boidMaterial：应用到实例化对象的材质。
4. bounds：包围盒指定了绘制的范围。实例化对象只有在这个包围盒内的区域才会被渲染。优化性能之用。
5. argsBuffer：参数的 ComputeBuffer，参数包括每个实例的几何体的索引数量和实例化的数量。
这个 argsBuffer 是啥？这个参数用来告诉Unity，我们现在要渲染哪个Mesh、要渲染多少个！可以用一种特殊的Buffer作为参数给进去。
在初始化shader时候，创建一种特殊Buffer，其标注为 ComputeBufferType.IndirectArguments 。这种类型的缓冲区专门用于传递给 GPU，以便在 GPU 上执行间接绘制命令。这里的new ComputeBuffer 第一个参数是 1 ，表示一个args数组（一个数组有5个uint），不要理解错了。
```
ComputeBuffer argsBuffer;
...
argsBuffer = new ComputeBuffer(1, 5 * sizeof(uint), ComputeBufferType.IndirectArguments);
if (boidMesh != null)
{
    args[0] = (uint)boidMesh.GetIndexCount(0);
    args[1] = (uint)numOfBoids;
}
argsBuffer.SetData(args);
...
Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);
```
在上一章的基础上，个体的数据结构增加一个offset，在Compute Shader用于方向上的偏移。另外初始状态的方向用Slerp插值，70%保持原来的方向，30%随机。Slerp插值的结果是四元数，需要用四元数方法转换到欧拉角再传入构造函数。
```
public float noise_offset;
...
Quaternion rot = Quaternion.Slerp(transform.rotation, Random.rotation, 0.3f);
boidsArray[i] = new Boid(pos, rot.eulerAngles, offset);
```
将这个新的属性noise_offset传到Compute Shader后，计算范围是 [-1, 1] 的噪声值，应用到鸟的速度上。
```
float noise = clamp(noise1(time / 100.0 + boid.noise_offset), -1, 1) * 2.0 - 1.0;
float velocity = boidSpeed * (1.0 + noise * boidSpeedVariation);
```
然后稍微优化了一下算法。Compute Shader大体是没有区别的。
```
if (distance(boid_pos, boidsBuffer[i].position) < neighbourDistance)
{
    float3 tempBoid_position = boidsBuffer[i].position;
    float3 offset = boid.position - tempBoid_position;
    float dist = length(offset);
    if (dist<neighbourDistance){
        dist = max(dist, 0.000001);//Avoid division by zero
        separation += offset * (1.0/dist - 1.0/neighbourDistance);
    }
    alignment += boidsBuffer[i].direction;
    cohesion += tempBoid_position;
    nearbyCount += 1;
}
```
最大的不同在于Shader上。本节使用Surface Shader取代Frag。这个东西其实就是一个包装好的vertex and fragment shader。Unity已经完成了光照、阴影等一系列繁琐的工作。你依旧可以指定一个Vert。
写Shader制作材质的时候，需要对Instanced的物体做特别处理。因为普通的渲染对象，他们的位置、旋转和其他属性在Unity中是静态的。而对于当前要构建的实例化对象，其位置、旋转等参数时刻在变化，因此，在渲染管线中需要通过特殊的机制来动态设置每个实例化对象的位置和参数。当前的方法基于程序的实例化技术，可以一次性渲染所有的实例化对象，而不需要逐个绘制。也就是一次性批量渲染。
着色器应用instanced技术方法。实例化阶段是在vert之前执行。这样每个实例化的对象都有单独的旋转、位移和缩放等矩阵。
现在需要为每个实例化对象创建属于他们的旋转矩阵。从Buffer中我们拿到了Compute Shader计算后的鸟的基本信息（上一节中，该数据传回了CPU，这里直接传给Shader做实例化）：
Shader里将Buffer传来的数据结构、相关操作用下面的宏包裹起来。
```
// .shader
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
struct Boid
{
    float3 position;
    float3 direction;
    float noise_offset;
};
StructuredBuffer<Boid> boidsBuffer; 
#endif
```
由于我只在 C# 的 DrawMeshInstancedIndirect 的args[1]指定了需要实例化的数量（鸟的数量，也是Buffer的大小），因此直接使用unity_InstanceID索引访问Buffer就好了。
```
#pragma instancing_options procedural:setup
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        _BoidPosition = boidsBuffer[unity_InstanceID].position;
        _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
    #endif
}
```
这里的空间变换矩阵的计算涉及到Homogeneous Coordinates，可以去复习一下GAMES101的课程。点是(x,y,z,1)，坐标是(x,y,z,0)。
如果使用仿射变换（Affine Transformations），代码是这样的：
```
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    _BoidPosition = boidsBuffer[unity_InstanceID].position;
    _LookAtMatrix = look_at_matrix(boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
    #endif
}
 void vert(inout appdata_full v, out Input data)
{
    UNITY_INITIALIZE_OUTPUT(Input, data);
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    v.vertex = mul(_LookAtMatrix, v.vertex);
    v.vertex.xyz += _BoidPosition;
    #endif
}
```
不够优雅，我们直接使用一个齐次坐标（Homogeneous Coordinates）。一个矩阵搞掂旋转平移缩放！
```
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    _BoidPosition = boidsBuffer[unity_InstanceID].position;
    _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
    #endif
}
 void vert(inout appdata_full v, out Input data)
{
    UNITY_INITIALIZE_OUTPUT(Input, data);
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    v.vertex = mul(_Matrix, v.vertex);
    #endif
}
```
至此，就大功告成了！当前的帧率比上一节提升了将近一倍。
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Shaders/InstancedFlocking.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Scripts/InstancedFlocking.cs
- Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Shaders/InstancedFlocking.shader
6. 应用蒙皮动画
本节要做的是，使用Animator组件，在实例化物体之前，将各个关键帧的Mesh抓取到Buffer当中。通过选取不同索引，得到不同姿势的Mesh。具体的骨骼动画制作不在本文讨论范围。
只需要在上一章的基础上修改代码，添加Animator等逻辑。我已经在下面写了注释，可以看看。
并且个体的数据结构有所更新：
```
struct Boid{
    float3 position;
    float3 direction;
    float noise_offset;
    float speed; // 暂时没啥用
    float frame; // 表示动画中的当前帧索引
    float3 padding; // 确保数据对齐
};
```
详细说说这里的对齐。一个数据结构中，数据的大小最好是16字节的整数倍。
- float3 position; (12字节)
- float3 direction; (12字节)
- float noise_offset; (4字节)
- float speed; (4字节)
- float frame; (4字节)
- float3 padding; (12字节)
如果没有Padding，大小是36字节，不是常见的对齐大小。加上Padding，对齐到48字节，完美！
```
private SkinnedMeshRenderer boidSMR; // 用于引用包含蒙皮网格的SkinnedMeshRenderer组件。
private Animator animator;
public AnimationClip animationClip; // 具体的动画剪辑，通常用于计算动画相关的参数。
private int numOfFrames; // 动画中的帧数，用于确定在GPU缓冲区中存储多少帧数据。
public float boidFrameSpeed = 10f; // 控制动画播放的速度。
MaterialPropertyBlock props; // 在不创建新材料实例的情况下传递参数给着色器。这意味着可以改变实例的材质属性（如颜色、光照系数等），而不会影响到使用相同材料的其他对象。
Mesh boidMesh; // 存储从SkinnedMeshRenderer烘焙出的网格数据。
...
void Start(){ // 这里首先初始化Boid数据，然后调用GenerateSkinnedAnimationForGPUBuffer来准备动画数据，最后调用InitShader来设置渲染所需的Shader参数。
    ...
    // This property block is used only for avoiding an instancing bug.
    props = new MaterialPropertyBlock();
    props.SetFloat("_UniqueID", Random.value);
    ...
    InitBoids();
    GenerateSkinnedAnimationForGPUBuffer();
    InitShader();
}
void InitShader(){ // 此方法配置Shader和材料属性，确保动画播放可以根据实例的不同阶段正确显示。frameInterpolation的启用或禁用决定了是否在动画帧之间进行插值，以获得更平滑的动画效果。
    ...
    if (boidMesh)//Set by the GenerateSkinnedAnimationForGPUBuffer
    ...
    shader.SetFloat("boidFrameSpeed", boidFrameSpeed);
    shader.SetInt("numOfFrames", numOfFrames);
    boidMaterial.SetInt("numOfFrames", numOfFrames);
    if (frameInterpolation && !boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION"))
    boidMaterial.EnableKeyword("FRAME_INTERPOLATION");
    if (!frameInterpolation && boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION"))
    boidMaterial.DisableKeyword("FRAME_INTERPOLATION");
}
void Update(){
    ...
    // 后面两个参数：
        // 1. 0: 参数缓冲区的偏移量，用于指定从哪里开始读取参数。
        // 2. props: 前面创建的 MaterialPropertyBlock，包含所有实例共享的属性。
    Graphics.DrawMeshInstancedIndirect( boidMesh, 0, boidMaterial, bounds, argsBuffer, 0, props);
}
void OnDestroy(){ 
    ...
    if (vertexAnimationBuffer != null) vertexAnimationBuffer.Release();
}
private void GenerateSkinnedAnimationForGPUBuffer()
{
    ... // 接下文
}
```
为了给Shader在不同的时间提供不同姿势的Mesh，因此在 GenerateSkinnedAnimationForGPUBuffer() 函数中，从 Animator 和 SkinnedMeshRenderer 中提取每一帧的网格顶点数据，然后将这些数据存储到GPU的 ComputeBuffer 中，以便在实例化渲染时使用。
通过GetCurrentAnimatorStateInfo获取当前动画层的状态信息，用于后续控制动画的精确播放。
numOfFrames 使用最接近动画长度和帧率乘积的二次幂来确定，可以优化GPU的内存访问。
然后创建一个ComputeBuffer来存储所有帧的所有顶点数据。vertexAnimationBuffer
在for循环中，烘焙所有动画帧。具体做法是，在每个sampleTime时间点播放并立即更新，然后烘焙当前动画帧的网格到bakedMesh中。并且提取刚刚烘焙好的Mesh顶点，更新到数组 vertexAnimationData 中，最后上传至GPU，结束。
```
// ...接上文
boidSMR = boidObject.GetComponentInChildren<SkinnedMeshRenderer>();
boidMesh = boidSMR.sharedMesh;
animator = boidObject.GetComponentInChildren<Animator>();
int iLayer = 0;
AnimatorStateInfo aniStateInfo = animator.GetCurrentAnimatorStateInfo(iLayer);
Mesh bakedMesh = new Mesh();
float sampleTime = 0;
float perFrameTime = 0;
numOfFrames = Mathf.ClosestPowerOfTwo((int)(animationClip.frameRate * animationClip.length));
perFrameTime = animationClip.length / numOfFrames;
var vertexCount = boidSMR.sharedMesh.vertexCount;
vertexAnimationBuffer = new ComputeBuffer(vertexCount * numOfFrames, 16);
Vector4[] vertexAnimationData = new Vector4[vertexCount * numOfFrames];
for (int i = 0; i < numOfFrames; i++)
{
    animator.Play(aniStateInfo.shortNameHash, iLayer, sampleTime);
    animator.Update(0f);
    boidSMR.BakeMesh(bakedMesh);
    for(int j = 0; j < vertexCount; j++)
    {
        Vector4 vertex = bakedMesh.vertices[j];
        vertex.w = 1;
        vertexAnimationData[(j * numOfFrames) +  i] = vertex;
    }
    sampleTime += perFrameTime;
}
vertexAnimationBuffer.SetData(vertexAnimationData);
boidMaterial.SetBuffer("vertexAnimation", vertexAnimationBuffer);
boidObject.SetActive(false);
```
在Compute Shader中，维护每一个个体数据结构中储存的帧变量。
```
boid.frame = boid.frame + velocity * deltaTime * boidFrameSpeed;
if (boid.frame >= numOfFrames) boid.frame -= numOfFrames;
```
在Shader中lerp不同帧的动画。左边是没有帧插值的，右边是插值后的，效果非常显著。
好的标题可以获得更多的推荐及关注者
```
void vert(inout appdata_custom v)
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        #ifdef FRAME_INTERPOLATION
            v.vertex = lerp(vertexAnimation[v.id * numOfFrames + _CurrentFrame], vertexAnimation[v.id * numOfFrames + _NextFrame], _FrameInterpolation);
        #else
            v.vertex = vertexAnimation[v.id * numOfFrames + _CurrentFrame];
        #endif
        v.vertex = mul(_Matrix, v.vertex);
    #endif
}
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
        _CurrentFrame = boidsBuffer[unity_InstanceID].frame;
        #ifdef FRAME_INTERPOLATION
            _NextFrame = _CurrentFrame + 1;
            if (_NextFrame >= numOfFrames) _NextFrame = 0;
            _FrameInterpolation = frac(boidsBuffer[unity_InstanceID].frame);
        #endif
    #endif
}
```
非常不容易，终于完整了。
完整工程链接：https://github.com/Remyuu/Unity-Compute-Shader-Learn/tree/L4_Skinned/Assets/Scripts
8. 总结/小测试
When rendering points which gives the best answer?
What are the three key steps in flocking?
When creating an arguments buffer for DrawMeshInstancedIndirect, how many uints are required?
We created the wing flapping by using a skinned mesh shader. True or False.
In a shader used by DrawMeshInstancedIndirect, which variable name gives the correct index for the instance?
References
1. https://en.wikipedia.org/wiki/Boids
2. Flocks, Herds, and Schools: A Distributed Behavioral Model
2024-05-28
Compute Shader学习笔记（二）之后处理效果
前言
初步认识了Compute Shader，实现一些简单的效果。所有的代码都在：
https://github.com/Remyuu/Unity-Compute-Shader-Learngithub.com/Remyuu/Unity-Compute-Shader-Learn
main分支是初始代码，可以下载完整的工程跟着我敲一遍。PS：每一个版本的代码我都单独开了分支。
这一篇文章学习如何使用Compute Shader制作：
- 后处理效果
- 粒子系统
上一篇文章没有提及GPU的架构，是因为我觉得一上来就解释一大堆名词根本听不懂QAQ。有了实际编写Compute Shader的经验，就可以将抽象的概念和实际的代码联系起来。
CUDA在GPU上的执行程序可以用三层架构来说明：
- Grid – 对应一个Kernel
- |-Block – 一个Grid有多个Block，执行相同的程序
- | |-Thread – GPU上最基本的运算单元
Thread是GPU最基础的单元，不同Thread中自然就会有信息交换。为了有效地支持大量并行线程的运行，并解决这些线程之间的数据交换需求，内存被设计成多个层次。因此存储角度也可以分为三层：
- Per-Thread memory – 一个Thread内，传输周期是一个时钟周期（小于1纳秒），速度可以比全局内存快几百倍。
- Shared memory – 一个Block之间，速度比全局快很多。
- Global memory – 所有线程之间，但速度最慢，通常是GPU的瓶颈。Volta架构使用了HBM2作为设备的全局内存，Turing则是用了GDDR6。
如果超过内存大小限制，则会被推到容量更大但是更慢的存储空间上。
Shared Memory和L1 cache共享同一个物理空间，但是功能上有区别：前者需要手动管理，后者由硬件自动管理。我的理解是，Shared Memory 功能上类似于一个可编程的L1缓存。
在NVIDIA的CUDA架构中，流式多处理器（Streaming Multiprocessor, SM）是GPU上的一个处理单元，负责执行分配给它的线程块（Blocks）中的线程。流处理器（Stream Processors），也称为“CUDA核心”，是SM内的处理元件，每个流处理器可以并行处理多个线程。总的来说：
- GPU -> Multi-Processors (SMs) -> Stream Processors
即，GPU包含多个SM（也就是多处理器），每个SM包含多个流处理器。每个流处理器负责执行一个或多个线程（Thread）的计算指令。
在GPU中，Thread（线程）是执行计算的最小单元，Warp（纬度）是CUDA中的基本执行单位。
在NVIDIA的CUDA架构中，每个Warp通常包含32个线程（AMD有64个）。Block（块）是一个线程组，包含多个线程。在CUDA中，一个Block可以包含多个Warp。Kernel（内核）是在GPU上执行的一个函数，你可以将其视为一段特定的代码，这段代码被所有激活的线程并行执行。总的来说：
- Kernel -> Grid -> Blocks -> Warps -> Threads
但在日常开发中，通常需要同时执行的线程（Threads）远超过32个。
为了解决软件需求与硬件架构之间的数量不匹配问题，GPU采用了一种策略：将属于同一个块（Block）的线程分组。这种分组被称为“Warp”，每个Warp包含固定数量的线程。当需要执行的线程数量超过一个Warp所能包含的数量时，GPU会调度额外的Warp。这样做的原则是确保没有任何线程被遗漏，即便这意味着需要启动更多的Warp。
举个例子，如果一个块（Block）有128个线程（Thread），并且我的显卡身穿皮夹克（Nvidia每个Warp有32个Thread），那么一个块（Block）就会有 128/32=4 个Warp。举一个极端的例子，如果有129个线程，那么就会开5个Warp。有31个线程位置将直接空闲！因此我们在写Compute Shader时，[numthreads(a,b,c)] 中的 abc 最好是32的倍数，减少CUDA核心的浪费。
读到这里，想必你一定会很混乱。我按照个人的理解画了个图。若有错误请指出。
L3 后处理效果
当前构建基于BIRP管线，SRP管线只需要修改几处代码。
这一章关键在于构建一个抽象基类管理Compute Shader所需的资源（第一节）。然后基于这个抽象基类，编写一些简单的后处理效果，比如高斯模糊、灰阶效果、低分辨率像素效果以及夜视仪效果等等。这一章的知识点的小总结：
- 获取和处理Camera的渲染贴图
- ExecuteInEditMode 关键词
- SystemInfo.supportsComputeShaders 检查系统是否支持
- Graphics.Blit() 函数的使用（全程是Bit Block Transfer）
- 用 smoothstep() 制作各种效果
- 多个Kernel之间传输数据 Shared 关键词
1. 介绍与准备工作
后处理效果需要准备两张贴图，一个只读，另一个可读写。至于贴图从哪来，都说是后处理了，那肯定从相机身上获取贴图，也就是Camera组件上的Target Texture。
- Source：只读
- Destination：可读写，用于最终输出
由于后续会实现多种后处理效果，因此抽象出一个基类，减少后期工作量。
在基类中封装以下特性：
- 初始化资源（创建贴图、Buffer等）
- 管理资源（比方说屏幕分辨率改变后，重新创建Buffer等等）
- 硬件检查（检查当前设备是否支持Compute Shader）
抽象类完整代码链接：https://pastebin.com/9pYvHHsh
首先，当脚本实例被激活或者附加到活着的GO的时候，调用 OnEnable() 。在里面写初始化的操作。检查硬件是否支持、检查Compute Shader是否在Inspector上绑定、获取指定的Kernel、获取当前GO的Camera组件、创建纹理以及设置初始化状态为真。
```
if (!SystemInfo.supportsComputeShaders)
    ...
if (!shader)
    ...
kernelHandle = shader.FindKernel(kernelName);
thisCamera = GetComponent<Camera>();
if (!thisCamera)
    ...
CreateTextures();
init = true;
```
创建两个纹理 CreateTextures() ，一个Source一个Destination，尺寸为摄像机分辨率。
```
texSize.x = thisCamera.pixelWidth;
texSize.y = thisCamera.pixelHeight;
if (shader)
{
    uint x, y;
    shader.GetKernelThreadGroupSizes(kernelHandle, out x, out y, out _);
    groupSize.x = Mathf.CeilToInt((float)texSize.x / (float)x);
    groupSize.y = Mathf.CeilToInt((float)texSize.y / (float)y);
}
CreateTexture(ref output);
CreateTexture(ref renderedSource);
shader.SetTexture(kernelHandle, "source", renderedSource);
shader.SetTexture(kernelHandle, "outputrt", output);
```
具体纹理的创建：
```
protected void CreateTexture(ref RenderTexture textureToMake, int divide=1)
{
    textureToMake = new RenderTexture(texSize.x/divide, texSize.y/divide, 0);
    textureToMake.enableRandomWrite = true;
    textureToMake.Create();
}
```
这样就完成初始化了，当摄像机完成场景渲染并准备显示到屏幕上时，Unity会调用 OnRenderImage() ，这个时候就开始调用Compute Shader开始计算了。若当前没初始化好或者没shader，就Blit一下，把source直接拷给destination，即啥也不干。 CheckResolution(out _) 这个方法检查渲染纹理的分辨率是否需要更新，如果要，就重新生成一下Texture。完事之后，就到了老生常谈的Dispatch阶段啦。这里就需要将source贴图通过Buffer传给GPU，计算完毕后，传回给destination。
```
protected virtual void OnRenderImage(RenderTexture source, RenderTexture destination)
{
    if (!init || shader == null)
    {
        Graphics.Blit(source, destination);
    }
    else
    {
        CheckResolution(out _);
        DispatchWithSource(ref source, ref destination);
    }
}
```
注意看，这里我们没有用什么 SetData() 或者是 GetData() 之类的操作。因为现在所有数据都在GPU上，我们直接命令GPU自产自销就好了，CPU不要趟这滩浑水。如果将纹理取回内存，再传给GPU，性能就相当糟糕。
```
protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination)
{
    Graphics.Blit(source, renderedSource);
    shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
    Graphics.Blit(output, destination);
}
```
我不信邪，非得传回CPU再传回GPU，测试结果相当震惊，性能竟然差了4倍以上。因此我们需要减少CPU和GPU之间的通信，这是使用Compute Shader时非常需要关心的。
```
// 笨蛋方法
protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination)
{
    // 将源贴图Blit到用于处理的贴图
    Graphics.Blit(source, renderedSource);
    // 使用计算着色器处理贴图
    shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
    // 将输出贴图复制到一个Texture2D对象中，以便读取数据到CPU
    Texture2D tempTexture = new Texture2D(renderedSource.width, renderedSource.height, TextureFormat.RGBA32, false);
    RenderTexture.active = output;
    tempTexture.ReadPixels(new Rect(0, 0, output.width, output.height), 0, 0);
    tempTexture.Apply();
    RenderTexture.active = null;
    // 将Texture2D数据传回GPU到一个新的RenderTexture
    RenderTexture tempRenderTexture = RenderTexture.GetTemporary(output.width, output.height);
    Graphics.Blit(tempTexture, tempRenderTexture);
    // 最终将处理后的贴图Blit到目标贴图
    Graphics.Blit(tempRenderTexture, destination);
    // 清理资源
    RenderTexture.ReleaseTemporary(tempRenderTexture);
    Destroy(tempTexture);
}
```
接下来开始编写第一个后处理效果。
小插曲：奇怪的BUG
另外插播一个奇怪bug。
在Compute Shader中，如果最终输出的贴图结果名字是output，那么在某些API比如Metal中，就会出问题。解决方法是，改个名字。
```
RWTexture2D<float4> outputrt;
```
添加图片注释，不超过 140 字（可选）
2. RingHighlight效果
创建RingHighlight类，继承自刚刚编写的基类。
重载初始化方法，指定Kernel。
```
protected override void Init()
{
    center = new Vector4();
    kernelName = "Highlight";
    base.Init();
}
```
重载渲染方法。想要实现聚焦某个角色的效果，则需要给Compute Shader传入角色的屏幕空间的坐标 center 。并且，如果在Dispatch之前，屏幕分辨率发生改变，那么重新初始化。
```
protected void SetProperties()
{
    float rad = (radius / 100.0f) * texSize.y;
    shader.SetFloat("radius", rad);
    shader.SetFloat("edgeWidth", rad * softenEdge / 100.0f);
    shader.SetFloat("shade", shade);
}
protected override void OnRenderImage(RenderTexture source, RenderTexture destination)
{
    if (!init || shader == null)
    {
        Graphics.Blit(source, destination);
    }
    else
    {
        if (trackedObject && thisCamera)
        {
            Vector3 pos = thisCamera.WorldToScreenPoint(trackedObject.position);
            center.x = pos.x;
            center.y = pos.y;
            shader.SetVector("center", center);
        }
        bool resChange = false;
        CheckResolution(out resChange);
        if (resChange) SetProperties();
        DispatchWithSource(ref source, ref destination);
    }
}
```
并且改变Inspector面板的时候可以实时看到参数变化效果，添加 OnValidate() 方法。
```
private void OnValidate()
{
    if(!init)
        Init();
    SetProperties();
}
```
GPU中，该怎么制作一个圆内没有阴影，圆的边缘平滑过渡，过渡层外是阴影的效果呢？基于上一篇文章判断一个点是否在圆内的方法，我们用 smoothstep() ，处理过渡层即可。
```
#pragma kernel Highlight

Texture2D<float4> source;
RWTexture2D<float4> outputrt;
float radius;
float edgeWidth;
float shade;
float4 center;

float inCircle( float2 pt, float2 center, float radius, float edgeWidth ){
    float len = length(pt - center);
    return 1.0 - smoothstep(radius-edgeWidth, radius, len);
}

[numthreads(8, 8, 1)]
void Highlight(uint3 id : SV_DispatchThreadID)
{
    float4 srcColor = source[id.xy];
    float4 shadedSrcColor = srcColor * shade;
    float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
    float4 color = lerp( shadedSrcColor, srcColor, highlight );

    outputrt[id.xy] = color;

}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_RingHighlight/Assets/Shaders/RingHighlight.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_RingHighlight/Assets/Scripts/RingHighlight.cs
3. 模糊效果
模糊效果原理很简单，每一个像素采样周边的 n*n 个像素加权平均就可以得到最终效果。
但是有效率问题。众所周知，减少对纹理的采样次数对优化非常重要。如果每个像素都需要采样20*20个周边像素，那么渲染一个像素就需要采样400次，显然是无法接受的。并且，对于单个像素而言，采集周边一整个矩形像素的操作在Compute Shader中很难处理。怎么解决呢？
通常做法是，横着采样一遍，再竖着采样一遍。什么意思呢？对于每一个像素，只在x方向上采样20个像素，y方向上采样20个像素，总共采样20+20个像素，再加权平均。这种方法不仅减少了采样次数，还更符合Compute Shader的逻辑。横着采样，设置一个Kernel；竖着采样，设置另一个Kernel。
```
#pragma kernel HorzPass
#pragma kernel Highlight
```
由于Dispatch是顺序执行的，因此我们计算完水平的模糊后，利用计算好的结果再垂直采样一遍。
```
shader.Dispatch(kernelHorzPassID, groupSize.x, groupSize.y, 1);
shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
```
做完模糊操作之后，再结合上一节的RingHighlight，完工！
有一点不同的是，再计算完水平模糊后，怎么将结果传给下一个Kernel呢？答案呼之欲出了，直接使用 shared 关键词。具体步骤如下。
CPU中声明存储水平模糊纹理的引用，制作水平纹理的kernel，并绑定。
```
RenderTexture horzOutput = null;
int kernelHorzPassID;
protected override void Init()
{
    ...
    kernelHorzPassID = shader.FindKernel("HorzPass");
    ...
}
```
还需要额外在GPU中开辟空间，用来存储第一个kernel的结果。
```
protected override void CreateTextures()
{
    base.CreateTextures();
    shader.SetTexture(kernelHorzPassID, "source", renderedSource);
    CreateTexture(ref horzOutput);
    shader.SetTexture(kernelHorzPassID, "horzOutput", horzOutput);
    shader.SetTexture(kernelHandle, "horzOutput", horzOutput);
}
```
GPU上这样设置：
```
shared Texture2D<float4> source;
shared RWTexture2D<float4> horzOutput;
RWTexture2D<float4> outputrt;
```
另外有个疑问， shared 这个关键词好像加不加都一样，实际测试不同的kernel都可以访问到。那请问shared还有什么意义呢？
在Unity中，变量前加shared表示这个资源不是每次调用都重新初始化，而是保持其状态，供不同的shader或dispatch调用使用。这有助于在不同的shader调用之间共享数据。标记了 shared 可以帮助编译器优化出更高性能的代码。
在计算边界的像素时，会遇到可用像素数量不足的情况。要么就是左边剩下的像素不足 blurRadius ，要么右边剩余像素不足。因此先算出安全的左索引，然后再计算从左到右最大可以取多少。
```
[numthreads(8, 8, 1)]
void HorzPass(uint3 id : SV_DispatchThreadID)
{
    int left = max(0, (int)id.x-blurRadius);
    int count = min(blurRadius, (int)id.x) + min(blurRadius, source.Length.x - (int)id.x);
    float4 color = 0;
    uint2 index = uint2((uint)left, id.y);
    [unroll(100)]
    for(int x=0; x<count; x++){
        color += source[index];
        index.x++;
    }
    color /= (float)count;
    horzOutput[id.xy] = color;
}
[numthreads(8, 8, 1)]
void Highlight(uint3 id : SV_DispatchThreadID)
{
    //Vert blur
    int top = max(0, (int)id.y-blurRadius);
    int count = min(blurRadius, (int)id.y) + min(blurRadius, source.Length.y - (int)id.y);
    float4 blurColor = 0;
    uint2 index = uint2(id.x, (uint)top);
    [unroll(100)]
    for(int y=0; y<count; y++){
        blurColor += horzOutput[index];
        index.y++;
    }
    blurColor /= (float)count;
    float4 srcColor = source[id.xy];
    float4 shadedBlurColor = blurColor * shade;
    float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
    float4 color = lerp( shadedBlurColor, srcColor, highlight );
    outputrt[id.xy] = color;
}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_BlurEffect/Assets/Shaders/BlurHighlight.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_BlurEffect/Assets/Scripts/BlurHighlight.cs
4. 高斯模糊
和上面不同的是，采样之后不再是取平均值，而是用一个高斯函数加权求得。
其中，是标准差，控制宽度。
有关更多Blur的内容：https://www.gamedeveloper.com/programming/four-tricks-for-fast-blurring-in-software-and-hardware#close-modal
由于这个计算量还有不小的，如果每一个像素都去计算一次这个式子就非常耗。我们用预计算的方式，将计算结果通过Buffer的方式传到GPU上。由于两个kernel都需要使用，在Buffer声明的时候加一个shared。
```
float[] SetWeightsArray(int radius, float sigma)
{
    int total = radius * 2 + 1;
    float[] weights = new float[total];
    float sum = 0.0f;
    for (int n=0; n<radius; n++)
    {
        float weight = 0.39894f * Mathf.Exp(-0.5f * n * n / (sigma * sigma)) / sigma;
        weights[radius + n] = weight;
        weights[radius - n] = weight;
        if (n != 0)
            sum += weight * 2.0f;
        else
            sum += weight;
    }
    // normalize kernels
    for (int i=0; i<total; i++) weights[i] /= sum;
    return weights;
}
private void UpdateWeightsBuffer()
{
    if (weightsBuffer != null)
        weightsBuffer.Dispose();
    float sigma = (float)blurRadius / 1.5f;
    weightsBuffer = new ComputeBuffer(blurRadius * 2 + 1, sizeof(float));
    float[] blurWeights = SetWeightsArray(blurRadius, sigma);
    weightsBuffer.SetData(blurWeights);
    shader.SetBuffer(kernelHorzPassID, "weights", weightsBuffer);
    shader.SetBuffer(kernelHandle, "weights", weightsBuffer);
}
```
完整代码：
- https://pastebin.com/0qWtUKgy
- https://pastebin.com/A6mDKyJE
5. 低分辨率效果
GPU：真是酣畅淋漓的计算啊。
让一张高清的纹理边模糊，同时不修改分辨率。实现方法很简单，每 n*n 个像素，都只取左下角的像素颜色即可。利用整数的特性，id.x索引先除n，再乘上n就可以了。
```
uint2 index = (uint2(id.x, id.y)/3) * 3;
float3 srcColor = source[index].rgb;
float3 finalColor = srcColor;
```
效果已经放在上面了。但是这个效果太锐利了，通过添加噪声，柔化锯齿。
```
uint2 index = (uint2(id.x, id.y)/3) * 3;
float noise = random(id.xy, time);
float3 srcColor = lerp(source[id.xy].rgb, source[index],noise);
float3 finalColor = srcColor;
```
每 n*n 个格子的像素不在只取左下角的颜色，而是取原本颜色和左下角颜色的随机插值结果。效果一下子就精细了不少。当n比较大的时候，还能看到下面这样的效果。只能说不太好看，但是在一些故障风格道路中还是可以继续探索。
如果想要得到噪声感的画面，可以尝试lerp的两端添加系数，比如：
```
float3 srcColor = lerp(source[id.xy].rgb * 2, source[index],noise);
```
6. 灰阶效果与染色
Grayscale Effect & Tinted
将彩色图像转换为灰阶图像的过程涉及将每个像素的RGB值转换为一个单一的颜色值。这个颜色值是RGB值的加权平均值。这里有两种方法，一种是简单平均，一种是符合人眼感知的加权平均。
1. 平均值法（简单但不准确）：
这种方法对所有颜色通道给予相同的权重。 2. 加权平均法（更准确, 反映人眼感知）：
这种方法根据人眼对绿色更敏感、对红色次之、对蓝色最不敏感的特点, 给予不同颜色通道不同的权重。（下面的截图效果不太好，我也没看出来lol）
加权后，再简单地颜色混合（乘法），最后lerp得到可控的染色强度结果。
```
uint2 index = (uint2(id.x, id.y)/6) * 6;
float noise = random(id.xy, time);
float3 srcColor = lerp(source[id.xy].rgb, source[index],noise);
// float3 finalColor = srcColor;
float3 grayScale = (srcColor.r+srcColor.g+srcColor.b)/3.0;
// float3 grayScale = srcColor.r*0.299f+srcColor.g*0.587f+srcColor.b*0.114f;
float3 tinted = grayScale * tintColor.rgb;
float3 finalColor = lerp(srcColor, tinted, tintStrength);
outputrt[id.xy] = float4(finalColor, 1);
```
染一个废土颜色：
7. 屏幕扫描线效果
首先 uvY 将坐标归一化到 [0,1] 。
lines 是控制扫描线数量的一个参数。
然后增加一个时间偏移，系数控制偏移速度。可以开放一个参数控制线条偏移的速度。
```
float uvY = (float)id.y/(float)source.Length.y;
float scanline = saturate(frac(uvY * lines + time * 3));
```
这个“线”看起来不太够“线”，减个肥。
```
float uvY = (float)id.y/(float)source.Length.y;
float scanline = saturate(smoothstep(0.1,0.2,frac(uvY * lines + time * 3)));
```
然后lerp上颜色。
```
float uvY = (float)id.y/(float)source.Length.y;
float scanline = saturate(smoothstep(0.1, 0.2, frac(uvY * lines + time*3)) + 0.3);
finalColor = lerp(source[id.xy].rgb*0.5, finalColor, scanline);
```
“减肥”前后，各取所需吧！
8. 夜视仪效果
这一节总结上面所有内容，实现一个夜视仪的效果。先做一个单眼效果。
```
float2 pt = (float2)id.xy;
float2 center = (float2)(source.Length >> 1);
float inVision = inCircle(pt, center, radius, edgeWidth);
float3 blackColor = float3(0,0,0);
finalColor = lerp(blackColor, finalColor, inVision);
```
双眼效果不同点在于有两个圆心，计算得到的两个遮罩vision用 max() 或者是 saturate() 合并即可。
```
float2 pt = (float2)id.xy;
float2 centerLeft = float2(source.Length.x / 3.0, source.Length.y /2);
float2 centerRight = float2(source.Length.x / 3.0 * 2.0, source.Length.y /2);
float inVisionLeft = inCircle(pt, centerLeft, radius, edgeWidth);
float inVisionRight = inCircle(pt, centerRight, radius, edgeWidth);
float3 blackColor = float3(0,0,0);
// float inVision = max(inVisionLeft, inVisionRight);
float inVision = saturate(inVisionLeft + inVisionRight);
finalColor = lerp(blackColor, finalColor, inVision);
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_NightVision/Assets/Shaders/NightVision.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_NightVision/Assets/Scripts/NightVision.cs
9. 平缓过渡线条
思考一下，我们应该怎么在屏幕上画一条平滑过渡的直线。
smoothstep() 函数可以完成这个操作，熟悉这个函数的读者可以略过这一段。这个函数用来创建平滑的渐变。smoothstep(edge0, edge1, x) 函数在x在 edge0 和 edge1 之间时，输出值从0渐变到1。如果 x < edge0 ，返回0；如果 x > edge1 ，返回1。其输出值是根据Hermite插值计算的：
```
float onLine(float position, float center, float lineWidth, float edgeWidth) {
    float halfWidth = lineWidth / 2.0;
    float edge0 = center - halfWidth - edgeWidth;
    float edge1 = center - halfWidth;
    float edge2 = center + halfWidth;
    float edge3 = center + halfWidth + edgeWidth;
    return smoothstep(edge0, edge1, position) - smoothstep(edge2, edge3, position);
}
```
上面代码中，传入的参数都已经归一化 [0,1]。position 是考察的点的位置，center 是线的中心位置，lineWidth 是线的实际宽度，edgeWidth 是边缘的宽度，用于平滑过渡。我实在对我的表达能力感到不悦！至于怎么算的，我给大家画个图理解吧！
大概就是：，，。
思考一下，怎么画一个平滑过渡的圆。
对于每个点，先计算与圆心的距离向量，结果返回给 position ，并且计算其长度返回给 len 。
模仿上面两个 smoothstep 做差的方法，通过减去外边缘插值结果来生成一个环形的线条效果。
```
float circle(float2 position, float2 center, float radius, float lineWidth, float edgeWidth){
    position -= center;
    float len = length(position);
    //Change true to false to soften the edge
    float result = smoothstep(radius - lineWidth / 2.0 - edgeWidth, radius - lineWidth / 2.0, len) - smoothstep(radius + lineWidth / 2.0, radius + lineWidth / 2.0 + edgeWidth, len);
    return result;
}
```
10. 扫描线效果
然后一条横线、一条竖线，套娃几个圆，做一个雷达扫描的效果。
```
float3 color = float3(0.0f,0.0f,0.0f);
color += onLine(uv.y, center.y, 0.002, 0.001) * axisColor.rgb;//xAxis
color += onLine(uv.x, center.x, 0.002, 0.001) * axisColor.rgb;//yAxis
color += circle(uv, center, 0.2f, 0.002, 0.001) * axisColor.rgb;
color += circle(uv, center, 0.3f, 0.002, 0.001) * axisColor.rgb;
color += circle(uv, center, 0.4f, 0.002, 0.001) * axisColor.rgb;
```
再画一个扫描线，并且带有轨迹。
```
float sweep(float2 position, float2 center, float radius, float lineWidth, float edgeWidth) {
    float2 direction = position - center;
    float theta = time + 6.3;
    float2 circlePoint = float2(cos(theta), -sin(theta)) * radius;
    float projection = clamp(dot(direction, circlePoint) / dot(circlePoint, circlePoint), 0.0, 1.0);
    float lineDistance = length(direction - circlePoint * projection);
    float gradient = 0.0;
    const float maxGradientAngle = PI * 0.5;
    if (length(direction) < radius) {
        float angle = fmod(theta + atan2(direction.y, direction.x), PI2);
        gradient = clamp(maxGradientAngle - angle, 0.0, maxGradientAngle) / maxGradientAngle * 0.5;
    }
    return gradient + 1.0 - smoothstep(lineWidth, lineWidth + edgeWidth, lineDistance);
}
```
添加到颜色中。
```
...
color += sweep(uv, center, 0.45f, 0.003, 0.001) * sweepColor.rgb;
...
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_HUDOverlay/Assets/Shaders/HUDOverlay.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_HUDOverlay/Assets/Scripts/HUDOverlay.cs
11. 渐变背景阴影效果
这个效果可以用在字幕或者是一些说明性文字之下。虽然可以直接在UI Canvas中加一张贴图，但是使用Compute Shader可以实现更加灵活的效果以及资源的优化。
字幕、对话文字背景一般都在屏幕下方，上方不作处理。同时需要较高的对比度，因此对原有画面做一个灰度处理、并且指定一个阴影。
```
if (id.y<(uint)tintHeight){
    float3 grayScale = (srcColor.r + srcColor.g + srcColor.b) * 0.33 * tintColor.rgb;
    float3 shaded = lerp(srcColor.rgb, grayScale, tintStrength) * shade;
    ... // 接下文
}else{
    color = srcColor;
}
```
渐变效果。
```
...// 接上文
    float srcAmount = smoothstep(tintHeight-edgeWidth, (float)tintHeight, (float)id.y);
    ...// 接下文
```
最后再lerp起来。
```
...// 接上文
    color = lerp(float4(shaded, 1), srcColor, srcAmount);
```
12. 总结/小测试
If id.xy = [ 100, 30 ]. What would be the return value of inCircle((float2)id.xy, float2(130, 40), 40, 0.1)
When creating a blur effect which answer describes our approach best?
Which answer would create a blocky low resolution version of the source image?
What is smoothstep(5, 10, 6); ?
If an and b are both vectors. Which answer best describes dot(a,b)/dot(b,b); ?
What is _MainTex_TexelSize.x? If _MainTex is 512 x 256 pixel resolution.
13. 利用Blit结合Material做后处理
除了使用Compute Shader制作后处理，还有一种简单的方法。
```
// .cs
Graphics.Blit(source, dest, material, passIndex);
// .shader
Pass{
    CGPROGRAM
    #pragma vertex vert_img
    #pragma fragment frag
    fixed4 frag(v2f_img input) : SV_Target{
        return tex2D(_MainTex, input.uv);
    }
    ENDCG
}
```
通过结合Shader来处理图像数据。
那么问题来了，两者有什么区别？而且传进来的不是一张纹理吗，哪来的顶点？
答：
第一个问题。这种方法称为“屏幕空间着色”，完全集成在Unity的图形管线中，性能其实比Compute Shader更高。而Compute Shader提供了对GPU资源的更细粒度控制。它不受图形管线的限制，可以直接访问和修改纹理、缓冲区等资源。
第二个问题。注意看 vert_img 。在UnityCG中可以找到如下定义：
Unity会自动将传进来的纹理自动转换为两个三角形（一个充满屏幕的矩形），我们用材质的方法编写后处理时直接在frag上写就好了。
下一章将会学习如何将Material、Shader、Compute Shader还有C#联系起来。
2024-05-27
Compute Shader学习笔记（一）之入门
标签：入门/Shader/计算着色器/GPU优化
前言
Compute Shader比较复杂，需要具备一定的编程知识、图形学知识以及GPU相关的硬件知识才能较好的掌握。学习笔记分为四个部分：
- 初步认识Compute Shader，实现一些简单的效果
- 画圆、星球轨道、噪声图、操控Mesh等等
- 后处理、粒子系统
- 物理模拟、绘制草地
- 流体模拟
主要参考资料如下：
- https://www.udemy.com/course/compute-shaders/?couponCode=LEADERSALE24A
- https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
- https://medium.com/ericzhan-publication/shader筆記-初探compute-shader-9efeebd579c1
- https://docs.unity3d.com/Manual/class-ComputeShader.html
- https://docs.unity3d.com/ScriptReference/ComputeShader.html
- https://learn.microsoft.com/en-us/windows/win32/api/D3D11/nf-d3d11-id3d11devicecontext-dispatch
- lygyue：Compute Shader（很有意思）
- https://medium.com/@sengallery/unity-compute-shader-基礎認識-5a99df53cea1
- https://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html（太老，已经过时）
- http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf
- 王江荣：【Unity】Compute Shader的基础介绍与使用
- …未完待续
L1 介绍Compute Shader
1. 初识Compute Shader
简单的说，可以通过Compute Shader，计算出一个材质，然后通过Renderer显示出来。需要注意，Compute Shader不仅仅可以做这些。
可以把下面两份代码拷下来测试一下。
```
using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class AssignTexture : MonoBehaviour
{
    // ComputeShader 用于在 GPU 上执行计算任务
    public ComputeShader shader;

    // 纹理分辨率
    public int texResolution = 256;

    // 渲染器组件
    private Renderer rend;
    // 渲染纹理
    private RenderTexture outputTexture;
    // 计算着色器内核句柄
    private int kernelHandle;

    // Start 在脚本启用时被调用一次
    void Start()
    {
        // 创建一个新的渲染纹理，指定宽度、高度和位深度（此处位深度为0）
        outputTexture = new RenderTexture(texResolution, texResolution, 0);
        // 允许随机写入
        outputTexture.enableRandomWrite = true;
        // 创建渲染纹理实例
        outputTexture.Create();

        // 获取当前对象的渲染器组件
        rend = GetComponent<Renderer>();
        // 启用渲染器
        rend.enabled = true;

        InitShader();
    }

    private void InitShader()
    {
        // 查找计算着色器内核 "CSMain" 的句柄
        kernelHandle = shader.FindKernel("CSMain");

        // 设置计算着色器中使用的纹理
        shader.SetTexture(kernelHandle, "Result", outputTexture);

        // 将渲染纹理设置为材质的主纹理
        rend.material.SetTexture("_MainTex", outputTexture);

        // 调度计算着色器的执行，传入计算组的大小
        // 这里假设每个工作组是 16x16
        // 简单的说就是，要分配多少个组，才能完成计算，目前只分了xy的各一半，因此只渲染了1/4的画面。
        DispatchShader(texResolution / 16, texResolution / 16);
    }

    private void DispatchShader(int x, int y)
    {
        // 调度计算着色器的执行
        // x 和 y 表示计算组的数量，1 表示 z 方向上的计算组数量（这里只有一个）
        shader.Dispatch(kernelHandle, x, y, 1);
    }

    void Update()
    {
        // 每帧检查是否有键盘输入（按键 U 被松开）
        if (Input.GetKeyUp(KeyCode.U))
        {
            // 如果按键 U 被松开，则重新调度计算着色器
            DispatchShader(texResolution / 8, texResolution / 8);
        }
    }
}
```
Unity默认的Compute Shader：
```
// Each #kernel tells which function to compile; you can have many kernels
#pragma kernel CSMain

// Create a RenderTexture with enableRandomWrite flag and set it
// with cs.SetTexture
RWTexture2D<float4> Result;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID) { 
  // TODO: insert actual code here! Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0); 
}
```
在这个示例中，我们可以看到左下角四分之一的区域绘制上了一种名为Sierpinski网的分形结构，这个无关紧要，Unity官方觉得这个图形很有代表性，就把它当作默认代码了。
具体讲一下Compute Shader的代码， C# 的代码看注释即可。
#pragma kernel CSMain 这行代码指示了Compute Shader的入口。CSMain名字随便改。
RWTexture2D Result 这行代码是一个可读写的二维纹理。R代表Read，W代表Write。
着重看这一行代码：
```
[numthreads(8,8,1)]
```
在Compute Shader文件中，这行代码规定了一个线程组的大小，比如这个8 * 8 * 1的线程组中，一共有64个线程。每一个线程计算一个单位的像素（RWTexture）。
而在上面的 C# 文件中，我们用 shader.Dispatch 指定线程组的数量。
接下来提一个问题，如果当前线程组指定为 881 ，那么我们需要多少个线程组才能渲染完 res*res 大小的RWTexture呢？
答案是：res/8 个。而我们代码目前只调用了 res/16 个，因此只渲染了左下角的1/4的区域。
除此之外，入口函数传入的参数也值得一说。uint3 id : SV_DispatchThreadID 这个id表示当前线程的唯一标识符。
2. 四分图案
学会走之前，先学会爬。首先在 C# 中指定需要执行的任务（Kernel）。
目前我们写死了，现在我们暴露一个参数，表示可以执行渲染不同的任务。
```
public string kernelName = "CSMain";
...
kernelHandle = shader.FindKernel(kernelName);
```
这样，就可以在Inspector中随意修改了。
但是，光上盘子可不行，得上菜啊。我们在Compute Shader中做菜。
先设置几个菜单。
```
#pragma kernel CSMain // 刚刚我们已经声明好了
#pragma kernel SolidRed // 定义一个新的菜，并且在下面写出来就好了
... // 可以写很多
[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID){ ... }
[numthreads(8,8,1)]
void SolidRed (uint3 id : SV_DispatchThreadID){
 Result[id.xy] = float4(1,0,0,0); 
}
```
在Inspector中修改对应的名字，就可以启用不同的Kernel。
如果我想传数据给Compute Shader咋办？比方说，给Compute Shader传一个材质的分辨率。
```
shader.SetInt("texResolution", texResolution);
```
并且在Compute Shader里，也要声明好。
思考一个问题，怎么实现下面的效果？
```
[numthreads(8,8,1)]
void SplitScreen (uint3 id : SV_DispatchThreadID)
{
    int halfRes = texResolution >> 1;
    Result[id.xy] = float4(step(halfRes, id.x),step(halfRes, id.y),0,1);
}
```
解释一下，step 函数其实就是：
```
step(edge, x){
    return x>=edge ? 1 : 0;
}
```
(uint)res >> 1 意思就是res的位往右边移动一位。相当于除2（二进制的内容）。
这个计算方法就只是简单的依赖当前的线程id。
位于左下角的线程永远输出黑色。因为step返回永远都是0。
而左下半边的线程， id.x > halfRes ，因此在红通道返回1。
以此类推，非常简单。如果你不信服，可以具体算一下，可以帮助理解线程id、线程组和线程组组的关系。
3. 画圆
原理听上去很简单，判断 (id.x, id.y) 是否在圆内，是则输出1，否则0。动手试试吧。
```
float inCircle( float2 pt, float radius ){
    return ( length(pt)<radius ) ? 1.0 : 0.0;
}

[numthreads(8,8,1)]
void Circle (uint3 id : SV_DispatchThreadID)
{
    int halfRes = texResolution >> 1;
    int isInside = inCircle((float2)((int2)id.xy-halfRes), (float)(halfRes>>1));
    Result[id.xy] = float4(0.0,isInside ,0,1);
}
```
4. 总结/小测试
如果输出是 256 为边长的RWTexture，哪个答案会产生完整的红色的纹理？
```
RWTexture2D<float4> output;

[numthreads(16,16,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
     output[id.xy] = float4(1.0, 0.0, 0.0, 1.0);
}
```
哪个答案将在纹理输出的左侧给出红色，右侧给出黄色？
L2 开始了
1. 传递值给GPU
废话不多说，先画一个圆。两份初始代码在这里。
PassData.cs: https://pastebin.com/PMf4SicK
PassData.compute: https://pastebin.com/WtfUmhk2
大体结构和上文的没有变化。可以看到最终调用了一个drawCircle函数来画圆。
```
[numthreads(1,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    int2 centre = (texResolution >> 1);
    int radius = 80;
    drawCircle( centre, radius );
}
```
这里使用的画圆方法是非常经典的光栅化绘制方法，对数学原理感兴趣的可以看 http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf 。大概思路是利用一种对称的思想生成的。
不同的是，这里我们使用指定 (1,1,1) 为一个线程组的大小。在CPU端调用CS：
```
private void DispatchKernel(int count)
{
    shader.Dispatch(circlesHandle, count, 1, 1);
}
void Update()
{
    DispatchKernel(1);
}
```
问题来了，请问一个线程执行了多少次？
答：只执行了一次。因为一个线程组只有 111=1 个线程，并且CPU端只调用了 111=1 个线程组来计算。因此只用了一个线程完成了一个圆的绘制。也就是说，一个线程可以一次绘制一整个RWTexture，也不是之前那样，一个线程绘制一个pixel。
这也说明了Compute Shader和Fragment Shader是有本质的区别的。片元着色器只是计算单个像素的颜色，而Compute Shader可以执行或多或少任意的操作！
回到Unity，想绘制好看的圆，就需要轮廓颜色、填充颜色。将这两个参数传递到CS中。
```
float4 clearColor;
float4 circleColor;
```
并且增加颜色填充Kernel，并修改Circles内核。如果有多个内核同时访问一个RWTexture的时候，可以添加上 shared 关键词。
```
#pragma kernel Circles
#pragma kernel Clear
    ...
shared RWTexture2D<float4> Result;
    ...
[numthreads(32,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    // int2 centre = (texResolution >> 1);
    int2 centre = (int2)(random2((float)id.x) * (float)texResolution);
    int radius = (int)(random((float)id.x) * 30);
    drawCircle( centre, radius );
}

[numthreads(8,8,1)]
void Clear (uint3 id : SV_DispatchThreadID)
{
    Result[id.xy] = clearColor;
}
```
在CPU端获取Clear内核，传入数据。
```
private int circlesHandle;
private int clearHandle;
    ...
shader.SetVector( "clearColor", clearColor);
shader.SetVector( "circleColor", circleColor);
    ...
private void DispatchKernels(int count)
{
    shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1);
    shader.Dispatch(circlesHandle, count, 1, 1);
}
void Update()
{
    DispatchKernels(1); // 现在画面有32个圆圆
}
```
一个问题，如果代码改为：DispatchKernels(10) ，画面会有多少个圆？
答：320个。一开始Dispatch为 111=1 时，一个线程组有 3211=32 个线程，每个线程画一个圆。小学数学。
接下来，加入 _Time 变量，让圆圆随着时间变化。由于Compute Shader内部貌似没有_time这样的变量，所以只能由CPU传入。
CPU端，注意，实时更新的变量需要在每次Dispatch前更新（outputTexture不需要，因为这outputTexture指向的实际上是GPU纹理的引用！）：
```
private void DispatchKernels(int count)
{
    shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1);
    shader.SetFloat( "time", Time.time);
    shader.Dispatch(circlesHandle, count, 1, 1);
}
```
Compute Shader：
```
float time;
...
void Circles (uint3 id : SV_DispatchThreadID){
    ...
    int2 centre = (int2)(random2((float)id.x + time) * (float)texResolution);
    ...
}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Shaders/PassData.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Scripts/PassData.cs
但是现在的圆非常混乱，下一步就需要利用Buffer让圆圆看起来更有规律。
同时不需要担心多个线程尝试同时写入同一个内存位置（比如 RWTexture），可能会出现竞争条件（race condition）。当前的API都会很好的处理这个问题。
2. 利用Buffer传递数据给GPU
目前为止，我们学习了如何从CPU传送一些简单的数据给GPU。如何传递自定义的结构体呢？
我们可以使用Buffer作为媒介，其中Buffer当然是存在GPU当中的，CPU端（C#）只存储其引用。。首先，在CPU声明一个结构体，然后声明CPU端的引用和GPU端的引用。
```
struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
    Circle[] circleData;  // 在CPU上
    ComputeBuffer buffer; // 在GPU上
```
获取一个线程组的大小信息，可以这样，下面代码只获取了circlesHandles线程组的x方向上的线程数量，yz都不要了（因为假设线程组yz都是1）。并且乘上分配的线程组数量，就可以得到总的线程数量。
```
uint threadGroupSizeX;
shader.GetKernelThreadGroupSizes(circlesHandle, out threadGroupSizeX, out _, out _);
int total = (int)threadGroupSizeX * count;
```
现在把需要传给GPU的数据准备好。这里创建了线程数个圆形，circleData[threadNums]。
```
circleData = new Circle[total];
float speed = 100;
float halfSpeed = speed * 0.5f;
float minRadius = 10.0f;
float maxRadius = 30.0f;
float radiusRange = maxRadius - minRadius;
for(int i=0; i<total; i++)
{
    Circle circle = circleData[i];
    circle.origin.x = Random.value * texResolution;
    circle.origin.y = Random.value * texResolution;
    circle.velocity.x = (Random.value * speed) - halfSpeed;
    circle.velocity.y = (Random.value * speed) - halfSpeed;
    circle.radius = Random.value * radiusRange + minRadius;
    circleData[i] = circle;
}
```
然后在Compute Shader上接受这个Buffer。声明一个一模一样的结构体（Vector2和Float2是一样的），然后创建一个Buffer的引用。
```
// Compute Shader
struct circle
{
    float2 origin;
    float2 velocity;
    float radius;
};
StructuredBuffer<circle> circlesBuffer;
```
注意，这里使用的StructureBuffer是只读的，区别于下一节提到的RWStructureBuffer。
回到CPU端，将刚才准备好的CPU数据通过Buffer发送给GPU。首先明确我们申请的Buffer大小，也就是我们要传多大的东西给GPU。这里一份圆形的数据有两个 float2 的变量和一个 float 的变量，一个float是4bytes（不同平台可能不同，你可以用 sizeof(float) 加以判断），并且有 circleData.Length 份圆数据需要传递。circleData.Length表示缓冲区需要存储多少个圆形对象，而stride定义了每个对象的数据占用多少字节。开辟了这么大的空间，接下来使用SetData()将数据填充到缓冲区，也就是这一步，将数据传递给了GPU。最后将数据所在的GPU引用绑定到Compute Shader指定的Kernel。
```
int stride = (2 + 2 + 1) * 4; //2 floats origin, 2 floats velocity, 1 float radius - 4 bytes per float
buffer = new ComputeBuffer(circleData.Length, stride);
buffer.SetData(circleData);
shader.SetBuffer(circlesHandle, "circlesBuffer", buffer);
```
目前为止，我们已经将CPU准备好的一些数据，通过Buffer传递给了GPU。
OK，现在把好不容易传到GPU的数据利用起来。
```
[numthreads(32,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    int2 centre = (int2)(circlesBuffer[id.x].origin + circlesBuffer[id.x].velocity * time);
    while (centre.x>texResolution) centre.x -= texResolution;
    while (centre.x<0) centre.x += texResolution;
    while (centre.y>texResolution) centre.y -= texResolution;
    while (centre.y<0) centre.y += texResolution;
    uint radius = (int)circlesBuffer[id.x].radius;
    drawCircle( centre, radius );
}
```
就可以看到，现在的圆圆是连续运动的。因为我们Buffer存储了id.x为索引的圆在上一帧的位置以及这个圆的运动状态。
总结一下，这一节学会了如何在CPU端自定义一个结构体（数据结构），并且通过Buffer传递给GPU，在GPU上对数据进行处理。
下一节，我们学习如何从GPU获取数据返回给CPU。
- 当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Shaders/BufferJoy.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Scripts/BufferJoy.cs
3. 从GPU取得数据
还是老样子，创建一个Buffer，用于把数据从GPU传回给CPU。并且在CPU这边定义一个数组，用于接受数据。然后创建好缓冲区、绑定到着色器上，最后在CPU上创建好准备接受GPU数据的变量。
```
ComputeBuffer resultBuffer; // Buffer
Vector3[] output;           // CPU接受
...
    //buffer on the gpu in the ram
    resultBuffer = new ComputeBuffer(starCount, sizeof(float) * 3);
    shader.SetBuffer(kernelHandle, "Result", resultBuffer);
    output = new Vector3[starCount];
```
在Compute Shader中也接受这样一个Buffer。这里的Buffer是可读写的，也就是说这个Buffer可以被Compute Shader修改。上一节中，Compute Shader只需要读取Buffer，因此 StructuredBuffer 足矣。这里我们需要使用RW。
```
RWStructuredBuffer<float3> Result;
```
接下来，在Dispatch后面用 GetData 接收数据即可。
```
shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
resultBuffer.GetData(output);
```
思路就是这么简单。现在我们尝试制作一大堆围绕球心运动的星星场景。
将计算星星坐标的任务放到GPU上完成，最终获取计算好的各个星星的位置数据，在 C# 中实例化物体。
Compute Shader中，每一个线程计算一个星星的位置，然后输出到Buffer当中。
```
[numthreads(64,1,1)]
void OrbitingStars (uint3 id : SV_DispatchThreadID)
{
    float3 sinDir = normalize(random3(id.x) - 0.5);
    float3 vec = normalize(random3(id.x + 7.1393) - 0.5);
    float3 cosDir = normalize(cross(sinDir, vec));
    float scaledTime = time * 0.5 + random(id.x) * 712.131234;
    float3 pos = sinDir * sin(scaledTime) + cosDir * cos(scaledTime);
    Result[id.x] = pos * 2;
}
```
在CPU端通过 GetData 得到计算结果，时刻修改对应事先实例化好的GameObject的Pos。
```
void Update()
{
    shader.SetFloat("time", Time.time);
    shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
    resultBuffer.GetData(output);
    for (int i = 0; i < stars.Length; i++)
        stars[i].localPosition = output[i];
}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Shaders/OrbitingStars.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Scripts/OrbitingStars.cs
4. 使用噪声
使用Compute Shader生成一张噪声图非常简单，并且非常高效。
```
float random (float2 pt, float seed) {
    const float a = 12.9898;
    const float b = 78.233;
    const float c = 43758.543123;
    return frac(sin(seed + dot(pt, float2(a, b))) * c );
}

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    float4 white = 1;
    Result[id.xy] = random(((float2)id.xy)/(float)texResolution, time) * white;
}
```
有一个库可以得到更多各式各样的噪声。https://pastebin.com/uGhMLKeM
```
#include "noiseSimplex.cginc" // Paste the code above and named "noiseSimplex.cginc"

...

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    float3 pos = (((float3)id)/(float)texResolution) * 2.0;
    float n = snoise(pos);
    float ring = frac(noiseScale * n);
    float delta = pow(ring, ringScale) + n;

    Result[id.xy] = lerp(darkColor, paleColor, delta);
}
```
5. 变形的Mesh
这一节中，我们将一个Cube正方体，通过Compute Shader变成一个球体，并且要有动画过程，是渐变的！
老样子，在CPU端声明顶点参数，然后丢到GPU里面计算，计算得到的新坐标newPos，应用到Mesh上。
顶点结构的声明，CPU端的声明我们附带一个构造函数，这样方便些。GPU端的照葫芦画瓢。此处，我们打算向GPU传递两个Buffer，一个只读另一个可读写。一开始两个Buffer是一样的，随着时间变化（渐变），可读写的Buffer逐渐变化，Mesh从立方体不断变成球球。
```
// CPU
public struct Vertex
{
    public Vector3 position;
    public Vector3 normal;
    public Vertex( Vector3 p, Vector3 n )
    {
        position.x = p.x;
        position.y = p.y;
        position.z = p.z;
        normal.x = n.x;
        normal.y = n.y;
        normal.z = n.z;
    }
}
...
Vertex[] vertexArray;
Vertex[] initialArray;
ComputeBuffer vertexBuffer;
ComputeBuffer initialBuffer;
// GPU
struct Vertex {
    float3 position;
    float3 normal;
};
...
RWStructuredBuffer<Vertex>  vertexBuffer;
StructuredBuffer<Vertex>    initialBuffer;
```
初始化（ Start() 函数）的完整步骤如下：
1. 在CPU端，初始化kernel，获取Mesh引用
2. 将Mesh数据传到CPU中
3. 在GPU中声明Mesh数据的Buffer
4. 将Mesh数据和其他参数传到GPU中
完成这些操作后，每一帧Update，我们将从GPU得到的新顶点，应用给mesh。
那GPU的计算怎么实现呢？
相当简单的做法，我们只需要归一化模型空间的各个顶点即可！试想一下，当所有顶点位置向量都归一化了，那模型就变成一个球。
实际代码中，我们还需要同时计算法线，如果不改变法线，物体的光照就会非常奇怪。那问题来了，法线怎么计算呢？非常简单，原本正方体的顶点的坐标就是最终球球的法线向量！
为了实现“呼吸”的效果，加入一个正弦函数，控制归一化的系数。
```
float delta = (Mathf.Sin(Time.time) + 1)/ 2;
```
由于代码有点长，放一个链接吧。
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Shaders/MeshDeform.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Scripts/MeshDeform.cs
6. 总结/小测试
应该如何在GPU上定义这个结构：
```
struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
```
这个结构应该怎样设置ComputeBuffer的大小？
```
struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
```
下面代码为什么错误？
```
StructuredBuffer<float3> positions;
//Inside a kernel
...
positions[id.x] = fixed3(1,0,0);
```
References
Indirect Compute Shader
2024-05-27