Remo

标签： Compute Shader

Compute Shader学习笔记（四）之草地渲染
项目地址：
https://github.com/Remyuu/Unity-Compute-Shader-Learngithub.com/Remyuu/Unity-Compute-Shader-Learn
L5 草地渲染
当前做的效果非常丑陋，还有很多细节没有完善，仅仅是“实现”了。由于我也是菜鸡，写/做的不够好的地方望各位指正。
知识点小结：
- 草地渲染方案
- UNITY_PROCEDURAL_INSTANCING_ENABLED
- bounds.extents
- 射线检测
- 罗德里格旋转
- 四元数旋转
前言1
前言参考文章：
草地渲染有很多方法。
最简单的是直接一张草地的纹理贴上去。
除此之外，将一个个Mesh草拖到场景中也很常见。这种方法操作空间大，每一颗草都在掌控中。虽然可以用Batching等方法优化，减少CPU到GPU的传输时间，但是这会损耗您键盘上的Ctrl、C、V和D键的寿命。不过可以在Transform组件里面用 L(a, b) 让选中的物体平均分布在 a 和 b 之间。想随机，可以用 R(a, b) 。更多相关的操作可以看官方文档。
还可以结合几何着色器和曲面细分着色器，这个方法看起来不错的，但是一个着色器只能对应一种几何（草），如果想要在这个网格生成花或者岩石，就需要在几何着色器中修改代码。这个问题其实不是最关键的，更要命的问题是很多移动设备还有Metal根本就不支持几何着色器，就算支持也只是软件模拟的，性能差劲。并且每一帧都会重新计算一次草地Mesh，浪费性能。
广告牌技术渲染草也是一种广泛流传经久不衰的方法。当我们不需要高保真的画面时，这个方法非常奏效。这个方法是简单的渲一个Quad+贴图（Alpha裁切）。用DrawProcedural就可以了。但是这个方法只可远观不可近看，否则就会大露馅。
用Unity的地形系统也可以画出非常nice的草。并且Unity使用了instancing技术确保了性能。其中最好用的地方莫过于他的笔刷工具，但是如果你的工作流没有地形系统的身影，那么你还可以用第三方插件做到。
在搜索资料的时候我还发现了一种叫Impostors「冒名顶替」技术。结合了广告牌的顶点节省优势和从多个角度真实重现对象的能力，还挺有意思。这个技术通过预先从多个角度“拍下”一个真实草的Mesh照片，通过Texture存起来。运行的时候根据当前相机的观看方向选择合适的纹理进行渲染。相当于广告牌技术的升级版。我认为Impostors技术非常适合用于那些大型但玩家可能需要从多个角度查看的对象，如树木或复杂建筑。然而，当相机非常接近或者在两个角度之间变换时，这种方法可能会出现问题。比较合理的方案是：在距离非常近用基于Mesh的方法，中等距离用Impostors，远距离用广告牌。
本文要实现的方法是基于GPU Instancing的，应该称之为「per-blade mesh grass」。在《對馬島之魂》、《原神》和《薩爾達傳說：曠野之息》等游戏上都是使用这种方案。每个草都有自己的实体，光影效果也相当真实。
渲染流程：
前言2
Unity的Instancing技术比较复杂，我也只是管中窥豹，出现错误请指正。目前的代码都是仿照文档写的。GPU instancing目前支持的平台：
- Windows: DX11 and DX12 with SM 4.0 and above / OpenGL 4.1 and above
- OS X and Linux: OpenGL 4.1 and above
- Mobile: OpenGL ES 3.0 and above / Metal
- PlayStation 4
- Xbox One
另外Graphics.DrawMeshInstancedIndirect目前已经淘汰了，应该使用 Graphics.RenderMeshIndirect ，这个函数会自动计算Bounding Box，这个就是后话了。详细请看官方文档：RenderMeshIndirect 。这篇文章也很有帮助：
https://zhuanlan.zhihu.com/p/403885438。
GPU Instancing原理是将多个具有相同Mesh的对象发一次Draw Call。CPU首先收集好所有信息，然后放到数组里一次性发给GPU。局限就是这些对象的Material和Mesh都要相同。这就是一次能绘制这么多草而保持高性能的原理。要实现GPU Instancing绘制上百万的Mesh，就需要遵循一些规定：
- 所有的网格需使用相同的Material
- 勾选GPU Instancing
- Shader需支持实例化
- 不支持Skin Mesh Renderer
由于不支持Skin Mesh Renderer，在上一篇文章中，我们绕过了SMR，直接取了不同关键帧的Mesh出来传给GPU，这也是上一篇文章最后提出那个问题的原因。
Unity中的Instancing分为两种主要类型：GPU Instancing和Procedural Instancing（涉及到Compute Shaders和Indirect Drawing技术），还有一种是立体渲染路径（UNITY_STEREO_INSTANCING_ENABLED），这里就不深入了。在Shader中，前者用#pragma multi_compile_instancing 后者用#pragma instancing_options procedural:setup 。具体的请看官方文档Creating shaders that support GPU instancing 。
然后目前SRP管线不支持自定义的GPU Instancing Shader，只有BIRP可以。
然后就是UNITY_PROCEDURAL_INSTANCING_ENABLED 。这个宏用于表示是否启用了Procedural Instancing。在使用Compute Shader或Indirect Drawing API时，实例的属性（如位置、颜色等）可以在GPU上实时计算并直接用于渲染，无需CPU的介入。在源代码中，关于这个宏的核心代码是：
```
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    #ifndef UNITY_INSTANCING_PROCEDURAL_FUNC
        #error "UNITY_INSTANCING_PROCEDURAL_FUNC must be defined."
    #else
        void UNITY_INSTANCING_PROCEDURAL_FUNC(); // 前向声明程序化函数
        #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input)      { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input)); UNITY_INSTANCING_PROCEDURAL_FUNC();}
    #endif
#else
    #define DEFAULT_UNITY_SETUP_INSTANCE_ID(input)          { UnitySetupInstanceID(UNITY_GET_INSTANCE_ID(input));}
#endif
```
要求Shader定义一个UNITY_INSTANCING_PROCEDURAL_FUNC函数，其实就是 setup() 函数。没有这个setup()函数，就会报错。
一般来说，setup()函数要做的就是从Buffer中取出对应（unity_InstanceID）的数据，然后计算当前实例的位置、变换矩阵、颜色、金属度或者是自定义数据等属性。
GPU Instancing只是Unity众多优化手段的一种，仍然需要继续学习。
1. 摇曳的3-Quad草
这一章所运用关于CS的知识点在上一篇文章都已全部涉及，只不过换一个背景罢了。简单画一个示意图。
实现是使用GPU Instancing，也就是一次性渲染一大片Mesh。核心的代码就一句：
```
Graphics.DrawMeshInstancedIndirect(mesh, 0, material, bounds, argsBuffer);
```
Mesh采用三个Quad共六个三角形组成。
然后上一张贴图+Alpha Test。
草的数据结构：
- 位置
- 倾斜角度
- 随机噪声值（用于计算随机的倾斜角度）
```
public Vector3 position; // 世界坐标，需要计算
public float lean;
public float noise;
public GrassClump( Vector3 pos){
    position.x = pos.x;
    position.y = pos.y;
    position.z = pos.z;
    lean = 0;
    noise = Random.Range(0.5f, 1);
    if (Random.value < 0.5f) noise = -noise;
}
```
将需要渲染的草的Buffer（世界坐标需要计算）传给GPU。首先确定草在哪里生成、生成多少。获取当前物体的Mesh（暂时假设是一个Plane Mesh）的AABB。
```
Bounds bounds = mf.sharedMesh.bounds;
Vector3 clumps = bounds.extents;
```
确定草的范围，然后在xOz平面上随机生成草。
添加图片注释，不超过 140 字（可选）
需要注意，当前还是在物体空间，因此需要将Object Space转换到World Space。
```
pos = transform.TransformPoint(pos);
```
再结合密度density参数和物体缩放系数，计算出一共要渲染多少个草。
```
Vector3 vec = transform.localScale / 0.1f * density;
clumps.x *= vec.x;
clumps.z *= vec.z;
int total = (int)clumps.x * (int)clumps.z;
```
由于Compute Shader的逻辑是每个线程计算一棵草，极有可能需要渲染的草的数量不是线程的倍数。因此将需要渲染的草的数量向上取整到线程的倍数。也就是说，当密度因子=1的时候，渲染的草的数量等于一个线程组中线程的数量。
```
groupSize = Mathf.CeilToInt((float)total / (float)threadGroupSize);
int count = groupSize * (int)threadGroupSize;
```
让Compute Shader计算每个草的倾斜角度。
```
GrassClump clump = clumpsBuffer[id.x];
clump.lean = sin(time) * maxLean * clump.noise;
clumpsBuffer[id.x] = clump;
```
将草的位置、旋转角度传给GPU Buffer还没完，还得拜托Material决定渲染实例的最终外观，才能最终执行Graphics.DrawMeshInstancedIndirect。
渲染流程中，在实例化阶段之前（也就是procedural:setup函数内），使用unity_InstanceID确定现在渲的是哪个草。获取当前草的世界空间，草的倾倒值。
```
GrassClump clump = clumpsBuffer[unity_InstanceID];
_Position = clump.position;
_Matrix = create_matrix(clump.position, clump.lean);
```
具体的旋转+位移矩阵：
```
float4x4 create_matrix(float3 pos, float theta){
    float c = cos(theta); // 计算旋转角度的余弦值
    float s = sin(theta); // 计算旋转角度的正弦值
    // 返回一个4x4变换矩阵
    return float4x4(
        c, -s, 0, pos.x, // 第一行：X轴旋转和位移
        s,  c, 0, pos.y, // 第二行：Y轴旋转（对于2D足够，但草丛可能不使用）
        0,  0, 1, pos.z, // 第三行：Z轴不变
        0,  0, 0, 1     // 第四行：均匀坐标（保持不变）
    );
}
```
这个公式怎么推的呢？将(0,0,1)带入罗德里格斯公式得到一个的旋转矩阵，然后扩展到重心坐标。带入就是代码的公式了。
用这个矩阵乘上Object Space的顶点，得到倾倒+位移的顶点坐标。
```
v.vertex.xyz *= _Scale;
float4 rotatedVertex = mul(_Matrix, v.vertex);
v.vertex = rotatedVertex;
```
这时候问题来了。目前草并不是一个平面，而是三组Quad组成的立体图形。
如果简单的将所有顶点按照z轴旋转，就会出现草根大偏移的问题。
因此借助 v.texcoord.y ，将旋转前后的顶点位置lerp起来。这样，纹理坐标的Y值越高（即顶点在模型上的位置越靠近顶部），顶点受到的旋转影响就越大。由于草根的Y值为0，lerp之后草根就不会乱晃了。
```
v.vertex.xyz *= _Scale;
float4 rotatedVertex = mul(_Matrix, v.vertex);
// v.vertex = rotatedVertex;
v.vertex.xyz += _Position;
v.vertex = lerp(v.vertex, rotatedVertex, v.texcoord.y);
```
效果很差，草太假了。这种Quad草只有在远处用用。
- 摆动僵硬
- 叶片僵硬
- 光影效果很差
当前版本代码：
2. 程式化草叶
上一节用几个Quad和带Alpha贴图的草，用sin wave做扰动，效果非常一般。现在用程式化的草和Perlin噪声改善。
在 C# 中定义草的顶点、法线和uv作为Mesh传到GPU上。
```
Vector3[] vertices =
{
    new Vector3(-halfWidth, 0, 0),
    new Vector3( halfWidth, 0, 0),
    new Vector3(-halfWidth, rowHeight, 0),
    new Vector3( halfWidth, rowHeight, 0),
    new Vector3(-halfWidth*0.9f, rowHeight*2, 0),
    new Vector3( halfWidth*0.9f, rowHeight*2, 0),
    new Vector3(-halfWidth*0.8f, rowHeight*3, 0),
    new Vector3( halfWidth*0.8f, rowHeight*3, 0),
    new Vector3( 0, rowHeight*4, 0)
};
Vector3 normal = new Vector3(0, 0, -1);
Vector3[] normals =
{
    normal, normal, normal, normal, normal, normal, normal, normal, normal
};
Vector2[] uvs =
{
    new Vector2(0,0),
    new Vector2(1,0),
    new Vector2(0,0.25f),
    new Vector2(1,0.25f),
    new Vector2(0,0.5f),
    new Vector2(1,0.5f),
    new Vector2(0,0.75f),
    new Vector2(1,0.75f),
    new Vector2(0.5f,1)
};
```
Unity的Mesh还有一个顶点顺序需要设定，默认是逆时针。如果顺时针写并且开启背面剔除，那就啥也看不见了。
```
int[] indices =
{
    0,1,2,1,3,2,//row 1
    2,3,4,3,5,4,//row 2
    4,5,6,5,7,6,//row 3
    6,7,8//row 4
};
mesh.SetIndices(indices, MeshTopology.Triangles, 0);
```
在代码那边设置好风的方向、大小还有噪声比重，打包进一个float4里面，传给Compute Shader计算一片草叶的摆动方向。
```
Vector4 wind = new Vector4(Mathf.Cos(theta), Mathf.Sin(theta), windSpeed, windScale);
```
一个草叶的数据结构
```
struct GrassBlade
{
    public Vector3 position;
    public float bend; // 随机草叶倾倒
    public float noise;// CS计算噪声值
    public float fade; // 随机草叶明暗
    public float face; // 叶片朝向
    public GrassBlade( Vector3 pos)
    {
        position.x = pos.x;
        position.y = pos.y;
        position.z = pos.z;
        bend = 0;
        noise = Random.Range(0.5f, 1) * 2 - 1;
        fade = Random.Range(0.5f, 1);
        face = Random.Range(0, Mathf.PI);
    }
}
```
当前的草叶都是一个方向的。Setup函数里，先修改叶片朝向。
```
// 创建绕Y轴的旋转矩阵（面向）
float4x4 rotationMatrixY = AngleAxis4x4(blade.position, blade.face, float3(0,1,0));
```
将草叶倾倒的逻辑（由于AngleAxis4x4是包含了位移，下图只是单独演示了叶片倾倒而没有随机朝向，如果要得到下图的效果代码中记得加入位移）：
```
// 创建绕X轴的旋转矩阵（倾倒）
float4x4 rotationMatrixX = AngleAxis4x4(float3(0,0,0), blade.bend, float3(1,0,0));
```
然后合成两个旋转矩阵。
```
_Matrix = mul(rotationMatrixY, rotationMatrixX);
```
现在的光照是非常奇怪的。因为法线没有修改。
```
// 计算逆转置矩阵用于法线变换
float3x3 normalMatrix = (float3x3)transpose(((float3x3)_Matrix));
// 变换法线
v.normal = mul(normalMatrix, v.normal);
```
这里逆矩阵的代码：
```
float3x3 transpose(float3x3 m)
{
    return float3x3(
        float3(m[0][0], m[1][0], m[2][0]), // Column 1
        float3(m[0][1], m[1][1], m[2][1]), // Column 2
        float3(m[0][2], m[1][2], m[2][2])  // Column 3
    );
}
```
为了代码可读性，再补上齐次坐标变换矩阵，这里升级为那个著名的旋转公式：
```
float4x4 AngleAxis4x4(float3 pos, float angle, float3 axis){
    float c, s;
    sincos(angle*2*3.14, s, c);
    float t = 1 - c;
    float x = axis.x;
    float y = axis.y;
    float z = axis.z;
    return float4x4(
        t * x * x + c    , t * x * y - s * z, t * x * z + s * y, pos.x,
        t * x * y + s * z, t * y * y + c    , t * y * z - s * x, pos.y,
        t * x * z - s * y, t * y * z + s * x, t * z * z + c    , pos.z,
        0,0,0,1
        );
}
```
想要在不平坦的地面生成怎么办？
只需要修改生成草地初始位置高度的逻辑，用MeshCollider加射线检测，
```
bladesArray = new GrassBlade[count];
gameObject.AddComponent<MeshCollider>();
RaycastHit hit;
Vector3 v = new Vector3();
Debug.Log(bounds.center.y + bounds.extents.y);
v.y = (bounds.center.y + bounds.extents.y);
v = transform.TransformPoint(v);
float heightWS = v.y + 0.01f; // 浮点数误差
v.Set(0, 0, 0);
v.y = (bounds.center.y - bounds.extents.y);
v = transform.TransformPoint(v);
float neHeightWS = v.y;
float range = heightWS - neHeightWS;
// heightWS += 10; // 稍微调高一点 误差自行调整
int index = 0;
int loopCount = 0;
while (index < count && loopCount < (count * 10))
{
    loopCount++;
    Vector3 pos = new Vector3( Random.value * bounds.extents.x * 2 - bounds.extents.x + bounds.center.x,
        0,
        Random.value * bounds.extents.z * 2 - bounds.extents.z + bounds.center.z);
    pos = transform.TransformPoint(pos);
    pos.y = heightWS;
    if (Physics.Raycast(pos, Vector3.down, out hit))
    {
        pos.y = hit.point.y;
        GrassBlade blade = new GrassBlade(pos);
        bladesArray[index++] = blade;
    }
}
```
这里用射线检测每个草的位置，计算其正确高度。
还可以调整一下，海拔越高，草地越稀疏。
如上图。计算两个绿色箭头的比值，越高的海拔生成的概率越低。
```
float deltaHeight = (pos.y - neHeightWS) / range;
if (Random.value > deltaHeight)
{
    // 生草
}
```
当前代码链接：
现在光影啥的都没问题了。
3. 交互草
上一节中，我们先是旋转了草的朝向，又是改变了草的倾倒。现在我们还要加上一个旋转，当一个物体靠近草，就让草朝着与物体相反的方向伏倒。这意味着又来一个旋转。这个旋转并不好设置，因此改为四元数进行。而四元数的计算在Compute Shader进行。传给材质的也是四元数，存在草片的结构体中。最后在顶点着色器中将四元数转换回仿射矩阵应用旋转。
这里再加入草的随机宽和身高。因为目前每个草Mesh都是一样的，没办法通过修改Mesh的方法修改草的高度。因此只能在Vert做顶点偏移了。
```
// C#
[Range(0,0.5f)]
public float width = 0.2f;
[Range(0,1f)]
public float rd_width = 0.1f;
[Range(0,2)]
public float height = 1f;
[Range(0,1f)]
public float rd_height = 0.2f;
    GrassBlade blade = new GrassBlade(pos);
    blade.height = Random.Range(-rd_height, rd_height);
    blade.width = Random.Range(-rd_width, rd_width);
    bladesArray[index++] = blade;
// Setup 开头
GrassBlade blade = bladesBuffer[unity_InstanceID];
_HeightOffset = blade.height_offset;
_WidthOffset = blade.width_offset;
// Vert 开头
float tempHeight = v.vertex.y * _HeightOffset;
float tempWidth = v.vertex.x * _WidthOffset;
v.vertex.y += tempHeight;
v.vertex.x += tempWidth;
```
整理一下，当前的一个草Buffer存了:
```
struct GrassBlade{
    public Vector3 position; // 世界坐标位置 - 需初始化
    public float height; // 草的身高偏移 - 需初始化
    public float width; // 草的宽度偏移 - 需初始化
    public float dir; // 叶片朝向 - 需初始化
    public float fade; // 随机草叶明暗 - 需初始化
    public Quaternion quaternion; // 旋转参数 - CS计算->Vert
    public float padding;
    public GrassBlade( Vector3 pos){
        position.x = pos.x;
        position.y = pos.y;
        position.z = pos.z;
        height = width = 0;
        dir = Random.Range(0, 180);
        fade = Random.Range(0.99f, 1);
        quaternion = Quaternion.identity;
        padding = 0;
    }
}
int SIZE_GRASS_BLADE = 12 * sizeof(float);
```
用来表示从向量 v1 旋转到向量 v2 的四元数 q ：
```
float4 MapVector(float3 v1, float3 v2){
    v1 = normalize(v1);
    v2 = normalize(v2);
    float3 v = v1+v2;
    v = normalize(v);
    float4 q = 0;
    q.w = dot(v, v2);
    q.xyz = cross(v, v2);
    return q;
}
```
想要组合两个旋转的四元数，需要用乘法（注意顺序）。
假设有两个四元数和。它们的乘积计算公式是 :
其中是的实部和虚部分量, 是的实部和虚部分量。
```
float4 quatMultiply(float4 q1, float4 q2) {
    // q1 = a + bi + cj + dk
    // q2 = x + yi + zj + wk
    // Result = q1 * q2
    return float4(
        q1.w * q2.x + q1.x * q2.w + q1.y * q2.z - q1.z * q2.y, // X component
        q1.w * q2.y - q1.x * q2.z + q1.y * q2.w + q1.z * q2.x, // Y component
        q1.w * q2.z + q1.x * q2.y - q1.y * q2.x + q1.z * q2.w, // Z component
        q1.w * q2.w - q1.x * q2.x - q1.y * q2.y - q1.z * q2.z  // W (real) component
    );
}
```
要确定草是往哪个地方倒，就需要获取交互物体trampler的Pos，也就是其Transform组件。并且每一帧都通过SetVector传到GPU Buffer中，给Compute Shader用，所以把GPU的内存地址当作ID存着，不需要每次都用字符串访问。还要确定多大范围内的草要倒下，倒与不倒之间怎么过渡，给GPU传一个 trampleRadius ，由于这个是常数，就不用每一帧都修改，因此直接用字符串Set一下就好了。
```
// CSharp
public Transform trampler;
[Range(0.1f,5f)]
public float trampleRadius = 3f;
...
Init(){
    shader.SetFloat("trampleRadius", trampleRadius);
    tramplePosID = Shader.PropertyToID("tramplePos");
}
Update(){
    shader.SetVector(tramplePosID, pos);
}
```
本节把所有旋转的操作都丢进Compute Shader里面一次算完，直接返回一个四元数给材质。首先是q1计算随机朝向的四元数，q2计算随机倾倒，qt计算交互的倾倒。这里可以在Inspector开放一个交互的系数。
```
[numthreads(THREADGROUPSIZE,1,1)]
void BendGrass (uint3 id : SV_DispatchThreadID)
{
    GrassBlade blade = bladesBuffer[id.x];
    float3 relativePosition = blade.position - tramplePos.xyz;
    float dist = length(relativePosition);
    float4 qt;
    if (dist<trampleRadius){
        float eff = ((trampleRadius - dist)/trampleRadius) * 0.6;
        qt = MapVector(float3(0,1,0), float3(relativePosition.x*eff,1,relativePosition.z*eff));
    }else{
        qt = MapVector(float3(0,1,0),float3(0,1,0));
    }
    float2 offset = (blade.position.xz + wind.xy * time * wind.z) * wind.w;
    float noise = perlin(offset.x, offset.y) * 2 - 1;
    noise *= maxBend;
    float4 q1 = MapVector(float3(0,1,0), (float3(wind.x * noise,1,wind.y*noise)));
    float faceTheta = blade.dir * 3.1415f / 180.0f;
    float4 q2 = MapVector(float3(1,0,0),float3(cos(faceTheta),0,sin(faceTheta)));
    blade.quaternion = quatMultiply(qt,quatMultiply(q2,q1));
    bladesBuffer[id.x] = blade;
}
```
然后四元数到旋转矩阵的方法是：
```
float4x4 quaternion_to_matrix(float4 quat)
{
    float4x4 m = float4x4(float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0), float4(0, 0, 0, 0));
    float x = quat.x, y = quat.y, z = quat.z, w = quat.w;
    float x2 = x + x, y2 = y + y, z2 = z + z;
    float xx = x * x2, xy = x * y2, xz = x * z2;
    float yy = y * y2, yz = y * z2, zz = z * z2;
    float wx = w * x2, wy = w * y2, wz = w * z2;
    m[0][0] = 1.0 - (yy + zz);
    m[0][1] = xy - wz;
    m[0][2] = xz + wy;
    m[1][0] = xy + wz;
    m[1][1] = 1.0 - (xx + zz);
    m[1][2] = yz - wx;
    m[2][0] = xz - wy;
    m[2][1] = yz + wx;
    m[2][2] = 1.0 - (xx + yy);
    m[0][3] = _Position.x;
    m[1][3] = _Position.y;
    m[2][3] = _Position.z;
    m[3][3] = 1.0;
    return m;
}
```
然后应用一下。
```
void vert(inout appdata_full v, out Input data)
{
    UNITY_INITIALIZE_OUTPUT(Input, data);
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    float tempHeight = v.vertex.y * _HeightOffset;
    float tempWidth = v.vertex.x * _WidthOffset;
    v.vertex.y += tempHeight;
    v.vertex.x += tempWidth;
    // 应用模型顶点变换
    v.vertex = mul(_Matrix, v.vertex);
    v.vertex.xyz += _Position;
    // 计算逆转置矩阵用于法线变换
    v.normal = mul((float3x3)transpose(_Matrix), v.normal);
    #endif
}
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        // 获取Compute Shader计算结果
        GrassBlade blade = bladesBuffer[unity_InstanceID];
        _HeightOffset = blade.height_offset;
        _WidthOffset = blade.width_offset;
        _Fade = blade.fade; // 设置明暗
        _Matrix = quaternion_to_matrix(blade.quaternion); // 设置最终转转矩阵  
        _Position = blade.position; // 设置位置
    #endif
}
```
当前代码链接：
4. 总结/小测试
How do you programmatically get the thread group sizes of a kernel?
When defining a Mesh in code, the number of normals must be the same as the number of vertex positions. True or false.
2024-06-04
Compute Shader学习笔记（三）之粒子效果与群集行为模拟
紧接着上一篇文章
remoooo：Compute Shader学习笔记（二）之后处理效果
L4 粒子效果与群集行为模拟
本章节使用Compute Shader生成粒子。学习如何使用DrawProcedural和DrawMeshInstancedIndirect，也就是GPU Instancing。
知识点总结：
- Compute Shader、Material、C#脚本和Shader共同协作
- Graphics.DrawProcedural
- material.SetBuffer()
- xorshift 随机算法
- 集群行为模拟
- Graphics.DrawMeshInstancedIndirect
- 旋转平移缩放矩阵，齐次坐标
- Surface Shader
- ComputeBufferType.Default
- #pragma instancing_options procedural:setup
- unity_InstanceID
- Skinned Mesh Renderer
- 数据对齐
1. 介绍与准备工作
Compute Shader除了可以同时处理大量的数据，还有一个关键的优势，就是Buffer存储在GPU中。因此可以将Compute Shader处理好的数据直接传递给与Material关联的Shader中，即Vertex/Fragment Shader。这里的关键就是，material也可以像Compute Shader一样SetBuffer()，直接从GPU的Buffer中访问数据！
使用Compute Shader来制作粒子系统可以充分体现Compute Shader的强大并行能力。
在渲染过程中，Vertex Shader会从Compute Buffer中读取每个粒子的位置和其他属性，并将它们转换为屏幕上的顶点。Fragment Shader则负责根据这些顶点的信息（如位置和颜色）来生成像素。通过Graphics.DrawProcedural方法，Unity可以直接渲染这些由Shader处理的顶点，无需预先定义的网格结构，也不依赖Mesh Renderer，这对于渲染大量粒子特别有效。
2. 粒子你好
步骤也是非常简单，在 C# 中定义好粒子的信息（位置、速度与生命周期），初始化将数据传给Buffer，绑定Buffer到Compute Shader和Material。渲染阶段在OnRenderObject()里调用Graphics.DrawProceduralNow实现高效地渲染粒子。
新建一个场景，制作一个效果：百万粒子跟随鼠标绽放生命的粒子，如下：
写到这里，不禁让我思绪万千。粒子的生命周期很短暂，如同星火一般瞬间点燃，又如同流星一闪即逝。纵有千百磨难，我亦不过是亿万尘埃中的一粒，平凡且渺小。这些粒子，虽或许会在空间中随机漂浮（使用”Xorshift”算法计算粒子生成的位置），或许会拥有独一无二的色彩，但它们终究逃不出被程式预设的命运。这难道不正是我的人生写照吗？按部就班地上演着自己的角色，无法逃脱那无形的束缚。
“上帝已死！而我们这些杀死他的人，又怎能不感到最大的痛苦呢？” – 弗里德里希·尼采
尼采不仅宣告了宗教信仰的消逝，更指出了现代人面临的虚无感，即没有了传统的道德和宗教支柱，人们感到了前所未有的孤独和方向感的缺失。粒子在C#脚本中被定义、创造，按照特定规则运动和消亡，这与尼采所描述的现代人在宇宙中的状态颇有相似之处。虽然每个人都试图寻找自己的意义，但最终仍受限于更广泛的社会和宇宙规则。
生活中充满了各种不可避免的痛苦，反映了人类存在的固有虚无和孤独感。失恋、生离死别、工作失意以及即将编写的粒子死亡逻辑等等，都印证了尼采所表达的，生活中没有什么是永恒不变的。同一个Buffer中的粒子必然在未来某个时刻消失，这体现了尼采所描述的现代人的孤独感，个体可能会感受到前所未有的孤立无援，因此每个人都是孤独的战士，必须学会独自面对内心的龙卷风和外部世界的冷漠。
但是没关系，「夏天会周而复始，该相逢的人会再次相逢」。本文的粒子也会在结束后再次生成，以最好的状态拥抱属于它的Buffer。
Summer will come around again. People who meet will meet again.
当前版本代码，可以自己拷下来跑跑（都有注释）：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Shaders/ParticleFun.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Scripts/ParticleFun.cs
- Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_First_Particle/Assets/Shaders/Particle.shader
废话就说到这，先看看 C# 脚本是咋写的。
老样子，先定义粒子的Buffer（结构体），并且初始化一下子，然后传给GPU，关键在于最后三行将Buffer绑定给shader的操作。下面省略号的代码没什么好讲的，都是常规操作，用注释一笔带过了。
```
struct Particle{
    public Vector3 position; // 粒子位置
    public Vector3 velocity; // 粒子速度
    public float life;       // 粒子生命周期
}
ComputeBuffer particleBuffer; // GPU 的 Buffer
...
// Init() 中
    // 初始化粒子数组
    Particle[] particleArray = new Particle[particleCount];
    for (int i = 0; i < particleCount; i++){
        // 生成随机位置和归一化
        ...
        // 设置粒子的初始位置和速度
        ... 
        // 设置粒子的生命周期
        particleArray[i].life = Random.value * 5.0f + 1.0f;
    }
    // 创建并设置Compute Buffer
    ...
    // 查找Compute Shader中的kernel ID
    ...
    // 绑定Compute Buffer到shader
    shader.SetBuffer(kernelID, "particleBuffer", particleBuffer);
    material.SetBuffer("particleBuffer", particleBuffer);
    material.SetInt("_PointSize", pointSize);
```
关键的渲染阶段来了 OnRenderObject() 。material.SetPass 用于设置渲染材质通道。DrawProceduralNow 方法在不使用传统网格的情况下绘制几何体。MeshTopology.Points 指定了渲染的拓扑类型为点，GPU会把每个顶点作为一个点来处理，不会进行顶点之间的连线或面的形成。第二个参数 1 表示从第一个顶点开始绘制。particleCount 指定了要渲染的顶点数，这里是粒子的数量，即告诉GPU总共需要渲染多少个点。
```
void OnRenderObject()
{
    material.SetPass(0);
    Graphics.DrawProceduralNow(MeshTopology.Points, 1, particleCount);
}
```
获取当前鼠标位置方法。OnGUI()这个方法每一帧可能调用多次。z值设为摄像机的近裁剪面加上一个偏移量，这里加14是为了得到一个更合适视觉深度的世界坐标（也可以自行调整）。
```
void OnGUI()
{
    Vector3 p = new Vector3();
    Camera c = Camera.main;
    Event e = Event.current;
    Vector2 mousePos = new Vector2();
    // Get the mouse position from Event.
    // Note that the y position from Event is inverted.
    mousePos.x = e.mousePosition.x;
    mousePos.y = c.pixelHeight - e.mousePosition.y;
    p = c.ScreenToWorldPoint(new Vector3(mousePos.x, mousePos.y, c.nearClipPlane + 14));
    cursorPos.x = p.x;
    cursorPos.y = p.y;
}
```
上面已经将 ComputeBuffer particleBuffer; 传到了Compute Shader和Shader中。
先看看Compute Shader的数据结构。没什么特别的。
```
// 定义粒子数据结构
struct Particle
{
    float3 position;  // 粒子的位置
    float3 velocity;  // 粒子的速度
    float life;       // 粒子的剩余生命时间
};
// 用于存储和更新粒子数据的结构化缓冲区，可从GPU读写
RWStructuredBuffer<Particle> particleBuffer;
// 从CPU设置的变量
float deltaTime;       // 从上一帧到当前帧的时间差
float2 mousePosition;  // 当前鼠标位置
```
这里简单讲讲一个特别好用的随机数序列生成方法 xorshift 算法。一会将用来随机粒子的运动方向如上图，粒子会随机朝着三维的方向运动。
- 详细参考：https://en.wikipedia.org/wiki/Xorshift
- 原论文链接：https://www.jstatsoft.org/article/view/v008i14
这个算法03年由George Marsaglia提出，优点在于运算速度极快，并且非常节约空间。即使是最简单的Xorshift实现，其伪随机数周期也是相当长的。
基本操作是位移（shift）和异或（xor）。算法的名字也由此而来。它的核心是维护一个非零的状态变量，通过对这个状态变量进行一系列的位移和异或操作来生成随机数。
```
// 用于生成随机数的状态变量
uint rng_state;
uint rand_xorshift() {
    // Xorshift algorithm from George Marsaglia's paper
    rng_state ^= (rng_state << 13);  // 将状态变量左移13位，然后与原状态进行异或
    rng_state ^= (rng_state >> 17);  // 将更新后的状态变量右移17位，再次进行异或
    rng_state ^= (rng_state << 5);   // 最后，将状态变量左移5位，进行最后一次异或
    return rng_state;                // 返回更新后的状态变量作为生成的随机数
}
```
基本Xorshift 算法的核心已在前面的解释中提到，不过不同的位移组合可以创建多种变体。原论文还提到了Xorshift128变体。使用128位的状态变量，通过四次不同的位移和异或操作更新状态。代码如下：
```
// c language Ver
uint32_t xorshift128(void) {
    static uint32_t x = 123456789;
    static uint32_t y = 362436069;
    static uint32_t z = 521288629;
    static uint32_t w = 88675123; 
    uint32_t t = x ^ (x << 11);
    x = y; y = z; z = w;
    w = w ^ (w >> 19) ^ (t ^ (t >> 8));
    return w;
}
```
可以产生更长的周期和更好的统计性能。这个变体的周期接近，非常厉害。
总的来说，这个算法用在游戏开发完全足够了，只是不适合用在密码学等领域。
在Compute Shader中使用这个算法时，需要注意Xorshift算法生成的随机数范围时uint32的的范围，需要再做一个映射( [0, 2^32-1] 映射到 [0, 1])：
```
float tmp = (1.0 / 4294967296.0);  // 转换因子
rand_xorshift()) * tmp
```
而粒子运动方向是有符号的，因此只要在这个基础上减去0.5就好了。三个方向的随机运动：
```
float f0 = float(rand_xorshift()) * tmp - 0.5;
float f1 = float(rand_xorshift()) * tmp - 0.5;
float f2 = float(rand_xorshift()) * tmp - 0.5;
float3 normalF3 = normalize(float3(f0, f1, f2)) * 0.8f; // 缩放了运动方向
```
每一个Kernel需要完成的内容如下：
- 先得到Buffer中上一帧的粒子信息
- 维护粒子Buffer（计算粒子速度，更新位置、生命值），写回Buffer
- 若生命值小于0，重新生成一个粒子
生成粒子，初始位置利用刚刚Xorshift得到的随机数，定义粒子的生命值，重置速度。
```
// 设置粒子的新位置和生命值
particleBuffer[id].position = float3(normalF3.x + mousePosition.x, normalF3.y + mousePosition.y, normalF3.z + 3.0);
particleBuffer[id].life = 4;  // 重置生命值
particleBuffer[id].velocity = float3(0,0,0);  // 重置速度
```
最后是Shader的基本数据结构：
```
struct Particle{
    float3 position;
    float3 velocity;
    float life;
};
struct v2f{
    float4 position : SV_POSITION;
    float4 color : COLOR;
    float life : LIFE;
    float size: PSIZE;
};
// particles' data
StructuredBuffer<Particle> particleBuffer;
```
然后在顶点着色器计算粒子的顶点色、顶点的Clip位置以及传输一个顶点大小的信息。
```
v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID){
    v2f o = (v2f)0;
    // Color
    float life = particleBuffer[instance_id].life;
    float lerpVal = life * 0.25f;
    o.color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal);
    // Position
    o.position = UnityObjectToClipPos(float4(particleBuffer[instance_id].position, 1.0f));
    o.size = _PointSize;
    return o;
}
```
片元着色器计算插值颜色。
```
float4 frag(v2f i) : COLOR{
    return i.color;
}
```
至此，就可以得到上面的效果。
3. Quad粒子
上一节每一个粒子都只有一个点，没什么意思。现在把一个点变成一个Quad。在Unity中，没有Quad，只有两个三角形组成的假Quad。
开干，基于上面的代码。在 C# 中定义顶点，一个Quad的尺寸。
```
// struct
struct Vertex
{
    public Vector3 position;
    public Vector2 uv;
    public float life;
}
const int SIZE_VERTEX = 6 * sizeof(float);
public float quadSize = 0.1f; // Quad的尺寸
```
每一个粒子的的基础上，设置六个顶点的uv坐标，给顶点着色器用。并且按照Unity规定的顺序绘制。
```
index = i*6;
    //Triangle 1 - bottom-left, top-left, top-right
    vertexArray[index].uv.Set(0,0);
    vertexArray[index+1].uv.Set(0,1);
    vertexArray[index+2].uv.Set(1,1);
    //Triangle 2 - bottom-left, top-right, bottom-right
    vertexArray[index+3].uv.Set(0,0);
    vertexArray[index+4].uv.Set(1,1);
    vertexArray[index+5].uv.Set(1,0);
```
最后传递给Buffer。这里的 halfSize 目的是传给Compute Shader计算Quad的各个顶点位置用的。
```
vertexBuffer = new ComputeBuffer(numVertices, SIZE_VERTEX);
vertexBuffer.SetData(vertexArray);
shader.SetBuffer(kernelID, "vertexBuffer", vertexBuffer);
shader.SetFloat("halfSize", quadSize*0.5f);
material.SetBuffer("vertexBuffer", vertexBuffer);
```
渲染阶段把点改为三角形，有六个点。
```
void OnRenderObject()
{
    material.SetPass(0);
    Graphics.DrawProceduralNow(MeshTopology.Triangles, 6, numParticles);
}
```
在Shader中改一下设置，接收顶点数据。并且接收一张贴图用于显示。需要做alpha剔除。
```
_MainTex("Texture", 2D) = "white" {}     
...
Tags{ "Queue"="Transparent" "RenderType"="Transparent" "IgnoreProjector"="True" }
LOD 200
Blend SrcAlpha OneMinusSrcAlpha
ZWrite Off
...
    struct Vertex{
        float3 position;
        float2 uv;
        float life;
    };
    StructuredBuffer<Vertex> vertexBuffer;
    sampler2D _MainTex;
    v2f vert(uint vertex_id : SV_VertexID, uint instance_id : SV_InstanceID)
    {
        v2f o = (v2f)0;
        int index = instance_id*6 + vertex_id;
        float lerpVal = vertexBuffer[index].life * 0.25f;
        o.color = fixed4(1.0f - lerpVal+0.1, lerpVal+0.1, 1.0f, lerpVal);
        o.position = UnityWorldToClipPos(float4(vertexBuffer[index].position, 1.0f));
        o.uv = vertexBuffer[index].uv;
        return o;
    }
    float4 frag(v2f i) : COLOR
    {
        fixed4 color = tex2D( _MainTex, i.uv ) * i.color;
        return color;
    }
```
在Compute Shader中，增加接收顶点数据，还有halfSize。
```
struct Vertex
{
    float3 position;
    float2 uv;
    float life;
};
RWStructuredBuffer<Vertex> vertexBuffer;
float halfSize;
```
计算每个Quad六个顶点的位置。
```
//Set the vertex buffer //
    int index = id.x * 6;
    //Triangle 1 - bottom-left, top-left, top-right   
    vertexBuffer[index].position.x = p.position.x-halfSize;
    vertexBuffer[index].position.y = p.position.y-halfSize;
    vertexBuffer[index].position.z = p.position.z;
    vertexBuffer[index].life = p.life;
    vertexBuffer[index+1].position.x = p.position.x-halfSize;
    vertexBuffer[index+1].position.y = p.position.y+halfSize;
    vertexBuffer[index+1].position.z = p.position.z;
    vertexBuffer[index+1].life = p.life;
    vertexBuffer[index+2].position.x = p.position.x+halfSize;
    vertexBuffer[index+2].position.y = p.position.y+halfSize;
    vertexBuffer[index+2].position.z = p.position.z;
    vertexBuffer[index+2].life = p.life;
    //Triangle 2 - bottom-left, top-right, bottom-right  // // 
    vertexBuffer[index+3].position.x = p.position.x-halfSize;
    vertexBuffer[index+3].position.y = p.position.y-halfSize;
    vertexBuffer[index+3].position.z = p.position.z;
    vertexBuffer[index+3].life = p.life;
    vertexBuffer[index+4].position.x = p.position.x+halfSize;
    vertexBuffer[index+4].position.y = p.position.y+halfSize;
    vertexBuffer[index+4].position.z = p.position.z;
    vertexBuffer[index+4].life = p.life;
    vertexBuffer[index+5].position.x = p.position.x+halfSize;
    vertexBuffer[index+5].position.y = p.position.y-halfSize;
    vertexBuffer[index+5].position.z = p.position.z;
    vertexBuffer[index+5].life = p.life;
```
大功告成。
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Shaders/QuadParticles.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Scripts/QuadParticles.cs
- Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Quad/Assets/Shaders/QuadParticle.shader
下一节，将Mesh升级为预制体，并且尝试模拟鸟类飞行时的集群行为。
4. Flocking（集群行为）模拟
Flocking 是一种模拟自然界中鸟群、鱼群等动物集体运动行为的算法。核心是基于三个基本的行为规则，由Craig Reynolds在Sig 87提出，通常被称为“Boids”算法：
- 分离（Separation） 粒子与粒子之间不能太靠近，要有边界感。具体是计算周边一定半径的粒子然后计算一个避免碰撞的方向。
- 对齐（Alignment） 个体的速度趋于群体的平均速度，要有归属感。具体是计算视觉范围内粒子的平均速度（速度大小方向）。这个视觉范围要根据鸟类实际的生物特性决定，下一节会提及。
- 聚合（Cohesion） 个体的位置趋于平均位置（群体的中心），要有安全感。具体是，每个粒子找出周围邻居的几何中心，计算一个移动向量（最终结果是平均位置）。
思考一下，上面三个规则，哪一个最难实现？
答：Separation。众所周知，计算物体间的碰撞是非常难以实现的。因为每个个体都需要与其他所有个体进行距离比较，这会导致算法的时间复杂度接近O(n^2)，其中n是粒子的数量。例如，如果有1000个粒子，那么在每次迭代中可能需要进行将近500,000次的距离计算。在当年原论文作者在没有经过优化的原始算法（时间复杂度O(N^2)）中渲染一帧（80只鸟）所需时间是95秒，渲染一个300帧的动画使用了将近9个小时。
一般来说，使用四叉树或者是格点哈希（Spatial Hashing）等空间划分方法可以优化计算。也可以维护一个近邻列表存储每个个体周边一定距离的个体。当然了，还可以使用Compute Shader硬算。
废话不多说，开干。
首先下载好预备的工程文件（如果没有事先准备）：
- 鸟的Prefab：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Prefabs/Boid.prefab
- 脚本：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Scripts/SimpleFlocking.cs
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/main/Assets/Shaders/SimpleFlocking.compute
然后添加到一个空GO中。
启动项目就可以看到一堆鸟。
下面是关于群体行为模拟的一些参数。
```
// 定义群体行为模拟的参数。
    public float rotationSpeed = 1f; // 旋转速度。
    public float boidSpeed = 1f; // Boid速度。
    public float neighbourDistance = 1f; // 邻近距离。
    public float boidSpeedVariation = 1f; // 速度变化。
    public GameObject boidPrefab; // Boid对象的预制体。
    public int boidsCount; // Boid的数量。
    public float spawnRadius; // Boid生成的半径。
    public Transform target; // 群体的移动目标。
```
除了Boid预制体boidPrefab和生成半径spawnRadius之外，其他都需要传给GPU。
为了方便，这一节先犯个蠢，只在GPU计算鸟的位置和方向，然后传回给CPU，做如下处理：
```
...
boidsBuffer.GetData(boidsArray);
// 更新每个鸟的位置与朝向
for (int i = 0; i < boidsArray.Length; i++){
    boids[i].transform.localPosition = boidsArray[i].position;
    if (!boidsArray[i].direction.Equals(Vector3.zero)){
        boids[i].transform.rotation = Quaternion.LookRotation(boidsArray[i].direction);
    }
}
```
Quaternion.LookRotation() 方法用于创建一个旋转，使对象面向指定的方向。
在Compute Shader中计算每个鸟的位置。
```
#pragma kernel CSMain
#define GROUP_SIZE 256    
struct Boid{
    float3 position;
    float3 direction;
};
RWStructuredBuffer<Boid> boidsBuffer;
float time;
float deltaTime;
float rotationSpeed;
float boidSpeed;
float boidSpeedVariation;
float3 flockPosition;
float neighbourDistance;
int boidsCount;
```
[numthreads(GROUP_SIZE,1,1)]
void CSMain (uint3 id : SV_DispatchThreadID){ …// 接下文 }
先写对齐和聚合的逻辑，最终输出实际位置、方向给Buffer。
```
Boid boid = boidsBuffer[id.x];
    float3 separation = 0; // 分离
    float3 alignment = 0; // 对齐 - 方向
    float3 cohesion = flockPosition; // 聚合 - 位置
    uint nearbyCount = 1; // 自身算作周边的个体。
    for (int i=0; i<boidsCount; i++)
    {
        if(i!=(int)id.x) // 把自己排除 
        {
            Boid temp = boidsBuffer[i];
            // 计算周围范围内的个体
            if(distance(boid.position, temp.position)< neighbourDistance){
                alignment += temp.direction;
                cohesion += temp.position;
                nearbyCount++;
            }
        }
    }
    float avg = 1.0 / nearbyCount;
    alignment *= avg;
    cohesion *= avg;
    cohesion = normalize(cohesion-boid.position);
    // 综合一个移动方向
    float3 direction = alignment + separation + cohesion;
    // 平滑转向和位置更新
    boid.direction = lerp(direction, normalize(boid.direction), 0.94);
    // deltaTime确保移动速度不会因帧率变化而改变。
    boid.position += boid.direction * boidSpeed * deltaTime;
    boidsBuffer[id.x] = boid;
```
这就是没有边界感（分离项）的下场，所有的个体都表现出相当亲密的关系，都重叠在一起了。
添加下面的代码。
```
if(distance(boid.position, temp.position)< neighbourDistance)
{
    float3 offset = boid.position - temp.position;
    float dist = length(offset);
    if(dist < neighbourDistance)
    {
        dist = max(dist, 0.000001);
        separation += offset * (1.0/dist - 1.0/neighbourDistance);
    }
    ...
```
1.0/dist 当Boid越靠近时，这个值越大，表示分离力度应当越大。1.0/neighbourDistance 是一个常数，基于定义的邻近距离。两者的差值表示实际的分离力应对距离的反应程度。如果两个Boid的距离正好是 neighbourDistance，这个值为零（没有分离力）。如果两个Boid距离小于 neighbourDistance，这个值为正，且距离越小，值越大。
当前代码：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Flocking/Assets/Shaders/SimpleFlocking.compute
下一节将采用Instanced Mesh，提高性能。
5. GPU Instancing优化
首先回顾一下本章节的内容。「粒子你好」与「Quad粒子」的两个例子中，我们都运用了Instanced技术（Graphics.DrawProceduralNow()），将Compute Shader的计算好的粒子位置直接传递给VertexFrag着色器。
本节使用的DrawMeshInstancedIndirect 用于绘制大量几何体实例，实例都是相似的，只是位置、旋转或其他参数略有不同。相对于每帧都重新生成几何体并渲染的 DrawProceduralNow，DrawMeshInstancedIndirect 只需要一次性设置好实例的信息，然后 GPU 就可以根据这些信息一次性渲染所有实例。渲染草地、群体动物就用这个函数。
这个函数有很多参数，只用其中的一部分。
```
Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);
```
1. boidMesh：把鸟Mesh丢进去。
2. subMeshIndex：绘制的子网格索引。如果网格只有一个子网格，通常为0。
3. boidMaterial：应用到实例化对象的材质。
4. bounds：包围盒指定了绘制的范围。实例化对象只有在这个包围盒内的区域才会被渲染。优化性能之用。
5. argsBuffer：参数的 ComputeBuffer，参数包括每个实例的几何体的索引数量和实例化的数量。
这个 argsBuffer 是啥？这个参数用来告诉Unity，我们现在要渲染哪个Mesh、要渲染多少个！可以用一种特殊的Buffer作为参数给进去。
在初始化shader时候，创建一种特殊Buffer，其标注为 ComputeBufferType.IndirectArguments 。这种类型的缓冲区专门用于传递给 GPU，以便在 GPU 上执行间接绘制命令。这里的new ComputeBuffer 第一个参数是 1 ，表示一个args数组（一个数组有5个uint），不要理解错了。
```
ComputeBuffer argsBuffer;
...
argsBuffer = new ComputeBuffer(1, 5 * sizeof(uint), ComputeBufferType.IndirectArguments);
if (boidMesh != null)
{
    args[0] = (uint)boidMesh.GetIndexCount(0);
    args[1] = (uint)numOfBoids;
}
argsBuffer.SetData(args);
...
Graphics.DrawMeshInstancedIndirect(boidMesh, 0, boidMaterial, bounds, argsBuffer);
```
在上一章的基础上，个体的数据结构增加一个offset，在Compute Shader用于方向上的偏移。另外初始状态的方向用Slerp插值，70%保持原来的方向，30%随机。Slerp插值的结果是四元数，需要用四元数方法转换到欧拉角再传入构造函数。
```
public float noise_offset;
...
Quaternion rot = Quaternion.Slerp(transform.rotation, Random.rotation, 0.3f);
boidsArray[i] = new Boid(pos, rot.eulerAngles, offset);
```
将这个新的属性noise_offset传到Compute Shader后，计算范围是 [-1, 1] 的噪声值，应用到鸟的速度上。
```
float noise = clamp(noise1(time / 100.0 + boid.noise_offset), -1, 1) * 2.0 - 1.0;
float velocity = boidSpeed * (1.0 + noise * boidSpeedVariation);
```
然后稍微优化了一下算法。Compute Shader大体是没有区别的。
```
if (distance(boid_pos, boidsBuffer[i].position) < neighbourDistance)
{
    float3 tempBoid_position = boidsBuffer[i].position;
    float3 offset = boid.position - tempBoid_position;
    float dist = length(offset);
    if (dist<neighbourDistance){
        dist = max(dist, 0.000001);//Avoid division by zero
        separation += offset * (1.0/dist - 1.0/neighbourDistance);
    }
    alignment += boidsBuffer[i].direction;
    cohesion += tempBoid_position;
    nearbyCount += 1;
}
```
最大的不同在于Shader上。本节使用Surface Shader取代Frag。这个东西其实就是一个包装好的vertex and fragment shader。Unity已经完成了光照、阴影等一系列繁琐的工作。你依旧可以指定一个Vert。
写Shader制作材质的时候，需要对Instanced的物体做特别处理。因为普通的渲染对象，他们的位置、旋转和其他属性在Unity中是静态的。而对于当前要构建的实例化对象，其位置、旋转等参数时刻在变化，因此，在渲染管线中需要通过特殊的机制来动态设置每个实例化对象的位置和参数。当前的方法基于程序的实例化技术，可以一次性渲染所有的实例化对象，而不需要逐个绘制。也就是一次性批量渲染。
着色器应用instanced技术方法。实例化阶段是在vert之前执行。这样每个实例化的对象都有单独的旋转、位移和缩放等矩阵。
现在需要为每个实例化对象创建属于他们的旋转矩阵。从Buffer中我们拿到了Compute Shader计算后的鸟的基本信息（上一节中，该数据传回了CPU，这里直接传给Shader做实例化）：
Shader里将Buffer传来的数据结构、相关操作用下面的宏包裹起来。
```
// .shader
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
struct Boid
{
    float3 position;
    float3 direction;
    float noise_offset;
};
StructuredBuffer<Boid> boidsBuffer; 
#endif
```
由于我只在 C# 的 DrawMeshInstancedIndirect 的args[1]指定了需要实例化的数量（鸟的数量，也是Buffer的大小），因此直接使用unity_InstanceID索引访问Buffer就好了。
```
#pragma instancing_options procedural:setup
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        _BoidPosition = boidsBuffer[unity_InstanceID].position;
        _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
    #endif
}
```
这里的空间变换矩阵的计算涉及到Homogeneous Coordinates，可以去复习一下GAMES101的课程。点是(x,y,z,1)，坐标是(x,y,z,0)。
如果使用仿射变换（Affine Transformations），代码是这样的：
```
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    _BoidPosition = boidsBuffer[unity_InstanceID].position;
    _LookAtMatrix = look_at_matrix(boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
    #endif
}
 void vert(inout appdata_full v, out Input data)
{
    UNITY_INITIALIZE_OUTPUT(Input, data);
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    v.vertex = mul(_LookAtMatrix, v.vertex);
    v.vertex.xyz += _BoidPosition;
    #endif
}
```
不够优雅，我们直接使用一个齐次坐标（Homogeneous Coordinates）。一个矩阵搞掂旋转平移缩放！
```
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    _BoidPosition = boidsBuffer[unity_InstanceID].position;
    _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
    #endif
}
 void vert(inout appdata_full v, out Input data)
{
    UNITY_INITIALIZE_OUTPUT(Input, data);
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    v.vertex = mul(_Matrix, v.vertex);
    #endif
}
```
至此，就大功告成了！当前的帧率比上一节提升了将近一倍。
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Shaders/InstancedFlocking.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Scripts/InstancedFlocking.cs
- Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L4_Instanced/Assets/Shaders/InstancedFlocking.shader
6. 应用蒙皮动画
本节要做的是，使用Animator组件，在实例化物体之前，将各个关键帧的Mesh抓取到Buffer当中。通过选取不同索引，得到不同姿势的Mesh。具体的骨骼动画制作不在本文讨论范围。
只需要在上一章的基础上修改代码，添加Animator等逻辑。我已经在下面写了注释，可以看看。
并且个体的数据结构有所更新：
```
struct Boid{
    float3 position;
    float3 direction;
    float noise_offset;
    float speed; // 暂时没啥用
    float frame; // 表示动画中的当前帧索引
    float3 padding; // 确保数据对齐
};
```
详细说说这里的对齐。一个数据结构中，数据的大小最好是16字节的整数倍。
- float3 position; (12字节)
- float3 direction; (12字节)
- float noise_offset; (4字节)
- float speed; (4字节)
- float frame; (4字节)
- float3 padding; (12字节)
如果没有Padding，大小是36字节，不是常见的对齐大小。加上Padding，对齐到48字节，完美！
```
private SkinnedMeshRenderer boidSMR; // 用于引用包含蒙皮网格的SkinnedMeshRenderer组件。
private Animator animator;
public AnimationClip animationClip; // 具体的动画剪辑，通常用于计算动画相关的参数。
private int numOfFrames; // 动画中的帧数，用于确定在GPU缓冲区中存储多少帧数据。
public float boidFrameSpeed = 10f; // 控制动画播放的速度。
MaterialPropertyBlock props; // 在不创建新材料实例的情况下传递参数给着色器。这意味着可以改变实例的材质属性（如颜色、光照系数等），而不会影响到使用相同材料的其他对象。
Mesh boidMesh; // 存储从SkinnedMeshRenderer烘焙出的网格数据。
...
void Start(){ // 这里首先初始化Boid数据，然后调用GenerateSkinnedAnimationForGPUBuffer来准备动画数据，最后调用InitShader来设置渲染所需的Shader参数。
    ...
    // This property block is used only for avoiding an instancing bug.
    props = new MaterialPropertyBlock();
    props.SetFloat("_UniqueID", Random.value);
    ...
    InitBoids();
    GenerateSkinnedAnimationForGPUBuffer();
    InitShader();
}
void InitShader(){ // 此方法配置Shader和材料属性，确保动画播放可以根据实例的不同阶段正确显示。frameInterpolation的启用或禁用决定了是否在动画帧之间进行插值，以获得更平滑的动画效果。
    ...
    if (boidMesh)//Set by the GenerateSkinnedAnimationForGPUBuffer
    ...
    shader.SetFloat("boidFrameSpeed", boidFrameSpeed);
    shader.SetInt("numOfFrames", numOfFrames);
    boidMaterial.SetInt("numOfFrames", numOfFrames);
    if (frameInterpolation && !boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION"))
    boidMaterial.EnableKeyword("FRAME_INTERPOLATION");
    if (!frameInterpolation && boidMaterial.IsKeywordEnabled("FRAME_INTERPOLATION"))
    boidMaterial.DisableKeyword("FRAME_INTERPOLATION");
}
void Update(){
    ...
    // 后面两个参数：
        // 1. 0: 参数缓冲区的偏移量，用于指定从哪里开始读取参数。
        // 2. props: 前面创建的 MaterialPropertyBlock，包含所有实例共享的属性。
    Graphics.DrawMeshInstancedIndirect( boidMesh, 0, boidMaterial, bounds, argsBuffer, 0, props);
}
void OnDestroy(){ 
    ...
    if (vertexAnimationBuffer != null) vertexAnimationBuffer.Release();
}
private void GenerateSkinnedAnimationForGPUBuffer()
{
    ... // 接下文
}
```
为了给Shader在不同的时间提供不同姿势的Mesh，因此在 GenerateSkinnedAnimationForGPUBuffer() 函数中，从 Animator 和 SkinnedMeshRenderer 中提取每一帧的网格顶点数据，然后将这些数据存储到GPU的 ComputeBuffer 中，以便在实例化渲染时使用。
通过GetCurrentAnimatorStateInfo获取当前动画层的状态信息，用于后续控制动画的精确播放。
numOfFrames 使用最接近动画长度和帧率乘积的二次幂来确定，可以优化GPU的内存访问。
然后创建一个ComputeBuffer来存储所有帧的所有顶点数据。vertexAnimationBuffer
在for循环中，烘焙所有动画帧。具体做法是，在每个sampleTime时间点播放并立即更新，然后烘焙当前动画帧的网格到bakedMesh中。并且提取刚刚烘焙好的Mesh顶点，更新到数组 vertexAnimationData 中，最后上传至GPU，结束。
```
// ...接上文
boidSMR = boidObject.GetComponentInChildren<SkinnedMeshRenderer>();
boidMesh = boidSMR.sharedMesh;
animator = boidObject.GetComponentInChildren<Animator>();
int iLayer = 0;
AnimatorStateInfo aniStateInfo = animator.GetCurrentAnimatorStateInfo(iLayer);
Mesh bakedMesh = new Mesh();
float sampleTime = 0;
float perFrameTime = 0;
numOfFrames = Mathf.ClosestPowerOfTwo((int)(animationClip.frameRate * animationClip.length));
perFrameTime = animationClip.length / numOfFrames;
var vertexCount = boidSMR.sharedMesh.vertexCount;
vertexAnimationBuffer = new ComputeBuffer(vertexCount * numOfFrames, 16);
Vector4[] vertexAnimationData = new Vector4[vertexCount * numOfFrames];
for (int i = 0; i < numOfFrames; i++)
{
    animator.Play(aniStateInfo.shortNameHash, iLayer, sampleTime);
    animator.Update(0f);
    boidSMR.BakeMesh(bakedMesh);
    for(int j = 0; j < vertexCount; j++)
    {
        Vector4 vertex = bakedMesh.vertices[j];
        vertex.w = 1;
        vertexAnimationData[(j * numOfFrames) +  i] = vertex;
    }
    sampleTime += perFrameTime;
}
vertexAnimationBuffer.SetData(vertexAnimationData);
boidMaterial.SetBuffer("vertexAnimation", vertexAnimationBuffer);
boidObject.SetActive(false);
```
在Compute Shader中，维护每一个个体数据结构中储存的帧变量。
```
boid.frame = boid.frame + velocity * deltaTime * boidFrameSpeed;
if (boid.frame >= numOfFrames) boid.frame -= numOfFrames;
```
在Shader中lerp不同帧的动画。左边是没有帧插值的，右边是插值后的，效果非常显著。
好的标题可以获得更多的推荐及关注者
```
void vert(inout appdata_custom v)
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        #ifdef FRAME_INTERPOLATION
            v.vertex = lerp(vertexAnimation[v.id * numOfFrames + _CurrentFrame], vertexAnimation[v.id * numOfFrames + _NextFrame], _FrameInterpolation);
        #else
            v.vertex = vertexAnimation[v.id * numOfFrames + _CurrentFrame];
        #endif
        v.vertex = mul(_Matrix, v.vertex);
    #endif
}
void setup()
{
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        _Matrix = create_matrix(boidsBuffer[unity_InstanceID].position, boidsBuffer[unity_InstanceID].direction, float3(0.0, 1.0, 0.0));
        _CurrentFrame = boidsBuffer[unity_InstanceID].frame;
        #ifdef FRAME_INTERPOLATION
            _NextFrame = _CurrentFrame + 1;
            if (_NextFrame >= numOfFrames) _NextFrame = 0;
            _FrameInterpolation = frac(boidsBuffer[unity_InstanceID].frame);
        #endif
    #endif
}
```
非常不容易，终于完整了。
完整工程链接：https://github.com/Remyuu/Unity-Compute-Shader-Learn/tree/L4_Skinned/Assets/Scripts
8. 总结/小测试
When rendering points which gives the best answer?
What are the three key steps in flocking?
When creating an arguments buffer for DrawMeshInstancedIndirect, how many uints are required?
We created the wing flapping by using a skinned mesh shader. True or False.
In a shader used by DrawMeshInstancedIndirect, which variable name gives the correct index for the instance?
References
1. https://en.wikipedia.org/wiki/Boids
2. Flocks, Herds, and Schools: A Distributed Behavioral Model
2024-05-28
Compute Shader学习笔记（二）之后处理效果
前言
初步认识了Compute Shader，实现一些简单的效果。所有的代码都在：
https://github.com/Remyuu/Unity-Compute-Shader-Learngithub.com/Remyuu/Unity-Compute-Shader-Learn
main分支是初始代码，可以下载完整的工程跟着我敲一遍。PS：每一个版本的代码我都单独开了分支。
这一篇文章学习如何使用Compute Shader制作：
- 后处理效果
- 粒子系统
上一篇文章没有提及GPU的架构，是因为我觉得一上来就解释一大堆名词根本听不懂QAQ。有了实际编写Compute Shader的经验，就可以将抽象的概念和实际的代码联系起来。
CUDA在GPU上的执行程序可以用三层架构来说明：
- Grid – 对应一个Kernel
- |-Block – 一个Grid有多个Block，执行相同的程序
- | |-Thread – GPU上最基本的运算单元
Thread是GPU最基础的单元，不同Thread中自然就会有信息交换。为了有效地支持大量并行线程的运行，并解决这些线程之间的数据交换需求，内存被设计成多个层次。因此存储角度也可以分为三层：
- Per-Thread memory – 一个Thread内，传输周期是一个时钟周期（小于1纳秒），速度可以比全局内存快几百倍。
- Shared memory – 一个Block之间，速度比全局快很多。
- Global memory – 所有线程之间，但速度最慢，通常是GPU的瓶颈。Volta架构使用了HBM2作为设备的全局内存，Turing则是用了GDDR6。
如果超过内存大小限制，则会被推到容量更大但是更慢的存储空间上。
Shared Memory和L1 cache共享同一个物理空间，但是功能上有区别：前者需要手动管理，后者由硬件自动管理。我的理解是，Shared Memory 功能上类似于一个可编程的L1缓存。
在NVIDIA的CUDA架构中，流式多处理器（Streaming Multiprocessor, SM）是GPU上的一个处理单元，负责执行分配给它的线程块（Blocks）中的线程。流处理器（Stream Processors），也称为“CUDA核心”，是SM内的处理元件，每个流处理器可以并行处理多个线程。总的来说：
- GPU -> Multi-Processors (SMs) -> Stream Processors
即，GPU包含多个SM（也就是多处理器），每个SM包含多个流处理器。每个流处理器负责执行一个或多个线程（Thread）的计算指令。
在GPU中，Thread（线程）是执行计算的最小单元，Warp（纬度）是CUDA中的基本执行单位。
在NVIDIA的CUDA架构中，每个Warp通常包含32个线程（AMD有64个）。Block（块）是一个线程组，包含多个线程。在CUDA中，一个Block可以包含多个Warp。Kernel（内核）是在GPU上执行的一个函数，你可以将其视为一段特定的代码，这段代码被所有激活的线程并行执行。总的来说：
- Kernel -> Grid -> Blocks -> Warps -> Threads
但在日常开发中，通常需要同时执行的线程（Threads）远超过32个。
为了解决软件需求与硬件架构之间的数量不匹配问题，GPU采用了一种策略：将属于同一个块（Block）的线程分组。这种分组被称为“Warp”，每个Warp包含固定数量的线程。当需要执行的线程数量超过一个Warp所能包含的数量时，GPU会调度额外的Warp。这样做的原则是确保没有任何线程被遗漏，即便这意味着需要启动更多的Warp。
举个例子，如果一个块（Block）有128个线程（Thread），并且我的显卡身穿皮夹克（Nvidia每个Warp有32个Thread），那么一个块（Block）就会有 128/32=4 个Warp。举一个极端的例子，如果有129个线程，那么就会开5个Warp。有31个线程位置将直接空闲！因此我们在写Compute Shader时，[numthreads(a,b,c)] 中的 abc 最好是32的倍数，减少CUDA核心的浪费。
读到这里，想必你一定会很混乱。我按照个人的理解画了个图。若有错误请指出。
L3 后处理效果
当前构建基于BIRP管线，SRP管线只需要修改几处代码。
这一章关键在于构建一个抽象基类管理Compute Shader所需的资源（第一节）。然后基于这个抽象基类，编写一些简单的后处理效果，比如高斯模糊、灰阶效果、低分辨率像素效果以及夜视仪效果等等。这一章的知识点的小总结：
- 获取和处理Camera的渲染贴图
- ExecuteInEditMode 关键词
- SystemInfo.supportsComputeShaders 检查系统是否支持
- Graphics.Blit() 函数的使用（全程是Bit Block Transfer）
- 用 smoothstep() 制作各种效果
- 多个Kernel之间传输数据 Shared 关键词
1. 介绍与准备工作
后处理效果需要准备两张贴图，一个只读，另一个可读写。至于贴图从哪来，都说是后处理了，那肯定从相机身上获取贴图，也就是Camera组件上的Target Texture。
- Source：只读
- Destination：可读写，用于最终输出
由于后续会实现多种后处理效果，因此抽象出一个基类，减少后期工作量。
在基类中封装以下特性：
- 初始化资源（创建贴图、Buffer等）
- 管理资源（比方说屏幕分辨率改变后，重新创建Buffer等等）
- 硬件检查（检查当前设备是否支持Compute Shader）
抽象类完整代码链接：https://pastebin.com/9pYvHHsh
首先，当脚本实例被激活或者附加到活着的GO的时候，调用 OnEnable() 。在里面写初始化的操作。检查硬件是否支持、检查Compute Shader是否在Inspector上绑定、获取指定的Kernel、获取当前GO的Camera组件、创建纹理以及设置初始化状态为真。
```
if (!SystemInfo.supportsComputeShaders)
    ...
if (!shader)
    ...
kernelHandle = shader.FindKernel(kernelName);
thisCamera = GetComponent<Camera>();
if (!thisCamera)
    ...
CreateTextures();
init = true;
```
创建两个纹理 CreateTextures() ，一个Source一个Destination，尺寸为摄像机分辨率。
```
texSize.x = thisCamera.pixelWidth;
texSize.y = thisCamera.pixelHeight;
if (shader)
{
    uint x, y;
    shader.GetKernelThreadGroupSizes(kernelHandle, out x, out y, out _);
    groupSize.x = Mathf.CeilToInt((float)texSize.x / (float)x);
    groupSize.y = Mathf.CeilToInt((float)texSize.y / (float)y);
}
CreateTexture(ref output);
CreateTexture(ref renderedSource);
shader.SetTexture(kernelHandle, "source", renderedSource);
shader.SetTexture(kernelHandle, "outputrt", output);
```
具体纹理的创建：
```
protected void CreateTexture(ref RenderTexture textureToMake, int divide=1)
{
    textureToMake = new RenderTexture(texSize.x/divide, texSize.y/divide, 0);
    textureToMake.enableRandomWrite = true;
    textureToMake.Create();
}
```
这样就完成初始化了，当摄像机完成场景渲染并准备显示到屏幕上时，Unity会调用 OnRenderImage() ，这个时候就开始调用Compute Shader开始计算了。若当前没初始化好或者没shader，就Blit一下，把source直接拷给destination，即啥也不干。 CheckResolution(out _) 这个方法检查渲染纹理的分辨率是否需要更新，如果要，就重新生成一下Texture。完事之后，就到了老生常谈的Dispatch阶段啦。这里就需要将source贴图通过Buffer传给GPU，计算完毕后，传回给destination。
```
protected virtual void OnRenderImage(RenderTexture source, RenderTexture destination)
{
    if (!init || shader == null)
    {
        Graphics.Blit(source, destination);
    }
    else
    {
        CheckResolution(out _);
        DispatchWithSource(ref source, ref destination);
    }
}
```
注意看，这里我们没有用什么 SetData() 或者是 GetData() 之类的操作。因为现在所有数据都在GPU上，我们直接命令GPU自产自销就好了，CPU不要趟这滩浑水。如果将纹理取回内存，再传给GPU，性能就相当糟糕。
```
protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination)
{
    Graphics.Blit(source, renderedSource);
    shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
    Graphics.Blit(output, destination);
}
```
我不信邪，非得传回CPU再传回GPU，测试结果相当震惊，性能竟然差了4倍以上。因此我们需要减少CPU和GPU之间的通信，这是使用Compute Shader时非常需要关心的。
```
// 笨蛋方法
protected virtual void DispatchWithSource(ref RenderTexture source, ref RenderTexture destination)
{
    // 将源贴图Blit到用于处理的贴图
    Graphics.Blit(source, renderedSource);
    // 使用计算着色器处理贴图
    shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
    // 将输出贴图复制到一个Texture2D对象中，以便读取数据到CPU
    Texture2D tempTexture = new Texture2D(renderedSource.width, renderedSource.height, TextureFormat.RGBA32, false);
    RenderTexture.active = output;
    tempTexture.ReadPixels(new Rect(0, 0, output.width, output.height), 0, 0);
    tempTexture.Apply();
    RenderTexture.active = null;
    // 将Texture2D数据传回GPU到一个新的RenderTexture
    RenderTexture tempRenderTexture = RenderTexture.GetTemporary(output.width, output.height);
    Graphics.Blit(tempTexture, tempRenderTexture);
    // 最终将处理后的贴图Blit到目标贴图
    Graphics.Blit(tempRenderTexture, destination);
    // 清理资源
    RenderTexture.ReleaseTemporary(tempRenderTexture);
    Destroy(tempTexture);
}
```
接下来开始编写第一个后处理效果。
小插曲：奇怪的BUG
另外插播一个奇怪bug。
在Compute Shader中，如果最终输出的贴图结果名字是output，那么在某些API比如Metal中，就会出问题。解决方法是，改个名字。
```
RWTexture2D<float4> outputrt;
```
添加图片注释，不超过 140 字（可选）
2. RingHighlight效果
创建RingHighlight类，继承自刚刚编写的基类。
重载初始化方法，指定Kernel。
```
protected override void Init()
{
    center = new Vector4();
    kernelName = "Highlight";
    base.Init();
}
```
重载渲染方法。想要实现聚焦某个角色的效果，则需要给Compute Shader传入角色的屏幕空间的坐标 center 。并且，如果在Dispatch之前，屏幕分辨率发生改变，那么重新初始化。
```
protected void SetProperties()
{
    float rad = (radius / 100.0f) * texSize.y;
    shader.SetFloat("radius", rad);
    shader.SetFloat("edgeWidth", rad * softenEdge / 100.0f);
    shader.SetFloat("shade", shade);
}
protected override void OnRenderImage(RenderTexture source, RenderTexture destination)
{
    if (!init || shader == null)
    {
        Graphics.Blit(source, destination);
    }
    else
    {
        if (trackedObject && thisCamera)
        {
            Vector3 pos = thisCamera.WorldToScreenPoint(trackedObject.position);
            center.x = pos.x;
            center.y = pos.y;
            shader.SetVector("center", center);
        }
        bool resChange = false;
        CheckResolution(out resChange);
        if (resChange) SetProperties();
        DispatchWithSource(ref source, ref destination);
    }
}
```
并且改变Inspector面板的时候可以实时看到参数变化效果，添加 OnValidate() 方法。
```
private void OnValidate()
{
    if(!init)
        Init();
    SetProperties();
}
```
GPU中，该怎么制作一个圆内没有阴影，圆的边缘平滑过渡，过渡层外是阴影的效果呢？基于上一篇文章判断一个点是否在圆内的方法，我们用 smoothstep() ，处理过渡层即可。
```
#pragma kernel Highlight

Texture2D<float4> source;
RWTexture2D<float4> outputrt;
float radius;
float edgeWidth;
float shade;
float4 center;

float inCircle( float2 pt, float2 center, float radius, float edgeWidth ){
    float len = length(pt - center);
    return 1.0 - smoothstep(radius-edgeWidth, radius, len);
}

[numthreads(8, 8, 1)]
void Highlight(uint3 id : SV_DispatchThreadID)
{
    float4 srcColor = source[id.xy];
    float4 shadedSrcColor = srcColor * shade;
    float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
    float4 color = lerp( shadedSrcColor, srcColor, highlight );

    outputrt[id.xy] = color;

}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_RingHighlight/Assets/Shaders/RingHighlight.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_RingHighlight/Assets/Scripts/RingHighlight.cs
3. 模糊效果
模糊效果原理很简单，每一个像素采样周边的 n*n 个像素加权平均就可以得到最终效果。
但是有效率问题。众所周知，减少对纹理的采样次数对优化非常重要。如果每个像素都需要采样20*20个周边像素，那么渲染一个像素就需要采样400次，显然是无法接受的。并且，对于单个像素而言，采集周边一整个矩形像素的操作在Compute Shader中很难处理。怎么解决呢？
通常做法是，横着采样一遍，再竖着采样一遍。什么意思呢？对于每一个像素，只在x方向上采样20个像素，y方向上采样20个像素，总共采样20+20个像素，再加权平均。这种方法不仅减少了采样次数，还更符合Compute Shader的逻辑。横着采样，设置一个Kernel；竖着采样，设置另一个Kernel。
```
#pragma kernel HorzPass
#pragma kernel Highlight
```
由于Dispatch是顺序执行的，因此我们计算完水平的模糊后，利用计算好的结果再垂直采样一遍。
```
shader.Dispatch(kernelHorzPassID, groupSize.x, groupSize.y, 1);
shader.Dispatch(kernelHandle, groupSize.x, groupSize.y, 1);
```
做完模糊操作之后，再结合上一节的RingHighlight，完工！
有一点不同的是，再计算完水平模糊后，怎么将结果传给下一个Kernel呢？答案呼之欲出了，直接使用 shared 关键词。具体步骤如下。
CPU中声明存储水平模糊纹理的引用，制作水平纹理的kernel，并绑定。
```
RenderTexture horzOutput = null;
int kernelHorzPassID;
protected override void Init()
{
    ...
    kernelHorzPassID = shader.FindKernel("HorzPass");
    ...
}
```
还需要额外在GPU中开辟空间，用来存储第一个kernel的结果。
```
protected override void CreateTextures()
{
    base.CreateTextures();
    shader.SetTexture(kernelHorzPassID, "source", renderedSource);
    CreateTexture(ref horzOutput);
    shader.SetTexture(kernelHorzPassID, "horzOutput", horzOutput);
    shader.SetTexture(kernelHandle, "horzOutput", horzOutput);
}
```
GPU上这样设置：
```
shared Texture2D<float4> source;
shared RWTexture2D<float4> horzOutput;
RWTexture2D<float4> outputrt;
```
另外有个疑问， shared 这个关键词好像加不加都一样，实际测试不同的kernel都可以访问到。那请问shared还有什么意义呢？
在Unity中，变量前加shared表示这个资源不是每次调用都重新初始化，而是保持其状态，供不同的shader或dispatch调用使用。这有助于在不同的shader调用之间共享数据。标记了 shared 可以帮助编译器优化出更高性能的代码。
在计算边界的像素时，会遇到可用像素数量不足的情况。要么就是左边剩下的像素不足 blurRadius ，要么右边剩余像素不足。因此先算出安全的左索引，然后再计算从左到右最大可以取多少。
```
[numthreads(8, 8, 1)]
void HorzPass(uint3 id : SV_DispatchThreadID)
{
    int left = max(0, (int)id.x-blurRadius);
    int count = min(blurRadius, (int)id.x) + min(blurRadius, source.Length.x - (int)id.x);
    float4 color = 0;
    uint2 index = uint2((uint)left, id.y);
    [unroll(100)]
    for(int x=0; x<count; x++){
        color += source[index];
        index.x++;
    }
    color /= (float)count;
    horzOutput[id.xy] = color;
}
[numthreads(8, 8, 1)]
void Highlight(uint3 id : SV_DispatchThreadID)
{
    //Vert blur
    int top = max(0, (int)id.y-blurRadius);
    int count = min(blurRadius, (int)id.y) + min(blurRadius, source.Length.y - (int)id.y);
    float4 blurColor = 0;
    uint2 index = uint2(id.x, (uint)top);
    [unroll(100)]
    for(int y=0; y<count; y++){
        blurColor += horzOutput[index];
        index.y++;
    }
    blurColor /= (float)count;
    float4 srcColor = source[id.xy];
    float4 shadedBlurColor = blurColor * shade;
    float highlight = inCircle( (float2)id.xy, center.xy, radius, edgeWidth);
    float4 color = lerp( shadedBlurColor, srcColor, highlight );
    outputrt[id.xy] = color;
}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_BlurEffect/Assets/Shaders/BlurHighlight.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_BlurEffect/Assets/Scripts/BlurHighlight.cs
4. 高斯模糊
和上面不同的是，采样之后不再是取平均值，而是用一个高斯函数加权求得。
其中，是标准差，控制宽度。
有关更多Blur的内容：https://www.gamedeveloper.com/programming/four-tricks-for-fast-blurring-in-software-and-hardware#close-modal
由于这个计算量还有不小的，如果每一个像素都去计算一次这个式子就非常耗。我们用预计算的方式，将计算结果通过Buffer的方式传到GPU上。由于两个kernel都需要使用，在Buffer声明的时候加一个shared。
```
float[] SetWeightsArray(int radius, float sigma)
{
    int total = radius * 2 + 1;
    float[] weights = new float[total];
    float sum = 0.0f;
    for (int n=0; n<radius; n++)
    {
        float weight = 0.39894f * Mathf.Exp(-0.5f * n * n / (sigma * sigma)) / sigma;
        weights[radius + n] = weight;
        weights[radius - n] = weight;
        if (n != 0)
            sum += weight * 2.0f;
        else
            sum += weight;
    }
    // normalize kernels
    for (int i=0; i<total; i++) weights[i] /= sum;
    return weights;
}
private void UpdateWeightsBuffer()
{
    if (weightsBuffer != null)
        weightsBuffer.Dispose();
    float sigma = (float)blurRadius / 1.5f;
    weightsBuffer = new ComputeBuffer(blurRadius * 2 + 1, sizeof(float));
    float[] blurWeights = SetWeightsArray(blurRadius, sigma);
    weightsBuffer.SetData(blurWeights);
    shader.SetBuffer(kernelHorzPassID, "weights", weightsBuffer);
    shader.SetBuffer(kernelHandle, "weights", weightsBuffer);
}
```
完整代码：
- https://pastebin.com/0qWtUKgy
- https://pastebin.com/A6mDKyJE
5. 低分辨率效果
GPU：真是酣畅淋漓的计算啊。
让一张高清的纹理边模糊，同时不修改分辨率。实现方法很简单，每 n*n 个像素，都只取左下角的像素颜色即可。利用整数的特性，id.x索引先除n，再乘上n就可以了。
```
uint2 index = (uint2(id.x, id.y)/3) * 3;
float3 srcColor = source[index].rgb;
float3 finalColor = srcColor;
```
效果已经放在上面了。但是这个效果太锐利了，通过添加噪声，柔化锯齿。
```
uint2 index = (uint2(id.x, id.y)/3) * 3;
float noise = random(id.xy, time);
float3 srcColor = lerp(source[id.xy].rgb, source[index],noise);
float3 finalColor = srcColor;
```
每 n*n 个格子的像素不在只取左下角的颜色，而是取原本颜色和左下角颜色的随机插值结果。效果一下子就精细了不少。当n比较大的时候，还能看到下面这样的效果。只能说不太好看，但是在一些故障风格道路中还是可以继续探索。
如果想要得到噪声感的画面，可以尝试lerp的两端添加系数，比如：
```
float3 srcColor = lerp(source[id.xy].rgb * 2, source[index],noise);
```
6. 灰阶效果与染色
Grayscale Effect & Tinted
将彩色图像转换为灰阶图像的过程涉及将每个像素的RGB值转换为一个单一的颜色值。这个颜色值是RGB值的加权平均值。这里有两种方法，一种是简单平均，一种是符合人眼感知的加权平均。
1. 平均值法（简单但不准确）：
这种方法对所有颜色通道给予相同的权重。 2. 加权平均法（更准确, 反映人眼感知）：
这种方法根据人眼对绿色更敏感、对红色次之、对蓝色最不敏感的特点, 给予不同颜色通道不同的权重。（下面的截图效果不太好，我也没看出来lol）
加权后，再简单地颜色混合（乘法），最后lerp得到可控的染色强度结果。
```
uint2 index = (uint2(id.x, id.y)/6) * 6;
float noise = random(id.xy, time);
float3 srcColor = lerp(source[id.xy].rgb, source[index],noise);
// float3 finalColor = srcColor;
float3 grayScale = (srcColor.r+srcColor.g+srcColor.b)/3.0;
// float3 grayScale = srcColor.r*0.299f+srcColor.g*0.587f+srcColor.b*0.114f;
float3 tinted = grayScale * tintColor.rgb;
float3 finalColor = lerp(srcColor, tinted, tintStrength);
outputrt[id.xy] = float4(finalColor, 1);
```
染一个废土颜色：
7. 屏幕扫描线效果
首先 uvY 将坐标归一化到 [0,1] 。
lines 是控制扫描线数量的一个参数。
然后增加一个时间偏移，系数控制偏移速度。可以开放一个参数控制线条偏移的速度。
```
float uvY = (float)id.y/(float)source.Length.y;
float scanline = saturate(frac(uvY * lines + time * 3));
```
这个“线”看起来不太够“线”，减个肥。
```
float uvY = (float)id.y/(float)source.Length.y;
float scanline = saturate(smoothstep(0.1,0.2,frac(uvY * lines + time * 3)));
```
然后lerp上颜色。
```
float uvY = (float)id.y/(float)source.Length.y;
float scanline = saturate(smoothstep(0.1, 0.2, frac(uvY * lines + time*3)) + 0.3);
finalColor = lerp(source[id.xy].rgb*0.5, finalColor, scanline);
```
“减肥”前后，各取所需吧！
8. 夜视仪效果
这一节总结上面所有内容，实现一个夜视仪的效果。先做一个单眼效果。
```
float2 pt = (float2)id.xy;
float2 center = (float2)(source.Length >> 1);
float inVision = inCircle(pt, center, radius, edgeWidth);
float3 blackColor = float3(0,0,0);
finalColor = lerp(blackColor, finalColor, inVision);
```
双眼效果不同点在于有两个圆心，计算得到的两个遮罩vision用 max() 或者是 saturate() 合并即可。
```
float2 pt = (float2)id.xy;
float2 centerLeft = float2(source.Length.x / 3.0, source.Length.y /2);
float2 centerRight = float2(source.Length.x / 3.0 * 2.0, source.Length.y /2);
float inVisionLeft = inCircle(pt, centerLeft, radius, edgeWidth);
float inVisionRight = inCircle(pt, centerRight, radius, edgeWidth);
float3 blackColor = float3(0,0,0);
// float inVision = max(inVisionLeft, inVisionRight);
float inVision = saturate(inVisionLeft + inVisionRight);
finalColor = lerp(blackColor, finalColor, inVision);
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_NightVision/Assets/Shaders/NightVision.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_NightVision/Assets/Scripts/NightVision.cs
9. 平缓过渡线条
思考一下，我们应该怎么在屏幕上画一条平滑过渡的直线。
smoothstep() 函数可以完成这个操作，熟悉这个函数的读者可以略过这一段。这个函数用来创建平滑的渐变。smoothstep(edge0, edge1, x) 函数在x在 edge0 和 edge1 之间时，输出值从0渐变到1。如果 x < edge0 ，返回0；如果 x > edge1 ，返回1。其输出值是根据Hermite插值计算的：
```
float onLine(float position, float center, float lineWidth, float edgeWidth) {
    float halfWidth = lineWidth / 2.0;
    float edge0 = center - halfWidth - edgeWidth;
    float edge1 = center - halfWidth;
    float edge2 = center + halfWidth;
    float edge3 = center + halfWidth + edgeWidth;
    return smoothstep(edge0, edge1, position) - smoothstep(edge2, edge3, position);
}
```
上面代码中，传入的参数都已经归一化 [0,1]。position 是考察的点的位置，center 是线的中心位置，lineWidth 是线的实际宽度，edgeWidth 是边缘的宽度，用于平滑过渡。我实在对我的表达能力感到不悦！至于怎么算的，我给大家画个图理解吧！
大概就是：，，。
思考一下，怎么画一个平滑过渡的圆。
对于每个点，先计算与圆心的距离向量，结果返回给 position ，并且计算其长度返回给 len 。
模仿上面两个 smoothstep 做差的方法，通过减去外边缘插值结果来生成一个环形的线条效果。
```
float circle(float2 position, float2 center, float radius, float lineWidth, float edgeWidth){
    position -= center;
    float len = length(position);
    //Change true to false to soften the edge
    float result = smoothstep(radius - lineWidth / 2.0 - edgeWidth, radius - lineWidth / 2.0, len) - smoothstep(radius + lineWidth / 2.0, radius + lineWidth / 2.0 + edgeWidth, len);
    return result;
}
```
10. 扫描线效果
然后一条横线、一条竖线，套娃几个圆，做一个雷达扫描的效果。
```
float3 color = float3(0.0f,0.0f,0.0f);
color += onLine(uv.y, center.y, 0.002, 0.001) * axisColor.rgb;//xAxis
color += onLine(uv.x, center.x, 0.002, 0.001) * axisColor.rgb;//yAxis
color += circle(uv, center, 0.2f, 0.002, 0.001) * axisColor.rgb;
color += circle(uv, center, 0.3f, 0.002, 0.001) * axisColor.rgb;
color += circle(uv, center, 0.4f, 0.002, 0.001) * axisColor.rgb;
```
再画一个扫描线，并且带有轨迹。
```
float sweep(float2 position, float2 center, float radius, float lineWidth, float edgeWidth) {
    float2 direction = position - center;
    float theta = time + 6.3;
    float2 circlePoint = float2(cos(theta), -sin(theta)) * radius;
    float projection = clamp(dot(direction, circlePoint) / dot(circlePoint, circlePoint), 0.0, 1.0);
    float lineDistance = length(direction - circlePoint * projection);
    float gradient = 0.0;
    const float maxGradientAngle = PI * 0.5;
    if (length(direction) < radius) {
        float angle = fmod(theta + atan2(direction.y, direction.x), PI2);
        gradient = clamp(maxGradientAngle - angle, 0.0, maxGradientAngle) / maxGradientAngle * 0.5;
    }
    return gradient + 1.0 - smoothstep(lineWidth, lineWidth + edgeWidth, lineDistance);
}
```
添加到颜色中。
```
...
color += sweep(uv, center, 0.45f, 0.003, 0.001) * sweepColor.rgb;
...
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_HUDOverlay/Assets/Shaders/HUDOverlay.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L3_HUDOverlay/Assets/Scripts/HUDOverlay.cs
11. 渐变背景阴影效果
这个效果可以用在字幕或者是一些说明性文字之下。虽然可以直接在UI Canvas中加一张贴图，但是使用Compute Shader可以实现更加灵活的效果以及资源的优化。
字幕、对话文字背景一般都在屏幕下方，上方不作处理。同时需要较高的对比度，因此对原有画面做一个灰度处理、并且指定一个阴影。
```
if (id.y<(uint)tintHeight){
    float3 grayScale = (srcColor.r + srcColor.g + srcColor.b) * 0.33 * tintColor.rgb;
    float3 shaded = lerp(srcColor.rgb, grayScale, tintStrength) * shade;
    ... // 接下文
}else{
    color = srcColor;
}
```
渐变效果。
```
...// 接上文
    float srcAmount = smoothstep(tintHeight-edgeWidth, (float)tintHeight, (float)id.y);
    ...// 接下文
```
最后再lerp起来。
```
...// 接上文
    color = lerp(float4(shaded, 1), srcColor, srcAmount);
```
12. 总结/小测试
If id.xy = [ 100, 30 ]. What would be the return value of inCircle((float2)id.xy, float2(130, 40), 40, 0.1)
When creating a blur effect which answer describes our approach best?
Which answer would create a blocky low resolution version of the source image?
What is smoothstep(5, 10, 6); ?
If an and b are both vectors. Which answer best describes dot(a,b)/dot(b,b); ?
What is _MainTex_TexelSize.x? If _MainTex is 512 x 256 pixel resolution.
13. 利用Blit结合Material做后处理
除了使用Compute Shader制作后处理，还有一种简单的方法。
```
// .cs
Graphics.Blit(source, dest, material, passIndex);
// .shader
Pass{
    CGPROGRAM
    #pragma vertex vert_img
    #pragma fragment frag
    fixed4 frag(v2f_img input) : SV_Target{
        return tex2D(_MainTex, input.uv);
    }
    ENDCG
}
```
通过结合Shader来处理图像数据。
那么问题来了，两者有什么区别？而且传进来的不是一张纹理吗，哪来的顶点？
答：
第一个问题。这种方法称为“屏幕空间着色”，完全集成在Unity的图形管线中，性能其实比Compute Shader更高。而Compute Shader提供了对GPU资源的更细粒度控制。它不受图形管线的限制，可以直接访问和修改纹理、缓冲区等资源。
第二个问题。注意看 vert_img 。在UnityCG中可以找到如下定义：
Unity会自动将传进来的纹理自动转换为两个三角形（一个充满屏幕的矩形），我们用材质的方法编写后处理时直接在frag上写就好了。
下一章将会学习如何将Material、Shader、Compute Shader还有C#联系起来。
2024-05-27
Compute Shader学习笔记（一）之入门
标签：入门/Shader/计算着色器/GPU优化
前言
Compute Shader比较复杂，需要具备一定的编程知识、图形学知识以及GPU相关的硬件知识才能较好的掌握。学习笔记分为四个部分：
- 初步认识Compute Shader，实现一些简单的效果
- 画圆、星球轨道、噪声图、操控Mesh等等
- 后处理、粒子系统
- 物理模拟、绘制草地
- 流体模拟
主要参考资料如下：
- https://www.udemy.com/course/compute-shaders/?couponCode=LEADERSALE24A
- https://catlikecoding.com/unity/tutorials/basics/compute-shaders/
- https://medium.com/ericzhan-publication/shader筆記-初探compute-shader-9efeebd579c1
- https://docs.unity3d.com/Manual/class-ComputeShader.html
- https://docs.unity3d.com/ScriptReference/ComputeShader.html
- https://learn.microsoft.com/en-us/windows/win32/api/D3D11/nf-d3d11-id3d11devicecontext-dispatch
- lygyue：Compute Shader（很有意思）
- https://medium.com/@sengallery/unity-compute-shader-基礎認識-5a99df53cea1
- https://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html（太老，已经过时）
- http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf
- 王江荣：【Unity】Compute Shader的基础介绍与使用
- …未完待续
L1 介绍Compute Shader
1. 初识Compute Shader
简单的说，可以通过Compute Shader，计算出一个材质，然后通过Renderer显示出来。需要注意，Compute Shader不仅仅可以做这些。
可以把下面两份代码拷下来测试一下。
```
using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class AssignTexture : MonoBehaviour
{
    // ComputeShader 用于在 GPU 上执行计算任务
    public ComputeShader shader;

    // 纹理分辨率
    public int texResolution = 256;

    // 渲染器组件
    private Renderer rend;
    // 渲染纹理
    private RenderTexture outputTexture;
    // 计算着色器内核句柄
    private int kernelHandle;

    // Start 在脚本启用时被调用一次
    void Start()
    {
        // 创建一个新的渲染纹理，指定宽度、高度和位深度（此处位深度为0）
        outputTexture = new RenderTexture(texResolution, texResolution, 0);
        // 允许随机写入
        outputTexture.enableRandomWrite = true;
        // 创建渲染纹理实例
        outputTexture.Create();

        // 获取当前对象的渲染器组件
        rend = GetComponent<Renderer>();
        // 启用渲染器
        rend.enabled = true;

        InitShader();
    }

    private void InitShader()
    {
        // 查找计算着色器内核 "CSMain" 的句柄
        kernelHandle = shader.FindKernel("CSMain");

        // 设置计算着色器中使用的纹理
        shader.SetTexture(kernelHandle, "Result", outputTexture);

        // 将渲染纹理设置为材质的主纹理
        rend.material.SetTexture("_MainTex", outputTexture);

        // 调度计算着色器的执行，传入计算组的大小
        // 这里假设每个工作组是 16x16
        // 简单的说就是，要分配多少个组，才能完成计算，目前只分了xy的各一半，因此只渲染了1/4的画面。
        DispatchShader(texResolution / 16, texResolution / 16);
    }

    private void DispatchShader(int x, int y)
    {
        // 调度计算着色器的执行
        // x 和 y 表示计算组的数量，1 表示 z 方向上的计算组数量（这里只有一个）
        shader.Dispatch(kernelHandle, x, y, 1);
    }

    void Update()
    {
        // 每帧检查是否有键盘输入（按键 U 被松开）
        if (Input.GetKeyUp(KeyCode.U))
        {
            // 如果按键 U 被松开，则重新调度计算着色器
            DispatchShader(texResolution / 8, texResolution / 8);
        }
    }
}
```
Unity默认的Compute Shader：
```
// Each #kernel tells which function to compile; you can have many kernels
#pragma kernel CSMain

// Create a RenderTexture with enableRandomWrite flag and set it
// with cs.SetTexture
RWTexture2D<float4> Result;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID) { 
  // TODO: insert actual code here! Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0); 
}
```
在这个示例中，我们可以看到左下角四分之一的区域绘制上了一种名为Sierpinski网的分形结构，这个无关紧要，Unity官方觉得这个图形很有代表性，就把它当作默认代码了。
具体讲一下Compute Shader的代码， C# 的代码看注释即可。
#pragma kernel CSMain 这行代码指示了Compute Shader的入口。CSMain名字随便改。
RWTexture2D Result 这行代码是一个可读写的二维纹理。R代表Read，W代表Write。
着重看这一行代码：
```
[numthreads(8,8,1)]
```
在Compute Shader文件中，这行代码规定了一个线程组的大小，比如这个8 * 8 * 1的线程组中，一共有64个线程。每一个线程计算一个单位的像素（RWTexture）。
而在上面的 C# 文件中，我们用 shader.Dispatch 指定线程组的数量。
接下来提一个问题，如果当前线程组指定为 881 ，那么我们需要多少个线程组才能渲染完 res*res 大小的RWTexture呢？
答案是：res/8 个。而我们代码目前只调用了 res/16 个，因此只渲染了左下角的1/4的区域。
除此之外，入口函数传入的参数也值得一说。uint3 id : SV_DispatchThreadID 这个id表示当前线程的唯一标识符。
2. 四分图案
学会走之前，先学会爬。首先在 C# 中指定需要执行的任务（Kernel）。
目前我们写死了，现在我们暴露一个参数，表示可以执行渲染不同的任务。
```
public string kernelName = "CSMain";
...
kernelHandle = shader.FindKernel(kernelName);
```
这样，就可以在Inspector中随意修改了。
但是，光上盘子可不行，得上菜啊。我们在Compute Shader中做菜。
先设置几个菜单。
```
#pragma kernel CSMain // 刚刚我们已经声明好了
#pragma kernel SolidRed // 定义一个新的菜，并且在下面写出来就好了
... // 可以写很多
[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID){ ... }
[numthreads(8,8,1)]
void SolidRed (uint3 id : SV_DispatchThreadID){
 Result[id.xy] = float4(1,0,0,0); 
}
```
在Inspector中修改对应的名字，就可以启用不同的Kernel。
如果我想传数据给Compute Shader咋办？比方说，给Compute Shader传一个材质的分辨率。
```
shader.SetInt("texResolution", texResolution);
```
并且在Compute Shader里，也要声明好。
思考一个问题，怎么实现下面的效果？
```
[numthreads(8,8,1)]
void SplitScreen (uint3 id : SV_DispatchThreadID)
{
    int halfRes = texResolution >> 1;
    Result[id.xy] = float4(step(halfRes, id.x),step(halfRes, id.y),0,1);
}
```
解释一下，step 函数其实就是：
```
step(edge, x){
    return x>=edge ? 1 : 0;
}
```
(uint)res >> 1 意思就是res的位往右边移动一位。相当于除2（二进制的内容）。
这个计算方法就只是简单的依赖当前的线程id。
位于左下角的线程永远输出黑色。因为step返回永远都是0。
而左下半边的线程， id.x > halfRes ，因此在红通道返回1。
以此类推，非常简单。如果你不信服，可以具体算一下，可以帮助理解线程id、线程组和线程组组的关系。
3. 画圆
原理听上去很简单，判断 (id.x, id.y) 是否在圆内，是则输出1，否则0。动手试试吧。
```
float inCircle( float2 pt, float radius ){
    return ( length(pt)<radius ) ? 1.0 : 0.0;
}

[numthreads(8,8,1)]
void Circle (uint3 id : SV_DispatchThreadID)
{
    int halfRes = texResolution >> 1;
    int isInside = inCircle((float2)((int2)id.xy-halfRes), (float)(halfRes>>1));
    Result[id.xy] = float4(0.0,isInside ,0,1);
}
```
4. 总结/小测试
如果输出是 256 为边长的RWTexture，哪个答案会产生完整的红色的纹理？
```
RWTexture2D<float4> output;

[numthreads(16,16,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
     output[id.xy] = float4(1.0, 0.0, 0.0, 1.0);
}
```
哪个答案将在纹理输出的左侧给出红色，右侧给出黄色？
L2 开始了
1. 传递值给GPU
废话不多说，先画一个圆。两份初始代码在这里。
PassData.cs: https://pastebin.com/PMf4SicK
PassData.compute: https://pastebin.com/WtfUmhk2
大体结构和上文的没有变化。可以看到最终调用了一个drawCircle函数来画圆。
```
[numthreads(1,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    int2 centre = (texResolution >> 1);
    int radius = 80;
    drawCircle( centre, radius );
}
```
这里使用的画圆方法是非常经典的光栅化绘制方法，对数学原理感兴趣的可以看 http://www.sunshine2k.de/coding/java/Bresenham/RasterisingLinesCircles.pdf 。大概思路是利用一种对称的思想生成的。
不同的是，这里我们使用指定 (1,1,1) 为一个线程组的大小。在CPU端调用CS：
```
private void DispatchKernel(int count)
{
    shader.Dispatch(circlesHandle, count, 1, 1);
}
void Update()
{
    DispatchKernel(1);
}
```
问题来了，请问一个线程执行了多少次？
答：只执行了一次。因为一个线程组只有 111=1 个线程，并且CPU端只调用了 111=1 个线程组来计算。因此只用了一个线程完成了一个圆的绘制。也就是说，一个线程可以一次绘制一整个RWTexture，也不是之前那样，一个线程绘制一个pixel。
这也说明了Compute Shader和Fragment Shader是有本质的区别的。片元着色器只是计算单个像素的颜色，而Compute Shader可以执行或多或少任意的操作！
回到Unity，想绘制好看的圆，就需要轮廓颜色、填充颜色。将这两个参数传递到CS中。
```
float4 clearColor;
float4 circleColor;
```
并且增加颜色填充Kernel，并修改Circles内核。如果有多个内核同时访问一个RWTexture的时候，可以添加上 shared 关键词。
```
#pragma kernel Circles
#pragma kernel Clear
    ...
shared RWTexture2D<float4> Result;
    ...
[numthreads(32,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    // int2 centre = (texResolution >> 1);
    int2 centre = (int2)(random2((float)id.x) * (float)texResolution);
    int radius = (int)(random((float)id.x) * 30);
    drawCircle( centre, radius );
}

[numthreads(8,8,1)]
void Clear (uint3 id : SV_DispatchThreadID)
{
    Result[id.xy] = clearColor;
}
```
在CPU端获取Clear内核，传入数据。
```
private int circlesHandle;
private int clearHandle;
    ...
shader.SetVector( "clearColor", clearColor);
shader.SetVector( "circleColor", circleColor);
    ...
private void DispatchKernels(int count)
{
    shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1);
    shader.Dispatch(circlesHandle, count, 1, 1);
}
void Update()
{
    DispatchKernels(1); // 现在画面有32个圆圆
}
```
一个问题，如果代码改为：DispatchKernels(10) ，画面会有多少个圆？
答：320个。一开始Dispatch为 111=1 时，一个线程组有 3211=32 个线程，每个线程画一个圆。小学数学。
接下来，加入 _Time 变量，让圆圆随着时间变化。由于Compute Shader内部貌似没有_time这样的变量，所以只能由CPU传入。
CPU端，注意，实时更新的变量需要在每次Dispatch前更新（outputTexture不需要，因为这outputTexture指向的实际上是GPU纹理的引用！）：
```
private void DispatchKernels(int count)
{
    shader.Dispatch(clearHandle, texResolution/8, texResolution/8, 1);
    shader.SetFloat( "time", Time.time);
    shader.Dispatch(circlesHandle, count, 1, 1);
}
```
Compute Shader：
```
float time;
...
void Circles (uint3 id : SV_DispatchThreadID){
    ...
    int2 centre = (int2)(random2((float)id.x + time) * (float)texResolution);
    ...
}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Shaders/PassData.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Circle_Time/Assets/Scripts/PassData.cs
但是现在的圆非常混乱，下一步就需要利用Buffer让圆圆看起来更有规律。
同时不需要担心多个线程尝试同时写入同一个内存位置（比如 RWTexture），可能会出现竞争条件（race condition）。当前的API都会很好的处理这个问题。
2. 利用Buffer传递数据给GPU
目前为止，我们学习了如何从CPU传送一些简单的数据给GPU。如何传递自定义的结构体呢？
我们可以使用Buffer作为媒介，其中Buffer当然是存在GPU当中的，CPU端（C#）只存储其引用。。首先，在CPU声明一个结构体，然后声明CPU端的引用和GPU端的引用。
```
struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
    Circle[] circleData;  // 在CPU上
    ComputeBuffer buffer; // 在GPU上
```
获取一个线程组的大小信息，可以这样，下面代码只获取了circlesHandles线程组的x方向上的线程数量，yz都不要了（因为假设线程组yz都是1）。并且乘上分配的线程组数量，就可以得到总的线程数量。
```
uint threadGroupSizeX;
shader.GetKernelThreadGroupSizes(circlesHandle, out threadGroupSizeX, out _, out _);
int total = (int)threadGroupSizeX * count;
```
现在把需要传给GPU的数据准备好。这里创建了线程数个圆形，circleData[threadNums]。
```
circleData = new Circle[total];
float speed = 100;
float halfSpeed = speed * 0.5f;
float minRadius = 10.0f;
float maxRadius = 30.0f;
float radiusRange = maxRadius - minRadius;
for(int i=0; i<total; i++)
{
    Circle circle = circleData[i];
    circle.origin.x = Random.value * texResolution;
    circle.origin.y = Random.value * texResolution;
    circle.velocity.x = (Random.value * speed) - halfSpeed;
    circle.velocity.y = (Random.value * speed) - halfSpeed;
    circle.radius = Random.value * radiusRange + minRadius;
    circleData[i] = circle;
}
```
然后在Compute Shader上接受这个Buffer。声明一个一模一样的结构体（Vector2和Float2是一样的），然后创建一个Buffer的引用。
```
// Compute Shader
struct circle
{
    float2 origin;
    float2 velocity;
    float radius;
};
StructuredBuffer<circle> circlesBuffer;
```
注意，这里使用的StructureBuffer是只读的，区别于下一节提到的RWStructureBuffer。
回到CPU端，将刚才准备好的CPU数据通过Buffer发送给GPU。首先明确我们申请的Buffer大小，也就是我们要传多大的东西给GPU。这里一份圆形的数据有两个 float2 的变量和一个 float 的变量，一个float是4bytes（不同平台可能不同，你可以用 sizeof(float) 加以判断），并且有 circleData.Length 份圆数据需要传递。circleData.Length表示缓冲区需要存储多少个圆形对象，而stride定义了每个对象的数据占用多少字节。开辟了这么大的空间，接下来使用SetData()将数据填充到缓冲区，也就是这一步，将数据传递给了GPU。最后将数据所在的GPU引用绑定到Compute Shader指定的Kernel。
```
int stride = (2 + 2 + 1) * 4; //2 floats origin, 2 floats velocity, 1 float radius - 4 bytes per float
buffer = new ComputeBuffer(circleData.Length, stride);
buffer.SetData(circleData);
shader.SetBuffer(circlesHandle, "circlesBuffer", buffer);
```
目前为止，我们已经将CPU准备好的一些数据，通过Buffer传递给了GPU。
OK，现在把好不容易传到GPU的数据利用起来。
```
[numthreads(32,1,1)]
void Circles (uint3 id : SV_DispatchThreadID)
{
    int2 centre = (int2)(circlesBuffer[id.x].origin + circlesBuffer[id.x].velocity * time);
    while (centre.x>texResolution) centre.x -= texResolution;
    while (centre.x<0) centre.x += texResolution;
    while (centre.y>texResolution) centre.y -= texResolution;
    while (centre.y<0) centre.y += texResolution;
    uint radius = (int)circlesBuffer[id.x].radius;
    drawCircle( centre, radius );
}
```
就可以看到，现在的圆圆是连续运动的。因为我们Buffer存储了id.x为索引的圆在上一帧的位置以及这个圆的运动状态。
总结一下，这一节学会了如何在CPU端自定义一个结构体（数据结构），并且通过Buffer传递给GPU，在GPU上对数据进行处理。
下一节，我们学习如何从GPU获取数据返回给CPU。
- 当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Shaders/BufferJoy.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Using_Buffer/Assets/Scripts/BufferJoy.cs
3. 从GPU取得数据
还是老样子，创建一个Buffer，用于把数据从GPU传回给CPU。并且在CPU这边定义一个数组，用于接受数据。然后创建好缓冲区、绑定到着色器上，最后在CPU上创建好准备接受GPU数据的变量。
```
ComputeBuffer resultBuffer; // Buffer
Vector3[] output;           // CPU接受
...
    //buffer on the gpu in the ram
    resultBuffer = new ComputeBuffer(starCount, sizeof(float) * 3);
    shader.SetBuffer(kernelHandle, "Result", resultBuffer);
    output = new Vector3[starCount];
```
在Compute Shader中也接受这样一个Buffer。这里的Buffer是可读写的，也就是说这个Buffer可以被Compute Shader修改。上一节中，Compute Shader只需要读取Buffer，因此 StructuredBuffer 足矣。这里我们需要使用RW。
```
RWStructuredBuffer<float3> Result;
```
接下来，在Dispatch后面用 GetData 接收数据即可。
```
shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
resultBuffer.GetData(output);
```
思路就是这么简单。现在我们尝试制作一大堆围绕球心运动的星星场景。
将计算星星坐标的任务放到GPU上完成，最终获取计算好的各个星星的位置数据，在 C# 中实例化物体。
Compute Shader中，每一个线程计算一个星星的位置，然后输出到Buffer当中。
```
[numthreads(64,1,1)]
void OrbitingStars (uint3 id : SV_DispatchThreadID)
{
    float3 sinDir = normalize(random3(id.x) - 0.5);
    float3 vec = normalize(random3(id.x + 7.1393) - 0.5);
    float3 cosDir = normalize(cross(sinDir, vec));
    float scaledTime = time * 0.5 + random(id.x) * 712.131234;
    float3 pos = sinDir * sin(scaledTime) + cosDir * cos(scaledTime);
    Result[id.x] = pos * 2;
}
```
在CPU端通过 GetData 得到计算结果，时刻修改对应事先实例化好的GameObject的Pos。
```
void Update()
{
    shader.SetFloat("time", Time.time);
    shader.Dispatch(kernelHandle, groupSizeX, 1, 1);
    resultBuffer.GetData(output);
    for (int i = 0; i < stars.Length; i++)
        stars[i].localPosition = output[i];
}
```
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Shaders/OrbitingStars.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_GetData_From_Buffer/Assets/Scripts/OrbitingStars.cs
4. 使用噪声
使用Compute Shader生成一张噪声图非常简单，并且非常高效。
```
float random (float2 pt, float seed) {
    const float a = 12.9898;
    const float b = 78.233;
    const float c = 43758.543123;
    return frac(sin(seed + dot(pt, float2(a, b))) * c );
}

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    float4 white = 1;
    Result[id.xy] = random(((float2)id.xy)/(float)texResolution, time) * white;
}
```
有一个库可以得到更多各式各样的噪声。https://pastebin.com/uGhMLKeM
```
#include "noiseSimplex.cginc" // Paste the code above and named "noiseSimplex.cginc"

...

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    float3 pos = (((float3)id)/(float)texResolution) * 2.0;
    float n = snoise(pos);
    float ring = frac(noiseScale * n);
    float delta = pow(ring, ringScale) + n;

    Result[id.xy] = lerp(darkColor, paleColor, delta);
}
```
5. 变形的Mesh
这一节中，我们将一个Cube正方体，通过Compute Shader变成一个球体，并且要有动画过程，是渐变的！
老样子，在CPU端声明顶点参数，然后丢到GPU里面计算，计算得到的新坐标newPos，应用到Mesh上。
顶点结构的声明，CPU端的声明我们附带一个构造函数，这样方便些。GPU端的照葫芦画瓢。此处，我们打算向GPU传递两个Buffer，一个只读另一个可读写。一开始两个Buffer是一样的，随着时间变化（渐变），可读写的Buffer逐渐变化，Mesh从立方体不断变成球球。
```
// CPU
public struct Vertex
{
    public Vector3 position;
    public Vector3 normal;
    public Vertex( Vector3 p, Vector3 n )
    {
        position.x = p.x;
        position.y = p.y;
        position.z = p.z;
        normal.x = n.x;
        normal.y = n.y;
        normal.z = n.z;
    }
}
...
Vertex[] vertexArray;
Vertex[] initialArray;
ComputeBuffer vertexBuffer;
ComputeBuffer initialBuffer;
// GPU
struct Vertex {
    float3 position;
    float3 normal;
};
...
RWStructuredBuffer<Vertex>  vertexBuffer;
StructuredBuffer<Vertex>    initialBuffer;
```
初始化（ Start() 函数）的完整步骤如下：
1. 在CPU端，初始化kernel，获取Mesh引用
2. 将Mesh数据传到CPU中
3. 在GPU中声明Mesh数据的Buffer
4. 将Mesh数据和其他参数传到GPU中
完成这些操作后，每一帧Update，我们将从GPU得到的新顶点，应用给mesh。
那GPU的计算怎么实现呢？
相当简单的做法，我们只需要归一化模型空间的各个顶点即可！试想一下，当所有顶点位置向量都归一化了，那模型就变成一个球。
实际代码中，我们还需要同时计算法线，如果不改变法线，物体的光照就会非常奇怪。那问题来了，法线怎么计算呢？非常简单，原本正方体的顶点的坐标就是最终球球的法线向量！
为了实现“呼吸”的效果，加入一个正弦函数，控制归一化的系数。
```
float delta = (Mathf.Sin(Time.time) + 1)/ 2;
```
由于代码有点长，放一个链接吧。
当前版本代码：
- Compute Shader：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Shaders/MeshDeform.compute
- CPU：https://github.com/Remyuu/Unity-Compute-Shader-Learn/blob/L2_Mesh_Cube2Sphere/Assets/Scripts/MeshDeform.cs
6. 总结/小测试
应该如何在GPU上定义这个结构：
```
struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
```
这个结构应该怎样设置ComputeBuffer的大小？
```
struct Circle
{
    public Vector2 origin;
    public Vector2 velocity;
    public float radius;
}
```
下面代码为什么错误？
```
StructuredBuffer<float3> positions;
//Inside a kernel
...
positions[id.x] = fixed3(1,0,0);
```
References
Indirect Compute Shader
2024-05-27