差别
这里会显示出您选择的修订版和当前版本之间的差别。
两侧同时换到之前的修订记录 前一修订版 后一修订版 | 前一修订版 | ||
public:cs:rasterization_on_larrabee [2015/07/17 14:46] – [原文] oakfire | public:cs:rasterization_on_larrabee [2021/01/06 17:31] (当前版本) – [Tile assignment] oakfire | ||
---|---|---|---|
行 2: | 行 2: | ||
[[https:// | [[https:// | ||
- | ===== 原文 | + | --- 翻译: |
+ | ===== Larrabee 上的光栅化 | ||
**Rasterization on Larrabee** | **Rasterization on Larrabee** | ||
- | **Larrabee 上的光栅化** | + | 注: Larrabee 是英特尔公司的通用图形处理器(GPCPU)的开发代号/ |
- | 注: Larrabee 是英特尔公司的通用图形处理器(GPCPU)的开发代号/ | + | |
**By Michael Abrash** | **By Michael Abrash** | ||
行 29: | 行 29: | ||
==== Parallel programming for semi-parallel tasks ==== | ==== Parallel programming for semi-parallel tasks ==== | ||
+ | **半并行任务的并行设计** | ||
Last month, in "A First Look at the Larrabee New Instructions (LRBni)," | Last month, in "A First Look at the Larrabee New Instructions (LRBni)," | ||
+ | |||
+ | 上月, 在" | ||
In the process of implementing the standard graphics pipeline for Larrabee, we (the Larrabee team at RAD: Atman Binstock, Tom Forsyth, Mike Sartain, and I) have gotten considerable insight into that question, because while the pipeline is largely vectorizable, | In the process of implementing the standard graphics pipeline for Larrabee, we (the Larrabee team at RAD: Atman Binstock, Tom Forsyth, Mike Sartain, and I) have gotten considerable insight into that question, because while the pipeline is largely vectorizable, | ||
+ | |||
+ | 在Larrabee 标准图形管线的实现过程中, | ||
Before we begin, in order to avoid potential confusion, let me define " | Before we begin, in order to avoid potential confusion, let me define " | ||
- | ==== Rasterization was the problem child - but less problematic than we thought ==== | + | 在开始之前, |
+ | ==== Rasterization was the problem child - but less problematic than we thought ==== | ||
+ | 光栅化曾是个爱惹麻烦的孩子 - 但麻烦比我们所想的要少 | ||
When we look at applying the Larrabee New Instructions to the graphics pipeline, for the most part it's easy to find 16-wide work for the vector unit to do. Depth, stencil, pixel shading, and blending all fall out of processing 4x4 blocks, with writemasking at the edges of triangles. Vertex shading is not quite a perfect fit, but 16 vertices can be shaded and cached in parallel, which works well because vertex usage tends to be localized. Setting up 16 triangles at a time also works pretty well; there' | When we look at applying the Larrabee New Instructions to the graphics pipeline, for the most part it's easy to find 16-wide work for the vector unit to do. Depth, stencil, pixel shading, and blending all fall out of processing 4x4 blocks, with writemasking at the edges of triangles. Vertex shading is not quite a perfect fit, but 16 vertices can be shaded and cached in parallel, which works well because vertex usage tends to be localized. Setting up 16 triangles at a time also works pretty well; there' | ||
+ | |||
+ | 当我们审视 LNI 在图形管道上的应用时, | ||
That leaves rasterization, | That leaves rasterization, | ||
+ | |||
+ | 最后还有光栅化, | ||
The Larrabee rasterizer was at least the fifth major rasterizer I've written, but while the others were loosely related, the Larrabee rasterizer belongs to a whole different branch of the family - so different, in fact, that it almost didn't get written. Why not? Because, as is so often the case, I wanted to stick with the mental model I already understood and was comfortable with, rather than reexamining my assumptions. Long-time readers will recall my fondness for the phrase " | The Larrabee rasterizer was at least the fifth major rasterizer I've written, but while the others were loosely related, the Larrabee rasterizer belongs to a whole different branch of the family - so different, in fact, that it almost didn't get written. Why not? Because, as is so often the case, I wanted to stick with the mental model I already understood and was comfortable with, rather than reexamining my assumptions. Long-time readers will recall my fondness for the phrase " | ||
+ | |||
+ | Larrabee 光栅器是我写过的至少第五个主要光栅器, | ||
Coming up with better solutions is all about trying out different mental models to see what you might be missing, a point perfectly illustrated by my efforts about 10 years ago to speed up a texture mapper. I had thrown a lot of tricks at it and had made it a lot faster, and figured I was done; but just to make sure, I ran it by my friend David Stafford, an excellent optimizer. David said he'd think about it, and would let me know if he came up with anything. | Coming up with better solutions is all about trying out different mental models to see what you might be missing, a point perfectly illustrated by my efforts about 10 years ago to speed up a texture mapper. I had thrown a lot of tricks at it and had made it a lot faster, and figured I was done; but just to make sure, I ran it by my friend David Stafford, an excellent optimizer. David said he'd think about it, and would let me know if he came up with anything. | ||
+ | |||
+ | 更好的解决方案的出现, | ||
When I got home that evening, there was a message on the answering machine from David, saying that he had gotten 2 cycles out of the inner loop. (It's been a long time, and my memories have faded, so it may not have been exactly 2 cycles, but you get the idea.) I called him back, but he wasn't home (this was before cellphones were widespread). So all that evening I tried to figure out how David could have gotten those 2 cycles. As I ate dinner I wondered; as I brushed my teeth I wondered; and as I lay there in bed not sleeping, I wondered. Finally I had an idea - but it only saved 1 cycle, not 2. And no matter how much longer I racked my brains, I couldn' | When I got home that evening, there was a message on the answering machine from David, saying that he had gotten 2 cycles out of the inner loop. (It's been a long time, and my memories have faded, so it may not have been exactly 2 cycles, but you get the idea.) I called him back, but he wasn't home (this was before cellphones were widespread). So all that evening I tried to figure out how David could have gotten those 2 cycles. As I ate dinner I wondered; as I brushed my teeth I wondered; and as I lay there in bed not sleeping, I wondered. Finally I had an idea - but it only saved 1 cycle, not 2. And no matter how much longer I racked my brains, I couldn' | ||
+ | |||
+ | 当天晚上我到家后, | ||
The next morning, I called David, admitted that I couldn' | The next morning, I called David, admitted that I couldn' | ||
+ | |||
+ | 第二天早上, | ||
It's certainly a funny story, but, more to the point, the only thing keeping my code from getting faster had been my conviction that there were no better solutions - that my current mental model was correct and complete. After all, the only thing that changed between the original solution and the better one was that I acquired the belief that there was a better solution - that is, that there was a better model than my current one. As soon as I had to throw away my original model, I had a breakthrough. | It's certainly a funny story, but, more to the point, the only thing keeping my code from getting faster had been my conviction that there were no better solutions - that my current mental model was correct and complete. After all, the only thing that changed between the original solution and the better one was that I acquired the belief that there was a better solution - that is, that there was a better model than my current one. As soon as I had to throw away my original model, I had a breakthrough. | ||
+ | |||
+ | 这是个相当有趣的故事, | ||
Software rasterization was a lot like that. | Software rasterization was a lot like that. | ||
+ | |||
+ | 软件光栅化也类似这个. | ||
The previous experience that Tom Forsyth and I had had with software rasterization was that it was much, much slower than hardware. Furthermore, | The previous experience that Tom Forsyth and I had had with software rasterization was that it was much, much slower than hardware. Furthermore, | ||
+ | |||
+ | 以前 Tom Forsyth 和我在软件光栅化上的经验就是它比硬件要慢, | ||
In the early days of the project, as the vector width and the core architecture were constantly changing, one of the hardware architects, Eric Sprangle, kept asking us whether it might be possible to rasterize efficiently in software using vector processing, and we kept patiently explaining that it wasn' | In the early days of the project, as the vector width and the core architecture were constantly changing, one of the hardware architects, Eric Sprangle, kept asking us whether it might be possible to rasterize efficiently in software using vector processing, and we kept patiently explaining that it wasn' | ||
+ | |||
+ | 在项目早期, | ||
And we immediately failed. As soon as we had to think hard about how the inner loop could be structured with vector instructions, | And we immediately failed. As soon as we had to think hard about how the inner loop could be structured with vector instructions, | ||
+ | |||
+ | 我们立刻失败了. 一旦我们必须努力去想如何让内循环满足矢量运算指令并写数字 -- 那就是, 一旦我们的思考模式被迫去考虑处理实际情形 -- 就会清晰发现软件光栅化实质也处在性能优化领域中. 然后这就仅仅是工程问题, | ||
+ | |||
First, though, it's worth pointing out that, in general, dedicated hardware will be able to perform any specific task more efficiently than software; this is true by definition, since dedicated hardware requires no multifunction capability, so in the limit it can be like a general-purpose core with the extraneous parts removed. However, by the same token, hardware lacks the flexibility of CPUs, and that flexibility can allow CPUs to close some or all of the performance gap. Hardware needs worst-case capacity for each component, so it often sits at least partly idle; CPUs, on the other hand, can just switch to doing something different, so ALUs are never idle. CPUs can also implement flexible, adaptive approaches, and that can make a big difference, as we'll see shortly. | First, though, it's worth pointing out that, in general, dedicated hardware will be able to perform any specific task more efficiently than software; this is true by definition, since dedicated hardware requires no multifunction capability, so in the limit it can be like a general-purpose core with the extraneous parts removed. However, by the same token, hardware lacks the flexibility of CPUs, and that flexibility can allow CPUs to close some or all of the performance gap. Hardware needs worst-case capacity for each component, so it often sits at least partly idle; CPUs, on the other hand, can just switch to doing something different, so ALUs are never idle. CPUs can also implement flexible, adaptive approaches, and that can make a big difference, as we'll see shortly. | ||
- | ==== Why rasterization was the problem child ==== | + | 然而首先, |
+ | ==== Why rasterization was the problem child ==== | ||
+ | 为什么光栅化曾是个爱惹麻烦的孩子 | ||
I can't do a proper tutorial on rasterization here, so I'll just run through a brief refresher. For our purposes, all rasterization will be of triangles. There are three edges per triangle, each defined by an equation Bx+Cy relative to any point on the edge, with the sign indicating whether a point is inside or outside the edge. Both x and y are in 15.8 fixed-point format, with a range of [-16K, +16K). The edge equations are tested at pixel or sample centers, and for cases where a pixel or sample center is right on an edge, well-defined fill rules must be observed (in this case, top-left fill rules, which are generally implemented by subtracting 1 from the edge equation for left edges and flat top edges). Rasterization is performed with discrete math, and must be exact, so there must be enough bits to represent the edge equation completely. Finally, multisampled antialising must be supported. | I can't do a proper tutorial on rasterization here, so I'll just run through a brief refresher. For our purposes, all rasterization will be of triangles. There are three edges per triangle, each defined by an equation Bx+Cy relative to any point on the edge, with the sign indicating whether a point is inside or outside the edge. Both x and y are in 15.8 fixed-point format, with a range of [-16K, +16K). The edge equations are tested at pixel or sample centers, and for cases where a pixel or sample center is right on an edge, well-defined fill rules must be observed (in this case, top-left fill rules, which are generally implemented by subtracting 1 from the edge equation for left edges and flat top edges). Rasterization is performed with discrete math, and must be exact, so there must be enough bits to represent the edge equation completely. Finally, multisampled antialising must be supported. | ||
+ | |||
+ | 我不能在这里做个光栅化的正式说明, | ||
Let's look at a quick example of applying the edge equation. Figure 1 shows an edge from (12, 8) to (4, 24). The B coefficient of the edge equation is simply the edge's y length: (y1 - y0). The C coefficient is the negation of the edge's x length: (x0 - x1). Thus, the edge equation in Figure 1 is (24 - 8)x + (12 - 4 )y. Since we only care about the sign of the result (which indicates inside or outside), not the magnitude, this can be simplified to 2x + 1y, where the x value used in the equation is the distance between the point of interest and any point at which the equation is known to be zero (which is to say any point on the line); usually a vertex is used, as for example the vertex at (12, 8) is used in Figure 1. All points on the edge have the value 0, as can be seen in Figure 1 for the point on the line at (8, 16). | Let's look at a quick example of applying the edge equation. Figure 1 shows an edge from (12, 8) to (4, 24). The B coefficient of the edge equation is simply the edge's y length: (y1 - y0). The C coefficient is the negation of the edge's x length: (x0 - x1). Thus, the edge equation in Figure 1 is (24 - 8)x + (12 - 4 )y. Since we only care about the sign of the result (which indicates inside or outside), not the magnitude, this can be simplified to 2x + 1y, where the x value used in the equation is the distance between the point of interest and any point at which the equation is known to be zero (which is to say any point on the line); usually a vertex is used, as for example the vertex at (12, 8) is used in Figure 1. All points on the edge have the value 0, as can be seen in Figure 1 for the point on the line at (8, 16). | ||
+ | |||
+ | 让我们来看看一个应用边方程式的简要例子. 图例 1 显示了一条从 (12, 8) 到 (4, 24)的边. 边方程式的 B 系数 就是边的 y 的长度: (y1 - y0). C 系数则是边 x 的长度的负数: | ||
{{: | {{: | ||
**Figure 1**: Points on an edge always have an edge equation value of zero. | **Figure 1**: Points on an edge always have an edge equation value of zero. | ||
+ | |||
+ | 图例 1 : | ||
Points on one side of the edge will have positive values, as shown in Figure 2 for the point at (12, 16), which has a value of 8. | Points on one side of the edge will have positive values, as shown in Figure 2 for the point at (12, 16), which has a value of 8. | ||
+ | |||
+ | 在边的其中一边方向的点会是正值, | ||
{{: | {{: | ||
**Figure 2**: Points on one side of an edge are always positive. | **Figure 2**: Points on one side of an edge are always positive. | ||
+ | |||
+ | 图例 2 在边的其中一边方向的点都是正值. | ||
Points on the other side of the edge will have negative values, as shown in Figure 3 for the point at (4, 12), which has a value of -12. | Points on the other side of the edge will have negative values, as shown in Figure 3 for the point at (4, 12), which has a value of -12. | ||
+ | |||
+ | 在另外一边方向上的点都会是负值, | ||
{{: | {{: | ||
**Figure 3**: Points on the other side of the edge are always negative. | **Figure 3**: Points on the other side of the edge are always negative. | ||
+ | |||
+ | 图例 3 在另外一边方向上的点都会是负值. | ||
Simple though it is, the edge equation is the basis upon which the Larrabee rasterizer is built. By applying the three edge equations at once, it is possible to determine which points are inside a triangle, and which are not. Figure 4 shows an example of how this works; the pixels shown in green are considered to be inside the triangle formed by the edges, because their centers are inside all three edges. As you can see, the edge equation is negative on the side of each edge that's inside the triangle; in fact, it gets more negative the farther you get from the edge on the inside, and more positive the farther you get from the edge on the outside. | Simple though it is, the edge equation is the basis upon which the Larrabee rasterizer is built. By applying the three edge equations at once, it is possible to determine which points are inside a triangle, and which are not. Figure 4 shows an example of how this works; the pixels shown in green are considered to be inside the triangle formed by the edges, because their centers are inside all three edges. As you can see, the edge equation is negative on the side of each edge that's inside the triangle; in fact, it gets more negative the farther you get from the edge on the inside, and more positive the farther you get from the edge on the outside. | ||
+ | |||
+ | 简单考虑, | ||
{{: | {{: | ||
**Figure 4**: Rasterization of a triangle, defined by three edges, each with an inside (negative edge equation values) and an outside (positive edge equation values). Pixels are categorized as inside or outside based on edge equation values at pixel centers (white dots). | **Figure 4**: Rasterization of a triangle, defined by three edges, each with an inside (negative edge equation values) and an outside (positive edge equation values). Pixels are categorized as inside or outside based on edge equation values at pixel centers (white dots). | ||
+ | |||
+ | 图例 4 三角形的光栅化, | ||
Vectorization is an essential part of Larrabee performance - capable of producing a speedup of an order of magnitude or more - so the knotty question is how we can perform the evaluation shown in Figures 1-4 using vector processing. More accurately, the question is how we can efficiently perform the evaluation using vector processing; obviously we could use vector instructions to evaluate every pixel on the screen for every triangle, but that would involve a lot of wasted work. What's needed is some way of using vector instructions to quickly narrow in on the work that's really needed. | Vectorization is an essential part of Larrabee performance - capable of producing a speedup of an order of magnitude or more - so the knotty question is how we can perform the evaluation shown in Figures 1-4 using vector processing. More accurately, the question is how we can efficiently perform the evaluation using vector processing; obviously we could use vector instructions to evaluate every pixel on the screen for every triangle, but that would involve a lot of wasted work. What's needed is some way of using vector instructions to quickly narrow in on the work that's really needed. | ||
+ | |||
+ | 向量化是 Larrabee 光栅化的一个基础部分 - 对海量级指令的加速优化能力 - 所以比较棘手的问题是我们怎么使用向量处理机运行(译者: | ||
We considered a lot of approaches; let's take a look at a couple, so you can get a sense of what a different tack we had to take in order to vectorize a task that's not an obvious candidate for parallelization - and in order to leverage the unique strengths of CPUs. | We considered a lot of approaches; let's take a look at a couple, so you can get a sense of what a different tack we had to take in order to vectorize a task that's not an obvious candidate for parallelization - and in order to leverage the unique strengths of CPUs. | ||
- | ==== The Pixomatic 1 rasterization approach ==== | + | 我们仔细考虑过很多近似算法; |
+ | ==== The Pixomatic 1 rasterization approach ==== | ||
+ | Pixomatic 1 光栅化方法 (译者: Pixomatic is a software renderer for x86 machines ) | ||
Pixomatic version 1 used a rasterization approach often used by scalar software rasterizers, | Pixomatic version 1 used a rasterization approach often used by scalar software rasterizers, | ||
+ | |||
+ | Pixomatic 版本 1 使用了一个标量软件光栅化常用的方法, | ||
{{: | {{: | ||
**Figure 5**: A standard software rasterization approach, used by Pixomatic 1, in which the triangle is rasterized as either one or two trapezoids. This triangle is subdivided into two trapezoids; first the yellow and pink edges are set up and stepped down to the dashed line to generate spans of covered pixels, and then the black edge is set up and the black and pink edges are stepped from the dashed line to the bottom of the triangle. | **Figure 5**: A standard software rasterization approach, used by Pixomatic 1, in which the triangle is rasterized as either one or two trapezoids. This triangle is subdivided into two trapezoids; first the yellow and pink edges are set up and stepped down to the dashed line to generate spans of covered pixels, and then the black edge is set up and the black and pink edges are stepped from the dashed line to the bottom of the triangle. | ||
+ | |||
+ | 图例 5 : 一个标准的软件光栅化方法, | ||
This approach was efficient for scalar code, but it just doesn' | This approach was efficient for scalar code, but it just doesn' | ||
+ | |||
+ | 这种方法对于标量代码很有效, | ||
==== Sweep rasterization ==== | ==== Sweep rasterization ==== | ||
+ | Sweep 光栅化 (扫描光栅化:?: | ||
Another approach, often used by hardware, is sweep rasterization. An example of this is shown in Figure 6. Starting at a top vertex, a vector stamp of 4x4 pixels is swept left, then right, then down, and the process is repeated until the whole triangle has been swept. The edge equation is evaluated directly at each of the 16 pixels for each 4x4 block that's swept over. | Another approach, often used by hardware, is sweep rasterization. An example of this is shown in Figure 6. Starting at a top vertex, a vector stamp of 4x4 pixels is swept left, then right, then down, and the process is repeated until the whole triangle has been swept. The edge equation is evaluated directly at each of the 16 pixels for each 4x4 block that's swept over. | ||
+ | |||
+ | 另一种方法是扫描光栅化, | ||
{{: | {{: | ||
行 119: | 行 181: | ||
**Figure 6**: Sweep rasterization. Starting at the top vertex, a 4x4 pixel stamp is swept left until it's off the triangle, then right until it's off the triangle, and finally down, and then the process is repeated, until the whole triangle has been rasterized. | **Figure 6**: Sweep rasterization. Starting at the top vertex, a 4x4 pixel stamp is swept left until it's off the triangle, then right until it's off the triangle, and finally down, and then the process is repeated, until the whole triangle has been rasterized. | ||
+ | |||
+ | 图例 6: | ||
Sweep rasterization is more vectorizable than the Pixomatic 1 approach, because evaluating the pixel stamp is well-suited to vectorization, | Sweep rasterization is more vectorizable than the Pixomatic 1 approach, because evaluating the pixel stamp is well-suited to vectorization, | ||
- | ==== A high-level view of Larrabee rasterization ==== | + | 扫描光栅化比 Pixomatic 1 方法更为可向量化, |
+ | ==== A high-level view of Larrabee rasterization ==== | ||
+ | Larrabee 光栅化的一个上层视图 | ||
Larrabee takes a substantially different approach, one better suited to vectorization. In the Larrabee approach, we evaluate 16 blocks of pixels at a time to figure out which blocks are even touched by the triangle, then descend into each block that's at least partially covered, evaluating 16 smaller blocks within it, continuing to descend recursively until we have identified all the pixels that are inside the triangle. Here's an example of how that might work for our sample triangle. | Larrabee takes a substantially different approach, one better suited to vectorization. In the Larrabee approach, we evaluate 16 blocks of pixels at a time to figure out which blocks are even touched by the triangle, then descend into each block that's at least partially covered, evaluating 16 smaller blocks within it, continuing to descend recursively until we have identified all the pixels that are inside the triangle. Here's an example of how that might work for our sample triangle. | ||
+ | |||
+ | Larrabee 采用了一个本质上不同的方法, | ||
As I'll discuss shortly, the Larrabee renderer uses a chunking architecture. In a chunking architecture, | As I'll discuss shortly, the Larrabee renderer uses a chunking architecture. In a chunking architecture, | ||
+ | |||
+ | 我接下来就会讲到, | ||
{{: | {{: | ||
**Figure 7**: A triangle to be rasterized, shown against the pixels in a 64x64 tile. | **Figure 7**: A triangle to be rasterized, shown against the pixels in a 64x64 tile. | ||
+ | |||
+ | 图例 7 : | ||
First, we test which of the 16x16 blocks (16 of them - we check 16 things at a time whenever possible, in order to leverage the 16-wide vector units) that make up the tile are touched by the triangle, as shown in Figure 8. | First, we test which of the 16x16 blocks (16 of them - we check 16 things at a time whenever possible, in order to leverage the 16-wide vector units) that make up the tile are touched by the triangle, as shown in Figure 8. | ||
+ | |||
+ | 首先, 我们测定这瓦片中哪些 16x16 块( 16 个 - 我们尽可能一次性测定16个, | ||
{{: | {{: | ||
**Figure 8**: The 16 16x16 blocks are tested to see which are touched by the triangle. | **Figure 8**: The 16 16x16 blocks are tested to see which are touched by the triangle. | ||
+ | |||
+ | 图例 8 : 16 个 16x16 块被测定是否碰触到三角形. | ||
We find that only one 16x16 block is touched, the block shown in yellow, so we descend into that block to determine exactly what is touched by the triangle, subdividing it into 16 4x4 blocks (once again, we check 16 things at a time to be vector-friendly), | We find that only one 16x16 block is touched, the block shown in yellow, so we descend into that block to determine exactly what is touched by the triangle, subdividing it into 16 4x4 blocks (once again, we check 16 things at a time to be vector-friendly), | ||
+ | |||
+ | 我们判定只有一个 16x16 块碰触到, | ||
{{: | {{: | ||
**Figure 9**: The 16 4x4 blocks are tested to see which are touched by the triangle. | **Figure 9**: The 16 4x4 blocks are tested to see which are touched by the triangle. | ||
+ | |||
+ | 图例 9 : 16 个 4x4 的块被判定是否碰触到三角形. | ||
We find that 5 of the 4x4s are touched, so we process each of them separately, descending to the pixel level to generate masks for the covered pixels. The pixel rasterization for the first block is shown in Figure 10. | We find that 5 of the 4x4s are touched, so we process each of them separately, descending to the pixel level to generate masks for the covered pixels. The pixel rasterization for the first block is shown in Figure 10. | ||
+ | |||
+ | 我们判定其中 5 个 4x4 块碰触到, | ||
{{: | {{: | ||
**Figure 10**: Rasterization of the pixels in the first 4x4 block touched by the triangle. | **Figure 10**: Rasterization of the pixels in the first 4x4 block touched by the triangle. | ||
+ | |||
+ | 图例 10 : 被三角形碰触到的第一个 4x4 块的像素点光栅化 | ||
Figure 11 shows the final result. | Figure 11 shows the final result. | ||
+ | |||
+ | 图例 11 显示最终结果. | ||
{{: | {{: | ||
**Figure 11**: All 5 4x4 blocks touched by the triangle have been rasterized. | **Figure 11**: All 5 4x4 blocks touched by the triangle have been rasterized. | ||
+ | |||
+ | 图例 11 : 5 个三角形碰触到的 4x4 块都被光栅化后. | ||
As you can see, the Larrabee approach processes 4x4 blocks, like the sweep approach, but unlike the sweep approach it doesn' | As you can see, the Larrabee approach processes 4x4 blocks, like the sweep approach, but unlike the sweep approach it doesn' | ||
+ | |||
+ | 如你所见, | ||
Many years ago, I got a call from a guy I had once worked for. He wanted me to do some consulting work to help speed up his new company' | Many years ago, I got a call from a guy I had once worked for. He wanted me to do some consulting work to help speed up his new company' | ||
+ | |||
+ | 很多年前, | ||
He put me in touch with the engineer who was working on the software, who immediately informed me that the problem was that the convolution filter involved a great many integer multiplies, which the Sparc did very slowly, since at the time it didn't have a hardware integer multiply instruction. Instead, it had a partial product instruction, | He put me in touch with the engineer who was working on the software, who immediately informed me that the problem was that the convolution filter involved a great many integer multiplies, which the Sparc did very slowly, since at the time it didn't have a hardware integer multiply instruction. Instead, it had a partial product instruction, | ||
+ | |||
+ | 他让我和该软件的一个工程师联络, | ||
I suggested unrolling that loop into a series of partial product instructions, | I suggested unrolling that loop into a series of partial product instructions, | ||
+ | |||
+ | 我建议展开那个循环为一连串偏微指令, | ||
When I asked which was smaller, though, the engineer said there was no difference. When I persisted, he said they were random. When I said that I doubted they were random, since randomness is actually hard to come by, he grumbled. I don't know why he was reluctant to get me that information - I guess he thought it was a waste of time - but finally he agreed to gather the data and call me back. | When I asked which was smaller, though, the engineer said there was no difference. When I persisted, he said they were random. When I said that I doubted they were random, since randomness is actually hard to come by, he grumbled. I don't know why he was reluctant to get me that information - I guess he thought it was a waste of time - but finally he agreed to gather the data and call me back. | ||
+ | |||
+ | 当我询问哪个更少时, | ||
He didn't call me back that day, though. And he didn't call me back the next day. When he hadn't called me back the third day, I figured I might as well get it over with, and called him. He answered the phone, and, when I identified myself, he said, "Oh, hi. I'm just standing here with my managers, watching. We're all really happy." | He didn't call me back that day, though. And he didn't call me back the next day. When he hadn't called me back the third day, I figured I might as well get it over with, and called him. He answered the phone, and, when I identified myself, he said, "Oh, hi. I'm just standing here with my managers, watching. We're all really happy." | ||
+ | |||
+ | 然而当天他并没通知我, | ||
When I asked what exactly he was happy about, he replied, "Well, when I looked at the data, it turned out 90% of the values in the convolution kernel were zero, so I just put an if-not-zero around the multiply, and now the whole program runs three times faster!" | When I asked what exactly he was happy about, he replied, "Well, when I looked at the data, it turned out 90% of the values in the convolution kernel were zero, so I just put an if-not-zero around the multiply, and now the whole program runs three times faster!" | ||
+ | |||
+ | 当我问他为何高兴, | ||
Not-rasterizing is a lot like that, as we'll see shortly. | Not-rasterizing is a lot like that, as we'll see shortly. | ||
+ | 非光栅化也很类似这个, | ||
==== Tile assignment ==== | ==== Tile assignment ==== | ||
+ | 瓦片(Tile)的分配 | ||
As noted earlier, Larrabee uses chunked (also known as binned or tiled) rendering, where the target is divided into multiple rectangles, called tiles. The rendering commands are sorted according to the tiles they touch and stored in the corresponding bins, and then the contents of each bin are rendered separately to the corresponding tile. It's a bit complex, but it considerably improves cache coherence and parallelization. | As noted earlier, Larrabee uses chunked (also known as binned or tiled) rendering, where the target is divided into multiple rectangles, called tiles. The rendering commands are sorted according to the tiles they touch and stored in the corresponding bins, and then the contents of each bin are rendered separately to the corresponding tile. It's a bit complex, but it considerably improves cache coherence and parallelization. | ||
+ | |||
+ | 之前提到, | ||
For chunking, rasterization consists of two steps; the first identifies which tiles a triangle touches, and the second rasterizes the triangle within each tile. So it's a two-stage process, and I'm going to discuss the two stages separately. | For chunking, rasterization consists of two steps; the first identifies which tiles a triangle touches, and the second rasterizes the triangle within each tile. So it's a two-stage process, and I'm going to discuss the two stages separately. | ||
+ | |||
+ | 对于分块, | ||
Figure 12 shows an example of a triangle to be drawn to a tiled render target. The light blue area is a 256x256 render target, subdivided into four 128x128 tiles. | Figure 12 shows an example of a triangle to be drawn to a tiled render target. The light blue area is a 256x256 render target, subdivided into four 128x128 tiles. | ||
+ | |||
+ | 图例 12 显示了一个三角形将被画到一个瓦片的例子. 亮蓝色区域是 256x256 的渲染对象, | ||
{{: | {{: | ||
**Figure 12**: A triangle to be drawn to a 256x256 render target consisting of four 128x128 tiles. | **Figure 12**: A triangle to be drawn to a 256x256 render target consisting of four 128x128 tiles. | ||
+ | |||
+ | 图例 12 : 一个三角形要被画到由四个 128x128 瓦片组成的 256x256 渲染对象上. | ||
With Larrabee' | With Larrabee' | ||
+ | |||
+ | 对于 Larrabee 的分块架构, | ||
Assignment of triangles to tiles can easily be performed for relatively small triangles - say, up to a tile in size, which covers 90% of all triangles - by doing bounding box tests. For example, it would be easy with bounding box tests to find out what two tiles the triangle in Figure 12 is in. Larger triangles are currently assigned to tiles by simply walking the bounding box and testing each tile against the triangle; that doesn' | Assignment of triangles to tiles can easily be performed for relatively small triangles - say, up to a tile in size, which covers 90% of all triangles - by doing bounding box tests. For example, it would be easy with bounding box tests to find out what two tiles the triangle in Figure 12 is in. Larger triangles are currently assigned to tiles by simply walking the bounding box and testing each tile against the triangle; that doesn' | ||
+ | |||
+ | 通过使用盒包围测试能让相对小的三角形易分配到瓦片 - 所谓相对小是瓦片大小能达到覆盖三角形的90% :?:. 例如, 很容易使用盒包围测试来找出图例 12 中三角形在哪两个瓦片中. 更大的三角形的分配是移动盒包围与测定每个瓦片相对三角形位置; | ||
Large-triangle assignment to tiles is performed with scalar code, for simplicity and because it's not a significant performance factor. Let's look at how that scalar process works, because it will help us understand vectorized intra-tile rasterization later. I'll use a small triangle for the example, for simplicity and to make the figures legible, but as noted above, normally such a small triangle would be assigned to its tile or tiles using bounding box tests. | Large-triangle assignment to tiles is performed with scalar code, for simplicity and because it's not a significant performance factor. Let's look at how that scalar process works, because it will help us understand vectorized intra-tile rasterization later. I'll use a small triangle for the example, for simplicity and to make the figures legible, but as noted above, normally such a small triangle would be assigned to its tile or tiles using bounding box tests. | ||
+ | |||
+ | 大三角形的瓦片分配使用标量代码执行, | ||
Once we've set up the equation for an edge (by calculating B and C, as discussed when we looked at Figure 1), the first thing we do is calculate its value at the trivial reject corner of each tile. The trivial reject corner is the corner at which an edge's equation is most negative within a tile; the selection of the trivial reject corner for a given edge is based on its slope, as we'll see shortly. We set things up so that negative means inside in order to allow us to generate masks directly from the sign bit, so you can think of the trivial reject corner as the point in the tile that's most inside the edge. If this point isn't inside the edge, no point in the tile can be inside the edge, and therefore the whole triangle can be ignored for that tile. | Once we've set up the equation for an edge (by calculating B and C, as discussed when we looked at Figure 1), the first thing we do is calculate its value at the trivial reject corner of each tile. The trivial reject corner is the corner at which an edge's equation is most negative within a tile; the selection of the trivial reject corner for a given edge is based on its slope, as we'll see shortly. We set things up so that negative means inside in order to allow us to generate masks directly from the sign bit, so you can think of the trivial reject corner as the point in the tile that's most inside the edge. If this point isn't inside the edge, no point in the tile can be inside the edge, and therefore the whole triangle can be ignored for that tile. | ||
+ | |||
+ | 当我们建立三角形一条边的方程式后( 计算 B 和 C 系数, 如图例 1 我们所讨论的), | ||
Figure 13 shows the trivial reject test in action. Tile 0 is trivially rejected for the black edge and can be ignored, because its trivial reject corner is positive, and therefore the whole tile must be positive and must lie outside the triangle, while the other three tiles must be investigated further. You can see here how the trivial reject corner is the corner of each tile most inside the black edge; that is, the point with the most negative value in the tile. | Figure 13 shows the trivial reject test in action. Tile 0 is trivially rejected for the black edge and can be ignored, because its trivial reject corner is positive, and therefore the whole tile must be positive and must lie outside the triangle, while the other three tiles must be investigated further. You can see here how the trivial reject corner is the corner of each tile most inside the black edge; that is, the point with the most negative value in the tile. | ||
+ | |||
+ | 图例 13 显示了测定平凡拒绝角的手法. 瓦片 0 被黑边平凡拒绝, | ||
{{: | {{: | ||
**Figure 13**: The tile trivial reject test. | **Figure 13**: The tile trivial reject test. | ||
+ | |||
+ | 图例 13 瓦片平凡拒绝角的测定. | ||
Note that which corner is the trivial reject corner will vary from edge to edge, depending on slope. For example, it would be the lower left corner of each tile for the edge shown in red in Figure 14, because that's the corner that's most inside that edge. | Note that which corner is the trivial reject corner will vary from edge to edge, depending on slope. For example, it would be the lower left corner of each tile for the edge shown in red in Figure 14, because that's the corner that's most inside that edge. | ||
+ | |||
+ | 注意哪个角是平凡拒绝角会随着边变化, | ||
{{: | {{: | ||
**Figure 14**: Which corner is the trivial reject corner varies with edge slope. Here the lower left corner of each tile is the trivial reject corner. | **Figure 14**: Which corner is the trivial reject corner varies with edge slope. Here the lower left corner of each tile is the trivial reject corner. | ||
+ | |||
+ | 图例 14 : 平凡拒绝角随边斜率而改变, | ||
If you understand what we've just discussed, you're ninety percent of the way to understanding the whole Larrabee rasterizer. The trivial reject test is actually very straightforward once you understand it - it's just a matter of evaluating the sign of a simple equation at the right point - but it can take a little while to get it, so you may find it useful to re-read the previous section if you're at all uncertain or confused. | If you understand what we've just discussed, you're ninety percent of the way to understanding the whole Larrabee rasterizer. The trivial reject test is actually very straightforward once you understand it - it's just a matter of evaluating the sign of a simple equation at the right point - but it can take a little while to get it, so you may find it useful to re-read the previous section if you're at all uncertain or confused. | ||
+ | |||
+ | 如果你理解了刚才我们所讨论的, | ||
So that's the tile trivial reject test. The other tile test is the trivial accept test. For this, we take the value at the trivial reject corner (the corner we just discussed) and add the amount that the edge equation changes for a step all the way to the diagonally opposite tile corner, the tile trivial accept corner. This is the point in the tile where the edge equation is most positive; you can think of this as the point in the tile that's most outside the edge. If the trivial accept corner for an edge is negative, that whole tile is trivially accepted for that edge, and there' | So that's the tile trivial reject test. The other tile test is the trivial accept test. For this, we take the value at the trivial reject corner (the corner we just discussed) and add the amount that the edge equation changes for a step all the way to the diagonally opposite tile corner, the tile trivial accept corner. This is the point in the tile where the edge equation is most positive; you can think of this as the point in the tile that's most outside the edge. If the trivial accept corner for an edge is negative, that whole tile is trivially accepted for that edge, and there' | ||
+ | |||
+ | 这就是平凡拒绝角测验. 另一个瓦片测验是平凡接受测验. 这个测验从获取平凡拒绝角((刚讨论过的)边方程式的值转变成获取该瓦片对角的, | ||
Figure 15 shows the trivial accept test in action. Since the trivial accept corner is the corner at which the edge's equation is most positive, if this point is negative - and therefore inside the edge - all points in the tile must be inside the edge. Thus, tiles 0 and 1 are not trivially accepted for the black edge, because the equation for the black edge is positive at their trivial accept corners, but tiles 2 and 3 are trivially accepted, so rasterization of this triangle in tiles 2 and 3 can ignore the black edge entirely, saving a good bit of work. | Figure 15 shows the trivial accept test in action. Since the trivial accept corner is the corner at which the edge's equation is most positive, if this point is negative - and therefore inside the edge - all points in the tile must be inside the edge. Thus, tiles 0 and 1 are not trivially accepted for the black edge, because the equation for the black edge is positive at their trivial accept corners, but tiles 2 and 3 are trivially accepted, so rasterization of this triangle in tiles 2 and 3 can ignore the black edge entirely, saving a good bit of work. | ||
+ | |||
+ | 图例 15 显示了如何测验平方接受角. 由于平凡接受角是边方程式值最大的角, | ||
{{: | {{: | ||
**Figure 15**: The tile trivial accept test. | **Figure 15**: The tile trivial accept test. | ||
+ | |||
+ | 图例 15 : 平凡接受角测验. | ||
There' | There' | ||
+ | |||
+ | 这里有个重要的不对等. 当我们看平凡拒绝, | ||
{{: | {{: | ||
**Figure 16**: Tile 3 is trivially accepted against the black edge, but trivially rejected against the red edge. | **Figure 16**: Tile 3 is trivially accepted against the black edge, but trivially rejected against the red edge. | ||
+ | |||
+ | 图例 16 :瓦片 3 被黑边平凡接受, | ||
In Figure 17, however, tile 3 is trivially accepted by all three edges, and here we come to a key point. | In Figure 17, however, tile 3 is trivially accepted by all three edges, and here we come to a key point. | ||
+ | |||
+ | 然而在图例 17 中, 瓦片 3 被三条边平凡接受, | ||
{{: | {{: | ||
**Figure 17**: Tile 3 is trivially accepted against all three edges. | **Figure 17**: Tile 3 is trivially accepted against all three edges. | ||
+ | |||
+ | 图例 17 :瓦片 3 被三条边平凡接受. | ||
If all three edges are negative at their respective trivial accept corners, then the whole tile is inside the triangle, and no further rasterization tests are needed - and this is what I meant earlier when I said the rasterizer takes advantage of CPU smarts by not-rasterizing whenever possible. The tile-assignment code can just store a draw-whole-tile command in the bin, and the bin rendering code can simply do the equivalent of two nested loops around the shaders, resulting in a full-screen triangle rasterization speed of approximately infinity - one of my favorite performance numbers! | If all three edges are negative at their respective trivial accept corners, then the whole tile is inside the triangle, and no further rasterization tests are needed - and this is what I meant earlier when I said the rasterizer takes advantage of CPU smarts by not-rasterizing whenever possible. The tile-assignment code can just store a draw-whole-tile command in the bin, and the bin rendering code can simply do the equivalent of two nested loops around the shaders, resulting in a full-screen triangle rasterization speed of approximately infinity - one of my favorite performance numbers! | ||
- | By the way, this whole process should be familiar to 3-D programmers, | + | 如果三条边各自的平凡接受角都是负值, |
- | done in exactly the same way - although in three dimensions instead of two - with the | + | |
- | same use of sign to indicate inside and outside for trivial accept and reject. Also, structures such as octrees employ a 3-D version of the hierarchical recursion used by the Larrabee rasterizer.. | + | By the way, this whole process should be familiar to 3-D programmers, |
+ | |||
+ | 顺便一提, | ||
That completes our overview of how rasterization of large triangles for tile assignment works. As I said, this is done as a scalar evaluation in the Larrabee pipeline, so the trivial accept and reject tests for each tile are performed separately. Intra-tile rasterization, | That completes our overview of how rasterization of large triangles for tile assignment works. As I said, this is done as a scalar evaluation in the Larrabee pipeline, so the trivial accept and reject tests for each tile are performed separately. Intra-tile rasterization, | ||
- | Intra-tile rasterization: | + | 这就是完整的大三角形瓦片分配的概况. 如我所说, |
+ | |||
+ | ==== Intra-tile rasterization: | ||
Intra-tile rasterization starts at the level of a whole tile. Tile size varies, depending on factors such as pixel size, but let's assume that we're working with a 64x64 tile. Given that starting size, we calculate the edge equation values at the 16 trivial reject and trivial accept corners of the 16x16 blocks that make up the tile, just as we did at the tile level - but now we do these calculations 16 at a time. Let's start with the trivial reject test. | Intra-tile rasterization starts at the level of a whole tile. Tile size varies, depending on factors such as pixel size, but let's assume that we're working with a 64x64 tile. Given that starting size, we calculate the edge equation values at the 16 trivial reject and trivial accept corners of the 16x16 blocks that make up the tile, just as we did at the tile level - but now we do these calculations 16 at a time. Let's start with the trivial reject test. | ||
+ | |||
+ | 瓦片内光栅化先从整个瓦片级别开始. 瓦片尺寸是可变的, | ||
First, we calculate which corner of the tile is the trivial reject corner, calculate the value of the edge equation at that point, and set up a table containing the 16 steps of the edge equation from the value at the tile trivial reject corner to the trivial reject corners of the 16x16 blocks that make up the tile. The signs of the 16 values that result tell us which of the blocks are entirely outside the edge, and can therefore be ignored, and which are at least partially accepted, and therefore have to be evaluated further. | First, we calculate which corner of the tile is the trivial reject corner, calculate the value of the edge equation at that point, and set up a table containing the 16 steps of the edge equation from the value at the tile trivial reject corner to the trivial reject corners of the 16x16 blocks that make up the tile. The signs of the 16 values that result tell us which of the blocks are entirely outside the edge, and can therefore be ignored, and which are at least partially accepted, and therefore have to be evaluated further. | ||
+ | |||
+ | 首先, 我们计算哪个角是瓦片的平凡拒绝角, | ||
In Figure 18, for example, we calculate the trivial reject values for the black edge, by stepping from the value we calculated earlier at the trivial reject corner of the tile, and eliminate five of the 16x16 blocks that make up the tile. The trivial reject corner for the tile is shown in red, and the 16 trivial reject corners for the blocks are shown in white. The gray blocks are the ones that are rejected against the black edge; you can see that their trivial reject corners all have positive edge equation values. The other 11 blocks have negative values at their trivial reject corners, so they' | In Figure 18, for example, we calculate the trivial reject values for the black edge, by stepping from the value we calculated earlier at the trivial reject corner of the tile, and eliminate five of the 16x16 blocks that make up the tile. The trivial reject corner for the tile is shown in red, and the 16 trivial reject corners for the blocks are shown in white. The gray blocks are the ones that are rejected against the black edge; you can see that their trivial reject corners all have positive edge equation values. The other 11 blocks have negative values at their trivial reject corners, so they' | ||
+ | |||
+ | 图例 18 中, 举个例子, | ||
{{: | {{: | ||
**Figure 18**: The trivial reject tests for the 16 16x16 blocks in the tile. | **Figure 18**: The trivial reject tests for the 16 16x16 blocks in the tile. | ||
+ | |||
+ | 图例 18 : 瓦片 16x16 个块的平凡拒绝角的测验. | ||
To make this process clearer, in Figure 19 the arrows represent the 16 steps that are added to the black edge's tile trivial reject value. Each of these steps is just an add, and we can do 16 adds with a single vector instruction, | To make this process clearer, in Figure 19 the arrows represent the 16 steps that are added to the black edge's tile trivial reject value. Each of these steps is just an add, and we can do 16 adds with a single vector instruction, | ||
+ | |||
+ | 为了让这个过程更为清晰, | ||
{{: | {{: | ||
**Figure 19**: The steps from the tile trivial reject corner for the black edge to the trivial reject corners of the 16 16x16 blocks for the black edge. | **Figure 19**: The steps from the tile trivial reject corner for the black edge to the trivial reject corners of the 16 16x16 blocks for the black edge. | ||
+ | |||
+ | 图例 19: | ||
All this comes down to just setting up the right values, then doing one vector add and one vector compare. Remember that the edge equation is of the form Bx + Cy; therefore, to step a distance across the tile, we just set x and y to the horizontal and vertical components of that distance, evaluate the equation, and add that to the starting value. So all we're doing in Figure 19 is adding the 16 values that step the edge equation to the 16 trivial reject corners. For example, to get the edge equation value at the trivial reject corner of the upper-left block, we'd start with the value at the tile trivial reject corner, and add the amount that the edge equation changes for a step of -48 pixels in x and -48 pixels in y, as shown by the yellow arrow in Figure 20. To get the edge equation value at the trivial reject corner of the lower-left block, we'd instead add the amount that the edge equation changes for a step of -48 pixels in x only, as shown by the purple arrow. And that's really all there is to the Larrabee rasterizer - it's just a matter of stepping the edge equation values around the tile so as to determine what blocks and pixels are inside and outside the edges. | All this comes down to just setting up the right values, then doing one vector add and one vector compare. Remember that the edge equation is of the form Bx + Cy; therefore, to step a distance across the tile, we just set x and y to the horizontal and vertical components of that distance, evaluate the equation, and add that to the starting value. So all we're doing in Figure 19 is adding the 16 values that step the edge equation to the 16 trivial reject corners. For example, to get the edge equation value at the trivial reject corner of the upper-left block, we'd start with the value at the tile trivial reject corner, and add the amount that the edge equation changes for a step of -48 pixels in x and -48 pixels in y, as shown by the yellow arrow in Figure 20. To get the edge equation value at the trivial reject corner of the lower-left block, we'd instead add the amount that the edge equation changes for a step of -48 pixels in x only, as shown by the purple arrow. And that's really all there is to the Larrabee rasterizer - it's just a matter of stepping the edge equation values around the tile so as to determine what blocks and pixels are inside and outside the edges. | ||
+ | |||
+ | 所有这些归结为只要建立对应的值, | ||
{{: | {{: | ||
**Figure 20**: Examples of stepping the edge equation Bx + Cy. | **Figure 20**: Examples of stepping the edge equation Bx + Cy. | ||
+ | |||
+ | 图例 20 : | ||
Once again, we'll do trivial accept tests as well as trivial reject tests. In Figure 21 we've calculated the trivial accept values, and determined that 6 of the 16x16 blocks are trivially accepted for the black edge, and 10 of them are not trivially accepted. We know this because the values of the equation of the black edge at the trivial accept corners of the 6 pink blocks are negative, so those blocks are entirely inside the edge, while the values at the trivial accept corners of the other 10 blocks are positive, so those blocks are not entirely inside the black edge. The trivial accept values for the blocks can be calculated by stepping in any of several different ways: from the tile trivial accept corner, from the tile trivial reject corner, or from the 16 trivial reject values for the blocks. Regardless, it again takes only one vector instruction to step and one vector instruction to test the results. Combined with the results of the trivial reject test, we also know that 5 blocks are partially accepted. | Once again, we'll do trivial accept tests as well as trivial reject tests. In Figure 21 we've calculated the trivial accept values, and determined that 6 of the 16x16 blocks are trivially accepted for the black edge, and 10 of them are not trivially accepted. We know this because the values of the equation of the black edge at the trivial accept corners of the 6 pink blocks are negative, so those blocks are entirely inside the edge, while the values at the trivial accept corners of the other 10 blocks are positive, so those blocks are not entirely inside the black edge. The trivial accept values for the blocks can be calculated by stepping in any of several different ways: from the tile trivial accept corner, from the tile trivial reject corner, or from the 16 trivial reject values for the blocks. Regardless, it again takes only one vector instruction to step and one vector instruction to test the results. Combined with the results of the trivial reject test, we also know that 5 blocks are partially accepted. | ||
+ | |||
+ | 再次, 我们会像平凡拒绝角那样来做平凡接受角测验. 在图例 21 中我们已经计算出这些平凡接受角的值, | ||
{{: | {{: | ||
**Figure 21**: The trivial accept tests for the 16 16x16 blocks in the tile. | **Figure 21**: The trivial accept tests for the 16 16x16 blocks in the tile. | ||
+ | |||
+ | 图例 21 : 瓦片上 16 个 16x16 的块的平凡接受角测验. | ||
The 16 trivial reject and trivial accept values can be calculated with a total of just two vector adds per edge, using tables generated at triangle set-up time, and can be tested with two vector compares, which generate mask registers describing which blocks are trivially rejected and which are trivially accepted. We do this for the three edges, ANDing the results to create masks for the triangle, do some bit manipulation on the masks so they describe trivial and partial accept, and bit-scan through the results to find the trivially and partially accepted blocks. Each 16x16 that's trivially accepted against all three edges becomes one bin command; again, no further rasterization is needed for pixels in trivially accepted blocks. | The 16 trivial reject and trivial accept values can be calculated with a total of just two vector adds per edge, using tables generated at triangle set-up time, and can be tested with two vector compares, which generate mask registers describing which blocks are trivially rejected and which are trivially accepted. We do this for the three edges, ANDing the results to create masks for the triangle, do some bit manipulation on the masks so they describe trivial and partial accept, and bit-scan through the results to find the trivially and partially accepted blocks. Each 16x16 that's trivially accepted against all three edges becomes one bin command; again, no further rasterization is needed for pixels in trivially accepted blocks. | ||
+ | |||
+ | 通过三角形建立时生成的表, | ||
This is not obvious stuff, so let's take a moment to visualize the process. First, let's just look at one edge and trivial accept. | This is not obvious stuff, so let's take a moment to visualize the process. First, let's just look at one edge and trivial accept. | ||
+ | |||
+ | 这并不怎么明显, | ||
For a given edge, say edge number 1, we take the edge equation value at the tile trivial accept corner, broadcast it out to a vector, and vector add it to the precalculated values of the 16 steps to the trivial accept corners of the 16x16 blocks. This gives us the edge 1 values at the trivial accept corners of the 16 blocks, as shown in Figure 22. (The values shown are illustrative, | For a given edge, say edge number 1, we take the edge equation value at the tile trivial accept corner, broadcast it out to a vector, and vector add it to the precalculated values of the 16 steps to the trivial accept corners of the 16x16 blocks. This gives us the edge 1 values at the trivial accept corners of the 16 blocks, as shown in Figure 22. (The values shown are illustrative, | ||
+ | |||
+ | 对给定边, | ||
{{: | {{: | ||
**Figure 22**: Edge 1 trivial accept tests for the 16 16x16 blocks. | **Figure 22**: Edge 1 trivial accept tests for the 16 16x16 blocks. | ||
+ | |||
+ | 图 22 : 16 个 16x16 块的边 1 的平凡接受测验. | ||
The step values shown on the second line in Figure 22 are computed when the triangle is set up, using a vector multiply and a vector multiply-add. At the top level of the rasterizer - testing the 16x16 blocks that make up the tile, as shown in Figure 22 - those set-up instructions are just direct additional rasterization costs, since the top level only gets executed once per triangle, so it would be accurate to add 6 instructions to the cost of the 16x16 code we'll look at shortly in Listings 1 and 2. However, as the hierarchy is descended, the tables for the lower levels (16x16-to-4x4 and 4x4-to-mask) get reused multiple times. For example, when descending from 16x16 to 4x4 blocks, the same table is used for all partial 16x16 blocks in the tile. Likewise, there is only one table for generating masks for partial 4x4 blocks, so the additional cost per iteration in Listing 3 due to table set-up would be 2 instructions divided by the number of partial 4x4 blocks in the tile. This is generally much less than 1 instruction per 4x4 block per edge, although it gets higher the smaller the triangle is. | The step values shown on the second line in Figure 22 are computed when the triangle is set up, using a vector multiply and a vector multiply-add. At the top level of the rasterizer - testing the 16x16 blocks that make up the tile, as shown in Figure 22 - those set-up instructions are just direct additional rasterization costs, since the top level only gets executed once per triangle, so it would be accurate to add 6 instructions to the cost of the 16x16 code we'll look at shortly in Listings 1 and 2. However, as the hierarchy is descended, the tables for the lower levels (16x16-to-4x4 and 4x4-to-mask) get reused multiple times. For example, when descending from 16x16 to 4x4 blocks, the same table is used for all partial 16x16 blocks in the tile. Likewise, there is only one table for generating masks for partial 4x4 blocks, so the additional cost per iteration in Listing 3 due to table set-up would be 2 instructions divided by the number of partial 4x4 blocks in the tile. This is generally much less than 1 instruction per 4x4 block per edge, although it gets higher the smaller the triangle is. |