Optimizing shader performance for the Xperia™ PLAY
The Xperia™ PLAY incorporates Qualcomm’s 1 GHz Snapdragon processor with an Adreno™ 205 graphics processor platform. In this article, Matthew Rusch, a Staff Engineer for the Advanced Content Group at Qualcomm, will share some practical advice on how you can optimize mobile graphics applications, specifically shader performance, for the Xperia™ PLAY.
Matthew Rusch
Adreno™ 205 is an embedded graphics processing unit (GPU) that supports a complete range of 2D and 3D graphics, including hardware-level support for OpenGL ES 1.1/2.0, EGL 1.3, and OpenVG™ 1.1. The Xperia™ PLAY uses Qualcomm’s dedicated Adreno™ GPU which uses graphics internal memory (GMEM) that supports dedicated color, Z and stencil buffers. Matthew’s article provides developer tips for optimizing mobile graphics applications, specifically with optimizing visual effects for shaders, on the Adreno™ platform. These key performance tips will allow developers to achieve optimal performance, decrease GPU idle time, and identify performance bottlenecks.
Adreno™ 205 Overview
The Adreno™ 205 GPU has a sophisticated architecture that maximizes the use of graphical process power to achieve superior performance and flexibility, allowing developers to implement effects that were previously quite challenging.
To achieve superior performance, the Adreno™ 205 GPU uses a shared resource architecture that allows ALU (Arithmetic Logic Unit) and ‘fetch resources’ to be shared by the vertex shaders, pixel shaders, tesselation and general purpose processing. Vertices and pixels are processed in groups of four as a vector or a thread. Sequencer logic then assigns these threads to the shader ALUs. When a thread stalls, the shader ALUs can be reassigned. The shader architecture is also multithreaded. If a fragment shader’s execution is stalled due to any reason, such as a texture fetch, then execution will be given to another shader.
To offer greater flexibility, the Adreno™ 205 GPU uses a programmable 3D graphics pipeline that enables developers to create shader and program objects, and write vertex and fragment shaders in the OpenGL ES shading language. For example, fixed pixel color computations, hardware vertex transform and lighting are now replaced by programmable processers to give programmers direct control over transform, lighting and pixel processing. This opens up a wide range of opportunities for developers to implement unique animation, lighting, and shading effects.
General tips for shaders
Shader performance: Since the hardware works strictly on post compiled microcoded instructions, it is recommended to use a tool (such as the AMD GPU ShaderAnalyzer) that can analyze and predict the theoretical performance of pixel, vertex, geometry, hull, domain and compute shaders. Note that it is considered ‘theoretical performance’ because cache usage, memory latency, or branch patterns cannot be predicted.
Minimize shader registers: Minimizing general purpose registers (GPRs) can be the most important performance optimization method. Inputting simpler shaders to the compiler helps guarantee optimal results. Sometimes modifying OpenGL Shading Language to save even a single instruction can result in saving a GPR. Not unrolling loops can also save GPRs, but is up to the shading compiler.
Minimize shader instruction count: The compiler is good at optimizing specific instructions, but does not automatically do it in the most efficient way. Analyze shaders to save instructions wherever possible.
Precalculate as much as possible: Identify instructions for needless math on shader constants in your own shaders and move those calculations from the FS to the VS if possible. It can be easier to identify math on shader constants in the post compiled microcode. It is recommended to use the GPU ShaderAnalyzer, or similar tool, rather than examining your own code.
Pack shader interpolators: Shader interpolator values, or varyings, require a register to hold data being fed into a fragment shader. Therefore, you should minimize their use. Use constants where a value is uniform. All varyings have four components, whether they are used or not, so pack values together. Putting two vec2 texture coordinates into a single vec4 value is a common practice, but other strategies use more creative packaging.
Avoid uber-shaders: Uber-shaders combine multiple shaders into a single, larger one using static branching to dictate the control flow. Using them makes sense if you are trying to reduce static changes and batch draw calls. Since this often comes at the expense of an increased GPR count, it does not always make sense to use them.
Use dynamic branching carefully: Static branching performs well, but may create extra GPRs. Using dynamic branching in a shader has a certain, non-constant overhead that depends on the exact shader code. Therefore, using dynamic branching is not always a performance win. Since multiple pixels are processed as one, in the event that some pixels take a branch while others do not, the GPU must execute both paths (all instructions really do execute, but there is a masking bit that controls the output so that only appropriate pixels are affected). In this case, the performance will appear worse if each pixel was processed individually. If all pixels take the same path, the GPU is capable of really taking the branch, which is good for performance.
Killing pixels in the fragment shader: Some developers believe that manually killing pixels in the fragment shader will boost performance. However, this is not always true for two reasons:
If some pixels in a thread are killed, and others are not, the shader must still execute.
It is dependent on how the shader compiler generates microcode.
In theory, if all pixels in a thread are killed, the GPU will stops processing that thread as soon as it can.
Break up draw calls: If a shader is heavy on the GPRs and heavy on the texture cache demands, increased performance may result from breaking up the draw calls into multiple passes. Whether or not the results will be positive is very hard to predict, so using real-world measurements both ways is the best way to decide. Ideally, a two-pass draw would combine its results with simple alpha blending. Some developers may consider using a true deferred rendering algorithm, but that approach has many drawbacks, notably, the graphics internal memory must be resolved in order for a previous pass to be used as input to a successive pass. Since resolves are not free, it is a performance cost that must be recouped elsewhere in the algorithm.
In conclusion, it is recommended to check the internet for general graphics optimization advice. However, be aware that some advice is more centered on optimizing around state changes and the overhead of driver calls, at the expense of shader complexity. It is important to understand the hardware you are targeting, and to use all the tools at your disposal.
Matthew Rusch
Adreno™ 205 is an embedded graphics processing unit (GPU) that supports a complete range of 2D and 3D graphics, including hardware-level support for OpenGL ES 1.1/2.0, EGL 1.3, and OpenVG™ 1.1. The Xperia™ PLAY uses Qualcomm’s dedicated Adreno™ GPU which uses graphics internal memory (GMEM) that supports dedicated color, Z and stencil buffers. Matthew’s article provides developer tips for optimizing mobile graphics applications, specifically with optimizing visual effects for shaders, on the Adreno™ platform. These key performance tips will allow developers to achieve optimal performance, decrease GPU idle time, and identify performance bottlenecks.
Adreno™ 205 Overview
The Adreno™ 205 GPU has a sophisticated architecture that maximizes the use of graphical process power to achieve superior performance and flexibility, allowing developers to implement effects that were previously quite challenging.
To achieve superior performance, the Adreno™ 205 GPU uses a shared resource architecture that allows ALU (Arithmetic Logic Unit) and ‘fetch resources’ to be shared by the vertex shaders, pixel shaders, tesselation and general purpose processing. Vertices and pixels are processed in groups of four as a vector or a thread. Sequencer logic then assigns these threads to the shader ALUs. When a thread stalls, the shader ALUs can be reassigned. The shader architecture is also multithreaded. If a fragment shader’s execution is stalled due to any reason, such as a texture fetch, then execution will be given to another shader.
To offer greater flexibility, the Adreno™ 205 GPU uses a programmable 3D graphics pipeline that enables developers to create shader and program objects, and write vertex and fragment shaders in the OpenGL ES shading language. For example, fixed pixel color computations, hardware vertex transform and lighting are now replaced by programmable processers to give programmers direct control over transform, lighting and pixel processing. This opens up a wide range of opportunities for developers to implement unique animation, lighting, and shading effects.
General tips for shaders
Shader performance: Since the hardware works strictly on post compiled microcoded instructions, it is recommended to use a tool (such as the AMD GPU ShaderAnalyzer) that can analyze and predict the theoretical performance of pixel, vertex, geometry, hull, domain and compute shaders. Note that it is considered ‘theoretical performance’ because cache usage, memory latency, or branch patterns cannot be predicted.
Minimize shader registers: Minimizing general purpose registers (GPRs) can be the most important performance optimization method. Inputting simpler shaders to the compiler helps guarantee optimal results. Sometimes modifying OpenGL Shading Language to save even a single instruction can result in saving a GPR. Not unrolling loops can also save GPRs, but is up to the shading compiler.
Minimize shader instruction count: The compiler is good at optimizing specific instructions, but does not automatically do it in the most efficient way. Analyze shaders to save instructions wherever possible.
Precalculate as much as possible: Identify instructions for needless math on shader constants in your own shaders and move those calculations from the FS to the VS if possible. It can be easier to identify math on shader constants in the post compiled microcode. It is recommended to use the GPU ShaderAnalyzer, or similar tool, rather than examining your own code.
Pack shader interpolators: Shader interpolator values, or varyings, require a register to hold data being fed into a fragment shader. Therefore, you should minimize their use. Use constants where a value is uniform. All varyings have four components, whether they are used or not, so pack values together. Putting two vec2 texture coordinates into a single vec4 value is a common practice, but other strategies use more creative packaging.
Avoid uber-shaders: Uber-shaders combine multiple shaders into a single, larger one using static branching to dictate the control flow. Using them makes sense if you are trying to reduce static changes and batch draw calls. Since this often comes at the expense of an increased GPR count, it does not always make sense to use them.
Use dynamic branching carefully: Static branching performs well, but may create extra GPRs. Using dynamic branching in a shader has a certain, non-constant overhead that depends on the exact shader code. Therefore, using dynamic branching is not always a performance win. Since multiple pixels are processed as one, in the event that some pixels take a branch while others do not, the GPU must execute both paths (all instructions really do execute, but there is a masking bit that controls the output so that only appropriate pixels are affected). In this case, the performance will appear worse if each pixel was processed individually. If all pixels take the same path, the GPU is capable of really taking the branch, which is good for performance.
Killing pixels in the fragment shader: Some developers believe that manually killing pixels in the fragment shader will boost performance. However, this is not always true for two reasons:
If some pixels in a thread are killed, and others are not, the shader must still execute.
It is dependent on how the shader compiler generates microcode.
In theory, if all pixels in a thread are killed, the GPU will stops processing that thread as soon as it can.
Break up draw calls: If a shader is heavy on the GPRs and heavy on the texture cache demands, increased performance may result from breaking up the draw calls into multiple passes. Whether or not the results will be positive is very hard to predict, so using real-world measurements both ways is the best way to decide. Ideally, a two-pass draw would combine its results with simple alpha blending. Some developers may consider using a true deferred rendering algorithm, but that approach has many drawbacks, notably, the graphics internal memory must be resolved in order for a previous pass to be used as input to a successive pass. Since resolves are not free, it is a performance cost that must be recouped elsewhere in the algorithm.
In conclusion, it is recommended to check the internet for general graphics optimization advice. However, be aware that some advice is more centered on optimizing around state changes and the overhead of driver calls, at the expense of shader complexity. It is important to understand the hardware you are targeting, and to use all the tools at your disposal.
No comments: