C++ Coder

HCP高性能计算架构，实现，编译器指令优化，算法优化， LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客

管理

98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks

Low ALUBusy and low FetchUnitBusy

http://devgurus.amd.com/thread/158866

Low ALUBusy and low FetchUnitBusy

此问题 未被回答。

NURBS 2012-3-19 下午1:35

Hi,

When my kernel performs badly, the APP profiler reports a very low ALUBusy and low FetchUniBusy, (Both less than 10%)

What can be the bottleneck here? Could it be because of the high number of code paths?

Thanks

NURBS

有用答案 作者 pesh

有用答案Re: Low ALUBusy and low FetchUnitBusy
pesh 2012-3-26 上午7:07 (回复 NURBS)
Hi, NURBS!
Can you provide information about your device? If it's an AMD APU then there were problems with performance counters in previous versions of APP Profiler.
Also, check ALUPacking counter, if it has low value, then you code is VLIW limited and ALUBusy is poor, in this case try to reduce some data dependencies across sequential operations, it will allow compiler to better pack ALU instructions in VLIW, and utilize ALU resources. Try to reduce control flow statements, they affect counters to. In your situation, maybe you have if-statements, where in one branch you do fetch operation, and in another do some computations? That will cause some part of wavefront do fetch, and only after that remainder of wavefront will do ALU operations. So you will use only part of resources at time.
举报滥用

喜爱 (0)
- Re: Low ALUBusy and low FetchUnitBusy
  NURBS 2012-3-26 上午7:57 (回复 pesh)
  I have dual Radeon 6950 with either 12.3 or the new beta driver. It seems control flow was the issue, things are much better now. Is there an equation I can use to sum up the numbers of counters to 100%, so that I can be more certain I am not getting bogus numbers?
  举报滥用
  
  喜爱 (0)
  - Re: Low ALUBusy and low FetchUnitBusy
    pesh 2012-3-26 上午8:46 (回复 NURBS)
    I guess no, there is no such equation. First of all because when fetch instruction is applied by wavefront executing on compute unit, this wavefront goes to fetch unit, where it sits until fetch is done. At this time other wavefronts are doing calculations, or wait unit fetch unit become free, to execute next fetch instructions. So when some wavefronts are doing memory read or write other can do computations, and in the best case both counters can have 100% value, and ALUFetchRatio counter will equal to 1. Another important counters is FetchUnitStalled and WriteUnitStalled, try to keep them about 0 value. If it's too big, then many of wavefront are waiting for fetch unit to do memory read/write. To improve performance first of all, try to use sequential memory access pattern, then try to use local memory, if your algorithm reuse data several timers within workgroup.
    举报滥用
    
    喜爱 (0)

转至原文

posted on 2013-01-09 16:26 jackdong 阅读(451) 评论(0) 编辑收藏引用所属分类: OpenCL

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: 浅谈多节点CPU+GPU协同计算负载均衡性设计 VLIW on Cypress and vector addition Low ALUBusy and low FetchUnitBusy Understanding performance counters ALUBusy question 适用于ATI卡的GPU计算MD5的小程序源码，基于AMD APP SDK开发 Test latency for clEnqueueNDRangeKernel 采用OpenCL标准实现FPGA设计

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

C++ Coder

公告

常用链接

留言簿(2)

随笔分类

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜

Low ALUBusy and low FetchUnitBusy

有用答案Re: Low ALUBusy and low FetchUnitBusy

Re: Low ALUBusy and low FetchUnitBusy

Re: Low ALUBusy and low FetchUnitBusy