Peak efficiency is only when all your 4 or 5 ALUs have work to do in every cycles.
If you disassemble your code (use kernel analyzer), you can easily spot the idle ALUs.
for example:
10 x: SUB_INT T0.x, PV9.z, KC0[2].x
w: SETGE_UINT ____, PV9.z, KC0[2].x ;y,z,t sleeps
11 z: AND_INT ____, T0.y, PV10.w ;x,y,w,t sleeps
12 y: CNDE_INT T1.y, PV11.z, T0.z, T0.x ;x,z,w,t sleeps
13 x: ADD_INT ____, KC0[2].x, PV12.y ;y,z,w,t sleeps
this is so unoptimal that is does only 5 operations under 4 clocks, the possible maximum would be 4*5=20 operations (on vliw5)
826 x: XOR_INT T1.x, R28.w, T0.w
y: SETGT_UINT ____, T1.x, T0.w
z: XOR_INT T3.z, KC0[13].z, R20.y VEC_021
w: SETGT_UINT T2.w, T2.w, R15.y VEC_201
t: SETGT_UINT T0.w, R9.x, T1.y
827 x: ADD_INT ____, T0.z, T2.z
y: ADD_INT T2.y, T0.y, T2.x VEC_021
z: ADD_INT T0.z, T1.z, T2.y VEC_210
w: ADD_INT ____, PV826.y, T3.y VEC_021
t: SETGT_UINT ____, T3.w, R5.x
this one is maximum utilization. 10 operationc in 2 clocks.
There are tricks to improve local paralellism in code (other than simply vectorizing everything) like breaking dependency chains:
for example a+b+c+d -> (a+b)+(c+d)