Fast Stream Array Access

martinvicanek · Post by **martinvicanek** » Sun Oct 19, 2014 7:27 pm

Following KG's excellent ASM posts over at FS Guru I stumbled over a possibility to considerably cut down CPU load for stream array access. As an example I am attaching a low-CPU delay (integer and interpolated variants). The design borrows from Trogz Toolz, he has some smart and highly optimized stuff there. Hard to believe there was still a factor of 3(!) of optimization potential to gain. :shock:

Boy this opens up possibilities: fast lookup tables, fast wavetable oscillators, you name it.

tester · Post by **tester** » Sun Oct 19, 2014 7:51 pm

It's great news Martin. Another set of "impossible" (since the SM age) problems will be solved. I can't wait to see it. :-)

Exo · Post by **Exo** » Sun Oct 19, 2014 8:12 pm

Excellent nice work, yes stream arrays have always been very slow because of the need to unpack the channels.
This really is a game changer because a huge bottle neck has been removed

Gonna have a look see i can optimize a few other things with this

Exo · Post by **Exo** » Sun Oct 19, 2014 8:27 pm

Hi Martin, do you think it is possible to do this trick with this code?

Code: Select all

polyintin addr;
polyintin max;
streamin index;
streamout out;

int zero = 0;
int temp = 0;
stage2;
mov eax,addr[0];
cmp eax,0;
jz bypass;

  cvtps2dq xmm0,index;
  maxps xmm0,zero;
  minps xmm0,max;
  pslld xmm0,2;
  paddd xmm0,addr;
  movaps temp,xmm0;
  
  //Read
  mov eax,temp[0];
  fld [eax] ; fstp out[0]; 

  mov eax,temp[1];
  fld [eax] ; fstp out[1];
  
  mov eax,temp[2];
  fld [eax] ; fstp out[2];

  mov eax,temp[3];
  fld [eax] ; fstp out[3];
    
bypass:

This reads directly from the address of a mem, instead of from the mem input or an array. Where eax is the actually memory address and we read the actual value by doing [eax] . I know it can work easy with the mem input because it is copied into a standard code array.

KG_is_back · Post by **KG_is_back** » Sun Oct 19, 2014 8:37 pm

It should be possible, as I have posted on the FS guru. http://flowstone.guru/blog/how-to-use-assembler-part-3-alu-fpu-and-array-management/ just after Martins example post. I didn't tested it though. In that particular case the problem is a little bit more complicated - you need to read values that are in different channels and put them into desired channel. Only way to do that is code branching to pick the right shufps action.
Another concern is what happens when array is not 4*N size (in samples), because with the last values you would also read data outside the mem when using movaps (which works on 16bit aligned data). That may or may not crash. Further testing has to be done...

martinvicanek · Post by **martinvicanek** » Sun Oct 19, 2014 9:28 pm

Exo wrote:Hi Martin, do you think it is possible to do this trick with this code? [...]

Hehe, that's what I am after as well. :mrgreen:

So far I have only been able to do this with arrays declared in the same ASM module, though. KG has me lost, I'm curious what he will be pulling out his sleeve next. :ugeek:

KG_is_back · Post by **KG_is_back** » Sun Oct 19, 2014 9:36 pm

Nope... it seems the movaps works only with data that was declared as SSE array - which mems are not the case.

martinvicanek · Post by **martinvicanek** » Sun Oct 19, 2014 9:44 pm

Okay, that explains it. So could we declare an SSE array and copy the external mem to it in stage0 (basically what mem input in 3.0.5 does)? Then we'd have fast movaps/shufps access in stage2.

KG_is_back · Post by **KG_is_back** » Sun Oct 19, 2014 9:47 pm

That should do the trick.

BTW here is the code I came up with:

Code: Select all

streamin addr;
streamin max;
streamin index;
streamout out;

int zero = 0;
int temp = 0;
int temp2=0;
int I0=0;
int I1=1;
int I2=2;
int I3=3;
int In4=-4; //this is binary mask that makes last two bits zero
           //that means it rounds down to nearest multiple of 4
int I3=3; //this extracts only first two bits. It is actually N%4
float array[4];
stage2;
mov eax,addr[0];
cmp eax,0;
jz bypass;

  cvtps2dq xmm0,index;
  maxps xmm0,zero;
  minps xmm0,max;
  movaps xmm1,xmm0;
  andps xmm0,In4;
  pslld xmm0,2;
  paddd xmm0,addr; //this is address for 16bit aligned read
  movaps temp,xmm0;
  andps xmm1,I3; //this will be used to shuffle the right sample into output
  movaps temp2,xmm1;
  pslld xmm1,4;
  //read for channel1 and store into array
  mov eax,temp[0];
  movaps xmm2,[eax];
  movd eax,xmm1;
  movaps array[eax],xmm2;
  
  //extract values from array and shuffle each value into index[0]
  mov eax,0;
  movaps xmm0,array[eax]; //xmm0 may contain desired value in ch(0) - no shufling needed
  movaps xmm4,I0;
  cmpps xmm4,temp2,0; //true if index%4==0
  andps xmm1,xmm4;
  
  add eax,16;
  movaps xmm1,array[eax]; //xmm1 may cntn desired value in ch(1) - shuffle it to 0
  shufps xmm1,xmm1,1;
  movaps xmm4,I1;
  cmpps xmm4,temp2,0; //true if index%4==1
  andps xmm1,xmm4;
  
  add eax,16;
  movaps xmm2,array[eax]; //...
  shufps xmm2,xmm2,2;
  movaps xmm4,I2;
  cmpps xmm4,temp2,0; //true if index%4==2
  andps xmm2,xmm4;
  
  add eax,16;
  movaps xmm3,array[eax];
  shufps xmm3,xmm3,3;
  movaps xmm4,I3;
  cmpps xmm4,temp2,0; //true if index%4==3
  andps xmm3,xmm4;
  
  orps xmm0,xmm1;
  orps xmm0,xmm2;
  orps xmm0,xmm3;
  movaps out,xmm0; 
  
bypass:

it does not work because of the movaps xmm0,[eax] but replacing that with array should fix it.

Exo · Post by **Exo** » Sun Oct 19, 2014 10:00 pm

KG_is_back wrote:it does not work because of the movaps xmm0,[eax] but replacing that with array should fix it.

Yes movaps xmm0,[eax]; is the first thing I tried. Shame really. Should it work?

I was going to ask you guys is there any opcodes you really want/need? If you can give clear examples of benefits of certain opcodes I could get on to Malc to add them (I'm usually quite good at getting him to add little things if I give him a clear example and make it simple for him).

Maybe topic for another thread?

Fast Stream Array Access

Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access

Re: Fast Stream Array Access