**SIMD instructions and vectorization**

Vectorization refers to a compiler unrolling a loop combined with generating SIMD instructions. Each SIMD (Single Instruction Multiple Data) instruction operates on more than one data element at a time, so a loop can run more efficiently. With auto-vectorization, the compiler can identify and optimize some loops on its own, which means it can automatically vectorize a loop. Aarch64 has 32 128-bit wide vector registers that SIMD instructions use and they are named V0 to V31. You can refer to the ARM manual for more information about SIMD instructions and vector registers.

**Writing vectorizable code and enabling auto-vectorization**

For this lab, I need to write a program that fills two 1000-element integer arrays with random numbers between -1000 and 1000, sums these two arrays element-by-element to a third array, and calculates the sum of all elements in the third array and prints the result. Here is my program that accomplishes these tasks without considering vectorization:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define RANDNUM 1000
int main(void)
{
// Declare variables
int array1[RANDNUM], array2[RANDNUM], array3[RANDNUM];
int i, minNum = -1000, maxNum = 1000, sum = 0;
// Randomize seed
srand(time(NULL));
for (i = 0; i < RANDNUM; i++) {
// Store random numbers in two arrays
array1[i] = minNum + rand() % (maxNum + 1 - minNum);
array2[i] = minNum + rand() % (maxNum + 1 - minNum);
// Sum array elements into third array
array3[i] = array1[i] + array2[i];
// Sum of third array elements
sum += array3[i];
}
// Display sum of third array elements
printf("Sum of all elements in the third array is: %d\n", sum);
return 0;
}

I use the command “gcc -O0 lab5.c -o lab5” to compile my program with no optimization using the -O0 option. Here is the disassembly output for the section <main> using the “objdump -d” command:

0000000000400684 <main>:
400684: d285e010 mov x16, #0x2f00 // #12032
400688: cb3063ff sub sp, sp, x16
40068c: a9007bfd stp x29, x30, [sp]
400690: 910003fd mov x29, sp
400694: 12807ce0 mov w0, #0xfffffc18 // #-1000
400698: b92ef7a0 str w0, [x29,#12020]
40069c: 52807d00 mov w0, #0x3e8 // #1000
4006a0: b92ef3a0 str w0, [x29,#12016]
4006a4: b92efbbf str wzr, [x29,#12024]
4006a8: d2800000 mov x0, #0x0 // #0
4006ac: 97ffff99 bl 400510 <time@plt>
4006b0: 97ffffac bl 400560 <srand@plt>
4006b4: b92effbf str wzr, [x29,#12028]
4006b8: 14000038 b 400798 <main+0x114>
4006bc: 97ffff9d bl 400530 <rand@plt>
4006c0: 2a0003e1 mov w1, w0
4006c4: b96ef3a0 ldr w0, [x29,#12016]
4006c8: 11000402 add w2, w0, #0x1
4006cc: b96ef7a0 ldr w0, [x29,#12020]
4006d0: 4b000040 sub w0, w2, w0
4006d4: 1ac00c22 sdiv w2, w1, w0
4006d8: 1b007c40 mul w0, w2, w0
4006dc: 4b000021 sub w1, w1, w0
4006e0: b96ef7a0 ldr w0, [x29,#12020]
4006e4: 0b000022 add w2, w1, w0
4006e8: b9aeffa0 ldrsw x0, [x29,#12028]
4006ec: d37ef400 lsl x0, x0, #2
4006f0: 914007a1 add x1, x29, #0x1, lsl #12
4006f4: 913d4021 add x1, x1, #0xf50
4006f8: b8206822 str w2, [x1,x0]
4006fc: 97ffff8d bl 400530 <rand@plt>
400700: 2a0003e1 mov w1, w0
400704: b96ef3a0 ldr w0, [x29,#12016]
400708: 11000402 add w2, w0, #0x1
40070c: b96ef7a0 ldr w0, [x29,#12020]
400710: 4b000040 sub w0, w2, w0
400714: 1ac00c22 sdiv w2, w1, w0
400718: 1b007c40 mul w0, w2, w0
40071c: 4b000021 sub w1, w1, w0
400720: b96ef7a0 ldr w0, [x29,#12020]
400724: 0b000022 add w2, w1, w0
400728: b9aeffa0 ldrsw x0, [x29,#12028]
40072c: d37ef400 lsl x0, x0, #2
400730: 913ec3a1 add x1, x29, #0xfb0
400734: b8206822 str w2, [x1,x0]
400738: b9aeffa0 ldrsw x0, [x29,#12028]
40073c: d37ef400 lsl x0, x0, #2
400740: 914007a1 add x1, x29, #0x1, lsl #12
400744: 913d4021 add x1, x1, #0xf50
400748: b8606821 ldr w1, [x1,x0]
40074c: b9aeffa0 ldrsw x0, [x29,#12028]
400750: d37ef400 lsl x0, x0, #2
400754: 913ec3a2 add x2, x29, #0xfb0
400758: b8606840 ldr w0, [x2,x0]
40075c: 0b000022 add w2, w1, w0
400760: b9aeffa0 ldrsw x0, [x29,#12028]
400764: d37ef400 lsl x0, x0, #2
400768: 910043a1 add x1, x29, #0x10
40076c: b8206822 str w2, [x1,x0]
400770: b9aeffa0 ldrsw x0, [x29,#12028]
400774: d37ef400 lsl x0, x0, #2
400778: 910043a1 add x1, x29, #0x10
40077c: b8606820 ldr w0, [x1,x0]
400780: b96efba1 ldr w1, [x29,#12024]
400784: 0b000020 add w0, w1, w0
400788: b92efba0 str w0, [x29,#12024]
40078c: b96effa0 ldr w0, [x29,#12028]
400790: 11000400 add w0, w0, #0x1
400794: b92effa0 str w0, [x29,#12028]
400798: b96effa0 ldr w0, [x29,#12028]
40079c: 710f9c1f cmp w0, #0x3e7
4007a0: 54fff8ed b.le 4006bc <main+0x38>
4007a4: 90000000 adrp x0, 400000 <_init-0x4d8>
4007a8: 91220000 add x0, x0, #0x880
4007ac: b96efba1 ldr w1, [x29,#12024]
4007b0: 97ffff70 bl 400570 <printf@plt>
4007b4: 52800000 mov w0, #0x0 // #0
4007b8: a9407bfd ldp x29, x30, [sp]
4007bc: d285e010 mov x16, #0x2f00 // #12032
4007c0: 8b3063ff add sp, sp, x16
4007c4: d65f03c0 ret

The disassembly output above contains 81 lines of instructions.

Now, I use the command “gcc -O3 lab5.c -o lab5a” to compile my program with a lot of optimization using the -O3 option. The -O3 option enables a lot of optimization and enables auto-vectorization. Here is the disassembly output for the section <main>:

0000000000400580 <main>:
400580: a9bc7bfd stp x29, x30, [sp,#-64]!
400584: d2800000 mov x0, #0x0 // #0
400588: 910003fd mov x29, sp
40058c: a9025bf5 stp x21, x22, [sp,#32]
400590: 529a9c75 mov w21, #0xd4e3 // #54499
400594: a90153f3 stp x19, x20, [sp,#16]
400598: 72a83015 movk w21, #0x4180, lsl #16
40059c: f9001bf7 str x23, [sp,#48]
4005a0: 52807d13 mov w19, #0x3e8 // #1000
4005a4: 5280fa34 mov w20, #0x7d1 // #2001
4005a8: 52800017 mov w23, #0x0 // #0
4005ac: 97ffffd9 bl 400510 <time@plt>
4005b0: 97ffffec bl 400560 <srand@plt>
4005b4: 97ffffdf bl 400530 <rand@plt>
4005b8: 2a0003f6 mov w22, w0
4005bc: 97ffffdd bl 400530 <rand@plt>
4005c0: 9b357c03 smull x3, w0, w21
4005c4: 71000673 subs w19, w19, #0x1
4005c8: 9b357ec2 smull x2, w22, w21
4005cc: 9369fc63 asr x3, x3, #41
4005d0: 4b807c63 sub w3, w3, w0, asr #31
4005d4: 9369fc42 asr x2, x2, #41
4005d8: 4b967c42 sub w2, w2, w22, asr #31
4005dc: 1b148060 msub w0, w3, w20, w0
4005e0: 1b14d842 msub w2, w2, w20, w22
4005e4: 0b000040 add w0, w2, w0
4005e8: 511f4000 sub w0, w0, #0x7d0
4005ec: 0b0002f7 add w23, w23, w0
4005f0: 54fffe21 b.ne 4005b4 <main+0x34>
4005f4: 2a1703e1 mov w1, w23
4005f8: 90000000 adrp x0, 400000 <_init-0x4d8>
4005fc: 911f8000 add x0, x0, #0x7e0
400600: 97ffffdc bl 400570 <printf@plt>
400604: 52800000 mov w0, #0x0 // #0
400608: f9401bf7 ldr x23, [sp,#48]
40060c: a94153f3 ldp x19, x20, [sp,#16]
400610: a9425bf5 ldp x21, x22, [sp,#32]
400614: a8c47bfd ldp x29, x30, [sp],#64
400618: d65f03c0 ret
40061c: 00000000 .inst 0x00000000 ; undefined

The disassembly output above contains 40 lines of instructions, which is about half the amount of instructions compared to the first case. This is an indication that optimization has occurred. Auto-vectorization is enabled but the disassembly output does not contain SIMD instructions, which means that the code is not vectorized.

I need to change my code in order for it to become vectorizable. Instead of using one for loop, I will divide it into three for loops. The first loop stores random numbers into the two arrays. The second loop sums these two arrays element-by-element to a third array. The third loop calculates the sum of all of the elements in the third array. Here is my program with vectorizable code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define RANDNUM 1000
int main(void)
{
// Declare variables
int array1[RANDNUM], array2[RANDNUM], array3[RANDNUM];
int i, minNum = -1000, maxNum = 1000, sum = 0;
// Randomize seed
srand(time(NULL));
// Store random numbers in two arrays
for (i = 0; i < RANDNUM; i++) {
array1[i] = minNum + rand() % (maxNum + 1 - minNum);
array2[i] = minNum + rand() % (maxNum + 1 - minNum);
}
// Sum array elements into third array
for (i = 0; i < RANDNUM; i++) {
array3[i] = array1[i] + array2[i];
}
// Sum of third array elements
for (i = 0; i < RANDNUM; i++) {
sum += array3[i];
}
// Display sum of third array elements
printf("Sum of all elements in the third array is: %d\n", sum);
return 0;

I use the command “gcc -O0 lab5b.c -o lab5b” to compile my program with no optimization using the -O0 option. Here is the disassembly output for the section <main>:

0000000000400684 <main>:
400684: d285e010 mov x16, #0x2f00 // #12032
400688: cb3063ff sub sp, sp, x16
40068c: a9007bfd stp x29, x30, [sp]
400690: 910003fd mov x29, sp
400694: 12807ce0 mov w0, #0xfffffc18 // #-1000
400698: b92ef7a0 str w0, [x29,#12020]
40069c: 52807d00 mov w0, #0x3e8 // #1000
4006a0: b92ef3a0 str w0, [x29,#12016]
4006a4: b92efbbf str wzr, [x29,#12024]
4006a8: d2800000 mov x0, #0x0 // #0
4006ac: 97ffff99 bl 400510 <time@plt>
4006b0: 97ffffac bl 400560 <srand@plt>
4006b4: b92effbf str wzr, [x29,#12028]
4006b8: 14000023 b 400744 <main+0xc0>
4006bc: 97ffff9d bl 400530 <rand@plt>
4006c0: 2a0003e1 mov w1, w0
4006c4: b96ef3a0 ldr w0, [x29,#12016]
4006c8: 11000402 add w2, w0, #0x1
4006cc: b96ef7a0 ldr w0, [x29,#12020]
4006d0: 4b000040 sub w0, w2, w0
4006d4: 1ac00c22 sdiv w2, w1, w0
4006d8: 1b007c40 mul w0, w2, w0
4006dc: 4b000021 sub w1, w1, w0
4006e0: b96ef7a0 ldr w0, [x29,#12020]
4006e4: 0b000022 add w2, w1, w0
4006e8: b9aeffa0 ldrsw x0, [x29,#12028]
4006ec: d37ef400 lsl x0, x0, #2
4006f0: 914007a1 add x1, x29, #0x1, lsl #12
4006f4: 913d4021 add x1, x1, #0xf50
4006f8: b8206822 str w2, [x1,x0]
4006fc: 97ffff8d bl 400530 <rand@plt>
400700: 2a0003e1 mov w1, w0
400704: b96ef3a0 ldr w0, [x29,#12016]
400708: 11000402 add w2, w0, #0x1
40070c: b96ef7a0 ldr w0, [x29,#12020]
400710: 4b000040 sub w0, w2, w0
400714: 1ac00c22 sdiv w2, w1, w0
400718: 1b007c40 mul w0, w2, w0
40071c: 4b000021 sub w1, w1, w0
400720: b96ef7a0 ldr w0, [x29,#12020]
400724: 0b000022 add w2, w1, w0
400728: b9aeffa0 ldrsw x0, [x29,#12028]
40072c: d37ef400 lsl x0, x0, #2
400730: 913ec3a1 add x1, x29, #0xfb0
400734: b8206822 str w2, [x1,x0]
400738: b96effa0 ldr w0, [x29,#12028]
40073c: 11000400 add w0, w0, #0x1
400740: b92effa0 str w0, [x29,#12028]
400744: b96effa0 ldr w0, [x29,#12028]
400748: 710f9c1f cmp w0, #0x3e7
40074c: 54fffb8d b.le 4006bc <main+0x38>
400750: b92effbf str wzr, [x29,#12028]
400754: 14000012 b 40079c <main+0x118>
400758: b9aeffa0 ldrsw x0, [x29,#12028]
40075c: d37ef400 lsl x0, x0, #2
400760: 914007a1 add x1, x29, #0x1, lsl #12
400764: 913d4021 add x1, x1, #0xf50
400768: b8606821 ldr w1, [x1,x0]
40076c: b9aeffa0 ldrsw x0, [x29,#12028]
400770: d37ef400 lsl x0, x0, #2
400774: 913ec3a2 add x2, x29, #0xfb0
400778: b8606840 ldr w0, [x2,x0]
40077c: 0b000022 add w2, w1, w0
400780: b9aeffa0 ldrsw x0, [x29,#12028]
400784: d37ef400 lsl x0, x0, #2
400788: 910043a1 add x1, x29, #0x10
40078c: b8206822 str w2, [x1,x0]
400790: b96effa0 ldr w0, [x29,#12028]
400794: 11000400 add w0, w0, #0x1
400798: b92effa0 str w0, [x29,#12028]
40079c: b96effa0 ldr w0, [x29,#12028]
4007a0: 710f9c1f cmp w0, #0x3e7
4007a4: 54fffdad b.le 400758 <main+0xd4>
4007a8: b92effbf str wzr, [x29,#12028]
4007ac: 1400000b b 4007d8 <main+0x154>
4007b0: b9aeffa0 ldrsw x0, [x29,#12028]
4007b4: d37ef400 lsl x0, x0, #2
4007b8: 910043a1 add x1, x29, #0x10
4007bc: b8606820 ldr w0, [x1,x0]
4007c0: b96efba1 ldr w1, [x29,#12024]
4007c4: 0b000020 add w0, w1, w0
4007c8: b92efba0 str w0, [x29,#12024]
4007cc: b96effa0 ldr w0, [x29,#12028]
4007d0: 11000400 add w0, w0, #0x1
4007d4: b92effa0 str w0, [x29,#12028]
4007d8: b96effa0 ldr w0, [x29,#12028]
4007dc: 710f9c1f cmp w0, #0x3e7
4007e0: 54fffe8d b.le 4007b0 <main+0x12c>
4007e4: 90000000 adrp x0, 400000 <_init-0x4d8>
4007e8: 91230000 add x0, x0, #0x8c0
4007ec: b96efba1 ldr w1, [x29,#12024]
4007f0: 97ffff60 bl 400570 <printf@plt>
4007f4: 52800000 mov w0, #0x0 // #0
4007f8: a9407bfd ldp x29, x30, [sp]
4007fc: d285e010 mov x16, #0x2f00 // #12032
400800: 8b3063ff add sp, sp, x16
400804: d65f03c0 ret

The disassembly output above contains 97 lines of instructions. We get more instructions than the first case with one loop, which is as expected since we now have three loops. Also as expected, the disassembly output does not contain SIMD instructions since auto-vectorization is not enabled.

Now, I use the command “gcc -O3 lab5b.c -o lab5c” to compile my program with a lot of optimization using the -O3 option. Here is the disassembly output with my bolded comments for the section <main>:

0000000000400580 <main>:
**// main() function**
400580: d285e410 mov x16, #0x2f20 // #12064
400584: cb3063ff sub sp, sp, x16 **// stack pointer - x16**
400588: d2800000 mov x0, #0x0 // #0
40058c: a9007bfd stp x29, x30, [sp] **// store x29 and x30 to stack pointer address**
400590: 910003fd mov x29, sp **// move stack pointer to x29**
400594: a90153f3 stp x19, x20, [sp,#16] **// store x19 and x20 to stack pointer address with offset**
400598: 529a9c74 mov w20, #0xd4e3 // #54499
40059c: a9025bf5 stp x21, x22, [sp,#32] **// store x21 and x22 to stack pointer address with offset**
4005a0: 72a83014 movk w20, #0x4180, lsl #16 **// move value to w20**
4005a4: f9001bf7 str x23, [sp,#48] **// store x23 to stack pointer address with offset**
4005a8: 910103b6 add x22, x29, #0x40 **// x29 + 64 and store in x22**
4005ac: 913f83b5 add x21, x29, #0xfe0 **// x29 + 4064 and store in x21**
4005b0: 5280fa33 mov w19, #0x7d1 // #2001
4005b4: d2800017 mov x23, #0x0 // #0
4005b8: 97ffffd6 bl 400510 <time@plt> **// call time subroutine**
4005bc: 97ffffe9 bl 400560 <srand@plt> **// call srand subroutine**
**// first loop**
**// array1[i] = minNum + rand() % (maxNum + 1 - minNum)**
4005c0: 97ffffdc bl 400530 <rand@plt> **// call rand subroutine**
4005c4: 9b347c01 smull x1, w0, w20 **// w0 * w20 and store in x1**
4005c8: 9369fc21 asr x1, x1, #41 **// shift x1 value right by 41 bits**
4005cc: 4b807c21 sub w1, w1, w0, asr #31 **// subtract shifted register**
4005d0: 1b138020 msub w0, w1, w19, w0 **// multiply and subtract**
4005d4: 510fa000 sub w0, w0, #0x3e8 **// subtract**
4005d8: b8376ac0 str w0, [x22,x23] **// store w0 to an address**
**// array2[i] = minNum + rand() % (maxNum + 1 - minNum)**
4005dc: 97ffffd5 bl 400530 <rand@plt> **// call rand subroutine**
4005e0: 9b347c01 smull x1, w0, w20 **// w0 * w20 and store in x1**
4005e4: 9369fc21 asr x1, x1, #41 **// shift x1 value right by 41 bits**
4005e8: 4b807c21 sub w1, w1, w0, asr #31 **// subtract shifted register**
4005ec: 1b138020 msub w0, w1, w19, w0 **// multiply and subtract**
4005f0: 510fa000 sub w0, w0, #0x3e8 **// subtract**
4005f4: b8376aa0 str w0, [x21,x23] **// store w0 to an address**
**// loop if i < RANDNUM**
4005f8: 910012f7 add x23, x23, #0x4 **// x23 + 4 and store in x23**
4005fc: f13e82ff cmp x23, #0xfa0 **// test if x23 = 4000**
400600: 54fffe01 b.ne 4005c0 <main+0x40> **// repeat first loop if x23 not equal 4000**
400604: d283f002 mov x2, #0x1f80 // #8064
400608: 8b0203a1 add x1, x29, x2 **// x29 + x2 and store in x1**
40060c: d2800000 mov x0, #0x0 // #0
**// second loop**
**// array3[i] = array1[i] + array2[i];**
400610: 3ce06ac0 ldr q0, [x22,x0] **// load register**
400614: 3ce06aa1 ldr q1, [x21,x0] **// load register**
400618: 4ea18400 add v0.4s, v0.4s, v1.4s **// SIMD vector instruction: v0.4s + v1.4s and store in v0.4s**
40061c: 3ca06820 str q0, [x1,x0] **// store q0 to an address**
**// loop if i < RANDNUM**
400620: 91004000 add x0, x0, #0x10 **// x0 + 16 and store in x0**
400624: f13e801f cmp x0, #0xfa0 **// test if x0 = 4000**
400628: 54ffff41 b.ne 400610 <main+0x90> **// repeat second loop if x0 not equal 4000**
40062c: 4f000400 movi v0.4s, #0x0 **// SIMD vector instruction: move immediate (vector)**
400630: aa0103e0 mov x0, x1 **// move x1 to x29**
400634: d285e401 mov x1, #0x2f20 // #12064
400638: 8b0103a1 add x1, x29, x1 **// x29 + x1 and store in x1**
**// third loop**
**// sum += array3[i];**
40063c: 3cc10401 ldr q1, [x0],#16 **// load register**
400640: 4ea18400 add v0.4s, v0.4s, v1.4s **// SIMD vector instruction: v0.4s + v1.4s and store in v0.4s**
400644: eb01001f cmp x0, x1 **// test if x0 = x1**
400648: 54ffffa1 b.ne 40063c <main+0xbc> **// repeat third loop if x0 not equal x1**
40064c: 4eb1b800 addv s0, v0.4s **// SIMD vector instruction: add across vector**
400650: 90000000 adrp x0, 400000 <_init-0x4d8> **// store address in x0**
400654: 91210000 add x0, x0, #0x840 **// x0 + 2112 and store in x0**
400658: 0e043c01 mov w1, v0.s[0] **// SIMD vector instruction: move v0.s[0] to w1**
40065c: 97ffffc5 bl 400570 <printf@plt> **// call printf subroutine**
400660: f9401bf7 ldr x23, [sp,#48] **// load register**
400664: a94153f3 ldp x19, x20, [sp,#16] **// load pair of registers**
400668: 52800000 mov w0, #0x0 // #0
40066c: a9425bf5 ldp x21, x22, [sp,#32] **// load pair of registers**
400670: d285e410 mov x16, #0x2f20 // #12064
400674: a9407bfd ldp x29, x30, [sp] **// load pair of registers**
400678: 8b3063ff add sp, sp, x16 **// stack pointer + x16 and store in stack pointer**
40067c: d65f03c0 ret **// return from subroutine**

The disassembly output above contains 64 lines of instructions, which is less than the case with no optimization. In this case, the disassembly output contains SIMD instructions, which means that the code is vectorized. Specifically, the disassembly output shows that the second and third loop is vectorized. The second and third loop contains a few SIMD vector instructions where vector registers are used. For example, the SIMD instruction “add v0.4s, v0.4s, v1.4s” allows 4 additions to be performed in a single instruction. In terms of register “v0.4s”, “v0” represents vector register 0, “4” represents 4 data elements or lanes, and “s” represents the data element size of 32 bits. One instruction uses “v0.s[0]”, which represents a vector register element where “[0]” indicates the element index. Some SIMD instructions use the same name as other types of instructions. For example, we have “add” and “mov” instructions that become SIMD instructions when vector registers are used.

There are a few things to consider when you want to write vectorizable loops. Simple loops are more likely to be vectorizable than complex loops. A loop will not be vectorizable if it contains complex calculations such as the first loop in my program. This is also true if data dependencies exist within the loop, which is when the value of one variable depends on the value of another variable and values are overwritten. These three conditions explain why my first program that has only one big loop cannot be vectorized. Writing vectorizable code is not easy because different compilers handle vectorization differently and we are unfamiliar with that process. It will probably take at least a couple of attempts in modifying our code to get it to work. There are some general guidelines that we can follow but these guidelines may not be always helpful. On the other hand, it is not difficult to identify vectorized code that is shown in the disassembly output.