In this post I will be looking at writing a simple loop in assembly, for the gas compiler on x86_64, that prints the numbers 1 – 30, formatted as follows:
Loop: 1 Loop: 2 Loop: 3 ... (Lines removed to save space) ... Loop: 28 Loop: 29 Loop: 30
This started out as a group task in class and I expanded on it later on my own. I have two different solutions to the problem, one is much shorter but was more challenging to write. The first solution can be here, loopnozero.docx. If you look at the comments you’ll notice that I decided to print each part (‘Loop:’, number, line break). This was not the most efficient or smallest code I could have written. I was constrained by the time limit set by the end of class so I chose this method because it was faster and didn’t involve any commands not included in the lab guidelines.
The next part of the lab was to write the same program but on the AArch64 system. Because I was going to have to rewrite it I decided to modify my program on the x86_64 platform to be shorter and more efficient so that it would be less work to port over. The short version (47 lines instead of the original 82), loopshort.docx, is the one that I will be looking at in detail.
Assembly on x86_64
The first section of the loopShort.s file tell the compiler that this code goes in the .text section and declares _start which is the equivalent of
int main() in C. the first line of start just moves the number 1 into the register that I used for the loop index.
.text .globl _start _start: mov $1,%r15 /* loop index */
The next block of code starts with the label that marks the top of the main loop.
loop: /* Divide number */ movq %r15,%rax /* copy the loop index to registry for division */ movq $10,%r12 /* write 10 to register */ movq $0,%rdx /* clear rdx for remainder */ div %r12 /* divide by 10 */ movq %rax,%r11 /* move result quotient to register 11 */ movq %rdx,%r13 /* move remainder to number register 13*/
The first thing that happens in the loop is the division of the loop index by 10. There are six separate steps involved in doing this. Three registers are used to do division; rax must contain the number to be divided, rdx stores the remainder, and any other register can be used to hold the divisor (r12 in this case). The third move statement that overwrites the rdx register with 0 was added because I found I was getting leftover information when the remainder was 0 so I preemptively overwrite it before the divide operation. The
div instruction tells the processor to divide rax by r12 and store the result quotient in rax and the remainder in rdx.
The next step if to move the results into the message that will eventually be printed. The message string is stored in the variable
msg which I have created at the end of the file:
.section .data msg: .ascii "Loop: \n" msgLen = . - msg
I have left three spaces between the colon and the new line character to allow for one space for formatting and two digits. Before I can insert the numbers into the string I have to convert them to ascii characters. To do this I add 48 to them because 48 is the ascii number for 0. Then I can copy them in the message using the single byte move. The ‘b’ after ‘%r13′ means move a single byte and the ‘+7′ means move it into the 7th byte of ‘msg’. To avoid displaying the leading zero for numbers less than ten I used
cmp to compare the value of the quotient to zero. The compare function sets a flag in the processor so that when the jump if equal (je) function is called in the next step it knows if the jump should be taken.
/* Add ones to message */ add $48,%r13 /* convert to ascii by adding 48 */ mov %r13b,msg+7 /* copy one byte into msg at location 7 */ cmp $0,%r11 /* check if quotient is 0 */ je skipTens /* if it is jump to skipTens label */ /* Add tens to message */ add $48,%r11 /* convert to ascii by adding 48 */ mov %r11b,msg+6 /* copy one byte into msg at location 6 */ skipTens:
je line tells it to jump to the label ‘skipTens’ if the previous line evaluated to true. This jumps over the section where the quotient is added into byte 6 of the message. The third last section is where the message gets written to
stdout with a system call. There are four registers that must be used for this. The address of the message goes into rsi, the file descriptor (stdout = 1) into rdi, rax is the value of the syscall we whant (1 for sys_write), and rdx holds the length of the message.
/* print number ones */ movq $msg,%rsi /* pass the address of number to rsi */ movq $1,%rdi /* file descriptor stdout */ movq $1,%rax /* syscall sys_write */ movq $msgLen,%rdx /* set length to 1 */ syscall /* loop checking */ inc %r15 /* increment index */ cmp $31,%r15 /* see if we're done */ jne loop /* loop if we're not */ mov $0,%rdi /* exit status */ mov $60,%rax /* syscall sys_exit */ syscall
syscall actually make the call and writes the line. The final bit of code increments (
inc) the loop index, compares it to 31, and if it’s not there yet sends it back up to the top. The last three lines are a standard exit syscall.
Differences between x86_64 and AArch64
When converting my program from x86_64 to AArch64 there were a number of significant changes, some easier to deal with than others. One of the obvious ones that was really easy to fix is that on AArch64 none of the registers, numbers or variables are preceded with ‘%’ or ‘$’. The other obvious but slightly annoying one was that the from and to locations are switched in every function. On x86_64
mov %r12,%r15 means move r12 into r15 but in AArch64
mov r12,r15 means move r15 into r12. To me the x86 version is the more natural way to read it. As an example you can see the difference in the first line of each program:
_start: mov x15,1 /* set loop index (register 15) to 1 */x86_64:
_start: mov $1,%r15 /* loop index */
The first major difference I came across was trying to divide the loop index.
mov x12,10 /* put number 10 in register 12 */ udiv x11,x15,x12 /* quotient = loop index / divisor eg. r11 = r15 * r12 */ msub x13,x12,x11,x15 /* remainder = divisor - (quotient * loop index) */ /* r13 = r12 - (r11 * r15) */x86_64:
movq %r15,%rax /* copy the loop index to registry for division */ movq $10,%r12 /* write 10 to register */ movq $0,%rdx /* clear rdx for remainder */ div %r12 /* divide by 10 */ movq %rax,%r11 /* move result quotient to register 11 */ movq %rdx,%r13 /* move remainder to number register 13*/
Although the AArch64 code is 2 lines shorter I found it much more awkward to work with. The
udiv function (unsigned division) does not provide a remainder. The remainder must be calculated in an additional step using the formula ‘remainder = divisor – (quotient * loop index)’ with the
When it comes to placing the number into the message the AArch64 system requires the additional step of moving the address of msg into a register using the
adr function before it can insert the number. I do like the way that the
add function on AArch64 allows you to specify the register for the result so you don’t always destroy the number you are adding to.
add x13,x13,48 /* convert number to char: add 48 to quotient (r13) and store result in r13 */ adr x14,msg /* load the address of msg into r14 */ strb w13,[x14,7] /* Put the remainder (r13) into msg (r14) */ /* write one bit from r13 into r14 in the 7th postition */x86_64:
add $48,%r13 /* convert to ascii by adding 48 */ mov %r13b,msg+7 /* copy one byte into msg at location 7 */
Printing the message work the same but AArch64 uses numbered registers 1-8 instead of named registers, write is syscall 64 instead of 1, and the syscall is executed with
svc 0, not
syscall. AArch64 has no
inc command so instead to increment the loop count I just added 1 to it. The final difference I noticed was that AArch64 has calls ‘jumps’, ‘branches’ so the command is
bne instead of
jne for branch not equal.
Debugging was challenging because I am unfamiliar with the language and there is a distinct lack of meaningful variable names so it is easy to lose track of what is in each register. I found it very helpful to keep a text document open that I updated with the contents of each register when I started using a new one. The program didn’t require much debugging because each instruction was so simple it was easy to predict the results. The only situation that gave me pause was when unexpected characters were printed. Our professor pointed out fairly quickly that is was happing because I was relying on the length value that was stored in a register that was getting clobbered. His help saved us from having to debug that particular error.
Writing this program was an interesting task and did not take as long as I expected it to. That being said it still took much longer and was much more work than writing the same program in C, which would have taken less than a minute and only 4-6 lines of code. The syntax for assembly on x86_64 felt more natural to read and write and took less effort to understand but the syntax for AArch64 had a number of features, like significantly more address space that could be treated as registers and selecting the output register for math functions, that I think would be very handy for more complex applications.
I didn’t have to do much research for this post. The only resources I used are from the lab or in the resources section of these three pages:
Assembler Basics: Assembler_Basics
x86_64 Registers: X86_64_Register_and_Instruction_Quick_Start
AArch64 Registers: Aarch64_Register_and_Instruction_Quick_Start
The following is a zip file containing the source code for the various loop programs I wrote for this lab: assemblyLoopSource.zip
This lab had an optional challenge to write a program to print out the times tables up to 12. At the time of publishing this my program is incomplete, it uses nested loops to print the left side of the equation but I have not added the functionality to do the actual multiplication. If I can find the time to finish it I will update this section.