Planet CDOT

April 22, 2016

Berwout de Vries Robles

Update on phase 3

Quick update:

I got a reply to my github pull request. Sadly it was not what I had hoped for. The package manager replied that they want to give users full control of their compiler optimizations and not force anything on them.

 As there is theoretically no downside to applying atleast a base level of optimization, this argument seems a little bit strange to me.

The way I would do it is give the user a base level of optimization, so that users that have no knowledge of compiler optimizations get the benefit in the significant speedup without having to do anything. Advanced users could then opt-out of the optimization if they want to. I realize that this is more work for advanced users. However I think the situation where the user opts out of optimizations would come up very rarely. Whereas the situation where the user should have applied some degree of optimization, but did not because the user didn't know he should have, is more prevalent.

by Berwout de Vries Robles ( at April 22, 2016 05:58 AM

Andrei Topala

SPO600 Project: Phase Three and Conclusion

There is a saying attributed to Edison that goes something along the lines of “I have not failed–I have just found 10,000 ways that don’t work.”

The goal of the project was to optimize a selected open-source compression or checksum package and produce a meaningful improvement in performance. I chose lrzip, partly because it seemed interesting and partly because its creator is a physician, which is a career that I had at length considered myself back in high school before deciding that hospitals and life-and-death situations are scary. This might not have been the best choice (nor, I guess, the best reason to make a choice) because, well, lrzip looks pretty good. I could find no area of optimization that would amount to a substantive improvement. Whenever I thought I had come across an opportunity, it turned out not to be the case at all.

Optimization, I’ve come to realize, is pretty hard. Conversely, making a program run worse, even unintentionally, is very easy.

In phases one and two of this project I noted that lrzip crashed on AArch64 platforms when compressing a file with ZPAQ due to ZPAQ’s use of x86-specific JIT code. I thought to translate the x86 instructions into ARM instructions, so that JIT code could be used on both architectures, but that was too tall an order. I chose instead to focus on the C++ code that is used when JIT is unavailable, to try to optimize it and bring its runtime closer to that of the JIT version of the program. Also, I examined the LZMA code in the package (as LZMA is the default compression algorithm lrzip uses) and tried to optimize that as well.

Now, let me count the 10,000 ways.

The first order of business was to profile the program in order to see which functions took up the majority of the execution time. gprof was the tool for this. First I built the package with the -pg flag, which generates extra code to write profile information. This I did by setting it as one of the C, C++, and linker flags when calling the configure script:

[andrei@localhost lrzip-master]$ ./configure CFLAGS="-pg -O2" CXXFLAGS="-pg -O2" LDFLAGS="-pg"

Then I built the package using make (made the package?) and then ran lrzip on a 500MB text file.

[andrei@localhost lrzip-master]$ ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.937. Average Compression Speed:  2.041MB/s.
Total time: 00:04:04.53

This generated a gmon.out file in the directory, which lets us examine lrzip using the gprof tool. I used the following command to create an image from the profile information:

[andrei@localhost lrzip-master]$ gprof lrzip | gprof2dot | dot -Tpng > image.png


We can see from the graph that the program spent the vast majority of its time in the function GetMatchesSpec1, which is defined in the file lzma/C/LzFind.c. The graph tells us that the function was called 200 million times, so I figured that any marginal improvement I could make to the function would result in a significant gain in performance.
The function uses two Byte pointers to move along the stream of data read from the text file, and two UInt32 pointers (typedef’d as CLzRef) to keep track of matches from the data in a circular buffer.

CLzRef *ptr0 = son + (_cyclicBufferPos << 1) + 1;
CLzRef *ptr1 = son + (_cyclicBufferPos << 1);

Since the two pointers point to adjacent data locations I tried to define one in terms of the other (removing the “+1” part of ptr0’s definition then setting ptr1=ptr0++), but it had no effect, and I think the machine code generated would be the same regardless.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])

This I rewrote as a single while loop, but as I said in phase two it had no (or a negative) effect on the executable. The objdump revealed that both this code and my edit resulted in identical instructions with only a slight change in layout.
I also rewrote this section of the code entirely to check multiple bytes at a time by XORing chunks of data from pb and cur, but even before fine-tuning the code to set len to the proper value after the first different chunk of data was found, I realized that the overhead of doing all this made the program significantly slower, so I abandoned that course of action. (For the text file, the loop only ran a few times with every function call, so checking very large chunks of data would be counter-productive).

After that, I tried to see if I could find something to optimize in LzmaEnc_CodeOneBlock, which is in lzma/C/LzmaEnc.c. I couldn’t, at least not beyond that failed experiment with memmove that I wrote about in phase two.

So, LZMA I couldn’t make better. I then went back to ZPAQ. Here’s the profile information for lrzip when it’s invoked with the -z flag after having been compiled with -DNOJIT to turn off the JIT code:


Both update0 and predict0 are structured as a for loop that uses a switch statement to check the value of an element of an array, which gets its value from an element in the block header array of a ZPAQL object, which represents ZPAQL instructions. I couldn’t find any way to optimize the functions without breaking the logic of the algorithm. The only thing I could do is replace some multiplication operations with bitshifts, but that didn’t affect the runtime (I think the compiler does that as well, if it can).
The execute function, which consists of a switch statement of over 200 cases, only takes up 3% of the execution time, and as much as I tried to optimize it I couldn’t get it to run more quickly by any measurable magnitude, so that didn’t work out either.

My attempts to edit the source code having been unsuccessful, I tried to see if any compiler options would make the program go faster, and, indeed, -Ofast (which enables -ffast-math) created an executable whose LZMA compression consistently outperformed the default -O2 executable by about 1% without introducing any errors into the compressed file. -ffast-math, though, is probably not something one would want in a compression tool, and there might be potential for mathematical errors somewhere in the code. Besides, the 1% difference, though I tested it quite a few times, might just have been a coincidence.

In the end, all I could do is add a small preprocessor condition in the libzpaq/libzpaq.h file to check for an x86 architecture and disable the JIT code if none is found. I created a patch file but I am not sure if it’s worth submitting since 1) it is such an insignificant change and 2) it only checks for GCC and Visual Studio macros, and if the program is compiled on an x86 processor with a different compiler then the JIT code will be deactivated, which will have a significant negative impact on performance. Here is the patch file for, at least, posterity:

From 3976cdc2640adbe593a09bba010130fcf74ef809 Mon Sep 17 00:00:00 2001
From: Andrei Topala <>
Date: Thu, 21 Apr 2016 17:56:34 -0400
Subject: [PATCH] Added a preprocessor check for an x86 architecture in
 libzpaq.h as a requisite for enabling JIT code.

 libzpaq/libzpaq.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/libzpaq/libzpaq.h b/libzpaq/libzpaq.h
index 93387da..e2b5058 100644
--- a/libzpaq/libzpaq.h
+++ b/libzpaq/libzpaq.h
@@ -25,6 +25,11 @@ comprises the reference decoder for the ZPAQ level 2 standard.
 #ifndef LIBZPAQ_H
 #define LIBZPAQ_H

+// If we're not on an x86 processor, disable JIT
+#if !defined(i386) && !defined(_M_IX86) && !defined(__x86_64__) && !defined(_M_X64)
+#define NOJIT
 #ifndef DEBUG
 #define NDEBUG 1

That concludes this chapter in our journey through the wide waters of software portability and optimization. SPO600 was certainly the most challenging and most interesting course I’ve taken at Seneca, and I am sure I learned very much. I do wish I’d had more time to blog about it, though. That is my only regret, and also the 10,000 failures.

by andrei600 at April 22, 2016 03:58 AM

Tom Ng

The End of an SPO600 Semester

Finally done SPO600; and what a ride it was.

SPO600 was not what I expected in the beginning but as I progressed through the semester, the course developed into its own kind of charm. The course really demystifies a lot of what happens in the background. It explains what is normally abstracted away from or taken for granted by the programmer or really, most people.

The topics covered in SPO600 are very specialized. It really is hard to say accurately say what this course is about and “Software Portability and Optimization” is only the tip of the iceberg; it’s almost misleading on what the course has to teach you. The course covers a broad spectrum of topics from compilers to architectures to optimizations. Sometimes the course even covers things that may not have even been course topics in the first place such as useful linux commands and programs.

This course is also unlike any other I’ve taken before. I’ve never taken a course before where blogs and the zenit wiki are used  and to communicate. There was also an emphasis for doing things and reporting the results as well as a bit more leeway on how to approach a given problem. Other courses are more direct in what they have to teach and offer but the approach used in SP600 required you to think a bit more.

Unfortunately I was unable to intake everything that the course tried to teach but I still feel I learned very much from this course. I was initially curious only about a few things here and there relating to the topics of the course and the course gave me a lot more than what I bargained for. The course opens your eyes just enough so that you realize you actually know absolutely nothing. Computing turned out to be a pretty complicated subject and for the various topics SPO600 covered, it probably only scratched the surface. Scratching that surface though felt like I was gaining some deep knowledge on the subject even if it was only the bare bones basics. Regretfully I was unable to find an optimization for my project but I was able to still take away some further knowledge about the make system and other compiler options from doing the project. There really is no other course like SPO600 and even if I got a bit more than I could chew, I’m really glad I have taken this course.

by tng23spo600 at April 22, 2016 03:54 AM

Lab6: Auto-Vectorization of Loops

In this lab, I got to explore with the auto-vectorization features of gcc. I started by creating a small program that created two 1000 length int arrays filled with random numbers, a third array that stored the sums of the other two arrays and the program would print the sum of all numbers in the third array before exiting. The code is shown below:

#include “stdio.h”
#include “stdlib.h”
#include “time.h”

int main() {

int array1[1000], array2[1000], array3[1000];
long int sum;
int i;


/* Fill array 1 with random numbers */
for (i = 0; i < 1000; i++) {
array1[i] = rand();

/* Fill array 2 with random numbers */
for (i = 0; i < 1000; i++) {
array2[i] = rand();

/* Sum contents of arrays 1 and 2 into 3 */
for (i = 0; i < 1000; i++) {
array3[i] = array1[i] + array2[i];

/* Sum contents of array 3 into long int */
for (i = 0; i < 1000; i++) {
sum += array3[i];

printf(“Sum is: %ld”, sum);


I made two object dumps of this program to compare with by compiling with two different compiler options. The first version was made with only -g to turn on debugging symbols: gcc -g arraysum.c -o arraysum.out. The second version was made with: gcc -g -O3 -ftree-vectorizer-verbose=1 arraysum.c -o arraysum.out. The -O3 turns on more optimizations including vectorization while -ftree-vectorizer-verbose=1 prints out what was vectorized and not. The messages printed out when attempting to vectorize were:

Analyzing loop at arraysum.c:29
arraysum.c:29: note: vect_recog_widen_sum_pattern: detected: patt_7 = _31 w+ sum_50;
Vectorizing loop at arraysum.c:29
arraysum.c:29: note: LOOP VECTORIZED.
Analyzing loop at arraysum.c:24
Vectorizing loop at arraysum.c:24
arraysum.c:24: note: LOOP VECTORIZED.
Analyzing loop at arraysum.c:19
Analyzing loop at arraysum.c:14
arraysum.c:5: note: vectorized 2 loops in function.

Of the four loops in the program, gcc was able to vectorize two of them; the loop one where the third array was being filled with the sums and the loop where the sum of all values in the third array was being calculated. I initially used -ftree-vectorizer-verbose=2 at first but the messages were quite long so I used verbose=1 instead. The reason why the first two loops were not vectorized was shown in verbose=2 but not verbose=1. Likely what the issue was here were the calls to the rand() function which the vectorizer cannot analyze. A snippet of the messages printed from verbose=2 concerning the first two loops is below:

arraysum.c:19: note: not vectorized: loop contains function calls or data references that cannot be analyzed
arraysum.c:19: note: bad data references.
Analyzing loop at arraysum.c:14

arraysum.c:14: note: not vectorized: loop contains function calls or data references that cannot be analyzed
arraysum.c:14: note: bad data references.
arraysum.c:5: note: vectorized 2 loops in function.

The assembly for the vectorized loop that fills the third array is:

4005a8: 910083a3 add x3, x29, #0x20
4005ac: 8b000062 add x2, x3, x0
4005b0: 913f03a3 add x3, x29, #0xfc0
4005b4: 8b000061 add x1, x3, x0
4005b8: 4c407841 ld1 {v1.4s}, [x2]
4005bc: d283ec02 mov x2, #0x1f60 // #8032
4005c0: 4c407820 ld1 {v0.4s}, [x1]
4005c4: 8b1d0042 add x2, x2, x29
4005c8: 8b000041 add x1, x2, x0
4005cc: 4ea08420 add v0.4s, v1.4s, v0.4s
4005d0: 91004000 add x0, x0, #0x10
4005d4: 4c007820 st1 {v0.4s}, [x1]
4005d8: f13e801f cmp x0, #0xfa0
4005dc: 54fffe61 4005a8 <main+0x58>

The assembly for the vectorized loop that sums up the elements of the third array is:

4005f0: 4cdf7800 ld1 {v0.4s}, [x0], #16
4005f4: 0f20a402 sxtl v2.2d, v0.2s
4005f8: 4ee18441 add v1.2d, v2.2d, v1.2d
4005fc: 4f20a400 sxtl2 v0.2d, v0.4s
400600: eb01001f cmp x0, x1
400604: 4ee18401 add v1.2d, v0.2d, v1.2d
400608: 54ffff41 4005f0 <main+0xa0>

When comparing the object dump for the version of the program that had vectorization with the one that didn’t, the vectorized loops contained less instructions inside the body compared to the build where vectorization wasn’t enabled. There were also instructions exclusively relating to vectorization present such as the ld1 instruction which loads a single 1-element structure to one lane of one register. For example in the first dump above, the instruction at 4005b8 is: “ld1 {v1.4s}, [x2]”. v1 is the register, the 4s signifies the 4th index of 32 bits and x2 is for wide register 2.

Vectorization can also be implemented in the volume sampling lab as it also contains many loops. The loops are used to fill up the lookup table and vectorization should be able to speed up the process. The scale function which is in a loop might be the only issue as it may be too complicated for gcc to analyze and determine if the loop can be vectorized. Perhaps this problem can be alleviated by inlining it inside of the loop or by rewriting it more appropriately (I believe my implementation of the scale function could be a bit less messy than the way I wrote it).

by tng23spo600 at April 22, 2016 03:50 AM

Nina Wip

Trying to get my changes upstream

As noted before my optimization was to change the compiler flag from -O2 to -O3. This increased the speed with 11% on x86_64. During the testfase of this optimization I changed the compiler flags in the Makefile itself. If I wanted this to get committed in the community I'd have to find the place where the makefiles are being created.

This would normally be the configure file. However I could not find where the flags were set for the makefiles which I thought was very strange because I am very sure that when you configure and make the program, it compiles with the -O2 flag.

I used grep to find files where -O2 would be used, and the only file it found was an instructions file of how you can manually add -O2 while configuring, not as a standard.

Then I tried using grep to find CFLAGS and where they would be defined. What I discovered is that they use a pthread library which helps find the right flags for compilation. ( I quoted this from the md5deep documentation:
#   This macro figures out how to build C programs using POSIX threads. It
#   sets the PTHREAD_LIBS output variable to the threads library and linker
#   flags, and the PTHREAD_CFLAGS output variable to any special C compiler
#   flags that are needed.
I did not know how to manipulate the pthread configuration so it would always use -O3. I did read in their instructions that they had an email address for questions or remarks on the compiler options in the README file. It was however not in the file, so I could not contact them personally either.

On that note I'm sad to share that I could not get the right flags in the configure step. This means I could not commit anything to the github project because I could not contact them and ask for help or an explanation.

by Nina ( at April 22, 2016 03:36 AM

Vishnu Santhosh Kumar

Lab 6 – Vectorization

Vectorization is the process of rewriting a loop so that instead of processing a single element, it can process multiple elements simultaneously.

To show this in action I used the following code:

#include <stdio.h>
#include <stdlib.h>
#define NUM 1000
int main()

int a1[NUM], a2[NUM];

int i=0;
for (i = 0; i < NUM; i++) {
a1[i] = rand() % (100);
for (i = 0; i < NUM; i++) {
a2[i] = rand() % (100);

long int total = 0;
int a3[NUM];
for (i = 0; i < NUM; i++){
a3[NUM] = a1[i] + a2[i];
for (i = 0; i < NUM; i++){
total += a3[NUM];


The code above creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.

The code is compiled with -O3- option that  allows options for vectorization.

402388: 91420222  add x2, x0, #0x80, lsl #12
40238c: 3c407c00  ld1 {v0.2d}, [x0]
402390: 3cdf7c21  ld1 {v1.2d}, [x1], #16
402394: 3e61d400  fadd v0.2d, v0.2d, v1.2d
402398: 4c9f2c00  st1 {v0.2d}, [x0], #16
40239c: eb02001f  cmp x0, x2

Here, the st1 stores a single element structure to one path of one register. ld1, on the other hand, loads a single element structure instead. Those two commands seem to help the process of vectorization.


by vishnuhacks at April 22, 2016 02:51 AM

Berwout de Vries Robles

Phase 3 Pushing changes upstream

To get my changes accepted upstream, the first thing I needed to figure out was where in the make / build process to insert my change. Since in the phase 2 optimization testing I only directly edited the Makefile to see the changes.

It turns out ncompress uses a Makefile.def, which contains the definitions to be used for the make call. The package is not built with ./configure like most, but instead, after cloning you can directly make install. This means I had to add the following 2 lines to the Makefile.def:
#Compiler optimization flags
I tested whether the package still built correctly with my change to the Makefile.def and it built as it should with the -O3 optimization turned on.

Next I had to figure out how to contribute to the package. Luckily since the package is available on github I can fork the package. So that is what I did, I forked the package, made the change and committed it to my branch. I then created the following pull request:

I've added a link to my blog and a short description in the pull request, since there is no other means of communication described on the project page, now all we can do is wait for a response!

by Berwout de Vries Robles ( at April 22, 2016 01:46 AM

Vishnu Santhosh Kumar

Final Project : Phase 3.4 (last)

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve the performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.4 –Final Patches Submitted and Project Conclusion

Date : April 6,2016

April 6,2016

Submitted the final patches to the LZ4 community

Details :

The final patches of the code have been submitted to the lz4 community . Waiting for any further response from the community. All the changes made in the previous phases of the project have been added to the final code. The community is very active so I’m expecting a reply from the community soon.

April 7, 2016

Response from LZ4 community.

Details  :

The community has responded to my pull requests in GitHub. They have analysed my code and got the following replies.
The community replied that one of the optimizations I have done for decompression on  a function called “LZ4IO_compressFilename_extRess()” was not the right place to do that one, instead making that similar changes in a function named “LZ4IO_GetBlockSize_FromBlockId()” could have made the program much better. The optimizations I have done to improve the performance of the code in the x86 machines makes the community satisfied but they show me their concern regarding the performance of the code on some other platforms.  The community also told me that they compiler options I put for the platform specific flag won’t work in certain conditions or new platforms. He also pointed out one of my mistakes I raised in the pull requests regarding the cache size. When I examined the oldest versions of the LZ4 code, I noticed an option to set the cache size of the compression block, I increased this cache size to see the changes in the compression speed. My assumption was right and the compression speed got increased. However, this option was not found in the latest version of the  code. I added this statement in the pull request and community told me that, its still available but they changed the place of the code to a different file. The community owner also appreciated me to try lz4 optimization and he told me that he have gone through my blog.


The SPO course and this final project have created a lot of improvements in my coding and other skills. This course helped me to explore the world of the open source and many more new things. One such new thing is the blogging habits. I never had a blog before and  now I’m really interested in doing blogging and sharing my research.The Professor was very much helpful and inspiring. He made the labs fun and helped us to improve our team skills by making groups for all labs and working in groups. He explained things very well and give us a lot of references to follow. The classmates were too enthusiastic and energetic. This was one of the best Courses I have attended. I got some many new things to continue my research from this course. I once again thank my Professor and my classmates for being with me and supporting me  in  this course for this  whole semester.


by vishnuhacks at April 22, 2016 01:34 AM

April 21, 2016

Berwout de Vries Robles

Project Phase 2

After last week where we ended our project selection with a somewhat hasty switch to another project I am happy to be able to just dig into this package a little bit. Sadly I have been sick half of this week(still am) so the progress is slower than I would have hoped.

In my project selection blog I mentioned that I wanted to look at vectorization options the compiler can do for us if we make some small changes to the code. However before looking into the code I went to take a look at the makefile to see if there are any vectorization flags currently turned on.

This is what the compiler options line looks like on both betty and my own x86 machine:
What I found out is that CFLAGS, CPPFLAGS and LDFLAGS are all undefined in this case. What I also found strange is that these options define a set memory for the program to use and a number of registers the program can use.

The most glaring lack here is that there is not even a base level of optimization applied, so I set out to test that first. To verify that the code has not been changed in a way that it now has a different meaning, we will be using md5 checksum to see if our compressed and uncompressed files remain the same. After slowly ramping up the optimization level, it turns out that the program works on O3, without it causing any problems.

During benchmarks I first encountered a couple of runs where it looked like uncompressing the files was slower than before applying the optimization, however after running some more benchmarks, it seems to me that it was probably background noise from other processes running on my system.

After the -O3 optimization level had been applied we see the following results in the benchmarks:
aarch64 testserver Betty:
2545934560(~2.4GB) textfile:
real:    64.467s
user:   63.280s
sys:     1.180s

1073741824(1GB) integerstream:
real:    9.413s
user:   8.980s
sys:     0.430s

2545934560(~2.4GB) textfile:
real:    10.882s
user:   9.510s
sys:     1.370s

1073741824(1GB) integerstream:
real:    4.734s
user:   3.920s
sys:     0.700s

my own x86_64 Fedora 23 installation:
2545934560(~2.4GB) textfile:
real:    34.812s
user:   18.821s
sys:     4.715s

1073741824(1GB) integerstream:
real:    13.274s
user:   3.789s
sys:     1.651s

2545934560(~2.4GB) textfile:
real:    45.669s
user:   5.779s
sys:     2.339s

1073741824(1GB) integerstream: 
real:    17.713s
user:   2.814s
sys:     1.107s

If we compare that with last week's results we find that in all of the cases the compression was faster with the optimization turned on. However oddly enough on my local machine the uncompress was notably slower! This could also be explained by the amount of possible disturbances in the background processes of my machine since I am running a Virtual machine to test with. However I have tested extensively and have yet to see the optimized version of uncompress not do worse.

To go about implementing this optimization in the actual program I will have to add it to the Makefile recipe, for the testing I have just been editing the Makefile itself. I have also been looking for a reason as to why compression on the aarch64 server betty is so terribly slow, while uncompress is very fast, but I have yet to find the answer.

by Berwout de Vries Robles ( at April 21, 2016 10:09 PM

Tom Ng

The results of Optimizing xzutils with -O3

Ever since I wrote my last post on my attempt to optimize xzutils, I have been compiling different builds of xzutils with different compiler flags. The results of my first two benchmarks shown that -O3 ran slower than the default compilation which used -O2. These results surprised me so I wanted to figure out how a program could run slower with -O3 instead of -O2.

-O3 was simply -O2 with 9 additional optimizations turned on. The 9 optimizations are: -fgcse-after-reload, -finline-functions, -fipa-cp-clone, -fpredictive-commoning, -ftree-loop-distribute-patterns, -ftree-partial-pre, -ftree-vectorize, -funswitch-loops and -fvect-cost-model. This resulted in me creating 9 builds of xzutils, each with one of those optimizations passed into the configure script and compiled. The results of these builds are as shown in the picture below:

results(Click picture to enlarge)

The picture shows the times recorded by the time command when compressing and decompressing the test files. Highlighted in green are times where an improvement relative to the executable with the default optimizations (-g -O2) was found.

Overall, the results show that compiling with -O2 and one individual optimization that -O3 turns on has an overall negative impact on the running speed of the program since the times to complete operations have mostly increased. Savings if any tend to happen more during the decompression operations than the compression operations. All the individual optimizations were also slower compared to the -O3 version of the executable which was slower than the original executable to begin with.

Furthermore, any reductions in time that a build may have achieved for a section were mostly insignificant and potentially false. For example if there was a time savings in the user section, it may not have actually been a savings since the time “saved” was actually passed and added onto the kernel time resulting in no net gain of time savings.

Of the 9 optimizations tested, -finline-functions seems to be a bit of an outlier. The executable compiled with this optimization is still slower than -O2 and -O3 but its timings are notably better than the other executables with the individual -O3 optimizations. In fact, its times are similar to that of the -O3 executable but it is still stands that adding -finline-functions to -O2 produces a slower running executable. In other words, -finline-functions adds considerably less runtime compared to the other optimizations but the executable is still better off without having this optimization.

It was surprising to me that each individual -O3 optimization made the program slow but combined (-O3), the times were better (though still slower than just -O2). This suggests that combining optimizations may yield a speed boost. It’s hard to determine which optimizations could work with each other but through the names alone, I tried -ftree-loop-distribute-patterns, -ftree-partial-pre and -ftree-vectorize. The build with these optimizations though, did not yield favorable results. Much like the other tests with the individual optimizations, it was slower than the -O2 version and the -O3 version as well.

In the end, I was unable to optimize xzutils through the adding of the -O3 compiler optimizations. None of the individual -O3 optimizations result in any sort of meaningful speed improvement and overall slowed the program down. It seems -O3 isn’t used all the time because it may actually slow down programs and not just because it can break something as I had initially thought. The xzutils devs are in the right for not using -O3 as enabling -O2 without any -O3 optimizations will produce the faster running xzutils.

by tng23spo600 at April 21, 2016 09:39 PM

Lab7: Looking at Inline ASM in libmlx4-1.0

In this lab, I chose a package whose source includes inline assembly and attempted to locate it and find out what it does. In the source, I found only three lines of assembly. The assembly was used in exclusively inside of a series of if-else statements for the c preprocessor meaning only one or none of the assembly code will be executed. The assembly specifically targeted the i386, x86_64 and ia64 platforms.

The file mlx4.h contained the 3 asm statements found in the source code. The section where the assembly was found is an if-else block section for the c preprocessor for defining wc_wmb(). Depending on the platform though, wc_wmb() may be substituted for a single line of assembly or wmb() instead. The input and output operands were empty. Declaring asm volatile and putting “memory” in the clobber register ensures the statement doesn’t get optimized or moved by the compiler. The assembly for all three sections seems to be an implementation of a memory barrier which prevents the compiler from compromising the order of execution through optimization. Throughout the qc.c file, wmb() would be called in different parts of the source.

A single line of assembly would be used if it was determined that the architecture being targeted was either i386, x86_64 or ia64. If the target architecture was neither of these platforms, the else section defines wc_wmb() as wmb() without any assembly. I could not find wmb() being defined anywhere else in the program or redefined as something else elsewhere so it is possible that the call to wmb() may actually do nothing but let the compiler compile the program. If it does do nothing then it means the memory barrier for a section won’t be implemented and the program may suffer performance-wise after optimization because of this.

Only a single line of assembler is executed if one of three platforms are detected and none at all if the platform isn’t recognized. The increase in complexity of the code is insignificant and only serves to benefit the platforms that it does support. No portability is lost, just unsupported platforms may not run as well. For a series of small checks, it is probably worth having the assembler in the source code.

by tng23spo600 at April 21, 2016 08:59 PM

Giuseppe Ranieri

Providing for the Upstream

I have submitted my work to the upstream through github.

 I thought it would be best practice to first show my finding to the community for them to comment on the benefits and cons that I don't know about. Unfortunately, even 4 days after posting no one has responded yet. I assume even if I had made a pull request, the same thing would have happened at this point in time as no new pull requests were answered either.

For the sake of time though, if the work had been accepted, I would have had to went through their guidelines for contributing. Which follows a Google Individual Contributor License Agreement (CLA). The following is stated:

Before you contribute

Before we can use your code, you must sign the Google Individual Contributor License Agreement(CLA), which you can do online. The CLA is necessary mainly because you own the copyright to your changes, even after your contribution becomes part of our codebase, so we need your permission to use and distribute your code. We also need to be sure of various other things—for instance that you'll tell us if you know that your code infringes on other people's patents. You don't have to sign the CLA until after you've submitted your code for review and a member has approved it, but you must do it before we can put your code into our codebase. Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on.

Code reviews

All submissions, including submissions by project members, require review. We use Github pull requests for this purpose.

The small print

Contributions made by corporations are covered by a different agreement than the one above, the Software Grant and Corporate Contributor License Agreement.

 I followed their suggestion of:

Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on.

But since they have not answered me yet, I was not able to continue with the signing of the CLA.

by JoeyRanieri ( at April 21, 2016 05:55 PM

Vishnu Santhosh Kumar


My Tips for Future Data Scientistsdata-scientist

  • Be flexible and adaptable – There is no single tool or technique that always works best.
  • Cleaning data is most of the work – Knowing where to find the right data, how to access the data, and how to properly format/standardize the data is a huge task. It usually takes more time than the actual analysis.
  • Not all building models – Like the previous tip, you must have skills beyond the just model building.
  • Know the fundamentals of structuring data – Gain an understanding of relational databases. Also, learn how to collect and store good data. Not all data is useful.
  • Document what you do – This is important for others and your future self. Here is a subtype, learn version control.
  • Know the business – Every business has different goals. It is not enough to do analysis just because you love data and numbers. Know how your analysis can make more money, positively impact more customers, or save more lives. This is very important when getting others to support your work.
  • Practice explaining your work – Presentation is essential for data scientists. Even if you think you are an excellent presenter, it always helps to practice. You don’t have to be comfortable in front of an audience, but you must be capable in front of an audience. Take every opportunity you can get to be in front of a crowd. Plus, it helps to build your reputation as an expert.
  • Spreadsheets are useful – Although they lack some of the computational power of other tools, spreadsheets are still widely used and understood by the business world. Don’t be afraid to use a spreadsheet if it can get the job done.
  • Don’t assume the audience understands – Many (non-data science) audiences will not have a solid understanding of math. Most will have lost their basic college and high school mathematics skills. Explain concepts such as correlation and avoid equations. Audiences understand visuals, so use them to explain concepts.
  • Be ready to continually learn – I do not know a single data scientist who has stopped learning. The field is large and expanding daily.
  • Learn the basics – Once you have a firm understanding of the basics in mathematics, statistics, and computer programming; it will be much simpler to continue learning new data science techniques.
  • Be polymath – It helps to be a person with a wide range of knowledge.

by vishnuhacks at April 21, 2016 02:33 PM

April 20, 2016

David Wesley Au

[Lab 3] Assembly Lab



 mov x3,start
 mov x10,10
 adr x11,msg
 udiv x4,x3,x10
 msub x5,x10,x4,x3
 cmp x4,0
 add x4,x4,48
 add x5,x5,48
 strb w4,[x11,5]
 strb w5,[x11,6]
 mov x0,1
 adr x1,msg
 mov x2,len
 mov x8,64
 svc 0
 add x3,x3,1  
 cmp x3,max loop  
 mov x0,0
 mov x8,93
 svc 0


 mov $start,%r15
 mov $10,%r10
 mov %r15,%rax
 mov $0,%rdx
 div %r10    
 cmp $0,%rax
 mov %rax,%r14
 add $48,%r14
 mov %rdx,%r13
 add $48,%r13
 movb %r14b,msg+5
 movb %r13b,msg+6
 mov $len,%rdx
 mov $msg,%rsi
 mov $1,%rdi
 mov $1,%rax
 inc %r15
 cmp $max,%r15
 jne loop
 mov $0,%rdi
 mov $60,%rax

There is not much of a difference between the two, The methods of doing it may be somewhat different, but the logic is ultimately the same.

by David Au ( at April 20, 2016 07:43 PM

Project Phase 3

Since the attempt from Phase II, I have made attempts to try other possible optimizations, but have not gotten any luck in successful attempts. As I have said before, after searching through the pxz.c file multiple times on different occasions, I cannot find another possible shortcut / optimization that could be made towards the application. Between the two files, with a 1 MB difference, there is only a negligible time difference between the two. From my understanding of PXZ up until now, although the latest patch to the software dates back to almost one year. It already seems perfect in terms of code structure and speed. Given the tests I have done, there were almost no difference in time/speed between the different servers of our SPO600 servers.

Back when PXZ was made (2009), for four years or so, there was been a number of "patches" that have been applied to the compression tool. Given the benchmarks, the PXZ community has attempted on, PXZ is meant for multiple cores and processors, which will increase its efficiency in compression time. The compression time using a single threaded PXZ is equivalent to an ordinary usage of the XZ compression tool.

Although, my optimization attempts have ended in failure, I have learned a lot during the process of this project. I learned about how compression works in code, and how everything works well together in terms of assembly. In the future, I will attempt to make contributions to other parts of the PXZ utility be it the assembly, makefile, manpage or documentation.

by David Au ( at April 20, 2016 07:16 PM

[Lab 7] Inline Assembly

The Open Source Package I have chosen to blog about is called "Mosh". Mosh is a mobile shell, a remote terminal application that allows roaming, supports intermittent connectivity and provides line echo. It also has line editing of user keystrokes. Mosh can be used as a replacement for SSH, it is free to use and it is available across many platforms. (

1. Mosh has 14 different MAKEFILES

2. Mosh is available on Windows, OS X, Android, Chrome OS, Linux, BSD, and Solaris.

3. Mosh reduces the median keystroke response time to nearly instantaneous time. This makes remote servers feel like they are "local" machines.


5. When I tried to configure the package, I ran into this problem: configure: error: cannot find protoc, the Protocol Buffers compiler. Thus stopping me from attempting to build the package.

by David Au ( at April 20, 2016 03:16 PM

[Lab 2] Code Building

1. hello-2.10

I used these commands to build the software:


I had no problems with either, and after each option was used, there were more files that I have access to in the hello-2.10 directory.

These are the options after make

@aarchie hello-2.10]$ ls
ABOUT-NLS  COPYING    ChangeLog.O  INSTALL  NEWS    README-dev      THANKS  aclocal.m4  config.h   config.log     configure     contrib  hello    lib  po   stamp-h1
AUTHORS    ChangeLog  GNUmakefile  Makefile  README  README-release  TODO    build-aux  config.status  doc      hello.1  m4   man       src  tests

2. ghostscript

tar -xvf gnu-ghostscript-9.14.0.tar.xz


Since there is no MAKEFILE, and it is suggested that I do not install on the SPO600 servers, there is nothing else I can do after configure. These are the options available after ./configure.

@aarchie gnu-ghostscript-9.14.0]$ ls
AUTHORS    DroidSansFallback.NOTICE  Resource    config.log  devices   expat                    ghostscript_rt.vcxproj  install-sh  libtool    missing   toolbin
COPYING    LICENSE                   NEWS         aclocal.m4  base          config.sub  contrib       doc       ghostscript-ufst.vcproj  iccprofiles             jbig2dec  openjpeg  trio
ChangeLog  LICENSE.Unicode           README       arch        config.guess  configure   cups          examples  ghostscript.vcproj       ijs                     lib         man        psi

by David Au ( at April 20, 2016 02:58 PM

April 19, 2016

David Wesley Au

[Lab 5] - Algorithm Selection Lab

The first one should scale a signed 16-bit integer by multiplying it by a volume scaling factor expressed as a floating point number in the range of 0.000-1.000. This should be implemented as a function that accepts the sample (int16) and scaling factor (float) and returns the scaled sample (int16).
The second version of the function should do the same thing, using a lookup table (a pre-computed array of all 65536 possible values). The lookup table should be initialized every time a different volume factor is observed. This should be implemented as a drop-in replacement for the function above (same parameters and return value).


#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define MAX 250000000
#define FACTOR 0.500
int16_t method1(int16_t, float);
int16_t method2(int16_t, float);
int16_t* sample;
int main(){
int16_t* meth1;
int16_t* meth2;
meth1 = (int16_t*)malloc(MAX * 2);
meth2 = (int16_t*)malloc(MAX * 2);
int i;
struct timeval stop, start;
double diff;
sample = (int16_t*)malloc(32768 * 2);

srand(time(NULL)); // generate sound sample by using random number generator
int16_t* sound;
sound = (int16_t*)malloc(MAX * 2);
for (i = 0; i < MAX; i++) {
sound[i] = rand() % 65536 - 32768;
for (i = 0; i < MAX; i++){
printf("Value of sound sample: %d\n", sound[i]);

// Calculating time taken for method1 - method1
gettimeofday(&start, NULL);
for (i = 0; i < MAX; i++){
meth1[i] = method1(sound[i], FACTOR);
gettimeofday(&stop, NULL);
printf("took %d\n", stop.tv_sec - start.tv_sec);

// calculate time taken for method2-method2
gettimeofday(&start, NULL);
for (i = 0; i < 32769; i++){
sample[i] = i * FACTOR;
for (i = 0; i < MAX; i++){
meth2[i] = method2(sound[i], FACTOR);
gettimeofday(&stop, NULL);
printf("took %d\n", stop.tv_sec - start.tv_sec);
return 0;

int16_t method1(int16_t s, float f){
return s*f;

int16_t method2(int16_t s, float f){
int16_t result;
if (s >= 0){
result = sample[s];
result = -sample[-s];
return result;

aarchie: Method1 took 215,498 ms || Method2 took 168,443 ms
x86_64: Method1 took 20,429 ms || Method2 took 23,564 ms

by David Au ( at April 19, 2016 08:33 PM

[Lab 4]

#include <stdio.h>
int main(){
      printf("Hello World!\n");

Compiler Options:
-g               # enable debugging information
-O0              # do not optimize (that's a capital letter and then the digit zero)
-fno-builtin     # do not use builtin function optimizations

-static option:

-It includes all stdio library files (all, not just the ones given in .c file) as a static memory, and it will make the file much larger than the original file.

Without "-fno-builtin" option:

-Total file size is reduced.

Without "-g" option:

-Total file size is reduced.

Addition printf statements:

-Will move into multiple registers and push them into the stack

Move printf to a separate function (output()):

-It will call output() from main, and move $0x0 to %eax to empty the register before each function call.

by David Au ( at April 19, 2016 08:32 PM

Andrei Topala

Inline Assembler Lab

Let’s look at the use of inline assembler in a C program.

Assembly-language code can be embedded into a C (or other high-level languages) source file in order to optimize a certain part of the program by using processor-specific instructions that are faster than what the compiler would create out of functionally-equivalent C code. Moreover, inline assembly code can be used for hardware-specific tasks that would be impossible with high-level code.

An example of inline assembly code can be found in K9Copy, a DVD backup tool. In the file src/dvdread/md5.h, there is the following piece of code:

#if defined __GNUC__ && defined __i386__
static md5_uint32
rol(md5_uint32 x, int n)
  __asm__("roll %%cl,%0"
          :"=r" (x)
          :"0" (x),"c" (n));
  return x;
# define rol(x,n) ( ((x) << (n)) | ((x) >> (32-(n))) )

If the program is compiled with a GNU compiler and on an i386 processor, the function rol(md5_uint32 x, int n) is defined. It contains the assembly instruction roll (rotate left long), which rotates the 32-bit integer x to the left by n places, bringing bits that fall off the end back to the beginning. If the program is compiled with a different compiler or on a different processor, the function is instead replaced with a macro function that achieves the same result through bit shifts; it would use multiple instructions whereas the version using inline assembly uses only one.

This, I think, is a good use of inline assembly: it’s there to be used if it can make the executable faster, and if it can’t be used then there is an alternative. Unless there are machines that have the __i386__ macro defined but no rol function available (which I don’t think would be the case?), there are no problems with portability. The concept is sound. However, this particular source file was written in 1999. Modern compilers would almost certainly recognize the second function (that is, the macro function using bit shifts) and replace it with a rol instruction if possible. Using inline assembly for optimization is not so (relatively) straightforward a job as one might think, or as it would have been 20 years ago. Most attempts at optimization with inline assembly would probably be surpassed by the compiler’s optimizations; at worst, inline assembly might even impede the compiler and result in an executable that performs worse than it would have had the compiler been left to its own devices. There are, I’m sure, exceptions, and in some situations it could be the best course of action to use inline assembly in an optimizing capacity, but it probably shouldn’t be the first resort.

Inline assembly code should, of course, still be used for certain platform-specific operations or for tasks that cannot be done with high-level languages–those things, at least for the moment, lie beyond the powers of the compiler.

by andrei600 at April 19, 2016 01:51 AM

April 18, 2016

Tom Ng

Lab4: Compiler Options and their Impact

Lab 4 was about observing the impacts of compiler options on a simple program. This simple program was nothing more than a call to printf saying “Hello World”. In this lab, I compiled 7 different version of the same program with either different compiler options or with a slightly difference in the source code but accomplishes (mostly) the same result.

Before experimenting with compiler options, it is necessary to see how the compiler normally compiles a program without any options. The compile command would simply be: gcc hello.c. The compiler compiled the source to a binary that was 8190 bytes.

The first variation of the program was compiled with the options: -g -O0 -fno-builtin resulting in the command: gcc -g -O0 -fno-builtin. The -g enables debugging symbols, -O0 sets the optimization level to 0 which translates to no optimizations and -fno-builtin disables other optimizations. The resulting file was 9184 bytes which was larger in size. When examining the object dump (objdump -d) for the resulting executable and comparing it with the executable without options, there was only one change; in the optimized version, the instructions for calling printf function were changed to call the function puts instead.

The second variation of the program was compiled the -static option added in. This option is for linking statically instead of linking dynamically when compiling. The executable produced with this option shot up almost 700% in size to 681699 bytes. The object dump also resulted in massive amounts of instructions being added in. While the main was the same amount of lines, the start_main which was only 4 lines in the dynamically linked version, was many more; possibly too much for the objdump to show as the end of the header showed “…” suggesting that further lines may have been truncated.

The third variation was compiled without -fno-builtin. The filesize became slightly smaller at 7149 bytes. The object dump reveals the same change as the first variation where the printf calls were replaced by puts instead.

The fourth variation was compiled without -g. Comparing to the first variant, this resulted in no change to the file size or the assembly code. However doing a hash check on both executables resulted in different values so there is some difference here not visible in the file size or when calling objdump -d.

The fifth experiment is different in that this time it is a source code change rather than a compiler option change. The original program simply calls the printf function and passes to it one argument: the string containing “Hello world”. The change here is to add additional arguments. The additional arguments added are integers meant to be printed with the first string. I made 10 sub-variations of this program each with a different number of parameters passed to printf (1 to 10). When comparing the object dumps of the sub-variations with more parameters passed, there were more mov instructions there were in the main header as the number of parameters passed increased. There were also more instructions for pushing values onto the stack in that header. There were also other changes throughout the object dumps but they were only address changes.

The sixth experiment was much like the fifth where the source code was changed rather than the compiler options. The original program simply calls the printf function inside the main and then exits. Here a call to the function output() was made in the main and output contains the call to printf. In the disassembly, this was reflected by a new header output being created and contained much of the instructions that were in the main header. The main header was also reduced by one instruction.

The seventh and final experiment was setting the optimization level to the highest which meant changing -O0 to -O3. Here the filesize went up from 9184 bytes to 10608 bytes. There were many changes done to the assembly as revealed in the object dump. Some headers like the main were moved to other parts of the section. The main header also contained less lines of assembly to execute.

by tng23spo600 at April 18, 2016 04:22 AM

April 17, 2016

Tom Ng

On two Optimizations: -fmerge-constants and -fpeel-loops

Recently, I had the opportunity to spend time to examine two optimizations available in gcc: -fmerge-constants and -fpeel-loops. For each optimization, I examined what it did and tried to write code that would show the effect of the optimization.

-fmerge-constants is an optimization where as its name suggestions, the compiler attempts to merge all constants that have the same values. The idea is if a variable has the same value and it is assured that the variable will not change (constant), the same variable can be referenced instead of using multiple variables. This can help create smaller sized executables since there are fewer variables. This optimization even spans across compilation units meaning it’s not limited to the variables of one specific file in the source code.

The simple program I made to highlight the optimization simply prints constant variables of the same values in the main and includes the header for another function that does the exact same thing with the exact same values. I compiled two versions of the program: one with the optimization and one without the optimization. When comparing the disassembly of the two executables, the only things that changed were the addresses of the mov instructions. There was no increase or decrease in instructions yet the file size of the executable was smaller. The savings came from the .rodata section where for the optimized executable, the variable and its value were only declared once instead of multiple times. -fmerge-constants is enabled at optimization levels –O, -O2, -O3 and –Os and rightfully so because it should always be used. There are essentially no downsides to using this optimization unless your code calls for violating the constness of one of the variables you declared.

-fpeel-loops is an optimization where either the first or last iteration of the loop can be hoisted out which can potentially eliminate a variable or operation inside of the function. It is possible for the first or last iteration of a loop to be a special case and different from the rest of the iterations. Handling this case tends to requires extra code that ends up inside the loop. If this optimization is enabled, the compiler can move the special case out of the loop by handling it before or after the loop and the loop can then be adjusted to fit the other cases while eliminating a variable or operation. This can lead to further loop optimizations such as if compiler determines the number of iterations is small enough, it may completely unroll it eliminating the loop.

Unfortunately I don’t have any working code for this optimization but here is a trivial example on what would happen with this optimization enabled. Given:

int x = 2000;

int y;

for (y = 0; y < 10; y++) {


     x = y;



The loop will print the squares of x in and increment x by 1. Notice however the way this code was written makes it so that the first iteration is an anomaly in which x starts at a completely different number before becoming what would be normal for the loop. What the compiler can do is hoist the first iteration out and put it before the loop:



for (…


Following this, the loop can be adjusted like:


for (x = 1; x < 10 x++) {




The end result looks like:


int x = 2000;


for (x = 1; x < 10; x++) {




Notice the entire elimination of the y variable in the snippet. Furthermore, an entire operation inside the loop was eliminated (x = y). In this particular example, the number of lines of code was reduced but that does not always happen as it depends on the operation that occurs inside the loop and additional logic would be needed to handle the special case if the exit condition of the for loop is not constant.

by tng23spo600 at April 17, 2016 01:24 AM

April 16, 2016

Tom Ng

Lab3: Programming in Assembly

Lab 3 was a fun lab about learning assembly. Assembly is certainly very different from higher level languages but it has its charms. Programming in assembly felt like doing a time consuming puzzle that you know you can eventually figure out and when you do, it feels great. Assembly brings back the joy of printing your first variable or making your first working loop which is probably taken for granted when you’ve programmed enough. The fun part or challenge of assembly comes from its restrictions: there is a limited set of instructions which must be combined in order to do something that may have taken only one line of code in another language. Assembly also uses a limited number of registers which are used to store values and must be used sparingly.

I was tasked with having to code a loop that would simply print the numbers from 0-9. Sounds simple enough. That would be two or three lines of code in another language; but not in assembly. To start I copied the “Hello World!” example and tried to compile it. It worked but when I tried to execute it I got an exec format error. It took a while before I figured out what was wrong because I thought it was something like a syntax error but the message did not contain any lines for where an error could have been. It turned out I was trying executing the object file all this time and that the compiler (as) doesn’t link by default. I thought that was weird because when using gcc to compile c or c++, it links by default unless using the -c flag. I then had to actually call the linker ld whereas when linking with c or c++ I would use gcc and it would use ld. I guess I should have realized sooner since I had to keep typing chmod +x before attempting to execute what I thought was the compiled executable. Well, now I could start making loops.

It turned out the loop was actually already given to me and what I had to do instead was code what the loop was supposed to do. The requirement was to print “Loop: ” and then the current iteration of the loop on one line. I thought of printing “Loop: ” and then printing the number but it turned out you can actually modify the string by changing the address of the string; offset by the desired amount. This required declaring the string to have extra spaces at the end: “Loop: “. It was also necessary to add 48 to the number in order to print its ASCII equivalent. The next part of the lab expanded on the previous part by printing the numbers up to 30. This meant having to print numbers of up to two digits length but only a single digit could be printed at a time. Printing each digit meant breaking up the number with the help of division and remainders which the div instruction conveniently had. Leading zeroes also had to be supressed which was simple enough since individual digits were being printed anyway. To supress the zero, just before where the code prints the digit, a check is made to see if the number is zero and prints it if it is not. If it is zero, it jumps past to where it would have printed the digit. The code I wrote for the lab only allows for numbers with two digits to be printed but logic for printing larger numbers is there, probably at the cost of using more registers.

Finally I had to replicate the x86_64 assembly and port it to aarch64 assembly. Another deceptively hard task. I started by copying over the code from the x86_64 system to the aarch64 system and recompiled it. There were enough errors that I decided to start from scratch and upon further reading on wiki about aarch64 assembly, it was different enough that I had to rewrite the program from scratch anyway. The first interesting part that I had to change was the code to divide the number the loop was on. The x86_64 assembly had a lot of mov instructions because some of the instructions like div required certain numbers in certain registers to be used. Aarch64 turned what was 6 lines of code in x86_64 into just 2 lines since the (u)div instruction for aarch64 did not require the numbers it worked with to be in specific registers. The block of code used to print the message was also shorter by 3 lines because x86_64 assembly again required certain data to be in certain registers. Most of the code removed from the aarch64 version was in fact, mov. But aarch64 assembly also has some downsides compared to x86_64 assembly. Some instructions such as udiv only accepted registers as their arguments whereas div x86_64 could also accept numbers so a few mov instructions were also required to use some aarch64 instructions. Overall though, aarch64 required less lines of code than x86_64 assembly.

Doing lab 3 has taught me quite a bit about assembly but I’m definitely still in the realm of a beginner. Probably the most important thing I’ve learned about assembly is that there are different kinds of assembly for different architectures. I always thought assembly was one language but someone who knows x86_64 assembly may not know necessarily know aarch64 assembly. Assembly is some hard stuff and it’s kind of amazing knowing there are people that work with it everyday.

by tng23spo600 at April 16, 2016 07:45 PM

April 14, 2016

Vishnu Santhosh Kumar

Final Project : Phase 3.3

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve the performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.3 –  Interactions with the lz4  community in GitHub.

Date : March 29,2016 – March 30,2016

March 29,2016

Opened an issue to improve the documentation for the LZ4 compression program.

Details :

I opened an issue in GitHub to improve the documentation of the project. The pieces of information provided in the present documentations are not enough to understand the working of the program or the flow of the program. No documentation describes the special purpose of respective directories and files in it. I believe that a better documentation will really help beginner for understanding the program and contributing to its development.So I opened this issue in the GitHub.

I got the reply that the point of a format specification is to allow any kind of implementation that existing, and still remain compatible with each other.This is a good thing  as different environment come with different restrictions and strengths.  So a single algorithm wouldn’t fit them all.He also encouraged me to  understand the format and  then to  create my own program to read it.Which will get better insight, and as a result, get a better chance to understand the reference source code. I found a similar issue opened in the Github, and the response from the owner was that there is constantly a need to balance between “simple usage for a beginner” and “hardcore total-control experience” where little details matter a lot. Also, the community is been working since  a very long time(17 years) and many patches were merged without proper documentations.He said that he will try to convert the comments to Doxygen, a documentation generator, a tool for writing software reference documentation.


March 30,2016

Opened an issue regarding the increase in the compressed file

Details : 

I compressed a file with only unique characters to check the performance of the code. It is found that the size of the file got increased by nearly 150%. Which I believe , it happened because of the encoding of the non-matching literals. So I opened an issue to make some  changes in the algorithm to encode the literal only if a match has been and found and also only if the encoded literal size is lesser than the uncompressed data. By that way, the uncompressed literal will have the same size before. There the worst case file size can be reduced.
The response from the community was that the  default minimum size of the header is 15 bytes, even though my issue could improve the worst case situation , the command line utility cannot do that ,  because it must generate a result decodable on many different systems and for different use cases, so it needs the frame headers to ensure decoding success.Also in order to compress the file without header,  a function called ‘LZ4_compress()’ can be used directly at API level.



by vishnuhacks at April 14, 2016 05:52 PM

Final Project : Phase 3.2

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve the performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.2 –  Interactions with the lz4 community in  google forum.

Date : March 28,2016


March 28,2016

Post : Which is the default format used, Frame format or Block format?

Details : 

While I was working on my Phase 2.1,  I came across two different kinds of formats used by the compression code to decompress the literals. The format means the structure of the compressed block of literals, that I explained in phase 1.1 . The two formats are the frame format and the block format,  which have different structures. The actual default format used by the code was not mentioned in any documentations.I ask my question in the community forum.

I got  the reply that the “block format” is used for compression  in the LZ4 compression.Also, there is no default compression format, the  frame format is used  for testing and debugging. I also understood that the frame format is easy to use, but the block format is used to write an advanced application which has specific limitations (e.g. tight working memory/bandwidth/size/speed budget). I decided to make changes in the block format to improve the optimization.

The Block format is  used in compression and Frame format is used in testing and debugging.Also, there is no default format used for compression.

by vishnuhacks at April 14, 2016 04:27 PM

Final Project : Phase 3.1

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.1 –  Interactions with the lz4 maintainers in google community blog.

Date : March 22,2016  –  March 26,2016

March 22,2016

Post:  Difference between ‘lib’ directory and ‘programs’ directory?

Details :

During my Phase 1.1 of my final project , I  find it really hard to understand the purpose of the various file in different directories. One among such most confusing pair of libraries were the ‘lib’ directory and ‘programs’ directory. The files in each library had similar names and the code was so confusing (ex: a lot of macros) and unable to understand the purpose of these files. I posted my question in the google community forum.

I got my first reply that , the “lib” directory contains LZ4’s library (core) part.  “program” directory contains application programs which use LZ4 library: LZ4 command line utility, I/O submodule and test tools (e.g. benchmark). I understood that the optimizations for the I/O module like increasing buffer sizes or multi-threading the input/output etc can be modified in the programs directory. It contains the programs that use the files in the “lib”, which contains the code for compression or decompression , the block formats of the compression etc. Any optimizations on the compression/decompression should be done on the “lib” directory files.

The ‘lib’ directory is used to maintain the core program and the ‘programs’ directory is used to support the programs for the user interface, input or output operations, buffer management etc.


March 26,2016

Post:  How to determine maxOriginalSize and correctly allocate a buffer for decompression?

Details :

While I was analyzing the  code for the decompression, I noticed that the buffer size to decompress the file is fixed(128 kb). I thought it would be better to include the original size of the file in the header of the compressed file and by that way a better decompression size can be achieved from the header. This helps to improve the decompression speed of the decompressing code. I posted my question on the community blog.

I got my first reply on the same day itself.  The owner of the blog is a profile named “Cyan”, he replied that the compression block uses only a maximum of 64-kb, because of that the 128kb size in the  decompression code would be enough to decompress any file size. He also said that the future extensions for the compression format will allow embedding the file size in the header.Two other users have also posted their opinion after this, supporting  “Cyan’s” post.

The version 6 of the LZ4 software is planned to add the file size in the header . That can be used by the decompression code to achieve maximum possible decompression speed.


by vishnuhacks at April 14, 2016 03:59 PM

April 11, 2016

Andrei Topala

Project: Improving lrzip; Phase Two

In my last post, I mentioned that lrzip, when given the -z option, complains of an illegal instruction and crashes on AArch64. I did some debugging with GDB, hoping to find the cause of this, but the situation seemed to be that the program just sort of lost itself during execution:

Program received signal SIGILL, Illegal instruction.
0x000003ff7b000000 in ?? ()

Even though the package had been compiled with -g (to produce debugging information) and we’d downloaded all the separate debuginfos, the program still lost its way, and it couldn’t tell us where it was.

The reason for this turned out to be very simple. I took a look at the files in the libzpaq folder and found, in libzpaq.h, the following comment:

“By default, LIBZPAQ uses JIT (just in time) acceleration. This only
works on x86-32 and x86-64 processors that support the SSE2 instruction
set. To disable JIT, compile with -DNOJIT. To enable run time checks,
compile with -DDEBUG. Both options will decrease speed.”

JIT code is created during execution time and then executed. libzpaq.cpp contained several functions that filled various arrays with hex numbers representing machine instructions. Those arrays were then written to memory and executed as code. The problem is that those machine instructions are x86 instructions. So, on our AArch64 machines, the program was trying to execute data it had no business executing, and that’s why it crashed.

libzpaq.cpp has a check to make sure that the platform it’s being compiled on is x86, but it isn’t a very good check:

// x86? (not foolproof)
  const int S=sizeof(char*);      // 4 = x86, 8 = x86-64
  U32 t=0x12345678;
  if (*(char*)&t!=0x78 || (S!=4 && S!=8))
    error("JIT supported only for x86-32 and x86-64");

That snippet of code checks that the system uses little-endian format (which x86 does) and that a pointer is either 4 or 8 bytes long. AArch64 also uses little-endian format (AArch64 is bi-endian in regards to reading data, but by default it is little-endian) and has same-sized pointers, so it passes the test, which is something that shouldn’t happen.

It seems that this is a known problem. An issue was submitted to lrzip’s GitHub’s page a little over a month ago, but no solution was proposed beyond that mentioned in the libzpaq.h file (ie. passing to the C++ compiler the -DNOJIT flag when compiling the code).

I first thought to translate the encoded x86 instructions into the equivalent AArch64 instructions . . . but there are a lot of instructions, and I don’t think I could get it done in time (and I am sure I would run into trouble concerning the handling of things like x86’s variable-length vs AArch64’s 32-bit instructions, etc.)

I wanted to see how big of a difference the JIT code makes, so I tried running the ZPAQ algorithm on x86_64 with and without the -DNOJIT flag.

Here is the result of running lrzip -z normally:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  1.355MB/s. 
Total time: 00:06:08.42

To compile the program with the -DNOJIT flag, we just need to add it to the C++ flags when we run the configure script:

[andrei@localhost lrzip-master]$ ./configure CXXFLAGS="-g -O2 -DNOJIT"

The -g and -O2 flags are set by automake (which lrzip uses to create its configure script and makefiles) if CXXFLAGS isn’t defined by the user, so if we are defining CXXFLAGS we need to set them as well, for consistency’s sake.

Now, here’s the same operation without the JIT code:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  0.919MB/s. 
Total time: 00:09:04.74

Without the JIT code, the program takes 3 extra minutes, or slightly over 30% more time.

Chris suggested that such a large discrepancy between the JIT code and the compiled C code might indicate that the C code isn’t quite optimized as well as it could be. I looked at the C code and, while there are some small pieces I could try to optimize, I do not really understand the algorithm at all, and the code is difficult to read. I’ll try to keep going but I don’t have anything to show for my efforts here.

Anyway, at least I edited libzpaq.cpp. I added the following preprocessor directives near the top of the file:

// GNU
#ifdef __arm__
#define NOJIT 1
#ifdef __aarch64__
#define NOJIT 1

NOJIT, which is supposed to be set with the -DNOJIT flag, is checked in every function that uses JIT code, and if it is defined the program uses regular C code instead. So, with this change, if the preprocessor detects that we’re on an ARM machine, it just sets NOJIT and the other conditionals take care of the rest. It’s an inelegant solution (and I suppose it would make more sense to check instead that the program is running on an x86 architecture and enable NOJIT otherwise) but it works on aarchie and betty, and the ZPAQ algorithm defaults to using the C code. I’ve only tested this on gcc, though; other compilers have different (or no?) macros to check for ARM architectures, so this is not the best solution. I’ll try to refine it but I don’t know if it’ll be worth submitting to the community, since it is such a simple thing and could be achieved just as easily by compiling the code with the -DNOJIT flag.
Modifying the script to check the architecture and add -DNOJIT to CXXFLAGS accordingly also works, but I think it’s better just to have a safeguard in the libzpaq.cpp source file itself, because adding an extra C++ flag by default on non-x86 machines is not really expected behavior.

I took a break from the ZPAQ algorithm and tried to find something to optimize in the regular LZMA code (which is the path taken by default when lrzip is called without specifying an algorithm), but this, too, proved fruitless (although the code is much easier to read). Two functions take up 90% of the execution time, and I couldn’t improve either of them.

Two pieces of code in particular (the first from LzmaEnc.c and the second from LzFind.c) I tried to wrest some improvement in performance from.

p->reps[3] = p->reps[2];
p->reps[2] = p->reps[1];
p->reps[1] = p->reps[0];
p->reps[0] = pos;

Here I tried to use memmove to shift the array (it’s an UInt32 array) forward all at once, but it didn’t have any effect on performance.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])

This I tried to condense to a single while loop (or at least one while loop and a single if statement) in order to reduce the number of branches and operations, but that actually made the code slower. It might be the case here that the compiler optimizes this particular logic layout best.

So, that’s where things stand at the moment. It still looks like the ZPAQ C code might be the most promising venture, if only because I can’t seem to optimize the LZMA code at all. I’ll keep trying both options, though. I also haven’t looked into other compiler options yet, so that’s also still a possibility.

by andrei600 at April 11, 2016 06:46 AM

Giuseppe Ranieri

The Trials and Tribulations of Optimizing Zpofli

What Didn't Work

When I first went about doing this, my methods of testing were incorrect and there was a lot of research I should have done better on.

For example, reading more about the zopfli compression, the reason it is slow was because that was the entire point of why it was made. It was meant to be 80x slower than gzip while only providing an up to 8% boost. So it became a tall order to try to better it.

Here's comparing the size differences between the two.


As you can see, in this case for the 10mb file made with the command:

base64 /dev/urandom | head -c 10000000 > 10mb.txt

it can only provide a 0.6% increase. Although with less random text, it can probably provide better performance. Still, we're dealing with razor thin margins.

What I first tried to do was follow this guide but what ended up happening was that I was getting an even worse performance than by doing the makefile. And this wasn't a permanent solution either as I wouldn't be able to port the exe file to linux.

Next I tried to change the code anywhere I could to see even just a bit of a boost. Changing any data types where I thought it could help. The only problem was that it wouldn't compile correctly. The code was so advanced and accurate, it was difficult to pinpoint where and what to change for everything. So I thought it was a bust to try to even attempt this.

What Did Work 

So I thought my only chance was maybe changing the compiler options around to see what sticks. In the makefile, and the documentation it's suggested that the program is compiled with an optimize of -o2. I thought this was odd, as why not go for the full o3? As well as some other options that allow it to be compiled correctly and such.

Here are the results when comparing -o2 and -o3 on the xerxes server.

real    0m29.852sreal    0m29.847s

As you can see, no difference at all that can really be assumed thanks to the higher optimization level. As I ran it many times between the two, with -o2 sometimes being faster than -o3 and vice versa. I can only assume the reason the documentation suggests -o2 over -o3 that it just compiles quicker, and maybe it uses less memory, although I did not test for this.

So again I was stumped, as that means the optimization levels won't matter, until I remembered that I should try it on the aarchie server! And the results were terrific.

real    3m56.909sreal    3m50.971s

I tested multiple times on both levels to make sure it wasn't just a fluke, but nope, results were always +/- 1 second from their respective levels. This means that changing between the levels can provide as much as ~2% difference in speed! And when something takes this long to compress, that is something that is noticeable when you're working with things that could take hours.
Testing with larger files on the xerxes server with different optimizing levels showed the same speed between them all. Unlike showing a difference like it did on the ARM64. Which means that changing it is an ARM64 improvement only. This is probably why it was not noticed or included in the makefile.

What Now?

Before submitting these findings to the upstream I'll have to gather more research for why this happens. There must one or two only flags in the level 3 optimization flag that help it only on the ARM64 platform. Once I find this out, I could then push an ARM64 only makefile so others could see the same performance boost.

by JoeyRanieri ( at April 11, 2016 03:28 AM

Berwout de Vries Robles

Phase 2 Process

So in my previous blog I found an optimization, but I was not too happy about it because it seemed like such a simple step to take. I guess that is sort of the point with compiler optimizations though. The more efficiently you can get things done, the better. The compiler assists with this very handily.

I tried to get the compiler to tell me some possible optimizations as well, using -fopt-info-vec-missed to see what vectorization possibilities it missed and why. The compiler came up with about 500 lines of missed vectorization, so I wasn't sure where to start with that. I then used -fopt-info-vec to see that some things did actually get vectorized, but that got me no further either.

I tried profiling the program using gprof, it resulted in me finding out that the program spends its entire time in a single large function, whether compressing or uncompressing. This brought me no closer to finding any other optimization.

I have fiddled with the DUSERMEM and DREGISTERS to see if it makes a difference, but I think there is a set maximum to these values somewhere, although I do not know why. To me they are sort of magic numbers right now.

All in all the search and constant waiting for benchmarks was frustrating to me, especially because I found next to nothing besides the obvious compiler flag. If I was to attempt a project like this again. I would have an extra phase and my phase 2 would be write an automatic benchmarking suite where you press a button and a reliable benchmark rolls out the other end.

I also wouldn't use my own laptop to benchmark on, because for some reason the Virtual Machine gives very varying performances and I can never tell if it is an accurate representation of change or not. Luckily betty had my sanity covered there.

by Berwout de Vries Robles ( at April 11, 2016 02:13 AM

Nina Wip

Compiling Optimization Betty

After benchmarking x86_64 with the -O3 optimization flag, it's time to test this on ARCH64.
Since I can only work from the command line on the server I needed to come up with a command to replace all Makefiles' -O2 with -O3. The command I found the easiest was the following:
find -name Makefile | xargs sed -i 's/-O2/-O3/' 
The following benchmarks are done on ARCH64, server Betty.

With -O2 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.037sreal: 0m0.345sreal: 0m4.792s
user: 0m0.028suser: 0m0.323suser: 0m4.551s
sys: 0m0.001ssys: 0m0.027ssys: 0m0.400s

With -O3 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.036sreal: 0m0.343sreal: 0m4.768s
user: 0m0.028suser: 0m0.323suser: 0m4.499s
sys: 0m0.001ssys: 0m0.027ssys: 0m0.426s

As you can see the -O3 did not do anything on ARCH64. I thought this was very strange and checked the executable file to see if it changed at all. The file did change, it got larger as expected. Yet there is no gain in speed. Comparing the real time of the 1.5gb file again, it wasn't even 1% faster. So for ARCH64 I recommend using -O2 because it doesn't change much in run time and your file is smaller.

For Betty I'll have to find another optimization possibility, although I wouldn't know what the next possibility would be. Probably make something specifically for ARCH64, but this would cost way more time.

Another way to go is add the compiler flag -fopt-info-missed to find missed optimizations and see if I can do something about that. Source:

by Nina ( at April 11, 2016 01:20 AM

Compiling Optimization x86_64

As mentioned in the previous post I am going to take a look at the compiler flags and how I might be able to optimize them.

The currents flags for the C compiler that are used:
-pthread -g -pg -O2 -MD -D_FORTIFY_SOURCE=2 -Wpointer-arith -Wmissing-declarations -Wmissing-prototypes -Wshadow -Wwrite-strings -Wcast-align -Waggregate-return -Wbad-function-cast -Wcast-qual -Wundef -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmissing-format-attribute -Wmultichar -Wc++-compat -Wmissing-noreturn -funit-at-a-time -Wall -Wstrict-prototypes -Weffc++
As you can see there are a lot of flags to compile this program with the C compiler. What caught my eye was that they're using the O2 flag and not O3, so I saw this as a optimization possibility and went straight in to change it. Once I changed all the flags I did another benchmark of the program to see if it had helped at all. ~

To check if it did anything, I looked at the executable file. It made the file larger with the -O3 optimization. It went from 2.7 mb to 3.5 mb.

Note: Don't forget to take out the -pg flag as well to have a correct benchmark. -pg makes your program slower so then you won't have the correct comparison.

The following benchmarks are for x86_64

With -O2 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.053sreal: 0m0.287sreal: 0m3.458s
user: 0m0.021suser: 0m0.200suser: 0m2.793s
sys: 0m0.003ssys: 0m0.012ssys: 0m0.184s

With -O3 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.045sreal: 0m0.258sreal: 0m3.071s
user: 0m0.020suser: 0m0.197suser: 0m2.756s
sys: 0m0.002ssys: 0m0.014ssys: 0m0.177s

By comparing the real time you can see the -O3 flag makes the program 11.2% faster. Which is a pretty nice optimization. So I recommend they replace the -O2 flag with -O3.

Now it's time to benchmark the software on Betty and hope so see an optimization as well. 

by Nina ( at April 11, 2016 01:12 AM

April 10, 2016

David Wesley Au

pxz.c Phase II - Optimizations

So I tried the optimizations I posted in my last blog post, using these two tar.xz files.

1. linux-4.6-rc2.tar.xz ( 85.3 MB

2. linux-4.5.tar.xz ( 84.3 MB

I used the decompression flags to test the speed and time it takes to do each, with the non-edited version and the "optimized" version. These are my results: 

Note: MM:SS:MS

Note: These were tested on an  ARMv8 AArch64 system known as aarchie.

Default Version:
./pxz -d linux-4.6-rc2.tar.xz  ===== 1:11:20

./pxz -dcv linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06 

./pxz -dcv -T4 linux-4.6-rc2.tar.xz > /dev/null ===== 9.5 MiB/s 1:06

./pxz -dcv -T8 linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06


./pxz -d linux-4.5.tar.xz ===== 1:08

./ pxz -dcv linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T4 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T8 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05


"Optimized" Version:
./pxz -d linux-4.6-rc2.tar.xz ===== 1:11:19

./pxz -dcv linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06

./pxz -dcv -T4 linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06

./pxz -dcv -T8 linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06


./pxz -d linux-4.5.tar.xz ===== 1:09

./ pxz -dcv linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T4 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T8 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05


Given these results, I can see the small changes I made to the compression tool, did almost nothing to the overall result. With the exception if the default decompression 1 ms difference. After searching through the pxz.c file multiple times on different occasions, I cannot find another possible shortcut / optimization that could be made towards the application. Between the two files, with a 1 MB difference, there is only a negligible time difference between the two. 

by David Au ( at April 10, 2016 10:38 PM

Nina Wip


To enable g-profiling you need to add the -pg flag to the compiler options. These options can be found in the Makefile. Just look for the terms "CFLAGS", "CPPFLAGS", and "LDFLAGS". Behing these options add the -pg flag and g-profiling should be enabled.

At first I couldn't figure out why my software wasn't producing the gmon.out file, which gprof generates for you. I added -pg to all the flags in the head Makefile. But apparently there were way more Makefiles where I needed to change the flags. There were 6 Makefiles in total, so I added -pg to all the compiler flags.

Luckily this did generate the gmon.out file so I could take a look at the profiling by running the following command:
gprof md5deep > gprof.txt
This generates a readable file which contains the profiling. The strange thing was that my gprof file said that every function it called took 0 seconds, even though it said that the function got called 8 times. It probably went through a lot of functions quickly and didn't register time.

The function that gets called 8 times is hash_final_sha1 which looks like this:
void hash_final_sha1(void * ctx, unsigned char *sum) { 
sha1_finish((sha1_context *)ctx.sum); } 
Since it's a one liner there isn't much I can optimize here. But it does call other functions which I can take a look at. I went through all the functions that call each other until I found a function that actually did things by itself instead of just forwarding to another function. I ended up at the following function:
void sha1_update( sha1_context *ctx, const unsigned char *input, size_t ilen )
    size_t fill;
    unsigned long left;

    if( ilen <= 0 )

    left = ctx->total[0] & 0x3F;
    fill = 64 - left;

    ctx->total[0] += (unsigned long) ilen;
    ctx->total[0] &= 0xFFFFFFFF;

    if( ctx->total[0] < (unsigned long) ilen )

    if( left && ilen >= fill )
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, fill );
        sha1_process( ctx, ctx->buffer );
        input += fill;
        ilen  -= fill;
        left = 0;

    while( ilen >= 64 )
        sha1_process( ctx, input );
        input += 64;
        ilen  -= 64;

    if( ilen > 0 )
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, ilen );
Looking at it I could not find a possible way to optimize this function. Since it took a while to find a suitable function, and I couldn't find an optimization, I'm going to look into the compiler flags next since they use the O2 flag, and this should be able to change to O3.

by Nina ( at April 10, 2016 08:45 PM

Yunhao Wang

bzip2 optimization-phase two

There is two way to optimize the bzip2 compression time.

In the mainsort function of blocksort.c

  1. Using memset() function instead a for loop to set each elements in an array to zero. By doing this, it will reduce the memory usage by using shared library, and reduce the branch amount from the for loop.

-for (i = 65536; i >= 0; i–) ftab[i] = 0;

2.Manually unroll the loop it will increase the run time memory usage, as the author said it will produced wrong code while compile with gcc 2.7.x.

It will reduce the branch of the program, which will reduce the compression time.

–   //for (i = 1; i <= 65536; i++) ftab[i] += ftab[i-1]

+for (i = 1; i<= 65536; i+=8){
+       ftab[i] += ftab[i-1];
+       ftab[i+1] += ftab[i];
+       ftab[i+2] += ftab[i+1];
+       ftab[i+3] += ftab[i+2];
+       ftab[i+4] += ftab[i+3];
+       ftab[i+5] += ftab[i+4];
+       ftab[i+6] += ftab[i+5];
+       ftab[i+7] += ftab[i+6];
+       }

After compile the program, it passed 6 tests from its own on x86_64 and AArch64 machine.
Doing 6 tests (3 compress, 3 uncompress) …
If there’s a problem, things might stop at this point.

./bzip2 -1  < sample1.ref > sample1.rb2
./bzip2 -2  < sample2.ref > sample2.rb2
./bzip2 -3  < sample3.ref > sample3.rb2
./bzip2 -d  < sample1.bz2 > sample1.tst
./bzip2 -d  < sample2.bz2 > sample2.tst
./bzip2 -ds < sample3.bz2 > sample3.tst
cmp sample1.bz2 sample1.rb2
cmp sample2.bz2 sample2.rb2
cmp sample3.bz2 sample3.rb2
cmp sample1.tst sample1.ref
cmp sample2.tst sample2.ref
cmp sample3.tst sample3.ref

If you got this far and the ‘cmp’s didn’t complain, it looks
like you’re in business.

On x86_64 machine, while compression the linux-kernel tarball(version 4.5) with 628MB(658,053,120 bytes)

The compression time before

real    1m39.944s
user    1m39.464s
sys     0m0.388s

The compression time after

real    1m27.951s
user    1m27.428s
sys     0m0.443s

After running the cmp between the file compressed before and after, it was the same file

It reduced the compression by nearly 12% from the previous version

On AArch64 machine, while compression the linux-kernel tarball(version 4.5) with 628MB(658,053,120 bytes)

The compression time before
real    13m32.371s
user    13m29.980s
sys     0m1.890s

The compression time after

real    11m59.926s
user    11m58.790s
sys     0m0.960s

After running the cmp between the file compressed before and after, it was the same file

It reduced the compression time  by nearly 11.5% from the previous version








by yunhaowang at April 10, 2016 11:55 AM

Tom Ng

xzutils – The Search for an Optimization

A while ago, I announced for the SPO600 project that I would be looking into xzutils for an optimization. I particularly wanted to see if there were any optimizations to be had by simply changing the flags used in the compilation process.


Since I wanted to change the flags, the make process was where I had to look first. After extracting a fresh tarball, the only things that had to be done were running the configure script and then running the resulting makefile the script created. My first approach at attempting to change the compilation process was to make modifications to the root makefile generated by the configure script. The only changes I intended to try so far was modifying the flags which included in the makefile variables: CFLAGS, CPPFLAGS and LDFLAGS. Running the modified makefile however produced the usual output without the modifications I had. Since that didn’t work, I then tried to modify all the makefiles that were created by the configure script with the use of a sed command. Unfortunately there was still no change to the binary made either because sed didn’t work or the changes sed made still didn’t change anything. I also attempted to hardcode the changes into the file without success.


I did some poking around reading random documents that came with the tarball and chanced to read a doc intended for packagers which included an example that showed how to pass flags to the configure script. The flags I wanted to add were -g, -pg and -O3. I chose these flags because according to the output of the unmodified makefile, -O2 was used and I wanted to investigate why -O3 wasn’t. -g was already present but I wanted to add it just to make sure it was used everywhere. Finally -pg was used so that the program could be profiled. Instead of running just ./configure, I was able to pass the flags I wanted to test with ./configure CFLAGS=’-g -O3 -pg’. When running the resulting makefile, I was able to see the new flags I added in the output and the program compiled successfully with these flags.


I created in a test directory, three files: a 1gb file filled with zeroes, a 1gb file filled with random data and a relatively smaller (only 743223 bytes) file consisting of Lucy Montgomery’s Anne of Green Gables and Shakespeare’s The Tempest. The test process for both the unmodified and modified xzutils involved timing how long it took to compress a file, checking the resulting file’s size and checking the hash using sha256sum; then timing again how long it took to decompress the resulting .xz file, checking the size and hash again and finally compare them to the original.


In all three tests, the modified version was able to compress a file and decompress it without any data loss. This came as a surprise to me because I would imagine the reason why the xz developers would use -O2 instead of -O3 would be because one or more of the -O3 optimizations would break something in the (de)compression process or may produce a corrupted file. If the (de)compression process works fine with -O3, and using -O3 is supposed to make the program run even faster, then surely changing -O2 to -O3 would be an easy change for a speed boost. That turned out to be not the case; -O3 actually made the timings for the tests worse. More accurately though, -O3 created almost no significant changes in time and the negligible differences tended to lean on the slower side for -O3 xzutils. Well I can see now why they used -O2 instead of -O3.


It seems odd to me that -O3, which is supposed to optimize for performance, would produce an inferior performing binary than one compiled with -O2. As far as the tests showed, using -O3 instead of -O2 did not break anything but just made the program run slower instead. In searching for an optimization, I plan to examine which flags -O3 uses that -O2 doesn’t use and see which ones positively affect the program and which ones negatively affect the program. One or more of the flags certainly affect the program in a negative way but there also might be a flag or more that can speed the program up. It is also possible that an optimization may conflict with another resulting in slowing down the program. When querying gcc for which flags are turned on and off for -O2 and -O3 and using diff on the output, there are roughly 10 more flags enabled on -O3 so narrowing down the ‘bad’ flags shouldn’t be too hard. I look forward to investigating each flag with its own xz build and hopefully at least one of them can contribute positively to the performance of the program.

by tng23spo600 at April 10, 2016 12:22 AM

April 07, 2016

Vishnu Santhosh Kumar

Final Project:Phase 2.1

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve performance of the software on AArch64 and                                    x86_64 systems.

Phase 2 – Aim: Optimize the performance of the software

Basic Optimizations Discussed in   Phase 1.1  have been  done, there is no performance difference shown in the new phase. Some higher level of optimization is required.

♣  A Surprising Result !!
Ther very oldest commit  of the LZ4 software in GitHub was in  April 2014.  I downloaded that version (#r116) and tested to see the performance difference between the latest release of LZ4(#r131) where I’m working with.  To be surprised , there is no optimization happened in the software till that time. The older version showed the same benchmarking results I got from the latest release.  The older version looks easier to  understand and well documented. In “lz4.c” file of that version has the option to set the memory flag, i.e the amount of cache memory to use the compressor. The more that cache gives higher compression and reduced size improve speed by reducing compression ratio. I figured out that the rate of change of time is legible with the compression ratio it can provide with more cache size. I included some preprocessor code to check the architecture of the system and giving a highest possible cache size for the compressor. In the  “matrix system”, the processor is ” Dual-Core AMD Opteron(TM) Processor 8220″ with a  “64kb”  level-1 data cache size ( I did this performance comparison only on x86) .  By providing this maximum cache size other than the 16kb default value , the compressor shows a 4.24% increase in compression than the latest release. Then was a disappointing part, I quickly look into the latest release to make the same change over there too, but, unfortunately, there was no such piece of code to manage the memory cache size and no similar kind of code.  “Old is Gold !!!!!”

Hit- 1  

Objective : Increase the probability of finding similar literals in a compression block.

Link to GitHub

Performance Improvement :
      0.22% increase in Compression.

File Changed : /programs/lz4io.c

The compression uses a block size to read the data (uncompressed data) from the source file to find the matching literals. Before my changes, a fixed amount of buffer size is taken from the source file . I figured out that ,with varying size of the input buffer from the source file helps to find more similar matching patterns in the  text. ie. the probability of finding similar literals in the same block size is lower than the probability of finding the same literals in varying block size.

Hit- 2

Objective :  Replace the most expensive operations in code to improve Compression Speed.

Link To GitHub

Performance Improvement :
1.1 MB/s
increase in Compression Speed
No ChangeFile Changed : /programs/lz4io.c

File Changed : /programs/lz4io.c

In Compression Loop of the program , a compression progress is showing regularly on each time the loop iterates. It uses expensive multiplication to find the progressing compression ratio. I changed the multiplication with a least expensive bitwise operation , that lead to increasing the compression speed by 1.1MB/s.

Hit- 3

Objective: Optimize the code using CFLAGS in Makefile.

Link To GitHub

Performance Improvement:
          24.9 MB/s increase in Compression Speed.
          313.4 MB/s Increase In De-compression Speed.

File Changed : /programs/Makefile

The ‘-march’  CFLAG is added  in the makefile to optimize the software for the x86 architecture.  The ‘-march’ option generates the instructions for a specific machine type.  I included an ‘if’ condition in the Makefile to check whether the machine is x86 and give -march option to the CFlag. This  increased the average compression and the decompression speed of the software. This optimization is not possible on AArch64.

Final Result

Link To GitHub♦

The Compression Ratio have slightly increased  by 0.22% but the Compression and decompression speed have increased very much. The Compression has increased from 77.35MB/s to 102.25 MB/s and the Decompression speed have increased from 489.8 MB/s to 805.3 MB/s in the x86 machine. There is no noticeable optimization seen on the AArch64, however, a slight increase of 1.03MB/s – 2.1MB/s  is achieved.  The Final Patch with all the above changes has been submitted to the GitHub.

Link To The Final Submitted Patch.

by vishnuhacks at April 07, 2016 02:19 AM

David Wesley Au


Today, I have taken a  look at the pxz.c file, and I found two possible optimizations that I can try.

They are:
1. (Pre-Calculation)
m = mmap(NULL, s.st_size, PROT_READ, MAP_SHARED|MAP_POPULATE, fileno(f), 0);
if (m == MAP_FAILED ) {
perror("mmap failed");


if (mmap(NULL, s.st_size, PROT_READ, MAP_SHARED|MAP_POPULATE, fileno(f),0) == MAP_FAILED){
perror("mmap failed");

2. (Dead Store Elimination)
for (pt=0; pt<len; pt+=BUFFSIZE){
strm.next_in = &m[off+pt];
strm.avail_in = len-pt >= BUFFSIZE ? BUFFSIZE : len-pt;
strm.next_out = mo;
strm.avail_out = BUFFSIZE;
ret = lzma_code(&strm, LZMA_RUN);
if (ret != LZMA_OK){
fprintf(stderr, "error in LZMA_RUN\n");
if (BUFFSIZE - strm.avail_out > 0){
if (!fwrite(mo, 1, BUFFSIZE - strm.avail_out, ftemp[p]) ){
perror("writing to temp file failed");
strm.next_out = mo;
strm.avail_out = BUFFSIZE;
}while (strm.avail_in);

strm.next_out = mo;
strm.avail_out = BUFFSIZE;
for (pt=0; pt<len; pt+=BUFFSIZE){
strm.next_in = &m[off+pt];
strm.avail_in = len-pt >= BUFFSIZE ? BUFFSIZE : len-pt;
ret = lzma_code(&strm, LZMA_RUN);
if (ret != LZMA_OK){
fprintf(stderr, "error in LZMA_RUN\n");
if (BUFFSIZE - strm.avail_out > 0){
if (!fwrite(mo, 1, BUFFSIZE - strm.avail_out, ftemp[p]) ){
perror("writing to temp file failed");
}while (strm.avail_in);

There might be more small optimizations I can try. I will take a look at the pxz.c for more possible opportunities at a later day.

by David Au ( at April 07, 2016 12:58 AM

April 05, 2016

David Wesley Au

Project: pxz Compression Overview

The project I have chosen to work on and optimize in some minor way is pxz compression. This is a parallelized LZMA compression tool that takes advantage of the running XZ compression tool simultaneously on different parts of a file on multiple cores and processors. This supposedly significantly speeds up compression time.

pxz has different settings you can use when compressing file(s). They are:

pxz file1 file2
- This command compresses the two files in default settings

pxz -T4 file1
- This command compresses file 1 with default settings and with a maximum of 4 threads

pxz -k file1
- This command uses maximum compression and keeps the original file. It also uses the default number of threads

pxz -f -T4 file1
- This command compresses the file with a maximum of 4 threads

pxz -kvf -3 -T8 file1
- This command uses a -3 compression , keeps the original file, with a maximum of 8 threads and displays progress

tar Oc dir1 | pxz -D 12 -cv - > dir1.tar.xz
- This command moves dir1 into tar, then compresses it into 12 sizes of dictionary per thread, displays progress and writes the result into dir1.tar.xz

by David Au ( at April 05, 2016 12:32 PM

March 30, 2016

Tom Ng

SPO600 Project – xz

The package I chose for the SPO600 project was xz. It comprises of the xz file format that uses LZMA2 compression and xz utils which work with the .xz container. XZ was first used to fit the Tukaani distribution (a Slackware fork) under 700mb image. XZ was intended to replace the .lzma1 format with the same extension but it wasn’t finished before it rose in popularity so it was renamed to XZ.

I chose the package because I liked the name and the project didn’t seem either too active or inactive. It turns out according to the docs that came with the source code, that the name ‘XZ’ has no meaning and it was chosen simply because the name wasn’t being used by anything else. The community communicates on a variety of platforms including a forum, mailing list, IRC and regular old e-mail. The preferred way to submit bugs or patches is via e-mail or IRC.

I downloaded the most recent tarball for XZ which was version 5.2.2 and tried compiling it on archie and betty. Both compiled and worked right out the box. I haven’t been able to do any detailed profiling yet so I did some quick benchmarks with 3 different files instead.

The first thing I tried to compress was the history.txt file that came with the source. It consists of 1168 words according to wc -w history.txt and is 7427 bytes uncompressed. Just for fun, I tested XZ’s claim that it could typically compress files 30% smaller than gzip but gzip actually compressed it to 3010 bytes; 10 bytes less than XZ’s 3020. The second thing I tried to compress was a 128mb file filled with 0s made from the dd command. Here XZ took 18.9 seconds to compress it down to 19.6mb. The final file was also made by dd and was a 128mb file consisting of random data from urandom. This time XZ took 1 minute 27 seconds but failed to compress the file. In fact, the resulting XZ file was slightly larger than the file itself.

In terms of looking for potential optimizations, I am investigating the makefile templates for any potential areas that may include new optimization flags that can be added. It’s quite confusing since makefile templates and recursive makefiles are a first for me but I look forward to it.

by tng23spo600 at March 30, 2016 03:55 AM

Berwout de Vries Robles

Compiler optimizations part 2

Last week we each chose 2 compiler optimizations and in my previous blog I have explained my first, my second compiler optimization flag is -free. In some 64 bit architectures when you load a word into the lower half of a register, the upper half of that register implicitly gets zeroed out. This means that if you need to lengthen an unsigned integer from 32 to 64 bits you can do so just by loading the word.

From the gcc manual we find that -free is a flag that tries to remove redundant extension instructions. This is also one of the optimizations that specifically works well on AArch64, which is why I wanted to take a closer look at it. It is enabled at -O2, -O3 and -Os.

Let me start by saying that even though I've tried multiple programs lengthening int16's and int32's to int64's, by either multiplication or addition, I have yet to find a case where -free kicks in. So far there has been no difference in assembly. I have even tried to get 32 bit floats converted to 64 bit integers to see if it would do lengthening, but it never did. Instead gcc did do something interesting though, which I would like to highlight for a second. This was the program I used to attempt to generate a lengthening instruction.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

uint64_t sumArray(uint32_t *arr, uint32_t *arr2, int length){
    uint64_t result = 0;
    for(int i=0; i<length; i++){
        result += arr[i] + arr2[i];
    return result;

int main(){
    const int length=256;
    uint32_t arr[length];
    uint32_t arr2[length];
    for(int i=0; i<length; i++){
        arr[i] = rand();
        arr2[i] = rand();
    printf("%d", sumArray(arr, arr2, length));
    return 0;
 The funny thing is that this program actually did not create a lengthening instruction, but instead had the following instruction:
mov %eax, %eax
 on the surface this looks very strange, because we are moving a 32 bit word into the same register, but if you look into it slightly deeper, the mov instruction actually zeroes out the top part of the register as well, so it implicitly lengthens the 32 bit unsigned integer to a 64 bit unsigned integer.
I've tried several versions of this program, that multiply or add 32 bit to 64 bit unsigned integers or with 32 and 64 bit unsigned integers, but I have yet to generate a single lengthening instruction. This means that -free sadly does not do anything for the programs that I have written.

I can tell you that if there was a mov instruction present and afterwards a lengthening instruction(for an unsigned 32 bit int), this would be one of the cases where -free would kick in and remove the lengthening instruction, sadly I was not able to generate one of those cases.

by Berwout de Vries Robles ( at March 30, 2016 03:34 AM

Project selection part 2

Alright after the disappointing failure I experienced with slimdata I am ready for a new try! I tried to install several of the leftover projects (most of them have been selected already). Some of them had fairly complicated build instructions to get the last version to work and some of for some of the packages listed I was not sure if they were even around. Notably coreutils latest version has a bunch of prerequisites and I followed their readme to install the most recent version, the prerequisites will not install correctly on it even though I followed their instructions to the letter. Also sha2 and sha3 were sort of ambiguous, it was hard to find the actual source code that was meant by those listings. After exploring these packages I settled on ncompress.

ncompress is a compression algorithm that uses the Lempel-Ziv-Welch approach, this is a well known compression algorithm. The basics are that it creates a lookup-table for common sequences in the file, to reduce the number of bits required to display that sequence. This means there has to be some sort of repetition in the file in order for there to be an improvement. Therefore we can not use a randomly generated file. Luckily I wrote a piece of code in my Project selection 1 blog that writes a repeating int pattern either directly to a file or as text to a file. We can use the output from that program as input for the ncompress program.

The ncompress package is fairly old with its' last release being in 2006, as such I think there should be an optimization present in a multi-processing or simd approach. Which is what I will focus my attention on. This is also because I am most interested in these types of optimizations.

ncompress is easy to install by just running the git command:
git clone
then it will clone the repo into a map called ncompress, you can then navigate in there and make install the application.

I faced some difficulties copying my files to our aarch64 test servers aarchie and betty, for some reason my scp and sftp kept getting closed by the remote host after a certain percentage. After trying for over 20 times and reaching no more than 20% of a 1 GB file I gave up and looked for an alternative. It turns out rsync is a very flexible command that lets you sync directories across servers. For approaching aarchie and betty you do have to specify a port number with the -e option, but some googling on the rsync command gets you there very quickly.

I then set about obtaining benchmarks for the package, which are to be viewed below:

Some notes about the benchmarks:
- They were taken when cache was "hot" (I disregarded first run).
- They are averages of 3 runs.
- All runs used copies of the same files.
- I used time to measure time spent for the entire command.
-There was a minimal amount of interference caused by other processes running at the same time, so keep that in mind. I did check to see if there were no large CPU consuming applications running.

aarch64 testserver Betty:
2545934560(~2.4GB) textfile:
real: 83.652s
user: 82.590s
sys: 1.060s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.811s
user: 15.323s
sys: 0.480s
resulting size:  21505241(~2,00%)

2545934560(~2.4GB) textfile:
user: 12.530s
sys: 1.570s
1073741824(1GB) integerstream:
real: 5.920
user: 5.403
sys: 0.510s

my own x86_64 Fedora 23 installation:
2545934560(~2.4GB) textfile:
real: 39.341s
user: 28.147s
sys: 3.484s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.626s
user: 7.479s
sys: 2.328s
resulting size: 21505241(~2,00%)

2545934560(~2.4GB) textfile:
real: 38.687s
user: 6.326s
sys: 2.160s

1073741824(1GB) integerstream:
real: 14.575s
user: 2.500s
sys: 0.856s

Something interesting to look into immediately is that on Betty the uncompress algorithm seemed to work much better than on my own Fedora installation, where it was pretty much the same speed as the compression algorithm. I would like to do some more benchmarking on actual written text files and maybe some installers later, but for a basic benchmark I think this does alright. I was in some stress for time because I had to select a new project after attempting to benchmark the previous package for too long.

by Berwout de Vries Robles ( at March 30, 2016 03:29 AM

Nina Wip

MD5Deep Installation

The project I chose to optimize is called md5deep. This is a program to compute hashes for a number of files. Md5deep doesn't only hash files with md5 but can also use SHA variants, Tiger and Whirlpool. It is similar to the md5sum program but md5deep has some additional features, which I quoted from:
  • Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
  • Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match. Hashes sets can be drawn from Encase, the National Software Reference LibraryiLook InvestigatorHashkeepermd5sum, BSD md5, and other generic hash generating programs. Users are welcome to add functionality to read other formats too!
  • Time estimation - md5deep can produce a time estimate when it's processing very large files.
  • Piecewise hashing - Hash input files in arbitrary sized blocks
  • File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc.
I'll be taking a look at the source code and attempt to find optimization possibilities. The possibilities of optimizations are:
  • Improving a hashing algorithm
  • Add code so it works specifically on ARCH64, but not influencing the other systems
  • Edit the make file so it makes the program faster
I downloaded the source code from and ran the following commands to install it correctly:
sudo make install
To test that it worked I hashed a few files and saved the hashes in a file. By saving the hashing you can find the matching files by using their comparison feature. The next step will be benchmarking.

by Nina ( at March 30, 2016 01:23 AM

MD5deep focus area

So my first thought on the focus area was implementing a way for aarch64 to be installed correctly since I ran into trouble with the architecture. But in the latest configure file I had to download they fixed that issue already and was dated 2016-03-24, so that was pretty recent.

Since I'm more familiar with code than all the configuration and make files, I decided to take a better look at the source code.

by Nina ( at March 30, 2016 01:22 AM

MD5deep benchmarking betty

Now that I'm done testing and benchmarking on my own x86_64 fedora system, I need to test md5deep on an AARCH64 system. I transferred the files I used in the x86_64 benchmark to the betty server using scp and sftp. Once I got the test files transferred I transferred the tar.gz file of md5deep so I could try to install it on betty. Here's where I ran into some issues...

The normal way to install the md5deep is to unpack it and run the following commands:
sudo ./configure
make install 
Sadly it stopped at the configure step. It could not recognize which build to use since it did not know aarch64. Luckily in the error it gave me a website to download the most recent configure files which did contain aarch64! ( So after some more use of sftp the configure step finally worked and I could run the make install command.

On to the benchmarking. Just like benchmarking on the x86_64 I ran the md5deep program 100 times with 3 different file sizes. The results are as following:
10.5 mb file
real: 0m0.037s
user: 0m0.028s
sys: 0m0.001s
105 mb file
real: 0m0.345s
user: 0m0.323s
sys: 0m0.027s
1.5 gb file
real: 0m4.792s
user: 0m4.551s
sys: 0m0.400s
If you only have one file md5deep can hash it pretty quickly. Even if you hash entire directory structures you wouldn't have to wait too long.

by Nina ( at March 30, 2016 12:58 AM

MD5deep benchmarking x86_64

I currently only have md5deep installed on my own fedora workstation, so I could test if I could get it working. Now that I did get it to work I can benchmark it for my laptop. When I get this to work I will do the same on betty.

My fedora specifications are:
Memory: 2.9 GiB
Processor: Intel Core i7-3630QM CPU @ 2.40GHz
Machine architecture: x86_64

To benchmark this program I need a few files of different sizes. With the following command I can create a file with complete random data and a size of approximately 100mb.
dd if=/dev/random of=test100m bs=1M count=100 iflag=fullblock
After running this command for different file sizes, I ended up with a 105mb testfile and a 10.5mb file.
By running the following command you can run the md5deep program and see how long it took to complete.
time md5deep test100m
To get a presentabele benchmark you have to run this command a lot of times and get the average. Instead of manually running this command 100 times I wrote the following command which would do it for me:
time for i in {1..100}; do md5deep test100m; done
The time you get for this is the total time of running it 100 times, so the time result you get has to be divided by 100 to get the average time of running once. I got the following results:
10.5 mb file
real: 0m0.053s
user: 0m0.021s
sys: 0m0.003s
105 mb file
real: 0m0.287s
user: 0m0.200s
sys: 0m0.012s

After seeing that these files still get hashed pretty quickly I decided to download a bigger file to see how long that would take. I downloaded the Fedora Workstation iso, which is 1.5gb, to test with.
1.5gb file
real: 0m3.458s
user: 0m2.793s
sys: 0m0.184s
It still doesn't take very long for a 1.5gb file to be hashed, so this program is pretty fast.

by Nina ( at March 30, 2016 12:56 AM

Berwout de Vries Robles

Project Selection part 1

We've started our projects and I have selected slim, a compression algorithm that mainly deals with very large integers of 1,2 or 4-byte word size. One of the things I noticed immediately was that it had not been updated for a while, it was written in 2008 and some additions were made in 2013. Right now it is a one man project as all the contributions were made by a single person.

I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here:
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main() {
    //ints in a GB
    int64_t size = 1073741824/4;
    int repetition = 20000;
    int32_t* data = (int32_t*) malloc(sizeof(int32_t)*size);
    int i = 0;
    while (i < size) {
        int32_t number = rand();
        //repeat j times
        for (int j = 0; j < repetition; j++) {
            if (i < size) {
                data[i] = number;
    FILE *fp = fopen("data", "w");
    fwrite(data, 4, size, fp);
    //for(int x = 0; x<size; x++){
    //fprintf(fp, "%d", data[x]);
    return 0;
I have tried using fwrite, to just write the binary values of the integers to a file. I have tried fprintf, to write the text values of the integers to a file. I have tried random repetition amounts and swapping around some integers to create a more or less diverse file. I have tried smaller and larger files. The text files would not even slim, it would literally break the original file into an unrecoverable state. The integer files would slim but the maximum compression file gain was 1,5% of the file or about 20MB on a 1GB file.
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.

It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.

by Berwout de Vries Robles ( at March 30, 2016 12:30 AM

Giuseppe Ranieri

Identifying a Possible Optimization for the Zopfli Library

The Zopfli Compression Algorithm is a compression library programmed in C to perform very well, but slow for deflate or zlib compression. The library is only able to compress, not decompress.

I chose this library as the community was the largest and most active on Github. There are some pros and cons for this. One being that any problems that occur are reported and able to be fixed quicker as it's a quick and easy way to post the issue on Github. However this can also be a problem because although there are many issues on how to improve the library, many of them are solved and closed for the community being so active.


The follow is done by creating a random size text file on the ARM64 platform with the command

dd if=/dev/urandom of=10mb.txt count=1M bs=10
and transferring it over to the x86_64 platform so that the tests were running on the same data. I tested both a 1M and 10M file on the platforms. Here are the results ran with time ./zopfli 10mb.txt:

x86_64real    0m3.309sreal    0m33.970s
ARMv8real    0m27.127sreal    4m26.325s

They weren't kidding when they said it was "slow". Although any changes I do, good or bad, are more easily noticeable.


Optimizations Choices:

There are three avenues I believe I can follow in a priority order for possible ways to optimize the library. In the deflate.c there is a TODO of

static void AddBits(unsigned symbol, unsigned length,
                    unsigned char* bp, unsigned char** out, size_t* outsize) {
  /* TODO(lode): make more efficient (add more bits at once). */
  unsigned i;
  for (i = 0; i < length; i++) {
    unsigned bit = (symbol >> i) & 1;
    if (*bp == 0) ZOPFLI_APPEND_DATA(0, out, outsize);
    (*out)[*outsize - 1] |= bit << *bp;
    *bp = (*bp + 1) & 7;

for the AddBits function call. Perhaps there's something I can change to help the code along.

My second choice afterwards will be trying different flags for the build:

gcc src/zopfli/*.c -O2 -W -Wall -Wextra -ansi -pedantic -lm -o zopfli
as of right now, there may be some other other flags in O3 optimization that may help some edge cases, but could make it unstable in other areas. This will require patience a lot of compiling and test running.

Finally, the least enjoyable one would be trying to make it platform optimized for the ARM64. No issues were made on the Github so this might be a platform that has been overlooked.

by JoeyRanieri ( at March 30, 2016 12:04 AM

March 29, 2016

Vishnu Santhosh Kumar

Presentation – Compiler Options

Today I did my presentation about two  compiler options”-ftree-loop-distribution” &  “-frename-registers“.


In a loop, the code inside the loop body is stored in the cache in order to improve the optimization of the loop body by providing same values, or related storage locations, that are frequently accessed. Since the cache size is limited, a loop is broken into multiple loops over the same index range with each taking only a part of the original loop’s body. This helps to reduce the size of the larger loops into smaller ones and can utilize the cache more. This opitmization is called –  Loop fission.

The ‘-ftree-loop-distribution’  Perform loop fission. It improves the cache performance on larger loop bodies and also it allows further optmizations like vectorization / parallelization.( refers to converting sequential code into multi-threaded or vectorized (or even both) code in order to utilize multiple processors simultaneously ).


for(int i=0;i<100;++i)


for(int i=0;i<100;++i)

for(int i=0;i<100;++i)




GCC Compiler Options


by vishnuhacks at March 29, 2016 08:13 PM

Yunhao Wang

bzip2 optmization -phase one

Benchmark for Bzip2



linux-kernel tarball 628MB(658,053,120 bytes) for Huffman coding

500MB random data(for worse situation and Burrows–Wheeler transform)

generate by dd if=/dev/urandom of=test.500mb bs=1M count=500


Arch64  and x84_64


Modify the Makefile compiler flag with -pg

Compile the source, so we can profiling it

On AArch64

500MB random data with best flag(-9)
real    16m50.237s
user    16m48.650s
sys     0m1.520s

On X84_64

linux-kernel tarball with best flag(-9)

real    1m39.944s
user    1m39.464s
sys     0m0.388s


500MB random data with best flag(-9)

real    1m58.764s
user    1m57.768s
sys     0m0.889s

It looks like  we can find possible optimization in mainSort in blocksort.c(27.32%) and BZ2_compressBlock in compress.c(38.61%)

There is possible to find tiny optimization in huffman.c



Possible Optimization

Remove the for loop to reduce branch

1.mainsort function in blocksort.c  Line 769

/*– set up the 2-byte frequency table –*/

for (i = 65536; i >= 0; i–) ftab[i] = 0;


memset(&ftab[0],0,65537*sizeof(int));     // not working when int has 4byte


2.Loop Unroll

add -funroll-loops flag as compiler option (it will increase the size of file)

by yunhaowang at March 29, 2016 05:56 PM

March 24, 2016

Andrei Topala

Project – Improving lrzip; Phase One

So now frost slowly thaws and the winds of blooming spring carry to us tidings of final projects. Our project for the SPO600 course centers around the improvement of a selected checksum or compression open-source package. I chose lrzip, and I hope that I will be able to alter the code in such a way that we can gain a non-negligible improvement in execution time. That, at least, is what I will be aiming for.

So, what even is lrzip?

lrzip – Long Range ZIP or LZMA RZIP

lrzip is a compression program. It’s optimized for the compression of large files and on systems with great amounts of RAM.

lrzip operates in two basic steps, and on a single file.

First, it uses an extended version of rzip to perform a long distance redundancy reduction (ie it encodes duplicate or redundant chunks of data over a large distance). This is done with an LZ77-based algorithm: a sliding window technique is used to keep data in memory in order to find matches, and larger amounts of RAM allow the use of a larger window and thus lead to better compression. Where rzip is limited to a maximum window size of 900MB, lrzip’s use of it scales with available RAM. Duplicate data is replaced with a reference denoting the distance to the original data and the size of the original chunk of data.

A visual representation of the LZ77 algorithm:

Second, after the initial reduction, lrzip compresses the file further by running one of LZMA (default), ZPAQ, LZO, GZIP, and BZIP2 on the reduced data file.

So, I built the package on my x86_64 computer (which required the installation of a few dependencies, but other than that it went okay) and on our AArch64 aarchie server (which, again, needed some dependencies (though some were already installed)). The program ran fine on both systems.

Before we begin tinkering with the code, let’s run some benchmarks on aarchie. We’ll use three files. The first is a text file containing 500MB of text downloaded from Project Gutenberg. The second is a 500MB file composed entirely of random data. The third is a 500MB file composed entirely of zero data.

[atopala@aarchie lrzip-0.621]$ time ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.918. Average Compression Speed:  1.217MB/s.
Total time: 00:06:50.91

real	6m50.924s
user	49m2.940s
sys		0m9.290s
[atopala@aarchie lrzip-0.621]$ time ./lrzip random 
Output filename is: random.lrz
random - Compression Ratio: 1.000. Average Compression Speed:  6.329MB/s.
Total time: 00:01:18.78

real	1m18.796s
user	1m34.520s
sys		0m2.960s
[atopala@aarchie lrzip-0.621]$ time ./lrzip zero
Output filename is: zero.lrz
zero - Compression Ratio: 7003.299. Average Compression Speed:  6.024MB/s.
Total time: 00:01:22.92

real	1m22.933s
user	5m17.620s
sys		0m2.490s

The text file took much longer than the others, and it used 23.9% of the memory on aarchie. Compressing the random file required the use of only about 7% of the memory on aarchie. Compressing the zero file, which has a lot more duplicate data to compress, used much more memory, peaking at about 34.5% on aarchie. The text file also took much more user time to compress in comparison to the other files and as compared to the real time, so it was taking greater advantage of parallel processing; the zero file also was compressed using multiple threads. The vast majority of the time was spent in user mode; time spent in the kernel is negligible.

So, the goal is to beat these benchmarks. Maybe a way to recognize incompressible data could speed up the processing of the random file; or maybe a method to detect extremely large runs of identical data quickly could help with the zero file. The important file, though, is the text file, and ideally we would improve the runtime of its compression. Setting the lzma level to 9 (the default is 7) brings the compression ratio up to 4.029 (a minor but not insignificant improvement) but takes twice as long; maybe this could be sped up.

So, apart from improving the algorithm (which, I think, is very unlikely, since lrzip has been receiving regular updates as recently as last year so the algorithm should be pretty close to perfect barring potential oversights) and changing build options (which might not prove fruitful, since the autogen sets the default optimization level to 2, and further optimizations might have a negative impact; but this is probably a good area to work in since I don’t think lrzip has been built with AArch64 in mind) these are potential areas of improvement:

-lrzip uses x86 and x86_64 assembly language files that optimize CRC computation. Maybe we could rewrite the files for AArch64. As of now, “if you don’t have the nasm assembler or have a ppc or other non-x86 system, the standard C CRC routines will be compiled and linked in.” The TODO file suggests that the x86_64 version of the assembly code doesn’t work, so that might need fixing. However, the TODO file also suggests that this is of a low priority.
-the program crashes with the -z flag (ZPAQ optimization) on aarchie, claiming “Illegal instruction” after starting the first thread to compress a piece of data. This occurs on betty as well. It doesn’t happen on x86_64. I am not yet sure of the cause, or if it something that happens on AArch64 systems in general, but if it is a bug on AArch64 then I think this should become my main focus.

I’ll begin with the ZPAQ bug (or what I think might be a bug, at least), and if that doesn’t pan out then I’ll try to edit some of the build options and code to optimize it for AArch64 (since I figure that there may be other people working on lrzip, but most might not have access to an AArch64 machine). If that doesn’t work out either, I’ll try to optimize the C code, and if I can’t do that I’ll work on the (low-priority) assembly language files.

by andrei600 at March 24, 2016 05:49 PM

March 22, 2016

David Wesley Au

Lab 1

I have selected two open source packages (with different licenses) and I will be discussing the review process and how contributions are made toward each of the packages.

1. Blender (3D Animation)

First of all, Blender is a free, open-source 3D computer graphic software product used widely for creating animations, art, visual effects, 3D applications, and video games. It has many features such as 3D modelling, animation, and texturing. It also has an integrated game engine. Blender is written in C, C++ and Python (I will be talking about this later.).

In order to develop for Blender, a majority of the communication for development will be through a mailing list, irc chat and through the bug/patch trackers. The development process for Blender is as follows (Steps for getting a patch into Blender):

i) Find the current "owner" of the code and check if that person agrees to you working on their project or adding to their project (assuming you have an area in Blender you want to improve on)

ii) Post a design document on Blender wikipedia

iii) Code

iv) Check your code, and submit a patch

You can submit a patch by making a patch file through git. As stated here, with all the information needed to submit patches (

2. Python (Programming Language)

In order to develop for Python, you may get involved through their issue tracker and irc channels. You may use their PyChecker, which is a static analysis tool that finds bugs in Python source code and warns about code complexity and style. To submit a patch, follow these steps:

i) Get the source code, using this line: hg clone

ii) Build python, using this line: ./configure --with-pydebug && make -j2
If you are using Windows, use this line: PCbuild\build.bat -e -d

iii) Run the test, using this line: ./python -m test.regrtest -j3
MacOSX: ./python.ese -m test.regrtest -j3
Windows: ./python.bat -m test.regrtest -j3

iv) Make the patch.

v) Submit to issue tracker.

by David Au ( at March 22, 2016 01:11 PM

Berwout de Vries Robles

SPO600 Compiler Optimizations

As we are starting projects shortly, this week there is no lab post. Instead we take a look at some gcc optimization flags. Everyone in the course was tasked with selecting two gcc compiler optimization flags and figuring out what they do.

My first choice is:

We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.

the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:
gcc -floop-strip-mine program.C
Some optimizations are always on, some options, such as -O2, -O3 turn on entire groups of optimization flags. To turn off a specific flag use the -no prefix. for example:

gcc -no-floop-strip-mine program.C
The reason I chose floop-strip-mine is that our projects could potentially benefit a lot from multi-threading if they do not take advantage of that yet. floop-strip-mine is an optimization that is especially beneficial in a multi-threaded situation.

floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:

Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.
int[8][8] a, b, c;

for(int i=0; i<8; i++){
    for(int j=0;j<8;j++){
        c[i][j] = a[i][j] * b[j][i];

The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.

The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.

The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]

The fourth iteration:
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]

The fifth iteration:
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]

We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.

To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.

In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!

The seventh iteration:
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]

The eight iteration:
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]

This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;

The ninth iteration:
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!

The tenth iteration:
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!

The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!

Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!

This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:
int[8][8] a, b, c;

for(int x=0; x<8; x+=4){
    for(int y=0; y<8; y+=4){
        for(int i=0; i<min(x+4, 8), i++){
            for(int j=0; j<min(y+4, 8), j++){
                c[i][j] = a[i][j]*b[j][i];
Now I admit that if you look at this the first time, programmatically it does not make a lot of sense. You are introducing two extra for loops and the loops look strange! However from a cache-viewpoint, this makes a lot of sense.

Here is what the iterations in this loop look like:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]

x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]

x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]

j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!

x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!

x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!

Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!

If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.

In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!

Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.

A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.

So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.

To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.

So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.

I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.

by Berwout de Vries Robles ( at March 22, 2016 04:20 AM

March 16, 2016

Vishnu Santhosh Kumar

Final Project – Phase:1.1

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve performance of the software on AArch64 and                                    x86_64 systems.

Phase 1 – Aim: Identifying a Possible Place for Optimization and making a project plan. 


  • Introduction
  • Community and Contributions
  • Building & Analyzing The Software
  • Bench Marking
  • Optimization
  • Project Plan
  • References



LZ4 is lossless compression algorithm.  It  is  an  LZ77  type compressor. The LZ77 compressor replaces the repeating occurrences of a data with a reference to the single copy of that data in the uncompressed data. A match is encoded using  a lengthdistance pair, that means , “each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream” . An LZ4 compressed block format has the following pattern:

The first 4 high bits of the token shows the length of the literal. It can have values from 0-15, a value of 15 represents the need of additional bytes . Then the  optional bytes will provide values to satisfy the literal length. It can have any number of bytes. Following the token and optional literal length is the literal.After that, a 2 bytes values shows the offset, in little endian. It is the position to be copied from. The value ranges from 0-655535  while 0 and 65535 cannot be coded. In order to get the match length, then lower 4 bits of the token field is added to the optional match length bytes of a token field are equal to 15, like the same way the literal length are calculated.

With the offset and the match length, the decoder can now proceed to copy the data from the already decoded buffer. On decoding the match length, we reach the end of the compressed sequence and therefore, start another one.

The LZ4 also has an another format called ‘frame format‘. The format discussed above is called ‘block format‘ , which is the default format and more efficient. In this project, I’m dealing with the block format compression of LZ4 compressor.

     Community And Contributions

The source code of the software is available on GitHub. The patches are committed to the dev branch, any commit directly to the master branch is not allowed. The owner of the repository will analyze the changes and merges it with the master branch if the patches are relevant. There is a public discussion forum for the discussions on LZ4 compression algorithm & open source implementations. The best way to get in contact with the community is by using discussion forum or the issue tracker in GitHub.

Building  & Analyzing  The Software

The source code of the software is downloaded from GitHub into aarche (AArch64) and Xerxes(X86) systems. In both systems, I use ‘unzip’ to decompress the package. The ‘make’ instruction was enough to use the software.To test that I compressed many files of varying size and decompressed those files. I then compressed the files with different command line arguments given in the documents to check the performance and speed of the software. Everything went well.  Next step is to do the benchmarking.Before doing that I decided to analyze the code to understand the program well. I was so excited to do this, but really got disappointed once start doing that. The most part of the program uses the many Macros

Before doing benchmarking , I decided to analyze the code to understand the program well. I was so excited to do this , but really got disappointed once I  start doing that. The most part of the program uses many Macros and that makes me really confused. I almost spend two days just looking at the code, trying to understand the flow of the program. The hardest thing I faced was to find out the purpose of different directories and its files. The ‘Readme’ files don’t provide much information about that .The two main source directories for the package are ‘lib’ and ‘programs’ directory. I really got confused to find the best place to start analyzing  the code. I contacted the LZ4 community using their google discussion forum and ask my questions. I got the reply within  3 hours and I then started analyzing the software.
Now I got the better idea about the directories and I was familiar with the algorithm(LZ77)  used in the software for compression. With my interaction with the community I understood that the “lib” directory contains LZ4’s library (core) part & “program” directory contains application programs which use LZ4 library: LZ4 command line utility, I/O submodule and test tools (e.g. benchmark).

Bench Marking

The Bench Marking is  done on both Aarch-64 (aarchie) and x86-64 (Xerxes) systems. The sample files taken for this  benchmarking are from Silesia compression corpus, the intention of the Silesia corpus is to provide a data set of files that covers the typical data types used nowadays. The sizes of the files are between 6 MB and 51 MB. The chosen files are of different types and come from several sources. The file I choose to do the benchmarking is Collected works of Charles Dickens.

(The benchmarking is done with the help of benchmarking tools provided with lz4)



Name Of File Size  Compression rate  Compression  Speed
 Before  dickens

9.72027 mb

 –  –
 After  dickens.lz4

4.23717 mb

    56.41%  1.4 MB/s –  153.2 MB/s3

*Compression/decompression rate =( change in size/initial size )*100 {initial size=9.72027 }


Name Of File Size decompression rate  decompression  Speed
 Before  dickens.lz4

4.23717 mb

 –  –
 After  dickens

9.72027 mb

    100%  2.8 MB/s  – 144.6 MB/s




Name Of File Size  Compression rate  Compression  Speed
 Before  dickens

9.72027 mb

 –  –
 After  dickens.lz4

4.23717 mb

    56.41%  8.6 MB/s – 685.6 MB/s
*Compression/decompression rate =( change in size/initial size )*100
{initial size=9.72027 }

Name Of File Size decompression rate  decompression  Speed
 Before  dickens.lz4

4.23717 mb

 –  –
 After  dickens

9.72027 mb

    100%  21.3  MB/s – 1000.9 MB/s

♣Final Results

 System Size Of the File before Size of the  file
compression rate  compression  Speed
 AArch64  9.72027 mb

4.23717 mb

56.41% 1.4 MB/s –  153.2 MB/s3

(77.35 MB/s)

x86-64  9.72027 mb

4.23717 mb

56.41% 8.6 MB/s – 685.6 MB/s

(338.5 MB/s)


 System Size Of the File before Size of the  file
decompression rate  decompression  Speed
 AArch64 4.23717 mb

9.72027 mb

100% 2.8 MB/s  – 144.6 MB/s


x86-64  4.23717 mb

9.72027 mb

100%  21.3  MB/s – 1000.9 MB/s

(489.8 MB/s)



There are two possible optimization ways to improve the performance of the code. Improving compression rate

. Improving compression rate and speed

                     The Compression rate in both the platforms for a 9.2mb file is 56.41 % , which can be improved by changing the makefile or optimizing the existing code or giving platform specific codes for AAarch-64 or x86-64 systems.

Project Plan

The project is planned to achieve a performance boost in the LZ4 compression through various stages of optimizations.

Basic Optimizations:

1-Eliminating goto statement  : In the  ‘/programs/lz4cli.c’ file, there are so many goto statements used. It is always recommended not to use goto statement as this reduces the readability of the program. It becomes difficult to trace the control flow of a program, making the program logic complex to understand .Using goto statement is considered as poor programming practice. Any program written in C language can be written without the use of goto statement. So try to avoid goto statement as possible as you can.
The goto statement is replaced with the necessary code required after the goto destination.

FileName: /programs/lz4cli.c
Line Number : 360


case ‘V’: DISPLAY(WELCOME_MESSAGE); goto _cleanup; /* Version */
case ‘h’: usage_advanced(); goto _cleanup;
case ‘H’: usage_longhelp(); goto _cleanup;


case ‘V’: DISPLAY(WELCOME_MESSAGE);                      /* Version */
if (main_pause) waitEnter();
                  return operationResult;
case ‘h’: usage_advanced();
 if (main_pause) waitEnter();
                 return operationResult;
case ‘H’: usage_longhelp();
if (main_pause) waitEnter();
                   return operationResult;

2.Short-Circuit Evaluation : When evaluating a condition with a logical AND (&&), both halves of the condition must be true for the entire condition to be true. Therefore, it is not necessary to evaluate the second half of the condition if the first half is false.


File: /programs/lz4cli.c
Line : 499
if (!strcmp(input_filename, stdinmark) && IS_CONSOLE(stdin) ) badusage();

if (!strcmp(input_filename, stdinmark) )
if(IS_CONSOLE(stdin) )

This short-circuit evaluation can be done on various ‘if’ conditions in files: “/programs/lz4cli.c”, “programs/bench.c”, “lib/lz4.c”, “lib/xxhash.c”

Higher  Optimizations

  1. The code uses a single thread to encode and write the encoded data into the file, so I’m implementing a multithreading on “programs/lz4io.c” to write the encoded data into the file using a  different thread.
  2. The program uses hashmap search to find the matching literals. I’m trying to use different search algorithms like Binary Search Tree or Doubly-linked list to improve the compression rate.
  3. In the compression algorithm in “/lib/lz4.c”, I compressed a file having only 39 unique characters. It is found that the size of the file has been increased to 148.5%. Which means  around 150% increase in each in the size of  non-matching characters. It happened because of encoding of each character in irrespective  of whether the match is found or not. I can alter the encoding part of the algorithm in file “lz4.c”,to encode the literals only if a match is found.



by vishnuhacks at March 16, 2016 11:26 AM

March 15, 2016

Nina Wip

Presentation Review

Today I presented my topics discussed in my previous blog: -ftree-tail-merge and -floop-interchange. I used a  powerpoint to get the information across, I will put the slides on the end of this post.

Overall I thought the presentation went alright. I was able to show some code examples and explain how the options work. Sadly I wasn't able to compile my code due to different reasons. First of all the -floop-interchange option can only be used when GCC is configured with the --with-ppland and --with-cloog option. Appearently the GCC compiler we're using was only configured with --with-cloog, so this explains why I wasn't able to spot differences in my assembler code.

For the -ftree-tail-merge I do not know what I did wrong while compiling, I just could not get results in the assembler code. A tip I got from a classmate was to add a limit flag where I set the amount of lines that have to be identical. I have yet to try this.

I think most of the class understood what I was saying because there weren't many questions and I don't think the options I chose were very difficult.

One thing I couldn't figure out was, what the difference was between -ftree-tail-merge and -fcrossjumping since they basically both get the same results. Appearently the difference is that the -ftree-tail-merge gets called in the intermediate language GIMPLE and the -fcrossjumping is called in a different compile step. In source code you wouldn't see any differences between these options.

Powerpoint slides

by Nina ( at March 15, 2016 09:16 PM

Presentation Topics

The topics I chose for my presentation are the following optimization options. I'll be explaining the options with code examples and not assembler examples since I couldn't compile my code with the right compiler flags.

GCC Manual Description:
Perform loop interchange transformations on loops. Interchanging two nested loops switches the inner and outer loops. For example, given a loop like:
          DO J = 1, M
DO I = 1, N
A(J, I) = A(J, I) * C
loop interchange transforms the loop as if it were written:
          DO I = 1, N
DO J = 1, M
A(J, I) = A(J, I) * C
which can be beneficial when N is larger than the caches, because in Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss. This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppland --with-cloog to enable the Graphite loop transformation infrastructure.

So after reading the manual and the wiki page, I went right into writing code, which developed into the following:
int main() {
int array[1000][1000];
for(int i=0; i<1000; i++){
for(int j=0; j<1000; j++){
array[j][i] = i*j;

When I tried to compile this with the -floop-interchange option, I saw no differences in the assembler code. Later I figured out that the compiler we're using is not configured with the --with-ppland option. So it made sense that there was no assembler to show for it.

After this I decided to just show in code what would change when the option is enabled, which resulted into the following:
int main() {
int array[1000][1000];
for(int j=0; j<1000; j++){
for(int i=0; i<1000; i++){
array[j][i] = i*j;

The only difference is that the loops have switched, so first it goes through j instead of i.
This happens because in C language it reads the cache in row-major order but the first int that increases is the column. It goes from [0][0] to [1][0] (See table below). So when it needs a value it gets the entire row for one value and the next row for the next value. This would be more effective if it got all the values from one row at once, this is why the loops are switched. It then counts from [0][0] to [0][1], so it can get 4 values from reading the cache once instead of one value per cache read.

GCC Manual Description:
Look for identical code sequences. When found, replace one with a jump to the other. This optimization is known as tail merging or cross jumping. This flag is enabled by default at -O2 and higher. The compilation time in this pass can be limited using max-tail-merge-comparisons parameter and max-tail-merge-iterations parameter.

After researching this topic a lot I ended up writing the following code:
int main(int& a, int& b) {
int c = 0;
if(a > 100){
a += 100;
b += 200;
c = a + b;
} else {
a += 100;
b += 200;
c = a + b;
} return c;

As you can see, there's duplicate code in the if and else. In this simple example the developer should see that this can be changed, but to show you what -ftree-tail-merge does I kept it simple.
Sadly I couldn't get this to compile the right way either, because I probably missed some necessary flags.
After researching some more I found out that -tree-tail-merge and -fcrossjumping have the same output when it comes to the code, but the -ftree-tail-merge gets activated in a different intermediate language step during compiling which is called GIMPLE.
To show you what the difference in assembly code is, I used the -fcrossjumping flag because it gives the same results. I colour aligned the code and assembler to let you know where everything happens and what the difference is between the assembler code.

gcc -O
main(int&, int&):
movl (%rdi), %eax
cmpl $100, %eax
jle .L2

addl $100, %eax
movl %eax, (%rdi)

movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)

addl (%rdi), %eax

addl $300, %eax
movl %eax, (%rdi)

movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)

addl (%rdi), %eax

gcc -O -fcrossjumping
main(int&, int&):
movl (%rdi), %eax
cmpl $100, %eax
jle .L2

addl $100, %eax
jmp .L4

addl $300, %eax
movl %eax, (%rdi)
movl (%rsi), %eax
addl $200, %eax
movl %eax, (%rsi)

addl (%rdi), %eax


As you can see, it skips over the b and c operations in the if and jumps straight to L4 where the b and c operations are done. In code it would look like the following:
int main(int& a, int& b)
int c = 0;
if (a > 100) {
a += 100;
} else {
a += 300;

b += 200;
c = a + b;

return c;


by Nina ( at March 15, 2016 08:49 PM

Andrei Topala

GCC Compiler Options: -funsafe-math-optimizations and -ftracer

GCC contains several flags that can be set in order to guide the optimization of a file during compilation.

Let’s look at two of them:


The gcc manual says that this option “allows optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards.”

Essentially, what this means is that the compiler will take certain shortcuts in calculating the results of floating-point operations, which may result in rounding errors or potentially the program’s malfunctioning (hence the unsafe part of the name).

-funsafe-math-optimizations enables the compiler options -fno-signed-zeros (which ignores the sign of zeroes), -fno-trapping-math (which assumes that the code won’t generate user-visible traps like division by zero, etc), -fassociative-math (which allows re-association of operands) and -freciprocal-math (which allows for the replacing of division with multiplication by reciprocals).

To test this, we’ll use a simple program that adds some fractions together lots of times and prints the result:

#include <stdio.h>
#include <stdlib.h>

int main() {
        double sum = 0.0;
        double d = 0.0;

        int i;

        for (i=1; i<=10000000; i++) {
                d = (float)i/3 + (float)i/7;
                sum += d;

        printf("%.4f\n", sum);

        return 0;

We’ll compile it with the -funsafe-math-optimization flag and without:

[atopala@aarchie presentation]$ gcc -O0 test.c -o test
[atopala@aarchie presentation]$ gcc -O0 -funsafe-math-optimizations test.c -o test-math

I used the time command to compare the execution times of each compiled binary. Here is a screenshot of the results:


The binary that had been compiled with the -funsafe-math-optimization flag was much faster, but the result was off by quite a bit.

Let’s use objdump -d to examine the disassembly of each binary file.

[atopala@aarchie presentation]$ objdump -d test
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       580006c0        ldr     x0, 4006b0 <main+0xe0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000680        ldr     x0, 4006b0 <main+0xe0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000023        b       40067c <main+0xac>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180005c0        ldr     w0, 4006b8 <main+0xe8>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e201821        fdiv    s1, s1, s0
  400610:       1e260021        fmov    w1, s1
  400614:       b94027a0        ldr     w0, [x29,#36]
  400618:       1e220001        scvtf   s1, w0
  40061c:       1e260022        fmov    w2, s1
  400620:       180004e0        ldr     w0, 4006bc <main+0xec>
  400624:       1e270040        fmov    s0, w2
  400628:       1e270001        fmov    s1, w0
  40062c:       1e211800        fdiv    s0, s0, s1
  400630:       1e260000        fmov    w0, s0
  400634:       1e270020        fmov    s0, w1
  400638:       1e270001        fmov    s1, w0
  40063c:       1e212800        fadd    s0, s0, s1
  400640:       1e260000        fmov    w0, s0
  400644:       1e270000        fmov    s0, w0
  400648:       1e22c000        fcvt    d0, s0
  40064c:       9e660000        fmov    x0, d0
  400650:       f9000fa0        str     x0, [x29,#24]
  400654:       f94017a1        ldr     x1, [x29,#40]
  400658:       f9400fa0        ldr     x0, [x29,#24]
  40065c:       9e670021        fmov    d1, x1
  400660:       9e670000        fmov    d0, x0
  400664:       1e602821        fadd    d1, d1, d0
  400668:       9e660020        fmov    x0, d1
  40066c:       f90017a0        str     x0, [x29,#40]
  400670:       b94027a0        ldr     w0, [x29,#36]
  400674:       11000400        add     w0, w0, #0x1
  400678:       b90027a0        str     w0, [x29,#36]
  40067c:       b94027a1        ldr     w1, [x29,#36]
  400680:       5292d000        mov     w0, #0x9680                     // #38528
  400684:       72a01300        movk    w0, #0x98, lsl #16
  400688:       6b00003f        cmp     w1, w0
  40068c:       54fffb4d        b.le    4005f4 <main+0x24>
  400690:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400694:       911d8000        add     x0, x0, #0x760
  400698:       fd4017a0        ldr     d0, [x29,#40]
  40069c:       97ffff71        bl      400460 <printf@plt>
  4006a0:       52800000        mov     w0, #0x0                        // #0
  4006a4:       a8c37bfd        ldp     x29, x30, [sp],#48
  4006a8:       d65f03c0        ret
  4006ac:       d503201f        nop
  4006b8:       40400000        .word   0x40400000
  4006bc:       40e00000        .word   0x40e00000
[atopala@aarchie presentation]$ objdump -d test-math
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       58000540        ldr     x0, 400680 <main+0xb0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000500        ldr     x0, 400680 <main+0xb0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000017        b       40064c <main+0x7c>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180003e0        ldr     w0, 40067c <main+0xac>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e200821        fmul    s1, s1, s0
  400610:       1e260020        fmov    w0, s1
  400614:       1e270001        fmov    s1, w0
  400618:       1e22c021        fcvt    d1, s1
  40061c:       9e660020        fmov    x0, d1
  400620:       f9000fa0        str     x0, [x29,#24]
  400624:       f94017a1        ldr     x1, [x29,#40]
  400628:       f9400fa0        ldr     x0, [x29,#24]
  40062c:       9e670020        fmov    d0, x1
  400630:       9e670001        fmov    d1, x0
  400634:       1e612800        fadd    d0, d0, d1
  400638:       9e660000        fmov    x0, d0
  40063c:       f90017a0        str     x0, [x29,#40]
  400640:       b94027a0        ldr     w0, [x29,#36]
  400644:       11000400        add     w0, w0, #0x1
  400648:       b90027a0        str     w0, [x29,#36]
  40064c:       b94027a1        ldr     w1, [x29,#36]
  400650:       5292d000        mov     w0, #0x9680                     // #38528
  400654:       72a01300        movk    w0, #0x98, lsl #16
  400658:       6b00003f        cmp     w1, w0
  40065c:       54fffccd        b.le    4005f4 <main+0x24>
  400660:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400664:       911ca000        add     x0, x0, #0x728
  400668:       fd4017a0        ldr     d0, [x29,#40]
  40066c:       97ffff7d        bl      400460 <printf@plt>
  400670:       52800000        mov     w0, #0x0                        // #0
  400674:       a8c37bfd        ldp     x29, x30, [sp],#48
  400678:       d65f03c0        ret
  40067c:       3ef3cf3d        .word   0x3ef3cf3d

The only thing that’s changed is that the original equation, which contained two division operations, has been replaced with an equation containing one multiplication operation.
Due to the nature of floating point arithmetic, the number there is an approximation of what would be the algebraic result; there is a tiny difference between the result of the first equation and the result of the second, and, as evidenced by our program, this tiny difference can add up.

The gcc manual says of -funsafe-math-optimizations that “this option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.”

We’ve seen with our program that the code produced with this compiler option can be significantly faster. If precision is not an issue and the data being processed is either small enough or distributed in such a way that that large variations in the result don’t happen (as was the case in our program, where we just continually added to a number 10 million times), it might be okay to use -funsafe-math-optimizations for that extra speed boost–but, again, remember that it is unsafe, and might give unexpected results.


-ftracer, according to the gcc manual, “performs tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do better job.”
Superblocks are used during trace scheduling, which is an optimization technique that predicts frequent execution paths in a program and tries to make it so that the program takes advantage of parallel processing by queuing up blocks of instructions that don’t depend on each other (ie, one block can begin to be executed without waiting for a branch or result from another). A superblock is a set of instructions in which control can enter only at the top–you can’t begin executing instructions in the middle of a superblock. The larger the superblock, the less time the processor has to wait for the completion of other instructions before proceeding.
Tail duplication is the process of copying common parts of code in order to create larger superblocks. For example, consider the following program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
        int x = rand() % 100;
        int y = rand() % 100;

        if (x == 10) {
        else {
        x += 10;
        y += 10;
        printf("%d, %d\n", x, y);

        return 0;

What tail duplication should do is copy the printf and x+=10 and y+=10 statements that are outside the conditional statement so that they can be placed inside both conditional branches. This creates larger blocks of code that can, ideally, be executed in parallel.

As the gcc manual says, this is especially useful when performing additional optimizations because it allows the compiler to move code blocks around more freely.

Compiling the program with only the -ftracer option doesn’t change it–the binaries produced by gcc test.c and gcc -ftracer test.c are identical. Compiling with -O2 and -ftracer, however, does. In the objdump listing, the file that had been compiled with -ftracer has an extra call to printf in its main section. Moreover, the program has extra instructions inside what would be the section of code that executes when x==0. This seems to imply that tail duplication has taken place.

I did try to build longer, more complicated programs and see what -ftracer does in their compilation, but the results were inconclusive to say the least. -ftracer seemed not to kick in unless at the higher levels of optimization, and those mangled the machine code beyond my ability to analyze it well enough to say anything important about it😦 But the theory, I think, makes sense, and I imagine that -ftracer could produce faster code (certainly not smaller, on account of the tail duplication) when dealing with much larger and more complicated source files.

Update 18/03/16: So I presented my findings to the class. The professor said that -ftracer is one of the optimization flags that actually make the program worse, but do so in such a way that the intermediate representation is more compliant with other optimizations; so, by itself, it is not very useful, but with further optimizations it may make a significant difference.
It was also suggested that tail merging, which occurs by default at certain optimization levels, might have been thwarting my attempt to obtain a useful result from -ftracer. This, I think, is why it is so difficult to analyze optimized machine code: you’ve not really a clear idea of which optimizations have been applied, and where, and why (also when and how). Anyway, I think (hope) most people understood if not the exact intricacies of the flag’s effects then at least the basic idea behind it.

Here is a link to the presentation slides:
Presentation Slides

Further Reading

GCC documentation – optimizations
Semantics of Floating Point Math in GCC
Floating Point Representation
Superblock Scheduling and Tail Duplication>
The Superblock: An Effective Technique for VLIW and Superscalar Computation
VLIW Processors and Trace Scheduling: Super Block Scheduling

by andrei600 at March 15, 2016 11:18 AM

Giuseppe Ranieri

fipa-pure-const & fdse Compiler Options


The fipa-pure-const option allows the compiler to "Discover which functions are pure or constant. Enabled by default at -O and higher."

To understand pure functions there are two requirements:
  1. The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information or state that may change while program execution proceeds or between different executions of the program, nor can it depend on any external input from I/O devices.
  2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices.
Pure functions examples are:

  • sin(x), returning the sine of a number x
  • length(s), returning the size of a string s
Impure functions would be functions that are not the same value if the conditions are the same, like the day of the week or a random().


for(int i = 0; i < fibonacci(100); ++i)

with the fipa-pure-const the loop becomes

(int i = 0,end = fibonacci(100); i < end; ++i)

as the fibonacci() doesn't require to be calculated more than once.


The fdse option allows the compiler to "Perform dead store elimination (DSE) on RTL. Enabled by default at -O and higher."

Dead store is a variable that is assigned a value but is not read by any subsequent instruction. Dead store values waste processor time and memory.


function func(a, b) {
var x;
var i = 300;
while (i--) {
x = a + b; // dead store
Removing dead store values can have the unintended consequence of not allowing 
overwrites to happen.


by JoeyRanieri ( at March 15, 2016 01:15 AM

Nina Wip

Lab 4 - Compiled C

This lab was about compiling code with different arguments. We'll be discussing 6 changes. Each change was done in differents groups. My group did change 5, so this one is more extensive.

Starting code
We used the simple hello world code for this lab:
#include <stdio.h>

int main() {
printf("Hello World!\n");

We compiled this code using the GCC compiler with the following arguments.
-g               # enable debugging information
-O0 # do not optimize (that's a capital letter and then the digit zero)
-fno-builtin # do not use builtin function optimizations

In the objdump you could see that <main> calls to <printf> to print Hello World.

1. Add the compiler option -static. Note and explain the change in size, section headers, and the function call.
When you add the option -static, the compiler imports the entire library when it only needs one function. This causes the file size to be much larger. 

2. Remove the compiler option -fno-builtin. Note and explain the change in the function call.
<printf> replaced with <puts>

3. Remove the compiler option -g. Note and explain the change in size, section headers, and disassembly output.
When you enter the command objdump --source you normally get the written code together with the assembler code. This is handy when you want to see which assembler code belongs to which C code. When you remove the option -g you remove all the debug information. This also means when you enter the command objdump --source, you do not get the C code in the file because this is seen as debug information.

4. Add additional arguments to the printf() function in your program. Note which register each argument is placed in. 
In the AARCH64 architecture 7 arguents get saved in registers, the overflow gets pushed on the stack. 
In the INTEL architecture only 5 arguments get stored in registerd, and the overflow also gets pushed on the stack.

5. Move the printf() call to a separate function named output(), and call that function from main(). Explain the changes in the object code.
Our task was to move the printf() function to a seperate function output().
#include <stdio.h>

void output(){
printf("Hello World!\n");

int main() {

We compiled this with the same arguments stated above. The objdump now shows:
<main> calls <output> and that calls <printf>
<output> is the same as <main> without our output() function.

We now compiled the following code to inspect differences when you add parameters.
#include <stdio.h>

void output(char a[]){

int main() {
output("Hello World!\n");

In the objdump you see that it loads the given parameter to the x0 register.
The output function in the <main> adds a pointer to the parameter to register x0 before calling <output>.
<printf> in the <output> then takes the x0 parameter and puts it in the x1 register to put it as argument 2 for printf.

File size increased slightly with each change.

6. Remove -O0 and add -O3 to the gcc options. Note and explain the difference in the compiled code.These options have to do with optimizations. The compiler gets 'smarter' and deletes lines of code it doesn't need. For example, in O0 it sets up the stack but it never uses the stack after that. These lines are deleted in the O3 option. Another example is that after the <printf> is done, it doesn't return to main. It goes back to whatever called <main>.

by Nina ( at March 15, 2016 12:12 AM

February 29, 2016

Tom Ng

Lab 5: Lookups

Lab 5 was about figuring out how to scale sound samples. Actual sound samples were not used in the lab but the idea is the same in that an array of values is much like the sound samples in a sound file. Calculating the scaled sample involves taking the base volume and multiplying it by the scale (such as 0.6). There were two approaches for scaling all the values: calculate the value on the fly or generating a lookup table that the program can refer to for that sample.

Calculating the value on the fly is the most natural approach. Whenever the program wants the volume adjusted by a certain amount, it would just take the value and multiply it by the scale. Given 44100 samples, it ran almost instantly at approximately 0.30 seconds and approximately 30 seconds with 50000000 samples. It makes you appreciate that a codec for whatever sound player you might be using can do all this and your device likely still runs smoothly. Well actually, maybe it can do it faster; if the idea of a lookup table actually works out.

On paper, I did not see how a lookup table would make scaling a volume faster without some kind of heavy memory overload. The algorithm called for changing the lookup table only when a different volume factor is observed. I’m not sure whether that means the base volume changed or the scale itself changed but either of those changes would require recalculating the lookup table. The volume samples are limited to -32768 to 32767 for a total of 65536 possible samples and the scale could vary from 0.000 to 1.000. This meant that every time the sample value changed, the program had to calculate 65536 values for the lookup table and it would end up using only one of those values.

For my implementation, I used a constant scale for the entire test. This meant that the table only had to be built once and it could be reused. At 50000000 samples the implementation took only 20 seconds which was a 10 second improvement from multiplying the scale each time. At a smaller sample size, the lookup table algorithm was only negligibly slower by factors of 0.01 seconds. Despite that, the lookup table algorithm is probably a bit heavier on memory use so it’s better used if memory is less of an issue. Another important thing to note is that the lookup table test though assumed a constant scale factor. If the scale factor actually changed very often, it’s possible that this algorithm would be inferior to simply multiplying the scale every time since generating the table is costly. Otherwise the lookup table algorithm should probably be your go-to method if you’re going to scale volumes.

by tng23spo600 at February 29, 2016 04:53 AM

Andrei Topala

Vectorization Lab

Edit: This should be all okay now. (Sorry for posting this blog post incomplete–I had to watch Leo win his Oscar).

Here we’ll look at single instruction/multiple data (SIMD) vectorization.

A vector is a a series of equal-length data elements on which operations can be performed. SIMD instructions, as the name suggest, are used to perform an operation in parallel on multiple operands. Vectorization, then, is the process of converting parts of a program to a vector implementation so that SIMD instructions can be used in order to minimize processing time. A compiler can perform auto-vectorization during a source program’s compilation in order to optimize the resultant machine code so that it uses SIMD instructions, which can, depending on the size of data being processed, affect appreciably the running time of the program.

Consider, for instance, the following piece of code:

for (i=0; i<1000; i++) {
        arrSum[i] = arr1[i] + arr2[i];         

That loop has potential for vectorization. Instead of performing the addition on each pair of elements one by one, the compiler can use a vector implementation to process multiple pairs of elements at once so that the loop runs fewer times.

To invoke auto-vectorization, we need to give gcc the -O3 option, which tells it to go all out with optimization.

gcc -g -O3 lab6.c -o lab6

Here is the source code itself:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
	int arr1[1000];
	int arr2[1000];
	int arrSum[1000];
	long int sum;

	int i;

	for (i=0; i<1000; i++) {
		// The maximum value of rand() is the same as the capacity of int, so we halve the random number to avoid overflowing during addition
		arr1[i] = rand() / 2;
		arr2[i] = rand() / 2;

	for (i=0; i<1000; i++) {
		arrSum[i] = arr1[i] + arr2[i];

	for (i=0; i<1000; i++) {
		sum += arrSum[i];
	printf("%ld\n", sum);

	return 0;

Now let’s look at the machine code disassembly. I’ve extracted the main section of the code and annotated it to explain what is happening as I best could:

0000000000400550 <main>;
// Here we reserve space on the stack for local variables
  400550:	d1400bff 	sub	sp, sp, #0x2, lsl #12
  400554:	d13b83ff 	sub	sp, sp, #0xee0
// Here we are storing a pair of double words to a location on the stack; these two registers are to be used as the frame and link registers
  400558:	a9bd7bfd 	stp	x29, x30, [sp,#-48]!
// Here we move the current stack pointer to the frame register
  40055c:	910003fd 	mov	x29, sp
  400560:	d2800000 	mov	x0, #0x0                   	// #0
// Again storing a pair of double words
  400564:	a90153f3 	stp	x19, x20, [sp,#16]
// Store now a single register to a location on the stack
  400568:	f90013f5 	str	x21, [sp,#32]
// Here the program branches to the time and srand functions to initialize the random seed
  40056c:	97ffffdd 	bl	4004e0 <time@plt>
  400570:	97fffff0 	bl	400530 <srand@plt>
// Here the first loop begins, in which every element of the two arrays is set to a random number
  400574:	d281fa01 	mov	x1, #0xfd0                 	// #4048
  400578:	d2800013 	mov	x19, #0x0                   	// #0
  40057c:	9100c3b5 	add	x21, x29, #0x30
  400580:	8b1d0034 	add	x20, x1, x29
  400584:	97ffffdf 	bl	400500 <rand@plt>
  400588:	0b407c00 	add	w0, w0, w0, lsr #31
  40058c:	13017c00 	asr	w0, w0, #1
  400590:	b8356a60 	str	w0, [x19,x21]
  400594:	97ffffdb 	bl	400500 <rand@plt>
  400598:	0b407c00 	add	w0, w0, w0, lsr #31
  40059c:	13017c00 	asr	w0, w0, #1
  4005a0:	b8346a60 	str	w0, [x19,x20]
// In the first loop, the counter is incremented by 4 and compared to 4000, so the loop runs 1000 times
  4005a4:	91001273 	add	x19, x19, #0x4
  4005a8:	f13e827f 	cmp	x19, #0xfa0
  4005ac:	54fffec1	400584 <main+0x34>
  4005b0:	d2800000 	mov	x0, #0x0                   	// #0
// This is the entry point for the second loop, after the counter is set to 0
// We add 48 to r29 (which contains the address of the stack pointer where our arrays begin) and store that in a register to be used for arguments and return values
  4005b4:	9100c3a3 	add	x3, x29, #0x30
// Then we add the loop counter to the register and store that in another register
  4005b8:	8b000062 	add	x2, x3, x0
// Now we add 4048 to the frame register and store it in r3
// We are effectively jumping over the first 4000 bytes (ie the first 1000-element int array) in the stack
  4005bc:	913f43a3 	add	x3, x29, #0xfd0
// And again we add the loop counter to r3 and store it this time in r1
  4005c0:	8b000061 	add	x1, x3, x0
// r1 and r2, I think, are used to move around the memory space to read from and write to the three arrays we are using; essentially we're using r0, which is the loop counter, as the offset from the two starting locations of the two arrays we are using
// Here we load 4 single words from the memory addressed by r2
  4005c4:	4c407841 	ld1	{v1.4s}, [x2]
// Now we're moving 8048 to r2, which we will use to jump ahead and find the third array
  4005c8:	d283ee02 	mov	x2, #0x1f70                	// #8048
// Now we load 4 single words from the memory addressed by r1
  4005cc:	4c407820 	ld1	{v0.4s}, [x1]
// Then add the stack register's value to r2 (effectively setting r2 to the memory address of the third array)...
  4005d0:	8b1d0042 	add	x2, x2, x29
// ...and the loop counter to r2 and store it in r1 (so, we store in r1 the address of the nth element in the third array, where n is the loop counter)
  4005d4:	8b000041 	add	x1, x2, x0
// This is what it's all for: vector addition
  4005d8:	4ea08420 	add	v0.4s, v1.4s, v0.4s
// In the second loop, the counter is incremented by 16 and compared to 4000, so the loop runs only 250 times (and since we are processing 4 numbers each iteration, that's 1000 numbers in total)
  4005dc:	91004000 	add	x0, x0, #0x10
// Now we store the 4 sums to the memory addressed by r1 (that is, to the four consecutive elements starting from the nth in the third array, where n is the loop counter)
  4005e0:	4c007820 	st1	{v0.4s}, [x1]
// This is where the loop counter is compared to 4000
  4005e4:	f13e801f 	cmp	x0, #0xfa0
  4005e8:	54fffe61	4005b4 <main+0x64>
// We will move ahead another 4000 bytes, so we pass over the third array
  4005ec:	d285e201 	mov	x1, #0x2f10                	// #12048
// r2, at this point, should, I think, hold the memory address of the start of the third array (ie the array whose elements we are now summing)
// We move that value to r0
  4005f0:	aa0203e0 	mov	x0, x2
// We add the frame register's value to r1, which will now point to the empty memory space after the three arrays
  4005f4:	8b1d0021 	add	x1, x1, x29
// Move immediate: each of the four single words here are set to 0
  4005f8:	4f000401 	movi	v1.4s, #0x0
// r0 points to the third array, so we load four single words of it; then r0 is incremented by 16 bytes (which is the total number of bytes we're working with in this instruction (4 single words x 32 bits per word = 128 bits = 16 bytes))
// This is also the point to which the loop will return on successive iterations
  4005fc:	4cdf7800 	ld1	{v0.4s}, [x0], #16
// Signed integer lengthen
// Two of the single words in vector 0 are made double words and put into vector 2
  400600:	0f20a402 	sxtl	v2.2d, v0.2s
// Then the two double words in vector 2 are added to the two double words in vector 1
  400604:	4ee18441 	add	v1.2d, v2.2d, v1.2d
// Now the second part of vector 0 is made into double words
  400608:	4f20a400 	sxtl2	v0.2d, v0.4s
// Here we check to see if r0 points to the same memory address as r1 (that is, if we've gone through all the elements of the third array and reached the empty space beyond)
  40060c:	eb01001f 	cmp	x0, x1
// And now vector 0 (which contains what were the latter two single words of vector 0 and are now two double words) is added to vector 1
// At the end of the loop, vector 1 contains two numbers which, when added together, will equal the total sum of all the numbers in the third array
  400610:	4ee18401 	add	v1.2d, v0.2d, v1.2d
  400614:	54ffff41	4005fc <main+0xac>
// Scalar reduce; we add the two numbers in vector 1 together into a double-precision register, which is the same register and contains the sum of the two doublewords it contained before
  400618:	5ef1b821 	addp	d1, v1.2d
  40061c:	90000000 	adrp	x0, 400000 <_init-0x4a0>
// Move the first and second 64-bit elements from vector 1 to two separate registers
// This might be so that they can be used as arguments for printf?
  400620:	4e083c21 	mov	x1, v1.d[0]
  400624:	4e183c22 	mov	x2, v1.d[1]
  400628:	91214000 	add	x0, x0, #0x850
  40062c:	97ffffc5 	bl	400540 <printf@plt>
  400630:	a94153f3 	ldp	x19, x20, [sp,#16]
  400634:	f94013f5 	ldr	x21, [sp,#32]
  400638:	a8c37bfd 	ldp	x29, x30, [sp],#48
  40063c:	52800000 	mov	w0, #0x0                   	// #0
  400640:	913b83ff 	add	sp, sp, #0xee0
  400644:	91400bff 	add	sp, sp, #0x2, lsl #12
  400648:	d65f03c0 	ret
  40064c:	00000000 	.inst	0x00000000 ; undefined

The important instructions for vector addition are the following:

LD1 {Vt.<T>}, [base]
ADD Vd.<T>, Vn.<T>, Vm.<T>
ST1 {Vt.<T>}, [base]

These instructions load into the vector register multiple 1-element structures from the memory addressed by the base register; add the corresponding elements of the two vectors and store the sum in another vector; and store multiple 1-element elements from the vector to the memory addressed by the base register.

Then, to add together all the numbers in an array to a single long int, further instructions are used:

SXTL Vd.<Td>, Vn.<Ts>
SXTL2 Vd.<Td>, Vn.<Ts>
ADDP Dd, Vn.2D

These take the signed integers in the first half of the source vector, lengthen them (eg converts two single words into two double words), and store them in the destination vector; do the same to the second half of the source vector; and add together the two double-word elements in the source vector and store them in the destination register.

If our Algorithm Selection Lab were to use SIMD instructions, it would probably use the MUL instruction, which can multiply the integers in two vectors. However, the second operand we need there is a floating-point number (that is, the volume scaling factor); vector multiplication can’t be performed on vectors with elements of different types, so we would need to convert the floating-point number to an integer in order to use it in a vector multiplication. This, I think, would be done with the FCVTxS instruction, which converts a floating-point vector to a signed integer vector of the same size. Then we could use the SXTL instructions to lengthen the half-word (16-bit) integer vector so that the elements are the same size as the integer vector containing the word (32-bit) integers converted from the floating point. Then we could finally use the MUL instruction to multiply them together. (I am not sure how the 32-bit integer converted from the floating point of value ranging from 0.000 to 1.000 would look, so I don’t know if there is the potential for overflowing past the 32-bit limit when we multiply, but since the other operand was initially only 16 bits it should be okay, I think).

Analyzing the disassembly listing was pretty difficult, mainly because I am still at that stage where the assembler instructions don’t really mean anything to me in and of themselves, so, as with a foreign language of which you have only a tenuous grasp, I have to translate each instruction in my head, and sometimes reach for a dictionary (the ARMv8 Instruction Set Overview seems esoteric and scary at first glance, but it is very helpful). Reading a disassembly listing is like being lost in a strange land and getting directions from someone who speaks only that language, and this person insists on giving very, very detailed directions.
Another challenge was keeping track of what is stored in each register. Since most of the registers are general-purpose, and all you have to go on is the register’s number and the width of what’s inside, it’s easy to get them all mixed up. Kind of makes you appreciate the ability to give variables really long and descriptive names in higher-level programming languages.

Anyway, auto-vectorization is neat and it’s easy to see how SIMD could be very useful when processing operations.

by andrei600 at February 29, 2016 04:44 AM