Planet CDOT

May 27, 2016

Laily Ajellu

Creating Github Markdown pages - BBB Documentation

My newest challenge was to create markdown pages detailing HTML5 Coding Practices and HTML5 Project Structure. Most of the content for the documentation was provided by Daniel Perrone (Really smart guy! Click his name to checkout his Github).


My first challenge was to install Ruby 2.2.4p230 on my existing VM (using VMware). The problem was that certain components of BigBlueButton are dependent on Ruby, and updating Ruby meant uninstalling using:
sudo apt-get purge ruby

(Aside: If you do know how to update to ruby 2.2.4 on ubuntu 14.04 - without uninstalling -, please post a comment below!)

When I ran the command I was prompted to remove the dependent BBB components, which I didn't want to do. Solving this problem was probably the most time consuming of the whole task.


So after looking for ways to update vs. uninstall (and failing), I decided to create a VM from a backup. This backup had Ubuntu but no BBB.

Once Ruby was installed I used
bundle install
and finally
jekyll serve -H
to have it running on something other than my localhost.

At this point I was able to preview my changes, so all I needed to do was convert the Google Doc to Github Markdown.
First I looked for automatic converters to speed up the process and found 2 that work together:
1. Writage
2. Pandoc

The syntax wasn't quite right, but it did format some things nicely for me like bullets, spaces, and special characters.

Commiting, Pushing and Merging

Once the syntax was ready, I created symbolic links to both new pages, and followed these steps to update it on the official BBB github
1. git add ...
2. git config --global
3. git config --global
4. git commit
5. git push

Now that the changes were pushed to my own fork, it was time to submit a pull request (Compare and Pull Request Button) and finally merge the changes.

And voila! My first commit. What an amazing day!

by Laily Ajellu ( at May 27, 2016 09:03 PM

May 26, 2016

Kyle Klerks

ABCs of Express.js APIs

Now that the pieces of the puzzle are starting to be finished, it’s time to start putting some parts together. I previously haven’t had any experience working with express.js; that portion of the project was mostly left to my colleague. That said, it didn’t take long to wrap my head around the concept and get to work on gluing the code pieces together under the express framework.

An example of what an express function might look like is this;

app.get(‘/api/’, function(req, res) {



So essentially what’s going on here is the client trying to access the server by going to the URL “some.random.url/api/?id=6” would be redirected to this piece of code, which then executes the function(req, res).

app.get(‘/api/’, function(req, res) {

res.send(‘The ID you entered: ‘ +;


The above code will return “The ID you entered: 6” to the client. Let’s look at how that happens, piece by piece. The “req” variable means “request”. This represents any data coming from the user. In this case, “req” is holding any additional information tacked onto the URL after ‘/api/’. It works very much the same as GET in PHP does, in that you can reference that variable directly, and you will get the value the client included in the request: in this case, ?id=6. The “res” variable stands for “response”. This is what you return to the browser. By saying res.send, you alter the head of the response to the client with “The ID you entered: 6”.

Alternatively, the function might take the form of:

app.get(‘/api/:id/’, function(req, res) {

res.send(‘The ID you entered: ‘ +;


In this case, we get the same result, except instead of the client saying “some.random.url/api/?id=6”, the client says “some.random.url/api/6”. The only difference server-side in the actual implementation is saying instead of

by kyleklerks at May 26, 2016 08:51 PM

May 25, 2016

Laily Ajellu

Differences between EcmaScript5 and EcmaScript6

I have been reading the text "Exploring EcmaScript 6 by Dr. Axel Rauschmayer and have found a lot of useful new syntax, here is a sampling:
Source: Exploring ES6

Scoping of 'this'

In ES5:
When you have embedded functions, 'this' inside the scope of the inner function refers to the inner function.
But sometimes we want 'this' to refer to the outer scope. So we assign: const _this = this.


In ES6:
Arrow functions don't shadow 'this' in their scopes, so you don't have to assign: const _this = this.
You can simply use 'this' on it's own.

Source: Exploring ES6 - Ch. 4.4

Concat using spread Operator

In ES5:
We have to use the concat function on the first array and pass the rest as arguments.

But in ES6, we can use the spread operator:

Source: Exploring ES6 - Ch. 4.11

C++ style methods

In ES5:
Methods are just properties, but their value is a function.

In ES6:
functions defined within an object's scope are understood as methods

Source: Exploring ES6 - Ch. 4.12

Accessing a Base Class

In ES5:
You have to use the keyword 'prototype' on the Base class in order to access it's methods and the method 'call' on the Base class to set it's member variables:

In ES6:
All you need to do is use the keyword 'super' for both cases:

Reasons for Reading this Text

I'm currently working on a tutorial by Meteor which uses the React framework and ES6.

You can find it here: Meteor using React Tutorial

And although it's great, the learning curve is very steep for those who only know Javascript, not just because of the concepts, but because of the cool new syntax that's been added in the new release.
My hope is to bridge the gap using the text.

by Laily Ajellu ( at May 25, 2016 07:55 PM

Matthew Marangoni

CDOT update 05/25/2016

During my first week at CDOT, I completed varying setup tasks. Initially, I had to salvage computer parts from other unused machines and assemble my own. This was fairly easy, the only difficulty was finding a adequate graphics card as most machines were missing them already. Once I had my machine built, I then had to format the HDDs and install Windows 7 on my machine (this took a considerable amount of time as there seemed to be endless windows updates, driver updates, and reboots) but it did eventually finish the following morning.

Once complete I was able to begin the initial setup of my VM environment which involved tweaking a few NAT configurations and then installing Ubuntu 14.04 in my VM environment. All of this seemed to work just fine so I proceeded with the BigBlueButton install. This install went reasonably smooth but did take a day or so to complete. I followed the instructions from the BBB docs step by step, but there were a few slightly outdated steps which gave me unexpected results that were known to my colleagues, but not yet documented. Some portions of the BBB install took longer than others, so I used this time to review other documents (BBB HTML5 design, development, etc) to better familiarize myself with the Big Blue Button project. Following this, I then proceeded to set up the BBB developer environment with no issues.

Currently I am working on a few tutorials to get up to speed with the BBB project and working environments. I just completed the Meteor tutorial and am about to begin reading over a bit of the Meteor documentation. Later I will move onto the REACT, ES6, and Mongo tutorials and documentation.

by Matthew Marangoni ( at May 25, 2016 03:25 PM

May 23, 2016

Andrew Smith

Test/Exam clock

Sometimes I don’t understand the world. One of the things I couldn’t understand is why it is so difficult to find an online clock that is simple and can be resized. For years I would google “clock” and variations of that, and all I would get back was crappy small digital clocks on pages 80%-full of ads.

Why is this so complicated? I don’t know. Mozilla had a Canvas demo of how to make a clock, but even that one wasn’t very good, it was not resizeable. So I made my own, and our webmaster added a bunch of JavaScript library garbage, because I guess that’s what web programmers do these days by default.

Anyway, it still works and here it is: Seneca ICT Exam Clock

by Andrew Smith at May 23, 2016 05:11 PM

May 19, 2016

Kyle Klerks

Node.js & MySQL – Prepared Statements

What an exciting first few weeks this has been so far. I’m already learning so much about JavaScript and node.js through working with my project team.

The task I was most recently tackling was writing an SQL query generator that receives a JSON object filled with tokens and then builds a query around those parameters. Since our code is meant to run on a node.js server, we’re using the mysql module located here.

Initially when I wrote this function, all it did was parse the data and spit out a string meant to be run as a query. This was kind of an oversight, because the proper way to do it when security is a concern is to use prepared statements. The short version of prepared statements is that you use a ? in place of what would normally be filled in by a parameter. So, instead of;

“SELECT ” + field + “FROM ” + table + “WHERE id = ” + id;

You prepare it as;

 “SELECT ?? FROM ?? WHERE id = ?”

Essentially, the function was returning something like;

“SELECT column1 FROM exampleTable WHERE id = 1”

It SHOULD be formatted like this;

“SELECT ? FROM ? WHERE id = ?”

So, after some tweaking, I got my function to return an array with the query in first index, followed by a list of parameters to fill in the gaps. I ended up getting objects like this;

[“SELECT ? FROM ? WHERE id = ?”, “column1”, “exampleTable”, 1]

Once you’ve got an object like this, you can;

connection.query(“SELECT ? FROM ? WHERE id = ?”, [“column1”, “exampleTable”, 1], function(error, results, fields) { ….. });

The query function in this case has 3 parameters; the first being the statement to be prepared, the second an array of parameters to fill in your ?’s, and a function for handling the results of the query.

I tested the above statement and… oh fun, an error. “‘exampleTable’ is invalid syntax”, the log tells me. What? How is that possible? I log into the database directly and test the statement. No problems.

After a moment of disbelief and some more tests, I decided to give the documentation for the mysql module another once-over. It didn’t take long to realize my mistake. Glancing over the prepared statements again, I noticed this;

var sql = SELECT * FROM ?? WHERE ?? = ?;

Notice that some of the parameters are represented as ?? and not ?. This is because ?? is used when referencing a table or field, and ? is used when referencing a value. Oops.

So after that blunder, I tested it again and magically I’ve got a big list of results being spit out by the query function. Awesome! Mission Accomplished. Time to go get some coffee.

by kyleklerks at May 19, 2016 07:46 PM

May 18, 2016

Laily Ajellu

Day 2 at CDOT

Installing Windows - Day 2 - BBB


Error 1: No drives were found
Error 2: Bootmgr is missing Press Ctrl+Alt+Del to restart


1. For my motherboard, I needed to use the AHCI SATA port type
2. The SATA ports my hard drive were physically connected to weren't enabled in the BIOS settings, so it wasn't recognizing the hard drives

My Process:

Windows installation started (but no option to start from USB/disk) Chose language English.

Gave me the error: No drives were found


Following the instructions, I cleaned Disk 0

But then I found that it was actually the USB drive that I erased! (Which contained the .iso image for Windows7) So I had to burn the .iso onto the USB again using Rufus, but this time with FAT32 because you need to use FAT32 rather than NTFS when using a UEFI bios.

This time it gave me the error:
Bootmgr is missing Press Ctrl+Alt+Del to restart

So I went into BIOS settings by mashing F2, F8, F10, F12 as soon as I turned on the computer.
In BIOS settings:

1. Made sure that USB is the primary method of installation
2. Checked the RAM using the BIOS check (returned okay)
3. Quick checked the HardDrive using the BIOS check (returned okay)

Finally I found this link that gave me the solution:

After I enabled the all SATA drives in BIOS and changed the SATA port type to AHCI, it worked! Next time I booted up the computer it asked me to press any key to start from USB and once I started Windows installation it was able to find the hard drives. The reasons for the fix were:
1. For my motherboard, I needed to use the AHCI SATA port type
2. The SATA ports my hard drive were physically connected to weren't enabled in the BIOS settings, so it wasn't recognizing the hard drives

by Laily Ajellu ( at May 18, 2016 01:15 PM

My first day at CDOT

My first day at CDOT

When I first came in to CDOT today I wasn't expecting to start working immediately, but I was thrilled nevertheless.

First thing on the agenda, make my computer work.

The computer I was initially assigned would turn off periodically, for unknown reasons (at the time) so CDOT decided to give me another CPU. Turns out, it was one that used to be used as a server and so it didn't have a graphics card (It's a Z440 HP). So we took out the graphics card of the faulty computer and seated it into the server, hooked everything up and booted it up.

Lo and behold, the monitors didn't connect and after a few seconds of booting the computer starting sounding 6 BIOS beeps and showing 6 red flashes on the motherboard. I got on Google trying to figure out what the problem could be.

First I found this link:

HP support

Which was as nice first attempt, but no cigar.

Next, I hit the jackpot:

BIOS Beep Codes

Using CTRL-F to look for "6"...

And fixed!

Such a great day...

by Laily Ajellu ( at May 18, 2016 01:11 PM

April 22, 2016

Berwout de Vries Robles

Update on phase 3

Quick update:

I got a reply to my github pull request. Sadly it was not what I had hoped for. The package manager replied that they want to give users full control of their compiler optimizations and not force anything on them.

 As there is theoretically no downside to applying atleast a base level of optimization, this argument seems a little bit strange to me.

The way I would do it is give the user a base level of optimization, so that users that have no knowledge of compiler optimizations get the benefit in the significant speedup without having to do anything. Advanced users could then opt-out of the optimization if they want to. I realize that this is more work for advanced users. However I think the situation where the user opts out of optimizations would come up very rarely. Whereas the situation where the user should have applied some degree of optimization, but did not because the user didn't know he should have, is more prevalent.

by Berwout de Vries Robles ( at April 22, 2016 05:58 AM

Andrei Topala

SPO600 Project: Phase Three and Conclusion

There is a saying attributed to Edison that goes something along the lines of “I have not failed–I have just found 10,000 ways that don’t work.”

The goal of the project was to optimize a selected open-source compression or checksum package and produce a meaningful improvement in performance. I chose lrzip, partly because it seemed interesting and partly because its creator is a physician, which is a career that I had at length considered myself back in high school before deciding that hospitals and life-and-death situations are scary. This might not have been the best choice (nor, I guess, the best reason to make a choice) because, well, lrzip looks pretty good. I could find no area of optimization that would amount to a substantive improvement. Whenever I thought I had come across an opportunity, it turned out not to be the case at all.

Optimization, I’ve come to realize, is pretty hard. Conversely, making a program run worse, even unintentionally, is very easy.

In phases one and two of this project I noted that lrzip crashed on AArch64 platforms when compressing a file with ZPAQ due to ZPAQ’s use of x86-specific JIT code. I thought to translate the x86 instructions into ARM instructions, so that JIT code could be used on both architectures, but that was too tall an order. I chose instead to focus on the C++ code that is used when JIT is unavailable, to try to optimize it and bring its runtime closer to that of the JIT version of the program. Also, I examined the LZMA code in the package (as LZMA is the default compression algorithm lrzip uses) and tried to optimize that as well.

Now, let me count the 10,000 ways.

The first order of business was to profile the program in order to see which functions took up the majority of the execution time. gprof was the tool for this. First I built the package with the -pg flag, which generates extra code to write profile information. This I did by setting it as one of the C, C++, and linker flags when calling the configure script:

[andrei@localhost lrzip-master]$ ./configure CFLAGS="-pg -O2" CXXFLAGS="-pg -O2" LDFLAGS="-pg"

Then I built the package using make (made the package?) and then ran lrzip on a 500MB text file.

[andrei@localhost lrzip-master]$ ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.937. Average Compression Speed:  2.041MB/s.
Total time: 00:04:04.53

This generated a gmon.out file in the directory, which lets us examine lrzip using the gprof tool. I used the following command to create an image from the profile information:

[andrei@localhost lrzip-master]$ gprof lrzip | gprof2dot | dot -Tpng > image.png


We can see from the graph that the program spent the vast majority of its time in the function GetMatchesSpec1, which is defined in the file lzma/C/LzFind.c. The graph tells us that the function was called 200 million times, so I figured that any marginal improvement I could make to the function would result in a significant gain in performance.
The function uses two Byte pointers to move along the stream of data read from the text file, and two UInt32 pointers (typedef’d as CLzRef) to keep track of matches from the data in a circular buffer.

CLzRef *ptr0 = son + (_cyclicBufferPos << 1) + 1;
CLzRef *ptr1 = son + (_cyclicBufferPos << 1);

Since the two pointers point to adjacent data locations I tried to define one in terms of the other (removing the “+1” part of ptr0’s definition then setting ptr1=ptr0++), but it had no effect, and I think the machine code generated would be the same regardless.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])

This I rewrote as a single while loop, but as I said in phase two it had no (or a negative) effect on the executable. The objdump revealed that both this code and my edit resulted in identical instructions with only a slight change in layout.
I also rewrote this section of the code entirely to check multiple bytes at a time by XORing chunks of data from pb and cur, but even before fine-tuning the code to set len to the proper value after the first different chunk of data was found, I realized that the overhead of doing all this made the program significantly slower, so I abandoned that course of action. (For the text file, the loop only ran a few times with every function call, so checking very large chunks of data would be counter-productive).

After that, I tried to see if I could find something to optimize in LzmaEnc_CodeOneBlock, which is in lzma/C/LzmaEnc.c. I couldn’t, at least not beyond that failed experiment with memmove that I wrote about in phase two.

So, LZMA I couldn’t make better. I then went back to ZPAQ. Here’s the profile information for lrzip when it’s invoked with the -z flag after having been compiled with -DNOJIT to turn off the JIT code:


Both update0 and predict0 are structured as a for loop that uses a switch statement to check the value of an element of an array, which gets its value from an element in the block header array of a ZPAQL object, which represents ZPAQL instructions. I couldn’t find any way to optimize the functions without breaking the logic of the algorithm. The only thing I could do is replace some multiplication operations with bitshifts, but that didn’t affect the runtime (I think the compiler does that as well, if it can).
The execute function, which consists of a switch statement of over 200 cases, only takes up 3% of the execution time, and as much as I tried to optimize it I couldn’t get it to run more quickly by any measurable magnitude, so that didn’t work out either.

My attempts to edit the source code having been unsuccessful, I tried to see if any compiler options would make the program go faster, and, indeed, -Ofast (which enables -ffast-math) created an executable whose LZMA compression consistently outperformed the default -O2 executable by about 1% without introducing any errors into the compressed file. -ffast-math, though, is probably not something one would want in a compression tool, and there might be potential for mathematical errors somewhere in the code. Besides, the 1% difference, though I tested it quite a few times, might just have been a coincidence.

In the end, all I could do is add a small preprocessor condition in the libzpaq/libzpaq.h file to check for an x86 architecture and disable the JIT code if none is found. I created a patch file but I am not sure if it’s worth submitting since 1) it is such an insignificant change and 2) it only checks for GCC and Visual Studio macros, and if the program is compiled on an x86 processor with a different compiler then the JIT code will be deactivated, which will have a significant negative impact on performance. Here is the patch file for, at least, posterity:

From 3976cdc2640adbe593a09bba010130fcf74ef809 Mon Sep 17 00:00:00 2001
From: Andrei Topala <>
Date: Thu, 21 Apr 2016 17:56:34 -0400
Subject: [PATCH] Added a preprocessor check for an x86 architecture in
 libzpaq.h as a requisite for enabling JIT code.

 libzpaq/libzpaq.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/libzpaq/libzpaq.h b/libzpaq/libzpaq.h
index 93387da..e2b5058 100644
--- a/libzpaq/libzpaq.h
+++ b/libzpaq/libzpaq.h
@@ -25,6 +25,11 @@ comprises the reference decoder for the ZPAQ level 2 standard.
 #ifndef LIBZPAQ_H
 #define LIBZPAQ_H

+// If we're not on an x86 processor, disable JIT
+#if !defined(i386) && !defined(_M_IX86) && !defined(__x86_64__) && !defined(_M_X64)
+#define NOJIT
 #ifndef DEBUG
 #define NDEBUG 1

That concludes this chapter in our journey through the wide waters of software portability and optimization. SPO600 was certainly the most challenging and most interesting course I’ve taken at Seneca, and I am sure I learned very much. I do wish I’d had more time to blog about it, though. That is my only regret, and also the 10,000 failures.

by andrei600 at April 22, 2016 03:58 AM

Tom Ng

The End of an SPO600 Semester

Finally done SPO600; and what a ride it was.

SPO600 was not what I expected in the beginning but as I progressed through the semester, the course developed into its own kind of charm. The course really demystifies a lot of what happens in the background. It explains what is normally abstracted away from or taken for granted by the programmer or really, most people.

The topics covered in SPO600 are very specialized. It really is hard to say accurately say what this course is about and “Software Portability and Optimization” is only the tip of the iceberg; it’s almost misleading on what the course has to teach you. The course covers a broad spectrum of topics from compilers to architectures to optimizations. Sometimes the course even covers things that may not have even been course topics in the first place such as useful linux commands and programs.

This course is also unlike any other I’ve taken before. I’ve never taken a course before where blogs and the zenit wiki are used  and to communicate. There was also an emphasis for doing things and reporting the results as well as a bit more leeway on how to approach a given problem. Other courses are more direct in what they have to teach and offer but the approach used in SP600 required you to think a bit more.

Unfortunately I was unable to intake everything that the course tried to teach but I still feel I learned very much from this course. I was initially curious only about a few things here and there relating to the topics of the course and the course gave me a lot more than what I bargained for. The course opens your eyes just enough so that you realize you actually know absolutely nothing. Computing turned out to be a pretty complicated subject and for the various topics SPO600 covered, it probably only scratched the surface. Scratching that surface though felt like I was gaining some deep knowledge on the subject even if it was only the bare bones basics. Regretfully I was unable to find an optimization for my project but I was able to still take away some further knowledge about the make system and other compiler options from doing the project. There really is no other course like SPO600 and even if I got a bit more than I could chew, I’m really glad I have taken this course.

by tng23spo600 at April 22, 2016 03:54 AM

Lab6: Auto-Vectorization of Loops

In this lab, I got to explore with the auto-vectorization features of gcc. I started by creating a small program that created two 1000 length int arrays filled with random numbers, a third array that stored the sums of the other two arrays and the program would print the sum of all numbers in the third array before exiting. The code is shown below:

#include “stdio.h”
#include “stdlib.h”
#include “time.h”

int main() {

int array1[1000], array2[1000], array3[1000];
long int sum;
int i;


/* Fill array 1 with random numbers */
for (i = 0; i < 1000; i++) {
array1[i] = rand();

/* Fill array 2 with random numbers */
for (i = 0; i < 1000; i++) {
array2[i] = rand();

/* Sum contents of arrays 1 and 2 into 3 */
for (i = 0; i < 1000; i++) {
array3[i] = array1[i] + array2[i];

/* Sum contents of array 3 into long int */
for (i = 0; i < 1000; i++) {
sum += array3[i];

printf(“Sum is: %ld”, sum);


I made two object dumps of this program to compare with by compiling with two different compiler options. The first version was made with only -g to turn on debugging symbols: gcc -g arraysum.c -o arraysum.out. The second version was made with: gcc -g -O3 -ftree-vectorizer-verbose=1 arraysum.c -o arraysum.out. The -O3 turns on more optimizations including vectorization while -ftree-vectorizer-verbose=1 prints out what was vectorized and not. The messages printed out when attempting to vectorize were:

Analyzing loop at arraysum.c:29
arraysum.c:29: note: vect_recog_widen_sum_pattern: detected: patt_7 = _31 w+ sum_50;
Vectorizing loop at arraysum.c:29
arraysum.c:29: note: LOOP VECTORIZED.
Analyzing loop at arraysum.c:24
Vectorizing loop at arraysum.c:24
arraysum.c:24: note: LOOP VECTORIZED.
Analyzing loop at arraysum.c:19
Analyzing loop at arraysum.c:14
arraysum.c:5: note: vectorized 2 loops in function.

Of the four loops in the program, gcc was able to vectorize two of them; the loop one where the third array was being filled with the sums and the loop where the sum of all values in the third array was being calculated. I initially used -ftree-vectorizer-verbose=2 at first but the messages were quite long so I used verbose=1 instead. The reason why the first two loops were not vectorized was shown in verbose=2 but not verbose=1. Likely what the issue was here were the calls to the rand() function which the vectorizer cannot analyze. A snippet of the messages printed from verbose=2 concerning the first two loops is below:

arraysum.c:19: note: not vectorized: loop contains function calls or data references that cannot be analyzed
arraysum.c:19: note: bad data references.
Analyzing loop at arraysum.c:14

arraysum.c:14: note: not vectorized: loop contains function calls or data references that cannot be analyzed
arraysum.c:14: note: bad data references.
arraysum.c:5: note: vectorized 2 loops in function.

The assembly for the vectorized loop that fills the third array is:

4005a8: 910083a3 add x3, x29, #0x20
4005ac: 8b000062 add x2, x3, x0
4005b0: 913f03a3 add x3, x29, #0xfc0
4005b4: 8b000061 add x1, x3, x0
4005b8: 4c407841 ld1 {v1.4s}, [x2]
4005bc: d283ec02 mov x2, #0x1f60 // #8032
4005c0: 4c407820 ld1 {v0.4s}, [x1]
4005c4: 8b1d0042 add x2, x2, x29
4005c8: 8b000041 add x1, x2, x0
4005cc: 4ea08420 add v0.4s, v1.4s, v0.4s
4005d0: 91004000 add x0, x0, #0x10
4005d4: 4c007820 st1 {v0.4s}, [x1]
4005d8: f13e801f cmp x0, #0xfa0
4005dc: 54fffe61 4005a8 <main+0x58>

The assembly for the vectorized loop that sums up the elements of the third array is:

4005f0: 4cdf7800 ld1 {v0.4s}, [x0], #16
4005f4: 0f20a402 sxtl v2.2d, v0.2s
4005f8: 4ee18441 add v1.2d, v2.2d, v1.2d
4005fc: 4f20a400 sxtl2 v0.2d, v0.4s
400600: eb01001f cmp x0, x1
400604: 4ee18401 add v1.2d, v0.2d, v1.2d
400608: 54ffff41 4005f0 <main+0xa0>

When comparing the object dump for the version of the program that had vectorization with the one that didn’t, the vectorized loops contained less instructions inside the body compared to the build where vectorization wasn’t enabled. There were also instructions exclusively relating to vectorization present such as the ld1 instruction which loads a single 1-element structure to one lane of one register. For example in the first dump above, the instruction at 4005b8 is: “ld1 {v1.4s}, [x2]”. v1 is the register, the 4s signifies the 4th index of 32 bits and x2 is for wide register 2.

Vectorization can also be implemented in the volume sampling lab as it also contains many loops. The loops are used to fill up the lookup table and vectorization should be able to speed up the process. The scale function which is in a loop might be the only issue as it may be too complicated for gcc to analyze and determine if the loop can be vectorized. Perhaps this problem can be alleviated by inlining it inside of the loop or by rewriting it more appropriately (I believe my implementation of the scale function could be a bit less messy than the way I wrote it).

by tng23spo600 at April 22, 2016 03:50 AM

Nina Wip

Trying to get my changes upstream

As noted before my optimization was to change the compiler flag from -O2 to -O3. This increased the speed with 11% on x86_64. During the testfase of this optimization I changed the compiler flags in the Makefile itself. If I wanted this to get committed in the community I'd have to find the place where the makefiles are being created.

This would normally be the configure file. However I could not find where the flags were set for the makefiles which I thought was very strange because I am very sure that when you configure and make the program, it compiles with the -O2 flag.

I used grep to find files where -O2 would be used, and the only file it found was an instructions file of how you can manually add -O2 while configuring, not as a standard.

Then I tried using grep to find CFLAGS and where they would be defined. What I discovered is that they use a pthread library which helps find the right flags for compilation. ( I quoted this from the md5deep documentation:
#   This macro figures out how to build C programs using POSIX threads. It
#   sets the PTHREAD_LIBS output variable to the threads library and linker
#   flags, and the PTHREAD_CFLAGS output variable to any special C compiler
#   flags that are needed.
I did not know how to manipulate the pthread configuration so it would always use -O3. I did read in their instructions that they had an email address for questions or remarks on the compiler options in the README file. It was however not in the file, so I could not contact them personally either.

On that note I'm sad to share that I could not get the right flags in the configure step. This means I could not commit anything to the github project because I could not contact them and ask for help or an explanation.

by Nina ( at April 22, 2016 03:36 AM

Vishnu Santhosh Kumar

Lab 6 – Vectorization

Vectorization is the process of rewriting a loop so that instead of processing a single element, it can process multiple elements simultaneously.

To show this in action I used the following code:

#include <stdio.h>
#include <stdlib.h>
#define NUM 1000
int main()

int a1[NUM], a2[NUM];

int i=0;
for (i = 0; i < NUM; i++) {
a1[i] = rand() % (100);
for (i = 0; i < NUM; i++) {
a2[i] = rand() % (100);

long int total = 0;
int a3[NUM];
for (i = 0; i < NUM; i++){
a3[NUM] = a1[i] + a2[i];
for (i = 0; i < NUM; i++){
total += a3[NUM];


The code above creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.

The code is compiled with -O3- option that  allows options for vectorization.

402388: 91420222  add x2, x0, #0x80, lsl #12
40238c: 3c407c00  ld1 {v0.2d}, [x0]
402390: 3cdf7c21  ld1 {v1.2d}, [x1], #16
402394: 3e61d400  fadd v0.2d, v0.2d, v1.2d
402398: 4c9f2c00  st1 {v0.2d}, [x0], #16
40239c: eb02001f  cmp x0, x2

Here, the st1 stores a single element structure to one path of one register. ld1, on the other hand, loads a single element structure instead. Those two commands seem to help the process of vectorization.


by Vishnu at April 22, 2016 02:51 AM

Berwout de Vries Robles

Phase 3 Pushing changes upstream

To get my changes accepted upstream, the first thing I needed to figure out was where in the make / build process to insert my change. Since in the phase 2 optimization testing I only directly edited the Makefile to see the changes.

It turns out ncompress uses a Makefile.def, which contains the definitions to be used for the make call. The package is not built with ./configure like most, but instead, after cloning you can directly make install. This means I had to add the following 2 lines to the Makefile.def:
#Compiler optimization flags
I tested whether the package still built correctly with my change to the Makefile.def and it built as it should with the -O3 optimization turned on.

Next I had to figure out how to contribute to the package. Luckily since the package is available on github I can fork the package. So that is what I did, I forked the package, made the change and committed it to my branch. I then created the following pull request:

I've added a link to my blog and a short description in the pull request, since there is no other means of communication described on the project page, now all we can do is wait for a response!

by Berwout de Vries Robles ( at April 22, 2016 01:46 AM

Vishnu Santhosh Kumar

Final Project : Phase 3.4 (last)

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve the performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.4 –Final Patches Submitted and Project Conclusion

Date : April 6,2016

April 6,2016

Submitted the final patches to the LZ4 community

Details :

The final patches of the code have been submitted to the lz4 community . Waiting for any further response from the community. All the changes made in the previous phases of the project have been added to the final code. The community is very active so I’m expecting a reply from the community soon.

April 7, 2016

Response from LZ4 community.

Details  :

The community has responded to my pull requests in GitHub. They have analysed my code and got the following replies.
The community replied that one of the optimizations I have done for decompression on  a function called “LZ4IO_compressFilename_extRess()” was not the right place to do that one, instead making that similar changes in a function named “LZ4IO_GetBlockSize_FromBlockId()” could have made the program much better. The optimizations I have done to improve the performance of the code in the x86 machines makes the community satisfied but they show me their concern regarding the performance of the code on some other platforms.  The community also told me that they compiler options I put for the platform specific flag won’t work in certain conditions or new platforms. He also pointed out one of my mistakes I raised in the pull requests regarding the cache size. When I examined the oldest versions of the LZ4 code, I noticed an option to set the cache size of the compression block, I increased this cache size to see the changes in the compression speed. My assumption was right and the compression speed got increased. However, this option was not found in the latest version of the  code. I added this statement in the pull request and community told me that, its still available but they changed the place of the code to a different file. The community owner also appreciated me to try lz4 optimization and he told me that he have gone through my blog.


The SPO course and this final project have created a lot of improvements in my coding and other skills. This course helped me to explore the world of the open source and many more new things. One such new thing is the blogging habits. I never had a blog before and  now I’m really interested in doing blogging and sharing my research.The Professor was very much helpful and inspiring. He made the labs fun and helped us to improve our team skills by making groups for all labs and working in groups. He explained things very well and give us a lot of references to follow. The classmates were too enthusiastic and energetic. This was one of the best Courses I have attended. I got some many new things to continue my research from this course. I once again thank my Professor and my classmates for being with me and supporting me  in  this course for this  whole semester.


by vishnuhacks at April 22, 2016 01:34 AM

April 21, 2016

Berwout de Vries Robles

Project Phase 2

After last week where we ended our project selection with a somewhat hasty switch to another project I am happy to be able to just dig into this package a little bit. Sadly I have been sick half of this week(still am) so the progress is slower than I would have hoped.

In my project selection blog I mentioned that I wanted to look at vectorization options the compiler can do for us if we make some small changes to the code. However before looking into the code I went to take a look at the makefile to see if there are any vectorization flags currently turned on.

This is what the compiler options line looks like on both betty and my own x86 machine:
What I found out is that CFLAGS, CPPFLAGS and LDFLAGS are all undefined in this case. What I also found strange is that these options define a set memory for the program to use and a number of registers the program can use.

The most glaring lack here is that there is not even a base level of optimization applied, so I set out to test that first. To verify that the code has not been changed in a way that it now has a different meaning, we will be using md5 checksum to see if our compressed and uncompressed files remain the same. After slowly ramping up the optimization level, it turns out that the program works on O3, without it causing any problems.

During benchmarks I first encountered a couple of runs where it looked like uncompressing the files was slower than before applying the optimization, however after running some more benchmarks, it seems to me that it was probably background noise from other processes running on my system.

After the -O3 optimization level had been applied we see the following results in the benchmarks:
aarch64 testserver Betty:
2545934560(~2.4GB) textfile:
real:    64.467s
user:   63.280s
sys:     1.180s

1073741824(1GB) integerstream:
real:    9.413s
user:   8.980s
sys:     0.430s

2545934560(~2.4GB) textfile:
real:    10.882s
user:   9.510s
sys:     1.370s

1073741824(1GB) integerstream:
real:    4.734s
user:   3.920s
sys:     0.700s

my own x86_64 Fedora 23 installation:
2545934560(~2.4GB) textfile:
real:    34.812s
user:   18.821s
sys:     4.715s

1073741824(1GB) integerstream:
real:    13.274s
user:   3.789s
sys:     1.651s

2545934560(~2.4GB) textfile:
real:    45.669s
user:   5.779s
sys:     2.339s

1073741824(1GB) integerstream: 
real:    17.713s
user:   2.814s
sys:     1.107s

If we compare that with last week's results we find that in all of the cases the compression was faster with the optimization turned on. However oddly enough on my local machine the uncompress was notably slower! This could also be explained by the amount of possible disturbances in the background processes of my machine since I am running a Virtual machine to test with. However I have tested extensively and have yet to see the optimized version of uncompress not do worse.

To go about implementing this optimization in the actual program I will have to add it to the Makefile recipe, for the testing I have just been editing the Makefile itself. I have also been looking for a reason as to why compression on the aarch64 server betty is so terribly slow, while uncompress is very fast, but I have yet to find the answer.

by Berwout de Vries Robles ( at April 21, 2016 10:09 PM

Tom Ng

The results of Optimizing xzutils with -O3

Ever since I wrote my last post on my attempt to optimize xzutils, I have been compiling different builds of xzutils with different compiler flags. The results of my first two benchmarks shown that -O3 ran slower than the default compilation which used -O2. These results surprised me so I wanted to figure out how a program could run slower with -O3 instead of -O2.

-O3 was simply -O2 with 9 additional optimizations turned on. The 9 optimizations are: -fgcse-after-reload, -finline-functions, -fipa-cp-clone, -fpredictive-commoning, -ftree-loop-distribute-patterns, -ftree-partial-pre, -ftree-vectorize, -funswitch-loops and -fvect-cost-model. This resulted in me creating 9 builds of xzutils, each with one of those optimizations passed into the configure script and compiled. The results of these builds are as shown in the picture below:

results(Click picture to enlarge)

The picture shows the times recorded by the time command when compressing and decompressing the test files. Highlighted in green are times where an improvement relative to the executable with the default optimizations (-g -O2) was found.

Overall, the results show that compiling with -O2 and one individual optimization that -O3 turns on has an overall negative impact on the running speed of the program since the times to complete operations have mostly increased. Savings if any tend to happen more during the decompression operations than the compression operations. All the individual optimizations were also slower compared to the -O3 version of the executable which was slower than the original executable to begin with.

Furthermore, any reductions in time that a build may have achieved for a section were mostly insignificant and potentially false. For example if there was a time savings in the user section, it may not have actually been a savings since the time “saved” was actually passed and added onto the kernel time resulting in no net gain of time savings.

Of the 9 optimizations tested, -finline-functions seems to be a bit of an outlier. The executable compiled with this optimization is still slower than -O2 and -O3 but its timings are notably better than the other executables with the individual -O3 optimizations. In fact, its times are similar to that of the -O3 executable but it is still stands that adding -finline-functions to -O2 produces a slower running executable. In other words, -finline-functions adds considerably less runtime compared to the other optimizations but the executable is still better off without having this optimization.

It was surprising to me that each individual -O3 optimization made the program slow but combined (-O3), the times were better (though still slower than just -O2). This suggests that combining optimizations may yield a speed boost. It’s hard to determine which optimizations could work with each other but through the names alone, I tried -ftree-loop-distribute-patterns, -ftree-partial-pre and -ftree-vectorize. The build with these optimizations though, did not yield favorable results. Much like the other tests with the individual optimizations, it was slower than the -O2 version and the -O3 version as well.

In the end, I was unable to optimize xzutils through the adding of the -O3 compiler optimizations. None of the individual -O3 optimizations result in any sort of meaningful speed improvement and overall slowed the program down. It seems -O3 isn’t used all the time because it may actually slow down programs and not just because it can break something as I had initially thought. The xzutils devs are in the right for not using -O3 as enabling -O2 without any -O3 optimizations will produce the faster running xzutils.

by tng23spo600 at April 21, 2016 09:39 PM

Lab7: Looking at Inline ASM in libmlx4-1.0

In this lab, I chose a package whose source includes inline assembly and attempted to locate it and find out what it does. In the source, I found only three lines of assembly. The assembly was used in exclusively inside of a series of if-else statements for the c preprocessor meaning only one or none of the assembly code will be executed. The assembly specifically targeted the i386, x86_64 and ia64 platforms.

The file mlx4.h contained the 3 asm statements found in the source code. The section where the assembly was found is an if-else block section for the c preprocessor for defining wc_wmb(). Depending on the platform though, wc_wmb() may be substituted for a single line of assembly or wmb() instead. The input and output operands were empty. Declaring asm volatile and putting “memory” in the clobber register ensures the statement doesn’t get optimized or moved by the compiler. The assembly for all three sections seems to be an implementation of a memory barrier which prevents the compiler from compromising the order of execution through optimization. Throughout the qc.c file, wmb() would be called in different parts of the source.

A single line of assembly would be used if it was determined that the architecture being targeted was either i386, x86_64 or ia64. If the target architecture was neither of these platforms, the else section defines wc_wmb() as wmb() without any assembly. I could not find wmb() being defined anywhere else in the program or redefined as something else elsewhere so it is possible that the call to wmb() may actually do nothing but let the compiler compile the program. If it does do nothing then it means the memory barrier for a section won’t be implemented and the program may suffer performance-wise after optimization because of this.

Only a single line of assembler is executed if one of three platforms are detected and none at all if the platform isn’t recognized. The increase in complexity of the code is insignificant and only serves to benefit the platforms that it does support. No portability is lost, just unsupported platforms may not run as well. For a series of small checks, it is probably worth having the assembler in the source code.

by tng23spo600 at April 21, 2016 08:59 PM

Giuseppe Ranieri

Providing for the Upstream

I have submitted my work to the upstream through github.

 I thought it would be best practice to first show my finding to the community for them to comment on the benefits and cons that I don't know about. Unfortunately, even 4 days after posting no one has responded yet. I assume even if I had made a pull request, the same thing would have happened at this point in time as no new pull requests were answered either.

For the sake of time though, if the work had been accepted, I would have had to went through their guidelines for contributing. Which follows a Google Individual Contributor License Agreement (CLA). The following is stated:

Before you contribute

Before we can use your code, you must sign the Google Individual Contributor License Agreement(CLA), which you can do online. The CLA is necessary mainly because you own the copyright to your changes, even after your contribution becomes part of our codebase, so we need your permission to use and distribute your code. We also need to be sure of various other things—for instance that you'll tell us if you know that your code infringes on other people's patents. You don't have to sign the CLA until after you've submitted your code for review and a member has approved it, but you must do it before we can put your code into our codebase. Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on.

Code reviews

All submissions, including submissions by project members, require review. We use Github pull requests for this purpose.

The small print

Contributions made by corporations are covered by a different agreement than the one above, the Software Grant and Corporate Contributor License Agreement.

 I followed their suggestion of:

Before you start working on a larger contribution, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on.

But since they have not answered me yet, I was not able to continue with the signing of the CLA.

by JoeyRanieri ( at April 21, 2016 05:55 PM

Vishnu Santhosh Kumar


My Tips for Future Data Scientistsdata-scientist

  • Be flexible and adaptable – There is no single tool or technique that always works best.
  • Cleaning data is most of the work – Knowing where to find the right data, how to access the data, and how to properly format/standardize the data is a huge task. It usually takes more time than the actual analysis.
  • Not all building models – Like the previous tip, you must have skills beyond the just model building.
  • Know the fundamentals of structuring data – Gain an understanding of relational databases. Also, learn how to collect and store good data. Not all data is useful.
  • Document what you do – This is important for others and your future self. Here is a subtype, learn version control.
  • Know the business – Every business has different goals. It is not enough to do analysis just because you love data and numbers. Know how your analysis can make more money, positively impact more customers, or save more lives. This is very important when getting others to support your work.
  • Practice explaining your work – Presentation is essential for data scientists. Even if you think you are an excellent presenter, it always helps to practice. You don’t have to be comfortable in front of an audience, but you must be capable in front of an audience. Take every opportunity you can get to be in front of a crowd. Plus, it helps to build your reputation as an expert.
  • Spreadsheets are useful – Although they lack some of the computational power of other tools, spreadsheets are still widely used and understood by the business world. Don’t be afraid to use a spreadsheet if it can get the job done.
  • Don’t assume the audience understands – Many (non-data science) audiences will not have a solid understanding of math. Most will have lost their basic college and high school mathematics skills. Explain concepts such as correlation and avoid equations. Audiences understand visuals, so use them to explain concepts.
  • Be ready to continually learn – I do not know a single data scientist who has stopped learning. The field is large and expanding daily.
  • Learn the basics – Once you have a firm understanding of the basics in mathematics, statistics, and computer programming; it will be much simpler to continue learning new data science techniques.
  • Be polymath – It helps to be a person with a wide range of knowledge.

by vishnuhacks at April 21, 2016 02:33 PM

April 20, 2016

David Wesley Au

[Lab 3] Assembly Lab



 mov x3,start
 mov x10,10
 adr x11,msg
 udiv x4,x3,x10
 msub x5,x10,x4,x3
 cmp x4,0
 add x4,x4,48
 add x5,x5,48
 strb w4,[x11,5]
 strb w5,[x11,6]
 mov x0,1
 adr x1,msg
 mov x2,len
 mov x8,64
 svc 0
 add x3,x3,1  
 cmp x3,max loop  
 mov x0,0
 mov x8,93
 svc 0


 mov $start,%r15
 mov $10,%r10
 mov %r15,%rax
 mov $0,%rdx
 div %r10    
 cmp $0,%rax
 mov %rax,%r14
 add $48,%r14
 mov %rdx,%r13
 add $48,%r13
 movb %r14b,msg+5
 movb %r13b,msg+6
 mov $len,%rdx
 mov $msg,%rsi
 mov $1,%rdi
 mov $1,%rax
 inc %r15
 cmp $max,%r15
 jne loop
 mov $0,%rdi
 mov $60,%rax

There is not much of a difference between the two, The methods of doing it may be somewhat different, but the logic is ultimately the same.

by David Au ( at April 20, 2016 07:43 PM

Project Phase 3

Since the attempt from Phase II, I have made attempts to try other possible optimizations, but have not gotten any luck in successful attempts. As I have said before, after searching through the pxz.c file multiple times on different occasions, I cannot find another possible shortcut / optimization that could be made towards the application. Between the two files, with a 1 MB difference, there is only a negligible time difference between the two. From my understanding of PXZ up until now, although the latest patch to the software dates back to almost one year. It already seems perfect in terms of code structure and speed. Given the tests I have done, there were almost no difference in time/speed between the different servers of our SPO600 servers.

Back when PXZ was made (2009), for four years or so, there was been a number of "patches" that have been applied to the compression tool. Given the benchmarks, the PXZ community has attempted on, PXZ is meant for multiple cores and processors, which will increase its efficiency in compression time. The compression time using a single threaded PXZ is equivalent to an ordinary usage of the XZ compression tool.

Although, my optimization attempts have ended in failure, I have learned a lot during the process of this project. I learned about how compression works in code, and how everything works well together in terms of assembly. In the future, I will attempt to make contributions to other parts of the PXZ utility be it the assembly, makefile, manpage or documentation.

by David Au ( at April 20, 2016 07:16 PM

[Lab 7] Inline Assembly

The Open Source Package I have chosen to blog about is called "Mosh". Mosh is a mobile shell, a remote terminal application that allows roaming, supports intermittent connectivity and provides line echo. It also has line editing of user keystrokes. Mosh can be used as a replacement for SSH, it is free to use and it is available across many platforms. (

1. Mosh has 14 different MAKEFILES

2. Mosh is available on Windows, OS X, Android, Chrome OS, Linux, BSD, and Solaris.

3. Mosh reduces the median keystroke response time to nearly instantaneous time. This makes remote servers feel like they are "local" machines.


5. When I tried to configure the package, I ran into this problem: configure: error: cannot find protoc, the Protocol Buffers compiler. Thus stopping me from attempting to build the package.

by David Au ( at April 20, 2016 03:16 PM

[Lab 2] Code Building

1. hello-2.10

I used these commands to build the software:


I had no problems with either, and after each option was used, there were more files that I have access to in the hello-2.10 directory.

These are the options after make

@aarchie hello-2.10]$ ls
ABOUT-NLS  COPYING    ChangeLog.O  INSTALL  NEWS    README-dev      THANKS  aclocal.m4  config.h   config.log     configure     contrib  hello    lib  po   stamp-h1
AUTHORS    ChangeLog  GNUmakefile  Makefile  README  README-release  TODO    build-aux  config.status  doc      hello.1  m4   man       src  tests

2. ghostscript

tar -xvf gnu-ghostscript-9.14.0.tar.xz


Since there is no MAKEFILE, and it is suggested that I do not install on the SPO600 servers, there is nothing else I can do after configure. These are the options available after ./configure.

@aarchie gnu-ghostscript-9.14.0]$ ls
AUTHORS    DroidSansFallback.NOTICE  Resource    config.log  devices   expat                    ghostscript_rt.vcxproj  install-sh  libtool    missing   toolbin
COPYING    LICENSE                   NEWS         aclocal.m4  base          config.sub  contrib       doc       ghostscript-ufst.vcproj  iccprofiles             jbig2dec  openjpeg  trio
ChangeLog  LICENSE.Unicode           README       arch        config.guess  configure   cups          examples  ghostscript.vcproj       ijs                     lib         man        psi

by David Au ( at April 20, 2016 02:58 PM

April 19, 2016

David Wesley Au

[Lab 5] - Algorithm Selection Lab

The first one should scale a signed 16-bit integer by multiplying it by a volume scaling factor expressed as a floating point number in the range of 0.000-1.000. This should be implemented as a function that accepts the sample (int16) and scaling factor (float) and returns the scaled sample (int16).
The second version of the function should do the same thing, using a lookup table (a pre-computed array of all 65536 possible values). The lookup table should be initialized every time a different volume factor is observed. This should be implemented as a drop-in replacement for the function above (same parameters and return value).


#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define MAX 250000000
#define FACTOR 0.500
int16_t method1(int16_t, float);
int16_t method2(int16_t, float);
int16_t* sample;
int main(){
int16_t* meth1;
int16_t* meth2;
meth1 = (int16_t*)malloc(MAX * 2);
meth2 = (int16_t*)malloc(MAX * 2);
int i;
struct timeval stop, start;
double diff;
sample = (int16_t*)malloc(32768 * 2);

srand(time(NULL)); // generate sound sample by using random number generator
int16_t* sound;
sound = (int16_t*)malloc(MAX * 2);
for (i = 0; i < MAX; i++) {
sound[i] = rand() % 65536 - 32768;
for (i = 0; i < MAX; i++){
printf("Value of sound sample: %d\n", sound[i]);

// Calculating time taken for method1 - method1
gettimeofday(&start, NULL);
for (i = 0; i < MAX; i++){
meth1[i] = method1(sound[i], FACTOR);
gettimeofday(&stop, NULL);
printf("took %d\n", stop.tv_sec - start.tv_sec);

// calculate time taken for method2-method2
gettimeofday(&start, NULL);
for (i = 0; i < 32769; i++){
sample[i] = i * FACTOR;
for (i = 0; i < MAX; i++){
meth2[i] = method2(sound[i], FACTOR);
gettimeofday(&stop, NULL);
printf("took %d\n", stop.tv_sec - start.tv_sec);
return 0;

int16_t method1(int16_t s, float f){
return s*f;

int16_t method2(int16_t s, float f){
int16_t result;
if (s >= 0){
result = sample[s];
result = -sample[-s];
return result;

aarchie: Method1 took 215,498 ms || Method2 took 168,443 ms
x86_64: Method1 took 20,429 ms || Method2 took 23,564 ms

by David Au ( at April 19, 2016 08:33 PM

[Lab 4]

#include <stdio.h>
int main(){
      printf("Hello World!\n");

Compiler Options:
-g               # enable debugging information
-O0              # do not optimize (that's a capital letter and then the digit zero)
-fno-builtin     # do not use builtin function optimizations

-static option:

-It includes all stdio library files (all, not just the ones given in .c file) as a static memory, and it will make the file much larger than the original file.

Without "-fno-builtin" option:

-Total file size is reduced.

Without "-g" option:

-Total file size is reduced.

Addition printf statements:

-Will move into multiple registers and push them into the stack

Move printf to a separate function (output()):

-It will call output() from main, and move $0x0 to %eax to empty the register before each function call.

by David Au ( at April 19, 2016 08:32 PM

Andrei Topala

Inline Assembler Lab

Let’s look at the use of inline assembler in a C program.

Assembly-language code can be embedded into a C (or other high-level languages) source file in order to optimize a certain part of the program by using processor-specific instructions that are faster than what the compiler would create out of functionally-equivalent C code. Moreover, inline assembly code can be used for hardware-specific tasks that would be impossible with high-level code.

An example of inline assembly code can be found in K9Copy, a DVD backup tool. In the file src/dvdread/md5.h, there is the following piece of code:

#if defined __GNUC__ && defined __i386__
static md5_uint32
rol(md5_uint32 x, int n)
  __asm__("roll %%cl,%0"
          :"=r" (x)
          :"0" (x),"c" (n));
  return x;
# define rol(x,n) ( ((x) << (n)) | ((x) >> (32-(n))) )

If the program is compiled with a GNU compiler and on an i386 processor, the function rol(md5_uint32 x, int n) is defined. It contains the assembly instruction roll (rotate left long), which rotates the 32-bit integer x to the left by n places, bringing bits that fall off the end back to the beginning. If the program is compiled with a different compiler or on a different processor, the function is instead replaced with a macro function that achieves the same result through bit shifts; it would use multiple instructions whereas the version using inline assembly uses only one.

This, I think, is a good use of inline assembly: it’s there to be used if it can make the executable faster, and if it can’t be used then there is an alternative. Unless there are machines that have the __i386__ macro defined but no rol function available (which I don’t think would be the case?), there are no problems with portability. The concept is sound. However, this particular source file was written in 1999. Modern compilers would almost certainly recognize the second function (that is, the macro function using bit shifts) and replace it with a rol instruction if possible. Using inline assembly for optimization is not so (relatively) straightforward a job as one might think, or as it would have been 20 years ago. Most attempts at optimization with inline assembly would probably be surpassed by the compiler’s optimizations; at worst, inline assembly might even impede the compiler and result in an executable that performs worse than it would have had the compiler been left to its own devices. There are, I’m sure, exceptions, and in some situations it could be the best course of action to use inline assembly in an optimizing capacity, but it probably shouldn’t be the first resort.

Inline assembly code should, of course, still be used for certain platform-specific operations or for tasks that cannot be done with high-level languages–those things, at least for the moment, lie beyond the powers of the compiler.

by andrei600 at April 19, 2016 01:51 AM

April 18, 2016

Tom Ng

Lab4: Compiler Options and their Impact

Lab 4 was about observing the impacts of compiler options on a simple program. This simple program was nothing more than a call to printf saying “Hello World”. In this lab, I compiled 7 different version of the same program with either different compiler options or with a slightly difference in the source code but accomplishes (mostly) the same result.

Before experimenting with compiler options, it is necessary to see how the compiler normally compiles a program without any options. The compile command would simply be: gcc hello.c. The compiler compiled the source to a binary that was 8190 bytes.

The first variation of the program was compiled with the options: -g -O0 -fno-builtin resulting in the command: gcc -g -O0 -fno-builtin. The -g enables debugging symbols, -O0 sets the optimization level to 0 which translates to no optimizations and -fno-builtin disables other optimizations. The resulting file was 9184 bytes which was larger in size. When examining the object dump (objdump -d) for the resulting executable and comparing it with the executable without options, there was only one change; in the optimized version, the instructions for calling printf function were changed to call the function puts instead.

The second variation of the program was compiled the -static option added in. This option is for linking statically instead of linking dynamically when compiling. The executable produced with this option shot up almost 700% in size to 681699 bytes. The object dump also resulted in massive amounts of instructions being added in. While the main was the same amount of lines, the start_main which was only 4 lines in the dynamically linked version, was many more; possibly too much for the objdump to show as the end of the header showed “…” suggesting that further lines may have been truncated.

The third variation was compiled without -fno-builtin. The filesize became slightly smaller at 7149 bytes. The object dump reveals the same change as the first variation where the printf calls were replaced by puts instead.

The fourth variation was compiled without -g. Comparing to the first variant, this resulted in no change to the file size or the assembly code. However doing a hash check on both executables resulted in different values so there is some difference here not visible in the file size or when calling objdump -d.

The fifth experiment is different in that this time it is a source code change rather than a compiler option change. The original program simply calls the printf function and passes to it one argument: the string containing “Hello world”. The change here is to add additional arguments. The additional arguments added are integers meant to be printed with the first string. I made 10 sub-variations of this program each with a different number of parameters passed to printf (1 to 10). When comparing the object dumps of the sub-variations with more parameters passed, there were more mov instructions there were in the main header as the number of parameters passed increased. There were also more instructions for pushing values onto the stack in that header. There were also other changes throughout the object dumps but they were only address changes.

The sixth experiment was much like the fifth where the source code was changed rather than the compiler options. The original program simply calls the printf function inside the main and then exits. Here a call to the function output() was made in the main and output contains the call to printf. In the disassembly, this was reflected by a new header output being created and contained much of the instructions that were in the main header. The main header was also reduced by one instruction.

The seventh and final experiment was setting the optimization level to the highest which meant changing -O0 to -O3. Here the filesize went up from 9184 bytes to 10608 bytes. There were many changes done to the assembly as revealed in the object dump. Some headers like the main were moved to other parts of the section. The main header also contained less lines of assembly to execute.

by tng23spo600 at April 18, 2016 04:22 AM

April 17, 2016

Tom Ng

On two Optimizations: -fmerge-constants and -fpeel-loops

Recently, I had the opportunity to spend time to examine two optimizations available in gcc: -fmerge-constants and -fpeel-loops. For each optimization, I examined what it did and tried to write code that would show the effect of the optimization.

-fmerge-constants is an optimization where as its name suggestions, the compiler attempts to merge all constants that have the same values. The idea is if a variable has the same value and it is assured that the variable will not change (constant), the same variable can be referenced instead of using multiple variables. This can help create smaller sized executables since there are fewer variables. This optimization even spans across compilation units meaning it’s not limited to the variables of one specific file in the source code.

The simple program I made to highlight the optimization simply prints constant variables of the same values in the main and includes the header for another function that does the exact same thing with the exact same values. I compiled two versions of the program: one with the optimization and one without the optimization. When comparing the disassembly of the two executables, the only things that changed were the addresses of the mov instructions. There was no increase or decrease in instructions yet the file size of the executable was smaller. The savings came from the .rodata section where for the optimized executable, the variable and its value were only declared once instead of multiple times. -fmerge-constants is enabled at optimization levels –O, -O2, -O3 and –Os and rightfully so because it should always be used. There are essentially no downsides to using this optimization unless your code calls for violating the constness of one of the variables you declared.

-fpeel-loops is an optimization where either the first or last iteration of the loop can be hoisted out which can potentially eliminate a variable or operation inside of the function. It is possible for the first or last iteration of a loop to be a special case and different from the rest of the iterations. Handling this case tends to requires extra code that ends up inside the loop. If this optimization is enabled, the compiler can move the special case out of the loop by handling it before or after the loop and the loop can then be adjusted to fit the other cases while eliminating a variable or operation. This can lead to further loop optimizations such as if compiler determines the number of iterations is small enough, it may completely unroll it eliminating the loop.

Unfortunately I don’t have any working code for this optimization but here is a trivial example on what would happen with this optimization enabled. Given:

int x = 2000;

int y;

for (y = 0; y < 10; y++) {


     x = y;



The loop will print the squares of x in and increment x by 1. Notice however the way this code was written makes it so that the first iteration is an anomaly in which x starts at a completely different number before becoming what would be normal for the loop. What the compiler can do is hoist the first iteration out and put it before the loop:



for (…


Following this, the loop can be adjusted like:


for (x = 1; x < 10 x++) {




The end result looks like:


int x = 2000;


for (x = 1; x < 10; x++) {




Notice the entire elimination of the y variable in the snippet. Furthermore, an entire operation inside the loop was eliminated (x = y). In this particular example, the number of lines of code was reduced but that does not always happen as it depends on the operation that occurs inside the loop and additional logic would be needed to handle the special case if the exit condition of the for loop is not constant.

by tng23spo600 at April 17, 2016 01:24 AM

April 16, 2016

Tom Ng

Lab3: Programming in Assembly

Lab 3 was a fun lab about learning assembly. Assembly is certainly very different from higher level languages but it has its charms. Programming in assembly felt like doing a time consuming puzzle that you know you can eventually figure out and when you do, it feels great. Assembly brings back the joy of printing your first variable or making your first working loop which is probably taken for granted when you’ve programmed enough. The fun part or challenge of assembly comes from its restrictions: there is a limited set of instructions which must be combined in order to do something that may have taken only one line of code in another language. Assembly also uses a limited number of registers which are used to store values and must be used sparingly.

I was tasked with having to code a loop that would simply print the numbers from 0-9. Sounds simple enough. That would be two or three lines of code in another language; but not in assembly. To start I copied the “Hello World!” example and tried to compile it. It worked but when I tried to execute it I got an exec format error. It took a while before I figured out what was wrong because I thought it was something like a syntax error but the message did not contain any lines for where an error could have been. It turned out I was trying executing the object file all this time and that the compiler (as) doesn’t link by default. I thought that was weird because when using gcc to compile c or c++, it links by default unless using the -c flag. I then had to actually call the linker ld whereas when linking with c or c++ I would use gcc and it would use ld. I guess I should have realized sooner since I had to keep typing chmod +x before attempting to execute what I thought was the compiled executable. Well, now I could start making loops.

It turned out the loop was actually already given to me and what I had to do instead was code what the loop was supposed to do. The requirement was to print “Loop: ” and then the current iteration of the loop on one line. I thought of printing “Loop: ” and then printing the number but it turned out you can actually modify the string by changing the address of the string; offset by the desired amount. This required declaring the string to have extra spaces at the end: “Loop: “. It was also necessary to add 48 to the number in order to print its ASCII equivalent. The next part of the lab expanded on the previous part by printing the numbers up to 30. This meant having to print numbers of up to two digits length but only a single digit could be printed at a time. Printing each digit meant breaking up the number with the help of division and remainders which the div instruction conveniently had. Leading zeroes also had to be supressed which was simple enough since individual digits were being printed anyway. To supress the zero, just before where the code prints the digit, a check is made to see if the number is zero and prints it if it is not. If it is zero, it jumps past to where it would have printed the digit. The code I wrote for the lab only allows for numbers with two digits to be printed but logic for printing larger numbers is there, probably at the cost of using more registers.

Finally I had to replicate the x86_64 assembly and port it to aarch64 assembly. Another deceptively hard task. I started by copying over the code from the x86_64 system to the aarch64 system and recompiled it. There were enough errors that I decided to start from scratch and upon further reading on wiki about aarch64 assembly, it was different enough that I had to rewrite the program from scratch anyway. The first interesting part that I had to change was the code to divide the number the loop was on. The x86_64 assembly had a lot of mov instructions because some of the instructions like div required certain numbers in certain registers to be used. Aarch64 turned what was 6 lines of code in x86_64 into just 2 lines since the (u)div instruction for aarch64 did not require the numbers it worked with to be in specific registers. The block of code used to print the message was also shorter by 3 lines because x86_64 assembly again required certain data to be in certain registers. Most of the code removed from the aarch64 version was in fact, mov. But aarch64 assembly also has some downsides compared to x86_64 assembly. Some instructions such as udiv only accepted registers as their arguments whereas div x86_64 could also accept numbers so a few mov instructions were also required to use some aarch64 instructions. Overall though, aarch64 required less lines of code than x86_64 assembly.

Doing lab 3 has taught me quite a bit about assembly but I’m definitely still in the realm of a beginner. Probably the most important thing I’ve learned about assembly is that there are different kinds of assembly for different architectures. I always thought assembly was one language but someone who knows x86_64 assembly may not know necessarily know aarch64 assembly. Assembly is some hard stuff and it’s kind of amazing knowing there are people that work with it everyday.

by tng23spo600 at April 16, 2016 07:45 PM

April 14, 2016

Vishnu Santhosh Kumar

Final Project : Phase 3.3

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve the performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.3 –  Interactions with the lz4  community in GitHub.

Date : March 29,2016 – March 30,2016

March 29,2016

Opened an issue to improve the documentation for the LZ4 compression program.

Details :

I opened an issue in GitHub to improve the documentation of the project. The pieces of information provided in the present documentations are not enough to understand the working of the program or the flow of the program. No documentation describes the special purpose of respective directories and files in it. I believe that a better documentation will really help beginner for understanding the program and contributing to its development.So I opened this issue in the GitHub.

I got the reply that the point of a format specification is to allow any kind of implementation that existing, and still remain compatible with each other.This is a good thing  as different environment come with different restrictions and strengths.  So a single algorithm wouldn’t fit them all.He also encouraged me to  understand the format and  then to  create my own program to read it.Which will get better insight, and as a result, get a better chance to understand the reference source code. I found a similar issue opened in the Github, and the response from the owner was that there is constantly a need to balance between “simple usage for a beginner” and “hardcore total-control experience” where little details matter a lot. Also, the community is been working since  a very long time(17 years) and many patches were merged without proper documentations.He said that he will try to convert the comments to Doxygen, a documentation generator, a tool for writing software reference documentation.


March 30,2016

Opened an issue regarding the increase in the compressed file

Details : 

I compressed a file with only unique characters to check the performance of the code. It is found that the size of the file got increased by nearly 150%. Which I believe , it happened because of the encoding of the non-matching literals. So I opened an issue to make some  changes in the algorithm to encode the literal only if a match has been and found and also only if the encoded literal size is lesser than the uncompressed data. By that way, the uncompressed literal will have the same size before. There the worst case file size can be reduced.
The response from the community was that the  default minimum size of the header is 15 bytes, even though my issue could improve the worst case situation , the command line utility cannot do that ,  because it must generate a result decodable on many different systems and for different use cases, so it needs the frame headers to ensure decoding success.Also in order to compress the file without header,  a function called ‘LZ4_compress()’ can be used directly at API level.



by vishnuhacks at April 14, 2016 05:52 PM

Final Project : Phase 3.2

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve the performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.2 –  Interactions with the lz4 community in  google forum.

Date : March 28,2016


March 28,2016

Post : Which is the default format used, Frame format or Block format?

Details : 

While I was working on my Phase 2.1,  I came across two different kinds of formats used by the compression code to decompress the literals. The format means the structure of the compressed block of literals, that I explained in phase 1.1 . The two formats are the frame format and the block format,  which have different structures. The actual default format used by the code was not mentioned in any documentations.I ask my question in the community forum.

I got  the reply that the “block format” is used for compression  in the LZ4 compression.Also, there is no default compression format, the  frame format is used  for testing and debugging. I also understood that the frame format is easy to use, but the block format is used to write an advanced application which has specific limitations (e.g. tight working memory/bandwidth/size/speed budget). I decided to make changes in the block format to improve the optimization.

The Block format is  used in compression and Frame format is used in testing and debugging.Also, there is no default format used for compression.

by vishnuhacks at April 14, 2016 04:27 PM

Final Project : Phase 3.1

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve performance of the software on AArch64 and                                    x86_64 systems.

Phase 3.1 –  Interactions with the lz4 maintainers in google community blog.

Date : March 22,2016  –  March 26,2016

March 22,2016

Post:  Difference between ‘lib’ directory and ‘programs’ directory?

Details :

During my Phase 1.1 of my final project , I  find it really hard to understand the purpose of the various file in different directories. One among such most confusing pair of libraries were the ‘lib’ directory and ‘programs’ directory. The files in each library had similar names and the code was so confusing (ex: a lot of macros) and unable to understand the purpose of these files. I posted my question in the google community forum.

I got my first reply that , the “lib” directory contains LZ4’s library (core) part.  “program” directory contains application programs which use LZ4 library: LZ4 command line utility, I/O submodule and test tools (e.g. benchmark). I understood that the optimizations for the I/O module like increasing buffer sizes or multi-threading the input/output etc can be modified in the programs directory. It contains the programs that use the files in the “lib”, which contains the code for compression or decompression , the block formats of the compression etc. Any optimizations on the compression/decompression should be done on the “lib” directory files.

The ‘lib’ directory is used to maintain the core program and the ‘programs’ directory is used to support the programs for the user interface, input or output operations, buffer management etc.


March 26,2016

Post:  How to determine maxOriginalSize and correctly allocate a buffer for decompression?

Details :

While I was analyzing the  code for the decompression, I noticed that the buffer size to decompress the file is fixed(128 kb). I thought it would be better to include the original size of the file in the header of the compressed file and by that way a better decompression size can be achieved from the header. This helps to improve the decompression speed of the decompressing code. I posted my question on the community blog.

I got my first reply on the same day itself.  The owner of the blog is a profile named “Cyan”, he replied that the compression block uses only a maximum of 64-kb, because of that the 128kb size in the  decompression code would be enough to decompress any file size. He also said that the future extensions for the compression format will allow embedding the file size in the header.Two other users have also posted their opinion after this, supporting  “Cyan’s” post.

The version 6 of the LZ4 software is planned to add the file size in the header . That can be used by the decompression code to achieve maximum possible decompression speed.


by vishnuhacks at April 14, 2016 03:59 PM

April 11, 2016

Andrei Topala

Project: Improving lrzip; Phase Two

In my last post, I mentioned that lrzip, when given the -z option, complains of an illegal instruction and crashes on AArch64. I did some debugging with GDB, hoping to find the cause of this, but the situation seemed to be that the program just sort of lost itself during execution:

Program received signal SIGILL, Illegal instruction.
0x000003ff7b000000 in ?? ()

Even though the package had been compiled with -g (to produce debugging information) and we’d downloaded all the separate debuginfos, the program still lost its way, and it couldn’t tell us where it was.

The reason for this turned out to be very simple. I took a look at the files in the libzpaq folder and found, in libzpaq.h, the following comment:

“By default, LIBZPAQ uses JIT (just in time) acceleration. This only
works on x86-32 and x86-64 processors that support the SSE2 instruction
set. To disable JIT, compile with -DNOJIT. To enable run time checks,
compile with -DDEBUG. Both options will decrease speed.”

JIT code is created during execution time and then executed. libzpaq.cpp contained several functions that filled various arrays with hex numbers representing machine instructions. Those arrays were then written to memory and executed as code. The problem is that those machine instructions are x86 instructions. So, on our AArch64 machines, the program was trying to execute data it had no business executing, and that’s why it crashed.

libzpaq.cpp has a check to make sure that the platform it’s being compiled on is x86, but it isn’t a very good check:

// x86? (not foolproof)
  const int S=sizeof(char*);      // 4 = x86, 8 = x86-64
  U32 t=0x12345678;
  if (*(char*)&t!=0x78 || (S!=4 && S!=8))
    error("JIT supported only for x86-32 and x86-64");

That snippet of code checks that the system uses little-endian format (which x86 does) and that a pointer is either 4 or 8 bytes long. AArch64 also uses little-endian format (AArch64 is bi-endian in regards to reading data, but by default it is little-endian) and has same-sized pointers, so it passes the test, which is something that shouldn’t happen.

It seems that this is a known problem. An issue was submitted to lrzip’s GitHub’s page a little over a month ago, but no solution was proposed beyond that mentioned in the libzpaq.h file (ie. passing to the C++ compiler the -DNOJIT flag when compiling the code).

I first thought to translate the encoded x86 instructions into the equivalent AArch64 instructions . . . but there are a lot of instructions, and I don’t think I could get it done in time (and I am sure I would run into trouble concerning the handling of things like x86’s variable-length vs AArch64’s 32-bit instructions, etc.)

I wanted to see how big of a difference the JIT code makes, so I tried running the ZPAQ algorithm on x86_64 with and without the -DNOJIT flag.

Here is the result of running lrzip -z normally:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  1.355MB/s. 
Total time: 00:06:08.42

To compile the program with the -DNOJIT flag, we just need to add it to the C++ flags when we run the configure script:

[andrei@localhost lrzip-master]$ ./configure CXXFLAGS="-g -O2 -DNOJIT"

The -g and -O2 flags are set by automake (which lrzip uses to create its configure script and makefiles) if CXXFLAGS isn’t defined by the user, so if we are defining CXXFLAGS we need to set them as well, for consistency’s sake.

Now, here’s the same operation without the JIT code:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  0.919MB/s. 
Total time: 00:09:04.74

Without the JIT code, the program takes 3 extra minutes, or slightly over 30% more time.

Chris suggested that such a large discrepancy between the JIT code and the compiled C code might indicate that the C code isn’t quite optimized as well as it could be. I looked at the C code and, while there are some small pieces I could try to optimize, I do not really understand the algorithm at all, and the code is difficult to read. I’ll try to keep going but I don’t have anything to show for my efforts here.

Anyway, at least I edited libzpaq.cpp. I added the following preprocessor directives near the top of the file:

// GNU
#ifdef __arm__
#define NOJIT 1
#ifdef __aarch64__
#define NOJIT 1

NOJIT, which is supposed to be set with the -DNOJIT flag, is checked in every function that uses JIT code, and if it is defined the program uses regular C code instead. So, with this change, if the preprocessor detects that we’re on an ARM machine, it just sets NOJIT and the other conditionals take care of the rest. It’s an inelegant solution (and I suppose it would make more sense to check instead that the program is running on an x86 architecture and enable NOJIT otherwise) but it works on aarchie and betty, and the ZPAQ algorithm defaults to using the C code. I’ve only tested this on gcc, though; other compilers have different (or no?) macros to check for ARM architectures, so this is not the best solution. I’ll try to refine it but I don’t know if it’ll be worth submitting to the community, since it is such a simple thing and could be achieved just as easily by compiling the code with the -DNOJIT flag.
Modifying the script to check the architecture and add -DNOJIT to CXXFLAGS accordingly also works, but I think it’s better just to have a safeguard in the libzpaq.cpp source file itself, because adding an extra C++ flag by default on non-x86 machines is not really expected behavior.

I took a break from the ZPAQ algorithm and tried to find something to optimize in the regular LZMA code (which is the path taken by default when lrzip is called without specifying an algorithm), but this, too, proved fruitless (although the code is much easier to read). Two functions take up 90% of the execution time, and I couldn’t improve either of them.

Two pieces of code in particular (the first from LzmaEnc.c and the second from LzFind.c) I tried to wrest some improvement in performance from.

p->reps[3] = p->reps[2];
p->reps[2] = p->reps[1];
p->reps[1] = p->reps[0];
p->reps[0] = pos;

Here I tried to use memmove to shift the array (it’s an UInt32 array) forward all at once, but it didn’t have any effect on performance.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])

This I tried to condense to a single while loop (or at least one while loop and a single if statement) in order to reduce the number of branches and operations, but that actually made the code slower. It might be the case here that the compiler optimizes this particular logic layout best.

So, that’s where things stand at the moment. It still looks like the ZPAQ C code might be the most promising venture, if only because I can’t seem to optimize the LZMA code at all. I’ll keep trying both options, though. I also haven’t looked into other compiler options yet, so that’s also still a possibility.

by andrei600 at April 11, 2016 06:46 AM

Giuseppe Ranieri

The Trials and Tribulations of Optimizing Zpofli

What Didn't Work

When I first went about doing this, my methods of testing were incorrect and there was a lot of research I should have done better on.

For example, reading more about the zopfli compression, the reason it is slow was because that was the entire point of why it was made. It was meant to be 80x slower than gzip while only providing an up to 8% boost. So it became a tall order to try to better it.

Here's comparing the size differences between the two.


As you can see, in this case for the 10mb file made with the command:

base64 /dev/urandom | head -c 10000000 > 10mb.txt

it can only provide a 0.6% increase. Although with less random text, it can probably provide better performance. Still, we're dealing with razor thin margins.

What I first tried to do was follow this guide but what ended up happening was that I was getting an even worse performance than by doing the makefile. And this wasn't a permanent solution either as I wouldn't be able to port the exe file to linux.

Next I tried to change the code anywhere I could to see even just a bit of a boost. Changing any data types where I thought it could help. The only problem was that it wouldn't compile correctly. The code was so advanced and accurate, it was difficult to pinpoint where and what to change for everything. So I thought it was a bust to try to even attempt this.

What Did Work 

So I thought my only chance was maybe changing the compiler options around to see what sticks. In the makefile, and the documentation it's suggested that the program is compiled with an optimize of -o2. I thought this was odd, as why not go for the full o3? As well as some other options that allow it to be compiled correctly and such.

Here are the results when comparing -o2 and -o3 on the xerxes server.

real    0m29.852sreal    0m29.847s

As you can see, no difference at all that can really be assumed thanks to the higher optimization level. As I ran it many times between the two, with -o2 sometimes being faster than -o3 and vice versa. I can only assume the reason the documentation suggests -o2 over -o3 that it just compiles quicker, and maybe it uses less memory, although I did not test for this.

So again I was stumped, as that means the optimization levels won't matter, until I remembered that I should try it on the aarchie server! And the results were terrific.

real    3m56.909sreal    3m50.971s

I tested multiple times on both levels to make sure it wasn't just a fluke, but nope, results were always +/- 1 second from their respective levels. This means that changing between the levels can provide as much as ~2% difference in speed! And when something takes this long to compress, that is something that is noticeable when you're working with things that could take hours.
Testing with larger files on the xerxes server with different optimizing levels showed the same speed between them all. Unlike showing a difference like it did on the ARM64. Which means that changing it is an ARM64 improvement only. This is probably why it was not noticed or included in the makefile.

What Now?

Before submitting these findings to the upstream I'll have to gather more research for why this happens. There must one or two only flags in the level 3 optimization flag that help it only on the ARM64 platform. Once I find this out, I could then push an ARM64 only makefile so others could see the same performance boost.

by JoeyRanieri ( at April 11, 2016 03:28 AM

Berwout de Vries Robles

Phase 2 Process

So in my previous blog I found an optimization, but I was not too happy about it because it seemed like such a simple step to take. I guess that is sort of the point with compiler optimizations though. The more efficiently you can get things done, the better. The compiler assists with this very handily.

I tried to get the compiler to tell me some possible optimizations as well, using -fopt-info-vec-missed to see what vectorization possibilities it missed and why. The compiler came up with about 500 lines of missed vectorization, so I wasn't sure where to start with that. I then used -fopt-info-vec to see that some things did actually get vectorized, but that got me no further either.

I tried profiling the program using gprof, it resulted in me finding out that the program spends its entire time in a single large function, whether compressing or uncompressing. This brought me no closer to finding any other optimization.

I have fiddled with the DUSERMEM and DREGISTERS to see if it makes a difference, but I think there is a set maximum to these values somewhere, although I do not know why. To me they are sort of magic numbers right now.

All in all the search and constant waiting for benchmarks was frustrating to me, especially because I found next to nothing besides the obvious compiler flag. If I was to attempt a project like this again. I would have an extra phase and my phase 2 would be write an automatic benchmarking suite where you press a button and a reliable benchmark rolls out the other end.

I also wouldn't use my own laptop to benchmark on, because for some reason the Virtual Machine gives very varying performances and I can never tell if it is an accurate representation of change or not. Luckily betty had my sanity covered there.

by Berwout de Vries Robles ( at April 11, 2016 02:13 AM

Nina Wip

Compiling Optimization Betty

After benchmarking x86_64 with the -O3 optimization flag, it's time to test this on ARCH64.
Since I can only work from the command line on the server I needed to come up with a command to replace all Makefiles' -O2 with -O3. The command I found the easiest was the following:
find -name Makefile | xargs sed -i 's/-O2/-O3/' 
The following benchmarks are done on ARCH64, server Betty.

With -O2 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.037sreal: 0m0.345sreal: 0m4.792s
user: 0m0.028suser: 0m0.323suser: 0m4.551s
sys: 0m0.001ssys: 0m0.027ssys: 0m0.400s

With -O3 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.036sreal: 0m0.343sreal: 0m4.768s
user: 0m0.028suser: 0m0.323suser: 0m4.499s
sys: 0m0.001ssys: 0m0.027ssys: 0m0.426s

As you can see the -O3 did not do anything on ARCH64. I thought this was very strange and checked the executable file to see if it changed at all. The file did change, it got larger as expected. Yet there is no gain in speed. Comparing the real time of the 1.5gb file again, it wasn't even 1% faster. So for ARCH64 I recommend using -O2 because it doesn't change much in run time and your file is smaller.

For Betty I'll have to find another optimization possibility, although I wouldn't know what the next possibility would be. Probably make something specifically for ARCH64, but this would cost way more time.

Another way to go is add the compiler flag -fopt-info-missed to find missed optimizations and see if I can do something about that. Source:

by Nina ( at April 11, 2016 01:20 AM

Compiling Optimization x86_64

As mentioned in the previous post I am going to take a look at the compiler flags and how I might be able to optimize them.

The currents flags for the C compiler that are used:
-pthread -g -pg -O2 -MD -D_FORTIFY_SOURCE=2 -Wpointer-arith -Wmissing-declarations -Wmissing-prototypes -Wshadow -Wwrite-strings -Wcast-align -Waggregate-return -Wbad-function-cast -Wcast-qual -Wundef -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmissing-format-attribute -Wmultichar -Wc++-compat -Wmissing-noreturn -funit-at-a-time -Wall -Wstrict-prototypes -Weffc++
As you can see there are a lot of flags to compile this program with the C compiler. What caught my eye was that they're using the O2 flag and not O3, so I saw this as a optimization possibility and went straight in to change it. Once I changed all the flags I did another benchmark of the program to see if it had helped at all. ~

To check if it did anything, I looked at the executable file. It made the file larger with the -O3 optimization. It went from 2.7 mb to 3.5 mb.

Note: Don't forget to take out the -pg flag as well to have a correct benchmark. -pg makes your program slower so then you won't have the correct comparison.

The following benchmarks are for x86_64

With -O2 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.053sreal: 0m0.287sreal: 0m3.458s
user: 0m0.021suser: 0m0.200suser: 0m2.793s
sys: 0m0.003ssys: 0m0.012ssys: 0m0.184s

With -O3 flag
10.5 mb file105 mb file1.5 gb file
real: 0m0.045sreal: 0m0.258sreal: 0m3.071s
user: 0m0.020suser: 0m0.197suser: 0m2.756s
sys: 0m0.002ssys: 0m0.014ssys: 0m0.177s

By comparing the real time you can see the -O3 flag makes the program 11.2% faster. Which is a pretty nice optimization. So I recommend they replace the -O2 flag with -O3.

Now it's time to benchmark the software on Betty and hope so see an optimization as well. 

by Nina ( at April 11, 2016 01:12 AM

April 10, 2016

David Wesley Au

pxz.c Phase II - Optimizations

So I tried the optimizations I posted in my last blog post, using these two tar.xz files.

1. linux-4.6-rc2.tar.xz ( 85.3 MB

2. linux-4.5.tar.xz ( 84.3 MB

I used the decompression flags to test the speed and time it takes to do each, with the non-edited version and the "optimized" version. These are my results: 

Note: MM:SS:MS

Note: These were tested on an  ARMv8 AArch64 system known as aarchie.

Default Version:
./pxz -d linux-4.6-rc2.tar.xz  ===== 1:11:20

./pxz -dcv linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06 

./pxz -dcv -T4 linux-4.6-rc2.tar.xz > /dev/null ===== 9.5 MiB/s 1:06

./pxz -dcv -T8 linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06


./pxz -d linux-4.5.tar.xz ===== 1:08

./ pxz -dcv linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T4 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T8 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05


"Optimized" Version:
./pxz -d linux-4.6-rc2.tar.xz ===== 1:11:19

./pxz -dcv linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06

./pxz -dcv -T4 linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06

./pxz -dcv -T8 linux-4.6-rc2.tar.xz > /dev/null ===== 9.6 MiB/s 1:06


./pxz -d linux-4.5.tar.xz ===== 1:09

./ pxz -dcv linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T4 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05

./pxz -dcv -T8 linux-4.5.tar.xz > /dev/null ===== 9.6 MiB/s 1:05


Given these results, I can see the small changes I made to the compression tool, did almost nothing to the overall result. With the exception if the default decompression 1 ms difference. After searching through the pxz.c file multiple times on different occasions, I cannot find another possible shortcut / optimization that could be made towards the application. Between the two files, with a 1 MB difference, there is only a negligible time difference between the two. 

by David Au ( at April 10, 2016 10:38 PM

Nina Wip


To enable g-profiling you need to add the -pg flag to the compiler options. These options can be found in the Makefile. Just look for the terms "CFLAGS", "CPPFLAGS", and "LDFLAGS". Behing these options add the -pg flag and g-profiling should be enabled.

At first I couldn't figure out why my software wasn't producing the gmon.out file, which gprof generates for you. I added -pg to all the flags in the head Makefile. But apparently there were way more Makefiles where I needed to change the flags. There were 6 Makefiles in total, so I added -pg to all the compiler flags.

Luckily this did generate the gmon.out file so I could take a look at the profiling by running the following command:
gprof md5deep > gprof.txt
This generates a readable file which contains the profiling. The strange thing was that my gprof file said that every function it called took 0 seconds, even though it said that the function got called 8 times. It probably went through a lot of functions quickly and didn't register time.

The function that gets called 8 times is hash_final_sha1 which looks like this:
void hash_final_sha1(void * ctx, unsigned char *sum) { 
sha1_finish((sha1_context *)ctx.sum); } 
Since it's a one liner there isn't much I can optimize here. But it does call other functions which I can take a look at. I went through all the functions that call each other until I found a function that actually did things by itself instead of just forwarding to another function. I ended up at the following function:
void sha1_update( sha1_context *ctx, const unsigned char *input, size_t ilen )
    size_t fill;
    unsigned long left;

    if( ilen <= 0 )

    left = ctx->total[0] & 0x3F;
    fill = 64 - left;

    ctx->total[0] += (unsigned long) ilen;
    ctx->total[0] &= 0xFFFFFFFF;

    if( ctx->total[0] < (unsigned long) ilen )

    if( left && ilen >= fill )
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, fill );
        sha1_process( ctx, ctx->buffer );
        input += fill;
        ilen  -= fill;
        left = 0;

    while( ilen >= 64 )
        sha1_process( ctx, input );
        input += 64;
        ilen  -= 64;

    if( ilen > 0 )
        memcpy( (void *) (ctx->buffer + left),
                (const void *) input, ilen );
Looking at it I could not find a possible way to optimize this function. Since it took a while to find a suitable function, and I couldn't find an optimization, I'm going to look into the compiler flags next since they use the O2 flag, and this should be able to change to O3.

by Nina ( at April 10, 2016 08:45 PM

Yunhao Wang

bzip2 optimization-phase two

There is two way to optimize the bzip2 compression time.

In the mainsort function of blocksort.c

  1. Using memset() function instead a for loop to set each elements in an array to zero. By doing this, it will reduce the memory usage by using shared library, and reduce the branch amount from the for loop.

-for (i = 65536; i >= 0; i–) ftab[i] = 0;

2.Manually unroll the loop it will increase the run time memory usage, as the author said it will produced wrong code while compile with gcc 2.7.x.

It will reduce the branch of the program, which will reduce the compression time.

–   //for (i = 1; i <= 65536; i++) ftab[i] += ftab[i-1]

+for (i = 1; i<= 65536; i+=8){
+       ftab[i] += ftab[i-1];
+       ftab[i+1] += ftab[i];
+       ftab[i+2] += ftab[i+1];
+       ftab[i+3] += ftab[i+2];
+       ftab[i+4] += ftab[i+3];
+       ftab[i+5] += ftab[i+4];
+       ftab[i+6] += ftab[i+5];
+       ftab[i+7] += ftab[i+6];
+       }

After compile the program, it passed 6 tests from its own on x86_64 and AArch64 machine.
Doing 6 tests (3 compress, 3 uncompress) …
If there’s a problem, things might stop at this point.

./bzip2 -1  < sample1.ref > sample1.rb2
./bzip2 -2  < sample2.ref > sample2.rb2
./bzip2 -3  < sample3.ref > sample3.rb2
./bzip2 -d  < sample1.bz2 > sample1.tst
./bzip2 -d  < sample2.bz2 > sample2.tst
./bzip2 -ds < sample3.bz2 > sample3.tst
cmp sample1.bz2 sample1.rb2
cmp sample2.bz2 sample2.rb2
cmp sample3.bz2 sample3.rb2
cmp sample1.tst sample1.ref
cmp sample2.tst sample2.ref
cmp sample3.tst sample3.ref

If you got this far and the ‘cmp’s didn’t complain, it looks
like you’re in business.

On x86_64 machine, while compression the linux-kernel tarball(version 4.5) with 628MB(658,053,120 bytes)

The compression time before

real    1m39.944s
user    1m39.464s
sys     0m0.388s

The compression time after

real    1m27.951s
user    1m27.428s
sys     0m0.443s

After running the cmp between the file compressed before and after, it was the same file

It reduced the compression by nearly 12% from the previous version

On AArch64 machine, while compression the linux-kernel tarball(version 4.5) with 628MB(658,053,120 bytes)

The compression time before
real    13m32.371s
user    13m29.980s
sys     0m1.890s

The compression time after

real    11m59.926s
user    11m58.790s
sys     0m0.960s

After running the cmp between the file compressed before and after, it was the same file

It reduced the compression time  by nearly 11.5% from the previous version








by yunhaowang at April 10, 2016 11:55 AM

Tom Ng

xzutils – The Search for an Optimization

A while ago, I announced for the SPO600 project that I would be looking into xzutils for an optimization. I particularly wanted to see if there were any optimizations to be had by simply changing the flags used in the compilation process.


Since I wanted to change the flags, the make process was where I had to look first. After extracting a fresh tarball, the only things that had to be done were running the configure script and then running the resulting makefile the script created. My first approach at attempting to change the compilation process was to make modifications to the root makefile generated by the configure script. The only changes I intended to try so far was modifying the flags which included in the makefile variables: CFLAGS, CPPFLAGS and LDFLAGS. Running the modified makefile however produced the usual output without the modifications I had. Since that didn’t work, I then tried to modify all the makefiles that were created by the configure script with the use of a sed command. Unfortunately there was still no change to the binary made either because sed didn’t work or the changes sed made still didn’t change anything. I also attempted to hardcode the changes into the file without success.


I did some poking around reading random documents that came with the tarball and chanced to read a doc intended for packagers which included an example that showed how to pass flags to the configure script. The flags I wanted to add were -g, -pg and -O3. I chose these flags because according to the output of the unmodified makefile, -O2 was used and I wanted to investigate why -O3 wasn’t. -g was already present but I wanted to add it just to make sure it was used everywhere. Finally -pg was used so that the program could be profiled. Instead of running just ./configure, I was able to pass the flags I wanted to test with ./configure CFLAGS=’-g -O3 -pg’. When running the resulting makefile, I was able to see the new flags I added in the output and the program compiled successfully with these flags.


I created in a test directory, three files: a 1gb file filled with zeroes, a 1gb file filled with random data and a relatively smaller (only 743223 bytes) file consisting of Lucy Montgomery’s Anne of Green Gables and Shakespeare’s The Tempest. The test process for both the unmodified and modified xzutils involved timing how long it took to compress a file, checking the resulting file’s size and checking the hash using sha256sum; then timing again how long it took to decompress the resulting .xz file, checking the size and hash again and finally compare them to the original.


In all three tests, the modified version was able to compress a file and decompress it without any data loss. This came as a surprise to me because I would imagine the reason why the xz developers would use -O2 instead of -O3 would be because one or more of the -O3 optimizations would break something in the (de)compression process or may produce a corrupted file. If the (de)compression process works fine with -O3, and using -O3 is supposed to make the program run even faster, then surely changing -O2 to -O3 would be an easy change for a speed boost. That turned out to be not the case; -O3 actually made the timings for the tests worse. More accurately though, -O3 created almost no significant changes in time and the negligible differences tended to lean on the slower side for -O3 xzutils. Well I can see now why they used -O2 instead of -O3.


It seems odd to me that -O3, which is supposed to optimize for performance, would produce an inferior performing binary than one compiled with -O2. As far as the tests showed, using -O3 instead of -O2 did not break anything but just made the program run slower instead. In searching for an optimization, I plan to examine which flags -O3 uses that -O2 doesn’t use and see which ones positively affect the program and which ones negatively affect the program. One or more of the flags certainly affect the program in a negative way but there also might be a flag or more that can speed the program up. It is also possible that an optimization may conflict with another resulting in slowing down the program. When querying gcc for which flags are turned on and off for -O2 and -O3 and using diff on the output, there are roughly 10 more flags enabled on -O3 so narrowing down the ‘bad’ flags shouldn’t be too hard. I look forward to investigating each flag with its own xz build and hopefully at least one of them can contribute positively to the performance of the program.

by tng23spo600 at April 10, 2016 12:22 AM

April 07, 2016

Vishnu Santhosh Kumar

Final Project:Phase 2.1

Software Package : LZ4

Official Website :

Git Hub :

Final Aim: Optimize and improve performance of the software on AArch64 and                                    x86_64 systems.

Phase 2 – Aim: Optimize the performance of the software

Basic Optimizations Discussed in   Phase 1.1  have been  done, there is no performance difference shown in the new phase. Some higher level of optimization is required.

♣  A Surprising Result !!
Ther very oldest commit  of the LZ4 software in GitHub was in  April 2014.  I downloaded that version (#r116) and tested to see the performance difference between the latest release of LZ4(#r131) where I’m working with.  To be surprised , there is no optimization happened in the software till that time. The older version showed the same benchmarking results I got from the latest release.  The older version looks easier to  understand and well documented. In “lz4.c” file of that version has the option to set the memory flag, i.e the amount of cache memory to use the compressor. The more that cache gives higher compression and reduced size improve speed by reducing compression ratio. I figured out that the rate of change of time is legible with the compression ratio it can provide with more cache size. I included some preprocessor code to check the architecture of the system and giving a highest possible cache size for the compressor. In the  “matrix system”, the processor is ” Dual-Core AMD Opteron(TM) Processor 8220″ with a  “64kb”  level-1 data cache size ( I did this performance comparison only on x86) .  By providing this maximum cache size other than the 16kb default value , the compressor shows a 4.24% increase in compression than the latest release. Then was a disappointing part, I quickly look into the latest release to make the same change over there too, but, unfortunately, there was no such piece of code to manage the memory cache size and no similar kind of code.  “Old is Gold !!!!!”

Hit- 1  

Objective : Increase the probability of finding similar literals in a compression block.

Link to GitHub

Performance Improvement :
      0.22% increase in Compression.

File Changed : /programs/lz4io.c

The compression uses a block size to read the data (uncompressed data) from the source file to find the matching literals. Before my changes, a fixed amount of buffer size is taken from the source file . I figured out that ,with varying size of the input buffer from the source file helps to find more similar matching patterns in the  text. ie. the probability of finding similar literals in the same block size is lower than the probability of finding the same literals in varying block size.

Hit- 2

Objective :  Replace the most expensive operations in code to improve Compression Speed.

Link To GitHub

Performance Improvement :
1.1 MB/s
increase in Compression Speed
No ChangeFile Changed : /programs/lz4io.c

File Changed : /programs/lz4io.c

In Compression Loop of the program , a compression progress is showing regularly on each time the loop iterates. It uses expensive multiplication to find the progressing compression ratio. I changed the multiplication with a least expensive bitwise operation , that lead to increasing the compression speed by 1.1MB/s.

Hit- 3

Objective: Optimize the code using CFLAGS in Makefile.

Link To GitHub

Performance Improvement:
          24.9 MB/s increase in Compression Speed.
          313.4 MB/s Increase In De-compression Speed.

File Changed : /programs/Makefile

The ‘-march’  CFLAG is added  in the makefile to optimize the software for the x86 architecture.  The ‘-march’ option generates the instructions for a specific machine type.  I included an ‘if’ condition in the Makefile to check whether the machine is x86 and give -march option to the CFlag. This  increased the average compression and the decompression speed of the software. This optimization is not possible on AArch64.

Final Result

Link To GitHub♦

The Compression Ratio have slightly increased  by 0.22% but the Compression and decompression speed have increased very much. The Compression has increased from 77.35MB/s to 102.25 MB/s and the Decompression speed have increased from 489.8 MB/s to 805.3 MB/s in the x86 machine. There is no noticeable optimization seen on the AArch64, however, a slight increase of 1.03MB/s – 2.1MB/s  is achieved.  The Final Patch with all the above changes has been submitted to the GitHub.

Link To The Final Submitted Patch.

by vishnuhacks at April 07, 2016 02:19 AM

David Wesley Au


Today, I have taken a  look at the pxz.c file, and I found two possible optimizations that I can try.

They are:
1. (Pre-Calculation)
m = mmap(NULL, s.st_size, PROT_READ, MAP_SHARED|MAP_POPULATE, fileno(f), 0);
if (m == MAP_FAILED ) {
perror("mmap failed");


if (mmap(NULL, s.st_size, PROT_READ, MAP_SHARED|MAP_POPULATE, fileno(f),0) == MAP_FAILED){
perror("mmap failed");

2. (Dead Store Elimination)
for (pt=0; pt<len; pt+=BUFFSIZE){
strm.next_in = &m[off+pt];
strm.avail_in = len-pt >= BUFFSIZE ? BUFFSIZE : len-pt;
strm.next_out = mo;
strm.avail_out = BUFFSIZE;
ret = lzma_code(&strm, LZMA_RUN);
if (ret != LZMA_OK){
fprintf(stderr, "error in LZMA_RUN\n");
if (BUFFSIZE - strm.avail_out > 0){
if (!fwrite(mo, 1, BUFFSIZE - strm.avail_out, ftemp[p]) ){
perror("writing to temp file failed");
strm.next_out = mo;
strm.avail_out = BUFFSIZE;
}while (strm.avail_in);

strm.next_out = mo;
strm.avail_out = BUFFSIZE;
for (pt=0; pt<len; pt+=BUFFSIZE){
strm.next_in = &m[off+pt];
strm.avail_in = len-pt >= BUFFSIZE ? BUFFSIZE : len-pt;
ret = lzma_code(&strm, LZMA_RUN);
if (ret != LZMA_OK){
fprintf(stderr, "error in LZMA_RUN\n");
if (BUFFSIZE - strm.avail_out > 0){
if (!fwrite(mo, 1, BUFFSIZE - strm.avail_out, ftemp[p]) ){
perror("writing to temp file failed");
}while (strm.avail_in);

There might be more small optimizations I can try. I will take a look at the pxz.c for more possible opportunities at a later day.

by David Au ( at April 07, 2016 12:58 AM

April 05, 2016

David Wesley Au

Project: pxz Compression Overview

The project I have chosen to work on and optimize in some minor way is pxz compression. This is a parallelized LZMA compression tool that takes advantage of the running XZ compression tool simultaneously on different parts of a file on multiple cores and processors. This supposedly significantly speeds up compression time.

pxz has different settings you can use when compressing file(s). They are:

pxz file1 file2
- This command compresses the two files in default settings

pxz -T4 file1
- This command compresses file 1 with default settings and with a maximum of 4 threads

pxz -k file1
- This command uses maximum compression and keeps the original file. It also uses the default number of threads

pxz -f -T4 file1
- This command compresses the file with a maximum of 4 threads

pxz -kvf -3 -T8 file1
- This command uses a -3 compression , keeps the original file, with a maximum of 8 threads and displays progress

tar Oc dir1 | pxz -D 12 -cv - > dir1.tar.xz
- This command moves dir1 into tar, then compresses it into 12 sizes of dictionary per thread, displays progress and writes the result into dir1.tar.xz

by David Au ( at April 05, 2016 12:32 PM

March 30, 2016

Tom Ng

SPO600 Project – xz

The package I chose for the SPO600 project was xz. It comprises of the xz file format that uses LZMA2 compression and xz utils which work with the .xz container. XZ was first used to fit the Tukaani distribution (a Slackware fork) under 700mb image. XZ was intended to replace the .lzma1 format with the same extension but it wasn’t finished before it rose in popularity so it was renamed to XZ.

I chose the package because I liked the name and the project didn’t seem either too active or inactive. It turns out according to the docs that came with the source code, that the name ‘XZ’ has no meaning and it was chosen simply because the name wasn’t being used by anything else. The community communicates on a variety of platforms including a forum, mailing list, IRC and regular old e-mail. The preferred way to submit bugs or patches is via e-mail or IRC.

I downloaded the most recent tarball for XZ which was version 5.2.2 and tried compiling it on archie and betty. Both compiled and worked right out the box. I haven’t been able to do any detailed profiling yet so I did some quick benchmarks with 3 different files instead.

The first thing I tried to compress was the history.txt file that came with the source. It consists of 1168 words according to wc -w history.txt and is 7427 bytes uncompressed. Just for fun, I tested XZ’s claim that it could typically compress files 30% smaller than gzip but gzip actually compressed it to 3010 bytes; 10 bytes less than XZ’s 3020. The second thing I tried to compress was a 128mb file filled with 0s made from the dd command. Here XZ took 18.9 seconds to compress it down to 19.6mb. The final file was also made by dd and was a 128mb file consisting of random data from urandom. This time XZ took 1 minute 27 seconds but failed to compress the file. In fact, the resulting XZ file was slightly larger than the file itself.

In terms of looking for potential optimizations, I am investigating the makefile templates for any potential areas that may include new optimization flags that can be added. It’s quite confusing since makefile templates and recursive makefiles are a first for me but I look forward to it.

by tng23spo600 at March 30, 2016 03:55 AM

Berwout de Vries Robles

Compiler optimizations part 2

Last week we each chose 2 compiler optimizations and in my previous blog I have explained my first, my second compiler optimization flag is -free. In some 64 bit architectures when you load a word into the lower half of a register, the upper half of that register implicitly gets zeroed out. This means that if you need to lengthen an unsigned integer from 32 to 64 bits you can do so just by loading the word.

From the gcc manual we find that -free is a flag that tries to remove redundant extension instructions. This is also one of the optimizations that specifically works well on AArch64, which is why I wanted to take a closer look at it. It is enabled at -O2, -O3 and -Os.

Let me start by saying that even though I've tried multiple programs lengthening int16's and int32's to int64's, by either multiplication or addition, I have yet to find a case where -free kicks in. So far there has been no difference in assembly. I have even tried to get 32 bit floats converted to 64 bit integers to see if it would do lengthening, but it never did. Instead gcc did do something interesting though, which I would like to highlight for a second. This was the program I used to attempt to generate a lengthening instruction.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

uint64_t sumArray(uint32_t *arr, uint32_t *arr2, int length){
    uint64_t result = 0;
    for(int i=0; i<length; i++){
        result += arr[i] + arr2[i];
    return result;

int main(){
    const int length=256;
    uint32_t arr[length];
    uint32_t arr2[length];
    for(int i=0; i<length; i++){
        arr[i] = rand();
        arr2[i] = rand();
    printf("%d", sumArray(arr, arr2, length));
    return 0;
 The funny thing is that this program actually did not create a lengthening instruction, but instead had the following instruction:
mov %eax, %eax
 on the surface this looks very strange, because we are moving a 32 bit word into the same register, but if you look into it slightly deeper, the mov instruction actually zeroes out the top part of the register as well, so it implicitly lengthens the 32 bit unsigned integer to a 64 bit unsigned integer.
I've tried several versions of this program, that multiply or add 32 bit to 64 bit unsigned integers or with 32 and 64 bit unsigned integers, but I have yet to generate a single lengthening instruction. This means that -free sadly does not do anything for the programs that I have written.

I can tell you that if there was a mov instruction present and afterwards a lengthening instruction(for an unsigned 32 bit int), this would be one of the cases where -free would kick in and remove the lengthening instruction, sadly I was not able to generate one of those cases.

by Berwout de Vries Robles ( at March 30, 2016 03:34 AM

Project selection part 2

Alright after the disappointing failure I experienced with slimdata I am ready for a new try! I tried to install several of the leftover projects (most of them have been selected already). Some of them had fairly complicated build instructions to get the last version to work and some of for some of the packages listed I was not sure if they were even around. Notably coreutils latest version has a bunch of prerequisites and I followed their readme to install the most recent version, the prerequisites will not install correctly on it even though I followed their instructions to the letter. Also sha2 and sha3 were sort of ambiguous, it was hard to find the actual source code that was meant by those listings. After exploring these packages I settled on ncompress.

ncompress is a compression algorithm that uses the Lempel-Ziv-Welch approach, this is a well known compression algorithm. The basics are that it creates a lookup-table for common sequences in the file, to reduce the number of bits required to display that sequence. This means there has to be some sort of repetition in the file in order for there to be an improvement. Therefore we can not use a randomly generated file. Luckily I wrote a piece of code in my Project selection 1 blog that writes a repeating int pattern either directly to a file or as text to a file. We can use the output from that program as input for the ncompress program.

The ncompress package is fairly old with its' last release being in 2006, as such I think there should be an optimization present in a multi-processing or simd approach. Which is what I will focus my attention on. This is also because I am most interested in these types of optimizations.

ncompress is easy to install by just running the git command:
git clone
then it will clone the repo into a map called ncompress, you can then navigate in there and make install the application.

I faced some difficulties copying my files to our aarch64 test servers aarchie and betty, for some reason my scp and sftp kept getting closed by the remote host after a certain percentage. After trying for over 20 times and reaching no more than 20% of a 1 GB file I gave up and looked for an alternative. It turns out rsync is a very flexible command that lets you sync directories across servers. For approaching aarchie and betty you do have to specify a port number with the -e option, but some googling on the rsync command gets you there very quickly.

I then set about obtaining benchmarks for the package, which are to be viewed below:

Some notes about the benchmarks:
- They were taken when cache was "hot" (I disregarded first run).
- They are averages of 3 runs.
- All runs used copies of the same files.
- I used time to measure time spent for the entire command.
-There was a minimal amount of interference caused by other processes running at the same time, so keep that in mind. I did check to see if there were no large CPU consuming applications running.

aarch64 testserver Betty:
2545934560(~2.4GB) textfile:
real: 83.652s
user: 82.590s
sys: 1.060s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.811s
user: 15.323s
sys: 0.480s
resulting size:  21505241(~2,00%)

2545934560(~2.4GB) textfile:
user: 12.530s
sys: 1.570s
1073741824(1GB) integerstream:
real: 5.920
user: 5.403
sys: 0.510s

my own x86_64 Fedora 23 installation:
2545934560(~2.4GB) textfile:
real: 39.341s
user: 28.147s
sys: 3.484s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.626s
user: 7.479s
sys: 2.328s
resulting size: 21505241(~2,00%)

2545934560(~2.4GB) textfile:
real: 38.687s
user: 6.326s
sys: 2.160s

1073741824(1GB) integerstream:
real: 14.575s
user: 2.500s
sys: 0.856s

Something interesting to look into immediately is that on Betty the uncompress algorithm seemed to work much better than on my own Fedora installation, where it was pretty much the same speed as the compression algorithm. I would like to do some more benchmarking on actual written text files and maybe some installers later, but for a basic benchmark I think this does alright. I was in some stress for time because I had to select a new project after attempting to benchmark the previous package for too long.

by Berwout de Vries Robles ( at March 30, 2016 03:29 AM

Nina Wip

MD5Deep Installation

The project I chose to optimize is called md5deep. This is a program to compute hashes for a number of files. Md5deep doesn't only hash files with md5 but can also use SHA variants, Tiger and Whirlpool. It is similar to the md5sum program but md5deep has some additional features, which I quoted from:
  • Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
  • Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match. Hashes sets can be drawn from Encase, the National Software Reference LibraryiLook InvestigatorHashkeepermd5sum, BSD md5, and other generic hash generating programs. Users are welcome to add functionality to read other formats too!
  • Time estimation - md5deep can produce a time estimate when it's processing very large files.
  • Piecewise hashing - Hash input files in arbitrary sized blocks
  • File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc.
I'll be taking a look at the source code and attempt to find optimization possibilities. The possibilities of optimizations are:
  • Improving a hashing algorithm
  • Add code so it works specifically on ARCH64, but not influencing the other systems
  • Edit the make file so it makes the program faster
I downloaded the source code from and ran the following commands to install it correctly:
sudo make install
To test that it worked I hashed a few files and saved the hashes in a file. By saving the hashing you can find the matching files by using their comparison feature. The next step will be benchmarking.

by Nina ( at March 30, 2016 01:23 AM

MD5deep focus area

So my first thought on the focus area was implementing a way for aarch64 to be installed correctly since I ran into trouble with the architecture. But in the latest configure file I had to download they fixed that issue already and was dated 2016-03-24, so that was pretty recent.

Since I'm more familiar with code than all the configuration and make files, I decided to take a better look at the source code.

by Nina ( at March 30, 2016 01:22 AM

MD5deep benchmarking betty

Now that I'm done testing and benchmarking on my own x86_64 fedora system, I need to test md5deep on an AARCH64 system. I transferred the files I used in the x86_64 benchmark to the betty server using scp and sftp. Once I got the test files transferred I transferred the tar.gz file of md5deep so I could try to install it on betty. Here's where I ran into some issues...

The normal way to install the md5deep is to unpack it and run the following commands:
sudo ./configure
make install 
Sadly it stopped at the configure step. It could not recognize which build to use since it did not know aarch64. Luckily in the error it gave me a website to download the most recent configure files which did contain aarch64! ( So after some more use of sftp the configure step finally worked and I could run the make install command.

On to the benchmarking. Just like benchmarking on the x86_64 I ran the md5deep program 100 times with 3 different file sizes. The results are as following:
10.5 mb file
real: 0m0.037s
user: 0m0.028s
sys: 0m0.001s
105 mb file
real: 0m0.345s
user: 0m0.323s
sys: 0m0.027s
1.5 gb file
real: 0m4.792s
user: 0m4.551s
sys: 0m0.400s
If you only have one file md5deep can hash it pretty quickly. Even if you hash entire directory structures you wouldn't have to wait too long.

by Nina ( at March 30, 2016 12:58 AM

MD5deep benchmarking x86_64

I currently only have md5deep installed on my own fedora workstation, so I could test if I could get it working. Now that I did get it to work I can benchmark it for my laptop. When I get this to work I will do the same on betty.

My fedora specifications are:
Memory: 2.9 GiB
Processor: Intel Core i7-3630QM CPU @ 2.40GHz
Machine architecture: x86_64

To benchmark this program I need a few files of different sizes. With the following command I can create a file with complete random data and a size of approximately 100mb.
dd if=/dev/random of=test100m bs=1M count=100 iflag=fullblock
After running this command for different file sizes, I ended up with a 105mb testfile and a 10.5mb file.
By running the following command you can run the md5deep program and see how long it took to complete.
time md5deep test100m
To get a presentabele benchmark you have to run this command a lot of times and get the average. Instead of manually running this command 100 times I wrote the following command which would do it for me:
time for i in {1..100}; do md5deep test100m; done
The time you get for this is the total time of running it 100 times, so the time result you get has to be divided by 100 to get the average time of running once. I got the following results:
10.5 mb file
real: 0m0.053s
user: 0m0.021s
sys: 0m0.003s
105 mb file
real: 0m0.287s
user: 0m0.200s
sys: 0m0.012s

After seeing that these files still get hashed pretty quickly I decided to download a bigger file to see how long that would take. I downloaded the Fedora Workstation iso, which is 1.5gb, to test with.
1.5gb file
real: 0m3.458s
user: 0m2.793s
sys: 0m0.184s
It still doesn't take very long for a 1.5gb file to be hashed, so this program is pretty fast.

by Nina ( at March 30, 2016 12:56 AM

Berwout de Vries Robles

Project Selection part 1

We've started our projects and I have selected slim, a compression algorithm that mainly deals with very large integers of 1,2 or 4-byte word size. One of the things I noticed immediately was that it had not been updated for a while, it was written in 2008 and some additions were made in 2013. Right now it is a one man project as all the contributions were made by a single person.

I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here:
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main() {
    //ints in a GB
    int64_t size = 1073741824/4;
    int repetition = 20000;
    int32_t* data = (int32_t*) malloc(sizeof(int32_t)*size);
    int i = 0;
    while (i < size) {
        int32_t number = rand();
        //repeat j times
        for (int j = 0; j < repetition; j++) {
            if (i < size) {
                data[i] = number;
    FILE *fp = fopen("data", "w");
    fwrite(data, 4, size, fp);
    //for(int x = 0; x<size; x++){
    //fprintf(fp, "%d", data[x]);
    return 0;
I have tried using fwrite, to just write the binary values of the integers to a file. I have tried fprintf, to write the text values of the integers to a file. I have tried random repetition amounts and swapping around some integers to create a more or less diverse file. I have tried smaller and larger files. The text files would not even slim, it would literally break the original file into an unrecoverable state. The integer files would slim but the maximum compression file gain was 1,5% of the file or about 20MB on a 1GB file.
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.

It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.

by Berwout de Vries Robles ( at March 30, 2016 12:30 AM

Giuseppe Ranieri

Identifying a Possible Optimization for the Zopfli Library

The Zopfli Compression Algorithm is a compression library programmed in C to perform very well, but slow for deflate or zlib compression. The library is only able to compress, not decompress.

I chose this library as the community was the largest and most active on Github. There are some pros and cons for this. One being that any problems that occur are reported and able to be fixed quicker as it's a quick and easy way to post the issue on Github. However this can also be a problem because although there are many issues on how to improve the library, many of them are solved and closed for the community being so active.


The follow is done by creating a random size text file on the ARM64 platform with the command

dd if=/dev/urandom of=10mb.txt count=1M bs=10
and transferring it over to the x86_64 platform so that the tests were running on the same data. I tested both a 1M and 10M file on the platforms. Here are the results ran with time ./zopfli 10mb.txt:

x86_64real    0m3.309sreal    0m33.970s
ARMv8real    0m27.127sreal    4m26.325s

They weren't kidding when they said it was "slow". Although any changes I do, good or bad, are more easily noticeable.


Optimizations Choices:

There are three avenues I believe I can follow in a priority order for possible ways to optimize the library. In the deflate.c there is a TODO of

static void AddBits(unsigned symbol, unsigned length,
                    unsigned char* bp, unsigned char** out, size_t* outsize) {
  /* TODO(lode): make more efficient (add more bits at once). */
  unsigned i;
  for (i = 0; i < length; i++) {
    unsigned bit = (symbol >> i) & 1;
    if (*bp == 0) ZOPFLI_APPEND_DATA(0, out, outsize);
    (*out)[*outsize - 1] |= bit << *bp;
    *bp = (*bp + 1) & 7;

for the AddBits function call. Perhaps there's something I can change to help the code along.

My second choice afterwards will be trying different flags for the build:

gcc src/zopfli/*.c -O2 -W -Wall -Wextra -ansi -pedantic -lm -o zopfli
as of right now, there may be some other other flags in O3 optimization that may help some edge cases, but could make it unstable in other areas. This will require patience a lot of compiling and test running.

Finally, the least enjoyable one would be trying to make it platform optimized for the ARM64. No issues were made on the Github so this might be a platform that has been overlooked.

by JoeyRanieri ( at March 30, 2016 12:04 AM

March 29, 2016

Vishnu Santhosh Kumar

Presentation – Compiler Options

Today I did my presentation about two  compiler options”-ftree-loop-distribution” &  “-frename-registers“.


In a loop, the code inside the loop body is stored in the cache in order to improve the optimization of the loop body by providing same values, or related storage locations, that are frequently accessed. Since the cache size is limited, a loop is broken into multiple loops over the same index range with each taking only a part of the original loop’s body. This helps to reduce the size of the larger loops into smaller ones and can utilize the cache more. This opitmization is called –  Loop fission.

The ‘-ftree-loop-distribution’  Perform loop fission. It improves the cache performance on larger loop bodies and also it allows further optmizations like vectorization / parallelization.( refers to converting sequential code into multi-threaded or vectorized (or even both) code in order to utilize multiple processors simultaneously ).


for(int i=0;i<100;++i)


for(int i=0;i<100;++i)

for(int i=0;i<100;++i)




GCC Compiler Options


by vishnuhacks at March 29, 2016 08:13 PM

Yunhao Wang

bzip2 optmization -phase one

Benchmark for Bzip2



linux-kernel tarball 628MB(658,053,120 bytes) for Huffman coding

500MB random data(for worse situation and Burrows–Wheeler transform)

generate by dd if=/dev/urandom of=test.500mb bs=1M count=500


Arch64  and x84_64


Modify the Makefile compiler flag with -pg

Compile the source, so we can profiling it

On AArch64

500MB random data with best flag(-9)
real    16m50.237s
user    16m48.650s
sys     0m1.520s

On X84_64

linux-kernel tarball with best flag(-9)

real    1m39.944s
user    1m39.464s
sys     0m0.388s


500MB random data with best flag(-9)

real    1m58.764s
user    1m57.768s
sys     0m0.889s

It looks like  we can find possible optimization in mainSort in blocksort.c(27.32%) and BZ2_compressBlock in compress.c(38.61%)

There is possible to find tiny optimization in huffman.c



Possible Optimization

Remove the for loop to reduce branch

1.mainsort function in blocksort.c  Line 769

/*– set up the 2-byte frequency table –*/

for (i = 65536; i >= 0; i–) ftab[i] = 0;


memset(&ftab[0],0,65537*sizeof(int));     // not working when int has 4byte


2.Loop Unroll

add -funroll-loops flag as compiler option (it will increase the size of file)

by yunhaowang at March 29, 2016 05:56 PM

March 24, 2016

Andrei Topala

Project – Improving lrzip; Phase One

So now frost slowly thaws and the winds of blooming spring carry to us tidings of final projects. Our project for the SPO600 course centers around the improvement of a selected checksum or compression open-source package. I chose lrzip, and I hope that I will be able to alter the code in such a way that we can gain a non-negligible improvement in execution time. That, at least, is what I will be aiming for.

So, what even is lrzip?

lrzip – Long Range ZIP or LZMA RZIP

lrzip is a compression program. It’s optimized for the compression of large files and on systems with great amounts of RAM.

lrzip operates in two basic steps, and on a single file.

First, it uses an extended version of rzip to perform a long distance redundancy reduction (ie it encodes duplicate or redundant chunks of data over a large distance). This is done with an LZ77-based algorithm: a sliding window technique is used to keep data in memory in order to find matches, and larger amounts of RAM allow the use of a larger window and thus lead to better compression. Where rzip is limited to a maximum window size of 900MB, lrzip’s use of it scales with available RAM. Duplicate data is replaced with a reference denoting the distance to the original data and the size of the original chunk of data.

A visual representation of the LZ77 algorithm:

Second, after the initial reduction, lrzip compresses the file further by running one of LZMA (default), ZPAQ, LZO, GZIP, and BZIP2 on the reduced data file.

So, I built the package on my x86_64 computer (which required the installation of a few dependencies, but other than that it went okay) and on our AArch64 aarchie server (which, again, needed some dependencies (though some were already installed)). The program ran fine on both systems.

Before we begin tinkering with the code, let’s run some benchmarks on aarchie. We’ll use three files. The first is a text file containing 500MB of text downloaded from Project Gutenberg. The second is a 500MB file composed entirely of random data. The third is a 500MB file composed entirely of zero data.

[atopala@aarchie lrzip-0.621]$ time ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.918. Average Compression Speed:  1.217MB/s.
Total time: 00:06:50.91

real	6m50.924s
user	49m2.940s
sys		0m9.290s
[atopala@aarchie lrzip-0.621]$ time ./lrzip random 
Output filename is: random.lrz
random - Compression Ratio: 1.000. Average Compression Speed:  6.329MB/s.
Total time: 00:01:18.78

real	1m18.796s
user	1m34.520s
sys		0m2.960s
[atopala@aarchie lrzip-0.621]$ time ./lrzip zero
Output filename is: zero.lrz
zero - Compression Ratio: 7003.299. Average Compression Speed:  6.024MB/s.
Total time: 00:01:22.92

real	1m22.933s
user	5m17.620s
sys		0m2.490s

The text file took much longer than the others, and it used 23.9% of the memory on aarchie. Compressing the random file required the use of only about 7% of the memory on aarchie. Compressing the zero file, which has a lot more duplicate data to compress, used much more memory, peaking at about 34.5% on aarchie. The text file also took much more user time to compress in comparison to the other files and as compared to the real time, so it was taking greater advantage of parallel processing; the zero file also was compressed using multiple threads. The vast majority of the time was spent in user mode; time spent in the kernel is negligible.

So, the goal is to beat these benchmarks. Maybe a way to recognize incompressible data could speed up the processing of the random file; or maybe a method to detect extremely large runs of identical data quickly could help with the zero file. The important file, though, is the text file, and ideally we would improve the runtime of its compression. Setting the lzma level to 9 (the default is 7) brings the compression ratio up to 4.029 (a minor but not insignificant improvement) but takes twice as long; maybe this could be sped up.

So, apart from improving the algorithm (which, I think, is very unlikely, since lrzip has been receiving regular updates as recently as last year so the algorithm should be pretty close to perfect barring potential oversights) and changing build options (which might not prove fruitful, since the autogen sets the default optimization level to 2, and further optimizations might have a negative impact; but this is probably a good area to work in since I don’t think lrzip has been built with AArch64 in mind) these are potential areas of improvement:

-lrzip uses x86 and x86_64 assembly language files that optimize CRC computation. Maybe we could rewrite the files for AArch64. As of now, “if you don’t have the nasm assembler or have a ppc or other non-x86 system, the standard C CRC routines will be compiled and linked in.” The TODO file suggests that the x86_64 version of the assembly code doesn’t work, so that might need fixing. However, the TODO file also suggests that this is of a low priority.
-the program crashes with the -z flag (ZPAQ optimization) on aarchie, claiming “Illegal instruction” after starting the first thread to compress a piece of data. This occurs on betty as well. It doesn’t happen on x86_64. I am not yet sure of the cause, or if it something that happens on AArch64 systems in general, but if it is a bug on AArch64 then I think this should become my main focus.

I’ll begin with the ZPAQ bug (or what I think might be a bug, at least), and if that doesn’t pan out then I’ll try to edit some of the build options and code to optimize it for AArch64 (since I figure that there may be other people working on lrzip, but most might not have access to an AArch64 machine). If that doesn’t work out either, I’ll try to optimize the C code, and if I can’t do that I’ll work on the (low-priority) assembly language files.

by andrei600 at March 24, 2016 05:49 PM

March 22, 2016

David Wesley Au

Lab 1

I have selected two open source packages (with different licenses) and I will be discussing the review process and how contributions are made toward each of the packages.

1. Blender (3D Animation)

First of all, Blender is a free, open-source 3D computer graphic software product used widely for creating animations, art, visual effects, 3D applications, and video games. It has many features such as 3D modelling, animation, and texturing. It also has an integrated game engine. Blender is written in C, C++ and Python (I will be talking about this later.).

In order to develop for Blender, a majority of the communication for development will be through a mailing list, irc chat and through the bug/patch trackers. The development process for Blender is as follows (Steps for getting a patch into Blender):

i) Find the current "owner" of the code and check if that person agrees to you working on their project or adding to their project (assuming you have an area in Blender you want to improve on)

ii) Post a design document on Blender wikipedia

iii) Code

iv) Check your code, and submit a patch

You can submit a patch by making a patch file through git. As stated here, with all the information needed to submit patches (

2. Python (Programming Language)

In order to develop for Python, you may get involved through their issue tracker and irc channels. You may use their PyChecker, which is a static analysis tool that finds bugs in Python source code and warns about code complexity and style. To submit a patch, follow these steps:

i) Get the source code, using this line: hg clone

ii) Build python, using this line: ./configure --with-pydebug && make -j2
If you are using Windows, use this line: PCbuild\build.bat -e -d

iii) Run the test, using this line: ./python -m test.regrtest -j3
MacOSX: ./python.ese -m test.regrtest -j3
Windows: ./python.bat -m test.regrtest -j3

iv) Make the patch.

v) Submit to issue tracker.

by David Au ( at March 22, 2016 01:11 PM

Berwout de Vries Robles

SPO600 Compiler Optimizations

As we are starting projects shortly, this week there is no lab post. Instead we take a look at some gcc optimization flags. Everyone in the course was tasked with selecting two gcc compiler optimization flags and figuring out what they do.

My first choice is:

We can not run the -floop-strip-mine flag right now because it requires a specific library to be built in to gcc at build time. This library is not present in most readily available gcc builds(including the one you probably have installed and the one I have installed). I have tried building gcc myself, but the process is fairly complicated and I could not get it to work. This means that this blog I will be showing a manual example of what some sample code would look like before and after translation, but there will be no assembly demonstrating the power of the actual flag.

the f before the options stands for flag, you can turn them on by adding the function as an argument to a compilation, for example:
gcc -floop-strip-mine program.C
Some optimizations are always on, some options, such as -O2, -O3 turn on entire groups of optimization flags. To turn off a specific flag use the -no prefix. for example:

gcc -no-floop-strip-mine program.C
The reason I chose floop-strip-mine is that our projects could potentially benefit a lot from multi-threading if they do not take advantage of that yet. floop-strip-mine is an optimization that is especially beneficial in a multi-threaded situation.

floop-strip-mine is a flag that turns on the strip-mining of loops. To see how that works, here is a pseudocode example:

Lets say we have two two-dimensional arrays, array a and b, that both have 8 rows and 8 columns.
Let us assume they are filled with values, what those values are doesn't really matter right now. We are going to loop through the arrays and do a multiplication, storing the result in a two-dimensional array c.
int[8][8] a, b, c;

for(int i=0; i<8; i++){
    for(int j=0;j<8;j++){
        c[i][j] = a[i][j] * b[j][i];

The values in a and b are stored in main memory, if we pull them out of main memory into the cache, let's see what happens. Let's assume our cache-lines can store 4 integer values at a time.

The first iteration:
i = 0;
j = 0;
elements we need: a[0][0], b[0][0].
lines going from main memory into cache:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

The second iteration:
i = 0;
j = 1;
elements we need: a[0][1], b[1][0]
Hey that is nice! a[0][1] is already in cache so we can reuse that cache line!
lines going from main memory into cache:
line 3: b[1][0], b[1][1], b[1][2], b[1][3]
Sadly we had to pull in a new line for b, as the value was not in cache yet.

The third iteration:
i = 0;
j = 2;
elements we need: a[0][2], b[2][0]
As in the second iteration the a-line is already in cache, so we can reuse it, the b line however we have to pull in:
line 4: b[2][0], b[2][1], b[2][2], b[2][3]

The fourth iteration:
elements we need: a[0][3], b[3][0]
As in previous iterations, a is in cache and we pull in:
line 5: b[3][0], b[3][1], b[3][2], b[3][3]

The fifth iteration:
elements we need: a[0][4], b[4][0]
Now we need to also bring in a new line for a, note that we have been using the same line in all the previous iterations though:
line 6: a[0][4], a[0][5], a[0][6], a[0][7]
line 7: b[4][0], b[4][1], b[4][2], b[4][3]

We can sort of see a pattern here: a is nicely lined up in cache, if we proceed after the 4th iteration it will still only take up 2 lines of cache space. The b-array however is misaligned, every new iteration, we need to bring a new cache-line in from main memory! Note that this is still not a problem in a realistic situation where we have more than enough cache-lines to store all the values in b.

To illustrate what strip mining is and why it works let us assume we only have 8 cache lines to store the values of a and b in. Whenever a new cache line is brought in that does not fit in the cache, the oldest unused cache line gets thrown out.

In the next few iterations I am hoping to make clear what goes wrong here:
The sixth iteration:
elements we need a[0][5], b[5][0]
cache lines to pull in:
line 8: b[5][0], b[5][1], b[5][2], b[5][3]
Note that our cache is now full! We will start replacing the oldest unused line in the next iterations!

The seventh iteration:
elements we need a[0][6], b[6][0]
We find the oldest unused cache line, which is line 2, since we used line 1 up to the fourth iteration, cache lines to pull in:
replace line 2: b[6][0], b[6][1], b[6][2], b[6][3]

The eight iteration:
elements we need: a[0][7], b[7][0]
We find the oldest unused cache line, which is line 3 and replace it:
replace line 3: b[7][0], b[7][1], b[7][2], b[7][3]

This was the final iteration where i remained constant, so far we have not encountered any real problems, besides having to replace some cache lines, which is not a big deal yet. Now we look at the first few iterations where i=1;

The ninth iteration:
elements we need:a[1][0], b[1][1]
First we find the oldest unused cache line, which is line 4, replace that and then we find the second oldest unused cache line, which is line 1 and replace it:
replace line 4: a[1][0], a[1][1], a[1][2], a[1][3]
replace line 1: b[0][1], b[0][2], b[0][3], b[0][4]
Note here that we used to have the b-value b[0][1] in cache already, but we had to replace it with another value because our cache was full! This means we had to pull the same value from memory into cache twice!

The tenth iteration:
elements we need: a[1][1], b[1][1]
Again, we see the recurring pattern, we already have a[1][1] in the cache in line 4, but we have to replace a cache line to get the b-value in cache. We already had the b-value b[1][1] in cache in the second iteration, but we were forced to replace that line because our cache was full!

The pattern we are seeing here is that because the access of the b-array is structured as b[j][i] instead of b[i][j], we are running into trouble fitting the values into cache properly and creating cache-misses by replacing lines with values we still need later on!

Normally what you would do in a situation like this where the reverse access b[j][i] is causing the cache-misses is called loop-interchange. You can reverse the order of the loops so b[j][i] essentially becomes b[i][j]. However we can not do that in this case, because a[i][j] is also in this loop! Reversing the loops would create the exact same problem, only on the other side, after swapping the loops around we would have a lot of cache-misses on the a-array!

This is where strip mining comes in!
We take the code we had written for our original program and turn it into this:
int[8][8] a, b, c;

for(int x=0; x<8; x+=4){
    for(int y=0; y<8; y+=4){
        for(int i=0; i<min(x+4, 8), i++){
            for(int j=0; j<min(y+4, 8), j++){
                c[i][j] = a[i][j]*b[j][i];
Now I admit that if you look at this the first time, programmatically it does not make a lot of sense. You are introducing two extra for loops and the loops look strange! However from a cache-viewpoint, this makes a lot of sense.

Here is what the iterations in this loop look like:
x=0, y=0, i=0; j=0
cache lines:
line 1: a[0][0], a[0][1], a[0][2], a[0][3]
line 2: b[0][0], b[0][1], b[0][2], b[0][3]

x=0, y=0, i=0; j=1
cache lines:
line 3: b[1][0],b[1][1],b[1][2],b[1][3]

x=0, y=0, i=0; j=2
cache lines:
line 4: b[2][0],b[2][1],b[2][2],b[2][3]

x=0, y=0, i=0; j=3
cache lines:
line 5: b[3][0],b[3][1],b[3][2],b[3][3]

j is no longer smaller than 4 (y+4) so instead we go up one loop and increase i!
x=0, y=0, i=1, j=0
cache lines:
line 6: a[1][0], a[1][1], a[1][2], a[1][3]
Note that here we still have the required b cache line!

x=0, y=0, i=1,j=1
cache lines:
No cache lines required! We have line 6 which contains a[1][1] and line 3 which contains b[1][1]!

x=0, y=0, i=1,j=2
cache lines:
No cache lines required! We have line 6 which contains a[1][2] and line 4 which contains b[2][1]!

Actually if you keep on iterating the loop, you can see that for the first 16 iterations, all we need is 8 cache lines! The lines we are missing now are a[2][0], a[2][1], a[2][2], a[2][3] and a[3][0], a[3][1], a[3][2], a[3][3], but that is it for the first 16 iterations! In fact, all groups of 16 iterations in the cycle only require 8 cache lines to store that particular block!

If we look at the arrays in a visual display, we are actually structuring the loops so it does these blocks of 4 by 4 first, before moving on to the next block! The reason for this is that these blocks happen to fit perfectly in our limited cache size.

In the real world obviously we have larger cache lines and more of them, but this concept applies in exactly the same way to arrays that are much larger than this one. This means that if you are working with very large two-dimensional arrays and traversing those arrays in a manner that is sub-optimal, strip-mining may be the solution for you!

Another time when strip-mining can be very valuable is if you are dealing with a multi-threading situation. Often when multi-threading, you will divide the data up into chunks that will be given to a specific processor. All the processors then run the same program, but on different pieces of data. This is a common way of managing workloads between multiple processors.

A problem that arises from this is cache-coherency between processors. If one processor has a cache line that contains a value that another processor has inside it's cache. and it updates a value on that same line. That whole line will need to be updated in the second processor, even though the second processor may not even need to use the specific value that was updated.

So if processor 1 was working on a[0][1], a[0][2] and processor 2 was working on a[0][3], a[0][4],
the cache-line that processor 1 is working on accidentally also contains a[0][3] and a[0][4] because we can only pull in entire cache-lines at a time. This means that if processor 2 updates a[0][3], processor 1 will require an update, even though it is not even actively working on a[0][3]! It just happened to be on the same cache-line. There is a lot of unnecessary communication between processors this way.

To prevent this unwarranted communication we strip-mine the loop and instead deal out the blocks of 16 values to different processors. This way there is no overlap in the cache-lines and each processor stays out of the other processors way. If you wish to know more about this problem, which is a very common problem in multi-processing, google for false-sharing.

So to recap:
We strip-mine because:
a. If our cache size is small, it prevents a lot of cache-misses.
b. If we are multi-threading, it prevents a lot of communication overhead caused by false sharing.

I am a little bit sad that I could not present you with a view of what gcc does with our code in terms of assembly, I will look into obtaining a gcc build with the necessary libraries and post some more on that if I find one.

by Berwout de Vries Robles ( at March 22, 2016 04:20 AM