Planet CDOT

May 28, 2015

Anna Fatsevych

MySQL Tests Continued

The Wikimedia Commons image dumps include files of .svg,.pdf, .ogg, and .djvu formats. SVG is an image file, whereas .pdf’s were mostly books, ogg’s were video/audio and .djvu’s were scanned books/texts.

Hashing these with pHash and BlockHash was a challenge, because they do not always throw an error (i.e. when trying to hash pdf), so some issues took longer to discover, and the others (svg, ogg, and djvu) cannot be hashed.

Dealing with file names with many different characters of various languages some exceptions arose – a php function addcslashes($string, “list of characters to escape”) comes in handy as well as the ones mentioned in the previous post.

I ran my php program on 6,953 files, of which 6308 were processed (some due to format and some due to file naming convention errors). It took 2.41 hours – to pHash, and blockHash each image out of 6308, and then store the values in the database. Granted, hashing took the most time as the dummy data results averaged 1,000 INSERTS per minute.

I ran SELECT tests on my 6,008 records and discovered that, ‘SELECT *’ and ‘SELECT where’ based on hashes speeds were quite impressive, with the longest one taking 3.2 seconds for select all. Granted, there were many duplicates in this database (the same hash was applied to erroneous files), which will not be the case.

I will continue running the performance tests and post further results.

by anna at May 28, 2015 07:17 PM

Hosung Hwang

Performance test of pHash

I performed performance test of a perceptual hash algorithms : pHash.

Following is the specification of test machine.

OS : Ubuntu 14.04 LTS
Processor : Intel Core i7-3517U CPU @ 1.90GHz x 4
OS type 64-bit
Memory 7.7GiB
Disk : 117.6GB SSD

Test performed from the internal SSD(MSATA) drive and external USB hard drive.

Read/Write benchmarking

Before doing actual hashing test, I performed simple read/write benchmark using dd command. It writes and read 8.2gb file :

time sh -c "dd if=/dev/zero of=/media/hosung/BACKUP/test.tmp bs=4k count=2000000 && sync"
time sh -c "dd if=/media/hosung/BACKUP/test.tmp of=/dev/null bs=4k"

Each job was performed 5 times. Followings are average values.

Condition Speed
Internal SSD Write 245 MB/s
Internal SSD Read 440MB/s
External HDD Write through USB 3.0 109MB/s
External HDD Read through USB 3.0 122 MB/s
External HDD Write through USB 2.0 109 MB/s
External HDD Read through USB 2.0 129 MB/s

USB 3.0 reading speed was slightly faster than USB 2.0. And Internal SSD was 4 times faster than USBs.

pHash performance test

Sample images are 900 jpg files and the size is varied from 1.3kb to 50.4mb. Full size of the sample files were 805.4mb. For the test, I wrote a c++ code that extract hash values using ph_dct_imagehash() function in pHash from all jpg images in a directory. The reason to rewrote the program is to avoid the time of loading new process when shell script is used. Every test were performed 8 times after rebooting.

Internal SSD

hosung@hosung-Spectre:~/cdot/ccl/PerHash/pHash/pHash-0.9.6/phash-test$ for F in {0..7}; do ./phashbench /home/hosung/cdot/ccl/hashtest/all-images; done
Elapsed secs to hashing 900 files : 292.419326
Elapsed secs to hashing 900 files : 290.789127
Elapsed secs to hashing 900 files : 291.163042
Elapsed secs to hashing 900 files : 290.769897
Elapsed secs to hashing 900 files : 290.710176
Elapsed secs to hashing 900 files : 290.940988
Elapsed secs to hashing 900 files : 290.880126
Elapsed secs to hashing 900 files : 290.766687

External HDD through USB 3.0 port

hosung@hosung-Spectre:~/cdot/ccl/PerHash/pHash/pHash-0.9.6/phash-test$ for F in {0..7}; do ./phashbench /media/hosung/BACKUP/all-images; done
Elapsed secs to hashing 900 files : 293.422019
Elapsed secs to hashing 900 files : 293.145768
Elapsed secs to hashing 900 files : 292.828859
Elapsed secs to hashing 900 files : 292.591345
Elapsed secs to hashing 900 files : 292.631436
Elapsed secs to hashing 900 files : 292.811508
Elapsed secs to hashing 900 files : 292.898119
Elapsed secs to hashing 900 files : 292.607773

External HDD through USB 2.0 port

hosung@hosung-Spectre:~/cdot/ccl/PerHash/pHash/pHash-0.9.6/phash-test$ for F in {0..7}; do ./phashbench /media/hosung/BACKUP/all-images; done
Elapsed secs to hashing 900 files : 294.008601
Elapsed secs to hashing 900 files : 292.954135
Elapsed secs to hashing 900 files : 292.275561
Elapsed secs to hashing 900 files : 292.255697
Elapsed secs to hashing 900 files : 292.459464
Elapsed secs to hashing 900 files : 292.737186
Elapsed secs to hashing 900 files : 292.803859
Elapsed secs to hashing 900 files : 292.605617


  • USB 3.0 port and USB 2.0 port seems to have no difference.
  • Even when it was performed from internal SSD, the speed was slightly fast.
  • In spite of 4 times faster reading speed, hashing speed in internal SSD was less than 1% faster than USB.
  • Therefore, in terms of hashing, CPU performance seems to be important than IO performance.
  • The other method in pHash : ph_mh_imagehash should be tested later.

by Hosung at May 28, 2015 02:34 PM

May 27, 2015

Justin Flowers

Working with Host-Only VBox Networks in CentOS6

In order to communicate between VMs a simple alternative to fancy port forwarding is to set up a host-only network joining them. This is my go to solution for testing machines that need to talk to other machines. In CentOS6 this can be quite difficult to figure out on your own. Here I’ll discuss how to set up your machines fully in simple terms.

Step 1: Create the network in VirtualBox preferences

Before we can begin configuring boxes to be on this host-only network, we’ll need to make it first. This is relatively easy, thankfully. Simply go to VBox’s network preferences through File->Preferences->Network and hit the + button at the top right of the page to add a new host-only network.

Step 2: Connect VMs to host-only network

Note: to do this part your VMs must be powered down
Next we need to give the VMs access to this network. Go to the VMs network settings through right-clicking on the machine and Settings->Network. Once there add a new adapter by clicking on one of the tabs at the top and checking Enable Network Adapter. Then simply pick Host-only Adapter and it should automatically pick the first network on the list. Do this for all machines you want to communicate via the host-only network.

Step 3: Configure adapters on VMs

This is the hardest step and took me the longest to figure out. Begin by doing:

ifconfig -a

This will show you a list of all adapters present on your machine. The one you’re looking for will match the hardware address created for each network adapter in step 2, although usually its the last Ethernet one in the list. Once you have the name of your adapter (likely eth1 if you only had a NAT adapter before) you can begin configuring it with:

sudo vi /etc/sysconfig/network-scripts/ifcfg-eth1

Substituting eth1 with the name of your adapter you found with the ifconfig.In this new file copy and paste:


Again, substituting eth1 for whatever the name of your host-only adapter was, adapter_mac with the MAC address for your host-only adapter (which can be found with ifconfig -a or from the VBox machine network settings page), and the IP address for whichever one you want the machine to have.

Save that file and then run:

ifup eth1

Alternatively, if you know you will be destroying the machine soon and wish to configure this quickly, simply run:

sudo ifconfig eth1 netmask up

However the above command will not persist through reboots and will not keep your IP static, meaning your computer could be leased a new one if you’re working for a longer period of time.

If you’ve followed all the steps correctly you should now see your connection to the host-only network! Test it out with a ping to see if it works.

by justin at May 27, 2015 07:51 PM

Anna Fatsevych

MySQL Stress Test

I ran various commands through MySQL database testing for speed and size variations.

To check for size of the database I used this code:

SELECT table_schema “Data Base Name”,
-> sum( data_length + index_length ) / 1024 /
-> 1024 “Data Base Size in MB”,
-> sum( data_free )/ 1024 / 1024 “Free Space in MB”
-> FROM information_schema.TABLES
-> GROUP BY table_schema
-> ;

I am running a PHP file, and the MySQL commands are in a loop. Here is the code snipped to show how the processing time is calculated.

       // Start timing
       $time_pre = microtime(true);

        for ($x = 0; $x <= 50000; $x++) {
            $DB->query("INSERT INTO IMG(phash, bhash,author,license,directory) VALUES($hash,'{$bhash}','author','license','{$directory}')");
        // Stop timing
        $time_post = microtime(true);
        $exec_time = $time_post - $time_pre;   
        echo "\n\n TIME: ".$exec_time."\n";

The Results (click for a larger image):


Difference in INSERT statements in TIME between BLOB and VARCHAR(1024) was relatively small (52.54 seconds per 1000 records, and respectively 58.22 for the BLOB formatted ones). But the most significant difference was in size:


by anna at May 27, 2015 07:16 PM

MySQL and Data Types

After attempting to store pHash results as a BIGINT in MySQL database, some hashes were correct and some were erroneous, showing up constantly as 9223372036854775807 (the largest possible BIGINT in MySQL structure).

The solution to this problem was using UNSIGNED BIGINT – the type to store the largest possible integer in a MySQL database. More on it here.
If it is still too small, the next option is VARCHAR.

Block Hash is stored as a binary string – and the data type for it is BINARY(64).

Images are stored as a BLOB.

Also, when working with pHash, special characters such as brackets, parentheses, etc might raise an error; use addslashes(), or quotemeta(), you can als escape special characters with a regEx. In this example I am executing pHash and using the functions mentioned above to escape strings.

$hash = exec('./phash '.$directory.quotemeta($photo));

More MySQL and PHP goodness to come.


by anna at May 27, 2015 03:56 PM

May 26, 2015

Hosung Hwang

Content Based Image Retreval : Concept and Projects

Concept of CBIR

Content Based Image Retrieval(CBIR) is the technology to search images from image database without other information like title and author.


It is basically divided to 2 parts : Indexing and Searching. Indexing performs based on color, texture, and shape. For fast and accurate searching, developing indexing algorithm has been important issue in computer vision field.

Open Source CBIR Engines

There are some commercial CBIR Engines such as Google Image Search and TinEye. Followings are open source CBIR engines.


  • GPL, developed until 2006
  • written in C# .NET 2.0 using Visual Studio .NET 2005


  • Open license, developed until 2009
  • Written in C++, Supports python

The GNU Image-Finding Tool (GIFT)

  • full server system written in c, developed until 2004
  • divided to kernel part and plugin part
  • very complex, many dependency in other gnu projects

imgSeek WebSite, Desktop

  • developed until 2012
  • server system written in python/C++/Java, used SOAP


  • Recently developed, GPL
  • written in Java, use OpenCV, LSH
  • business model : source code customization consulting


  • image recognition platform for your mobile apps
  • Recently developed. LGPL
  • Written in C++, consist of 10 cpp files. use OpenCV 2.4
  • Include HTTP server
  • Used ORB instead of SIFT or SURF (patent protected)
  • business model : Image database server hosting, source code customization


  • Pastec can be a good reference to make a content based image retreval system because it is simple and has major features. Also it is free from license/patent.
  • However, it is wired that they are promoting that it is for mobile apps. Intensive test is needed.

by Hosung at May 26, 2015 10:18 PM

Peter McIntyre

Responsive top-of-page logo image with Bootstrap

We’re working on a new version of the School of ICT web site recently.

It uses Drupal 7, and we have configured it to use the Bootstrap framework.

One of the things we wanted to do was to have a full-width logo image at the top of the page. However, we did not want it to scale, so I looked for a way to show an image that matched the Bootstrap style for the viewport. Without using JavaScript.

Here’s what we settled on.


Four logo images

We created four logo images:

  1. logo-1100-plus.png
  2. logo-910-plus.png
  3. logo-690-plus.png
  4. logo-690-minus.png


Wrap the images

Wrap the image in a suitable container, if necessary. Then, use the Bootstrap classes “hidden-??”.

Here’s a code sample. Replace the <a href and <img src values with your own. Enjoy.

<p>Responsive images, using Bootstrap grid breakpoints</p>

<p>Width 1100 or more</p>
<div class="hidden-md hidden-sm hidden-xs">
  <a href="">
  <img src="~/Content/Images/logo-1100-plus-v1.png" /></a>

<p>Width between 992 and 1100</p>
<div class="hidden-lg hidden-sm hidden-xs">
  <a href="">
  <img src="~/Content/Images/logo-910-plus-v1.png" /></a>

<p>Width between 768 and 992</p>
<div class="hidden-lg hidden-md hidden-xs">
  <a href="">
  <img src="~/Content/Images/logo-690-plus-v1.png" /></a>

<p>Width less than 768</p>
<div class="hidden-lg hidden-md hidden-sm">
  <a href="">
  <img class="img-responsive" src="~/Content/Images/logo-690-minus-v1.png" /></a>























by petermcintyre at May 26, 2015 08:07 PM

Kieran Sedgwick

Lessons from the battlefield for an intermediate programmer

My biggest challenge as an intermediate-level programmer is almost poetic, is desperately tense and is more and more obvious the further I develop in the field. I’ve reached the point where walking through code is almost second nature. No more cold sweats at what looks like a magical jump in logic, or raised blood pressure at the sight of yet another third-party API, or reinforcing the fetal-shaped imprint on my bed when a bug defies all attempts to understand it.

At this point it’s not about solving problems. It’s about efficiency. My latest piece of work was relatively meaty, with sizeable problems that needed carefully considered solutions. It also took me a week longer to solve than I would have liked, so I decided to analyze my performance and draw lessons from the experience. Here’s what I observed:

Lesson No. 1: Problems in isolation are easier to solve


The web app I’m developing this summer needed to be refactored to work with a brand new authentication server, using a completely different authentication protocol. The codebase itself is messy, test-free and highly coupled. A complete refactor (my definite preference) was out of the question since there’s simply too much real work to do to worry about technical debt.

And so I fell into my first trap. My attention was torn between implementing the new authentication protocol, and not breaking the mess of a codebase in the process. Jumping back and forth between these two goals left me confused about the causes of my errors, and mentally exhausted from the task switching.

Solution: Next time, separate the problems, solving the most basic first

  • Identify which parts of the problem are easiest to isolate
  • Solve those problems in isolation, even if it’s contrived
  • Use perfected solutions of simple problems as a basis for the more complicated ones


  • Attention isn’t split between different domains
  • Problems don’t “cross pollinate”, confusing the source of an issue
  • Saves time

Lesson No. 2: If you have to really know a new technology, slow down and build it out as you learn it


Oauth2 is pretty cool, and I know that now. Learning it could have gone a little faster, if I used more haste and less rush. Relying on my instinct by skipping over learning fundamental terms and concepts led me down the wrong path. I struggled to find a good example implementation, so I tried to cobble together the concepts enough to empower me to implement it all at once. Not a good idea!

Solution: Make an effort to learn terms and concepts without losing patience, implementing something as soon as possible

  • Small confusions snowball as the pieces come together, so be thorough in research
  • Find or create examples that further or test your understanding of the piece you don’t understand
  • Solidify the learning by implementing something that works as soon as possible, even if it’s incomplete or contrived.


  • Would cut down the amount of “back and forth” referencing of material since you’re more familiar with the ideas
  • Surfaces misunderstandings through broken implementations, helping to solidify the trouble parts faster

To the future

In his book Talent Is Overrated: What Really Separates World-Class Performers from Everybody Else, Geoff Colvin points out that masters of their craft approach every task in predictable way.

First, they have specific goals aimed at besting their previous efforts as they go into a task. Second, they check in with those goals as they perform the task, making sure they’re staying focused on improvement. Finally, they reflect on the experience to build the goals for their next attempt. Adopting this was my motivation for this post, and seemed like the best response to my disappointment over my performance.

To success!

by ksedgwick at May 26, 2015 04:52 PM

May 23, 2015

Anna Fatsevych

The wiki way

After some research on Wikimedia Commons, I have found out some information, along with the links for further references.

There is the WikiPedia API Sandbox that allows to test the calls.

There is also MediaWiki which is actually a software package (written in .php) that has multiple tools and extensions available to add functionality and support. I have installed MediaWiki locally, and will be able to create my own extensions and tools. To do so, I have signed up for a developer account with WikiLabs. Here are some helpful links:

API Source

Wikimedia Commons is a free file repository that is dedicated to promoting sharing of media. Read more about the Commons here.

There are also database dumps available with all the media on the Commons.

Here are the image dumps:

And mirrors here



by anna at May 23, 2015 01:39 AM

May 22, 2015

Barbara deGraaf


This blog will focus on my involvement on my current project which is to make the cameras in three.js act like real cameras.

Continue reading to learn how cameras work and the various setting cinematographers can change on their cameras and how that affects the image

by barbaradegraafsoftware at May 22, 2015 01:15 PM

May 21, 2015

Justin Flowers

Configuring and using Logstash

Logstash has some incredibly well defined installation guides that can be easily found through a Google search similar to “install logstash on *insert Linux distro:*“. However understanding how its configuration and permissions work by default can be daunting.

To start, we’ll need to find a way to figure out if Logstash is working correctly. Unfortunately starting the service with the standard means (something along the lines of sudo service logstash start) does not give any information about the success of its launch. To check if it’s working correctly you’ll have to check the log file at /var/log/logstash/logstash.log. If you see an error like:

{:timestamp=>"2015-05-21T14:20:51.434000-0400", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}

That simply is a log notifying you that Logstash successfully shut down. If the file is blank after starting the service then you know that it started up with no errors. Otherwise you likely have permissions issues with your firewall or issues with your Logstash configuration file (the error will likely give you more information). So, make sure to cat out that file and verify that Logstash successfully started up before injecting logs into your system. Permissions are a continuous issue configuring Logstash, from permissions to write to the folders you want to keep your output to permissions for reading the log files you need to intercept.

Configuring Logstash can be quite frustrating, as there is quite a bit of arbitrary rules for the various filtering methods. For example, the grok filter allows you to add fields to a JSON log before sending it to outputs. Unfortunately, grok will not add these fields unless there is a successful match block. This means to add a field, regardless of whether you care about the message field or not, you must use a grok with at least one match block (which, in this case, I matched message with .*).

The only issues I had setting up input was getting the permissions correctly configured for the log files coming in, so I’ll skip that component except to show what mine looked like:

input {
 file { 
   path => "/var/log/httpd/access_log"
   type => "apache"
 file { 
   path => "/var/lib/mysql/mysqld_general.log"
   type => "mysql"
 file {
   path => "/var/log/messages"
   type => "linuxsys"

The filter section, however, was much more frustrating. This is where I encountered the grok needs a match to run anything else issue. This is a relatively simple filter which checks the type of the log and then gives it a different entry for its operatingsystem field.

filter {
   if [type] == "apache" {
     grok {
       patterns_dir => "/etc/logstash/conf.d/patterns"
       match => { 
          "message" => "%{COMMONAPACHELOG}"
  else {
    grok {
      match => { "message" => ".*" } 
      add_field => { "operatingsystem" => "Host OS (CentOS)" }

The patterns_dir field tells Logstash where to find your user defined regex terms. More about that over here. As you can see, if the log’s type is “apache” then it will run my custom apache log regex on the message string (which identifies and stores the OS, among other things). Otherwise it adds the operatingsystem field with an entry about what should be added there. As mentioned earlier, without the match => { “message” => “.*” } line the grok in the else will never go off.

Finally, for output I only had 1 issue: getting Logstash to send data to Elasticsearch. The key ended up being to add a protocol line to the elasticsearch block and telling it explicitly to use http for connecting. Again. for example I’ll post it as well:

output {
 file { 
    path => "/tmp/logstash_out.log"
    codec => json_lines 
 elasticsearch { 
    host => ""
    port => "9275"
    protocol => "http"

Except for some nasty regex with Apache logs, that’s about all the issues I ran into making the Logstash configuration file work the way I needed it too. I hope this post has helped you work your way around the Logstash system a little better!

by justin at May 21, 2015 09:04 PM

Hong Zhan Huang

OSTEP – The city of ARMs – Building a simple network

After having received the opportunity of joining the Seneca Center for the Development of Open Technology (CDOT) I have begun a journey into the world of ARM architecture which is the focus of the OSTEP team that I am apart of. Following the training for new additions to the team like myself, we were to assign ourselves an initial task to handle. For me that task would be to work with an ARM machine which I’ll lovingly dub Sunny.

The main points of my task is to do the initial set up with Sunny and then to test it with a particular brew of Linux for aarch64 systems to see how stable Sunny is in conjunction with this operating system.

The Set Up

Initial Stage

The first step towards setting up was to understand what manner of connections I could use to work with Sunny from my workstation. Sunny doesn’t have any sort of video ports that one can hook up to a display. The main ports of interest are serial, Ethernet and a SD-card slot. The serial port is how we’ll be interacting with it. With the use of a serial to USB adapter one can use a terminal emulation tool such as Screen, Minicom or putty to connect via the serial port. Directions to using these tools to connect to serial can be found here. Here after I began to familiarize myself with Sunny and proceeded to think about what method of booting I can implement to most effectively place another operating system onto it.

As stated prior aside from the serial, there are Ethernet ports and a SD-card slot. The SD slot is seemingly meant for the use of firmware updates (and perhaps as a last resort to move files to and from). That leaves the Ethernet remaining which then means a network installation (PXE Boot) would be the best choice. Thankfully here at CDOT we have a lovely cabinet of goodies known as the EHL (Enterprise Hyperscale Lab). So instead go through the process of setting up a new PXE Boot server and all the things that it entails, I can leverage our existing infrastructure to make this task much easier.

Secondary Stage

With an inkling of what I need to do, now is the time to figure out how to do it. Certainly I want to be able to use our existing network but how would I do it. Sticking Sunny into the EHL is a little difficult at this time as it’s undergoing some clean up as well as being generally quite full of things already. As luck would have it, we in the OSTEP team are intending to build a second cabinet which might be called the EHL2 but at the time of this writing, the physical cabinet to hold it all together has yet to arrive. However all the other parts that would be come the EHL2 are already present for use. Thus my task then becomes to set up a temporary “cabinet 2″ to which will connect to Sunny as well as the original EHL to make use of what’s already here. I think a visual will better show off what is being put together: A pile of gear

This set up is much like a very simple and stripped down version of the actual EHL. Following physically connecting up all the hardware, it was time to configure some of the pieces. The steps to do so were as follows:

  1. Edit the DHCPD config to add new records for the Power Distribution Unit (PDU), Terminal Server (TS) and Sunny. Sunny’s record will also contain the information regarding what it will attempt to PXE boot from.
  2. Power up the TS and PDU and confirm their connection.
  3. Configure TS and PDU as needed. The TS is what allows us to remotely connect to Sunny’s serial port. The PDU will allow us remote power management of our “cabinet 2″ network. For example we can remotely power cycle Sunny through the terminal or a web interface for the PDU. Power cycling is the equivalent of unplugging the power cord and plugging it back in again.

Now that we’ve put everything together and configured it, with some luck our system will be up and fully running.

Things aren’t quite working yet stage

Here comes the trouble: For some reason, everything was working except for Sunny’s ethernet ports. Attempting to run a ifconfig command showed that the ethernet devices were detected and enabled but weren’t actually up and running. Previously before I was assigned to this machine there had been a myriad of magical and mysterious issues pertaining to the ethernet but at some point had resolved it self. And once again it appears when it’s in my ownership. Upon the suggestion of my peers, I went to test the power supply in the case and sure enough it was reading as faulty with a power supply tester. After switching it up, everything now works. Hurrah.

Some of the things I’ve learned this time:

  • The things that can make up a server cabinet and how to put one together physically and as well as the configuration that is needed to make it work.
  • Thing to look out for when trouble shooting hardware such as faulty power supplies.
  • Sometimes things will just be a mystery that requires testing things out one by one.

We’re finally ready to PXE Boot but we’ll continue that in another post.

by hzhuang3 at May 21, 2015 02:58 PM

Anna Fatsevych

Download Speed, Size, and Bandwidth

While optimizing and expanding on my findings from previous post,

I ran some tests on my home wireless network with an average speed of 67 Mbps, with 23 ms PING according to

Running two Flickr queries (API calls) per image – up to three tries per query – I was able to successfully download 784 images within 57 minutes that averaged 5 MB in size (705 images were 8 MB or less, largest image was 40.1 MB), and after that – constant fails in downloading, which led me to believe a maximum allowed quota per hour of 3600 queries was reached. I believe this was achieved due to intermittency in my network (wireless) and also on the server side, causing multiple tries for unsuccessful API call – I programmed it that way for continuity.

Going by my statistics and referring to this Mbps to MB/s calculator, my transfer speed was about 8.3 MB/s, so the 705 images would take under a second to download, whereas images averaging 25 MB, of which I had 23 would take 3 seconds, and images averaging 35 MB (there were 9 in my case) would take approximately 5 seconds; thus having the larger images eat up the download time.

Ideally a maximum of 1797 images can be downloaded per hour, 3600 total – 2 initial search queries, and 1 query per page of 500 images (4 maximum), followed by 2x queries per image with the current python downloader program images-flickr.

Efficient download speeds of at least 250 MBps which is 31 MB/s (preferably higher) would decrease the spectrum of time variance for the download size, equalizing it to about an image in under a second. From one Flickr account (API KEY) a maximum of 43,128 images can be downloaded per day with bandwidth averaging 237 GB per day.

Will continue posting my findings,


by anna at May 21, 2015 06:42 AM

May 19, 2015

Hosung Hwang

Research about OpenSIFT 2

In the previous posting, I wrote about what is OpenSIFT and how it works. In this posting, I will discuss about the performance, size, and quality.

Extracting keypoint

The utility program from OpenSIFT does extracting keypoint and comparing at the same time. In our project, extracting hash and comparing happens separately. So, I changed the sample code and wrote bash script to do it for all image files.

Sample images were 896 jpg images, and size of them were 804.2MB in total. Generating keypoints for all 896 images take in my Core i7 machine :

real    93m27.545s
user    84m19.643s
sys 2m34.055s

Keypoint extracting time and keypoint file size is significantly varied by image : how complex the image is.

The least complex image : Metaball4.jpg

Following image is the image with features.
Only 1 feature was found.
Original file size was 2.5MB
Keypoint file size was 375Byte
Following is entire keypoint file :

1 128
1256.797741 1136.853164 75.205367 3.072790
 6 0 0 0 0 0 0 11 3 0 0 0 0 1 28 46 5 2 0 0
 0 24 46 11 130 80 0 0 0 4 1 1 39 1 0 0 0 0 0 20
 109 16 3 3 3 11 53 130 8 4 4 23 93 123 76 31 130 9 0 2
 29 19 0 19 43 14 0 0 0 0 0 2 105 129 65 16 2 1 1 15
 11 22 77 130 73 10 2 8 130 0 0 16 21 1 0 130 10 6 0 0
 0 0 0 0 5 31 25 1 0 0 0 4 17 4 29 16 0 0 2 104
 130 0 0 1 0 0 0 130

Extracting time was :

real    0m6.994s
user    0m7.082s
sys 0m0.277s

The result seems weird. Only one feature was detected even though there is a big round shape. And extracting time was 6 seconds.

The most complex image : VidGajsek – Razgled 505.jpg

Following image is the image with features.
VidGajsek - Razgled 505.jpg_screenshot_19.05.2015
265116 features are found.
Original file size was 9.9MB.
Keypoint file size was 101.7MB.
Extracting time was :

real    3m1.761s
user    2m33.824s
sys 0m9.152s

The original image was kind of HDR image. There were many separate dots and all those small pixels seems to have features. Surprisingly, keypoint file size was 101.7MB and extracting took more than 3 minutes.

I looked at APIs if it has an option to write the keypoint file binary format, which will reduce the size. However, there was no those kind of option.

Matching keypoints of two images

One of the problem of OpenSIFT is that there is no standard way to decide if the two image is similar or if one of the image is part of the other image. Matching works based on one image according to the code.

Testing character images

In case of simple character image, the number of features are few. Therefore, when those images are compared with other images, there are many false positive results. For example, “I” has 2 features, and all two features matches with 1, 2, 5, 7, b, d, E, f, h, i, j, L, n, P, r, T, V, W, and y. Likewise, ‘i'(4), ‘j'(3), ‘l'(4), ‘o'(2), ‘O'(2) have many matching results. Other than those, following results were false positive : 2 and d, 3 and I, 7 and V, 7 and y, D and 5, D and P, h and n, h and u, M and h, M and I, M and L, M and P, u and h, w and I.
When the image is simple, OpenSIFT gives many false positive results.

Testing complex images

When I compared two keypoint files generated by completely different images that has 188507 and 216297 features, the time of comparison was 1 minute, and matching features were 224.

When I compared 4 images with all other images, it doesn’t seem to have false positive result. I couldn’t do this for all images; it takes too long time, especially for complex images.

In case of cropped image of another image :
Screenshot from 2015-05-19 18:58:39
Matching image looks like this :
Comparing took 10.430 seconds, and the result was :

Features of cropped image : 35915
Features of original image : 45669
Matched features : 23109

In this case, 64% features of base image were matched.

Although more test is needed, when the images are complex, if matching features are more than 50%, the two images are similar or one is part of the other one.

Problems are significant size of keypoint file, keypoint generation time, and comparison time. Keypoint file size seems to be the biggest problem. Total size of sample image is 808MB. Total size of generated keypoint file is 2.9GB. Although reducing the size seems to be possible by changing it to binary format or compressed format, 30% of 2.9GB is still too big : bigger than original image size.

by Hosung at May 19, 2015 11:33 PM

Anna Fatsevych

Flicker Method Calls and Python

Flickr has an extensive API to perform all sorts of operations. The method calls I found most useful were:

It takes “photo_id” as an argument, and of course the API KEY is required to use any of the Flickr API methods, and returns detailed information about the photo.
The response will include in XML(REST), JSON, or PHP Serial, such valuable information about the image as: user information (nsid (user ID), user real name, username, date uploaded, date taken, license, and more).

Flickr also provides instant Request/Response where you can test out your calls with the parameters in the request, and also the information you will get in the response. You can try it out for yourself here:
Here is a sample of a search result response:

Screenshot from 2015-05-19 14_55_00

In my previous post I mentioned I am using Python API (here) It defines the methods and the arguments that are easy to use :


def photos_search(user_id='', auth=False,  tags='', tag_mode='', text='',\
               min_upload_date='', max_upload_date='',\
               min_taken_date='', max_taken_date='', \
               license='', per_page='', page='', sort='',\
               safe_search='', content_type='', **kwargs)

Here is an example of me calling that method in my program:

f = flickr.photos_search(min_upload_date="2015-03-20", license="1,2,3", per_page="500") 

A list of photos will be returned with the specified parameters. Per_page specifies how many results to display per page (100 is default, 500 is max). You may, however, query the specific pages :

f = flickr.photos_search(min_upload_date="2015-03-20", license="1,2,3", page="2", per_page="500")

I have my program available on GitHub
I am using two method calls initially to get the total amount of images and pages:

#get the photos
f = flickr.photos_search(min_upload_date=date, max_upload_date=date, license=lic, per_page="500")

#get the total pages
fn = flickr.photos_search_pages(min_upload_date=date, max_upload_date=date, license=lic, per_page="500")

flickr.photos_search_pages is a method defined in (FLickr python API) that will return the number of pages, and then it is just as simple as iterating through all the pages, if you were out to download all the images from a specific search result.

While researching more about Flickr API for Python, the official Flickr API Documentation provides three different links to Python APIs. Some are kept a little more rigorously updated and some need a little tweaking, in my case, had to be slightly changed (ie. Flickr no longer responds with “isadmin” property for users, so the “isadmin” had to be commented out in order not to fault). Here are the links to the Flickr Python API:


Beej’s Python Flickr API (one I am using)

Hope this is helpful.



One more thing, when working with network(s), I learned that it is extremely “touchy” and various “socket” errors are bound to interrupt your flow. Thus, wrapping the code in “try: and exception:” proved to be crucial in my case. And if you make a mistake and wrap everything so tightly that the good old “Ctrl+C” fails to stop your forloop, just add

except KeyboardInterrupt:

by anna at May 19, 2015 08:02 PM

Hosung Hwang

Research about OpenSIFT


As a better solution of comparison of images, there might be more solutions other than perceptual hash. What I looked at was SIFT(Scale-invariant feature transform), which is an algorithm to detect features in images. For example, if an image is part of bigger image, this algorithm detects even though the cropped image is rotated.

Building OpenSIFT

OpenSIFT is SIFT algorithm implementation. It uses OpenCV.
In my Ubuntu 14.04 machine, I installed OpenCV using following order.

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install build-essential libgtk2.0-dev libjpeg-dev libtiff4-dev libjasper-dev libopenexr-dev cmake python-dev python-numpy python-tk libtbb-dev libeigen3-dev yasm libfaac-dev libopencore-amrnb-dev libopencore-amrwb-dev libtheora-dev libvorbis-dev libxvidcore-dev libx264-dev libqt4-dev libqt4-opengl-dev sphinx-common texlive-latex-extra libv4l-dev libdc1394-22-dev libavcodec-dev libavformat-dev libswscale-dev default-jdk ant libvtk5-qt4-dev
$ wget
$ unzip
$ cd opencv-2.4.9
$ mkdir build
$ cd build
$ make
$ sudo make install

After installing OpenCV, simply make builds libopensift.a and utilities : match, siftfeat.

How it works

siftfeat utility extracts keypoints. From this original image :
It detects keypoints like this :
When I write keypoint information into a text file, it looks like this :

116 128
103.219669 142.087615 42.452571 0.869987
 0 0 0 0 2 0 0 0 2 21 34 14 20 2 0 0 4 21 15 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 80 15 0 0
 74 89 51 61 191 59 1 7 191 159 25 13 29 5 1 26 17 0 0 0
 0 0 0 2 0 0 0 0 27 1 0 0 91 14 1 6 191 102 12 47
 191 37 1 1 30 26 9 159 33 0 0 0 0 0 0 8 0 0 0 0
 0 0 0 0 13 6 0 0 3 1 0 1 44 12 0 0 1 0 0 4
 0 0 0 0 0 0 0 0
103.219669 142.087615 42.452571 -2.213366
 0 0 0 0 0 0 0 0 1 0 0 6 46 8 0 0 3 1 0 2
 17 6 0 0 0 0 0 0 0 0 0 0 0 0 0 10 29 0 0 0
 29 21 9 190 190 25 1 2 190 82 13 58 95 12 1 13 23 0 0 0
 0 0 0 1 0 0 0 2 14 0 0 0 32 4 1 33 190 140 25 16
 190 39 1 9 73 79 44 79 77 9 0 0 0 0 0 3 0 0 0 0
 0 0 0 0 0 0 0 0 5 23 17 2 26 2 0 0 2 21 30 14
 2 0 0 0 0 0 0 0
130.321665 70.660335 15.892657 -2.046253
 1 2 1 59 56 15 1 1 130 96 1 3 2 10 5 42 133 122 0 0
 0 0 0 23 9 4 0 0 0 0 0 0 20 1 1 9 100 25 1 6
 133 5 0 1 0 1 4 133 133 14 0 0 0 0 7 133 15 1 0 0
 0 0 0 10 21 0 0 0 131 43 8 20 133 43 0 0 0 0 3 50
 133 45 0 0 0 0 1 34 2 1 0 0 0 0 0 2 1 0 0 0
 38 110 76 23 104 14 0 0 0 1 58 69 82 21 0 0 0 0 0 6
 1 0 0 0 0 0 0 0
<the last part omitted>

It writes 116 features into a text file. Full size of the keypoint file is 41.8KB. Original png image file was 29.9KB.

When there is another image like this :
Extracted features looks like this :
In this case, 95 features were founded.

Using match utility, following output image is generated :
And there are 34 total matches of features.

Using c APIs, we can

  • detect features
  • generate image with features
  • keypoint file
  • know how many features are found
  • match two images
  • match two keypoint files
  • generate features matching image
  • know how many features are matched between two images

In the next posting, more test result will be discussed.

by Hosung at May 19, 2015 04:49 PM

May 13, 2015

Hosung Hwang

Perceptual hash test for text image

In the last posting, I made sets of test images that contains a character.

Today, I tested how pHash works for this kind of images and how much hamming distance is reliable in case of rotation, moving, and adding a dot.

Original Sample images

I generated basic images using this command:

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 0x0+0+0 "$L" "$L.jpg" ; done

I generated hash of every images using phash program. And compared every images using perl script made by Andrew Smith.

The only image pair that give 0 distance was l(small L) and I(capital I). They are slightly different but almost the same.

I l

Other than this case, all comparison gives more than 12. In case of text, 0% False positive.


I rotated 2 degrees. They looks very similar.
Original :
Rotated :

In this case, I used the sample program of pHash : test_image.
Result :

0 0.jpg 0.jpg dist = 12
1 1.jpg 1.jpg dist = 2
2 2.jpg 2.jpg dist = 4
3 3.jpg 3.jpg dist = 8
4 4.jpg 4.jpg dist = 8
5 5.jpg 5.jpg dist = 4
6 6.jpg 6.jpg dist = 4
7 7.jpg 7.jpg dist = 6
8 8.jpg 8.jpg dist = 4
9 9.jpg 9.jpg dist = 6

4 degrees Rotation :

Result :

0 0.jpg 0.jpg dist = 14
1 1.jpg 1.jpg dist = 8
2 2.jpg 2.jpg dist = 8
3 3.jpg 3.jpg dist = 16
4 4.jpg 4.jpg dist = 8
5 5.jpg 5.jpg dist = 10
6 6.jpg 6.jpg dist = 12
7 7.jpg 7.jpg dist = 6
8 8.jpg 8.jpg dist = 8
9 9.jpg 9.jpg dist = 6

45 degrees Rotation :
Result :

0 0.jpg 0.jpg dist = 24
1 1.jpg 1.jpg dist = 38
2 2.jpg 2.jpg dist = 20
3 3.jpg 3.jpg dist = 28
4 4.jpg 4.jpg dist = 30
5 5.jpg 5.jpg dist = 30
6 6.jpg 6.jpg dist = 22
7 7.jpg 7.jpg dist = 36
8 8.jpg 8.jpg dist = 26
9 9.jpg 9.jpg dist = 20


I slightly changed the position of characters in the image.
Original :
Moved :
Result :

0 0.jpg 0.jpg dist = 14
1 1.jpg 1.jpg dist = 10
2 2.jpg 2.jpg dist = 14
3 3.jpg 3.jpg dist = 18
4 4.jpg 4.jpg dist = 20
5 5.jpg 5.jpg dist = 20
6 6.jpg 6.jpg dist = 16
7 7.jpg 7.jpg dist = 14
8 8.jpg 8.jpg dist = 18
9 9.jpg 9.jpg dist = 14

When I moved more, the distance was even bigger.

Adding a Dot

Then I add a dot at the same position.
Original :
Added :
Result :

0 0.jpg 0.jpg dist = 18
1 1.jpg 1.jpg dist = 4
2 2.jpg 2.jpg dist = 0
3 3.jpg 3.jpg dist = 2
4 4.jpg 4.jpg dist = 8
5 5.jpg 5.jpg dist = 2
6 6.jpg 6.jpg dist = 2
7 7.jpg 7.jpg dist = 6
8 8.jpg 8.jpg dist = 4
9 9.jpg 9.jpg dist = 2
10 A.jpg A.jpg dist = 16
33 X.jpg X.jpg dist = 16

From the result, if the changed image is overapped with line, the distance was small. Whereas, if the dot is not overapped with line, the distance is more than 10.

Next step

  • More tests for resize, crop and different font are needed

ppt link

by Hosung at May 13, 2015 09:18 PM

May 12, 2015

Anna Fatsevych

Flickr API and ExifTool on CentOS 7

Flickr holds ridiculous amount (millions) of Creative Commons Licensed images, and I have been learning to use their API (REST) to download images based on the tag and license requirement.

I am using Flickr Python API (here) and a modified version of python script available here.  You do need to obtain your API Key and Secret from Flickr by registering and also inputting those values in the file. Flickr API method provides a list of all available licenses.


I am doing a search on Flick images by tag, and also will sort by License ID. To do so, I use the method that allows for the search parameters to be sent in with the method call, such as: user_id, tags, tag_mode, min_upload_date, max_upload_date, min_taken_date, max_taken_date,

Here is the code for calling the method :


I run the program using:

$ python students 1

The program will then proceed to download all available images under that tag with the specified license (1 in this case, which is CC BY-NC-SA-2.0), and download all the matching images (up to a maximum of 500 per query, and within the 3600 API queries per hour limit set by Flickr).

To check which xmp tags the downloaded images have – I downloaded the perl script – that uses the Exif Tool library, and to install it on CentOS 7, EPEL release RPM package has to be installed first by issuing this command:

$ sudo yum install epel-release

Then to install the ExifTool:

$ sudo yum install perl-Image-ExifTool

Then, to read the tags, you can just run exiftool imagename.jpg.




by anna at May 12, 2015 08:11 PM

Justin Flowers

Simulating Hard Drive Failures in Linux

For our project we’re going to need to do various kinds of testing, including simulating failed drives. From our research there are not many ways to do this without creating a copy of a working drive and inject corrupt blocks. Here’s a great answer on Stack Overflow with a list of options available for this.

Since we’ll be handling large drives, making a corrupt copy of a drive seems unfeasible. That pretty much eliminates the device mapping route. Additionally, our access to the hard drives will not only be through POSIX commands, making libfiu less useful. Because of all this we feel that the best route would be to simulate our drives as a kind of linear raid device with mdadm, setting them to faulty every now and then.

Unfortunately, this is easier said than done. The documentation for mdadm is quite confusing.  To do it, you’ll need to start with an unmounted drive and call:

sudo mdadm --create /dev/md1 --level=faulty --layout=clear --raid-devices=1 /dev/sdb

That command will create a mapping from your drive (replace “/dev/sdb” with your drive) to a newly creating raid profile called “md1″. It will also set the drive to use the faulty personality with “–level=faulty” and set it to not send any faults yet with “–layout=clear”.

If your drive already had partitions with filesystems on them, then you can skip this step. Otherwise you’ll want to add a partition to md1 with fdisk:

sudo fdisk /dev/md1

To add a new partition use “n”, adding a primary partition in the first slot. You can use the defaults for the other options by just hitting enter. If you want to make the partition smaller than the size of the drive then you can use “+500M” or “+1G”, for example, in the “Last sector” option. When you’ve finished adding the partition make sure send “w”, otherwise your changes wont be made!

Surprisingly though, your newly made partition (or existing partitions) will not show up in /dev. This is because you’ve technically made a logical device with mdadm, so you’ll need to register it with:

sudo kpartx -a /dev/md1

After doing that, run “lsblk” to see your block devices with their registers. You can see your mappings to partitions in /dev/mapper (the one you just made will look like “md1p1″). By running:

sudo mkfs -t ext4 /dev/mapper/md1p1

You can format it to ext4. Finally, to mount it, create a directory to mount it to (I like /mnt/hdd) and call:

sudo mount /dev/mapper/md1p1 /mnt/hdd

Now that your system is mounted, at any time you can start injecting errors by calling:

sudo mdadm --grow /dev/md1 --layout=wp3

You can set different layout parameters to inject different faults with the “–layout” portion. The first half specifies the type of fault to inject (“wp”) and the number after it specifies the number of accesses before sending the fault (or period). For more examples of how to use the layout parameter and what the different faults are check out here and here.

To take down this drive, you’ll need to call in this order:

sudo umount /mnt/hdd
sudo kpartx -d /dev/md1
sudo mdadm --stop /dev/md1

And then your drive will be back to normal!

by justin at May 12, 2015 07:09 PM

Hosung Hwang

Making sample images using ‘convert’ utility

In the previous posting, I made sample images using Java. ‘convert’ utility gives powerful image processing functionalities.
‘convert’ is part of ImageMagick. Using ‘convert’ and bash script, we can make sample image easily.

A-Z a-z 0-9 image

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 0x0+0+0 "$L" "$L.jpg" ; done

Screenshot from 2015-05-12 13:20:47

The same set with italic style

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 0x30+0+0 "$L" "$L.jpg" ; done

Screenshot from 2015-05-12 13:20:11

The same set with rotation

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 45x45+0+0 "$L" "$L.jpg" ; done

Screenshot from 2015-05-12 13:21:32

by Hosung at May 12, 2015 05:26 PM

David Humphrey

Learning to git bisect

Yesterday one of my students hit a bug in Brackets. We're working on an extension for Thimble that adds an autocomplete option to take a selfie when you are typing a URL that might be an image (e.g., <img src="...">). It also needs to work in CSS, when you enter something like background-image: url(...). Except it doesn't work in the latter case. I advised him to file a bug in the Brackets repo, and the response was, "This used to work, must be a regression."

I told my student he should bisect and find the commit that caused the regression (i.e., code change that introduced the bug). This was a new idea for him, so I did it with him, and promised I'd write something later to walk through the process. Now that it's "later," I'm writing a short walkthrough on what I did.

As I write this, there are 16,064 commits in the adobe/brackets repo on github. That's a lot of coming and going, where a bug could easily hitch a ride and quietly find its way into the product. How does one hope to find what is likely a tiny needle in such a huge haystack? The answer is git bisect.

Having a large number of commits in which to locate a bad commit is only difficult if we try and manually deal with those commits. If I remembered that this code worked yesterday, I might quickly look through the diffs for everything that landed recently, and figure things out that way. But when we're talking about months, or maybe years since this last worked, that strategy won't be efficient.

Luckily git is designed to help us here. Because git knows about every commit, both what changed, and what change(s) came before, we can take a huge range of commits and slice and dice them in order to expose the first bad one.

In order for this process to work, you need a reproducible test case. In an ideal world, this is a unit test that's in your git history, and is stable across all the commits you'll be checking. In the real world you often have to write one yourself, or get a set of steps that quickly exposes the bug.

For the bug I was tracking, I had a simple way to test in the editor manually. When the code works, it should give us a list of filenames to use for the URL, and look like this:

Working Image

When it fails, it keeps giving the list of completions for background-image instead, and looks like this:

Broken Image

Now that I have a simple way to confirm whether or not the bug is present in a particular version of the code, I need to quickly eliminate commits and narrow down exactly where it came from.

You begin a bisect session with git by doing: git bisect start. In order to have git bisect for me, I need to create a range, and to do this I need two end points (i.e., two commits): one where the bug is not present (good commit); and one where I know it's broken (bad commit). Finding a bad commit is usually pretty easy, since you already know you have the bug--in my case I can use master. I tell git this is the bad commit by doing git bisect bad (note: I was sitting on the master branch when I typed this. You can also explicitly give a commit or branch/tag).

But for the last-known-good commit, I obviously don't know exactly where it happened. As a result, I'm going to need to overshoot and go back far enough to get to something that works.

Brackets is currently preparing for version 1.4 on master, and there are 80 previous released (i.e., tagged) versions. These tagged versions are useful for quickly jumping back in time, since they represent versions of the editor that are most likely to run (i.e,. they didn't tag and release broken commits). So I start checking out old releases: 1.1 (broken), 1.0 (broken), release 0.44 (broken). It looks like this is an old problem, so I jump further back so as to not waste my time testing too many.

Eventually I checkout version 0.32 from Oct 4, 2013, and the bug isn't there. Now I can tell git about the other end of my bisect range by doing: git bisect good.

Now git can do its magic. It will take my good and bad commits, and checkout a commit half way between them. It looks like this:

Bisecting: 3123 revisions left to test after this (roughly 12 steps)  
[940fb105ecde14c7b5aab5191ec14e766e136648] Make the window the holder

A couple of things to notice. First, git has checked out commit 940fb105ecde14c7b5aab5191ec14e766e136648 automatically. It has let me know that there are 12 steps left before it can narrow down the problematic commit. That's not too bad, given that I'm trying to find one bad commit in thousands from the past two years!

At this point I need to run the editor for commit 940fb105ecde14c7b5aab5191ec14e766e136648 and test to see if it has the bug. If it does, I type git bisect bad. If it doesn't, I type git bisect good. In this case the bug is there, and I enter git bisect bad. Git responds:

Bisecting: 1558 revisions left to test after this (roughly 11 steps)  
[829d231440e7fa0399f8e12ef031ee3fbd268c79] Merge branch 'master' into PreferencesModel

A new commit has been checked out, eliminating half of the previous commits in the range (was 3132, now it's 1558), and there are 11 steps to go. This process continues, sometimes the bug is there, sometimes it isn't. After about 5 minutes I get to the first bad commit, and git shows me this:

6d638b2049d6e88cacbc7e0c4b2ba8fa3ca3c6f9 is the first bad commit  
commit 6d638b2049d6e88cacbc7e0c4b2ba8fa3ca3c6f9  
Author: <Name of Author>  
Date:   Mon Apr 7 15:23:47 2014 -0700

    fix css value hints

:040000 040000 547987939b1271697d186c73533e044209169f3b 499abf3233f1316f75a83bf00acbb2955b777262 M    src

Now I know which commit caused the bug to start happening. Neither git nor I know why this commit did what it did, but it doesn't matter. We also know what changed, which bug was being fixed, who did it, and when they did it. Armed with this info we can go talk to people on irc, and add notes to a few bugs: the bug where the bad commit was added (this will alter people who know about the code, and could more easily fix it), and our bug where we've filed the issue. Often you don't need to solve the bug, just find it, and let the person who knows the code well help with a fix.

The last step is to tell git that we're done bisecting: git bisect reset. This takes us back to the commit we were on before we started our bisect.

Despite being quite skilled with git, I don't think that any of my students had used bisect before. Lots of people haven't. It's worth knowing how to do it for times like this when you need to quickly narrow down a regression.

by David Humphrey at May 12, 2015 03:01 PM

May 11, 2015

Hosung Hwang

Making images from alphabet characters

To test Perceptual Hash, I wanted images that contains a character : a.jpg, a.png, a.gif, b.jpg, etc,.

There was no this kind of images, so, I wrote a simple java program using a code block from this page.

This program makes jpg, png, and gif image for A-Z, a-z, and 0-9.

Full source code is:

import java.awt.Color;
import java.awt.Font;
import java.awt.FontMetrics;
import java.awt.Graphics2D;
import java.awt.RenderingHints;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;

//modified from
public class FontImage {

    public static void drawText(Font font, String text){
        BufferedImage img = new BufferedImage(1, 1, BufferedImage.TYPE_INT_RGB);
        Graphics2D g2d = img.createGraphics();
        FontMetrics fm = g2d.getFontMetrics();
        int width = fm.stringWidth(text);
        int height = fm.getHeight();

        img = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
        g2d = img.createGraphics();
        g2d.setRenderingHint(RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_QUALITY);
        g2d.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON);
        g2d.setRenderingHint(RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_QUALITY);
        g2d.setRenderingHint(RenderingHints.KEY_DITHERING, RenderingHints.VALUE_DITHER_ENABLE);
        g2d.setRenderingHint(RenderingHints.KEY_FRACTIONALMETRICS, RenderingHints.VALUE_FRACTIONALMETRICS_ON);
        g2d.setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR);
        g2d.setRenderingHint(RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_QUALITY);
        g2d.setRenderingHint(RenderingHints.KEY_STROKE_CONTROL, RenderingHints.VALUE_STROKE_PURE);
        fm = g2d.getFontMetrics();
        g2d.fillRect(0, 0, width, height);
        g2d.drawString(text, 0, fm.getAscent());
        try {
            ImageIO.write(img, "gif", new File(text + ".gif"));
            ImageIO.write(img, "png", new File(text + ".png"));
            ImageIO.write(img, "jpg", new File(text + ".jpg"));
        } catch (IOException ex) {

    public static void main(String[] args) {
        String text = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
        Font font = new Font("Arial", Font.PLAIN, 48);

        for (char t : text.toCharArray())
            drawText(font, Character.toString(t));

The output files are :
Screenshot from 2015-05-11 16:35:03

There is a better way to do this using convert utility. It is described in the next posting.

by Hosung at May 11, 2015 08:44 PM

May 07, 2015

Kieran Sedgwick

Leveraging travic-ci & heroku for rapid deployment of Thimble

As we got closer to a usable mashup of Brackets and Thimble we wanted to allow other people to play with what we’d done so far.

By leveraging Travis-ci hooks I was able to automate a deployment of our app to Heroku whenever we updated our main branch. This was easier than I’d anticipated (you can see the plethora of excellent documentation for more information on the process) and also surfaced some issues revolving around heavily interdependent apps:

1. Local is a racetrack, deployed is a city block

The reality of local development of Webmaker is that most of the pain has been removed. The excellent webmaker-suite module clones, installs and configures the Webmaker application ecosystem based on which of their core and peripheral servers you need running.

Aside from new environment variables we introduced, we never had to touch the config. All of the features tied to the tight coupling with other parts of the ecosystem, like login and publishing makes, “just worked”. Not so in deployment.

We had to accept that, at least at first, there was only so much we could expose to others for testing, and that our application would look far more incomplete than it actually was.

2. When deployed, the pitstop is miles away

An automated deployment system also meant that we were held hostage by the length of the continuous integration process if something broke on Heroku. Reverting breaking changes had a time delay not found locally, and considering our workflow (imagine Thimble as a tree with a complicated root system of dependancies and submodules) things could get pretty messy as we tracked down a problem.

Add to that the time it took to redeploy and it became clear that we had to be more mindful of what we pushed and where.

3. If local is a… drawing of a donut, then… something something bagels

The main takeaway from this process was that it heroku wasn’t at all ideal for deploying this particular application. It was a picture of a donut to a donut. What we really needed was a full deployment of the Webmaker ecosystem! So that became the next goal in our automagical deployment journey.

by ksedgwick at May 07, 2015 05:12 PM

May 04, 2015

Andrew Smith

Screen scraping timetable data from a PeopleSoft Faculty Center

Our school moved to PeopleSoft for.. I’m not going there.. but that’s where everyone’s timetables are now. I thought maybe this big fancy company has an API to let me access the data but no, it’s basically impossible to access the API directly.

So I was left with screen scraping, which I always wanted to try, why not. Go to the page I want to examine, open up Firebug, and drill down to the table elements I’m interested in: body>div>iframe>html>body>form>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div…

Er, wtf? I seemed to be going in some Firebug bug infinite loop. Surely they don’t have that many tables inside each other? Then I discovered the “Click an element” button and found that there are lots and lots of tables inside tables on this simple page:

faculty centre

This is with the text at its minimum size, you can see by the scrollbars what I’m talking about:

Firebug peoplesoft html

But after a while I managed to figure it out. I had to learn some XPath to find the cells I was interested in based on their IDs, but I couldn’t use XPath for everything – I tried but it ate all my RAM and was still working through the swap partition when I killed it in the morning.

Here’s the script in case you’re in the same boat. It prints the timetable data in the console. For myself I intend to make some Json out of it for import into Everyone’s Timetable.

// Firebug script to scrape timetable data from a PeopleSoft-backed website.
// Run it when you're on the page that shows the timetable. You get to that page
// like so:
// Faculty Center
//  Click the Search tab
//    Expand Additional Search Criteria
//      Set "Instructor Last Name" to the one you're looking for
//        Start Firebug, go to Console, paste in this script and run it
// Author: Andrew Smith

var frameDocument = document.getElementById('ptifrmtgtframe').contentWindow.document;

// DERIVED_CLSRCH_DESCR200$0, $1, etc. have the course title
var courseTitles = frameDocument.
  frameDocument.documentElement, null,

// For each course
for (var i = 0; i < courseTitles.snapshotLength; i++) {
  var courseTitle = courseTitles.snapshotItem(i);
  // Find the the next tr which has the timetable data for this course
  var timetableTableParentRow = courseTitle
  // There's some fucked up empty row after the first course title only
  if (i == 0)
    timetableTableParentRow = timetableTableParentRow
  // Now go down to the table in this tr, it's the only thing that has 
  // an id so I can use xpath to find its children (timetable rows).
  var timetableTableId = timetableTableParentRow
  // MTG_DAYTIME$0, $1, etc. have the day and time range in this format:
  // Mo 1:30PM - 3:15PM
  var times = frameDocument.
    evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_DAYTIME')]", 
    frameDocument.documentElement, null,
  var timesArray = new Array();
  for (var j = 0; j < times.snapshotLength; j++) {
    timesArray[j] = times.snapshotItem(j).textContent;
  // MTG_ROOM$0, $1, etc. have the room number in this format:
  // S@Y SEQ Bldg S3028
  var rooms = frameDocument.
    evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_ROOM')]", 
    frameDocument.documentElement, null,
  var roomsArray = new Array();
  for (var j = 0; j < rooms.snapshotLength; j++) {
    roomsArray[j] = rooms.snapshotItem(j).textContent;
  // MTG_INSTR$0, $1, etc. have the instructor names but I think I'll
  // ignore them. For shared courses it won't hurt too much I hope.
  // Dump all the timetable data into the console, will do something with it later.
  for (var j = 0; j < times.snapshotLength; j++) {
    console.log(timesArray[j] + roomsArray[j]);

by Andrew Smith at May 04, 2015 02:39 AM

April 28, 2015

Andrew Smith

Perceptual hash comparison: pHash vs Blockhash: false positives

I’ve been considering a project idea for Seneca’s partnership with Creative Commons. For that idea to work I would need a tool to create perceptual hashes from images that:

  • Give true positive results when comparing images that were resized, and/or their colours changed.
  • Give very few (near zero percent) false positive results. Too many false positives (even 0.1%) would kill this idea.

I’ve spent the weekend working on the second part, because it’s the biggest unknown. I’ve compared pHash 0.9.6 and Blockhash 40bfc84f03. The comparison isn’t entirely fair because Blockhash supposedly has a better algorithm in the works but that is not yet published.

Test Data (sample images)

I downloaded these from Wikimedia Commons via the Images category. Not manually, I used wimgs to do the actual downloading once I chose the categories I wanted.

The categories I chose semi-randomly, but slightly biased towards what I thought were likely candidates to show perceptual hashing limitations: 3D_computer_graphics, Computer_generated_images, Computer_generated_particles, Digital_forensic_facial_reconstructions, Fractals, and HDR_images.

From that I ended up with 900 very different and very similar JPG files.

pHash issues

The open source, published pHash hasn’t been updated since 2013. The project isn’t abandoned but they’re not spending a lot of time on it. There is one github repo (sdepold’s) that’s attempted to collect random patches, but that’s explicitly not maintained and I didn’t even manage to build it.

One definite bug I ran into is described here. The fix reduced the number of hashes of zero I got for PNG images but there were still some left. So I decided to sidestep this bug by using the convert command to convert all the downloaded images to JPG before hashing them.

pHash test implementation

I had to write a simple C++ program based on their example that basically consists of this line:

ph_dct_imagehash(filename, tmphash)

Because the example was too complicated and useless to me out-of-the-box.

Then I wrote another C++ program just to compute the hamming distance, again basically one line:

ph_hamming_distance(ulong64 hash1, ulong64 hash2);

Which worked. Later I replaced it with a perl script to be able to handle 256bit hashes from Blockhash.

Those two combined with several shell scripts gave me an incredibly slow but good-enough-for-my-investigation process for running the tests on my 900 sample images.

pHash results

Comparing the hash of every image to that of every other image and looking for hash hamming distances <= 10 I ended up with 63 matches. 31 of those looked like real false positives to me, the other 32 were actually the same image with no (or irrelevant) variations. That means out of 900 files there were 31 of which each one had at least one match in the set.

That’s a 3.4% false positive rate, which is completely unacceptable for what I need. So I reduced the maximum distance from 10 to 4. That reduced the number of false positives to 6, 4 of which are debatable. Tenuously I could say of the following there are only two images that are different (false positive rate of 0.2), but more testing would be needed to confirm that:

Metaball7 Metaball8 Metaball9 Metaball10 Snapshot07 Snapshot10pointcloud

And yes, I was that picky when deciding whether there is a false positive hit.

Now to make sure that pHash works well for finding true positives I took this image:

01abc-resize128 and scaled it to sizes of 64, 128, 256, 512, 1024, 1536, and 2048.

I know it’s not a great example but it was the first in the list and I was just running a quick test.

The same algorithm matched all of them with a distance <=6, and all but the 64 pixel version with a distance <= 4. For me false positives are significantly worse than false negatives, I can live with a size limitation if that’s all it is.

In short: it appears that pHash may be able to do what I want. But I need to do more testing to confirm that.

Blockhash test implementation

I didn’t need to write any code for the hashing part, the included build/blockhash did exactly what I need.

I still had to write some shell scripts to make it and the hamming distance calculator work on my image set.

I left the default hash size at 16^2 (256 bits), that’s already 4 times the pHash hash size.

Blockhash results

I ran essentially the same test on the same test data with Blockhash as pHash. With distance <=10 I got 53 matches, of which 29 looked like real false positives.

I was about to test with distance <= 4 when I thought I should do the true positive test first, to make sure it’s similar to pHash. It wasn’t.

In face for my 7 resized versions of one image I got distances as large as 27, and the majority over 10. That means with Blockhash taking the false positive rate down destroys the true positive rate, which makes it useless to me. I did not pursue further testing, but perhaps I’ll try again if the Blockhash algorithm is updated.


It appears to be possible to get an extremely low false positive rate with pHash while keeping a very good true positive rate.

Perceptual image hashes are a complex business. If you try to come up with an algorithm in your head – it will suck. Which makes me worried about whether it’s a solved problem or I would need to solve it myself.

If I want to pursue this idea further I will need to run more tests on more images. Testing like this takes time. My hacky test system takes literally hours to run through 900 images, and it can be optimized a lot but the optimization process will take days/weeks. I will post followup results if they will exist.

by Andrew Smith at April 28, 2015 04:43 AM

How to get a tourist visa to travel to Russia from Canada

I’m going to Moscow in a couple of months, and we bought the tickets a while ago. More recently I’ve randomly discovered that I need a visa to travel to Russia. I’m glad I know people who know these things, it would have been very annoying to arrive there and be told we’re not allowed in. I’d expect Expedia to warn me, but no I guess they don’t do that.

Anyway, now that I’ve gone through the process I can say it’s very straightforward if you know what to do, and if you don’t – it feels like an impossible bureaucratic mess. So here’s a tutorial for getting a visa to go to Russia if you’re looking to do that.

There is some information here and here if you read russian but the guidance there is inadequate.

You start by filling out this application for the consulate. You won’t yet have everything you need yet but it doesn’t hurt to start, you can resume the application later. If you manage to screw up massively – don’t worry, at the end of that application you have to print it out and take it to the office, so at this point a mistake is not something to worry about.

Here are parts of the application that may be confusing and how to deal with them:

If you had USSR or Russian nationality at some time

The first really annoying thing. For Canadian-only citizens this is not an issue. But if you were born in the USSR – you may run into a problem like I have. The problem is that quite likely you have no documents left whatsoever from that time, so the options they give you don’t work:

  • Prove that you’ve renounced your Russian citizenship: this proof may not exist if you left in the early 90s
  • Visa to Israel: only works if that’s where you went
  • Stamp in their passport saying… only works if you had a passport back then and still have it.

Neither of those worked for me. My parents went to Romania when I was 12, I had no passport, the romanians even replaced my birth certificate with a Romanian one. But it still says on my passport today that I was born in Moldova so I have to deal with this.

For me I chose “Yes” in answer to that question and for the details picked 01June1991: DISSOLUTION OF THE USSR. That’s the advice I was given at the Visa office. The only document I had to provide was the Landed Immigrant paper that was in my passport when I immigrated. Luckily I kept that even though I’m a citizen now.

Number of entries

You pick one. I don’t know what the point is in more than one but if that’s what you need – I can’t answer that question.

Date of entry into / exit from Russia

These have to match your actual dates. Mind that you may arrive on a different day than when you’re leaving. In the past visas were given out for a period of time unrelated to specific entrance/exit dates but that’s no longer the case.

Which institution you are going to visit?

This is the other annoying thing. The theory is that you’re going to use a russian travel agency for your trip and they will provide you with these numbers. But we weren’t going to use a travel agency – I speak russian and am perfectly able to find my way around – there’s plenty to see in Moscow without a pre-arranged tour.

But that’s not an option. Don’t worry, we’re not the only ones who ran into this. Basically what I did was look for hotels in the centre Moscow and several of them had “Visa Support” buttons. These take you to websites such as this one, where you pay approx. 35$ to get the paper you need for your visa. No other services are involved (hence the low price). After you pay for it (Credit Card via PayPal worked for me) you get the paper you need immediately in your email.

On that paper you have to find the reference number (it’s called reference number) and a confirmation number (top of the form on the left).

Medical insurance

I found a place online that said this is not needed for Canadian citizens, it’s mostly for people from the EU. Probably because they don’t have public health care systems in many of the EU countries.

Just in case though I put in one of my numbers from my Sunlife insurance from work.

Do you plan to stay anywhere (hotel, individual)

Doh, yes. You want to pick Hotel. If you pick Individual – I hear they have to go through hell on their end to get the paperwork done. Most likely you don’t need the hotel booked at this point. I didn’t but they said because we’re travelling with the whole family the consulate may ask for a reservation confirmation.

Education and work experience

I think the point of this is to figure out whether you’re planning to go to Russia to work or participate in some armed conflict. Make sure that you don’t and fill in the form appropriately. The details (addresses/phone numbers) don’t need to be precise.

Have you ever visited other countries in the past ten years?

Who the hell hasn’t? I put in one entry with a made-up date for the USA and one for Moldova. I don’t know how they expect me to figure out where I went when in the last 10 years.

That’s all the advice I have. Note that the people at the Visa office are actually quite nice and helpful, but they don’t answer their phone so you have to go there and talk to them. Note also that the visa office is just a processing centre – they don’t make any decisions. So ask them your questions, they’ll help you as best they can, but once your application is ready – it’s up to the guys at the consulate to decide.

And bring payment as cash or a money order, they don’t accept anything else there.

by Andrew Smith at April 28, 2015 03:26 AM

April 23, 2015

Maxwell LeFevre

SMID Vectorization

About Automatic Vectorization

This post is about Single Instruction Multiple Data and vectorization. In class we had a look at a simple c program and the assembly that gcc creates when it is compiled. The program is as follows:

#include <stdlib.h>
#include <stdio.h>
void main() {
  int a[256], i, t;
  srand(1);			// initialize prng

  for (i=0; i<256; i++) {
    a[i]=rand();		// load array randomly

  for (i=0; i<256; i++) {
    t+=a[i];			// sum the array
  printf("%d\n",t);		// print the sum

It creates an array of 256 values, populates it with numbers generated by the rand function, and then sums and prints the result. When the compiler is set to -O3, optimization level 3, it will automatically vectorize this process. Vectorization can be turned on at -O2 but requires additional flags. Because this array has exactly 256 entries the compiler knows that it can safely stride across it and stop at the end without overshooting or falling short. If it didn’t line up evenly then the compiler would be able to vectorize it the same way and the loop would have to be broken up into 2 or more loops to allow vectorization. Another factor that makes the vectorization possible is that the rand function has been separated out of the loop where the values are summed. Rand is a linear function where the next value it produces is based on the last one. This means that the generation of the values cannot be vectorized because each one depends on the one before it, so rand must be hoisted out of the summing loop and given it’s own loop.

Another factor required for vectorization to work effectively is the alignment of the the memory pointers. The array must line up with starting at the beginning of a word and end at the end of a word, using all the space in-between. If an array is not aligned with the memory being read then the compiler will not be able to guarantee that while striding across memory it will be reading the right data. The words that make up the data must not be shared with any other data. If a particular word is only partially part of the value and the rest of it belongs to something else entirely then the value read will be incorrect. If the data is not properly aligned to the words or has overlapping data the compiler will not vectorize it. The last thing that will stop the compiler from vectorizing code is if there are variables that it can’t predict where they are coming from (e.g.. a function inside a for loop). This blocks the vectorization because the compiler doesn’t know if the function is linear or if it changes based on it’s own results.

The Disassembled Example

To show what the vectorization looks like in assembly I have isolated the two loops from the above program:

//Loop 1
400528:       97ffffe2        bl      4004b0 
40052c:       b8004660        str     w0, [x19],#4
400530:       eb14027f        cmp     x19, x20
400534:       54ffffa1    400528   //jump to 400528

//loop 2
40053c:       3cc106a1        ldr     q1, [x21],#16
400540:       eb15029f        cmp     x20, x21
400544:       4ea18400        add     v0.4s, v0.4s, v1.4s
400548:       54ffffa1    40053c  //jump to 40053c

You can see in the first loop where the array is populated that it is a simple loop using the register names that might be expected. In the second loop there are a few noticeable differences. q1 is a quadword and it is being read into register 1 from x21 and x21 is being incremented ahead by 16 bytes (or one quadword). In the third line of the second loop the v means those are vectors and what is happening is that they are summing the numbers from ldr. Once he loop is done I need to merge the results of summing each column. That is done with addv which just sums the vector.

Vectors can be much more efficient but you do have to make sure you set up your code correctly so that the compiler know they are possible.

by maxwelllefevre at April 23, 2015 09:08 AM

Danylo Medinski

Conclusion to SPO600

Since it is pretty much the end of the semester I think it is time to reflect on my experiences in SPO600. The course was an interesting one , especially its curriculum. There was no test or exam which I am all for, but it was strange considering all my classes would have some kind of test or exam in it.

The project was also interesting as there where no specified instructions provided. The instructions only stated the final objective. It was up to the students to decide how to work on the project. I actually did like this method as we can have more creativity and options since we have no restrictions on what we can do.

Throughout the semester I managed to learn a variety of topics. I learned how to properly test code by utilizing benchmarks such as high resolution counters/timers and learn about profiling .Along with performance testing, I can also apply some of the assembly knowledge that I gained during SPO600 to better optimize code that I write.

I also managed to expand my Linux knowledge and skills during my time in SPO600. Since I was constantly working with Linux, I was usually researching new topics and trying new commands/features to accomplish a variety of tasks.

Contributing to a open-source project turned out to be a very interesting task for the project. I thought about contributing to a open-source package in the future, but I did not know where to begin. During my time working on the project, I learned how to use mailing lists, communicate with communities, work on existing code and create/use pull requests. Even though I was not able to get a pull request accepted before the end of the semester, the learning experience is what mattered. If I want to contribute to another project, I will know where to start and what tasks to take now.

I would highly recommend SPO600 as it is a course that will teach you a vast variety of topics.Combined with a great professor who is clear, understandable and possibly knows literally every topic related to computing,you will enjoy this class and learn allot from it if you dedicate the time to it.

by ddmedinski at April 23, 2015 04:57 AM

Maxwell LeFevre

SPO600 Project Wrap-Up

This will be my last post for SPO600. I am using it to review the status of my final project. Hopefully I will be able to continue it at a later date but right now that doesn’t seem likely with the Aarch64 machine being repurposed at the end of the week. I enjoyed working on this project even though it was challenging and, at times, frustrating. I feel I learned much more from the labs we did, and enjoyed them more, it was a good experience to try and put some of the concepts learned in the labs into practice in a real world piece of software. You may have noticed a sudden glut of postings today. This is because I had a number of posts partially written and saved as drafts and I have been finishing them off and publishing them.

Progress and Current Status

After having done a significant amount of profiling I found it very challenging to find an area to focus on in the code. I have managed to make a slight improvement with the addition of the -O3 flag but I don’t feel that this is enough so I am continuing to look for other things I can change.

Things Left Undone…

Currently I have not tried to create a patch or send it upstream because I had hoped for a more significant change. As of now this change has yet to make itself apparent.


The biggest challenge I found while working on the project was isolating a specific area of code to work with. I found it hard to isolate an obvious area where HTTPD ran more efficiently in an x86_64 machine than it did on the Aarch64 machine. The second challenge for me was the time line and my ability to allocate a sufficient amount of time to this project. Due to unforeseen circumstances much of the time that should have been spent on this was redistributed across my other responsibilities. This was compounded by the fact that our Aarch64 machine had noticeably more downtime than it’s x86_64 counterpart. This downtime seemed to fall on the times I had set aside to work with it unfortunately. This could have been avoided if I had worked on it more frequently but for shorter time periods.

Missed Opportunities

Now, with a better understanding of what is involved in undertaking a project like this, I would heave done a few things differently. If I decide to undertake a project like this in the future I will reach out to the community sooner to ask their advice instead of relying almost exclusively on my own research. I would also work on it more consistently over a longer period of time to limit the effect of system downtime. This would also afford me more opportunities to get feedback and assistance with the project.

by maxwelllefevre at April 23, 2015 03:41 AM

Searching for Alternative Answers

I have been having trouble generating useful profiling data so I decided to try another method of finding something to optimize. I started searching HTTPD for inline assembly in hopes of finding sections that had been optimized for x86_64 and not Aarch64. Using the command grep -rnw ./ -e ".*asm.*" I have determined that there is no inline assembler in the source for HTTPD. I had thought that if I could find some x86 assembly I would be able to write matching Aarch64 assembly and hopefully see an improvement but that avenue is closed to me now.

Continuing to search the source code for possible optimizations I decided to see where x86 and Aarch were mentioned. Using the same command from above but slightly modified, grep -rnw ./ -e ".*x86.*" and grep -rnw ./ -e ".*aarch.*", I found that x86 is mentioned extensively in .dsp files for checking compatibility with Windows builds and then again briefly in the build config files but not in very many other places. Aarch is mentioned in even fewer places, limited to just 3 files, config.status, config.guess, and config.sub. These files are all relating to configuring the build file. Config.status is the information about what the configuration for the current build is. Config.guess is the file that tries to guess the machine type and configuration when autoconf is run. Config.sub is generated by autoconf and converts aliases into full paths. None of these files are part of the code base itself, they all relate to configuring the build options for the system.

Compiler Optimization

With this distinct lack of any obviously platform specific code I decided to go through the configuration files to see if there were any compiler flags for optimization set. I was unable to find any flags for optimization so I started benchmarking my current installation of HTTPD with then intention of adding the -O3 flag to gcc and rebuilding HTTPD to see how that would effect its efficiency. Before adding the -O3 flag, downloading the sample webpage 1000 times locally took on average about 2m52.010 real time. After rebuilding HTTPD with the -O3 flag the same test yields an average real time of 2m50.120s.

No Optimization Flag

Test Real Time User Time Sys Time
1 2m51.525s 0m25.390s 1m57.300s
2 2m52.030s 0m25.200s 1m57.820s
3 2m52.322s 0m25.020s 1m58.320s
4 2m51.414s 0m24.830s 1m58.180s
5  2m52.941s 0m24.780s 1m58.240
6  2m51.828s 0m24.760s 1m58.350s
Avg.  2m52.010s  0m24.997s 1m58.035s

With “-O3″ Flag

Test Real Time User Time Sys Time
1 2m50.579s 0m24.720s 1m58.950s
2  2m49.305s 0m25.370s 1m57.200s
3 2m50.643s  0m24.740s 1m58.610s
4  2m49.174s 0m24.320s 1m58.050s
5  2m50.518s  0m24.800s 1m58.590s
6  2m50.501s  0m24.750s 1m58.310s
Avg.  2m50.120s  0m24.783s 1m58.285s


With the -O3 flag it runs about 2 seconds quicker but for a task that runs for just under 3 minutes 2 seconds is a very small gain and I would have to do a lot of testing to make sure that these results are not just a fluke. I would like to find a more concrete improvement to make. So I am going to continue with profiling and try to find some other improvement to make.

by maxwelllefevre at April 23, 2015 03:40 AM

Danylo Medinski

Quick look at two kinds of assembly languages

A thing I wanted to do for some time now was to learn assembly.Recently I have had the opportunity to do so by doing a lab part of the SPO600 curriculum.

Assembly is one of the hardest types of programming languages to learn. Assembly being a low-level language means the syntax is level above machine language. When a high level language such as C++ is compiled, it is interpreted by the compiler into assembly. There is no standard for assembly languages(kind of how Java resembles C++) as the language is build specifically for the architecture. Assembly is very memory oriented which requires the users to carefully use registers and addresses. The syntax, barely resembling natural language are short line blocks that generally call,add,push etc registers.

To better understand assembly languages, I shall analyse assembly implementations on Aarch64 and x86_64.

First I will run two C programs at virtually preform the same task. The first program will be a printf statement that prints out “Hello World!’. The second program will also have an printf “Hello World!” but will also have an additional argument of 13 in the statement. We will compile and run both these programs on a x86-64 platforming using Gcc as our compiler.



int main(){
  printf("Hello World!\n");


Hello World!



int main(){
  printf("Hello World!\n",13);


Hello World!

On x86_64 systems, the assembly equivalent of this code would be in the GNU x86 Assembly/Gas syntax.


.globl	_start

	movq	$len,%rdx			/* message length */
	movq 	$msg,%rsi			/* message location */
	movq	$1,%rdi				/* file descriptor stdout */
	movq	$1,%rax				/* syscall sys_write */

	movq	$0,%rdi				/* exit status */
	movq	$60,%rax			/* syscall sys_exit */

.section .rodata

msg:	.ascii      "Hello, world!\n"
	len = . - msg

The Natwide Assembler (NASM) is also available for systems using x86 based architectures. Here are the NASM equivalent of the C programs.


section	.text
global	_start

	mov	rdx,len			; message length
	mov	rcx,msg			; message location
	mov	rbx,1			; file descriptor stdout
	mov	rax,4			; syscall sys_write
	int	0x80

	mov	rax,1			; syscall sys_exit
	int	0x80

section	.rodata

msg	db	'Hello, world!',0xa
len	equ	$ - msg

There are some differences when comparing the two assembly languages, but most noticeable are how the instructions vary such as mov and movq.NASM however is not available on ARM based architectures. Aarch64 architectures uses the GAS assembly. Some instructions would vary on GAS between ARM and x86 systems such as how syscall is invokes in Aarch64 with ‘svc 0′ while in x86 it is just ‘syscall’.


Even though I have not completed the assembly lab, I have learned some things about assembly from it. Knowing the basics of assembly, I will hope to soon be able to code my own small program in assembly. I will try to document more of my experience with assembly and development of the assembly program on this blog.


by ddmedinski at April 23, 2015 03:27 AM

James Boyer

Project Wrap up

Project Switch
Project Wrap up
Concluding the final project for spo600 at Seneca College


My last post gave a outline of what I was going to try to accomplish but unfortunately I did not get very far. There are many things dealing with this area that I believe would require a greater knowledge of the sljit compiler as a whole. Here is what I have done up to today, It's not much, but it's something.

The defines

For defining the number of float registers for ARM and x86 it should be pretty simple and similar to what he has currently. It could be implemented if unlike me you had a better understanding of saved float registers. Here is what it should look like:
#if (defined SLJIT_CONFIG_ARM_64 && SLJIT_CONFIG_ARM_64)

#elif (defined SLJIT_CONFIG_X86_64 && SLJIT_CONFIG_X86_64)
#if (defined _WIN64)

(XX should be replaced with correct number)
Defined in the arm documents here is where I found the information for number of floating point registers, It states there is 32 registers for single precision floating point, which are also used as 16 registers for double precision floating point, so I am not sure whether it would be 16 or 32. For x86 I was confused regarding the number of floating point registers used. For example this site has a picture showing 8 registers for floating points, but the title of this page is legacy registers which may mean they are no longer used. In the Intel documentation on page 183 (8.1.2) it states there are 8 in the x87 fpu stack, this made me wonder if these are the legacy registers the other site mentions or if these add on to the legacy registers making it 16 in total. Also in X86 there are Vector floating point registers and from this page I am not sure how many(possibly 16) or if they even come into play in sljit.

Function entry and exit points

Here is the current code in sljit for the entry:
SLJIT_API_FUNC_ATTRIBUTE sljit_si sljit_emit_enter(struct sljit_compiler *compiler,
sljit_si options, sljit_si args, sljit_si scratches, sljit_si saveds,
sljit_si fscratches, sljit_si fsaveds, sljit_si local_size)
sljit_si i, tmp, offs, prev, saved_regs_size;

CHECK(check_sljit_emit_enter(compiler, options, args, scratches, saveds, fscratches, fsaveds, local_size));
set_emit_enter(compiler, options, args, scratches, saveds, fscratches, fsaveds, local_size);

saved_regs_size = GET_SAVED_REGISTERS_SIZE(scratches, saveds, 0);
local_size += saved_regs_size + SLJIT_LOCALS_OFFSET;
local_size = (local_size + 15) & ~0xf;
compiler->local_size = local_size;

if (local_size <= (63 * sizeof(sljit_sw))) {
FAIL_IF(push_inst(compiler, STP_PRE | 29 | RT2(TMP_LR)
| RN(TMP_SP) | ((-(local_size >> 3) & 0x7f) << 15)));
FAIL_IF(push_inst(compiler, ADDI | RD(SLJIT_SP) | RN(TMP_SP) | (0 << 10)));
offs = (local_size - saved_regs_size) << (15 - 3);
} else {
compiler->local_size += 2 * sizeof(sljit_sw);
local_size -= saved_regs_size;
saved_regs_size += 2 * sizeof(sljit_sw);
FAIL_IF(push_inst(compiler, STP_PRE | 29 | RT2(TMP_LR)
| RN(TMP_SP) | ((-(saved_regs_size >> 3) & 0x7f) << 15)));
offs = 2 << 15;

prev = -1;
for (i = SLJIT_S0; i >= tmp; i--) {
if (prev == -1) {
prev = i;
FAIL_IF(push_inst(compiler, STP | RT(prev) | RT2(i) | RN(TMP_SP) | offs));
offs += 2 << 15;
prev = -1;

for (i = scratches; i >= SLJIT_FIRST_SAVED_REG; i--) {
if (prev == -1) {
prev = i;
FAIL_IF(push_inst(compiler, STP | RT(prev) | RT2(i) | RN(TMP_SP) | offs));
offs += 2 << 15;
prev = -1;

if (prev != -1)
FAIL_IF(push_inst(compiler, STRI | RT(prev) | RN(TMP_SP) | (offs >> 5)));

if (compiler->local_size > (63 * sizeof(sljit_sw))) {
/* The local_size is already adjusted by the saved registers. */
if (local_size > 0xfff) {
FAIL_IF(push_inst(compiler, SUBI | RD(TMP_SP) | RN(TMP_SP) | ((local_size >> 12) << 10) | (1 << 22)));
local_size &= 0xfff;
if (local_size)
FAIL_IF(push_inst(compiler, SUBI | RD(TMP_SP) | RN(TMP_SP) | (local_size << 10)));
FAIL_IF(push_inst(compiler, ADDI | RD(SLJIT_SP) | RN(TMP_SP) | (0 << 10)));

if (args >= 1)
FAIL_IF(push_inst(compiler, ORR | RD(SLJIT_S0) | RN(TMP_ZERO) | RM(SLJIT_R0)));
if (args >= 2)
FAIL_IF(push_inst(compiler, ORR | RD(SLJIT_S1) | RN(TMP_ZERO) | RM(SLJIT_R1)));
if (args >= 3)
FAIL_IF(push_inst(compiler, ORR | RD(SLJIT_S2) | RN(TMP_ZERO) | RM(SLJIT_R2)));

I will be honest, I have no idea how this is doing what it is doing(saving registers). I could not code how to extend this to make it save floating point registers, but I can only theorize it has something to do with fscratches and fsaved parameters, that currently are unused within the function. Possibly creating code similar to what is there, for example this line
saved_regs_size = GET_SAVED_REGISTERS_SIZE(scratches, saveds, 0);
Could be replicated to function with floating point registers like this:
fsaved_regs_size = GET_SAVED_FLOAT_REGISTERS_SIZE(fscratches, fsaveds, 0);
But that is all in theory, I am afraid I do not know, or even know how to know, how to do these things in practice.

Register mapping

I did not really get to this part because it seems simple but I have trouble understanding how he chooses the map for the register. For the integer registers of ARM64 this is his register map:
static SLJIT_CONST sljit_ub reg_map[SLJIT_NUMBER_OF_REGISTERS + 8] = {
31, 0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 8, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 29, 9, 10, 11, 30, 31
I just don't know how he came up with these numbers and also why 8 is added to the size of the register map.


In this post, there was definitely a phrase that came up a lot, it was "I don't know", I think simply after finally finding something to work on, it was just out of my range to complete, even though I am sure it would be an easy task for someone who is experienced in this kind of environment. I think I should have kept my options open and perhaps choose a easier project to work on. I am disappointed in myself for not producing anything worthy of a patch but I am out of time to contribute anything more, In the summer I will try to complete it for fun and will post results if I get anywhere. Regardless of my results I have learned a lot through this project and through this whole course and I think it is always worth challenging yourself to expand your knowledge of computers since there is so much to learn.
Thanks for reading, have a good summer.

by James Boyer ( at April 23, 2015 03:19 AM

Hong Zhan Huang

SPO600 Project Perl 2.X / 3 – Not so Hot Ops

A rather delayed continuation from the last time where I assumed I had made some good progress and had a route to follow. From there I had repeated the hot ops shuffling tests on Australia and it led to rather disappointing results. So disappointing that it had me return to Red and repeat the testing on a much more extended number of iterations. All of that lead to me discovering that the results I had from my previous post to be nothing more than an aberration.

It turns out that the actual shuffling of code within the pp_hot.c based on each function’s supposed ranking does virtually nothing in regards to performance. I suppose I just got lucky (or rather unlucky) with the outputs that seemed to lead to me thinking there was a very minimal performance increase for my time based variation of the file. This is certainly my mistake, and has led me to realize the need for more extensive testing in these environments.

The above results came from one of many seven run sets of testing and as it can be seen there isn’t any tangible difference between the different variations of the pp_hot.c file. Having reached this point it seems quite clear that the only ways to seek performance increases through hot ops is to find an that a widely used operation that has yet to be promoted into a hot op or to optimize an existing hot op (which I wasn’t able to find a opportunity to do so).

In attempts to find operations to promote I looked into some open source Perl projects on Perlmaven to build and profile. I had some issues building some of the larger products and with the drawing end of the semester I was lacking the time and necessary focus to work through all of those issues. I ended up profiling some smaller sized applications.

  1. Ack – A grep-like application that is optimized for and intended to be used by programmers.
    The results of profiling this application are contained here.
  2. Perltidy – An application that can be used to clean up perl files by applying the desired amount of indents and other formatting to make things easier to read.
    The results of profiling this application are contained here.
  3. st – An application that performs statistical calculations on some file with the command line.
    The results of profiling this application are contained here.

Having gone through the profiling results of these three applications, the main standout functions are yylex and yyparse that have a prominent presence in all of the results by taking up a hefty percentage of the time taken by the application. Although the data fed through these applications were trivial examples, if the size of the data were increased the proportionate time consumption of those functions would remain the same.

The three above applications are biased towards the parsing text so it isn’t too surprising that functions like yylex and yyparse would take some of the top spots in the profiling data. So in this way it perhaps isn’t the most representative of hot ops in a general sense but parsing isn’t a uncommon action so it could be that promoting these functions would be reasonable. The code for yylex is located in toke.c and yyparse in perly.c.

Unfortunately this is about as far as I could pursue this route given the time that remained. Without a greater understanding of the placing of yylex and yyparse in their respective current locations as well as a much larger profiling pool of results I am unsure whether truly making them into hot ops are a profitable notion.


To call this a conclusion is perhaps a worth a chuckle but I’ll take this portion to wrap up my thoughts on things overall:

  • I’m pretty disappointed that I wasn’t able to create a patch for this project and that my foolish mistakes on the second phase led me to dead end route unknowingly when it could have been easily avoided. Had I avoided it, I wouldn’t have been so stuck. I really should have communicated with the community on this one much earlier as well.
  • The choice of approaching hot ops is in hind sight was perhaps not the best choice given the need to have a rather large sample size of results to even begin finding out where one should proceed. In my case it certainly would have helped to have the amount of practice on building and profiling that I have now when I had began. It may have been better to have gone with one of my other choices from Phase 1 to have a more concentrated approach.
  • While I’ve been unsuccessful here, I still gained some good experience in working in an open source environment as well as doing builds and profiling. If I were to do this again from scratch I think I would have a good leg up over the me of 3 months ago.
  • I’ve learned a lot about how computers work and different ways to think about programming that I hadn’t been exposed to much prior in the duration of this course. It’s been a pretty good ride this semester and has ignited some new interests in me.
  • Even past the end of SPO600 I think I’ll continue to investigate this Perl hot ops debacle now that I’ve been introduced to it and have suffered a miserable defeat, I can’t simply let this stay as is. And beyond that I want to start looking into other Open Source projects to get involved in.

All in all and for better or worse this is the end.

A complete non-ending. End log of an SPO600 player~

by hzhuang3 at April 23, 2015 01:52 AM

Maxwell LeFevre

Profiling Apache HTTPD

Picking a Tool

When I initially installed Apache’s HTTPD server on the x86_64 (Australia) and Aarch64 (Red) machines available to us I intended to use gprof to profile the installation and try and find problem spots to potentially improve. After working to get the installations to produce gmon.out files I found that gprof is not capable of sampling fast enough to provide any meaningful data other than the number of function calls. I did some research and decided to use perf instead because it doesn’t require any special preparation while compiling and is capable of sampling very quickly. I have never used perf before so I spent a little bit of time learning how to use it and what it is capable of.


HTTPD is controlled with an application called apachectl which is used to start and stop it. When HTTPD is started with sudo ./apachectl start it actually launches 4 separate processes, one root process and 3 daemon processes. To get profiling information from HTTPD with perf I used perf record -F 100 -p PID where PID is the PID number for the HTTPD process. The -F 100 tells perf to sample 100 times per second (100Hz). Gprof samples at a fixed rate of 99Hz which is to slow for HTTPD. I used ps -aux to list the currently running processes and used the PID’s from this list for perf. I needed to profile all four processes so I made 5 separate connections to Red or Australia, depending on which one I was profiling, and used the fifth connection to run a bash script to load HTTPD. The script I used is a simple loop that repeatedly uses wget to download a webpage and all it’s content.

#! /bin/bash

mkdir temp
cd temp

while [ $COUNTER -lt 100 ]; do
  wget -nd -r -l1 -p -np localhost

cd ../
rm -r temp


To examine the results of the profiling I used a few other perf tools. The command sudo perf report --stdio provides results that look like the following.

# To display the header info, please use --header/--header-only options.
# Samples: 169  of event 'cycles'
# Event count (approx.): 33812829
# Overhead  Command  Shared Object          Symbol                           
# ........  .......  .....................  .................................
     9.55%  httpd           [.] _mcount@@GLIBC_2.18
     4.89%  httpd      [.] apr_pool_pre_cleanup_register
     4.81%  httpd    httpd                  [.] default_handler
     4.72%  httpd      [.] apr_fnmatch  
     4.54%  httpd           [.] alias_matches
     4.54%  httpd    httpd                  [.] ap_get_mime_headers_core 
     4.51%  httpd      [.] apr_poll 
     4.37%  httpd    [kernel.vmlinux]       [k] getname_flags  
     4.35%  httpd    [kernel.vmlinux]       [k] tcp_v4_rcv  
     4.29%  httpd           [.] memchr   
     4.18%  httpd    httpd                  [.] ap_queue_pop_something  
     4.10%  httpd    httpd                  [.] read_request_line   
     4.02%  httpd    [kernel.vmlinux]       [k] __audit_syscall_exit 
     3.90%  httpd    httpd                  [.] apr_proc_detach@plt 
     3.90%  httpd      [.] apr_array_make@plt 
     3.89%  httpd  [.] apr_brigade_split_line 
     3.13%  httpd    [kernel.vmlinux]       [k] can_vma_merge_before 
     2.99%  httpd      [.] apr_allocator_free 
     2.91%  httpd    httpd                  [.] ap_set_keepalive
     2.80%  httpd    httpd                  [.] ap_explode_recent_gmt 
     2.55%  httpd           [.] tolower 
     2.54%  httpd    [kernel.vmlinux]       [k] die
     1.68%  httpd    [kernel.vmlinux]       [k] __audit_syscall_entry 
     1.45%  httpd    [kernel.vmlinux]       [k] file_ra_state_init 
     1.45%  httpd  [.] apr_bucket_file_create 
     1.41%  httpd    httpd                  [.] listener_thread 
     1.37%  httpd      [.] 0x0000000000022f94  
     1.08%  httpd    [kernel.vmlinux]       [k] _raw_spin_unlock_irqrestore 
     0.07%  httpd    [kernel.vmlinux]       [k] finish_task_switch 

Using sudo perf report creates an interactive report where I can isolate sections that are part of httpd and not a shared library.

   4.81%  httpd  [.] default_handler 
   4.54%  httpd  [.] ap_get_mime_headers_core   
   4.18%  httpd  [.] ap_queue_pop_something   
   4.10%  httpd  [.] read_request_line  
   3.90%  httpd  [.] apr_proc_detach@plt  
   2.91%  httpd  [.] ap_set_keepalive 
   2.80%  httpd  [.] ap_explode_recent_gmt 
   1.41%  httpd  [.] listener_thread    

This will also allow you to navigate to the assembly of any function you choose. Perf seems to be a really powerful tool but unfortunately any time I try to generate data on Red httpd crashes with the following error and I have to kill it’s daemon processes with sudo kill -9 PID.

Message from syslogd@x2 at Apr 22 16:27:50 ...
kernel:[195999.516934] Internal error: Oops - bad mode: 0 [#4] SMP

Message from syslogd@x2 at Apr 22 16:27:50 ...
kernel:[195999.691511] Process httpd (pid: 29072, stack limit = 0xfffffe0013b4c058)

Conclusion (For Now)

I am still having trouble getting adequate profiling information from Red to compare to Australia. I will try to fix this and will hopefully have more data for my next post.

by maxwelllefevre at April 23, 2015 01:15 AM

Justin Grice

SPO600 – Finalizing the Project & Resolution

I left off my last post with a memory issue related to my changes to the zend_operators.h file. I managed to finally discover the issue was caused by my incorrect use of the mov command. In the original x86_64 code, a byte-sized value was being moved into register 3 as part of handling the overflow. My mistake when porting it was that I had been trying to move a 32-bit value into the register as opposed to a 8-bit value. When I changed the code to accomplish this correctly, the memory issue disappeared.

Unfortunately at this point, I started to realize that my code was not passing PHP’s  build testing list. A further inspection of my code lead me to realize I had misunderstood the differences between the two architectures. I had mistakenly thought that the arguments on the x86_64 architecture could be replicated 1:1 with simple syntax changes on the AArch64 architecture. Once I realized that x86_64 was a register memory architecture and AArch64 was a load-store architecture, I proceed to rewrite a large chunk of my existing code for the fast_increment and fast_decrement functions.

"ldr x3, [%0, 0]\n\t"
"add x3, x3, #1\n\t"
"str x3, [%0, 0]\n\t"
"bvc 0f\n\t"
"mov x4, #0x43e00000\n\t"
"str x4, [%0, 0x4]\n\t"
"mov w5, %1\n\t"
"strb w5, [%0, %c2]\n\t"
: "r"(&op1->value),
: "cc",

The major changes between this code and my existing code where that the values are now being loaded into registers and that having operations performed upon them. They then are stored back into memory. This implementation resulted in the correct outputs being given. After these changes were done, I proceeded to build and run PHP’s tests to ensure it was working as expected. There were no failed tests that were not present in the original clean version of PHP.

With the working build, I then proceeded to test performance to see if my code resulted in any changes. I reused bench.php from my initial tests to see if performance had improved. When I ran the clean build over 20 iterations, it resulted in a average time of 3.086s to complete. I then proceeded to run my build for the same amount of iterations, with a average time of 5.049s.

This fairly large reduction of performance surprised me at first, but lead me to believe that the compiler was performing optimizations on the clean build that it wasn’t performing on mine. This lead me to believe that the current file could not be improved in this area through my approach. I have tried contacting the community through the PHP internal mailing list for additional suggestions on how this file can be improved, but have yet to hear back from the community.

I have included my changes on my Github

by jgrice18 at April 23, 2015 01:14 AM

Danylo Medinski


What is Vectorization

Vectorization or Automatic Vectorization is the process of unrolling a loop so that multiple elements are simultaneously processed, effectively converting a scaler(one operation at a time) operation to a vector operation. Vectorized code/programs preform parallel operations by utilizing Single instruction, multiple data(SIMD) instructions.

Vectorization as powerful as it is, does carry its limitations.

  • Vectorized code must span only one line(single code block) as the compiler may misinterpret the code and require many vector registers if many lines are used or functions are called.
  • If the would-be vectorized loop is part of a loop nest(multidimensional array processing), only the last inner loop may be vectorized unless the compiler is fully able to unroll the loops. To enable vectorization of a loop next, higher compiler optimization levels may be required(-O 3).
  • Since loops where vectorization can be preformed requires for each iterations to be counted, while loops are excluded for vectorization.

Without these specific conditions, vectorization cannot be preformed.

Examples of Vectorization

Here example of code that can be vectorized.

This code is a fairly simple as it contains two loops. The first loop assigns a random number to the int array ‘a’ and the second loop totals the elements in ‘a’ in the t int variable.

void main() {

typedef int aint __attribute__ ((__aligned__(4)));
int a[256];
int i, t;


for (i=0; i<256; i++) {

for (i=0; i<256; i++) {

Since the first loop calls the rand() function for each index, it cannot be vectorized. The second loop may be vectorized since the calculation within the loop is fairly simple as it uses the += operation to sum up all the elements.

To learn more about the code and how it will be effected by vectorization, I used objdump to view the object code with and without vectorization enabled.

I first compiled the code on a Aarch64 system with gcc without vectorization enabled. Here is the result of the objdump.

 0000000000400658 :

void main() {
 400658: d11043ff sub sp, sp, #0x410
 40065c: a9bf7bfd stp x29, x30, [sp,#-16]!
 400660: 910003fd mov x29, sp

 typedef int aint __attribute__ ((__aligned__(4)));
 aint a[256];
 int i, t;

 400664: 52800020 mov w0, #0x1 // #1
 400668: 97ffff9e bl 4004e0 <srand@plt>

 for (i=0; i<256; i++) {
 40066c: b9041fbf str wzr, [x29,#1052]
 400670: 14000009 b 400694 <main+0x3c>
 400674: 97ffff8f bl 4004b0 <rand@plt>
 400678: 2a0003e2 mov w2, w0
 40067c: 910063a0 add x0, x29, #0x18
 400680: b9841fa1 ldrsw x1, [x29,#1052]
 400684: b8217802 str w2, [x0,x1,lsl #2]
 aint a[256];
 int i, t;


 for (i=0; i<256; i++) {
 400688: b9441fa0 ldr w0, [x29,#1052]
 40068c: 11000400 add w0, w0, #0x1
 400690: b9041fa0 str w0, [x29,#1052]
 400694: b9441fa0 ldr w0, [x29,#1052]
 400698: 7103fc1f cmp w0, #0xff
 40069c: 54fffecd b.le 400674 <main+0x1c>
 for (i=0; i<256; i++) {
 4006a0: b9041fbf str wzr, [x29,#1052]
 4006a4: 1400000a b 4006cc <main+0x74>
 4006a8: 910063a0 add x0, x29, #0x18
 4006ac: b9841fa1 ldrsw x1, [x29,#1052]
 4006b0: b8617800 ldr w0, [x0,x1,lsl #2]
 4006b4: b9441ba1 ldr w1, [x29,#1048]
 4006b8: 0b000020 add w0, w1, w0
 4006bc: b9041ba0 str w0, [x29,#1048]

 for (i=0; i<256; i++) {
 for (i=0; i<256; i++) {
 4006c0: b9441fa0 ldr w0, [x29,#1052]
 4006c4: 11000400 add w0, w0, #0x1
 4006c8: b9041fa0 str w0, [x29,#1052]
 4006cc: b9441fa0 ldr w0, [x29,#1052]
 4006d0: 7103fc1f cmp w0, #0xff
 4006d4: 54fffead b.le 4006a8 <main+0x50>
 4006d8: 90000000 adrp x0, 400000 
 4006dc: 911e4000 add x0, x0, #0x790
 4006e0: b9441ba1 ldr w1, [x29,#1048]
 4006e4: 97ffff83 bl 4004f0 <printf@plt>
 4006e8: a8c17bfd ldp x29, x30, [sp],#16
 4006ec: 911043ff add sp, sp, #0x410
 4006f0: d65f03c0 ret

Compiling the code again, this time with the -O3 flag to enable vectorization, here was the resulting object code.

0000000000400500 :

void main() {
  400500:	d11003ff 	sub	sp, sp, #0x400

  typedef int aint __attribute__ ((__aligned__(4)));
  aint a[256];
  int i, t;

  400504:	52800020 	mov	w0, #0x1                   	// #1

void main() {
  400508:	a9bd7bfd 	stp	x29, x30, [sp,#-48]!
  40050c:	910003fd 	mov	x29, sp
  400510:	f90013f5 	str	x21, [sp,#32]
  400514:	9100c3b5 	add	x21, x29, #0x30
  400518:	a90153f3 	stp	x19, x20, [sp,#16]
  40051c:	9110c3b4 	add	x20, x29, #0x430

  typedef int aint __attribute__ ((__aligned__(4)));
  aint a[256];
  int i, t;

  400520:	aa1503f3 	mov	x19, x21
  400524:	97ffffef 	bl	4004e0 <srand@plt>

  for (i=0; i<256; i++) {
  400528:	97ffffe2 	bl	4004b0 <rand@plt>
  40052c:	b8004660 	str	w0, [x19],#4
  aint a[256];
  int i, t;


  for (i=0; i<256; i++) {
  400530:	eb14027f 	cmp	x19, x20
  400534:	54ffffa1	400528 <main+0x28>
  400538:	4f000400 	movi	v0.4s, #0x0
  for (i=0; i<256; i++) {
  40053c:	3cc106a1 	ldr	q1, [x21],#16
  400540:	eb15029f 	cmp	x20, x21
  400544:	4ea18400 	add	v0.4s, v0.4s, v1.4s
  400548:	54ffffa1	40053c <main+0x3c>
  40054c:	4eb1b800 	addv	s0, v0.4s
  400550:	a94153f3 	ldp	x19, x20, [sp,#16]
  400554:	f94013f5 	ldr	x21, [sp,#32]
  for (i=0; i<256; i++) {
  400558:	90000000 	adrp	x0, 400000 
  40055c:	a8c37bfd 	ldp	x29, x30, [sp],#48
  for (i=0; i<256; i++) {
  400560:	911da000 	add	x0, x0, #0x768
  400564:	0e043c01 	mov	w1, v0.s[0]
  400568:	911003ff 	add	sp, sp, #0x400
  for (i=0; i<256; i++) {
  40056c:	17ffffe1 	b	4004f0 <printf@plt>

Comparing the object code with vectorization to the object code without vectorization, there are some noticeable differences. In the object code where vectorization is enabled, there are now some NEON SIMD instructions being used to utilize SIMD in the loops.


Seeing how SIMD is utilized when vectorization is used, I can better understand and grasp the concept of SIMD and vectorization. Since vectorization preforms close to the likes of loop unrolling, I can use vectorization over unrolling a loop as they essentially are the same(loop unrolling cannot be vectorized as the compiler preforms the unrolling on its own). Although at the moment I have no use for vectorization, I do see it potential use and may use it in the future if I am attempting to speed up or optimize code.


by ddmedinski at April 23, 2015 01:02 AM

April 22, 2015

Danylo Medinski

Quick Project Update: The Finale

Well the time to wrap up the project has come. The last outstanding task that remained was creating the phpt test scripts for my function. The test scripts are necessary for a new function as the command make install will use these scripts to check if PHP functions correctly.

After a hour of writing new tests, I managed to produce a test script named ‘array_binary_search_variation1.phpt’. The script uses 5 different arrays(or tests) to check if all features of the function work(strict search, soft/normal search), thus having variations in the functions behavior. After testing the tester via make test, I pushed it to my PHP repo and waited until the Travis Cl build succeeded.

Some feedback that I have received on both the mailing list ( and github pull request conversation recommended the idea of migrating my binary search implementation from the procedural API array.c to a array class such as ArrayObject since I would be able to keep the sorted state of a array within the object. .As much as this kind of implementation can be useful in certain situations, it is a implementation that I will have to pass at the moment. Adding the binary search implementation may result in having to change much of the code I have in order to work with an array class and this is something that I do not have sufficient time to implement and test this change. I will not however ignore this idea as this may be something that I can implement to my binary search in the future.

Since it seems that everything I could of done with this function is now completed at this point, all what remains it to wait until the pull request is reviewed and merged with the PHP master branch. It now seems that most likely my pull request will not be accepted before the SPO600 deadline since it is still being reviewed by the community. I will not however abandon this implementation as even after the deadline I will continue to work on the implementation until my pull request is accepted or denied, If my pull request is accepted, I will continue to support my function with bug fixes and new features.


Pull Request –

Mailing List Response –

by ddmedinski at April 22, 2015 11:54 PM

Liam Christopher Martin

SPO600- Project Phase 3

For this post I will be discussing spo600’s final project’s Phase 3 Submitting a patch.

Unfortunately for this phase there were a lot of different problems that arose for me. Initially I was designing a script around a text generated list of combinations. In this text was a generated list of all combinations of a set of 0’s and 1’s. In my script they would represent a group of options that would be given to gcc, a 0 in a specific location would represent a group of options being turned off. Switching these on and off would ideally have an effect on the compile/execution times of a finished program. A few problems arose form this. First off, this method of reading each of these from a file would be quicker in python, but makes for some highly specific code. Not as though this code would be used for anything else, however changing the size of the groups would have a few direct impacts, having to regenerate this file and tweak it using text generators each time is cumbersome, as well as not being too accurate for pasting over any set of combinations specific enough to b able to narrow down any sort of data. A combination set with 18 groups (each with 10 values and one with 7) for instance has greater than 200,000 combinations, and this is still a pretty broad kind of testing system. Since realizing that this would be terrifying to debug because a regular expression inconsistency per line would result in a lot of loss of work, as well as manually sorting through hundreds of thousands of values to find something amiss would take time. to say nothing of re appropriating this code for different group sizes in terms of the loops and expressions kind of made me want to take a step back and try something else. So I’ve started to try and use pythons itertools.product module, which I have finally been able to generate the same list I was using before by adding a repeat option to the function. However in the time that is left in the course since making that change has not been enough for me to completely finish the script in order to grab the proper values. (a long with being able to choose your group size and let it generate things for you)

Secondly, I guess I kind of had a problem with one of the main concepts of my project. I was initially attempting to read the output of a compiled testing suite that comes in box with php. My flawed understanding while programming this would be to use the test in order to help me narrow down a section of code after running my script to see an area that would be heavily effected by the optimizations I was implementing. However, the gmon,out after the test would only provide output from the final test initiated in the script, and was much too low execution time to be able to gain any use from it. So this left me searching for a section of php code (or even the creation of a small algorithm) that would take long enough execution time to register the gmon, admittedly a less than difficult task, but with such a revelation of going the absolute wrong direction in my thinking coming so late I have been unsuccessful.

Thirdly, working with most of this stuff in php seems to be creating cascading problems. As discussed in previous posts about my project being able to create a gmon is kind of a toughie. Firstly, the only way I have been able to create it is doing a fresh configure of my system, adding -pg to the lines I specified previously, adding a set_time_limit() function to any script that you would like to run, as well as adding it to a few natural files before running make. Changing these values then re issuing a make has no effect. So this brings up a few things. Not being able to enable optimizations inline means having to write to the make file after every config, and a regular expression for that is difficult and highly situational. Another problem is the question of reconfiguring every time, will the benchmark remain similar? Will this effect further operations?

I think if I were to restart this project knowing what I know now I would attempt to perfect the script on a meaningless benchmark script, to be able to change group size to  make it easier for me to change to different sizes, which would give a higher level understanding of the effects of these compilers, as well as help me understand what is happening if there are odd edge cases present. It would also be neat to be able to add your own prefixes or option groups in case you were compiling with another program. The next step would be to find a clear area to bench mark, and attempt to get a few baselines before struggling uphill to effect it dynamically. Trying to tackle all three of these basic problems ended up leaving me in confusion even after seeking out clarification on them. Addressing a small problem on each front at a time while asking people with more experience only served to give me bigger questions.

by martinliam at April 22, 2015 06:58 PM

Artem Luzyanin

[SPO600] Final post

What a long and wonderful journey it’s been. I’ve learned so much about how computer actually works with the code I’m writing. I’ve learned how complex and diverse can the real-world programs be. I’ve learned that if you want to submit a patch for those programs, you need to be prepared to face tons of criticism, and very diverse opinions from different members of the community.

After a week and a half of waiting for a review from the community members for my latest patch, I got another destroying reply that pretty much said “what that guy said you could do and it would be useful, it actually useless”. I am going to stick with the opinion that the CORE developer had (the patchV2 that I created), and hope that the rank will take precedence when it actually comes to the decision making about my patch. I will continue monitoring the communication about it and I will try to push the patch through , even though it will happen, as I expect, not earlier than July. Finally, it will simply feel good to have such thing done in my life!

My patch is located on my Git repository:  

My CommitFest post on PostgreSQL:

I tried to find my mail archive, but it’s not in the archive mode yet.

So the result is that I’ve created the patch that was requested by one of the core developers, but that developer didn’t comment on it later, and the other members of the community didn’t like that idea. I will have to see if I can do something about it, but at this moment, during the release cycle, they are not very responsive.

by lisyonok85 at April 22, 2015 03:02 PM

Liam Christopher Martin

SPO600 Lab 2 – Baseline Builds

The object of this lab is to create a baseline build from a list of available programs, in a situation that will be consistent and repeatable. For this lab I have selected to benchmark PERL.

I decided to download a tarball from PERLs website using the command


I will construct an identical version on both red and australia and record output using different Make -j options for comparison.

In order to install Perl the following commands were input

sh Configure -de


make test

DESTDIR=$HOME/bin makeinstall

After adding –pg to the following lines, I was able to produce a gmon.out on both architectures.

CC = cc

LD = cc

Here are some more information about the system that may be useful for replication

./perl –version

This is perl 5, version 20, subversion 2 (v5.20.2) built for aarch64-linux

Copyright 1987-2015, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the

GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on

this system using “man perl” or “perldoc perl”. If you have access to the

Internet, point your browser at, the Perl Home Page.

[lcmartin2@red perl-5.20.2]$ free

total       used       free     shared buff/cache   available

Mem:       16713728     346816     9890496       7936     6476416   16181120

Swap:             0           0           0

by martinliam at April 22, 2015 03:14 AM

April 21, 2015

Jan Ona

SPO600 – PROJECT PHASE 2.9: Patch submission…not yet

Previous Posts:

Phase 1:
Phase 2:
Phase 2.5:

Continuing on for the final phase of the project for SPO600, I’ve managed to send out a patch to MongoDB’s Github repo. This patch submission is accompanied by their ISSUE as shown here.


After what feels to be a long time of not having any responses, I’ve managed to get a response.

The area that the patch affects seems to be a part of a bigger work being done within MongoDB’s codebase. Unfortunately due to this, the patch that I’ve sent would take much longer to resolve.

by jangabrielona at April 21, 2015 11:19 PM

Danylo Medinski

The struggles for profiling PHP with gprof

Profiling is the process of collecting and analyzing data of the time spent executing code and the functions called in a program. A programmer can use this information to assist in fixing or optimizing code.  Since profiling collects information during the execution of a program, very large programs can be made simpler for reading then the whole source code.

A while back I attempted to profile PHP using gprof. In order for gprof to profile a program, it must read a gmon.out file which is produced from a compiled program.To produce a gmon.out file, the -pg flag must be used to enable profile generation when compiling in gcc. When using a make file, the -pg flags must be added in ‘CFLAGS’ and ‘LDFLAGS’ in the MakeFile. After adding these flags into the PHP MakeFile, I ran the make command. The time to build the source is quite long, and 10 minutes into the build I was presented with an error that interrupted the build process.

Fatal error: Maximum execution time of 0 seconds exceeded in /home/ddmedinski/php-5.6.6/ext/phar/phar.php on line 1093
Makefile:333: recipe for target 'ext/phar/phar.phar' failed
make: *** [ext/phar/phar.phar] Error 255

Running the make install command I was presented with the same error. I attempted to build and install PHP with profiling enabled again but I was repeatedly presented with the same error. Since I was not able to continue with this task because I needed to enable profiling in PHP in order to produce the gmon.out file, I was forced to abandon this task.

I recently decided to return to this task with the hope of actually being able to build the source with profiling enabled. I configured and modified the makefile again and ran the make command. After a suspenseful ten minutes of build time I was yet again presented with the same error. With more frustrating hours given to researching and attempting to fix the problem with no luck, I decided that I was not getting anywhere. Based on what I read from other people who tried to profile PHP with gprof, I assumed this is some kind of bug. I wanted either way to test out gprof so I said screw it and decided the to use the gmon.out file that was generated for some reason during the build.

I first generated the profiling image with this command within the php-5.6.6 directory,

gprof sapi/cli/php gmon.out -Tpng > gprofTest.png

Since there is a lot going on in this image, in fact so much that its very difficult to analyze, it is better of to output the profiling results into a txt file. Here is the command used to generate a text file in gprof

gprof ~/php-5.6.6/sapi/cli/php gmon.out > gprofTest.txt

Since there is too much data in the text file to show in this post, here is a small sample of the profiling result.

Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00     5500     0.00     0.00  _zend_mm_alloc_int
  0.00      0.00     0.00     5491     0.00     0.00  _zend_mm_free_int
  0.00      0.00     0.00     5445     0.00     0.00  _efree
  0.00      0.00     0.00     5314     0.00     0.00  _emalloc
  0.00      0.00     0.00     4466     0.00     0.00  lex_scan
  0.00      0.00     0.00     4397     0.00     0.00  _zend_hash_quick_add_or_update
  0.00      0.00     0.00     3080     0.00     0.00  destroy_zend_function
  0.00      0.00     0.00     3080     0.00     0.00  zend_function_dtor
  0.00      0.00     0.00     3069     0.00     0.00  zend_new_interned_string_int
  0.00      0.00     0.00     3044     0.00     0.00  zendlex
  0.00      0.00     0.00     2601     0.00     0.00  zend_str_tolower_copy
  0.00      0.00     0.00     2312     0.00     0.00  zend_str_tolower_dup
  0.00      0.00     0.00     2086     0.00     0.00  zend_hash_quick_find
  0.00      0.00     0.00     1492     0.00     0.00  zend_stack_top
  0.00      0.00     0.00     1413     0.00     0.00  do_inherit_method_check
  0.00      0.00     0.00     1311     0.00     0.00  zend_hash_func
  0.00      0.00     0.00     1253     0.00     0.00  get_next_op
  0.00      0.00     0.00     1137     0.00     0.00  _estrndup
  0.00      0.00     0.00     1123     0.00     0.00  function_add_ref
  0.00      0.00     0.00     1122     0.00     0.00  zend_stack_push
  0.00      0.00     0.00     1104     0.00     0.00  do_inherit_method
  0.00      0.00     0.00     1104     0.00     0.00  zend_stack_del_top
  0.00      0.00     0.00     1047     0.00     0.00  zend_vm_set_opcode_handler
  0.00      0.00     0.00     1000     0.00     0.00  _zend_hash_add_or_update
  0.00      0.00     0.00      880     0.00     0.00  zend_strndup
  0.00      0.00     0.00      866     0.00     0.00  zend_hash_del_key_or_index
  0.00      0.00     0.00      859     0.00     0.00  zend_remove_ini_entries
  0.00      0.00     0.00      807     0.00     0.00  free_zend_constant
  0.00      0.00     0.00      804     0.00     0.00  zend_register_constant
  0.00      0.00     0.00      741     0.00     0.00  zend_register_long_constant

Now we can actually analyse the details of our program right away instead of trying to determine it from a picture. Here the data is formatted like a table so there are headings to state what each column means. This alone makes viewing the profiling output in text more practical then trying to figure out a image, especially for larger programs.


In conclusion even though this task wasn’t a complete success, I did however learn the basics of profiling. Knowing how powerful and useful profiling can be, I will for definitely use profiling in the future when testing code,

During the struggle to enabling profiling in PHP, I did learn through my research that there are profilers specifically built for PHP such as Xdebugs inbuilt profiler. Based on my documented experience with profiling PHP, I would not recommend profiling PHP with gprof until the profiling bug is fixed. In the meanwhile the profilers built for PHP should be used.

by ddmedinski at April 21, 2015 08:49 PM


Coming near to the end of the semester, I decided It was time to wrap up the optimization project. I was planning of having the strict search implemented before I create a pull request, but I ended up spending more time on it then expected. I decided to commit and push my code to github. I created a pull request( after so my current code can be reviewed while I continue working on my function.  A couple hours after I created the pull request I managed to implement the strict search into my function.

Strict Comparison implementation

The drawback with the strict implementation is that I it preforms two comparisons per element. The fast_is_identical_function() function preforms a strict comparison(===), however the result is a only specifies if the comparison succeeded or failed(Boolean).

This resulted me in having to keep compare_function in order to get the target position in order for the binary search to work. If the value of the element and target match, fast_is_identical_function compares the addresses of the target and element. If this result or fast_is_identical_function is a success, the search is a success, otherwise the search continues searching.

Here is the strict binary search implementation

             entry = zend_hash_index_find(arr_hash, mid);
             compare_function(&result, value, entry);
             if(Z_LVAL(result) < 0){                    high=mid-1;               }else if(Z_LVAL(result) > 0){
                  fast_is_identical_function(&result, value, entry);
                  if(Z_TYPE(result) == IS_TRUE){

Now that my function was complete, I was ready to push my code again.Before I pushed the new code however, I received a comment on my pull request. I was requested to change the name of both my C implementation and PHP side function name to include the array_ prefix. This would help users better understand the purpose of the function judging by the name.

After applying these changes to my code, the C implementation php_binary_search became php_array_binary_search and the PHP side function binary_search became array_binary_search.

Here is the complete binary search implementation

/* void php_array_binary_search(INTERNAL_FUNCTION_PARAMETERS, int behavior)
* * 0 = return boolean
* * 1 = return key
static inline void php_array_binary_search(INTERNAL_FUNCTION_PARAMETERS, int behavior) /* {{{ */
     zval *value, /* value to check for */
     *array, /* array to check in */
     *entry; /* pointer to array entry*/
     zend_bool strict = 0; /* strict comparison or not */
     #ifndef FAST_ZPP
     if (zend_parse_parameters(ZEND_NUM_ARGS(), "za|b", &value, &array, &strict) == FAILURE)     {
     HashTable *arr_hash;
     arr_hash = Z_ARRVAL_P(array);
     int low=0;
     int high=zend_hash_num_elements(arr_hash)-1;
     int mid;
     zval result;
             entry = zend_hash_index_find(arr_hash, mid);
             compare_function(&result, value, entry);
             if(Z_LVAL(result) < 0){                    high=mid-1;               }else if(Z_LVAL(result) > 0){
                  fast_is_identical_function(&result, value, entry);
                  if(Z_TYPE(result) == IS_TRUE){
        entry = zend_hash_index_find(arr_hash, mid);
        if(Z_LVAL(result) < 0){               low=mid+1;          } else if(Z_LVAL(result) > 0){

/* {{{ proto bool array_binary_search(mixed needle, array haystack [, bool strict])
 *    Checks if the given value exists in the array using binary search*/
	php_array_binary_search(INTERNAL_FUNCTION_PARAM_PASSTHRU, 0);

/* }}} */
/* }}} */

What Now

What appears to remain for the pull request are the testing scripts. When ‘make test’ is called in the PHP directory, test scripts for each PHP function are called to ensure that all functions operate correctly. The code I have wrote is not yet called when ‘make test’ is ran as I have no ‘phpt’ scripts for my function. This task should hopefully not take long since I will just migrate the PHP code I originally used to test the binary search unless some bugs surface from testing.

Once these tasks are completed, I shall push the testing functions along with any new changes to my github repo. While I wait for my pull request to be accepted, I will continue to communicate with the community about my implementation.


Pull Request –

by ddmedinski at April 21, 2015 07:24 AM

April 20, 2015

Hosung Hwang

Emacs diary additional functionality

On my another posting about Everyday Text File and emacs LISP Customization, I made a fast way to make a text file every day for diary. Also, there was a way to find information from diaries. However, sometimes, I want to see yesterday’s diary, and previous/next day’s diary of current diary file. To make this, I wrote a bash script and LISP functions.

Bash script that opens previous/next file


fname=$(basename $2)

if [ "$1" == "next" ]
    a=$(ls /diary | grep -C1 $fname | tail -1)
    echo "/diary/"$a
elif [ "$1" == "prev" ]
    b=$(ls /diary | grep -C1 $fname | head -1)
    echo "/diary/"$b

This script gets a file name and return previous file or next file. In this case it returns previous day’s file or next day’s file when I run like this : prev

When there is file it opens it. If now, it opens a previous file that there is.

LISP functions for emacs

(defun prevfile ()
  (find-file (substring (shell-command-to-string (format "~/sh/ prev %s" (current-buffer)))0 -1) )

(defun nextfile ()
  (find-file (substring (shell-command-to-string (format "~/sh/ next %s" (current-buffer)))0 -1) )

(global-set-key (kbd "C-S-<prior>") 'prevfile)
(global-set-key (kbd "C-S-<next>") 'nextfile)

M-x prevfile takes current buffer name, which is the file name, and open previous file in the diary directory. I set short cuts using global-set-key.
Ctrl+Shift+PageUp opens previous day’s diary
Ctrl+Shift+PageDown opens next day’s diary
and in my case, Ctrl+PageUp/PageDown opens previous/next buffer in emacs.

by Hosung at April 20, 2015 05:47 PM

Mohamed Baig

Rotation Bug for PDF.js

So I have been working this bug. Basically the problem was that pdf.js did not remember the page rotation of the PDF. This was somewhat more involved than my previous two bugs. I was still working the web directory, but now I needed to figure out what code to write and where to place the […]

by mbbaig at April 20, 2015 12:41 AM

April 19, 2015

Gideon Thomas

Refine your data

Hello all,

For one of my two remaining releases for my open source class, I decided (based on my professor’s suggestion) to contribute to a project called OpenRefine. This is a brilliant project started by Google which allows you to clean up your data. It has some pretty neat features like fixing inconsistencies in data, extracting out unnecessary fields of data, augmenting your data set with data collected from online sources, etc. It has a pretty slick UI that will make interaction with data more user-friendly. Gone are the days of using MS Excel.

For this release I worked on a pretty simple optimization bug that dealt with reducing the number of events triggered for each keystroke in an input field. With the project combining languages like Java, JavaScript, and HTML, I was not sure what I was in for.

I did have a little trouble navigating the project itself (it is a very big project) and was struggling a bit to figure out a starting point. I thought I would give it a fair shot before knocking on the doors of their IRC channel. Simply put – Global Search FTW. In the bug itself, the place where the event was captured was mentioned, so I global searched for that. Once I found it, I experimented a little with what an appropriate solution to ignoring certain keystrokes would be. Within about half an hour, voila, I had a good tested solution, and a PR up.

Unfortunately, I am not sure how to fix the failing AppVeyor thing – will wait for some comments from the devs.

Stay tuned for my final release soon…

Apparently the AppVeyor thing was not relevant atm so the devs merged it in :)

by Gideon Thomas at April 19, 2015 08:50 PM

April 18, 2015

Artem Luzyanin

[SPO600] The last attempt

This morning I sent the last email, asking for a feedback for my patch from the community. Over the last week I’ve been getting nearly a hundred emails a day from the developers email stream, so I can see them pretty busy. I understand the fact that during the release they will not have time for a future release non-critical patch reviews, but it surely doesn’t help me!

by lisyonok85 at April 18, 2015 01:35 PM

April 17, 2015

Danylo Medinski

PROJECT UPDATE 3, A update to a earlier update

In my last project update(, I stated that my php_binary_search function had a bug where a output/print statement would cause the function to execute quicker then a linear search while the binary search would be slower without this statement.

After discussing this bug with my SPO600 professor, it turned out that I forgot some high school level math and have been reading the timer output wrong. Turns out in reality the php_binary_search function substantially outperformed the linear search function(php_search_array).  The E in the end of the timer output is a exponential function, meaning additional zeros are added to the left of the output, instead of the right(1.0967254638672E-5 = 0.0000967254638672). This means that our algorithm is quite faster then then linear search. I was also able to get the function to work with all data types so there shouldn’t be any errors about incompatible types.

Now that I can confirm that the function works as intended, I can submit a pull request on github, but before I do that, I would like to add a strict compassion to my binary search function if possible since the php_seach_array function did include a flag for strict comparisons. This in the end would add extra functionality to my algorithm by allowing to check for the same data types along with the value(===).

Once this is added to my function, I will submit my pull request and post a blog detailing my experience of this.

by ddmedinski at April 17, 2015 03:07 PM

SIngle Instruction, Multiple Data (SIMD)

For this posting, I will briefly look into the x86_64 & AArch64 implementations of SIMD.

What is SIMD

SIMD or Single Instruction, Multiple Data is technology that enables enables processing of multiple data with a single instruction instead of using scalar operations were one instruction processes each data. The advantages of SIMD is that large calculations with large sets of data can be completed more quickly then systems that do not utilize SIMD(one instruction processes each data). SIMD however can not be applied to all algorithms.

Implementations of SIMD


Most of the x86_64 implementations of SIMD extend the MMX instruction set.


MMX, designed and released by Intel in 1997, is a instruction set that enables SIMD on x86 systems and is capable of 8 operations at the same time.

MMX features

  • 8 64-bit registers defined as MM0 to MM7
  • Registers can be used as single 64-bit quantities, dual 32-bit quantities, 4 16-bit quantites, or 8 8-bit quantites

Since the MMX registers overlap with the FPU stack register,the disadvantage of MMX is that FPU can’t be used parallel with the MMX registers.


Streaming SIMD Extension technology enables single instruction, multiple data. Originally developed and released by Intel in 1999 for the x86 architecture, SSE extends the MMX SIMD instruction set. Since the release of SSE in 1999, it has been vastly improved and expanded open since then. The major versions of SSE are SSE and SSE2. SSE3,SSSE3 and SSE4 adds minor changes to the SSE2 instruction set. Along with different versions of SSE,  extend SSE such as Advanced Vector Extension(AVX).

SSE includes

  • 70 new instructions
  • eight 128-bit data registers called XMM registers
  • 128 bit single-precision floating-point instructions
  • 64 bit MMX integer instructions

SSE2 instruction set extended SSE by

  • 144 new instructions opposed to the 70 instructions SSE offers
  • 64 bit double-precision floating-point instructions
  • extends MMX integer instructions to operate on 128-bit XMM registers

Algorithms such Image processing and graphics,Digital Signal Processing (DSP).Digest, hashing, encoding,Simulation of physical systems can benefit from SSE technologies.


Aarch64 used the Advanced SIMD extension(NEON) to enable SIMD technologies

Advanced SIMD extension

The NEON instruction set is a 64/128 bit instruction set for ARM architectures. NEON supports both SIMD and SISD(single instruction, single date) and is capable of 16 operations at the same time. NEON instructions both support integer and floating point data.

Media and Signal processing applications benefits from NEON as it provides standardized acceleration for them

  • sixteen 128-bit registers
  • 8, 16-32 and 64-bit integer data
  • signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit,single-precision floating-point data

The biggest benefits of NEON is that it can perform operations in parallel with standard ARM instructions, significantly increasing execution time.

by ddmedinski at April 17, 2015 03:19 AM

April 15, 2015

James Boyer

More project updates

Project Switch
Project Progress: Stack-less Just in time compiler
This is an update about my project progress in SPO600, problems with the new project.


After my previous post I found out what I have to do in sljit. Basically there are three areas that need changing in order to increase the number of floating point registers. First is the defines, which requires me to add architecture specific sections for the number of floating point registers in a architecture.Second is the function entry exit points which require me to add additional areas to save and restore floating point registers, because currently they only save and restore integer registers. Finally I would have to change the defines for the temporary registers and map the registers in order to get the real register index. This is a great area and I really want it to work but I am struggling with this and not to confident that I can complete this on time.

The defines

There is a file called sljitConfigInternal.h which has many defines for integer registers that look something like this:
#elif (defined SLJIT_CONFIG_ARM_64 && SLJIT_CONFIG_ARM_64)
#define SLJIT_LOCALS_OFFSET_BASE (2 * sizeof(sljit_sw))
but when it comes to floating point registers, all that is there is this:
#if (defined SLJIT_CONFIG_X86_64 && SLJIT_CONFIG_X86_64) && (defined _WIN64)
This is pretty much just assigning 6 float registers to any architecture no matter what, as you can imagine this is not ideal because many processors can use more than 6. My task was to expand this so that it would use more for x86 or arm system.

Function entry and exit points

In each architecture specific file(for arm64 it would be sljitNativeARM_64 there are functions that deal with the function entry and exit points these are called sljit_emit_enter() and sljit_emit_return(). What these functions currently do is save and restore the integer registers but if we have more floating point registers It would have to be changed to save and restore them aswell.

Register mapping

Right now in the ARM64 arch specific file two temporary floating point registers are being used with no mapping at all. We can compare the functionality used for in the integer registers to see what it means to have register mapping:
#define TMP_ZERO  (0)


#define TMP_FREG1 (0)

static SLJIT_CONST sljit_ub reg_map[SLJIT_NUMBER_OF_REGISTERS + 8] = {
31, 0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 8, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 29, 9, 10, 11, 30, 31

#define W_OP (1 << 31)
#define RD(rd) (reg_map[rd])
#define RT(rt) (reg_map[rt])
#define RN(rn) (reg_map[rn] << 5)
#define RT2(rt2) (reg_map[rt2] << 10)
#define RM(rm) (reg_map[rm] << 16)
#define VD(vd) (vd)
#define VT(vt) (vt)
#define VN(vn) ((vn) << 5)
#define VM(vm) ((vm) << 16)
As we can see, there is only two lines for the floating point registers(the TMP_FREG lines), and it is much more complex for the integer registers. The reg_map is used in the macros at the bottom in order to provide the correct machine register index for that register. I would have to do something similar for the floating point registers.


There are a few problems that have stopped me from completing these changes. First I am weary of how many floating point registers there are in each architecture, When looking at the integer registers the numbers are quite specific, for example ARM64 is defined as having 25 registers, MIPS is defined as having 22 registers, I have found that arm is supposed to have 32 floating point registers but it seems strange that it would be such an even number but I will try it regardless. There is the other line, NUMBER_OF_SAVED_FLOAT_REGISTERS which I am having trouble where to find that out, Chris Tyler, my professor, directed me to the procedure call standard for arm but I was unsuccessful in finding anything there. This problem kind of has me stuck and confused on what to do. It would be easy if I could ask the maintainer where/how he determined the registers numbers but he has stopped responding to my emails. For about 5 days we were talking, I would send one email and then he would send one back in the morning and I would respond and so forth, but I sent him an email one day and he just stopped responding, so he either got really busy or something happened to him, lets hope he is just busy. For now I will just try to get it to work by using 32 floating point registers for arm and do some trial and error to find the SAVED registers allowed.


This project despite the problems seems really interesting, I think if i had gotten an earlier start I would have been able to really complete a patch but switching projects slowed me down quite a lot. I think I will continue trying to complete this or get some progress even after spo600 is done, perhaps the maintainer will get some free time to help me with it. I will try to have something to show for next week but I don't know if it will be much.
Thanks for reading

by James Boyer ( at April 15, 2015 11:26 PM

Ali Al Dallal

Using JavaScript ES6 Arrow Function for callback and binding

With the new project that I'm working on for to implement new login with OAuth2. I got to work with the new ES6 JavaScript syntax and start to liking it a lot because of the flexibility and how easy to do a lot of things that I tend to always have problems with.

One of the most common problem I have is forgetting to bind the this context in the scope I'm working on.

For instance if I'm doing something like:

handleBlur: function(fieldName, value) {  
  var self = this;
    this.doSomething(fieldName, value, function(err, json) {
      self.doAnotherthing(function(err, response) {

Or sometimes you will do something like this

  handleBlur: function(fieldName, value) {
      this.doSomething(fieldName, value, function(err, json) {
        this.doAnotherthing(function(err, response) {

In ES6 I can simply turn it to be just like this:

  handleBlur: function(fieldName, value) {
      this.doSomething(fieldName, value, (err, json) => {
        this.doAnotherthing((err, response) => {

Now I have shorthanded version of my callback plus I have my this bind to the scope that I need to work with which is a win-win game for me :)

Don't forget to leave a comment if you have any suggestion.

by Ali Al Dallal at April 15, 2015 03:07 PM

Jan Ona

SPO600 – Project Phase 2.5: Change

To follow up with my project for SPO600, there unfortunately been some changes in regards to the direction of my project.

Patch cancelled:
As a response to my question from Spencer Brody, the initial creator of the patch, he revealed that MongoDB’s development team are currently planning to re-write a large amount of code, possibly overwriting and fixing the patch I was in the process of working with. In addition, since the patch has not been worked with for a long time (~3years), he also is not quite sure of where the files are.

Since there isn’t much time left until phase 3 of the project, I’ve decided to ask Spencer for places to look for other optimizations.

I was directed to a list of issues pulled from smaller projects which are mainly given to their newly-hired engineers to work with to get used to their code base. From this list, I’ve found what looks to be a good place to work with.

Optimization Option #3:  ISSUE-17651.
This Issue focuses on the canonical_query.h and its cpp. It asks to remove its generateCacheKey() call from its init and have it run on the first call to getPlanCacheKey of the class. This is due to the fact that the generateCacheKey() is considered to be an expensive operation and some instances of CanoncialQuery actually never makes use of the CacheKey.

Into Phase 3:
With days into the last due date, I hope to post a patch soon, Look for the next post for updates.

by jangabrielona at April 15, 2015 05:14 AM

SPO600 – Vectorization

As an overview of what was covered in my SPO600 class, we were tasked with disassembling a piece of code and analyzing a specific form of optimization, Vectorization.

For this post, the code that was used is as follows:

void main() {
int a[256], i, t;
srand(1); // initialize prng
for (i=0; i<256; i++) {
a[i]=rand(); // load array randomly
for (i=0; i<256; i++) {
t+=a[i]; // sum the array
printf("%d\n",t); // print the sum

Main as an assembly code:

0000000000400500 :
400500: d11003ff sub sp, sp, #0x400
400504: 52800020 mov w0, #0x1
400508: a9bd7bfd stp x29, x30, [sp,#-48]!
40050c: 910003fd mov x29, sp
400510: f90013f5 str x21, [sp,#32]
400514: 9100c3b5 add x21, x29, #0x30
400518: a90153f3 stp x19, x20, [sp,#16]
40051c: 9110c3b4 add x20, x29, #0x430
400520: aa1503f3 mov x19, x21
400524: 97ffffef bl 4004e0 <srand@plt>    //Before Loops
400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 400528 <main+0x28>  //Loop 1 end
400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16        //Loop 2 start
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 40053c <main+0x3c>  //Loop 2 end
40054c: 4eb1b800 addv s0, v0.4s
400550: a94153f3 ldp x19, x20, [sp,#16]
400554: f94013f5 ldr x21, [sp,#32]
400558: 90000000 adrp x0, 400000 
40055c: a8c37bfd ldp x29, x30, [sp],#48
400560: 911da000 add x0, x0, #0x768
400564: 0e043c01 mov w1, v0.s[0]
400568: 911003ff add sp, sp, #0x400
40056c: 17ffffe1 b 4004f0 <printf@plt>

Since assembly code is fairly difficult to understand, we can divide it into sections based on the original C code. Since Vectorization is only evident within the loops, anything above <srand@plt> on line:400524 we won’t have to worry about.

The first loop can be identified from the “ 400528 jump”:

400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 400528 <main+0x28>

This, however does not make use of vectorization since the <rand@plt> needs to be called to every index of the array.

Between the loops, we notice an extra line that was added by the compiler. We will go over this later:

400538: 4f000400 movi v0.4s, #0x0

The second loop:

40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 40053c <main+0x3c>

Vectorization is an optimization that relates to combining multiple instructions into one (SIMD). This optimization is first started by the compiler by loading a vector register before the loop starts. From within the loop, it continues by preparing the next vector register (Line 1), compares (Line 2) and adds an entire set of values as a batch(Line 3). This change effectively cut the loop iterations to a quarter of the initial size.

Since the values in this case was cut into 4 separate totals, the values are added together right after the loop:

40054c: 4eb1b800 addv s0, v0.4s


This lab introduces an interesting topic which is pretty difficult to make sense of the first time around. This also adds to the fact that little changes that can make better use of the compiler optimizations can drastically improve performance, something to take note of in the future

by jangabrielona at April 15, 2015 02:59 AM

April 14, 2015

Jordan Theriault

R7 – React-Bootstrap: Updated Thumbnail, added divider option

As per the request by the project creator mtscout6 on the last pull request, I’ve added a div option to the thumbnail. In this spec, the user can specify a “href” prop in order to use the thumbnail as an anchor tag. If the user omits the “href” property, the thumbnail will be a divider with an image at the top, and caption divider beneath.

I am currently waiting on comments for the latest addition to the thumbnail pull request.

For someone using the react-bootstrap framework the user would create the desired result below.

Anchor Thumbnail:
Anchor Thumbnail

Div Thumbnail:
Screenshot 2015-04-14 16.08.34

by JordanTheriault at April 14, 2015 08:12 PM

Artem Luzyanin

[SPO600] Project delays

After several days of waiting for a response to my patch, I have decided to go on IRC and try to bug people directly. This brought me two pieces of information. First of all, if the channel has around a thousand users, it doesn’t mean that if you ask a question, anyone will answer. Second, there is nothing I can do at this time that would make the community to review my patch at the moment. There is a current release happening, so people who are in charge of committing, or even reviewing, are busy with it. Here is the part of the IRC chat:

01[09:14] <ArtemL> Greeting everyone! I was wondering… I have submitted a patch for 2015/06 commitfest, as well as for pgsql-hackers mailing list, but got a response… I was wondering if I am doing something wrong, or I shouldn’t get any response?
01[09:14] <ArtemL> sorry, got NO response
02[09:15] * todakure (c896474c@gateway/web/freenode/ip. Quit (Ping timeout: 246 seconds)
02[09:16] * cleme1mp ( Quit (Ping timeout: 250 seconds)
03[09:16] * cleme1mp ( has joined #postgresql
03[09:17] * dantti (~dantti@ has joined #postgresql
[09:19] <Snow-Man> sigh.
[09:19] <Snow-Man> If you did all that then I doubt you did anything wrong.
[09:20] <oicu> ArtemL, people are busy trying to wrap up 9.5
[09:20] <Snow-Man> Sadly, we’re quite busy and haven’t had a chance to review everything.
[09:20] <RhodiumToad> also -hackers has been pretty quiet over the past few days
[09:20] <Snow-Man> Ah, yeah, and if it’s in the 2015/06 commitfest then you’re not likely to hear much til June.
03[09:21] * mattarse (~mattarse@ has joined #postgresql
03[09:21] * davidfetter_fbn (~chatzilla@2601:9:a81:4000:c8e0:f09:3cef:b553) has joined #postgresql
[09:21] <Snow-Man> RhodiumToad: I noticed that also.. Though I’m at a loss to explain it.
03[09:21] * ancientcc (~ancientcc@ has joined #postgresql
[09:22] <RhodiumToad> it’s not a US holiday or anything, is it?
03[09:23] * crobbins (~crobbins@2602:306:3403:d520:3d65:4b89:e725:c9f7) has joined #postgresql
02[09:23] * ancientcc (~ancientcc@ Quit (Client Quit)
01[09:23] <ArtemL> Oh I see. That’s reassuring! I guess I will wait until they will get a bit more free. Is there any way to get a response “yes, your patch is likely to get implemented” or “nah, you gotta do it better?”
[09:23] <Snow-Man> Nope.

Snow-Man (Steven Frost) would be one of the top contributors and reviewers, and is likely to be the one to review my code later. So… I patiently will wait. I will try to send another email later this week just to see if someone is bored enough to actually have a look at it, but my hopes are low…

by lisyonok85 at April 14, 2015 01:50 PM

Ryan Dang

Release 0.7

For Release 0.7, originally, I was going to work on the issue My app title not saved after hit the check mark #1391 . However, after playing around with it and trying to reproduce the bug, I was not able to reproduce it after many attempts. I made a comments about the bug is no longer valid in the issue. In the same issue, there were a comment regarding about the save app button should be enable when the app title is changed. I notice that this issue wasn’t fixed so I created an issue for it Enable Save App title on keydown instead of on blur It was an easy fix but it did take me sometime to look through the Vue.js documentation to find out what I can use instead of the “onBlur”. After trying few things out, I got it to work with the “onKeydown” syntax. The pull request was created for this issue is Enable save app when app tile is changed. It is merged shortly after as there is not much change to it.

As I testing out the app, I also notice that the edit app features are not user friendly. I create an issue to suggest a change to help improve the user experience. The issue is While editing the block, clicking on move up/move down button shouldn’t close the edit menu. I was working on that issue but it was closed. It was closed not because it was an invalid issue. It was closed because they are doing some major revamp of the UI.

by byebyebyezzz at April 14, 2015 02:24 AM

April 13, 2015

Hong Zhan Huang

SPO600: SIMD and Vectorization

During our last lab class we played around with the code below in order to see what occurs to the object code when we compile with the -O3. As i can be seen the code is simply two for loops where the first generates a random number for the array a and then the second one sums up the total of the array into the variable t.

#include <stdlib.h>
#include <stdio.h>

void main() {

  typedef int aint __attribute__ ((__aligned__(4)));
  aint a[256];
  int i, t = 0;


  for (i=0; i<256; i++) {
  for (i=0; i<256; i++) {

With -O3 the compiler applies a bunch of optimization and to see what occurs we’ll be using the objdump tool with -D to see what changes in our code.

0000000000400500 <main>:
  400500:    d11003ff     sub    sp, sp, #0x400
  400504:    52800020     mov    w0, #0x1                       // #1
  400508:    a9bd7bfd     stp    x29, x30, [sp,#-48]!
  40050c:    910003fd     mov    x29, sp
  400510:    f90013f5     str    x21, [sp,#32]
  400514:    9100c3b5     add    x21, x29, #0x30
  400518:    a90153f3     stp    x19, x20, [sp,#16]
  40051c:    9110c3b4     add    x20, x29, #0x430
  400520:    aa1503f3     mov    x19, x21
  400524:    97ffffef     bl    4004e0 <srand@plt>
  400528:    97ffffe2     bl    4004b0 <rand@plt>
  40052c:    b8004660     str    w0, [x19],#4
  400530:    eb14027f     cmp    x19, x20
  400534:    54ffffa1    400528 <main+0x28>
  400538:    4f000400     movi    v0.4s, #0x0
  40053c:    3cc106a1     ldr    q1, [x21],#16
  400540:    eb15029f     cmp    x20, x21
  400544:    4ea18400     add    v0.4s, v0.4s, v1.4s
  400548:    54ffffa1    40053c <main+0x3c>
  40054c:    4eb1b800     addv    s0, v0.4s
  400550:    a94153f3     ldp    x19, x20, [sp,#16]
  400554:    f94013f5     ldr    x21, [sp,#32]
  400558:    90000000     adrp    x0, 400000 <_init-0x460>
  40055c:    a8c37bfd     ldp    x29, x30, [sp],#48
  400560:    911da000     add    x0, x0, #0x768
  400564:    0e043c01     mov    w1, v0.s[0]
  400568:    911003ff     add    sp, sp, #0x400
  40056c:    17ffffe1     b    4004f0 <printf@plt>

The compiler has done some work in order to change how the code of our main  now operates. The first loop occurs at line 400528 and goes until 400534 where the branch not equal check happens. Normally we think of our for loops having some integer i that increments until some upper limit before our loop terminates. However here we don’t see any remnants of i nor any sort of incrementing operation or so it would appear. What does happen though is that the optimizer has opted to change our for loop to become a store operation where the results of our rand() are stored into the memory address of register x19 with an offset of 4 bytes. This then essentially becomes our for loop incrementing as it will continually store the numbers into memory offset by 4 bytes from the previous every loop until the array is filled. The compiler was able to predict our code and optimize out the incrementing and the integer i through these means.

Following the first loop we have some SIMD stuff going on. At line 400538 the compiler sets up the the vector register v0 by splitting the 128bit register into 4 word size lanes which would be 4 32bits lanes. We also load the values of our arrays into register q1 with offsets of 16 bytes. We then add up 4 values of the array at a time in each of the 4 lanes. Once this tallying process is complete, the 4 lanes are summed up into a single 32 bit value at line 40054c.


The results of this lab show us how the the compiler changes our code if we apply the optimizer flag -O3. We also see how the SIMD optimization works and some of the vector operations that are needed to facilitate it. There are opportunities for SIMD stuff such as in the second loop where data can be divided into equal sized pieces for processing. However in cases such as the first loop where the operation contains the creating of a random value SIMD isn’t applicable. This process of single instruction multiple data is akin to baking four loaves of bread in a single oven by using four bread pans at once. So theoretically you’ve increased your performance by 4x. The use of SIMD is also like loop unrolling.

The vector/SIMD instruction set for armv8 can be found here.

by hzhuang3 at April 13, 2015 09:37 PM

Thana Annis


Looking at the program we went over in class, if the compiler can accurately predict that a piece of code will loop a certain amount of times like “for(int i = 0; i < 256; i++)” and that the number of times it loops is divisible by 4 then it might choose to use vectorization instructions if the optimization settings are high enough when it is compiled.

Even though the first loop is divisible by 4 and is predictable, it is not a candidate for vectorization because rand needs to be called for each index of a. It will need to go through each loop to populate the array.

In the second loop, however, all the code is doing is reading each value in a and adding it to t. This makes it a good candidate for vectorization because all the memory is loaded and just needs to be processed. With vectorization, instead of reading and adding 32bits at a time it will grab a 128bit chunk from a and put each 32bit chunk into its own “lane” in memory.

|__32bits__|__32bits__|__32bits__|__32bits__| = 128bits of a’s memory

The processor knows that the register holding this 128 bits is vectorized so when add is called it adds up each individual lane and creates a total for each lane. Then it will advance the pointer in memory 128bits so that it can grab the next chunk of memory to process. Once all the totals are calculated and the end of a is reached, an instruction can be called that adds each lane up and creates a final total to store in t.

The advantage to this is that you are essentially only looping 1/4 of the times you would have if you didn’t use vectorization. If you were doing it the way it was written in the program, then you would be grabbing a 32bit chunk, adding it to t, moving the pointer over 32bits and repeating 256 times.Vectorization can be a powerful tool to decrease run time of your program.

by bwaffles91 at April 13, 2015 02:00 AM

April 11, 2015

Chris Tyler (ctyler)

Connecting a 96boards HiKey to a Breadboard

I've wanted to experiment with interfacing the 96boards HiKey (8-core, 64-bit ARM development computer) to some devices, but the 2mm 40-pin header is difficult to bring out to a breadboard (except by individual wires). I designed a simple adapter board and had some produced by the awesome OSH Park prototype PCB fabbing service. The boards are back, and I've been assembling a few of them for testing.

The adapter has a 40-pin 2mm male connector that plugs into the 96boards "low-speed" (LS) female header. It brings those signals to a 40-pin 0.1"/2.54mm header that can be used with an insulation displacement connector (IDC) on a ribbon cable. The other end of the ribbon cable can then be connected to a standard solderless breadboard using an adapter (most of the 40-pin IDC-to-breadboard adapters used with the Raspberry Pi A+/B+/2 should work -- for example, the AdaFruit Pi Cobbler Plus (straight) or T Cobbler Plus (T-shaped), KEYES SMP0047 from (U-shaped - not checked), or a 40-pin IDC Ribbon Breakout from Creatron in Toronto (straight) -- obviously, the pinouts from the 96boards connector will be different, so ignore or rewrite the pin labels on the adapter board). Update: Watch out for breadboard adapters that connect pins together -- e.g., that are designed for use with a Pi and connect together Ground or Supply pins, which will have different functions on 96boards LS connector. (I found a T-style adapter from that has this issue).

These first-try "Alpha 1" boards are not ideal -- they're bigger than they need to be, the board occludes one of the 96boards mounting holes, and the routing is wonky (I let the autorouter do its thing, and ended up with a couple of unnecessary vias and some wild traces). Nonetheless, the board seems to do what it was intended to do. I'm going to work on a second-generation design, but it may be a little while before I get around to it.

If you're interested in using this design, it's licensed under CC-BY-SA and can be downloaded or ordered from the OSH Park sharing site.

You can use this with the 2mm (96boards LS) and 2.5mm (IDC) connectors on the same side of the board, or on opposite sides of the board.

If you place them on the same side of the board (see the populated PCB with no cable attached in the photo -- though the IDC socket should be rotated 180 degrees), the pin numbering on the IDC cable will match the 96boards pin numbering. However, the IDC connector will be really tight to the edge of the board -- you may want to slightly angle the IDC header when you solder it on.

If you place them on the opposite sides of the board (the adapter plugged into the HiKey in the photo), the even and odd numbered pins will be swapped (1<->2, 3<->4).

One final note -- these should work fine with other 96boards devices that comply with the Consumer Edition specification -- and hopefully, it won't be long before the DragonBoard 410c is available so we can test this theory!

(The board in the photo is in one of the very basic acrylic cases I built for the HiKeys).

by Chris Tyler ( at April 11, 2015 05:22 PM

Artem Luzyanin

[SPO600] Disassembling lab

During out last lecture, we were given the code to disassemble and analyze. So I copied the tarball to my folder, run the usual “tar -xvf name” on it, run “make”, and now I was ready to go. This is the code I was analyzing:

#include <stdlib.h>
#include <stdio.h>

void main() {

int a[256], i, t;

srand(1); // initialize prng

for (i=0; i<256; i++) {
a[i]=rand(); // load array randomly

for (i=0; i<256; i++) {
t+=a[i]; // sum the array
printf("%d\n",t); // print the sum

As you can see, it’s pretty easy and straightforward program. After I run “objdump -d name”, I got the following:

0000000000400500 <main>:
400500: d11003ff sub sp, sp, #0x400
400504: 52800020 mov w0, #0x1 // #1
400508: a9bd7bfd stp x29, x30, [sp,#-48]!
40050c: 910003fd mov x29, sp
400510: f90013f5 str x21, [sp,#32]
400514: 9100c3b5 add x21, x29, #0x30
400518: a90153f3 stp x19, x20, [sp,#16]
40051c: 9110c3b4 add x20, x29, #0x430
400520: aa1503f3 mov x19, x21
400524: 97ffffef bl 4004e0 <srand@plt>
400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 400528 <main+0x28>
400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 40053c <main+0x3c>
40054c: 4eb1b800 addv s0, v0.4s
400550: a94153f3 ldp x19, x20, [sp,#16]
400554: f94013f5 ldr x21, [sp,#32]
400558: 90000000 adrp x0, 400000 <_init-0x460>
40055c: a8c37bfd ldp x29, x30, [sp],#48
400560: 911da000 add x0, x0, #0x768
400564: 0e043c01 mov w1, v0.s[0]
400568: 911003ff add sp, sp, #0x400
40056c: 17ffffe1 b 4004f0 <printf@plt>

Now this code took a bit of figuring out. Breaking the code into the sections helped:

400524: 97ffffef bl 4004e0 <srand@plt>

This line is out call to srand(), this is fairly clear.

400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 400528 <main+0x28>

This block acts in a pretty interesting way, since it FIRST calls rand() function, stores the result, and only then it checks the exit condition. “str w0, [x19],#4” Line shows that instead of incrementing the index, compiler decided to increment the pointer to the array, and when the pointer points to the projected end of the array, the loop will stop.

400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 40053c <main+0x3c>

This block is the whole reason for the exercise. Here we are touching upon Single Instruction Multiple Data (SIMD) concept. “400538: 4f000400 movi v0.4s, #0x0” loads a vector register, divided in 4 single word sized parts(32 bits) with zeros, preparing it to be the sum register.  "ldr q1, [x21],#16" loads another vector register with the values from our initial array. Then again, exit condition checked. Then the loaded values are added to the sum register in the respective positions. Then the loop ends or continues depending on the result of the condition check.

40054c: 4eb1b800 addv s0, v0.4s

This line adds up all the parts of the sum vector register into the rightmost position, filling the rest with nulls.

40056c: 17ffffe1 b 4004f0 <printf@plt>

Finally, we are calling printf for the output.

As we can see, by performing couple more of step to work with the vector registers, we save enormous time on the calculations, that now happen four at the time. This is the beauty of vectorization: if you can condense it like that, you should use it!

Unfortunately, there are problems with the vectorization, that we have to watch out for. First of all, we need to specify ‘-O3″ gcc option for it to work. This is the code that was vectorized in my example, when I change the option to “-O2″:

400538: b8404660 ldr w0, [x19],#4
40053c: eb1302df cmp x22, x19
400540: 0b0002b5 add w21, w21, w0
400544: 54ffffa1 400538 <main+0x38>

It reads much easier, but it does addition one at the time. No more vectorization for us. Unfortunately, in some situations running the third level of optimization might be problematic for the rest of the code due to drastic changes in it. So one would have to choose.

Another limitation is happening when we try to add two arrays into a third one (it would be something like this: “add v0.4s, v1.4s, v2.4s“).  Since the compiler doesn’t know whether the arrays are overlapping or not, it will play safe and assume that they will. Vectorization will not happen then. To fix that, you would need to add “__restricted__” option to the array to specify that it will not be overlapped.

Another limitation is that the vectorization might have problems with doing calculations on misaligned values. So imagine accessing first element of the first array and second element of the second array and going from there. There is an option to run “__builtin_assume_aligned” function (can be found here) to tell the compiler how to align. Now, according to these changes, this problem is actively worked on, and it seems that the compiler should figure it out on his own by now.

Another limitation for vectorization is that you cannot vectorize values, coming one at the time from a function. In our code, when we do “a[i]=rand();“, compiler sees that rand() will return only 1 value at the time, so you need to perform 4 separate calls to obtain 4 values. Hence no vectorization.

And the last limitation I will mention is that there is no output/calculation/condition can be done on the vectorized values.

This lab was very interesting, showing me that sometimes, a small change in a code or optimization level can have a significant impact on performance, if it allows vectorization to kick in!

by lisyonok85 at April 11, 2015 05:15 PM