Planet CDOT

April 23, 2018


Vimal Raghubir

Fixing Bugs in Kubernetes Website

So in this week’s open source adventures, I decided to tackle some bugs in the Kubernetes Website GitHub repository. This repository can be accessed here. So before I begin discussing the bug fixes that I made in this repository, I would like to highlight the reason I chose this repository specifically. If you have read one of my previous blog posts titled, “Kubernetes”, you will have already known that I am deeply fascinated in Kubernetes as well as other DevOps platforms. Due to this reason, I decided to tackle some bug fixes in Kubernetes regardless of what bug or what language it is in.

After exploring several repositories in Kubernetes, I noticed a trend of having several bugs that required intermediate to expert knowledge of the software/language in order to fix the bug, including the beginner bugs. I have started learning GO, which is the language used behind majority of Kubernetes architecture, as well as experimenting with the application on my own. What I came to realize is its an ongoing learning experience to which I need to build a foundation before I REALLY start to tackle bugs in the application.

Until that time comes, I decided to tackle simpler bugs such as in their website that can be seen here. The first bug fix that I made was to change some documentation that was necessary since it was causing confusion for other developers. My Pull Request can be accessed here. What the bug basically was is that the documentation previously stated that you would need to use the NodePort value provided by accessing your Service’s details, which is incorrect. As stated by the Dick Murray, a contributor to this repository, the NodePort value does not work in conjunction with your external IP address but instead the Port value does.

So the recommended fix was to change the documentation to reflect this. Below is the change on GitHub.

And below is the change on the actual website.

Onto the second bug fix! So for this bug fix, there was some header text that was hiding behind the main header and was only visible on the Safari browser. Although it could only be seen in the Safari browser, the text could still be searched on all browsers as shown below on Safari.

If you cannot make out the header text clearly from this picture, it can be seen clearer in the screenshot below.

Although it isn’t necessarily bothering anyone, it is still a bug that requires cleaning up. So I first had to pinpoint the exact file that this code exists in, but thankfully this was done for me by Zachary Sarah. So the recommendation by Zachary, was to simply remove this navigation bar since it isn’t useful anyways. I made my Pull Request and it has successfully been merged. The changes can be seen below.

This was a fix that was gladly accepted by the developers in this repository as indicated by Brad Topol.

In conclusion, I am absolutely excited to have finally dipped by toes in the water of Kubernetes after several weeks of my open source adventures. This is definitely only the beginning and I am ecstatic to continue fixing bugs in this repository as well as others that Kubernetes has. Not only is the idea of fixing bugs rewarding, but I finally feel like I am making an impact with my developing compared to the other school projects I’ve done. I am beyond thankful for all the lessons I have obtained from my open source professor David Humphrey, and will not put his teachings to waste! Open Source has thought me to think bigger than ever before and that is something that everyone is seeking in all aspects of life.

Once again I do look forward to many more open source adventures and this is the mark of many more to come! See you in my next adventure!

by Vimal Raghubir at April 23, 2018 12:35 AM


Jeffrey Espiritu

SPO600 Project – Stage 3 / Part 3

Inline Assembly Additions

I changed the for loop in the FLAC__fixed_compute_best_predictor function from this:

for(i = 0; i < data_len; i++) {
    error  = data[i]     ; total_error_0 += local_abs(error);                      save = error;
    error -= last_error_0; total_error_1 += local_abs(error); last_error_0 = save; save = error;
    error -= last_error_1; total_error_2 += local_abs(error); last_error_1 = save; save = error;
    error -= last_error_2; total_error_3 += local_abs(error); last_error_2 = save; save = error;
    error -= last_error_3; total_error_4 += local_abs(error); last_error_3 = save;
}

to this:

FLAC__int32 errors[4];
FLAC__uint32 total_error_0 = 0;
register FLAC__uint32 total_error_1 asm("r19") = 0;
register FLAC__uint32 total_error_2 asm("r20") = 0;
register FLAC__uint32 total_error_3 asm("r21") = 0;
register FLAC__uint32 total_error_4 asm("r22") = 0;

__asm__ ("movi v10.4s, #0" ::: "v10");

for (i = 0; i < data_len; i++) {
    error = data[i];
    errors[0] = error - last_error_0;
    errors[1] = errors[0] - last_error_1;
    errors[2] = errors[1] - last_error_2;
    errors[3] = errors[2] - last_error_3;

    last_error_0 = error;
    last_error_1 = errors[0];
    last_error_2 = errors[1];
    last_error_3 = errors[2];

    total_error_0 += local_abs(error);

    __asm__ volatile (
        "ldr q11, [%[src]]			\n\t"
        "abs v11.4s, v11.4s			\n\t"
        "add v10.4s, v10.4s, v11.4s	\n\t"
        :: [src] "r" (errors), "m" (errors[0])
        : "v10", "v11"
    );
}

__asm__ volatile (
    "mov w19, v10.s[3]	\n\t"
    "mov w20, v10.s[2]	\n\t"
    "mov w21, v10.s[1]	\n\t"
    "mov w22, v10.s[0]	\n\t"
    : "=r" (total_error_1),
      "=r" (total_error_2),
      "=r" (total_error_3),
      "=r" (total_error_4)
);

I re-organized the for loop so that I could perform the summation of errors at the same time using SIMD operations.

I benchmarked the inline assembly additions against the default implementation across the AArch64 servers: AArchie, BBetty, and CCharlie. As I’ve mentioned before, the Flac project does not have any optimizations for the ARM CPUs so I modified the function to compile if the GNU C compiler is present and the machine is using an ARM CPU. You can find the changes on my here.

AArch64 Server Specifications

AArchie

Operating System: Fedora Release 27
CPU: Cortex-A57 octa-core CPU (2GHz)
Cache:
L1 80KiB
L2 1MiB
L3 8000KiB
Memory: 2x DDR3 800MHz 8GiB
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

BBetty/CCharlie

Operating System: Fedora Release 27
CPU: APM XGene-1 octa-core CPU (2.4GHz)
Cache: Unknown
Memory: 2x DDR3 1600MHz 8GiB
Flags: fp asimd evtstrm cpuid

Benchmark Setup

I created a random WAV file generator for my audio input which you can find here. The random WAV file is seeded with the value of 1 to produce the same audio output file for a given size. This way I could reuse the same dataset over multiple runs on the same machine for consistent results. It’s also worth mentioning, I didn’t use real audio because I’m broke and I don’t CDs lying around to rip audio data from.

The version of GCC: gcc version 7.3.1 20180130 (Red Hat 7.3.1-2) (GCC).

Flac has 9 different compression levels to choose from (0-8) and I decided to use compression level 5 (the default) which provides a decent balance between speed and compression.

Scripts

To make testing easier, I modified the bash testing script I used for Lab 5 and slightly modified my perl average.pl script to parse the timing data and compute a set of average times. You can find the bash script here and the perl script here.

To run the script:

$ ./test.sh --file ~/bin/flac-132x-iasm --runs 10 -output data/release/iasm

Timing Data

To time the performance, I just used the /usr/bin/time command (note the absolute path to differentiate it from Bash’s built-in time command). I also customized the output so I can parse the data that I need. I also use -a to append the results to an output file specified using the -o option.

$ /usr/bin/time -f "%Uuser %Ssystem %eelapsed" -a -o "$output_file" "$program" "${args}"
# example results
13.01user 0.63system 13.81elapsed
12.96user 0.72system 13.93elapsed
12.91user 0.74system 13.79elapsed

Test Files

I created four (4) randomly generated WAV files of various sizes. The audio data stored at the standard audio rate of 44.1K samples per second, 16 bits per sample, for 2 channels. To create a random WAV file that is 600 seconds long, I would run the following: $ ./randwav 600.

wav600.wav  101MB (10 mins)
wav1800.wav 303MB (30 mins)
wav3600.wav 606MB (1 hour)
wav7200.wav 1.2GB (2 hours)

My Dataset

data
├── profile
│   ├── default
│   │   ├── wav1800.txt
│   │   ├── wav3600.txt
│   │   ├── wav600.txt
│   │   └── wav7200.txt
│   └── iasm
│       ├── wav1800.txt
│       ├── wav3600.txt
│       ├── wav600.txt
│       └── wav7200.txt
└── release
    ├── default
    │   ├── wav1800.txt
    │   ├── wav3600.txt
    │   ├── wav600.txt
    │   └── wav7200.txt
    └── iasm
        ├── wav1800.txt
        ├── wav3600.txt
        ├── wav600.txt
        └── wav7200.txt

Benchmark Results on AArchie

Flac Encoding Speed on AArchie

Data Analysis

Okay, the results came out better than I expected. I was able to improve encoding speed on AArchie using the inline assembly changes by quite a bit.

  • I was able to shave off 100ms on the 101MB file
  • And I saved roughly 1 second on both the 303MB and 606MB files
  • Lastly, there’s slightly more than a 2 second improvement on the 1.2GB file

Benchmark Results on BBetty and CCharlie

Flac Encoding Speed on BBetty

Flac Encoding Speed on CCharlie

Wait, what just happened?

Hmmm. That just happened. So after the decent performance gains I got from testing on AArchie, I thought to myself, “This is great!” Even though I was unable to port over the X86 intrinsic optimizations to AArch64, I still managed to change the code in such a way to use SIMD operations to improve performance. So I was expecting the same results on the other servers but I was horribly mistaken.

As you can see from the charts, my inline assembly greatly degrades encoding performance for all file sizes by more than several seconds, and in the worst case by more than 10 seconds. I was definitely not expecting these kinds of results. I ran the tests several more times, to ensure the files were cached, or to take into account other processes, but I kept getting the same results. I double checked the source, build files, compiler flags, and even the disassembly to check if my inline assembly was present, which it was. I can also rule out other processes on the BBetty machine because I checked using the top command and found there were no other intensive processes running at the time.

The only thing that comes to mind is most likely slow memory access on BBetty and CCharlie. I used an array to store error values and then loaded those values into a vector register, so that I could perform parallel summation. A possible workaround is to fully change the entire for loop to use inline assembly in order to minimize reads and writes to memory. I will probably do it on my own time because there’s not enough time for me to make changes, test and debug them, and run performance tests on all machines and submit it for the project — this is the last day!

Oh well, at least I tried.

Final Thoughts

Stay tuned.

by jespiritutech at April 23, 2018 12:15 AM

April 22, 2018


Dan Epstein

Optimizing & Benchmarking SPO600 Project Stage 3

Recap on Stage 2

Previously on stage 2 I performed multiple benchmark tests on SHA256deep with altered build option of O3 compare to the current implemented one O2. I have tested the time that it takes to encrypt a small sized file of 10mb, 100mb and 1gb. The tests took place on multiple servers with different hardware’s and configurations. The first server is archie, which is equipped with ARMv8 (aarch64) architecture. Then I performed tests on bbetty and charile servers but they have the same architecture but, just with more memory. I have compared the benchmark results between archie and xerxes (x86_64 architecture). Here below is the benchmark results from stage 2. I have only included aarch64 and x86_64 for comparison because they are different architectures types.

10MB

Aarchie64 Server – O2

Process

Time

Real

0m0.092s

User

0m0.066s

Sys

0m0.016s

Xerxes x86_64 Server – O2

Process

Time

Real

0m0.094s

User

0m0.095s

Sys

0m0.006s

Aarchie64 Server – O3

Process

Time

Real

0m0.075s

User

0m0.067s

Sys

0m0.010s

100MB

Xerxes x86_64 Server – O3

Process

Time

Real

0m0.093s

User

0m0.094s

Sys

0m0.004s

Aarchie64 Server – O2

Process

Time

Real

0m0.759s

User

0m0.668s

Sys

0m0.099s

Xerxes x86_64 Server – O2

Process

Time

Real

0m0.861s

User

0m0.882s

Sys

0m0.048s

Aarchie64 Server – O3

Process

Time

Real

0m0.705s

User

0m0.662s

Sys

0m0.060s

1GB

Xerxes x86_64 Server – O3

Process

Time

Real

0m0.864s

User

0m0.870s

Sys

0m0.062s

Aarchie64 Server – O2

Process

Time

Real

0m7.690s

User

0m6.799s

Sys

0m1.013s

Xerxes x86_64 Server – O2

Process

Time

Real

0m8.762s

User

0m8.946s

Sys

0m0.490s

Aarchie64 Server – O3

Process

Time

Real

0m7.170s

User

0m6.698s

Sys

0m0.655s

Xerxes x86_64 Server – O3

Process

Time

Real

0m8.790s

User

0m8.938s

Sys

0m0.535s

Reflection

The fastest server based on the results is archie64 with the improvement of 5.88% when encrypting a 1gb file using O3 flag. There is almost no difference when encrypting a small-sized file. For Xerxes, there the time decrease in about 0.38% for the 1gb file encryption. I’ve noticed while using a larger file, the O3, seems to be a better option to use. Unfortunately, there is not much of an improvement for small size files for all of the servers.

- %
Aarchie64 Server
- %
Xerxes x86_64

Then, in the next part of stage 2 I wanted to further optimize this project’s function: sha256_update (sha256deep). Sadly I couldn’t, because the code already seems to be optimized. The reason I believe this function is already optimized is that during my research I have found that memcpy is the fastest copy method in C. The other alternative to replace memcpy is to use inline assembler, which could be more efficient because you have more control over the data that is copied. The other factor that this code seems to be mostly optimized is that it’s using the right data types (uni8_t & unit32_t), which they are best used for storing small values.

Summary

As I have mentioned on my previous blog, I think this could be better optimized by using inline assembler to replace the memcpy function but, since I don’t have much experience with assembler language this couldn’t be implemented. The whole project experience was tough and challenging but, I do feel I have learned a lot from it. The hardest part was to find a function that could be potentially optimized. I’ve learned many techniques on how to evaluate a function and try to optimize it using the optimization that we learned in class, different build options and software profiling. However, I needed more experience and practice in order to fully optimize this project (to use inline assembler).

I have decided not to submit a pull request to the hashdeep project repository because it seems that it will take a very long time to get a response or for the changes to get accepted (there are pull requests that are still pending since February 2018). Therefore, there won’t be enough time to get this changes accepted and to report back. Overall, this was a great experience and hopefully, these blogs could help any students who will take SPO600 in the future.

by Dan at April 22, 2018 10:51 PM


Ruihui Yan

Project Initialization And Benchmarking

Because of problems encountered with the FFmpeg, I have decided to shift the project to HandBrake. HandBrake is an open-source to convert videos from any format to one that is compatible with modern devices.

First we update the list of packages available to install:

sudo apt-get update

 

Then we install the dependencies required for Ubuntu:

sudo apt-get install autoconf automake build-essential cmake git libass-dev libbz2-dev libfontconfig1-dev libfreetype6-dev libfribidi-dev libharfbuzz-dev libjansson-dev libmp3lame-dev libogg-dev libopus-dev libsamplerate-dev libtheora-dev libtool libvorbis-dev libx264-dev libxml2-dev m4 make patch pkg-config python tar yasm zlib1g-dev

 

Then we clone the HandBrake repository:

git clone https://github.com/HandBrake/HandBrake.git && cd HandBrake

 

Now, we build the package:

./configure --launch-jobs=$(nproc) --launch --disable-gtk

 

We added --disable-gtk to disable the graphical interface, since we won’t use it.

Once it’s built, we can find the HandBrakeCLI in ./build.

For our project, we are going to use uncompressed Deadpool 2 Trailer:

General
Complete name               : \\XOXO\video\deadpool.mp4
Format                      : MPEG-4
Format profile              : Base Media / Version 2
Codec ID                    : mp42 (isom/iso2/avc1/mp41)
File size                   : 226 MiB
Duration                    : 2 min 30 s
Overall bit rate            : 12.6 Mb/s
Movie name                  : Deadpool 2 - thedigitaltheater.com
Encoded date                : UTC 2018-03-22 22:48:59
Tagged date                 : UTC 2018-03-22 22:48:59
Writing application         : HandBrake 1.0.7 2017040900

Video
ID                          : 1
Format                      : AVC
Format/Info                 : Advanced Video Codec
Format profile              : High@L4.1
Format settings             : CABAC / 4 Ref Frames
Format settings, CABAC      : Yes
Format settings, ReFrames   : 4 frames
Codec ID                    : avc1
Codec ID/Info               : Advanced Video Coding
Duration                    : 2 min 30 s
Bit rate                    : 12.0 Mb/s
Width                       : 1 920 pixels
Height                      : 802 pixels
Display aspect ratio        : 2.40:1
Frame rate mode             : Variable
Frame rate                  : 23.976 (24000/1001) FPS
Minimum frame rate          : 23.974 FPS
Maximum frame rate          : 23.981 FPS
Color space                 : YUV
Chroma subsampling          : 4:2:0
Bit depth                   : 8 bits
Scan type                   : Progressive
Bits/(Pixel*Frame)          : 0.325
Stream size                 : 215 MiB (95%)
Writing library             : x264 core 148 r2708 86b7198
Encoding settings           : cabac=1 / ref=2 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=6 / psy=1 / psy_rd=1.00:0.00 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=-2 / threads=6 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=1 / keyint=240 / keyint_min=24 / scenecut=40 / intra_refresh=0 / rc_lookahead=30 / rc=2pass / mbtree=1 / bitrate=12000 / ratetol=1.0 / qcomp=0.60 / qpmin=0 / qpmax=69 / qpstep=4 / cplxblur=20.0 / qblur=0.5 / vbv_maxrate=62500 / vbv_bufsize=78125 / nal_hrd=none / filler=0 / ip_ratio=1.40 / aq=1:1.00
Encoded date                : UTC 2018-03-22 22:48:59
Tagged date                 : UTC 2018-03-22 22:48:59
Color range                 : Limited
Color primaries             : BT.709
Transfer characteristics    : BT.709
Matrix coefficients         : BT.709

Audio
ID                          : 2
Format                      : AC-3
Format/Info                 : Audio Coding 3
Codec ID                    : ac-3
Duration                    : 2 min 30 s
Bit rate mode               : Constant
Bit rate                    : 640 kb/s
Channel(s)                  : 6 channels
Channel positions           : Front: L C R, Side: L R, LFE
Sampling rate               : 48.0 kHz
Frame rate                  : 31.250 FPS (1536 SPF)
Bit depth                   : 16 bits
Compression mode            : Lossy
Stream size                 : 11.5 MiB (5%)
Title                       : Surround / Surround
Language                    : English
Service kind                : Complete Main
Default                     : Yes
Alternate group             : 1
Encoded date                : UTC 2018-03-22 22:48:59
Tagged date                 : UTC 2018-03-22 22:48:59

The file can be downloaded here.

I will be converting the video using x265 codec, which provides better compressing rate without losing quality.

Here is the command I used for the conversation:

HandBrake/build/HandBrakeCLI -e x265  -i ~/deadpool.mp4 -o ~/output.mp4

Run 1:

Run 2:

We can notice the runtime for HandBrake to convert the video is around 14 minutes. My goal is to optimize it and improve that time.

by blacker at April 22, 2018 10:20 PM


Matteo Peruzzi

Observables and CORS

In some recent tinkering I’ve been doing on one of my Angular apps I can across a particular error I wasn’t expecting. I was in the process of building a service that would allow me to extract a summary of a wikipedia article (API request) when my requests kept getting denied.
The particular method I have been using to make API has been to use an Observable to get the request, now while this has worked on API’s like the NHLs with Wikipedia there are some security issues that are holding my data back. In my console debugger I get the following message.

No 'Access-Control-Allow-Origin' header is present on the requested resource.

After some further digging I found that this error comes about because my request didn’t have the proper headings required to access Wikipedia’s API a mechanism brought about by CORS.

What is CORS exactly? It stands for Cross-Origin Resource Sharing and simply serves as a barrier to protect from HTTP requests within scripts. CORS is generally used to help mitigate issues that arise from cross-origin HTTP requests. There are simple Chrome extensions that can be used to get around this, but from a developer perspective that doesn’t do us any good.

To properly get around CORS with our Observables is a little tricky unfortunately. If an API supports it one can instead make a JSONP (prefix) request during the GET. JSONP allows the code to treat the API as if it were a Javascript file script. This way of treating the request like a script can get around the issue of CORS.

So I converted my code to make use of JSONP:

getTeamWiki(name): Observable<any> {

const urlRoute='https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext&titles=';

return this.http.jsonp(urlRoute + name, 'JSONP_CALLBACK');

}

Unfortunately while this did get rid of my Access-Control error I’m now faced with a new error regarding MIME type checking being to strict. Which is something I’ll have to leave for another blog post once I fix that issue.

It’s a little frustrating since just pasting that url into my search bar I can get the request on my browser. For example this search for ‘Bee’ on Wikipedia gives with the result I want exactly, but the server permissions are blocking my code from using it properly.

I’ll be posting this issue on various sites to see if I can get a community solution to my error.

by Peruzzi Blog at April 22, 2018 09:27 PM


Thomas Nolte

SPO600 Project Stage 3

Introduction

This blog is for wrapping up the final project and discussing the results and experience of the work accomplished. First I will show an overview of what I wanted to do with the project, then I will examine my results of the work I did and finally I will conclude on my experience with the project as a whole.

Read more on the project here.

Finding the Software

To start this project I first wanted to find a open source software that was CPU intensive. At first I tested a few text compression software however the nature of those projects is to have the code already fairly optimized to begin with. After a few failures I sought out a different software to work on and found Kvazaar an open source video encoding project. Unlike the simpler smaller compression libraries I was looking through before Kvazaar was a large project with many source files. With tens of thousands of lines of code to go through I decided to focus of this software as the chances to find places to optimize was high. With so many source files I first had to benchmark the software to see where the software may be slowing itself down. After a few tests with the appropriate files the run time was long enough to give an accurate assessment on what functions I should focus on optimizing. After isolating 6 functions that took the large majority of the run time I began to delve into the code.

Read more on the software and my strategy in my stage 1 blog.

Code Changes and Results

Upon further examination of the code it was still clear that the code itself was written very well and understanding the long running functions would take a lot of time. The first thing was began to look for was if any of the loop could be combined to stop any unneeded iterations. This ended in failure as much of the code was separated as best it could code in the loops below reliant on the code in the loops above it. With this observations I turned my attention to how some of the variables were being initialized and used. If the variables where being initialized more often than needed some simple re positioning of the code could impact the run time significantly. As well as if some of the constant variables were not being used than simply in lining the value of the variable would stop any need of initializing and have the added benefit of saving memory space. These two approaches is where I found my strongest results as the longest running function had places to implement both changes.

The results of my changes looked like this:

TestTable.png

The changes I made have improved the overall run time of the program but ~0.3% by making the longest running function 6% faster. However I believe the strongest result of my changes is the variance between multiple runs. As a video encoding software I assume anyone using it will use it many time for multiple videos. By making the run time have less variance estimated time taken to run the program many times will become much more accurate with my changes. With these changes showing strong results this is where I stopped my optimization changes and decided to begin wrapping up the project.

To read more on my code changes read my stage 2 blog.

Committing to upstream

I’ve shown my results to the community and am currently waiting on a response. As the community does not run on the same schedule as the class I have to blog about my results without completing the last piece of the project. With how minor my changes are to the code I believe that I will get my changes committed to the live version of the software.

Project Conclusion

This project really tested what I have learned in the this course and the last 5 semesters of college. Finding a viable open source software was the most time consuming part of this project as there was not a lot of ways to track down a CPU intensive program that would build easily on AR64. Once the benchmarking process was done I thoroughly enjoyed going through the source files analyzing what was code was running the slowest. The code changes I made are easy to understand but unsurprising that the first developer of the code looked over as they are simple readjustments of the code that makes it less readable. However this shows me that even after you write a program and get it to run there is still a good reason to go back and improve on your code. I would have liked to look for further changes in the code to show stronger improvement to the overall run time but didn’t want to get in over my head just before the end of the semester. However I am content with the results I got and look forward to getting my changes up streamed.

by nolte96 at April 22, 2018 08:42 PM


Dennis Arul

Another Update

Alright so after my previous attempt at optimizing by using the -O3 flag was not successful, I started looking into other flags that could potentially speed up the program. In my search I found the ofast flag. So what the ofast flag does is enables all of the -O3 optimizations another thing that it does is, it also enables optimizations that are not valid for all standard compliant programs. Below show the changes that I made to the make file,.ofastchange2.PNGofastchange.PNG

What I found interesting about this was the runtimes seemed to fluctuate more than the other 2 flags that I tried to run it with. On the first run I actually received a run time that was very similar to the other 2 methods, so going forward I assumed that I would get the same results as the test with the -O3 flag, but I found, that was not the case.fast1.PNG

The second run was faster than the first but not by much, this was approximately the same timing I got for one of the -03 runs, I understand that ofast does activate the O3 optimizations but it should not be almost identical as there are a few more optimizations being activated in ofast.

fast2.PNG

Finally for my 3rd fun after the changes I got a runtime that was in between the the first and the second. As I mentioned previously this was fluctuating but I was unsure why it was doing so. Even though it was slightly faster, the timing did not get too much better.

fast3.PNG

Again my efforts for optimizing the program ended in failure, going forward I would like to find out what cause the fluctuations in the timings to see if there is anything I can do to reduce the fluctuations and maybe even get it running faster.

by dparul at April 22, 2018 06:38 PM


Matthew Quan

Some thoughts on SPO600

In this blog I’m going to give my thoughts on SPO600.

SPO600 is a course on software portability and optimization.

The course is only available to only CPA and CTY students and the prerequisite for this course is IPC144.

Our prof, Chris Tyler was very helpful in giving advice and guidance in our projects.

In the first half of the course we learned about the different ways of optimizing and in the last half we spend trying to optimize a project. This course had a focus on the various types of optimizing like changing compiler options, doing vectorization/SIMD or inline assembly.

Our prof talked about where the future of the industry was headed and that was one of the highlights of the course. Also, the pacing in the course was good considering it was slightly compressed due to the strike. I will say that this course is quite time consuming and not for people who want a quick and easy mark.

A suggestion to this course would be having a Slack to discuss SPO600. I found having a Slack in OSD600 was very helpful to me and allowed for better collaboration between the classmates.

This course felt very different from my other programming courses in that it is very self-directed and you need to be excited on whatever you are working. I personally recommend this course to anyone who is looking to gain information on how to optimize programs.

by mattprogrammingblog at April 22, 2018 06:03 PM

SPO600 Stage 3: The End

In my last blog for SPO600, I’m going to be wrapping up my optimization journey with Brotli.

First, here is a better table of the aarchie:

  Run#1 Run#2 Total Relative to Benchmark
–O2 0m49.034s 0m49.065s 49.050s N/A
0m48.777s 0m48.691s 48.734s
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone 0m49.201s 0m49.176s 49.189s 0.283384% increase
0m48.911s 0m48.888s 48.900s 0.340625% increase
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre 0m49.536s 0m49.541s 49.539s 0.996942% increase
0m49.313s 0m49.363s 49.338s 1.23938% increase
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops 0m49.271s 0m49.047s 49.159 0.222222% increase
0m49.054s 0m48.728s 48.891 0.322157% increase
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops 0m49.062s 0m48.924s 48.993 0.116208% decrease
0m48.838s 0m48.677s 48.758 0.0492469% increase
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone 0m48.967s 0m49.039s 49.003 0.0958206% decrease
0m48.729s 0m48.802s 48.766 0.0656626% increase
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone 0m49.026s 0m48.989s 48.930 0.244648% decrease
0m48.833s 0m48.649s 48.741 0.0143637% increase

 

Bbetty:

  Run#1 Run#2 Total Relative to Benchmark
–O2 0m52.795s 0m52.760s 52.776 N/A
0m52.556s 0m52.476s 52.516
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone 0m53.189s 0m53.224s 53.207 0.816659% increase
0m52.881s 0m52.196s 52.539s 0.0437962% increase
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre 0m54.139s 0m54.198s 54.169s 2.63946% increase
0m53.926s 0m53.975s 53.951s 2.7325% increase
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops 0m52.880s 0m52.774s 52.827s 0.0966348% increase
0m52.629s 0m52.559s 52.594s 0.148526% increase
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops 0m52.853s 0m52.847s 52.850s 0.140215% increase
0m52.562s 0m52.537s 52.550s 0.0647422% increase
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone 0m52.828s 0m52.848s 52.838s 0.117478% increase
0m52.534s 0m52.548s 52.541s 0.0476045% increase
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone 0m52.780s 0m52.795s 52.788s 0.0227376% increase
0m52.578s 0m52.543s 52.561s 0.0856882% increase

 

Xerxes:

  Run#1 Run#2 Total Relative to Benchmark
–O2 0m41.442s 0m41.404s 0m41.423s N/A
0m41.200s 0m41.136s 0m41.168s
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone 0m41.502s 0m41.402s 41.452s 0.0700094% increase
0m41.246s 0m41.143s 41.195s 0.0655849% increase
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre 0m41.433s 0m41.433s 41.433s 0.0241412% increase
0m41.165s 0m41.169s 41.167s 0.00242907% decrease
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops 0m41.379s 0m41.372s 41.376s 0.113464% decrease
0m41.109s 0m41.113s 41.111s 0.138457% decrease
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops 0m41.984s 0m41.367s 41.676s 0.610772% increase
0m41.735s 0m41.123s 41.429s 0.633988% increase
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone 0m41.407s 0m41.369s 41.388s 0.0844941% decrease
0m41.151s 0m41.136s 41.144s 0.0582977% decrease
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone 0m41.337s 0m41.337s 41.337s 0.207614% decrease
0m41.065s 0m41.087s 41.076s 0.223475% decrease

 

The next thing I looked at was all the –O3 flags.

Here are all the flags that -O3 turns on:

-finline-functions

-funswitch-loops

-fpredictive-commoning

-fgcse-after-reload

-ftree-loop-vectorize

-ftree-loop-distribution

-ftree-loop-distribute-patterns

-floop-interchange

-fsplit-paths

-ftree-slp-vectorize

-fvect-cost-model

-ftree-partial-pre

-fpeel-loops

-fipa-cp-clone

 

I was curious about the impact of the individual results of the compiler options (run on a 180 gb file on bbetty):

Compiler Options Times (Real) AVG Relative to Benchmark
Benchmark 52.790s 52.792 N/A
52.794s
-finline-functions 52.834s 52.834 0.0795575% increase
52.834s
-funswitch-loops 52.895s 52.842 0.0947113% increase
52.786s
-fpredictive-commoning 52.591s 52.639 0.289817% decrease
52.687s
-fgcse-after-reload 52.792s 52.763 0.0549326% decrease
52.734s
-ftree-loop-vectorize 53.451s 53.556 1.44719% increase
53.661s
-ftree-loop-distribution 52.841s 52.787 0.00947113% decrease
52.732s
-ftree-loop-distribute-patterns 54.289s 54.206 2.67844% increase
54.123s
-floop-interchange 52.680s 52.740 0.0984998% decrease
52.800s
-fsplit-paths 52.888s 52.841 0.0928171% increase
52.793s
-ftree-slp-vectorize 52.710s 52.773 0.0359903% decrease
52.836s
-fvect-cost-model 52.748s 52.784 0.0151538% decrease
52.820s
-ftree-partial-pre 52.902s 52.876 0.159115% increase
52.849s
-fpeel-loops 52.828s 52.837 0.0852402% increase
52.846s
-fipa-cp-clone 52.735s 52.801 0.017048% increase
52.867s

 

8 of the compiler options had a worse time then the benchmark. The two the stand out are -ftree-loop-distribute-patterns and -ftree-loop-vectorize. Both of them produced a significant performance decrease. -ftree-loop-distribute-patterns performs loop distribution on loops. It would split the ‘for loop’ into two separate loops. I believe splitting the loop was more expensive then just having it be in one loop. Another option that results in performance degradation is -ftree-loop-vectorize. This option tries to do vectorization on trees. After running -fopt-info-vec-all, it said that hash_to_binary_tree_inc.h had a method in it that was not auto-vectorized.  Having that code be auto-vectorized would speed up time and possibly remove the performance hit when running this option.

This is the method:

On the other side, 6 of the compiler options had a better time than the benchmark.

The options are:

-fpredictive-commoning (predicts commonly used memory stores)

-fgcse-after-reload (performs a redundant load elimination pass after reload)

-ftree-loop-distribution (does loop distribution)

-floop-interchange (does loop interchange)

-ftree-slp-vectorize (performs basic vectorization on trees)

-fvect-cost-model (alter cost model for vectorization)

Reference:

https://linux.die.net/man/1/gcc

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Although most of good compiler options had minor performance boosts.

Another interesting observation is that -ftree-loop-distribute-patterns is a massive performance hit while -ftree-loop-distribution is as minor performance increase. The difference is -ftree-loop-distribute-patterns does loop distribution on code that has calls to a library while -ftree-loop-distribution does loop distribution in general. It might be that calls to the library take longer to do then normal loop distribution. Also, the same thing occurs with -ftree-slp-vectorize vs -ftree-loop-vectorize. The difference there is that -ftree-loop-vectorize does loop vectorization while -ftree-slp-vectorize does block vectorization.

I learned that some compiler options could result in worse performance times. I was not expecting that some options would decrease performance. Also, I was surprised that the thing that took the most time was collecting data for my project. When I was collecting data it took long time to build and test each configuration. Even when I let it run for a couple of days it was only around 70-80% done. It also took longer than I expected to make a bash script that could automatically build and test every solution. What worked for me was constantly working on this project and making small increments every day. Working on this project took a lot of time and doing it every day helped me make progress.

In summary, there is a performance boost to be gained from some –O3 optimizations however some of them have a very negative impact on time. The number of negative compiler options is 8 and the number of positive ones is 6. The positive options only boost performance by a tiny percentage. The negative ones can decrease performance by up to 3%. Brotli can be optimized but even with the proper options the performance improvement is negligible.

by mattprogrammingblog at April 22, 2018 05:53 PM


Arsalan Khalid

A braver release, 0.3, Getting through life and open source

It’s been tough, it’s a real grind right now. Not a test against effort, time, or desires. But a test against myself, mastering my will power. I set out on a journey, become a great coder. Without doubt, it’s something that is virtually more difficult to do than say it, it will always be easier to say it, than to just do it. We can’t just do it though, we have distractions — which can really be in the form of anything that takes you away from what you enjoy, which in my case is just laying back, chilling, writing some code, and making some nice bread out of it. Up until recently I’ve been doing a lot of the management consulting style of work, you know client discussions, wine and dine, requirements gathering, it’s a whole different side of experience when it comes to seeing a business. You get a shape for what a product is, what people, let alone clients are really looking to buy. Not only when it comes to their own businesses, but selfishly it comes back to ourselves, and what benefit one gains through the development of ourselves. Thats why open source development is so powerful, its like you’re serving yourself by doing more software development, building your brand, but now you’re also building a project for anyone that uses software globally, AKA all humans

For this time around, I’m trying to build on the development I’m doing on the Brave browser, it’s been a lot of CSS, which has made it easy to impact and write code, hit the ground running (I’ve been jogging for a while). It’s been starting to get on my nerves a little bit, as I have the desire to be able to do more intensive, impactful development. I suppose impactful isn’t exactly what I’m trying to say, because all code written towards a project has it’s value, you don’t know when another developer will come along and pick it up, or even what a user may gain out of it. But still, being stuck in this land:

Just imagine, I had to find something called enabledContent__grant_text, and mess around with its attributes, locating this specific style field was a trek on its own. All this styling is in JavaScript (Aphrodite library) too by the way…, there’s a lot of different libraries here at play, and even a preferences.properties file filled with all the different UI text fields, in many different languages. Absolutely profound, as all of this was contributed to by the community supporting Brave! Just look at the number of files I had to mess around with to change a few text strings in their UI.

This can be of course tedious for development, but at its core I’ve learned that this is what a mature, and vast UI looks like. I’ve really only seen smaller grade projects, a bit of production level services, but mostly servlets, and back-end systems.

React react react, appropriately, that’s how you live life.

There’s a lot of things going in the above screenshot, what exactly is going on? It looks like there’s this function or set of logic being executed in this block to ‘render’ something. Usually that’s associated with synthesizing some form of picture, colour, or art into virtual existence out of thin air. Many people can likely recognize all of the javascript mixed in with a set of HTML tags here. Especially when it starts with <div>, but likely what could throw many people off is the ‘<BrowserButton grouped item secondaryColor’ bits, the first thing one would see. First off, <BrowserButton> is a custom tag created within JSX, “a syntax extension to JavaScript. It is recommended to use it with React to describe what the UI should look like. JSX may remind you of a template language, but it comes with the full power of JavaScript”
 — JSX In Depth. Good reminder to know that the engineers at Facebook, and now the community have developed this stateful, event driven, composition web framework. Which has humble beginnings to pure PHP, when Zuckerberg was writing code in his dorm room.The LAMP stack! You know, when he created it almost 15 years ago. Using JSX, a Component has been generated, with a corresponding state associated with it. Within this state various visual properties have been set such as: groupedItem and primaryColor. Now saying this, acting all like I’ve become a scholar in React, I’m still learning my self. Which is why it’s always the best being able to reach out in the community on Brave’s Discord(in late April on a Sunday afternoon) with something like:

That being said, when looking at this closely:

primaryColor is setting the state’s colour property to be of the browser UI’s primary colour. Which is their defacto lion orange. It’s also saying, that the ‘l10Id’ is set to recover, which translates to, get text from the browser’s text dictionary, where each of these id’s is associated with a few sentences in a selected language of choice. Since recover is set, this component will display the text associated with recover in the default set GB-EN. Which makes me think of my ever eternal debate with my colleague who says English will always be associated with the British, who invented English. That’s why it’s called English. Not American or Canadian. Further, there is some logic with the recognizable onClick attribute: {this.recoverWallet}. Which is calling a method inside the broader ImmutableComponent, LedgerRecoveryFooter. If you’re picking up on the wording here, you’ll notice that this component is related to recovering something. Which is incidentally to restore the browser’s ledger. As a feature to the browser, this is pretty cool — since this is providing the user with the ability to update any browser with a ledger of their choice, representing a user’s elapsed time on a Basic Attention Token publisher site. It’s all connected to the code we’re writing, to what these buttons in React are really doing. Finally, noticing the last attribute being set in the Component, custom={styles.recover__button}. Setting this button’s icon to have a button style view in Brave. Which can also be replaced to by a picture, icon, or text.

Looking at something so simple as just 4 lines of code, but then looking upwards at the real depth and reality of what this code actually means from a product sense, is profound. Which is what these simple buttons are doing, in combination with the development approach at making something as simple as a button. Creating your own custom HTML tag, a state tied to it, setting how it will look, the colour, and text, without looking at the stylesheets or backend. Ladies and gentlemen I’d say this is what scaleable means! Check out more of my code for this pull request, which includes my third release here at: https://github.com/brave/browser-laptop/pull/13772

I’d say this sums up my short blab and learnings from web development within the brave browser, and getting a closer look at what the real ‘ledger’ component famous ‘blockchain’ piece of the browser is.

Thanks for tuning in!

by Arsalan Khalid at April 22, 2018 05:36 PM


Sanjit Pushpaseelan

SPO600- Failed assembly implementation

After spending the last two days trying to implement assembly language I as sad to say I failed to produce any notable results with my implementation.

To start off, I re-benchmarked my results for MD5deep since I was getting weird results when I ran some tests a few days ago (several of my tests were getting me 12s+ runtimes for some reason which I found odd). I think the server was just busy or something since my runtimes have gone back to normal. Anyways, I will post my new benchmarks so you have a reference for the upcoming benchmark.

 

Luckily, my results seem to be back to normal. I am at an average runtime of 9.3 seconds which is basically what I got during my last testing.

 

So sadly for most of my attempts to implement assembly language ended in failure. Due to my inexperience with assembly language, I was unable to find the proper documentation for ARM assembly language. Most of the assembly I found in rewolf I was unable to even write for ARM since either I couldn’t find a way to convert it into ARM or I was just unable to figure out what the code was doing. I’m not going to bother posting the code I attempted to convert into ARM assembly but I will list what I was trying to do with each conversion.

 

  1. Convert the bit shifting functions (F1, F2, etc…) into assembly. I was never able to figure out how to properly move the registers to emulate what the bit shift functions were doing before.

 

2. Write a function to process the message passed into MD5 by using state transformations

 

I was unable to get to implement the functions to do the above but I was able to implement one other function.

 

#define DISPLAY(x,w) ( __asm__(“ror %%cl,%0″:”=r” (x):”0″ (x),”c” (32- n));)

This code was interesting to write cause I had no idea that ARM lacked some of the commands that X86 had so I had to get creative with this solution. Instead of rotating left like the original code, I had to rotate right all the way to the other side so it would be in the same place it would be if I rotated left with the passed int value. Below you will find my results to compiling with this function implemented.

 

 

Sadly, as you can see my runtime is around 9.3, the same as my initial run. This means that my efforts were sadly in vain.

 

Moving forward

 

Seeing as this project is due tomorrow, I will not have time to try and adopt a new plan of attack for this project. I will be posting a blog post that has summarized the work I have done in the past 2 months and what I have taken away from the work I’ve done.

by sanjitps at April 22, 2018 02:13 PM


Justin Vuu

SPO600 – Stage 3 – Finale

Because I won’t be available for most of Sunday to write this blog, I’ll write it now.

This finale will be multi-part and summarize my progression through the project from start to finish.

Summary of Progression

I had difficulty finding a project to work on. I tried a variety of different projects on Github, but many of them had problems being installed on our servers, such as requiring immintrin.h. With stage 1’s deadline approaching, I hastily picked a project what I would end up abandoning because the only way to benchmark it was infeasible. I later settled on HashRat after learning how to generate a large file of random data.

My approach to optimizing HashRat is akin to throwing things at a wall and seeing what sticks. I went through everything from changing the optimization level to trying inline assembler. My biggest breakthroughs, I think, are when I discovered the second Makefile that was responsible for compiling the algorithms, and how to enable the optimized version of the transform function.

 

Increasing the optimization level in the other Makefile had a small but noticeable improvement. It was enabling the optimized function that made the biggest improvement. This is because the function unrolled most of the loops. Unfortunately, my attempt at inline assembler turned out fruitless.

In the end, I created a branch that only enabled the optimized function by default. I didn’t want to use the O3 flag because I hadn’t tested it on the other algorithms used by HashRat. I submitted an issue and pull request, but so far haven’t seen a reply.

Analysis of Results

Running the original build on Xerxes, BBetty, and AArchie produced different run times. This was expected, especially for BBetty, because of their different specifications and architecture. Xerxes and AArchie hashed a 3GB file in about 36.9 seconds, while BBetty was the slowest taking 45.5 seconds.

With just the O3 flag in the proper Makefile, I saw a very small improvement. On AArchie, there was about a half second improvement. Xerxes saw a greater improvement of about 1 second. I didn’t test this approach on BBetty.

With the unrolled function being used in addition to O3, all 3 servers saw a significant increase in performance. Interestingly, the improvements were significantly better on the AArch servers compared to Xerxes. Xerxes was only 11% faster in total. AArchie, on the other hand, shaved a quarter of its time.

So how is it unrolled? Let me link a gist of the functions so you can see the differences: https://gist.github.com/jevuu/837bf8f5e7cbbe94ecd52f5c991ee69c

In the slower version, the loops are iterated 16 and 64 times respectively. This will likely produce many jump instructions. Values of the variables are also being changed at the end of each iteration.

In the unrolled version, the loops are only iterated 2 and 8 times respectively. Directives set up to take in values like a function are used 8 times per iteration. Because the compiler replaces every reference to the directive with its value, it’s essentially like 8 inlined functions. This also means the program won’t need to swap the values of variables, it simply changes the order they’re passed in.

This
2 times
as opposed to this
16 times

Reflection

I think this is possibly the hardest project I’ve ever been given in my academic career.

I think I could have improved my process if I had a better understanding of how to optimize. I’ve honestly felt very lost throughout the course, and this project was very daunting. I think I only have myself to blame because the professor gave every opportunity to clarify topics covered in class and help with projects.

It was great that I managed to find something that helped improve performance, and it was right there in the code.

However, I wish I was able to figure out assembler. The reason I dared to try inline assembler was that I wanted to push myself. I was inspired by someone else’s attempt at rewriting the same function in assembler. Unfortunately, I couldn’t figure it out and didn’t make any progress on that side.

I also should have attempted to contact the project’s author from the beginning. I forked the project and worked on it without ever trying to communicate with them. I feel that this might have made my issue and pull request come off as rude.

I also should have looked for a more active project. Having an actual community or active authors would have helped me greatly in understanding their project, where it needs improvement, and how to improve it.

One takeaway from this course is that I should always review my code thoroughly to check for any unnecessary calculations, unused variables, and minimize the use of loops with many iterations.

 

 

by justosd at April 22, 2018 09:18 AM


Matt Rajevski

Part 2[update]

So I’ve gone through most of the major functions that the benchmark program went over and did some individual test on them before and after I made some changes to the source code.

As I mention before, the functions had high efficiency ratings and that really showed when looking at the source code. Everything was done in a very clean and efficient method. It is a compression software so the logic behind it is complex in nature, but I didn’t expect it to be this complex.

Here are some examples of highly complex code found in the major functions:

// LZMA2 decode to Buffer //
SRes Lzma2Dec_DecodeToBuf(CLzma2Dec *p, Byte *dest, SizeT *destLen, const Byte *src, SizeT *srcLen, ELzmaFinishMode finishMode, ELzmaStatus *status)
{
  SizeT outSize = *destLen, inSize = *srcLen;
  *srcLen = *destLen = 0;
  for (;;)
  {
    SizeT srcSizeCur = inSize, outSizeCur, dicPos;
    ELzmaFinishMode curFinishMode;
    SRes res;
    if (p->decoder.dicPos == p->decoder.dicBufSize)
      p->decoder.dicPos = 0;
    dicPos = p->decoder.dicPos;
    if (outSize > p->decoder.dicBufSize - dicPos)
    {
      outSizeCur = p->decoder.dicBufSize;
      curFinishMode = LZMA_FINISH_ANY;
    }
    else
    {
      outSizeCur = dicPos + outSize;
      curFinishMode = finishMode;
    }

    res = Lzma2Dec_DecodeToDic(p, outSizeCur, src, &srcSizeCur, curFinishMode, status);
    src += srcSizeCur;
    inSize -= srcSizeCur;
    *srcLen += srcSizeCur;
    outSizeCur = p->decoder.dicPos - dicPos;
    memcpy(dest, p->decoder.dic + dicPos, outSizeCur);
    dest += outSizeCur;
    outSize -= outSizeCur;
    *destLen += outSizeCur;
    if (res != 0)
      return res;
    if (outSizeCur == 0 || outSize == 0)
      return SZ_OK;
  }
}
// Part of a BZip2 encode function that is 285 line long //
    {
      UInt32 remFreq = numSymbols;
      unsigned gs = 0;
      unsigned t = numTables;
      do
      {
        UInt32 tFreq = remFreq / t;
        unsigned ge = gs;
        UInt32 aFreq = 0;
        while (aFreq < tFreq)            aFreq += symbolCounts[ge++];         if (ge > gs + 1 && t != numTables && t != 1 && (((numTables - t) & 1) == 1))
          aFreq -= symbolCounts[--ge];
        
        Byte *lens = Lens[t - 1];
        unsigned i = 0;
        do
          lens[i] = (Byte)((i >= gs && i < ge) ? 0 : 1);
        while (++i < alphaSize);
        gs = ge;
        remFreq -= aFreq;
      }
      while (--t != 0);
    }
// Deflate bit reversal function //
NO_INLINE void Huffman_ReverseBits(UInt32 *codes, const Byte *lens, UInt32 num)
{
  for (UInt32 i = 0; i < num; i++)
  {
    UInt32 x = codes[i];
    x = ((x & 0x5555) << 1) | ((x & 0xAAAA) >> 1);
    x = ((x & 0x3333) << 2) | ((x & 0xCCCC) >> 2);
    x = ((x & 0x0F0F) << 4) | ((x & 0xF0F0) >> 4);
    codes[i] = (((x & 0x00FF) << 8) | ((x & 0xFF00) >> 8)) >> (16 - lens[i]);
  }
}

Despite the impressive optimizations of the code, I was able to find a few very small optimizations.

In 7zCrc.c I found a strength reduction in CrcGenerateTable()

 
// Before //
  for (; i < 256 * CRC_NUM_TABLES; i++)   {      UInt32 r = g_CrcTable[i - 256];      g_CrcTable[i] = g_CrcTable[r & 0xFF] ^ (r >> 8);
  }
// After //
 UInt32 j = 256 * CRC_NUM_TABLES;
  for (; i < j; i++)   {
     UInt32 r = g_CrcTable[i - 256];
     g_CrcTable[i] = g_CrcTable[r & 0xFF] ^ (r >> 8);
  }

This will remove the multiplication from the loop, and this saves at least 256 calculations based on the value of CRC_NUM_TABLES. Unfortunately the results were the same as before the change.

// CRC32 benchmark results post changes //
[mrrajevski@aarchie p7zip_16.02]$ 7za b "-mm=CRC32"

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

LE
CPU Freq:  1995  1999  1999  1998  1999  1998  1998  1999  1999

RAM size:   16000 MB,  # CPU hardware threads:   8

Size      1      2      3      4      6      8

10:     640   1280   1920   2558   3775   4274
11:     644   1289   1933   2577   3786   4139
12:     646   1293   1939   2586   3854   4413
13:     647   1295   1938   2584   3779   4199
14:     647   1283   1934   2579   3739   4373
15:     642   1292   1913   2555   3817   4205
16:     642   1285   1924   2564   3830   4171
17:     642   1282   1927   2561   3790   4266
18:     643   1287   1929   2566   3834   4244
19:     644   1288   1930   2519   3778   4147
20:     642   1266   1897   2522   3763   3644
21:     638   1265   1894   2513   3712   3756
22:     636   1262   1887   2487   3715   4266
23:     633   1258   1882   2509   3729   4211
24:     628   1256   1885   2512   3719   4285

Avg:    641   1279   1916   2546   3775   4173

In Lzma2Dec.c I found an if statement that compared against a multiplied value that consisted of fixed vars.

// Before //
if (b >= (9 * 5 * 5))
// After //
if (b >= (225))

While this is only an if statement, the function containing it is called within a while loop so it has the potential to save at least 2 calculations based on the while statement. Unfortunately the results were the same as before the change

// LZMA2 benchmark results post change //
[mrrajevski@aarchie p7zip_16.02]$ 7za b "-mm=LZMA"

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

LE
CPU Freq:  1994  1993  1999  1999  1998  1998  1999  1999  1999

RAM size:   16000 MB,  # CPU hardware threads:   8
RAM usage:   1765 MB,  # Benchmark threads:      8

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       9646   511   1838   9384  |     130144   563   1973  11101
23:       9897   558   1807  10085  |     127691   560   1974  11050
24:       9272   551   1811   9969  |     124665   557   1966  10942
25:       9002   548   1875  10279  |     114883   526   1945  10224
----------------------------------  | ------------------------------
Avr:             542   1833   9929  |              551   1965  10829
Tot:             547   1899  10379

 

The last thing I managed to find was in BZip2Encoder.cpp

// Before //
  for (unsigned i = 0; i < 4; i++)      WriteByte2(((Byte)(v >> (24 - i * 8))));
// After //
  for (unsigned i = 0; i < 32; i += 8)
     WriteByte2(((Byte)(v >> (24 - i))));

This removed the multiplication from inside the loop. This function is called within a recursive function, so there is potential to save 4 calculations per recursive function call. But yet again the improvements, if any, were negligible.

// BZip2 benchmark results post change //
[mrrajevski@aarchie p7zip_16.02]$ 7za b "-mm=BZip2"

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

LE
CPU Freq:  1998  1998  1999  1999  1999  1999  1998  1999  1999

RAM size:   16000 MB,  # CPU hardware threads:   8
RAM usage:   1765 MB,  # Benchmark threads:      8

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:      16312   661   1491   9855  |      60175   661   1034   6837
23:      15738   654   1453   9509  |      57310   646   1024   6613
24:      15143   640   1429   9149  |      55011   643   1001   6441
25:      15673   675   1402   9469  |      53924   649    987   6403
----------------------------------  | ------------------------------
Avr:             658   1444   9496  |              650   1012   6574
Tot:             654   1228   8035

So far I have been unable to find any improvements to the program that are noticeable, but I will continue to search for more and update with what I find.

by mrrajevski at April 22, 2018 06:16 AM


Qichang Chen

SPO600 - Project - Stage Three

In this last stage of my SPO600 project, Since I don't have results suitable for upstreaming, I am going to wrap up my project results and do some thorough technical analysis of my results.

First of all, I am going to summary what I did for my project. (If you want to go over the details, you can see my previous posts.)
I picked a software called SSDUP, it is a traffic-aware SSD burst buffer for HPC systems. I noticed that it uses 3 different Murmurhash3 hash functions, the first two hash functions are optimized for x86 platforms and the third hash function is optimized for x64 platforms. I also noticed that it uses 'gcc -std=gnu99' to compile. In order to easier to handler these 3 hash functions, I split them into 3 files and separately testing them on an AArch64 and x86_64 systems.

As the professor said my results in stage two is hard to read, I am going to show my results again in a table format.

First hash function (MurmurHash3_x86_32), the execution time for -O3 is about 802% faster than without compilation option:
without -O3 option
with -O3 option
No code changes
14.117
1.572
Code changes: i+i and len
14.035
N/A

Second hash function (MurmurHash3_x86_128), the execution time for -O3 is about 891% faster than without compilation option:
without -O3 option
with -O3 option
No code changes
13.332
1.338
Code changes: i+i and len
13.543
N/a

Third hash function (MurmurHash3_x64_128), the execution time for -O3 is about 523% faster than without compilation option, and 0.04% faster with code changed:
without -O3 option
with -O3 option
No code changes
8.179
1.315
Code changes: i+i and len
8.137
N/A

All of the tests are first completed on an AArch64 system. My first step to optimize the hash function is to compile my benchmark program with -O3 compilation option. The first two hash functions, which have been optimized for x86 platforms, which has a significant improvement in performance. The third hash function, which has been optimized for x64 platforms, after compiling with -O3 option, which is a very small improvement in performance. My second step in optimization is to change some code in the third function, there is 0.04% faster than without changing the code.

Afterward, I perform the benchmark program on an x86_64 system, the result turns out that it also has a significant improvement in performance if compiling with -O3 option. But the improvement of the third function on an AArch64 system is not as much as different than x86_64 platforms. As a result, compiling with -O3 option for both functions produces the best performance and is the most optimized case.

by Frankie Chen (noreply@blogger.com) at April 22, 2018 05:44 AM


Yalong Li

OSD Release 0.3 final post

When I try to fix this issue in debugger.html, I found some other bugs. Both issues are related to the "Set directory root" menu button on the left side panel of the debugger. They are not in the issues tab.

The first issue occurs when there is a webpack folder. When trying to set the sub directory to the root, the content of the folder will be missing, but when setting it up directly, the content is rendered. It happens because of the webpack has different url than the ordinary ones.

Issue 1 - STR:
  1. Go to https://firefox-debugger-example-react-js.glitch.me/
  2. On the left panel, right click on "Webpack" folder and Click on "Set directory root"
  3. Then right click on "app" foler and Click on "Set directory root" ( notice the content is missing ).

Issue 2 - STR:
  1. Go to https://davidwalsh.name/
  2. On the left panel, right click on "davidwalsh.name" and Click on "Set directory root"
  3. Then expand the subfolders; right click on "libs" and Click on "Set directory root" ( notice the content is missing ).

                         issue 1:                                                                                    issue 2:
                                                      
I fixed both issues and added test coverage to the code. It a learning process while debugging the issues. The pull request can be found here.

by Yalong (noreply@blogger.com) at April 22, 2018 04:24 AM


Geoffrey Wu

SPO 600 Project - Stage 1

For the SPO600 course, I need to complete a project of identifying a CPU-intensive function/method - let's say a checksum or hash algorithm - that is implemented in a language that compiles to machine code (e.g. C, C++, or Assembler) in an open-source project, improving its performance, then get that optimization accepted by the upstream. For more information, click here.

Find an open source package

For this project, I chose Tengine, a fork of the web server nginx  developed by Taobao. When I'm reading its source code in Github, I noticed that it also uses the MurmurHash, just like nginx. Since MurmurHash itself is a simple hash function, it would be difficult to improve it (I still remembered my professor once said an easy program is the most difficult one to optimize), I decided to accept the challenge, because why not?

Benchmark the performance on AArch64

As the function is very simple, testing its performance is much simpler than I imagine: I just extract the function from the file, and then write a small tester to test it.
I check the repository of Tengine, and it shows that it is built with the -O2 compiler flag for optimization. Therefore, my test would also be compiled with -O2. Also, to avoid any possible errors on my measurements (Sometimes a program will run faster on the second time it's executed thanks to the cache), I will run the test 10 times.

Here is the source code:


#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

#define DATA_LENGTH 500000

uint32_t
ngx_murmur_hash2(u_char *data, size_t len)
{
uint32_t h, k;

h = 0 ^ len;

while (len >= 4)
{
k = data[0];
k |= data[1] << 8;
k |= data[2] << 16;
k |= data[3] << 24;

k *= 0x5bd1e995;
k ^= k >> 24;
k *= 0x5bd1e995;

h *= 0x5bd1e995;
h ^= k;

data += 4;
len -= 4;
}

switch (len)
{
case 3:
h ^= data[2] << 16;
/* fall through */
case 2:
h ^= data[1] << 8;
/* fall through */
case 1:
h ^= data[0];
h *= 0x5bd1e995;
}

h ^= h >> 13;
h *= 0x5bd1e995;
h ^= h >> 15;

return h;
}

int main()
{
u_char tmp;
u_char *data;
srand(time(NULL));

data = calloc(DATA_LENGTH, sizeof(u_char));
for (int i = 0; i < DATA_LENGTH; i++)
{
tmp = (u_char)((rand() % 26) + 'a');
data[i] = tmp;
}

for (int j = 0; j < 10; j++)
{
clock_t start = clock();
uint32_t result = ngx_murmur_hash2(data, sizeof(u_char) * DATA_LENGTH);
clock_t end = clock();
printf("Run #%d\n", (j+1));
clock_t diff = end - start;
double msec = (double)diff / CLOCKS_PER_SEC;
printf("Time taken: %lf seconds\n", msec);
}

free(data);
}

What this tester does is very simple, it just create an array of 500000 random characters, start the timer, send the array to the hash function and records the result, stop the timer, print the total time, and repeat the above steps for 10 times.

This is the final result:
Run #1
Time taken: 0.000002 seconds
Run #2
Time taken: 0.000001 seconds
Run #3
Time taken: 0.000001 seconds
Run #4
Time taken: 0.000002 seconds
Run #5
Time taken: 0.000001 seconds
Run #6
Time taken: 0.000001 seconds
Run #7
Time taken: 0.000001 seconds
Run #8
Time taken: 0.000001 seconds
Run #9
Time taken: 0.000002 seconds
Run #10
Time taken: 0.000002 seconds

Impressive, most impressive. The results looks quite uniform.

Identifying the strategies

After benchmarking the performance of the hash function, I managed to come out with some strategies:

1. Altered build options - since it is using -O2 optimization, maybe it would be faster if I try -O3 optimization instead. It is also the easiest option, so I don't see why I shouldn't give it a try.
2. Code changes to permit better optimization by the compiler - judging from the code itself, I think I can rewrite it in a way that the compiler would auto-vectorize (SIMD) a few operations. 
3. Algorithm improvements - as it is already quite optimized, I doubt if I can improve it further.
4. In-line assembler - personally I am very reluctant to use In-line assembler (since it kills portability), but if that is the only way I can make it faster, then so be it.

 

by Wu Geoffrey (noreply@blogger.com) at April 22, 2018 03:42 AM

SPO 600 Project - Stage 3

In this post, I will discuss my findings during Stage 3 of the project. For my previous stages, click here and here.

Before I start, I think I should mention how to performance improved in percentage. And there's the result:
[GeoffWu@aarchie tester]$ ./tester
Run #1
original version took 33339704 nanoseconds
modified version took 33237179 nanoseconds
hash match: y
difference between two versions: 102525 nanoseconds
percentage difference: 0.307516
Run #2
original version took 33302201 nanoseconds
modified version took 33231595 nanoseconds
hash match: y
difference between two versions: 70606 nanoseconds
percentage difference: 0.212016
Run #3
original version took 33366103 nanoseconds
modified version took 33270939 nanoseconds
hash match: y
difference between two versions: 95164 nanoseconds
percentage difference: 0.285212
Run #4
original version took 33349119 nanoseconds
modified version took 33346432 nanoseconds
hash match: y
difference between two versions: 2687 nanoseconds
percentage difference: 0.008057
Run #5
original version took 33358631 nanoseconds
modified version took 33268834 nanoseconds
hash match: y
difference between two versions: 89797 nanoseconds
percentage difference: 0.269187
Run #6
original version took 33353199 nanoseconds
modified version took 33301913 nanoseconds
hash match: y
difference between two versions: 51286 nanoseconds
percentage difference: 0.153766
Run #7
original version took 33249315 nanoseconds
modified version took 33269754 nanoseconds
hash match: y
difference between two versions: -20439 nanoseconds
percentage difference: -0.061472
Run #8
original version took 33306561 nanoseconds
modified version took 33333888 nanoseconds
hash match: y
difference between two versions: -27327 nanoseconds
percentage difference: -0.082047
Run #9
original version took 33376007 nanoseconds
modified version took 33293441 nanoseconds
hash match: y
difference between two versions: 82566 nanoseconds
percentage difference: 0.247381
Run #10
original version took 33342424 nanoseconds
modified version took 33367455 nanoseconds
hash match: y
difference between two versions: -25031 nanoseconds
percentage difference: -0.075073

Although the percentage difference is very marginal (around 0.1 to 0.3%), considering Tengine is originally designed for sites like Taobao, a Chinese online shopping website that is expected to handle hundreds of gigabytes of network traffic every single second (again, it is written in Chinese), I believe even a smallest amount of improvement would make a huge difference.

After I optimize the Murmurhash and pass all the unit tests, I opened a pull request, sign the CLA and submit my code to Alibaba, though they still didn't let me know whether they accepted or rejected my patch.

Reflection

I first heard of Tengine in around 2016, after I installed Wappalyzer on my Firefox basically just for my amusement. When I am spending my spare time watching videos on Bilibili (think of it as the Chinese equivalent of Youtube), I noticed a strange-looking T on the Wappalyzer. Driven by curiosity, I clicked on it, and that's how I found this web server.

I have to admit during the beginning of the project, I actually not very sure if I can handle it, since I don't have any experience with open source before. But as it turns out it is not as intimidating as it seems, and more enjoyable and fulfilling than what I imagine (I still can't forget how happy I am when I finally managed to get the unit test running).

Overall, Tengine is a decent program to work with - though I would be appreciated if they would write more detailed documentation on the unit tests - in particlar, what kind of dependencies it required. While Lua/LuaJIT isn't a dependency required by Tengine, the unit test does require it, as it tests the ngx_http_dyups_module, a module that is not compiled into Tengine by default. The module also requires the ngx_http_lua_module and either Lua/LuaJIT.

by Wu Geoffrey (noreply@blogger.com) at April 22, 2018 03:42 AM

SPO 600 Project - Stage 2

In this post, I will discuss my findings during Stage 2 of the project. For more details, click here.

Before I should talk about how I optimized the Murmurhash2 algorithm, I would like to answer a question that I forgot to add on my stage 1 post - where is that algorithm is used? After doing some research, I found out that Tengine, just like nginx, uses Murmurhash2 in the ngx_http_split_clients_module, which is often used for A/B Testing.

Alright then, let's talk about how I optimized the algorithm. First of all, I think I should mention the optimization strategies I eliminated:
  • In-line assembler: Given Tengine is meant to support a variety of platforms and CPU architectures, I honesty don't think it is a good option to use it - unless Jack Ma would give me a job afterwards :)
  • Altered build options: At first I considered this option, since it is the easiest one, and planning to change its compile options to -O3/-Ofast, but after checking the cc (which listed all the compile options for all the compilers it supported) directory of Tengine at GitHub, I quickly noticed that it still can be compiled using some legacy compilers - Compaq C compiler and Open Watcom C compiler, to name a few - and those are things I don't want to mess with, as I don't want to break backward compatibility.
  • Code changes to permit better optimization by the compiler: I also briefly considered this option and change some parts of the algorithm, so it can take advantage of auto-vectorization and SIMD, though I have a concern: what if the software is running on a platform that doesn't support SIMD and/or auto-vectorization? Also, to really take advantage of auto-vectorization and SIMD, my best option is to compile it with -O3, which could potentially wreck the entire software (I assume Igor Sysoev chose -O2 for the original nginx for the same reason).
So, the safest (and the only possible) option for me is Algorithm improvements.

Improve the algorithm 

Since Murmurhash itself is already a simple algorithm, at first I think there is no way I can further optimize it (for this reason, I even once consider to choose the md5 algorithm instead, but I later dropped the idea), but after paying the SMHasher a visit, I realized I was wrong. I scrolled to the Murmurhash2 and did a quick comparison with the ngx_murmurhash.c file, and I noticed it is using the endian- and alignment-neutral version. An idea quickly popped out from my mind: what if I just combine the original and endian- & alignment-neutral version, so it can be switched to original when the platform is already little-endian?

Then another problem arises: is there a compiler-independent way to detect the endianness of a CPU? I know in GCC, <sys/param.h> and <endian.h> there are macros for detecting the endianness, but I think it is unfair to penalize users with a slow algorithm only because they aren't using GCC, they're not using Unix-like OS, or they don't have GNU C library. That means I have figure another way to detect the endianness.

And when I am surfing around Tengine's directory at GitHub (largely because I am bored), I noticed a file called endianness - a small script that checks the endianness of the CPU, and creates the NGX_HAVE_LITTLE_ENDIAN macro if the CPU is little-endian. 

Thanks to that, I quickly wrote a proof-of-concept of my improved version and its tester, with the original version serve as the comparison. This time I would be using a 50000000-character string, as recommended by my professor, Chris Tyler:


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

#define DATA_LENGTH 50000000
#define BILLION 1000000000L

uint32_t
ngx_murmur_hash2(u_char *data, size_t len)
{
uint32_t h, k;

h = 0 ^ len;

while (len >= 4)
{
k = data[0];
k |= data[1] << 8;
k |= data[2] << 16;
k |= data[3] << 24;

k *= 0x5bd1e995;
k ^= k >> 24;
k *= 0x5bd1e995;

h *= 0x5bd1e995;
h ^= k;

data += 4;
len -= 4;
}

switch (len)
{
case 3:
h ^= data[2] << 16;
/* fall through */
case 2:
h ^= data[1] << 8;
/* fall through */
case 1:
h ^= data[0];
h *= 0x5bd1e995;
}

h ^= h >> 13;
h *= 0x5bd1e995;
h ^= h >> 15;

return h;
}

int ngx_check_endianness()
{
int i = 0x11223344;
char *p;

p = (char *) &i;
if (*p == 0x44) return 0;
return 1;
}

uint32_t
ngx_murmur_hash2_mod(u_char *data, size_t len)
{
uint32_t h, k;

h = 0 ^ len;

while (len >= 4)
{
if(ngx_check_endianness() == 0)
{
k = *((uint32_t *)data);
}
else
{
k = data[0];
k |= data[1] << 8;
k |= data[2] << 16;
k |= data[3] << 24;
}

k *= 0x5bd1e995;
k ^= k >> 24;
k *= 0x5bd1e995;

h *= 0x5bd1e995;
h ^= k;

data += 4;
len -= 4;
}

switch (len)
{
case 3:
h ^= data[2] << 16;
/* fall through */
case 2:
h ^= data[1] << 8;
/* fall through */
case 1:
h ^= data[0];
h *= 0x5bd1e995;
}

h ^= h >> 13;
h *= 0x5bd1e995;
h ^= h >> 15;

return h;
}

int main()
{
u_char tmp;
u_char *data;
struct timespec tval_before, tval_after;

srand(time(NULL));

data = calloc(DATA_LENGTH, sizeof(u_char));
for (int i = 0; i < DATA_LENGTH; i++)
{
tmp = (u_char)((rand() % 26) + 'a');
data[i] = tmp;
}

clock_gettime(CLOCK_MONOTONIC, &tval_before);
uint32_t result = ngx_murmur_hash2(data, sizeof(u_char) * DATA_LENGTH);
clock_gettime(CLOCK_MONOTONIC, &tval_after);
uint64_t timedif = BILLION * (tval_after.tv_sec - tval_before.tv_sec) + tval_after.tv_nsec - tval_before.tv_nsec;
printf("original version took %llu nanoseconds\n", (long long unsigned int) timedif);

struct timespec tval_before2, tval_after2;
clock_gettime(CLOCK_MONOTONIC, &tval_before2);
uint32_t result2 = ngx_murmur_hash2_mod(data, sizeof(u_char) * DATA_LENGTH);
clock_gettime(CLOCK_MONOTONIC, &tval_after2);
uint64_t timedif2 = BILLION * (tval_after2.tv_sec - tval_before2.tv_sec) + tval_after2.tv_nsec - tval_before2.tv_nsec;
printf("modified version took %llu nanoseconds\n", (long long unsigned int) timedif2);

char match;

if(result == result2)
{
match = 'y';
}
else
{
match = 'n';
}

printf("hash match: %c\ndifference between two versions: %lli nanoseconds\n", match, (long long int) timedif - timedif2);

free(data);

return 0;
}

(Note: Since it is a proof-of-concept, I didn't use the macro directly, though the ngx_check_endianness function uses exactly the same way how Tengine detects the endianness of CPU to create the macro. Of course in the final version, I will use the macro.)

I run the tester on aarchie (AArch64), and the results is much better than I expected (my expectation is the improved version would only faster than the original by a couple nanoseconds):

[GeoffWu@aarchie tester]$ ./tester
original version took 33387800 nanoseconds
modified version took 33328578 nanoseconds
hash match: y
difference between two versions: 59222 nanoseconds
[GeoffWu@aarchie tester]$ gprof ./tester
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 57.37      0.04     0.04                             ngx_murmur_hash2
 43.03      0.07     0.03                             ngx_murmur_hash2_mod

Since I am too lazy to type ./tester 10 times, I modify the tester to run itself 10 times, each time with a new string. And the result is:

[GeoffWu@aarchie tester]$ ./tester
original version took 33476581 nanoseconds
modified version took 33345553 nanoseconds
hash match: y
difference between two versions: 131028 nanoseconds
original version took 33399119 nanoseconds
modified version took 33353913 nanoseconds
hash match: y
difference between two versions: 45206 nanoseconds
original version took 33400424 nanoseconds
modified version took 33327394 nanoseconds
hash match: y
difference between two versions: 73030 nanoseconds
original version took 33355609 nanoseconds
modified version took 33320899 nanoseconds
hash match: y
difference between two versions: 34710 nanoseconds
original version took 33369537 nanoseconds
modified version took 33380704 nanoseconds
hash match: y
difference between two versions: -11167 nanoseconds
original version took 33400528 nanoseconds
modified version took 33388808 nanoseconds
hash match: y
difference between two versions: 11720 nanoseconds
original version took 33404792 nanoseconds
modified version took 33388615 nanoseconds
hash match: y
difference between two versions: 16177 nanoseconds
original version took 33364073 nanoseconds
modified version took 33322090 nanoseconds
hash match: y
difference between two versions: 41983 nanoseconds
original version took 33376240 nanoseconds
modified version took 33391640 nanoseconds
hash match: y
difference between two versions: -15400 nanoseconds
original version took 33544634 nanoseconds
modified version took 33392719 nanoseconds
hash match: y
difference between two versions: 151915 nanoseconds

I also run this tester on xerxes (x86_64), and this is the result:

[GeoffWu@xerxes tester]$ ./tester
original version took 24486375 nanoseconds
modified version took 24449778 nanoseconds
hash match: y
difference between two versions: 36597 nanoseconds
original version took 24482464 nanoseconds
modified version took 24497271 nanoseconds
hash match: y
difference between two versions: -14807 nanoseconds
original version took 24500344 nanoseconds
modified version took 24479112 nanoseconds
hash match: y
difference between two versions: 21232 nanoseconds
original version took 24509004 nanoseconds
modified version took 24478274 nanoseconds
hash match: y
difference between two versions: 30730 nanoseconds
original version took 24473804 nanoseconds
modified version took 24488610 nanoseconds
hash match: y
difference between two versions: -14806 nanoseconds
original version took 24471569 nanoseconds
modified version took 24454527 nanoseconds
hash match: y
difference between two versions: 17042 nanoseconds
original version took 24476877 nanoseconds
modified version took 24455086 nanoseconds
hash match: y
difference between two versions: 21791 nanoseconds
original version took 24473525 nanoseconds
modified version took 24444749 nanoseconds
hash match: y
difference between two versions: 28776 nanoseconds
original version took 24496712 nanoseconds
modified version took 24437485 nanoseconds
hash match: y
difference between two versions: 59227 nanoseconds
original version took 24500064 nanoseconds
modified version took 24445308 nanoseconds
hash match: y
difference between two versions: 54756 nanoseconds

Overall, the improved version is faster than the original version (although there are several instances when the original version is faster, and my speculation is because each time I'm using a new string), and I consider that as a success.

 The final version of my improved Murmurhash can be found here. (Note: initially I only checked the CPU endianness, but after reading this conversation in the nginx mailing list, I added the NGX_HAVE_NONALIGNED to check the alignment.)

A Quick side-note on Tengine's make test

This part is not necessary relevant to the project, but I really think I should write this all down, just so when someone wants to contribute to Tengine and run the unit tests before opening a pull request, they won't have to spend a week to figure it out (like I did).

For Tengine, there is a (at least seemingly) convenient to run all the unit tests it provided. According to its How to test page (written in Chinese) on their wiki, after you configure it, you just have to type make test to run all its unit tests. Looks pretty easy, isn't it?

But as it turns out, it is much more complicated than I expected.

At first, I simply type ./configure --prefix=/home/GeoffWu/app to create the Makefile, then run make test. Since I haven't implement my changes, I expect I would pass the test, but it fails spectacularly.

After spending some time to analyze the error message, I realize it is because I didn't include a module that is not compiled by default: the ngx_http_dyups_module and its Lua API. Also, I need to include the lua-nginx-module (I am using the one provided by Tengine), as well as install either LuaJIT 2.0/2.1 or Lua 5.1 when I am compiling it.

After installing all the LuaJIT and the luajit-devel and make sure I include the modules it requires, I try it again. But this time, it can even compile:

modules/ngx_http_lua_module/src/ngx_http_lua_headers.c: In function ‘ngx_http_lua_ngx_header_set’:
modules/ngx_http_lua_module/src/ngx_http_lua_headers.c:709:13: error: implicit declaration of function ‘luaL_getn’; did you mean ‘luaL_gsub’? [-Werror=implicit-function-declaration]
         n = luaL_getn(L, 3);
             ^~~~~~~~~
             luaL_gsub
cc1: all warnings being treated as errors
make[1]: *** [objs/Makefile:1469: objs/addon/src/ngx_http_lua_headers.o] Error 1
make[1]: Leaving directory '/home/GeoffWu/tengine'

For those who can't understand what it means, it is just the the Lua module provided by Tengine is so old (in fact it is released in 2015), it still uses luaL_getn, which is deprecated. I end up getting the unit tests to run by going the official repo of ngx_http_lua_module, clone it to my directory, then manually adding it to Tengine when I am compiling it.

The exact command is: ./configure --prefix=/home/GeoffWu/app --with-http_dyups_module --with-http_dyups_lua_api --add-module=/home/GeoffWu/lua-nginx-module --with-luajit-inc=/usr/include/luajit-2.1 --with-luajit-lib=/usr/lib64














by Wu Geoffrey (noreply@blogger.com) at April 22, 2018 03:42 AM

SPO 600 - Lab 6

In this lab, I would be exploring the use of inline assembler and its use in open source software. For more information, click here.

Part A

For comparison, I tested the performance of vol_simd (the version with inline assembly and SIMD) and vol3 (the version with no inline assembly, which I built in lab 5) on aarch64 (well, the inline assembly version is written in aarch64 assembly), and I noticed vol_simd is a bit faster than its pure-C counterparts.

Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ time ./vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -454
Time spent: 0.000670

real    0m0.027s
user    0m0.027s
sys    0m0.000s

Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -142
Time spent: 0.001029

real    0m0.028s
user    0m0.028s
sys    0m0.000s



I adjust the number of samples to 500000000, and vol_simd is still faster.

Inline assembly and SIMD:
[GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ make && time ./vol_simd
gcc -g -O3 vol_simd.c -o vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -751
Time spent: 0.826959

real    0m25.909s
user    0m24.816s
sys    0m1.079s

Pure C:
[GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1
Result: -606
Time spent: 1.068387

real    0m25.942s
user    0m24.828s
sys    0m1.098s

//Q: what is an alternate approach?
        register int16_t*       in_cursor       asm("r20");     // input cursor
        register int16_t*       out_cursor      asm("r21");     // output cursor
        register int16_t        vol_int         asm("r22");     // volume as int16_t
A: Let the compiler to choose which registers to use.

// set vol_int to fixed-point representation of 0.75
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t) (0.75 * 32767.0);
A: 32767, because the range of int16_t is -32,768 to 32,767.

// Q: what happens if we remove the following
                        // two lines? Why?
                        : [in]"+r"(in_cursor)
                        : "0"(in_cursor),[out]"r"(out_cursor)
                        );
A: It can't even compile, because the compiler doesn't know where to get the input and where to output the result.

// Q: are the results usable? are they correct?
A: In aarch64 platform, the result are usable, and they're correct.

Part B

The package I chose is cxxtools, a generic C++ library that is part of the C++ web application framework Tntnet.

In cxxtools, I can't find too many assembly-language code, though I still managed to find some inline assembly code for the atomicity component. It would choose the proper inline assembly version of it based on the platform, e.g. arm, x86, x86_64, MIPS, etc.

Considering it also has a platform-independent version (written in pure C++), I assume it is done to improve its performance whenever possible.
My personal opinion on the inline assembly is even though inline assembly is very complicated (even more than pure assembly), I do acknowledge their existence is necessary - after all, compiler optimization isn't always produce the best result, and they sometimes break the software (looking at you, -Ofast).

Reflection

Overall inline assembly is an interesting (sometimes necessary) piece of technology to work with, but personally I would rather compile it with -O2 or -O3 instead.



by Wu Geoffrey (noreply@blogger.com) at April 22, 2018 03:13 AM


Zukhruf Khan

Release 0.3 – More bugs (?!) to fix..

Hello everyone. As this is my last assignment, this will also be my last blog for my open source class (but hopefully not my last blog for life!). This time, I’ll be talking about Release 0.3. The instructions are essentially the same as Release 0.2 – except this time, try to work on at least more than one bug, so I’ve worked on two bugs.

First, I scoured the internet for open github issues. I eventually found this issue in a random github repo (R for Data Science Online Learning Community), and decided I’d try it. It essentially gave a github documentation link, as well as a link to a code of conduct example, and asked for a code_of_conduct.md to be created. It wasn’t necessarily a coding task, but documentation is important as well, nevertheless. So, I forked the repo, and created a code_of_conduct.md file in the repo’s .github folder, in which I copied and converted the code of conduct (to markdown) from the attached website. After including the group’s contact info and such, I created a PR.

Screen Shot 2018-04-21 at 10.14.28 PM.pngthe code_of_conduct.md file I created

So, that was one issue down. Now I had to find at least one more. I was just about to do a google search for another issue, when I realized this group may have more issues to fix. And voila, they did. The next issue I tackled was also a documentation issue – create a pull request template. I didn’t even know you could set a template for users submitting pull requests to your repo, but hey, you live, you learn, right?

And so, using the link mentioned in the issue, I followed essentially the same steps as above – creating a new markdown file in the .github folder. I looked at other PR templates, as well as this group’s existing issue template, and put it together. After doing that, I created another PR (not using the template yet), and am thus done with two PR’s for the same repository, which is a new step for me! I may contribute to another issue once I get feedback on these 2 PR’s, but in the meantime, I’ve learned that PR’s can be fun, even if it’s not code you’re modifying! As for growth, I’d say I grew over the span of Release 0.2 and 0.3 by contributing to two issues, rather than one, and choosing documentation/writing issues rather than fixing code. One thing’s for sure, I’m glad to have been introduced to how easily you can contribute to the world of open source, by something as simple as fixing issues on Github.

Screen Shot 2018-04-21 at 10.21.55 PM.pngThe pull_request_template.md file I created

by zeddkayy at April 22, 2018 02:22 AM

April 21, 2018


Dennis Arul

SPO600 Lab 3

This was in my opinion the hardest lab that we had to go through, I felt that it was very difficult because I was not too familiar with working on assembler language. So what we had to do for this lab was create a program that loops and output on both architectures, x86_64 and aarch64, the servers used were xerxes for x86_64 and for aarch64 we used aarchie. We were provided with the following code to start with. We modified the hello-gas file, so that it contained the bit of code that we received.

 

.text
.globl _start

start = 0
max = 10

_start:
mov $start,%r15
mov %r15,%r10
loop:
movq $len,%rdx
movq $msg,%rsi
movq $1,%rax
movq $1,%rax
syscall
inc %r15
cmp $max,%r15
jne loop
movq $0,%rdi /* exit status */
movq $60,%rax /* syscall sys_exit */
syscall

.section .rodata

msg: .ascii “Hello, world!\n”
len = . – msg

The output that we got after adding the code is shown below.

helloloop.PNG

The next step that we were asked to do was modify the code so that there is a number that follows the hello world, honestly this part confused me for a little while because when I made the changes and ran it, the number was appearing in the wrong places so I had to play around with the code a little bit to make sure that it appeared at the end. What I did was I used an exclamation point as a placeholder  because otherwise the number was replacing the next line and all of the hello worlds showed up in a singular line.

 

.text
.globl _start

start = 0
max = 10

_start:
mov $start,%r15
loop:
mov %r15,%r14
add $48,%r14
mov %r14b,msg+14
movq $len,%rdx
movq $msg,%rsi
movq $1,%rax
movq $1,%rax
syscall
inc %r15
cmp $max,%r15
jne loop
mov $0,%rdi /* exit status */
mov $60,%rax /* syscall sys_exit */
syscall

.section .data

msg: .ascii “Hello, world: !\n”
len = . – msg

The results for the code above are shown below.

numberloop.PNG

After this we were asked to change the code so that it loops 30 times rather than only 10.

.text
.globl _start

start = 00
max = 30

_start:
mov $start,%r15
mov $0x30,%r12
loop:
mov $’0′,%r14
mov $10,%r13
mov $0,%rdx
mov %r15,%rax
div %r13
cmp $0,%rax
mov %rax,%r13
add %r14,%r13
mov %r13,msg+13
mov %rdx,%r12
add %r14,%r12
mov %r12,msg+14
movq $len,%rdx
movq $msg,%rsi
movq $1,%rdi
movq $1,%rax
syscall
inc %r15
cmp $max,%r15
jne loop
mov $0,%rdi /* exit status */
mov $60,%rax /* syscall sys_exit */
syscall

.section .data

msg: .ascii “Hello, world: !! \n”
len = . – msg

This took significantly longer than expected, going in I expected to change the max to 30 and it work but I realised that I was very wrong. Not only will it only not go on till 30 but it will not provide the 2 digit numbers for each loop. Something I learned from one of my classmates while doing this was if we just changed max to 30 the number would not be shown, what will be shown is the ascii corresponding with the number 9. This was a very challenging lab, compared to coding on other platforms this is significantly harder maybe it is because I am a beginner when it comes to assembly code, but getting through this lab was a struggle.

 

by dparul at April 21, 2018 11:52 PM


Ruihui Yan

Installing Raspbian on Raspberry Pi 3

After so many problems with Fedora and Ubuntu distributions on Raspberry Pi, I decided to go back to the Raspbian, which is the standard system the comes with the Pi. The Raspbian Stretch, which is based on the Debian Stretch, can be found here.

I decided to go with the Raspbian Stretch Lite.

The instructions suggest that we use Etcher to flash the SD card, which is available for Linux, macOS and Windows.

After boot up, we can use the command sudo raspi-config to connect the Raspberry Pi to WiFi.

 

And enable access through SSH:

 

After that, we are ready to access the Raspberry Pi 3 via SSH.

Moving on with the project…

by blacker at April 21, 2018 11:02 PM


Jeffrey Espiritu

SPO600 Project – Stage 3 / Part 2

Dissecting FLAC__fixed_compute_best_predictor

If I want any chance at trying to optimize the FLAC__fixed_compute_best_predictor function, I need to understand how the algorithm works from a strictly programming perspective (to be honest, I don’t know really know how the function helps in the audio encoding process).

Source file: fixed.c

The main part of the function is the for loop, shown below. The code seems straightforward. There is no advanced mathematical operations involved, so I should be able to understand it.

for(i = 0; i < data_len; i++) {
    error  = data[i]     ; total_error_0 += local_abs(error);                      save = error;
    error -= last_error_0; total_error_1 += local_abs(error); last_error_0 = save; save = error;
    error -= last_error_1; total_error_2 += local_abs(error); last_error_1 = save; save = error;
    error -= last_error_2; total_error_3 += local_abs(error); last_error_2 = save; save = error;
    error -= last_error_3; total_error_4 += local_abs(error); last_error_3 = save;
}

After writing out the code on paper, and trying to find connections between variables, I was able to understand what was happening inside the for loop. To describe how it works in words would help a little but a diagram would help a lot. Unfortunately, I'm not good at using image-editing software nor do I have an electronic pad I can draw on.

To be honest though, even after I was able to understand how it works, the more difficult task was to figure out how to optimize it. In fact, I couldn't figure out what variables I should store in a vector register and what calculations I can pick out to perform in parallel using SIMD operations. Well, to make it easier on myself, I decided to look at the SSE2 optimized version, which on its surface, seems rather scary. However, with very good documentation on Intel intrinsics, I was able to not only understand the optimized version, but I was able to work towards writing a solution for ARM CPUs in the process.

Just to recall from the previous post, here is the SSE2 optimized version of the FLAC__fixed_compute_best_predictor function.

Source file: fixed_intrin_sse2.c

FLAC__SSE_TARGET("sse2")
uint32_t FLAC__fixed_compute_best_predictor_intrin_sse2(const FLAC__int32 data[], uint32_t data_len, float residual_bits_per_sample[FLAC__MAX_FIXED_ORDER + 1])
{
	FLAC__uint32 total_error_0, total_error_1, total_error_2, total_error_3, total_error_4;
	uint32_t i, order;

	__m128i total_err0, total_err1, total_err2;

	{
		FLAC__int32 itmp;
		__m128i last_error;

		last_error = _mm_cvtsi32_si128(data[-1]);							// 0   0   0   le0
		itmp = data[-2];
		last_error = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
		last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));	// 0   0   le0 le1
		itmp -= data[-3];
		last_error = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
		last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));	// 0   le0 le1 le2
		itmp -= data[-3] - data[-4];
		last_error = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
		last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));	// le0 le1 le2 le3

		total_err0 = total_err1 = _mm_setzero_si128();
		for(i = 0; i < data_len; i++) {
			__m128i err0, err1, tmp;
			err0 = _mm_cvtsi32_si128(data[i]);								// 0   0   0   e0
			err1 = _mm_shuffle_epi32(err0, _MM_SHUFFLE(0,0,0,0));			// e0  e0  e0  e0
#if 1 /* OPT_SSE */
			err1 = _mm_sub_epi32(err1, last_error);
			last_error = _mm_srli_si128(last_error, 4);						// 0   le0 le1 le2
			err1 = _mm_sub_epi32(err1, last_error);
			last_error = _mm_srli_si128(last_error, 4);						// 0   0   le0 le1
			err1 = _mm_sub_epi32(err1, last_error);
			last_error = _mm_srli_si128(last_error, 4);						// 0   0   0   le0
			err1 = _mm_sub_epi32(err1, last_error);							// e1  e2  e3  e4
#else
			last_error = _mm_add_epi32(last_error, _mm_srli_si128(last_error, 8));	// le0  le1  le2+le0  le3+le1
			last_error = _mm_add_epi32(last_error, _mm_srli_si128(last_error, 4));	// le0  le1+le0  le2+le0+le1  le3+le1+le2+le0
			err1 = _mm_sub_epi32(err1, last_error);							// e1  e2  e3  e4
#endif
			tmp = _mm_slli_si128(err0, 12);									// e0   0   0   0
			last_error = _mm_srli_si128(err1, 4);							//  0  e1  e2  e3
			last_error = _mm_or_si128(last_error, tmp);						// e0  e1  e2  e3

			tmp = _mm_srai_epi32(err0, 31);
			err0 = _mm_xor_si128(err0, tmp);
			err0 = _mm_sub_epi32(err0, tmp);
			tmp = _mm_srai_epi32(err1, 31);
			err1 = _mm_xor_si128(err1, tmp);
			err1 = _mm_sub_epi32(err1, tmp);

			total_err0 = _mm_add_epi32(total_err0, err0);					// 0   0   0   te0
			total_err1 = _mm_add_epi32(total_err1, err1);					// te1 te2 te3 te4
		}
	}

	total_error_0 = _mm_cvtsi128_si32(total_err0);
	total_err2 = total_err1;											// te1  te2  te3  te4
	total_err1 = _mm_srli_si128(total_err1, 8);							//  0    0   te1  te2
	total_error_4 = _mm_cvtsi128_si32(total_err2);
	total_error_2 = _mm_cvtsi128_si32(total_err1);
	total_err2 = _mm_srli_si128(total_err2,	4);							//  0   te1  te2  te3
	total_err1 = _mm_srli_si128(total_err1, 4);							//  0    0    0   te1
	total_error_3 = _mm_cvtsi128_si32(total_err2);
	total_error_1 = _mm_cvtsi128_si32(total_err1);
}

Intel Intrinsics Breakdown

I’m going to break down each intrinsic function, for personal reference, with examples to help clarify what is happening.

As a reference, here is the Intel Intrinsics Guide that was really helpful in trying to understand the SSE2 optimized version.

// create a 128 bit vector, store data[-1] in the lowest element, zero out the rest
last_error = _mm_cvtsi32_si128(data[-1]);
// Note: the 128 refers to the size of the vector register
// if data[-1] = 550, then last_error would be equal to |0 0 0 550|
// move the elements around in the vector to create a new one according to the shuffle bits in the second parameter
err1 = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
// if last_error = |1 2 3 4|, then err1 would be equal to |2 3 4 4|

Just for the sake of clarity, the _MM_SHUFFLE macro is defined as follows in xmmintrin.h.

 /*******************************************************/
 /* MACRO for shuffle parameter for _mm_shuffle_ps().   */
 /* Argument fp3 is a digit[0123] that represents the fp*/
 /* from argument "b" of mm_shuffle_ps that will be     */
 /* placed in fp3 of result. fp2 is the same for fp2 in */
 /* result. fp1 is a digit[0123] that represents the fp */
 /* from argument "a" of mm_shuffle_ps that will be     */
 /* places in fp1 of result. fp0 is the same for fp0 of */
 /* result                                              */
 /*******************************************************/
#define _MM_SHUFFLE(fp3,fp2,fp1,fp0) (((fp3) << 6) | ((fp2) << 4) | \
                                     ((fp1) << 2) | ((fp0)))
// pretty straightforward, subtract two vector registers
last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));
// logical shift right: shift the vector register to the right by 4 bytes
last_error = _mm_srli_si128(last_error, 4);
// ex. |1 2 3 4| => |0 1 2 3|
// logical shift left: shift the vector register to the left by 12 bytes
tmp = _mm_slli_si128(err0, 12);
// ex. |1 2 3 4| => |4 0 0 0|
// pretty straightforward, add two vector registers
total_err0 = _mm_add_epi32(total_err0, err0);

These 3 lines of code is used to get the absolute value of each element of a vector register.

tmp = _mm_srai_epi32(err0, 31);
err0 = _mm_xor_si128(err0, tmp);
err0 = _mm_sub_epi32(err0, tmp);
// in other words
// {x y z w} => {|x| |y| |z| |w|}

How it works:

err0 = 1000 0001 (-127)
tmp  = 1111 1111 (-127 >> 7, shift right arithmetic, value of -1)

err0 = err0 XOR tmp
       1000 0001
   XOR 1111 1111
       ---------
     = 0111 1110 (126)
       
err0 = err0 - tmp
     = (126 - (-1))
     = 127
     
As you can see, 127 is the absolute value of -127, computed from a 3 step process.

Porting the x86_64 optimizations to AArch64 (or not)

I ran into several problems while trying to port over the x86_64 SSE2 optimizations for the function. There are several intrinsic instructions that do not have an equivalent or similiar instruction in AArch64. I combed through the 6666- page documentation. If you want to take a look, you can find it here: ARMv8 Documentation.

  1. There is no AArch64 instruction to shift the bytes of a vector register to the left. LSL (Logical Shift Left) is only used for 32 or 64 bit registers. So if I had a vector register v0=|A B C D|, I would not be able to produce v0=|C D 0 0|.
  2. There is no AArch64 instruction to shift the bytes of a vector register to the right.
  3. There is no AArch64 instruction to shuffle the bytes of a vector register. So if I had a 128-bit vector register v0=|A B C D|, I would not be able to shuffle the elements to produce v0=|D C B A|, for example.

Is there a way I can work around these problems and use the instructions available on ARM CPUs to perform the same operations? Yes, but that would mean using 3 or 4 different instructions to do the same operation of a one equivalent Intel intrinsic instruction.

Dead End, Think Again

Remember when I said I understood the main logic of the algorithm of the function and how describing it in words would not help much, and how I said that SSE2 optimized version helped me understand how the function was actually optimized. Well, I present you my hacked, (potentionally?) optimized version of the function using inline assembly.

FLAC__int32 errors[4];
FLAC__uint32 total_error_0 = 0;
register FLAC__uint32 total_error_1 asm("r19") = 0;
register FLAC__uint32 total_error_2 asm("r20") = 0;
register FLAC__uint32 total_error_3 asm("r21") = 0;
register FLAC__uint32 total_error_4 asm("r22") = 0;

__asm__ ("movi v10.4s, #0" ::: "v10");

for (i = 0; i < data_len; i++) {
    error = data[i];
    errors[0] = error - last_error_0;
    errors[1] = errors[0] - last_error_1;
    errors[2] = errors[1] - last_error_2;
    errors[3] = errors[2] - last_error_3;

    last_error_0 = error;
    last_error_1 = errors[0];
    last_error_2 = errors[1];
    last_error_3 = errors[2];

    total_error_0 += local_abs(error);

    __asm__ volatile (
        "ldr q11, [%[src]]			\n\t"
        "abs v11.4s, v11.4s			\n\t"
        "add v10.4s, v10.4s, v11.4s	\n\t"
        :: [src] "r" (errors), "m" (errors[0])
        : "v10", "v11"
    );
}

__asm__ volatile (
    "mov w19, v10.s[3]	\n\t"
    "mov w20, v10.s[2]	\n\t"
    "mov w21, v10.s[1]	\n\t"
    "mov w22, v10.s[0]	\n\t"
    : "=r" (total_error_1),
      "=r" (total_error_2),
      "=r" (total_error_3),
      "=r" (total_error_4)
);

So in short, I moved the addition calculations for total_error_[1 to 4] into SIMD operations. It wasn’t that hard to do, but I ran into several problems along the way to get it to work, but here it is. I was going to go on and on about my journey with working with inline assembly to try to optimize the function, and all of the hours of trial and error, and testing and debugging, but I’ll save that for another time.

Next up Part 3

For the next (and hopefully last) part, I’m going to benchmark the code across ARM servers and wrap up Stage 3 for good.

by jespiritutech at April 21, 2018 10:45 PM


Ruihui Yan

Installing Ubuntu Core on Raspberry Pi 3

Because of some issues with the installation of the dependencies required for my package, I decided to install Ubuntu Core on my Raspberry Pi. Ubuntu Core is a tiny version of Ubuntu meant for small computers like IoT devices.

Following these instructions, I have installed the distribution on my SD card:

After a quick setup, the Raspberry Pi is ready to be accessed remotely:

 

The problem is, Ubuntu Core uses snaps, which are self-contained applications specifically for these types of devices. Therefore, it doesn’t support apt-get, which is essential to install any dependency my project needs.

 

For that reason, I decided to go back to the good old Raspbian.

by blacker at April 21, 2018 10:17 PM


Michael Pierre

Release 0.6 – The Finale but not the end

Over the past eight months, I discovered a community in which I never thought existed. A community full of friendship, challenge, and hardship. I learned how to put my mind to the test and collaborate with individuals all over the world to reach one common goal. Improve and develop great software. opensource I’ve contributed to a variety of organizations like Mozilla, MoePlayer, Chart.js, and many more. I feel my contributions to the open source community have made excel as a software developer but tenfold as a problem solver. A lot of open source work is implementing efficient and unique problem-solving methodologies. You have to figure out how to not tackle a problem from one angle but several. You need to know how to dissect an issue to its barebone functionality and have the ambition to never give up even when the issue you are tackling is tough and shows no sign of remorse. This is what makes an individual an excellent open source developer and problem solver.

One thing to realize with open source is that you don’t know everything and it is likely that you don’t know nearly as much as you did about software or various coding languages. There will be projects that look like alien code even though you’ve been coding that language in college or university for three or fours years. This is when you discover the contrast of practical production code and the “simulated code” that you come across in coding classes that only aim to teach you a single or few concepts. It is important as a software developer to have exposure to real-world code as much as possible because this is what will prepare you for the workforce and future opportunities. Of course this not to say that simulated learning code provided by professors is bad and I do agree that it is crucial to help students learn the core concepts and application of different programming languages. I am rather stressing that some students tend to get into a cycle of being spoon fed code and explicit instructions how to complete a certain task. This is bad because in the workforce and in my personal work experience there is no project spec or sample code that dictates exactly how to accomplish the project. At my prior job I was simply told what was needed and had to design, develop, and test the software myself. So how does this all tie in with open source? Well, open source provides the means to work on real-world code with industry leading developers and gain valuable experience that you can place on your resume or use to better yourself as a developer. Taking this all into regard I decided to tackle a hard issue for my last release.

The issue I decided to tackle for this release was to add a calendar widget to the date filter for dejavu. In plain text it seems very simple to implement however I have very little experience with ReactJS and haven’t dived into state management and passing data from different parent and child components whilst using an external date API to return the correct data as well as conform to existing functionality in the project. Overall it seemed like a great way to challenge myself and I’ll walk through my development process. Luckily to start a maintainer of the project gave some good info on where to start as well as a good date picker API that fits perfectly with react called react-datetime. I began reading over the component that handles filter input called “SingleMenuItem.” Currently, in this component, we are using events to grab and validate text input instead of having the filter submit like an HTML form for example. This was a bit confusing for me and from my understanding, the input data is being stored in the components state and then accessed from a parent class called “FilterDropdown” which contains several “SingleMenuItem” components. The problem with this approach for me was having the react date picker conform to our existing functionality in the project which was very event focused. But as one knows from using a date picker widget most of the time you simply click on the date and it fills the field. Knowing this myself I assumed that the onClick() event could be used to trigger our existing input event functionality and decided to go down that route initially.

Example of dejavu events:

valChange = (e) => {
    var filterValue = e.currentTarget.value;
    this.setState({
        filterValue: filterValue
    });
    this.props.getFilterVal(filterValue);
};
// Further down in render()
 input id="{keyInput}" type="text" placeholder="enter value" onKeyUp={this.valChange}

(Chevrons removed to stop WordPress from processing input tag literally)

Essentially like shown in the example above I had to implement something similar but have it work for the react date time component. Now the date time component has an attribute call inputProps which can be used to dictate custom input properties for the date picker. So my first attempt was to add the onClick() event to the inputProps attribute but as a result, this canceled any onClick() event defined in the Datetime API thus making it run valChange with an undefined event. I ended trying almost every event that is offered in React and had countless hours of half-baked implementation where the user does some abstract event to trigger the function to grab the input value. I did consult one of the maintainers for the project about this issue and he was very helpful but I think the main problem with all the hours of testing and coding was my lack of experience with React applications.

Fast forward to my current implementation I needed to add three additional functions to SingleMenuItem.js because the  tag handles events different than a regular input (API reference) and returns date data in milliseconds (UNIX time) depending on the event which in this case is onBlur. So once onBlur is triggered we attempt to parse the date and set the filterValue accordingly. This logic happens in setDate. For range values, it is set a bit differently. I created two functions called setRangeStartand setRangeEnd which handle this use case. Depending on which one is set first (starting date or ending date) the function will either store the date in the dateStore state to be used by the contrasting function that will set the dates as the date range for filterValue.

—Code reference—

reactDatetime.gif

This is the implementation I decided to create a pull request with because I believed that it would work for all use cases. To my surprise, a maintainer got back to me and indicated that the date picker has to adapt to the date format of the data and thus be created dynamically basing its format on the data. This created a HUGE hurdle for me because it involved accessing information dynamically across objects that don’t have any relation. I have since replied to the maintainer asking if there is a way to access the elasticsearch data within a component that has no relation. I figure if I can access just one date I can base the format off that and create the date picker, however, the challenge is more accomplishing this in React with entry-level experience. I hope perform a quick update with the fix but overall I’ve enjoyed the challenge this issue has given me!

by michaelpierreblog at April 21, 2018 08:19 PM


Jeffrey Espiritu

SPO600 Project – Stage 3 / Part 1

Random Wavfile Generator Bug Fix

I wrote a program that would generate a random WAV file. The program is simple, it generates a random signed 16-bit value between [-32768, 32767] to represent the audio data for both the left and right channels. You can find the source for the file here.

So I tried to optimize it by trying to fit two signed 16-bit values inside a 32-bit variable so I can just write it to a buffer in a single call.

Can you spot the problem with this code?

// Write sampled data
for (i=0; i<blocks; i++) {
    rnd = rand()%65536 - 32768;
    rnd = (rnd << 16) | rnd;
    memcpy((void*)(data+i*4), (void*)&rnd, sizeof(int32_t));
}

I couldn't spot it at first, so I modified to test if the values are being stored correctly in rnd.

// Write sampled data
for (i=0; i<blocks; i++) {
    int16_t lhs, rhs, tmp;
    rnd = tmp = rand()%65536 - 32768;  // [-32768, 32767]
    rnd = (rnd << 16) | rnd;
    lhs = (rnd >> 16) & 0xffff;  // Get back the random value
    rhs = rnd & 0xffff;  // Get back the same random value
    if (lhs == rhs) {
        printf("GOOD %d == %d (%d)\n", lhs, rhs, tmp);
    }
    else {
        printf("BAD %d != %d (%d)\n", lhs, rhs, tmp);
    }
    memcpy((void*)(data+i*4), (void*)&rnd, sizeof(int32_t));
}

When I run the modified version, I get the following output:

BAD -1 != -21643 (-21643)
GOOD 3116 == 3116 (3116)
BAD -1 != -6479 (-6479)
BAD -1 != -25195 (-25195)
GOOD 27774 == 27774 (27774)
GOOD 3354 == 3354 (3354)
GOOD 5564 == 5564 (5564)
GOOD 6074 == 6074 (6074)
BAD -1 != -7804 (-7804)
GOOD 20843 == 20843 (20843)
BAD -1 != -19348 (-19348)
BAD -1 != -11034 (-11034)
GOOD 27811 == 27811 (27811)
BAD -1 != -13046 (-13046)
BAD -1 != -24755 (-24755)
GOOD 111 == 111 (111)
BAD -1 != -9260 (-9260)
BAD -1 != -32033 (-32033)
BAD -1 != -13309 (-13309)
GOOD 29352 == 29352 (29352)

The first thing that struck me right away is that all of the BAD results had to deal with negative numbers. Then of course, the problem had to do with the fact I was trying to shift a negative number to fit it into the upper 16 bits of a 32-bit variable.

I was trying to store a value in the range [-32768, 32767] into both the upper and lower half words of a 32-bit integer. Just as an example, let’s say I wanted to duplicate the value of -127 into the upper and lower 8 bits of an 16-bit integer.

int16_t x = (-127 << 8) | -127;  // 1000 0001 1000 0001
// Seems right, right? Nope.
// The values (-127 << 8) and -127 is treated as either a 16-bit value or 32-bit value.
// Let's assume the value is treated as a 16-bit value
//     1000 0000 0000 0001 (-127)
// <<8 0000 0001 0000 0000 (-127 << 8)
//  OR 1000 0000 0000 0001 (OR with -127)
//     -------------------
//     1000 0001 0000 0001
// Which is completely wrong. The expected value should be:
//     1000 0001 1000 0001

Fix

The fix was to not try to optimize and just write the values to the array.

// Write sampled data
for (i=0; i<blocks; i++) {
    rnd = rand()%65536 - 32768;
    data[i*2] = (int16_t)rnd;
    data[i*2+1] = (int16_t)rnd;
}

Back to Profiling Flac with Gprof on Aarchie

$ gprof ~/bin/flac-132-d | gprof2dot | dot -Tps | convert - flac-132-d-gprof.png

Wait a minute, I’m still getting the same two function graph that I got in Stage 2.

Flac Gprof Data

Google’s Gperftools

I searched the web for other performance analysis tools I can use on Linux distributions other than gprof. So I found one. It is called gperftools provided by Google for analyzing the performance of programs. For information about gperftools, you can visit the gperftools Github repository.

To install gperftools I ran the following command:
$ sudo dnf install gperftools.x86_64

As per instructions, I had to link the Flac project with the gperftools library by passing LIBS='-lprofiler' option to configure.

$ ./configure --enable-debug=info CFLAGS='-O2' CXXFLAGS='-O2' LIBS='-lprofiler' --prefix=/home/jespiritu/myflac-1.3.2x-gperf

Run the Flac program and specify the profile output file:
$ CPUPROFILE=~/prof.out flac-132x-gp -f -5 wav/out.wav

Like gprof, I could generate a visual profiling graph to get a better sense of performance analysis of the Flac program.

$ pprof --dot /home/jespiritu/bin/flac-132x-gp prof.out | dot -Tps | convert - flac-132x-gp-r7200.png

As the visual performance graph below indicates, I was able to successfully profile the Flac program with Google’s performance analysis tool.

Flac Pprof Xerxes Random 7200s Wav

Gprof and Shared Libraries

According to this blog post, gprof, Valgrind and gperftools – an evaluation of some tools for application level CPU profiling on Linux
gprof is unable “to profile shared libraries.”

So I had to pass the --disable-shared option to configure.
$ ./configure --disable-shared --enable-debug=profile CFLAGS='-O2' CXXFLAGS='-O2' --prefix=/home/jespiritu/myflac-1.3.2x-profile

$ flac-132x-p -f -5 wav/out.wav

flac 1.3.2
Copyright (C) 2000-2009  Josh Coalson, 2011-2016  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

out.wav: wrote 636291200 bytes, ratio=0.501

# Generate profiling data
$ gprof ~/bin/flac-132x-p | gprof2dot | dot -Tps | convert - flac-132x-p-r7200.png

Finally.
Flac Gprof Random 7200s Wav

Flac CPU Optimizations (or lack thereof)

The function FLAC__cpu_info detects the CPU of the computer and populates the FLAC__CPUInfo structure with details about the supported instructions sets available.

void FLAC__cpu_info (FLAC__CPUInfo *info)
{
	memset(info, 0, sizeof(*info));

#ifdef FLAC__CPU_IA32
	info->type = FLAC__CPUINFO_TYPE_IA32;
#elif defined FLAC__CPU_X86_64
	info->type = FLAC__CPUINFO_TYPE_X86_64;
#else
	info->type = FLAC__CPUINFO_TYPE_UNKNOWN;
#endif

	switch (info->type) {
	case FLAC__CPUINFO_TYPE_IA32: /* fallthrough */
	case FLAC__CPUINFO_TYPE_X86_64:
		x86_cpu_info (info);
		break;
	default:
		info->use_asm = false;
		break;
	}
}

It appears that Flac only has optimizations for IA32 and X86_64 CPU architectures. It does not have any optimizations for AArch32 or AArch64.

typedef enum {
	FLAC__CPUINFO_TYPE_IA32,
	FLAC__CPUINFO_TYPE_X86_64,
	FLAC__CPUINFO_TYPE_UNKNOWN  // No AARCH32 or AARCH64, hmmm...
} FLAC__CPUInfo_Type;

typedef struct {
	FLAC__bool intel;

	FLAC__bool cmov;
	FLAC__bool mmx;
	FLAC__bool sse;
	FLAC__bool sse2;

	FLAC__bool sse3;
	FLAC__bool ssse3;
	FLAC__bool sse41;
	FLAC__bool sse42;
	FLAC__bool avx;
	FLAC__bool avx2;
	FLAC__bool fma;
} FLAC__CPUInfo_x86;


typedef struct {
	FLAC__bool use_asm;
	FLAC__CPUInfo_Type type;
	FLAC__CPUInfo_x86 x86;
} FLAC__CPUInfo;

The Default Encoding Functions

The function init_stream_internal_ in stream_encoder.c setups the optimizations for the target CPU.

static FLAC__StreamEncoderInitStatus init_stream_internal_(
	FLAC__StreamEncoder *encoder,
	FLAC__StreamEncoderReadCallback read_callback,
	FLAC__StreamEncoderWriteCallback write_callback,
	FLAC__StreamEncoderSeekCallback seek_callback,
	FLAC__StreamEncoderTellCallback tell_callback,
	FLAC__StreamEncoderMetadataCallback metadata_callback,
	void *client_data,
	FLAC__bool is_ogg
)
{
[...]
	/*
	 * get the CPU info and set the function pointers
	 */
	FLAC__cpu_info(&encoder->private_->cpuinfo);
	/* first default to the non-asm routines */
#ifndef FLAC__INTEGER_ONLY_LIBRARY
	encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation;
#endif
	encoder->private_->local_precompute_partition_info_sums = precompute_partition_info_sums_;
	encoder->private_->local_fixed_compute_best_predictor = FLAC__fixed_compute_best_predictor;
	encoder->private_->local_fixed_compute_best_predictor_wide = FLAC__fixed_compute_best_predictor_wide;
#ifndef FLAC__INTEGER_ONLY_LIBRARY
	encoder->private_->local_lpc_compute_residual_from_qlp_coefficients = FLAC__lpc_compute_residual_from_qlp_coefficients;
	encoder->private_->local_lpc_compute_residual_from_qlp_coefficients_64bit = FLAC__lpc_compute_residual_from_qlp_coefficients_wide;
	encoder->private_->local_lpc_compute_residual_from_qlp_coefficients_16bit = FLAC__lpc_compute_residual_from_qlp_coefficients;
#endif

Confirmation on Aarchie

I debugged the program using gdb and made at a break point where the program detects the CPU.

$ gdb --args flac-132x-p -f -5 wav/out.wav
(gdb) break stream_encoder.c:872
(gdb) run
(gdb) print encoder->private_->cpuinfo
$1 = {use_asm = 0, type = FLAC__CPUINFO_TYPE_IA32, x86 = {intel = 0, cmov = 0, mmx = 0, sse = 0, sse2 = 0,
    sse3 = 0, ssse3 = 0, sse41 = 0, sse42 = 0, avx = 0, avx2 = 0, fma = 0}}
(gdb)

Flac does not have any AArch64 optimizations and uses the default implementations of the core encoding functions.

Examining Possible Strategies for Optimization

From examining the profiling graph generated by gprof, there are three (3) encoding functions that I could look into and try to port x86_64 optimizations over to the AArch64 CPU architecture.

FLAC__lpc_compute_autocorrelation 33.84% (33.84%) 232560x
FLAC__fixed_compute_best_predictor 16.99% (16.99%) 310060x
FLAC__lpc_compute_residual_from_qlp_coefficients 3.79% (3.79%) 232560x

The professor recommended I look at the default implementations of the encoding functions before trying to examine the various optimizations for them. I decided to look at the FLAC__fixed_compute_best_predictor function to see if I can make sense of it.

Source file: fixed.c

#ifndef FLAC__INTEGER_ONLY_LIBRARY
uint32_t FLAC__fixed_compute_best_predictor(const FLAC__int32 data[], uint32_t data_len, float residual_bits_per_sample[FLAC__MAX_FIXED_ORDER+1])
#else
uint32_t FLAC__fixed_compute_best_predictor(const FLAC__int32 data[], uint32_t data_len, FLAC__fixedpoint residual_bits_per_sample[FLAC__MAX_FIXED_ORDER+1])
#endif
{
	FLAC__int32 last_error_0 = data[-1];
	FLAC__int32 last_error_1 = data[-1] - data[-2];
	FLAC__int32 last_error_2 = last_error_1 - (data[-2] - data[-3]);
	FLAC__int32 last_error_3 = last_error_2 - (data[-2] - 2*data[-3] + data[-4]);
	FLAC__int32 error, save;
	FLAC__uint32 total_error_0 = 0, total_error_1 = 0, total_error_2 = 0, total_error_3 = 0, total_error_4 = 0;
	uint32_t i, order;

	for(i = 0; i < data_len; i++) {
		error  = data[i]     ; total_error_0 += local_abs(error);                      save = error;
		error -= last_error_0; total_error_1 += local_abs(error); last_error_0 = save; save = error;
		error -= last_error_1; total_error_2 += local_abs(error); last_error_1 = save; save = error;
		error -= last_error_2; total_error_3 += local_abs(error); last_error_2 = save; save = error;
		error -= last_error_3; total_error_4 += local_abs(error); last_error_3 = save;
	}

	if(total_error_0 < flac_min(flac_min(flac_min(total_error_1, total_error_2), total_error_3), total_error_4))
		order = 0;
	else if(total_error_1 < flac_min(flac_min(total_error_2, total_error_3), total_error_4))
		order = 1;
	else if(total_error_2 < flac_min(total_error_3, total_error_4))
		order = 2;
	else if(total_error_3 < total_error_4)
		order = 3;
	else
		order = 4;

Here is the description of the function found in the fixed.h header file.

/*
 *	FLAC__fixed_compute_best_predictor()
 *	--------------------------------------------------------------------
 *	Compute the best fixed predictor and the expected bits-per-sample
 *  of the residual signal for each order.  The _wide() version uses
 *  64-bit integers which is statistically necessary when bits-per-
 *  sample + log2(blocksize) > 30
 *
 *	IN data[0,data_len-1]
 *	IN data_len
 *	OUT residual_bits_per_sample[0,FLAC__MAX_FIXED_ORDER]
 */

Here is the SSE2 optimization of the same function only available if the target platform has support for X86 intrinsics.

Source file: fixed_intrin_sse2.c

FLAC__SSE_TARGET("sse2")
uint32_t FLAC__fixed_compute_best_predictor_intrin_sse2(const FLAC__int32 data[], uint32_t data_len, float residual_bits_per_sample[FLAC__MAX_FIXED_ORDER + 1])
{
	FLAC__uint32 total_error_0, total_error_1, total_error_2, total_error_3, total_error_4;
	uint32_t i, order;

	__m128i total_err0, total_err1, total_err2;

	{
		FLAC__int32 itmp;
		__m128i last_error;

		last_error = _mm_cvtsi32_si128(data[-1]);							// 0   0   0   le0
		itmp = data[-2];
		last_error = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
		last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));	// 0   0   le0 le1
		itmp -= data[-3];
		last_error = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
		last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));	// 0   le0 le1 le2
		itmp -= data[-3] - data[-4];
		last_error = _mm_shuffle_epi32(last_error, _MM_SHUFFLE(2,1,0,0));
		last_error = _mm_sub_epi32(last_error, _mm_cvtsi32_si128(itmp));	// le0 le1 le2 le3

		total_err0 = total_err1 = _mm_setzero_si128();
		for(i = 0; i < data_len; i++) {
			__m128i err0, err1, tmp;
			err0 = _mm_cvtsi32_si128(data[i]);								// 0   0   0   e0
			err1 = _mm_shuffle_epi32(err0, _MM_SHUFFLE(0,0,0,0));			// e0  e0  e0  e0
#if 1 /* OPT_SSE */
			err1 = _mm_sub_epi32(err1, last_error);
			last_error = _mm_srli_si128(last_error, 4);						// 0   le0 le1 le2
			err1 = _mm_sub_epi32(err1, last_error);
			last_error = _mm_srli_si128(last_error, 4);						// 0   0   le0 le1
			err1 = _mm_sub_epi32(err1, last_error);
			last_error = _mm_srli_si128(last_error, 4);						// 0   0   0   le0
			err1 = _mm_sub_epi32(err1, last_error);							// e1  e2  e3  e4
#else
			last_error = _mm_add_epi32(last_error, _mm_srli_si128(last_error, 8));	// le0  le1  le2+le0  le3+le1
			last_error = _mm_add_epi32(last_error, _mm_srli_si128(last_error, 4));	// le0  le1+le0  le2+le0+le1  le3+le1+le2+le0
			err1 = _mm_sub_epi32(err1, last_error);							// e1  e2  e3  e4
#endif
			tmp = _mm_slli_si128(err0, 12);									// e0   0   0   0
			last_error = _mm_srli_si128(err1, 4);							//  0  e1  e2  e3
			last_error = _mm_or_si128(last_error, tmp);						// e0  e1  e2  e3

			tmp = _mm_srai_epi32(err0, 31);
			err0 = _mm_xor_si128(err0, tmp);
			err0 = _mm_sub_epi32(err0, tmp);
			tmp = _mm_srai_epi32(err1, 31);
			err1 = _mm_xor_si128(err1, tmp);
			err1 = _mm_sub_epi32(err1, tmp);

			total_err0 = _mm_add_epi32(total_err0, err0);					// 0   0   0   te0
			total_err1 = _mm_add_epi32(total_err1, err1);					// te1 te2 te3 te4
		}
	}

	total_error_0 = _mm_cvtsi128_si32(total_err0);
	total_err2 = total_err1;											// te1  te2  te3  te4
	total_err1 = _mm_srli_si128(total_err1, 8);							//  0    0   te1  te2
	total_error_4 = _mm_cvtsi128_si32(total_err2);
	total_error_2 = _mm_cvtsi128_si32(total_err1);
	total_err2 = _mm_srli_si128(total_err2,	4);							//  0   te1  te2  te3
	total_err1 = _mm_srli_si128(total_err1, 4);							//  0    0    0   te1
	total_error_3 = _mm_cvtsi128_si32(total_err2);
	total_error_1 = _mm_cvtsi128_si32(total_err1);

	/* prefer higher order */
	if(total_error_0 < flac_min(flac_min(flac_min(total_error_1, total_error_2), total_error_3), total_error_4))
		order = 0;
	else if(total_error_1 < flac_min(flac_min(total_error_2, total_error_3), total_error_4))
		order = 1;
	else if(total_error_2 < flac_min(total_error_3, total_error_4))
		order = 2;
	else if(total_error_3 < total_error_4)
		order = 3;
	else
		order = 4;
}

According to the Intel Intrinsics Guide, Intel intrinsics instructions “are C style functions that provide access to many Intel instructions – including Intel® SSE, AVX, AVX-512, and more – without the need to write assembly code.”

Next up Part 2

In the next part, I will break down the algorithm of the FLAC__fixed_compute_best_predictor function and go over how I used the SSE2 optimized version of the function as a guide to figure out an optimization strategy for the AArch64 architecture.

by jespiritutech at April 21, 2018 08:03 PM


Adam Kolodko

O2 vs O3, not what was expected

Due to some time constraints the following will be tested on my local x_86 machine.

Using ccmake I built 2 versions of blender, one with -O2 flags, and the other with -O3 flags. On average the more optimized version is 1% slower.

These tests are performed on a laptop, plugged into a power source without any additional applications running in the background.

The image in question is of a cat looking at a some fish, the grain is due to the low resolution of this render.

outOpt1

Basic, total time 3245.36 seconds:

  %            
 time    name    
 44.31   ccl::QBVH_bvh_intersect_hair(ccl::KernelGlobals*, ccl::Ray const*, ccl::Intersection*, unsigned int, unsigned int*, float, float)
 12.43   ccl::perlin(float, float, float)
  7.76   ccl::QBVH_bvh_intersect_shadow_all_hair(ccl::KernelGlobals*, ccl::Ray const*, ccl::Intersection*, unsigned int, unsigned int, unsigned int*)
  7.40   GaussianYBlurOperation::executePixel(float*, int, int, void*)
  3.51   ccl::svm_eval_nodes(ccl::KernelGlobals*, ccl::ShaderData*, ccl::PathState*, ccl::ShaderType, int)
  3.03   ccl::kernel_path_trace(ccl::KernelGlobals*, float*, int, int, int, int, int)
  2.38   ccl::svm_math(ccl::NodeMath, float, float)
  2.06   ccl::shader_setup_from_ray(ccl::KernelGlobals*, ccl::ShaderData*, ccl::Intersection const*, ccl::Ray const*)
  1.87   ccl::light_sample(ccl::KernelGlobals*, float, float, float, ccl::float3, int, ccl::LightSample*)
  1.74   ccl::kernel_path_surface_bounce(ccl::KernelGlobals*, ccl::ShaderData*, ccl::float3*, ccl::PathState*, ccl::PathRadianceState*, ccl::Ray*)
  1.45   GaussianXBlurOperation::executePixel(float*, int, int, void*)
  1.09   ccl::primitive_tangent(ccl::KernelGlobals*, ccl::ShaderData*)

Optimized, total time 3350.43 seconds:

  %  
 time   name    
 44.06   ccl::QBVH_bvh_intersect_hair(ccl::KernelGlobals*, ccl::Ray const*, ccl::Intersection*, unsigned int, unsigned int*, float, float)
 14.27   ccl::bsdf_microfacet_beckmann_sample(ccl::KernelGlobals*, ccl::ShaderClosure const*, ccl::float3, ccl::float3, ccl::float3, ccl::float3, float, float, ccl::float3*, ccl::float3*, ccl::float3*, ccl::float3*, float*)
  8.08   GaussianYBlurOperation::executePixel(float*, int, int, void*)
  7.57   ccl::QBVH_bvh_intersect_shadow_all_hair(ccl::KernelGlobals*, ccl::Ray const*, ccl::Intersection*, unsigned int, unsigned int, unsigned int*)
  3.14   ccl::svm_eval_nodes(ccl::KernelGlobals*, ccl::ShaderData*, ccl::PathState*, ccl::ShaderType, int)
  2.98   ccl::kernel_path_trace(ccl::KernelGlobals*, float*, int, int, int, int, int)
  2.00   ccl::shader_setup_from_ray(ccl::KernelGlobals*, ccl::ShaderData*, ccl::Intersection const*, ccl::Ray const*)
  1.80   ccl::light_sample(ccl::KernelGlobals*, float, float, float, ccl::float3, int, ccl::LightSample*)
  1.65   ccl::kernel_path_surface_bounce(ccl::KernelGlobals*, ccl::ShaderData*, ccl::float3*, ccl::PathState*, ccl::PathRadianceState*, ccl::Ray*)
  1.43   GaussianXBlurOperation::executePixel(float*, int, int, void*)
  1.06   ccl::svm_bevel(ccl::KernelGlobals*, ccl::ShaderData*, ccl::PathState*, float, int)
  1.06   ccl::background_map_pdf(ccl::KernelGlobals*, ccl::float3)
  1.00   ccl::perlin(float, float, float)

This info dump is the gprof output of every function that took 1% of time or more during a one hour render. Looking at this we can find what effects the change has on the individual functions.

On average, the functions took an additional second to run with the most striking change being that the function ‘perlin()’ went from 12% of runtime to a mere 1%, this is balanced out by having the function ‘bsdf_microfacet_beckmann_sample()’ go from >1% to 14%.

The strategy should be this, find out what specific optimization affects ‘perlin()’ and to find out if ‘perlin()’ can be cut down without forcing ‘bsdf_microfacet_beckmann_sample()’ to slow down.

by ahkol at April 21, 2018 05:48 PM

Change the build flags in spec

Now that we are able to build the basic version of the Blender app in the RPM working directory we will move that to the side with mv ~/rpmbuild ~/blender_basic

The goal now is to add the ‘-pg’ compile flags to the build, this was a problem that was harder to solve than expected.

Within the spec file we see a collection of flags that looks like so

...
%cmake .. \
%ifnarch %{ix86} x86_64
-DWITH_RAYOPTIMIZATION=OFF \
%endif
-DBOOST_ROOT=%{_prefix} \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_SKIP_RPATH=ON \
-DPYTHON_VERSION=$(%{__python3} -c "import sys ; print(sys.version[:3])") \
-DWITH_ALEMBIC=ON \
-DWITH_BUILDINFO=ON \
%{?_with_ffmpeg:-DWITH_CODEC_FFMPEG=ON} \
-DWITH_CODEC_SNDFILE=ON \
-DWITH_CXX_GUARDEDALLOC=OFF \
-DWITH_CYCLES=%{cyclesflag} \
-DWITH_DOC_MANPAGE=ON \
-DWITH_FFTW3=ON \
-DWITH_GAMEENGINE=ON \
-DWITH_IMAGE_OPENJPEG=ON \
-DWITH_INPUT_NDOF=ON \
-DWITH_INSTALL_PORTABLE=OFF \
-DWITH_JACK=ON \
-DWITH_MEM_JEMALLOC=ON \
-DWITH_MOD_OCEANSIM=ON \
-DWITH_OPENCOLLADA=ON \
-DWITH_OPENCOLORIO=ON \
-DWITH_OPENVDB=ON \
-DWITH_OPENVDB_BLOSC=ON \
-DWITH_PLAYER=ON \
-DWITH_PYTHON=ON \
-DWITH_PYTHON_INSTALL=OFF \
-DWITH_PYTHON_INSTALL_REQUESTS=OFF \
-DWITH_PYTHON_SAFETY=ON \
-DWITH_SDL=ON \
-DWITH_SYSTEM_LZO=ON \
-DWITH_SYSTEM_OPENJPEG=ON
...

My first instinct was to just add the following flags to the list

...
%cmake .. \
%ifnarch %{ix86} x86_64
-DWITH_RAYOPTIMIZATION=OFF \
%endif
-CXX_FLAGS=-pg \
-C_FLAGS=-pg \
-DBOOST_ROOT=%{_prefix} \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_SKIP_RPATH=ON \
-DPYTHON_VERSION=$(%{__python3} -c "import sys ; print(sys.version[:3])") \
-DWITH_ALEMBIC=ON \
-DWITH_BUILDINFO=ON \
%{?_with_ffmpeg:-DWITH_CODEC_FFMPEG=ON} \
...

This failed as the rpbuild was interpreting the new flags as directories.

It was suggested to take advantage to the fact I can build on X_86 and use diff or git to compare the differences. This failed as the ccmake for some reason had different names for files in the two different builds.

I was able to find that the C_FLAGS and CXX_FLAGS were set by export.

To successfully change the build flags the following lines were added.

...
export CXXFLAGS="$CXXFLAGS -pg"
export CFLAGS="$CFLAGS -pg"
%cmake .. \
%ifnarch %{ix86} x86_64
-DWITH_RAYOPTIMIZATION=OFF \
%endif
-DBOOST_ROOT=%{_prefix} \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_SKIP_RPATH=ON \
-DPYTHON_VERSION=$(%{__python3} -c "import sys ; print(sys.version[:3])") \
-DWITH_ALEMBIC=ON \
-DWITH_BUILDINFO=ON \
%{?_with_ffmpeg:-DWITH_CODEC_FFMPEG=ON} \
-DWITH_CODEC_SNDFILE=ON \
-DWITH_CXX_GUARDEDALLOC=OFF \
-DWITH_CYCLES=%{cyclesflag} \
-DWITH_DOC_MANPAGE=ON \
-DWITH_FFTW3=ON \
...

By building and then moving to a different directory, different run times with different compile flags can be compared.

by ahkol at April 21, 2018 05:47 PM


Dennis Arul

SPO600 Lab 5

For this post I will be going through Lab 5 and documenting the process and the result that I get. So for starters we were told to unpack this archive.   /public/spo600-20181-algorithm-selection-lab.tgz

After unpacking this archive and changing into the directory we see 3 files, one is the make file, the second file is the vol.h which contains the definition of samples and the base number that was provided was 500000 and the 3rd file was the vol1.c file which contained the source code.

The first thing that we were asked to do in  lab 5 is make and run to see if we got the same results each time so I did just that, I ran it 3 times and I did in fact get the same results each time.

Test running lab 5.PNG

Alright now the next part we are asked to adjust the sample and run it again. So what I did was double the sample amount, ran make clean, then make and finally ran it with time ./vol1 again to test what the results were. Although the results were increased due to the increase in the sample amount. The timing of the 3 runs after the double were again the same. The difference that I was able to notice was the time it took to run also double.

Double run 1.PNG

After this step we were asked to modify the provided code to have a pre-calculating lookup table and all the values are multiplied by the volume factor which in this case is 0.75 . Below shows the modified code.

 

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include “vol.h”

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}

int main() {

//Look up table
int16_t* lookup;

lookup = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
int ttl;

// Seed the pseudo-random number generator
srand(-1);

// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
lookup[x] = (int16_t) (x % 65536)-32768 * 0.75;                                                          }

// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+lookup[x])%1000;
}

// Print the sum
printf(“Result: %d\n”, ttl);

return 0;

}

Here are the results of the run after the changes were made. This version of the code is significantly faster than what we had originally.

Vol2.PNG

The next thing we had to do was modify the code to convert the volume factor to a fix point integer by multiplying with a binary number that represented the fixed value of 1. Below the code is shows the changes made

VOL3.C

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include “vol.h”

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}

int main() {

// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
int ttl;

// Seed the pseudo-random number generator
srand(-1);

// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}

int shift = 0b100000000 * 0.75;

// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = in[x] * shift >> 8;
}
// ######################################

// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}

// Print the sum
printf(“Result: %d\n”, ttl);

return 0;

}

These are the results I got for building and running after the changes made to vol3.c.

Vol3.PNG

What I saw was, this version and the original version runtime are very similar, the only difference that I saw between the two was, this version had a tendency to fluctuate when it came to the run time. In conclusion I found that the build that had the look table was by far the fastest of the 3. I am however curious as to why the original build and version 3 have similar run times.

 

by dparul at April 21, 2018 05:41 AM


Stephen Ward

Project Wrap-Up & Summary

It’s been about a month since I started the journey, documenting my progress through a world of learning GNU/Linux tools, traversing through a project belonging to a community of people, as well as understanding optimization. For reference, my initial failure to lift off the project (on time), can be viewed on this post. The steps I took to retry, and finally succeed at bench-marking a program can be viewed here. My attempts to optimize some code, and the results of these tests are found here as well.

The summary of these accounts has led me to this post.  Ultimately, I was hoping to take a section of code and optimize it. Did I accomplish this? No, but I took another approach at optimizing Audacity. I had seen success, quite significant success at that, by simply setting the optimization level 2 to 3. This isn’t enough however, to say that I had done anything. I really didn’t do anything apart from tell the compiler to build the code differently than before. What needs to be done, is figure out how exactly the compiler worked it’s magic. What did an optimization level of 3 do to the code behind the scenes.

The start of this investigation must occur at comparing the difference between what no does vs level 3. On a simplistic level, took an initial compilation time of ~ 5 minutes to ~15 minutes.  Is this extra time a concern to the community? Potentially. This also resulted in two different sizes of executable files in the end. The Standard Audacity ELF compiles into a 14MB file, while the new, more optimized version adds just a few more megabytes at 16MB. Is this significant when it comes to releasing to the public? I don’t think so, and probably won’t be a concern to the developers.

Beyond the fact it takes longer to compiler, and is slightly bigger, O3 has much deeper rooted differences that O0. Specifically the flags that are turned on during compilation. The GNU Compiler docs, point to a specific set of flags which are automatically turned on in addition to all the previous flags set at the other levels of optimization:

-finline-functions 
-funswitch-loops 
-fpredictive-commoning 
-fgcse-after-reload 
-ftree-loop-vectorize 
-ftree-loop-distribution 
-ftree-loop-distribute-patterns 
-floop-interchange 
-floop-unroll-and-jam 
-fsplit-paths 
-ftree-slp-vectorize 
-fvect-cost-model 
-ftree-partial-pre 
-fpeel-loops 
-fipa-cp-clone

The first flag I will look at, being the weight of the SPO course, will be the loop vectorization flag.  I have chosen to compile CLFAGS as “-02  -free-loop-vectorize” in order to narrow down just this flag’s performance.

The benchmarking looked as such:

Time to perform 200 edits: 4805 ms
Doing correctness check...
Passed correctness check!
Time to check all data: 5091 ms 
Reading data again...
Time to check all data (2): 5113 ms
At 44100 Hz, 16-bits per sample, the estimated number of
 simultaneous tracks that could be played at once: 232.5
Benchmark completed successfully.

About 200ms slower than the -O3 flag. Re running this another five times, dropped this output to ~150ms slower. Which is really just results to about ~3% decrease in time compared to using all of the default flags -03 has to  offer.

I wanted to test that this wasn’t a fluke, and that vectorization was likely the cause of the increased performance. In doing so, I removed the vectorize flag, and replaced it with a flag I consider would not have much of an effect. A flag like-fipa-cp-clone, which is enabled by default with -O3, is reported a slight increase in performance over it’s fipa-cp counterpart in the -O2 level. I could hypothesize then that by adding this flag, removing the vectorization flag, that times for -O2 would be greater than seen above.

What’s strange to me right now, is that adding this flag, and omitting the vectorize flag resulted in about the same times.  Do I conclude that vectorization is not the main reason for the increase in -03?

I went back to the drawing board, because this result is confusing. If I want to show the community progress in optimizing, I must figure out how their program is compiled by default. I had assumed that the -02 flag was added by default, because in one version that I had built, I saw the flag in the config.status file.

I’ve realized that my initial benchmark scores of ~12000ms were because that version Audacity must not have been compiled without any optimization.  How did this occur? I do not know. I have re-built audacity from the source, without adding anything but -pg. Benchmark times are consistent with just the -02 flag, i.e. ~5300-5800ms.

Summary

What I have to show from this project, and what I can present the Audacity community is this:

chart

There is a slight performance increase going with O3, but you sacrifice stability by using this flag. The alternative to this, would be to add flags to on the O2 level. There is about a 3% decrease in time to check data. As I had hypothesized earlier in part II, the BlockFile::CalcSummaryFromBuffer (which I what I believe is used to check the data) object is not all that vectorizable, and adding the flag shows this. Adding additional flags did not significantly increase the time to perform edits or check data. What this data does show, however, is that by adding the -O3 flag, the GCC compiler will optimize software with a 50% increase in performance. -O2, while being safer a choice, can be complimented with a few extra flags to resemble the -03 optimization values.

by steaward at April 21, 2018 02:06 AM


Sanjit Pushpaseelan

SPO600 – Plan of attack

So with exams and final projects, the last week and a half have been very hectic. I haven’t been able to blog nearly as much as I would have liked to but I was able to get some research done. Like I said before, I planned to use assembly language to improve the performance of md5deep. However, if I were to be honest, I had absolutely no idea how I would even go about doing that. I’m a complete beginner when it comes to assembly and there are times where I feel like that I don’t even know what I’m doing when I code in it (the assembly labs proved to be quite tough for me. Even when I revisit them I have some trouble understanding what’s going on).

Things initially looked grim but through excessive googling I was able to find two sources that might be able to help me. The first is a program called K9Copy, which is a DVD backup tool. I actually found this while scouring the inline assembler lab projects!

(can be found here)

(github code for K9copy)

The md5.h file contains a piece of code that says the following:

#if defined __GNUC__ && defined __i386__
staticmd5_uint32
rol(md5_uint32 x, intn)
{
  __asm__("roll %%cl,%0"
          :"=r"(x)
          :"0"(x),"c"(n));
  returnx;
}
So looking at this code, it isn’t super hard to figure out what’s going on here. The if statement makes sure the program is compiled with a GNU compiler that is on a i386 processor. The instruction roll rotates the 32bit in x by int n places.
I believe I can implement this code into MD5deep with ARM processor specific instructions instead of it’s current iteration.

Back-up plan

I think I either have fallen into some beginner’s trap or I’ve struck gold with this find. I found a project that rewrote all of MD5 assembly in x86 assembler! The project is called rewolf-md5. There’s even a heavily commented version for beginners!  It has the following code

.macro FF a,b,c,d,k,s,i

mov \b, %edi

mov \b, %edi

mov \b, %ebp

and \c, %edi

not %ebp

and \d, %ebp

or %ebp, %edi

leal \i(\a, %edi,), \a

addl \k*4(%esi), \a

rol $\s, \a

add \b, \a

.endm

This code seems to be pretty similar the first bitshift function in md5.c. If my first plan doesn’t work, perhaps I can adapt this code into MD5deep.

The plan moving forward

 

This project is due this Sunday so I need to move fast. I still have one more project that’s due this weekend as well so I will be juggling this and that project so I’m unsure if I will be able to complete it in time. I will be posting the results of trying the methods I stated in this blog tomorrow and the day after that I will be summarizing the work I did over this project and analyze why I got said results.

by sanjitps at April 21, 2018 01:49 AM

April 19, 2018


Adam Kolodko

Build on aarm

The roadblock last time I tried to build this on aarchie I got stuck at the script installing dependencies, this doesn’t even include the fact that this makefile doesn’t take into account arm vs x86.

Because DNF has it’s own build of blender that is compatible with aarm, the strategy is take that source code and use it to modify and build my version.

The first step is to get that source, to do this I used the command

dnf download --source blender

This gave me a file called “blender-2.79-1.fc27.src.rpm” which includes specialized build files that install on fedora. the .src.rpm file can also be used to build into a local workspace called ‘rpmbuild’

The first step is to unpack the file using the command

rpm -i blender-2.79-1.fc27.src.rpm

This creates the directories ~/rpmbuild/SPEC and ~/rpmbuild/SOURCE

Inside the SPEC folder is a file called blender.spec, this like a make file of sorts. The next command is

dnf builddep blender.spec

This is used to install the applications needed to build our blender app.

Once that is complete we want to build Blender itself, to do this use

rpmbuild -ba blender.spec

Several more directories are created but the important one is ‘~/rpmbuild/BUILD’. Within this directory is the files ready and set up for just a simple make and then make install .

Finally when that is complete in a half hour test the main executable, conveniently found in ~/rpmbuild/BUILD/blender_aarch64.2.4/cmake-make/bin/blender

by ahkol at April 19, 2018 11:32 PM


Shawn Pang

SPO600 Project Stage 3 | Wrap-up

Looking back at the stage 2, I realised that I didn’t go into much detail about the compiler options, so I’ll discuss that now. Using the -ftree-vectorize -fopt-info-vec-missed options within the Makefile’s CFLAGS variable, I checked to see which parts of the code were not being vectorized. Most of the code that was not being veotorized was because there were not enough data-references in the basic blocks and no grouped stores in the basic block. which I couldn’t think of any possible way to optimize. I tried some compiler options such as -fgcse and -fstore-merging but I saw no effects on the overall speed.

I was also curious about how the program would be affected when using the -O2 flag instead of -O3, so I ran some tests switching between -O1, -O2 and -O3. Just as a reminder from stage 2, I am changing the CFLAGS ?=-O3 value with the Makefile in the tests directory, then running make clean, make and then ./fullbench test-lz4-speed.py After testing it, this is the results I found:

Testing with -O3

*** LZ4 speed analyzer v1.8.2 64-bits, by Yann Collet ***
 test-lz4-speed.py :
Compression functions :
 1-LZ4_compress_default         :    16792 ->     7276 (43.33%),  253.5 MB/s
 2-LZ4_compress_default(small d :    16792 ->     7276 (43.33%),  252.9 MB/s
 3-LZ4_compress_fast(0)         :    16792 ->     7276 (43.33%),  253.4 MB/s
 4-LZ4_compress_fast(1)         :    16792 ->     7276 (43.33%),  253.3 MB/s
 5-LZ4_compress_fast(2)         :    16792 ->     7609 (45.31%),  266.1 MB/s
 6-LZ4_compress_fast(17)        :    16792 ->    11285 (67.20%),  425.3 MB/s
 7-LZ4_compress_fast_extState(0 :    16792 ->     7276 (43.33%),  245.4 MB/s
 8-LZ4_compress_fast_continue(0 :    16792 ->     7212 (42.95%),  207.1 MB/s
10-LZ4_compress_HC              :    16792 ->     6121 (36.45%),   23.9 MB/s
12-LZ4_compress_HC_extStateHC   :    16792 ->     6121 (36.45%),   24.2 MB/s
14-LZ4_compress_HC_continue     :    16792 ->     6121 (36.45%),   24.2 MB/s
20-LZ4_compress_forceDict       :    16792 ->     7212 (42.95%),  215.1 MB/s
30-LZ4F_compressFrame           :    16792 ->     7291 (43.42%),  248.1 MB/s
40-LZ4_saveDict                 :    16792 ->    16792 (100.0%),153481.4 MB/s
41-LZ4_saveDictHC               :    16792 ->    16792 (100.0%),153017.4 MB/s
Decompression functions :
 1-LZ4_decompress_fast           :     16792 ->   604.5 MB/s
 3-LZ4_decompress_fast_usingDict :     16792 ->   605.2 MB/s
 4-LZ4_decompress_safe           :     16792 ->   846.4 MB/s
 6-LZ4_decompress_safe_usingDict :     16792 ->   841.8 MB/s
 7-LZ4_decompress_safe_partial   :     16792 ->   852.1 MB/s
 8-LZ4_decompress_safe_forceExtD :     16792 ->   868.6 MB/s
 9-LZ4F_decompress               :     16792 ->   796.3 MB/s

Testing with -O2

*** LZ4 speed analyzer v1.8.2 64-bits, by Yann Collet ***
 test-lz4-speed.py :
Compression functions :
 1-LZ4_compress_default         :    16792 ->     7276 (43.33%),  261.5 MB/s
 2-LZ4_compress_default(small d :    16792 ->     7276 (43.33%),  256.7 MB/s
 3-LZ4_compress_fast(0)         :    16792 ->     7276 (43.33%),  261.4 MB/s
 4-LZ4_compress_fast(1)         :    16792 ->     7276 (43.33%),  261.3 MB/s
 5-LZ4_compress_fast(2)         :    16792 ->     7609 (45.31%),  273.4 MB/s
 6-LZ4_compress_fast(17)        :    16792 ->    11285 (67.20%),  439.9 MB/s
 7-LZ4_compress_fast_extState(0 :    16792 ->     7276 (43.33%),  263.1 MB/s
 8-LZ4_compress_fast_continue(0 :    16792 ->     7212 (42.95%),  219.2 MB/s
10-LZ4_compress_HC              :    16792 ->     6121 (36.45%),   23.5 MB/s
12-LZ4_compress_HC_extStateHC   :    16792 ->     6121 (36.45%),   23.3 MB/s
14-LZ4_compress_HC_continue     :    16792 ->     6121 (36.45%),   23.3 MB/s
20-LZ4_compress_forceDict       :    16792 ->     7212 (42.95%),  226.4 MB/s
30-LZ4F_compressFrame           :    16792 ->     7291 (43.42%),  256.9 MB/s
40-LZ4_saveDict                 :    16792 ->    16792 (100.0%),153932.7 MB/s
41-LZ4_saveDictHC               :    16792 ->    16792 (100.0%),152511.7 MB/s
Decompression functions :
 1-LZ4_decompress_fast           :     16792 ->   805.2 MB/s
 3-LZ4_decompress_fast_usingDict :     16792 ->   804.5 MB/s
 4-LZ4_decompress_safe           :     16792 ->   932.3 MB/s
 6-LZ4_decompress_safe_usingDict :     16792 ->   907.5 MB/s
 7-LZ4_decompress_safe_partial   :     16792 ->   917.7 MB/s
 8-LZ4_decompress_safe_forceExtD :     16792 ->   910.9 MB/s
 9-LZ4F_decompress               :     16792 ->   866.5 MB/s

Testing with -O1

*** LZ4 speed analyzer v1.8.2 64-bits, by Yann Collet ***
 test-lz4-speed.py :
Compression functions :
 1-LZ4_compress_default         :    16792 ->     7276 (43.33%),  240.6 MB/s
 2-LZ4_compress_default(small d :    16792 ->     7276 (43.33%),  226.4 MB/s
 3-LZ4_compress_fast(0)         :    16792 ->     7276 (43.33%),  239.6 MB/s
 4-LZ4_compress_fast(1)         :    16792 ->     7276 (43.33%),  239.6 MB/s
 5-LZ4_compress_fast(2)         :    16792 ->     7609 (45.31%),  254.4 MB/s
 6-LZ4_compress_fast(17)        :    16792 ->    11285 (67.20%),  417.6 MB/s
 7-LZ4_compress_fast_extState(0 :    16792 ->     7276 (43.33%),  244.9 MB/s
 8-LZ4_compress_fast_continue(0 :    16792 ->     7212 (42.95%),  174.1 MB/s
10-LZ4_compress_HC              :    16792 ->     6121 (36.45%),   21.1 MB/s
12-LZ4_compress_HC_extStateHC   :    16792 ->     6121 (36.45%),   20.4 MB/s
14-LZ4_compress_HC_continue     :    16792 ->     6121 (36.45%),   20.4 MB/s
20-LZ4_compress_forceDict       :    16792 ->     7212 (42.95%),  179.4 MB/s
30-LZ4F_compressFrame           :    16792 ->     7291 (43.42%),  224.7 MB/s
40-LZ4_saveDict                 :    16792 ->    16792 (100.0%),150710.6 MB/s
41-LZ4_saveDictHC               :    16792 ->    16792 (100.0%),150057.3 MB/s
Decompression functions :
 1-LZ4_decompress_fast           :     16792 ->   667.7 MB/s
 3-LZ4_decompress_fast_usingDict :     16792 ->   667.8 MB/s
 4-LZ4_decompress_safe           :     16792 ->   732.9 MB/s
 6-LZ4_decompress_safe_usingDict :     16792 ->   730.8 MB/s
 7-LZ4_decompress_safe_partial   :     16792 ->   734.5 MB/s
 8-LZ4_decompress_safe_forceExtD :     16792 ->   746.8 MB/s
 9-LZ4F_decompress               :     16792 ->   694.4 MB/s

After testing, it seems that the -O2 is the fastest. This seemed a bit strange to me, so I also tried running the fullbench program with the test-lz4-versions.py with different optimization flags. Here are my results:

Testing with -O3

*** LZ4 speed analyzer v1.8.2 64-bits, by Yann Collet ***
 test-lz4-versions.py :
Compression functions :
 1-LZ4_compress_default         :     5376 ->     2644 (49.18%),  225.7 MB/s
 2-LZ4_compress_default(small d :     5376 ->     2644 (49.18%),  225.4 MB/s
 3-LZ4_compress_fast(0)         :     5376 ->     2644 (49.18%),  225.9 MB/s
 4-LZ4_compress_fast(1)         :     5376 ->     2644 (49.18%),  225.7 MB/s
 5-LZ4_compress_fast(2)         :     5376 ->     2827 (52.59%),  244.6 MB/s
 6-LZ4_compress_fast(17)        :     5376 ->     4377 (81.42%),  519.5 MB/s
 7-LZ4_compress_fast_extState(0 :     5376 ->     2644 (49.18%),  223.3 MB/s
 8-LZ4_compress_fast_continue(0 :     5376 ->     2639 (49.09%),  194.5 MB/s
10-LZ4_compress_HC              :     5376 ->     2335 (43.43%),   30.7 MB/s
12-LZ4_compress_HC_extStateHC   :     5376 ->     2335 (43.43%),   30.5 MB/s
14-LZ4_compress_HC_continue     :     5376 ->     2335 (43.43%),   30.5 MB/s
20-LZ4_compress_forceDict       :     5376 ->     2639 (49.09%),  203.9 MB/s
30-LZ4F_compressFrame           :     5376 ->     2659 (49.46%),  220.6 MB/s
40-LZ4_saveDict                 :     5376 ->     5376 (100.0%),49115.7 MB/ss
41-LZ4_saveDictHC               :     5376 ->     5376 (100.0%),49046.6 MB/ss
Decompression functions :
 1-LZ4_decompress_fast           :      5376 ->   569.6 MB/s
 3-LZ4_decompress_fast_usingDict :      5376 ->   578.4 MB/s
 4-LZ4_decompress_safe           :      5376 ->   778.2 MB/s
 6-LZ4_decompress_safe_usingDict :      5376 ->   764.9 MB/s
 7-LZ4_decompress_safe_partial   :      5376 ->   787.5 MB/s
 8-LZ4_decompress_safe_forceExtD :      5376 ->   779.0 MB/s
 9-LZ4F_decompress               :      5376 ->   734.6 MB/s

Testing with -O2

*** LZ4 speed analyzer v1.8.2 64-bits, by Yann Collet ***
 test-lz4-versions.py :
Compression functions :
 1-LZ4_compress_default         :     5376 ->     2644 (49.18%),  234.7 MB/s
 2-LZ4_compress_default(small d :     5376 ->     2644 (49.18%),  233.4 MB/s
 3-LZ4_compress_fast(0)         :     5376 ->     2644 (49.18%),  234.6 MB/s
 4-LZ4_compress_fast(1)         :     5376 ->     2644 (49.18%),  234.7 MB/s
 5-LZ4_compress_fast(2)         :     5376 ->     2827 (52.59%),  262.6 MB/s
 6-LZ4_compress_fast(17)        :     5376 ->     4377 (81.42%),  538.9 MB/s
 7-LZ4_compress_fast_extState(0 :     5376 ->     2644 (49.18%),  237.8 MB/s
 8-LZ4_compress_fast_continue(0 :     5376 ->     2639 (49.09%),  200.6 MB/s
10-LZ4_compress_HC              :     5376 ->     2335 (43.43%),   30.4 MB/s
12-LZ4_compress_HC_extStateHC   :     5376 ->     2335 (43.43%),   30.5 MB/s
14-LZ4_compress_HC_continue     :     5376 ->     2335 (43.43%),   30.5 MB/s
20-LZ4_compress_forceDict       :     5376 ->     2639 (49.09%),  209.1 MB/s
30-LZ4F_compressFrame           :     5376 ->     2659 (49.46%),  230.7 MB/s
40-LZ4_saveDict                 :     5376 ->     5376 (100.0%),49408.4 MB/ss
41-LZ4_saveDictHC               :     5376 ->     5376 (100.0%),49183.9 MB/ss
Decompression functions :
 1-LZ4_decompress_fast           :      5376 ->   729.0 MB/s
 3-LZ4_decompress_fast_usingDict :      5376 ->   735.1 MB/s
 4-LZ4_decompress_safe           :      5376 ->   832.4 MB/s
 6-LZ4_decompress_safe_usingDict :      5376 ->   818.1 MB/s
 7-LZ4_decompress_safe_partial   :      5376 ->   830.5 MB/s
 8-LZ4_decompress_safe_forceExtD :      5376 ->   833.8 MB/s
 9-LZ4F_decompress               :      5376 ->   783.0 MB/s

Testing with -O1

*** LZ4 speed analyzer v1.8.2 64-bits, by Yann Collet ***
 test-lz4-versions.py :
Compression functions :
 1-LZ4_compress_default         :     5376 ->     2644 (49.18%),  211.9 MB/s
 2-LZ4_compress_default(small d :     5376 ->     2644 (49.18%),  201.9 MB/s
 3-LZ4_compress_fast(0)         :     5376 ->     2644 (49.18%),  211.9 MB/s
 4-LZ4_compress_fast(1)         :     5376 ->     2644 (49.18%),  211.8 MB/s
 5-LZ4_compress_fast(2)         :     5376 ->     2827 (52.59%),  230.2 MB/s
 6-LZ4_compress_fast(17)        :     5376 ->     4377 (81.42%),  484.2 MB/s
 7-LZ4_compress_fast_extState(0 :     5376 ->     2644 (49.18%),  206.3 MB/s
 8-LZ4_compress_fast_continue(0 :     5376 ->     2639 (49.09%),  147.7 MB/s
10-LZ4_compress_HC              :     5376 ->     2335 (43.43%),   26.5 MB/s
12-LZ4_compress_HC_extStateHC   :     5376 ->     2335 (43.43%),   26.4 MB/s
14-LZ4_compress_HC_continue     :     5376 ->     2335 (43.43%),   26.4 MB/s
20-LZ4_compress_forceDict       :     5376 ->     2639 (49.09%),  152.0 MB/s
30-LZ4F_compressFrame           :     5376 ->     2659 (49.46%),  198.5 MB/s
40-LZ4_saveDict                 :     5376 ->     5376 (100.0%),49052.3 MB/ss
41-LZ4_saveDictHC               :     5376 ->     5376 (100.0%),48857.4 MB/ss
Decompression functions :
 1-LZ4_decompress_fast           :      5376 ->   619.0 MB/s
 3-LZ4_decompress_fast_usingDict :      5376 ->   618.2 MB/s
 4-LZ4_decompress_safe           :      5376 ->   671.8 MB/s
 6-LZ4_decompress_safe_usingDict :      5376 ->   678.1 MB/s
 7-LZ4_decompress_safe_partial   :      5376 ->   677.4 MB/s
 8-LZ4_decompress_safe_forceExtD :      5376 ->   687.1 MB/s
 9-LZ4F_decompress               :      5376 ->   634.0 MB/s

Again, it seems that the -O2 optimization flag seems to be the fastest. Just to be sure, I ran the tests with the sames flags around 5 times each for both the test-lz4-speed.py and test-lz4-versions.py files, and every time the -O2 flag seemed to be faster then the -O3 flag by around 5%. I asked one of the developers why they were using the -O3 flag instead of -O2, and they said that the -O3 flag used to produce significantly faster binaries then -O2, but it was changed a few years ago both as a consequence of source code modifications, which are friendlier to compiler’s optimizations,
and as a consequence of compiler’s improvements. The difference may vary depending on the compiler type and version. He said that he may reconsider using the -O2 flag in the future in the Makefile.

So, on the subject of implementing SIMD into the lz4.c file, I wasn’t able to make it so that the program could use multithreading in the program. My previous idea was to use the pthread library and extract parts of the main loop, encapsulate it within a function, and replace the current part within the loop with the function. Unfortunately, I wasn’t able to make it so that the function was able to execute simultaneously with other parts of the loop because each part was so heavily dependent on previous parts. I heard from the developer that Intel did something similar, located here,but I couldn’t locate the patch that the Intel team used anywhere, and I haven’t been able to contact the developers that created the patch, so I wasn’t able to test it to see how it is done. So while it may be possible use threads, it would be incredibly difficult for me to implement with the resources that I have.

So, to wrap this project up, while I wasn’t able to increase the program speed using SIMD, the -O2 flag did in fact increase the speed of the programs, so this project could be counted as a success. My previous discussions with the developers told me that the developers previous attempts to implement SIMD into the program has failed, so I believe that my inability to implement SIMD into the program is not hugely a failure, especially within such a short time period. Determining whether or not using the -O2 optimization flag would require some more testing on different platforms, so whether or not it is changed would depend on future testing results using different compiler types and versions.

I enjoyed working on this project quite a bit. I learned quite a few things about the SIMD, compression algorithms, and optimizing code. My experience with working with the developers of LZ4 has been rather pleasant, as they were very accepting and polite when answering my questions about the project. They also responded quickly, within a few hours of my asking of the question, and gave some good advice on the subject at hand. Working on this program has been a good learning experience. Testing each possible idea that I thought up as a possible optimization and testing them was interesting to see, and help me learn how important planning and testing programs could be. I hope that I can continue to work on optimizing programs in my spare time.

by sppang at April 19, 2018 11:08 PM


Ray Gervais

The Cost of Aesthetic in Flat Design

A OSD700 Contribution Post

For the final release, one of the issues I wanted to focus on was this, which I figured would be an easy contribution toward the project and a check off of my final release requirements. After reviewing the comments on the issue, I was under the impression that I had to learn a new accessibly standard titled aXe. aXe was going to be the driving force behind this post, but to my fortune it’s more of a testing engine than a standard; testing instead web applications and pages against the WCAG 2.0 AA rulesets.

Evaluating the issues with a page relating to WCAG AA compliance is made easy with the aXe engine: https://axe-core.org/, and even displays in the results how to better improve or fix rulesets such as contrast and sizing. So, I was on familiar ground. A ground which many never bother to consider since they avoid the cracks and spots of mud as they walk along. I decided to use the engine’s results as a guide towards patching the cracks, cleaning up the mud. One has to wonder, what is the consequence of such patches?

I first looked into the Material Design stepper specification and accessibly pages, where items such as contrast and sizing were addressed in a stark-yet-not-half-assed manner. The rules made sense, but they still did not comply with WCAG AA requirements and better yet, disregarded many of the colour rules to forward a flat aesthetic. The website the documentation is running on fails multiple guidelines, meaning that this correction would come from ideas, discussion, and if accepted, deviation from the very guideline which established the design of the project. Damn.

Before

After

I left a comment which described the most transparent way of fixing the A11y issues with the stepper, opting to darken the text to meet the bare minimum of the compliance guidelines. It was as I was typing the comment and proposed changes, that I realized just how little people would care for such a change, or how quickly I’d be thrown off the boat for attempting to go against the design specification.

The change that I proposed is very small, bringing up the alpha layer of the RGBA font / background colours from 35% to 54%, which brings us to the compliant 4.5:1 contrast ratio. I figured this was better than changing the colours themselves which, good luck doing so since we are playing with #FFF and #000 through and through. Kids, this isn’t your Dad’s CSS.

In the past few weeks, I’ve been horrendous when it came to OSD700’s work, often appearing dark f or a week in span, my work for the course at a standstill in that time. Three days after posting the comment which I hoped would stir discussion, still not a single response. Perhaps I need to give them a week as well, moving onto a different issue as my secondary while waiting for the pitchforks or textbooks to fly in with fury once maintainers and developers alike stumble on it.

Regardless, one can only throw his paper plane into the sky and wait for the wind to determine it’s direction.

by RayGervais at April 19, 2018 10:02 PM


Patrick Godbout

Fixing a Bug, Adding Tests

In this lab, we were asked to work on the Brave browser, but more specifically to see how it compares to other browsers in certain niche tests. To do this, we studied the testing code for the Brave browser to try to understand it at its base, compared to FireFox and Chrome as examples. Following this analysis, we would find some bugs or odd behaviour that Brave did but other browsers didn't. So our task was to fix those behaviours, and add tests to it before submitting a pull request to our local branches. Here's some insight on how it went;

P.S.: I worked with a classmate on this lab, his name is Owen Mak

1- Which URL string cases did you have trouble with in Brave vs. other browsers?

" https://www.google.ca/search?q=dog cat " and " file:///Users/humphd/repos/browser-laptop/dog cat.txt " mainly. We used both these cases to test since they both resulted in different behaviour when comparing Brave to Chrome or FireFox.

In the first case, Chrome/FireFox would redirect to a "dog cat" google search, and Brave would redirect to a "https://www.google.ca/search?q=dog cat" google search.

In the second case, Chrome/FireFox would open the file from your user path and display the contents on the page, and Brave would google search "file:///Users/humphd/repos/browser-laptop/dog cat.txt".

2- Did you notice any other interesting differences between browsers and how they handle strings in the URL bar?

The aspect that stuck out to me was that most of their tests and error handling were similar, but the order wasn't. There were a lot of things that needed to pass through checks on any browser to make sure we handle the input properly, but sometimes the order in which those checks were implemented resulted in unexpected behaviour.

3- How did you write tests for these cases?

We wrote a test for both test cases, where we did this;

+      it('is a url which contains a space in the query', function () {
+        assert.equal(urlUtil.isNotURL('https://www.google.ca/search?q=dog cat'), false)
+      })
+      it('is an absolute file path with a space in the files', function () {
+        assert.equal(urlUtil.isNotURL('/home/omak/Documents/dog cat.txt'), false)

+      })

We added an if check to add functionality so that it wouldn't fall through the checks and consider the entire URL as a google search parameter.


4- What did you do in order to get Brave to work the same as other browsers?

One of the first things we did was compare the behaviour between Brave and Chrome/FireFox like mentioned above. We wanted to see what other browsers did which ends up being what Brave doesn't do. 

5- What did you need to learn in order to complete the work?

Surprisingly, I didn't need to learn how to write tests, since I had to do that as an assignment for my Project Release 0.1. What I needed to understand was what was being tested in this scenario. Beyond that, we needed to know why the checks were failing in Brave's scenario. We did this by putting console.logs everywhere the checks were being made, and we narrowed it down from there until we found out that every space in the URL was taken out at the same time, instead of taking out the leading and trailing spaces first and encoding the spaces in the middle. That's how we found out about the bug, and we were able to fix it from there.

by Patrick G (noreply@blogger.com) at April 19, 2018 04:30 PM

April 18, 2018


Justin Vuu

SPO600 – Lab 3 – Assembler Lab

I noticed I haven’t submitted this lab even though I’ve done a good portion of it. Huh.

In this lab, we experimented with assembly and how it can affect the program.

We were given a set of code in C which all produce the same output, but written differently. The code and their filenames are as follows:

hello.c
hello1

 

hello2.c
hello2

 

 

hello3.c
hello3

After compiling them in both AArch64 and x86_64, we took a look at their object code and compared them:

hello
obj-hello1obj-hello1-x86

hello2obj-hello2
obj-hello2-x86

hello3obj-hello3
obj-hello3-x86

Afterward, we compiled the assembler versions for the two architectures. We noticed that all the extra sections that were in the C object code aren’t in the assembler object code.

AArch64:
obj-helloasm-aarch

x86_64:
obj-helloasm-x86

Then we tried building a program in assembler that loops and prints the iteration number. First we did it in x86_64, and this is how it looks like and what it outputs:

loop10-x86loop10-x86-output

We then modified the code so that it loops 30 times and prints the iteration number properly. This was also done in x86_64:loop30-x86loop30-x86-output

Summary

Assembler is hard. This lab has made me appreciate high-level code more than ever! Figuring out how to code the looping programs in assembler was very satisfying, maybe because I can finally stretch after being hunched over staring at jargon for minutes on end.

Even after completing this lab I’m still confused about it.

by justosd at April 18, 2018 09:10 PM

Building Trust

Github pull requests not being looked at? If not, then…

Did you update your Github profile with your name and bio? Does your profile picture have a picture of you?

If you answered “No” to any of these, that’s probably why. One thing I learned from the Technician As Entrepreneur course (TEC702) is that you build trust by providing a picture of yourself and providing a bio – who you are and why you’re doing what you’re doing. This is reinforced by this comment here.

So, add a picture! Put your name on your profile! Provide a bit of info on yourself! And you’ll improve your chances of getting a response.

By not providing a picture, name, and/or bio, you’re a completely anonymous person trying to offer assistance. Sounds familiar? Like a complete stranger approaching you or at your front door trying to sell you something. Looks shady.

Even in social sites like Facebook or Twitter or Reddit, if you see an account that’s sending you messages or trying to friend or follow you, but their account provides no information on them, you probably wouldn’t reply or accept their request.

To the authors of the project, this is what they think when they see your freshly made Github account with a blank profile and a few small projects that you did for class.

by justosd at April 18, 2018 04:06 PM

SPO600 – Stage 3 – Pull Request Submitted!

Aaaah my nerves!

I created an issue and pull request for Hashrat that includes the optimization I made in Stage 2: https://github.com/ColumPaget/Hashrat/pull/11

I hope the author is nice. In retrospect I really should have tried reaching out to them at the beginning of the project. By filing the issue and pull request now, I feel like I’m coming off a little rude.

by justosd at April 18, 2018 03:51 PM

April 16, 2018


Vimal Raghubir

Brave Browser Bug Fix

In this week’s episode of open source adventures, I was tasked with fixing bugs in the URL parsing of Brave Browser. First of all what is Brave Browser? It is an open source web browser that focuses on blocking web trackers and intrusive advertisements. It also encourages privacy by sharing less of your personal data with advertising companies. Now that we know a little about Brave Browser, what does that mean for us?

Brave Browser is open source meaning anyone can access the code base as well as contribute. Since it is a relatively new browser compared to Firefox and Chrome for example, there is still a few bugs most notably in the URL parsing. So my professor David Humphrey, gave my class and I a few test URLs to test in Brave Browser and compare its results versus the results in Chrome and Firefox. Most of the URLs given worked except for 2.

  • "https://www.google.ca/search?q=dog cat"
  • " https://www.google.ca/search?q=dog cat "

Both of these URLs were not being parsed correctly by Brave. The result of these URLs can be seen below.

As you can see its taking the entire URL and passing it into the search bar which is definitely not what we want. We want it to just pass “dog cat” into the search bar and get results for that. The reason behind this bad URL parsing is that Brave did not take into account for spaces within query parameters of the URL. If there are leading spaces before or after the URL then Brave accounts for this by calling the trim() function.

Now the most interesting part about this bug fix is that a new approach was taken to achieve it. What is this new method? Its called Test Driven Development or TDD for short. This method focuses on writing the tests BEFORE fixing the bug so instead we focus on making our failing tests pass instead of the other way around.

Based on this principle, we need to write the tests first. As you can see below, these are the two tests written for this use case.

Now time to actually see if these tests are failing as expected. After installing the mocha library using npm install — global mocha and building them, I ran npm run test — — grep=”dog” to only show me the tests that have the word dog in it. This then showed me the 2 tests that were failing.

As you can see the expected values were false however the return values were true which means that the isNotUrl function does not consider these 2 URLs to be valid, because of the space between dog and cat. Now that we know the tests are failing, its time to go and fix them.

After searching the code base, it was clear that as soon as the URL is being taken into the isNotUrl function, its being trimmed. Therefore this portion of the code is where we need to handle any spaces before more complex URL manipulation takes place. I realized that Regex is the best course of action to handle an indefinite amount of white spaces as well as ensuring they aren’t located at the front or back of the URL.

As you can see the Regex above simply says “look for all white spaces that aren’t located at the front or the back of the URL and replace it with %20”. Now does this actually work? Yes it does! The unit tests are passing now.

Time to check if it fixes the URL when searching it in Brave Browser.

Success!!! In conclusion this bug fix was very meaningful since it exposed me to a different approach to bug fixing. Rather than debugging code and building my tests after the fact to ensure everything works well, I can build tests that I know will break and then go and fix them. All in all important lesson learned in this bug fix and looking forward to implementing this approach in future bug fixes! For now that’s a wrap. See you on the next blog post!

by Vimal Raghubir at April 16, 2018 04:28 AM


Zukhruf Khan

Fixing a Bug, Adding Tests

Hello all, for lab 6, we were asked to fix a minor bug that existed in the Brave browser-laptop code. The bug itself was that Brave reacts differently to spaces inside search queries or urls. For this case, we tested some strings inside Brave, versus Chrome and Firefox. Upon entering this url inside the url bar, while Chrome and Firefox returned a Google search page for “dog cat”, Brave returned a Google search for the entire URL itself. The instructions were to fix that bug and write new tests to test that the bug was indeed fixed.

So, I started off by forking the Brave repo, and cloning it to my machine. After building and running the code, I also ran the tests and made sure everything was working smoothly. I noticed that when I entered: https://www.google.ca/search?q=dog cat into my Firefox or Chrome browsers, the space between dog and cat was replaced with %20. In Brave, nothing was replaced and so the whole URL was placed in a Google search rather than the two separate terms. So first, as per the instructions, I familiarized myself with the tests in urlUtilTest.js, and wrote some more tests to test the fix I was going to make later.

Screen Shot 2018-04-15 at 10.18.43 PMI added the last 2 tests (and the final commented one)

Thus, I made my way through the urlUtil parsing code in this file. I ended up simply adding a replace(‘ ‘, ‘%20’) to one particular line of code, and voila, I was done!

Screen Shot 2018-04-15 at 9.47.04 PM

Afterwards, I tested the browser, and sure enough, Brave replaced the spaces with ‘%20’ and thus treated the URLs as Chrome and Firefox do.

Screen Shot 2018-04-15 at 9.49.02 PMTesting Brave with a file called test brave.txt – the spaces were replaced and the file shown

 

 

Screen Shot 2018-04-15 at 9.47.46 PMSuccessful tests!

My commit can be found here. I learned that sometimes, it only takes a line (or less) to fix a bug (although the tests were another case).

by zeddkayy at April 16, 2018 02:23 AM

April 15, 2018


Ray Gervais

A Second Semester of Open Source Contributions Completed

It’s hard to believe how quickly this semester has come to a close. Some of us including me even had countdown calendars, and yet the days escaped even quicker than we could count. It feels like just last week I started my second dedicated foray into Open Source technologies, and yet in the next two weeks it’ll be the end of such adventure (for now, that is). Similar to what I did when I completed OSD600, I thought I’d recap and share my thoughts as I complete OSD700, and perhaps also allude to the progression and experiences between the two which is only possible through fantastic instructors such as David.

From 600 to 700

The Rise of JavaScript

In David’s OSD600, I learned quite a bit about modern-day JavaScript practices and how they applied to current Open Source trends. Looking back now, I gather a few reasons why JavaScript completely swept over and took the FOSS realm by storm:

  • Thanks to Electron and HTML, making cross platform desktop applications is now as simple as writing a web application. I believe this is imperative in the popularity of JavaScript applications since working with GTK, qt, WPF, and Cocoa (just to name a few) can be disjointed and utterly mind jarring at times. If you know HTML and CSS, your application can share unified styling and components on all major platforms.
  • JavaScript has grown in the past decade to be one of the most flexible languages I’ve ever seen. Learning advanced JavaScript development means learning new patterns, new paradigms, new programming concepts such as callback / closure centric logic wrappers, and with the addition of Node for the backend, a semi-robust alternative to the languages of yesterday.
  • I’ve observed a radical shift both from SOTI and from developers I’ve talked to regarding their change of perspective between dynamically typed, interpreted languages such as Python, JavaScript and compiled languages such as C#, C++, and Java. Many whom admitted disdain for JavaScript were now advocating its usefulness for prototyping and rapid application development without the need to compile or grand environments being provisioned. Of course, you have individuals in both camps, some who claim that NODE and JavaScript are still too immature to be taken so seriously in enterprise, of which I do see some their points being incredibly realistic. Tried True > Bleeding Edge.

From Student to Intern

Likewise, it was through learning JavaScript in OSD600 that I had the confidence to learn Angular and it’s primary language, TypeScript. From there, the entire MEAN (MongoDB, Express, Angular, Node) JavaScript centric stack and all of it’s toil and glory. Flash forward three months later, and this new experience actually landed me into my first enterprise Internship with SOTI Inc, where I was a front end Angular developer. Using David’s knowledge and lessons, I learned quickly how to excel and push forward the tasks much bigger, much more complex than my potential, and became the lead front-end developer (still an intern!) of the SOTI Insight team.

I don’t think a single day goes by where OSD600 hasn’t had an impact on my current work in the best way. Looking back now, without that class I can say that I would not be in the position I came to, nor would I have the experience and drive which came from that class.

The Transitioning to 700, Thoughts Post-600

The same can be said for many who took David’s OSD600, and for those who in OSD700 are also finding their callings. With 700, instead of being shown around the nest and how various communities work we were thrown directly into the sky, told to choose a place to land and from there build our own nests alongside the given communities we chose. Here, some chose Visual Studio Code, Brave, Chart.JS, Angular.Material, ngx-bootstrap, Python even!

The experiences differ per person, but I don’t *think* any of us walked out with less than when we walked in. Instead, we walk into the last few classes with contributions and pull requests to our name, a steady stream of work relating to us showing up on most search engines at the very top (talk about good publicity for a developer!), and a confidence and skill set which isn’t easily obtained that will push our careers further than ever before.

Lessons and Thoughts Which Stood Out This Semester

Debugging through Different Directions

I’ve written about some of the debugging methods I’ve painstakingly learned over the past four years, a post which was directly inspired by David’s lessons and articles on the topic. Being the ever-learning, humbly experienced developer that he is, David shared with us his strategies for debugging applications built on top of the Electron framework; a lesson which the very next day even affected the nature of my tasks at SOTI in the best possible way.

Whereas I discussed in my post a lot of the common debugging issues and or missed areas which younger students that I tutored or myself often struggle with, David went straight into explaining how to bend Chrome and modern technologies to our will. He explained Dogfooding, Dependency Injection concepts, and navigating your way around a huge code base looking for a single event listener using modern day tools. Never before had I looked at the Chrome DevTools specifically with such admiration of what was possible through a web browser. It’s amazing how much effort and work is put into such tools and applications that the everyday person will never think about nor discover.

I took some of the tricks that David had displayed and applied it next day while debugging an item at SOTI. To my disbelief, no one else on the development team (which at that time was comprised of 4 senior JavaScript developers, 6 software developers) had heard of the debugging-on-DOM-Node-event trick, or even conditional breakpoints accessible through Chrome’s DevTools. Yet, it was exactly these two tricks (plus a few others) which finally allowed me to discover the flaw in the code; the line which broke the business logic.

Becoming a Small Community in The Class

Throughout most of our courses, we’re always anticipating to make friends in the classes we frequent and build relationships or networking potential with our peers. This means that when we see each other in the halls while rushing towards the next test, we’ll nod, see how it’s going, strike a conversation, or perhaps press forward since the Prof isn’t going to wait for you. I found this scenario through my few years at Seneca to be very predictable, to be the standard of meeting individuals who are in the same field as you.

In OSD600, and even more so in 700, I found David guided us towards something much bigger and more concrete. As per his intention, the class of OSD700 became a community where thoughts, stories, events and coding struggles were shared and enjoyed. Interesting developments or thoughts were shared on the Slack channel weekly, often by Sean who somehow managed to always have a new link or article to share! We attended a Rangle.IO event as a portion of the class, and even got to tour the Mozilla Toronto Office (MoTO) with Michael Hoye. The twitter tag #SenecaSocksCrew was created initially as a chance to get awesome looking VueJS socks, but later kept alive to become a symbol. An anchor for all of us to come back and relate to, to keep in touch and also to plan out new events together after the semester ends.

David got what he wanted, which was to join a class of extraordinary people into our own open source community. The community at the moment consists of the following incredible developers, who I’d recommend looking into for both their work, their blogs, and their continued plans to change the world:

Presenting your Best and Worst Work


This is an interesting one, because as the title suggests some days aren’t meant for presenting the goldmine of work you produced. Instead, we were encouraged to present any and all of it. What this meant was explaining our thinking, our trials and where we want to go next with the task at hand.

This was one topic which took me away from the comforts of a polished end result. Never before have I had to talk about failures, and work in progress to such degree, admitting that at the time of your presentation that there is still work to do, things to fix, and a long road ahead. It took a lot of time for me to adjust, to admit alongside my pampered and finished tasks that some were the polar opposite, coal in a cave full of unknown code waiting to break or be found. It was incredibly stress-inducing to me to go up in front of my peers, and explain why an item isn’t working, similar to a progress report. I’ve always been a perfectionist, so the introduction of this style of presenting was one which pulled me very left field, but also gave me the slowly-worked-to chance to learn from the presentation style and own up to the tasks in their unfinished entirety.

Contributing To the Everyday Persons Workflow

This title seems odd, even for me who wrote it. What I really mean by this, is that our contributions should be aimed at making the biggest splash it can for others. This doesn’t mean sending thousands of lines of code changes in a single Pull Request, instead it means making the project one bug less, one language translated more, one requested feature added, etc. David really tried to emphasize this as many of us looked at lines of code as the metric between an ‘A’ and ‘B’ graded release, instead of how this ‘very small’ change would improve the workflow of developers all over the world using said project, or how this bug fix will help clients and developers alike to push the project forward by removing technical debt for example.

It took awhile for me to learn this, since previous courses and egos always considered the better programmer to be the one who writes quality code, in bountiful amounts. My first few fixes were mere line changes, which though they qualified as a ‘release’, to me they felt like the shortest fragment of a ‘patch’. Over time, this changed as David stressed how these fixes were improving the lives of both the users and developers, beit bug fixes, new features, or even accessibility improvements (my niche I suppose). I saw that contributing didn’t have to be verbose, but instead helpful.

Where Do I Want To Go From Here

This isn’t my last blog post, nor is it my last blog post relating to OSD700. But, I figured this would be a nice place to put my ambitions and thoughts towards how I want to steer 2018. Not in any order of priority or execution:

– Learn VueJS / React
– Learn Python 3, Rust, GoLang
– Write a Full Stack Web Application from the Ground Up
– Write a Mobile Application from the Ground Up (iOS? Android?)
– Become a Mozillian!
– Become a Node Certified Developer
– Become a Linux Certified Administrator (maybe?!)
– Continue contributing to FOSS communities
– Continue working on Musical Adventures, release some of it!
– Continue being a member of the SenecaSocksCrew community

Going forward, I’m hoping to learn more lessons and also expose myself to newer technologies which contrast or conflict with my own current experiences and vices, my logic being that it will round me out as a programmer better than falling into a specific niche. I imagine my new career title will play well with this concept, going from Front-end Developer to Cloud Platform Engineer. 2018 is only a quarter of the way through, and there is still much that is possible before we see the end.

by RayGervais at April 15, 2018 09:14 PM


Hongcheng Zhang

lab6

We need to fix URL bug at the Brave browser using TDD. 

Here is some case.

屏幕快照 2018-04-15 下午3.27.54

屏幕快照 2018-04-15 下午3.31.35

Most of these cases are good, but it is different between

  • "https://www.google.ca/search?q=dog cat"
  • "file:///Users/humphd/repos/browser-laptop/dog cat.txt"

 

So, I try to fix this bug and I found the urlutil.js and urlutilTest.js.

urlutil.js check the url. And then we can see there is a trimming space at the beginning and end,  that will cause the problem. I changed the ode to check the validation of the url

屏幕快照 2018-04-15 下午3.42.25

Next, we need add some unit test to test it.

屏幕快照 2018-04-15 下午3.44.49.png

Then, the result is:

result-unit-test

by hongcheng1993 at April 15, 2018 07:47 PM


Abdul Kabia

Show me dog/cat pictures!

So very recently, our professor tasked us with working on Brave. It's a lightweight browser with AdBlock being their main focus. A pretty cool idea, if you ask me. For this piece of work, he wanted us to see what Brave would do when you gave it certain queries in it's URL bar, to see …

Continue reading Show me dog/cat pictures!

by akkabia at April 15, 2018 06:05 PM

April 14, 2018


Connor Ngo

Conversion is done - small sample

 

%oGroup = $LBC::Ports::Group[%port];
if(%group == -1)
{
    //%group = $LBC::Groups::NumGroups;
    //$LBC::Groups::NumGroups++;

    wires.add(%obj);
    %group = wires.getCount();
    //$LBC::Groups::Port[%group, 0] = %port;
    wires[%group].port[0] = %port;
    //$LBC::Groups::PortIDX[%group, %port] = 0;
    wires[%group].portIDX[%port] = 0;
    //$LBC::Groups::PortCount[%group] = 1;
    wires[%group].portCount = 1;
    //$LBC::Ports::Group[%port] = %group;


    //$LBC::Groups::Wire[%group, 0] = %obj;
    wire[%group].wire[0] = %obj;
    //$LBC::Groups::WireIDX[%group, %obj] = 0;
    wire[%group].wireIDX[%obj] = 0;
    //$LBC::Groups::WireCount[%group] = 1;
    wire[%group].wireCount = 1;
    //$LBC::Wires::Group[%obj] = %group;
}
else if(%group == %oGroup)
    continue;
else

{
    //%tpSize = $LBC::Groups::PortCount[%group];
    %tpSize = wire[%group].portCount;
    //$LBC::Ports::Group[$LBC::Groups::Port[%group, %tpSize] = %port] = %group;
    wire[%group].port[%tpSize] = %port;
    //$LBC::Groups::PortIDX[%group, %port] = %tpSize;
    wire[%group].portIDX[%port] = %tpSize;
    //$LBC::Groups::PortCount[%group]++;
    wire[%group].portCount++;
}

The conversion of the source code took a bit longer than expected and was rather tedious to change the majority of the variables from Global variables and to use SimSets. After doing several tests to make sure everything is working properly with my friends it's finally ready to be shipped out. My fellow peers are eager to get this published.

In my next post I plan to upload two videos to show the optimization - The before and after videos.

April 14, 2018 07:34 PM


Ruihui Yan

Installing Fedora 27 on Raspberry Pi 3

Since the servers at Seneca are shared among the students and is never backed up, the chances of someone using too much resource on a specific machine or a server going down are quite high. For that reason, I decided to install a Linux distribution on a Raspberry Pi 3 Model B, more specifically Fedora 27. I decided to install Fedora Workstation, which includes a graphical interface, and the installation involves downloading the Fedora image and flashing my SD card.

In this link there are instructions on how to write the image onto the SD card. It uses fedora-arm-installer for that. Since I am using a Macbook, I used an alternative solution, a program called Fedora Media Writer:

After flashing the SD card, I boot up the Raspberry Pi:

 

After a few minutes, the GUI boots up with the setup process:

 

Once the setup is done, you get access to Fedora desktop:

Apparently, there is a file that Fedora can’t redistribute and we need that file in order to make Wifi work:

To retrieve the file, we use the following command:

 

sudo curl https://fedora.roving-it.com/brcmfmac43430-sdio.txt -o /lib/firmware/brcm/brcmfmac43430-sdio.txt

After reboot, the Wifi should be available:

After using the Fedora Workstation for a while, I realized the GUI is not suitable for a computer so small like Raspberry Pi because the slowdowns and lags are really noticeable.Therefore, I decided to install a simpler version of Fedora, the Fedora Minimal. It doesn’t have GUI and it’s accessible only through command line interface.

Again, the process is really similar, but the setup is done via CLI:

After the installation, I need to enable Wifi again:

Wifi connection interface:

 

Lastly, here is a screenshot of me accessing the Raspberry Pi from the Terminal on a Mac:

 

Now, everything is set to proceed with the project!

 

by blacker at April 14, 2018 04:36 PM

April 13, 2018


Qiliang Chen

OSD600 - Lab 6: Fixing URL bug in Brave browser

In this lab, we are going to fix URL searching bug in Brave browser. Although each browser has its method to handle URL parsing, we hope Brave browser can get the same searching result as those famous browser does. For me, I like Chrome browser the best. So I'm going to fix Brave URL bug to make it have the same result as Chrome does.

Here's the link of our lab: https://wiki.cdot.senecacollege.ca/wiki/OSD600_and_DPS909_Winter_2018_Lab_6

After a few try, we get result from Chrome and Brave. They have different results in some way.


The results are different in URLs which have space inside URL. Then I compare the URL results of them.
Chrome:

Brave:

So I think Brave doesn't replace the space in URL like Chrome does. So Brave separates the URL into 2 parts and search them through searching engine. Our main purpose is to replace it to "%20" in order to let it work in Brave browser.

First we know, Brave already handle the trim function. When Brave get input from address bar, it will remove the space in the front and back. Next step it should do is to replace space inside the input to "%20". But it doesn't do that. That is what we need to fix.

Let's go through the code.

I think the code for handling url is in "urlutil.js" module. When I read the code, I found some functions that may be possible to modify. Like:
 - getUrlFromInput();
I tried adding "input = input.replace(" ", "%20")" right after trim() function in this function, but it doesn't work. I guess the input in this function has been already handled. I put a "console.log" after the trim and see what's the result.

I try "https://www.google.ca/search?q=dog cat" as input to address bar, and I get the result:


The program consider the url is separated as two part: 1. https://www.google.ca/search?q=dog  and 2 cat. Then it uses google as searching engine to produce the result. That's why we see google search bar has content like "https://www.google.ca/search?q=dog cat":


So I need to find the step before this! Before it separated the url with space!

I noticed the isNotUrl function is long. I think this function is to judge if input is an URL or not.

I put a "console.log" after the trim function in "isNotUrl", and send the same url to Brave address bar.
"www.google.ca"
In terminal, I got this result:

I think some of the place in program loop through the original input. It keeps concatenate the input string and check if it is an URL. When it concatenate to "www.google.ca", it checks the string and finds this is the possible URL. Then is call "getUrlFromInput" which I mention before. "getUrlFromInput" will format it as an URL.

Back to our case. I try "https://www.google.ca/search?q=dog cat" as an input, and I got this:

I guess the "isNotUrl" function has already judged it as an url cause it have "https" schema. That's why it doesn't start looping character by character. So it return the url to somewhere else to continue the process. As it doesn't replace the space to "%20", the browser cannot search url with space. The final result for it is, it separates this url as two part and search in Google engine. That's why we see the result search separately in Google website search bar:
So I try replacing space here:

Result:
That's work!

I also try the other three input:
" https://www.google.ca/search?q=dog cat "   -----------------------Fixed!
"/Users/kignorc/brave-browser/browser-laptop/dog cat.txt"-------Fixed!
" /Users/kignorc/brave-browser/browser-laptop/dog cat.txt "------Fixed!

When I push the project to Github, it returns some error! I think I made some other mistake when I just change space to "%20". I refer to the methods of other students, he replace like this:
str.replace(/\s+([^\s]+)/, '%20')

Now I can pass the test!

I write a test to specify test 'isURL' function. This function is opposite to 'isNotURL' function.

This test string should all be URL. It should pass all the tests. I use 'mocha' to run the test, and get the result like this:


Lab done!

by Chan Kignor (noreply@blogger.com) at April 13, 2018 02:09 AM

April 12, 2018


Justin Vuu

OSD600 – Release 0.3 – Starting Up

Whew with a little over a week to go, I’m going to dive into Release 0.3. I think for this release I’m going to tackle multiple issues, and maybe scale up in difficulty as I go through them.

About:Passwords – Brave Browser

My first issue will be, again, Brave browser. Here’s the issue in question: https://github.com/brave/browser-laptop/issues/13793.

The instructions aren’t exactly wrong, because on MacOS, the menu option reads “Preferences”. This issue is specific to Windows and Linux which display “Settings” instead.

This bit of code from a different part of the browser certainly suggests that to be the case:

passwords01

So let’s go straight to the relevant parts of this issue. After a bit of digging around, I discovered two files are responsible for the about:passwords page: passwords.js, and passwords.properties. The first file is responsible for the design of the page, and the second contains the localization strings for the US English locale.

In passwords.js, this is the block of code I’ll need to modify:

passwords02

We can see on line 232 that the contents of the div is defined by the l10n (localization) ID, with the ID being the name of the variable in the relevant locale file.

One way to deal with this issue is to add an inline conditional statement for data-l10n-id at line 232. So, if the OS is Mac (a.k.a. Darwin) then it displays the string with the original instructions, but if its another OS, display a different string.
passwords03

This somewhat fixes the first part of the issue. This fix opens a new can of worms for localization, however, because there is now a new string to localize in all the languages Brave currently supports. I brought this up in a comment on the issue:

passwords04.PNG

I’ll have to wait and see their reply.

In the meantime…

I continued to see how I can insert a link. Currently, the text is a <div>, which to my knowledge can’t be a link. And personally, I’d prefer if only “Settings(Preferences) > Security” was a link, and not the whole line.

After a bit of fiddling, I came to a solution. I split the string so that the static part – “To change how passwords are managed, go to ” – is its own string with the original name, and “Settings(Preferences) > Security”  are two strings called “passwordInstructionsLink(Darwin)”.

passwords05

I moved the localization ID out of the <div> and into a <span> tag nested inside. Then, the “passwordInstructionsLink” strings were placed in a <label> tag, with an additional onClick event property.

passwords06

And here’s the result!

passwords07

New Challenges

However nice this looks, this does make localization much more difficult because it separates one sentence into 3 strings. On top of that, some locales have additional words after the directions which are probably be static.

passwords08

If I want that to be regular text, that would probably mean splitting the sentence into 4 strings for localization. I hope that won’t be the case.

Link to branch: https://github.com/jevuu/browser-laptop/tree/passwords

by justosd at April 12, 2018 11:12 PM


Abdul Kabia

Playing with the big bois

Every couple of weeks, I'm tasked with making a 'release'. Now these releases aren't necessarily a release of a product, instead, they're a release of code to a Github repo of your choice. In short, I'm tasked with finding a repo I would like to contribute to and quite literally, contribute. Initially in my one …

Continue reading Playing with the big bois

by akkabia at April 12, 2018 10:49 PM


Stephen Ward

Project Hacking: Optimizing the Code.

This blog is in reference to SPO600’s Project: Part II.

My current situation is such. My first plan is to look at the way BlockFile’s are handled by Audacity.

Within the BlockFile object there are a lot of repeatable operations. Specifically, multiplication. My first attempt to speed up code was to lower the number of these operations throughout the entire implementation of the object. This type of optimization is called hoisting.


My attempt to hoist variables looked as follows:

int t = i * 256 was able to replace 5 multiplication operations, 
happening in a loop 256 times.

= 5 * 256 
= 1280 operations - 1 (the initial creation of t)

int k = i * 256 + j was able to replace 3 operations, 
happening over a span of 3 lines, in a loop occurring 256 times.
The code swap looked like this:

if (summary256[3 * (i * 256 + j)] < min)
 min = summary256[3 * (i * 256 + j)];
 if (summary256[3 * (i * 256 + j) + 1] > max)
 max = summary256[3 * (i * 256 + j) + 1];
 float r1 = summary256[3 * (i * 256 + j) + 2];

************************************************

if (summary256[(k)] < min)
 min = summary256[(k)];
 if (summary256[(k) + 1] > max)
 max = summary256[k + 1];


= 3 * 3 * 256
= 2304 operations - 1 (the initial creation of k).


I continued the above changes at every point in the function.  An additional 1024 multiplication operations were also replaced, totaling in 4605 operations from being made.

Did this speed up the program?

Unfortunately, this did not speed up the benchmark significantly. It did not slow it down either though.

Using 178937 chunks of 293 samples each, for a total of 100.0 MB.
Preparing…
Performing 200 edits…
Time to perform 200 edits: 9447 ms, comparable to 9337ms from before.
Doing correctness check…
Passed correctness check!
Time to check all data: 11329 ms
Reading data again…
Time to check all data (2): 11333 ms
At 44100 Hz, 16-bits per sample, the estimated number of
simultaneous tracks that could be played at once: 104.9
Benchmark completed successfully.

It is still consistent with my initial profiling. There could be a couple of reasons why. For one, the compiler might recognize the redundant multiplication throughout the code during pre compiliation, and what I did was in turn redundant because the compiler has already done this. One way to check this, would be to traverse through the objdump file (it’s over 150MB and I can’t upload it to github), trying to figure out how CalcSummaryBuffer was handled.

Secondly, perhaps I need to increase the amount of edits to see a noticeable difference. The code itself –especially for Dithering, which is what is called upon most during the benchmark — has been coded using only macros. Does the -02 optimize level affect macro pre-compilition?

The question remains if Audacity is optimized beyond what I am capable of. I have been poking my head into a lot of the functions found in the previous gprof file. Anything that seems to remotely have an influence on the run-time, links back to ‘gDitherAlgorithm’ class, which points to optimized macro code. My next goal before the weekend is to continue to try and build with different optimize levels on Aarc64 and x86_64 machines.

The low hanging fruit: Optimizing with a new level.

Since hoisting seemed to fail at optimizing, I also plan to look at simply changing the level of optimization for the program.

The new configure command was as such:

./configure CFLAGS="-O3 -pg" CPPFLAGS="-O3 -pg" --enable-debug --with-portaudio=local

The new bench marking was quite surprising. Keeping the same variables as the level 2 optimization (200 edits on a 100mb file), level 3 optimization looked like this:

Using 178937 chunks of 293 samples each, for a total of 100.0 MB.
Preparing...
Performing 200 edits...
Time to perform 200 edits: 4717 ms
Doing correctness check...
Passed correctness check!
Time to check all data: 4974 ms
Reading data again...
Time to check all data (2): 4929 ms
At 44100 Hz, 16-bits per sample, the estimated number of
 simultaneous tracks that could be played at once: 241.2
Benchmark completed successfully.

I ran it again, making sure this number wasn’t a fluke, and the average of 10 more runs came out to be: 5023 ms.

Looking again at the gprof output:Screenshot from 2018-04-18 18-21-33

CalcSummaryFromBuffer was still called the same amount of times (3549), but it’s percentage of usage dropped from 6.89% in the initial bench marking to 3.65%.

Summary:

5023ms/9334ms  = 54% decrease in time.

3.65/6.89 = 52% decrease in CPU usage.

Another neat thing to note, is that with -03 optimized, the amount of tracks that audacity could sample was over doubled (2.3 x the amount). Tackling why this occurred will be the topic for Stage 3. If I can confidently conclude that the new flag will make a positive impact, I will inform the community.

 

by steaward at April 12, 2018 07:12 PM


Aleksey Glazkov

Open Standards and JavaScript tests.

In one of our labs in Open-Source development course we discussed open standards(set of requirements for a programming language) and dug into testing of JavaScript language that is defined by ECMAScript Language Specification.

First of all, I downloaded test suite from GitHub and run all test on my local computer. In order to run test follow these instructions. Most of the tests were passed successfully, however 4 of them failed. Particularly default-locale-is-supported.js (default), default-locale-is-supported.js (strict mode), supportedLocalesOf-default-locale-and-zxx-locale.js (default), supportedLocalesOf-default-locale-and-zxx-locale.js (strict mode).

js_tests

After reading other students’ blog-post I noticed that it is a common problem with these particular tests.

After reading about tests and what they do. I had to create a set of my own test for Array.prototype.reverse() function. I carefully studied specification of this function on ECMA’s web-site. In order to write tests I had to study how such tests are written and formatted. Finally, I practised by writing a set of my own tests for JavaScript’s Array.prototype.reverse() function. You can check out my results here. It was not very complicated as I had a lot of examples of already written tests that helped me.

by alexglazkov at April 12, 2018 06:49 PM


Qichang Chen

SPO600 - Project - Stage Two

In the second stage of my SPO600 project, I will need to implement my optimizations to the function in the open source software that I have chosen, which is SSDUP. I will also need to prove that the optimized code produces the same results to the original code. I will also compare the performance of optimized function results between AArch64 and non AArch64 platforms.

As I mentioned in my stage one, SSDUP uses 3 different Murmur3 hash function, which is optimized for x86 and x64 platforms. The first function is for 32-bit machines and it produces a 32-bit output. The second function is also for 32-bit machines but it produces a 128-bit output. The third function is for 64-bit machines and it produces a 128-bit output. Each hash function will produce a different hash value.

Here is the source code of my file:
-----------------------------------------------------------------
#include "murmur3.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#ifdef __GNUC__
#define FORCE_INLINE __attribute__((always_inline)) inline
#else
#define FORCE_INLINE
#endif
#define SIZE 100000

static FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}

static FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
{
  return (x << r) | (x >> (64 - r));
}

#define ROTL32(x,y) rotl32(x,y)
#define ROTL64(x,y) rotl64(x,y)

#define BIG_CONSTANT(x) (x##LLU)

//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here

#define getblock(p, i) (p[i])

//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche


static FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;

  return h;
}

//----------

static FORCE_INLINE uint64_t fmix64 ( uint64_t k )
{
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xff51afd7ed558ccd);
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
  k ^= k >> 33;

  return k;
}

//-----------------------------------------------------------------------------
int main() {
  uint32_t hash[4];                /* Output for the hash */
  uint32_t seed = 42;              /* Seed value for hash */

  clock_t startTime1, startTime2, startTime3;
  clock_t endTime1, endTime2, endTime3;
  double totalTime1, totalTime2, totalTime3;

  char *samples = (char*)calloc(SIZE, sizeof(char));

  srand(1);

  //generate random characters

  for (int i = 0; i < SIZE; i++) {

            samples[i] = (random() % 26) + 'A';

  }

  /*if (argc != 2) {
    printf("usage: %s \"string to hash\"\n", argv[0]);
    exit(1);
  }

  printf("Input: \"%s\"\n", argv[1]);*/

  startTime1 = clock();  // Start time
  //MurmurHash3_x86_32(argv[1], strlen(argv[1]), seed, hash);
  for (int i = 0; i < SIZE; i++) {

  MurmurHash3_x86_32(samples, strlen(samples), seed, hash);

  }
  endTime1 = clock();  // End time
  totalTime1 = (double)(endTime1 - startTime1) / CLOCKS_PER_SEC;

  printf("x86_32:  %08x\n", hash[0]);
  printf("Total time: %lf seconds\n", totalTime1);

  startTime2 = clock(); // Start time
  //MurmurHash3_x86_128(argv[1], strlen(argv[1]), seed, hash);
  for (int i = 0; i < SIZE; i++) {

  MurmurHash3_x86_128(samples, strlen(samples), seed, hash);

  }
  endTime2 = clock();  // End time
  totalTime2 = (double)(endTime2 - startTime2) / CLOCKS_PER_SEC;

  printf("x86_128: %08x %08x %08x %08x\n",
         hash[0], hash[1], hash[2], hash[3]);
  printf("Total time: %lf seconds\n", totalTime2);

  startTime3 = clock(); // Start time
  //MurmurHash3_x64_128(argv[1], strlen(argv[1]), seed, hash);
  for (int i = 0; i < SIZE; i++) {

  MurmurHash3_x64_128(samples, strlen(samples), seed, hash);

  }
  endTime3 = clock();  // End time
  totalTime3 = (double)(endTime3 - startTime3) / CLOCKS_PER_SEC;

  printf("x64_128: %08x %08x %08x %08x\n",
         hash[0], hash[1], hash[2], hash[3]);
  printf("Total time: %lf seconds\n", totalTime3);

  return 0;
}


void MurmurHash3_x86_32 ( const void * key, int len,
                          uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 4;
  int i;

  uint32_t h1 = seed;

  uint32_t c1 = 0xcc9e2d51;
  uint32_t c2 = 0x1b873593;

  //----------
  // body

  const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);

  for(i = -nblocks; i; i++)
  {
    uint32_t k1 = getblock(blocks,i);

    k1 *= c1;
    k1 = ROTL32(k1,15);
    k1 *= c2;
 
    h1 ^= k1;
    h1 = ROTL32(h1,13);
    h1 = h1*5+0xe6546b64;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*4);

  uint32_t k1 = 0;

  switch(len & 3)
  {
  case 3: k1 ^= tail[2] << 16;
  case 2: k1 ^= tail[1] << 8;
  case 1: k1 ^= tail[0];
          k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len;

  h1 = fmix32(h1);

  *(uint32_t*)out = h1;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x86_128 ( const void * key, const int len,
                           uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;
  int i;

  uint32_t h1 = seed;
  uint32_t h2 = seed;
  uint32_t h3 = seed;
  uint32_t h4 = seed;

  uint32_t c1 = 0x239b961b;
  uint32_t c2 = 0xab0e9789;
  uint32_t c3 = 0x38b34ae5;
  uint32_t c4 = 0xa1e38b93;

  //----------
  // body

  const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);

  for(i = -nblocks; i; i++)
  {
    uint32_t k1 = getblock(blocks,i*4+0);
    uint32_t k2 = getblock(blocks,i*4+1);
    uint32_t k3 = getblock(blocks,i*4+2);
    uint32_t k4 = getblock(blocks,i*4+3);

    k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;

    h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;

    k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

    h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;

    k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

    h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;

    k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

    h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

  uint32_t k1 = 0;
  uint32_t k2 = 0;
  uint32_t k3 = 0;
  uint32_t k4 = 0;

  switch(len & 15)
  {
  case 15: k4 ^= tail[14] << 16;
  case 14: k4 ^= tail[13] << 8;
  case 13: k4 ^= tail[12] << 0;
           k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

  case 12: k3 ^= tail[11] << 24;
  case 11: k3 ^= tail[10] << 16;
  case 10: k3 ^= tail[ 9] << 8;
  case  9: k3 ^= tail[ 8] << 0;
           k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

  case  8: k2 ^= tail[ 7] << 24;
  case  7: k2 ^= tail[ 6] << 16;
  case  6: k2 ^= tail[ 5] << 8;
  case  5: k2 ^= tail[ 4] << 0;
           k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

  case  4: k1 ^= tail[ 3] << 24;
  case  3: k1 ^= tail[ 2] << 16;
  case  2: k1 ^= tail[ 1] << 8;
  case  1: k1 ^= tail[ 0] << 0;
           k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;

  h1 += h2; h1 += h3; h1 += h4;
  h2 += h1; h3 += h1; h4 += h1;

  h1 = fmix32(h1);
  h2 = fmix32(h2);
  h3 = fmix32(h3);
  h4 = fmix32(h4);

  h1 += h2; h1 += h3; h1 += h4;
  h2 += h1; h3 += h1; h4 += h1;

  ((uint32_t*)out)[0] = h1;
  ((uint32_t*)out)[1] = h2;
  ((uint32_t*)out)[2] = h3;
  ((uint32_t*)out)[3] = h4;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x64_128 ( const void * key, const int len,
                           const uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;
  int i;

  uint64_t h1 = seed;
  uint64_t h2 = seed;

  uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
  uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);

  //----------
  // body

  const uint64_t * blocks = (const uint64_t *)(data);

  for(i = 0; i < nblocks; i++)
  {
    uint64_t k1 = getblock(blocks,i*2+0);
    uint64_t k2 = getblock(blocks,i*2+1);

    k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;

    h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;

    k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

    h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

  uint64_t k1 = 0;
  uint64_t k2 = 0;

  switch(len & 15)
  {
  case 15: k2 ^= (uint64_t)(tail[14]) << 48;
  case 14: k2 ^= (uint64_t)(tail[13]) << 40;
  case 13: k2 ^= (uint64_t)(tail[12]) << 32;
  case 12: k2 ^= (uint64_t)(tail[11]) << 24;
  case 11: k2 ^= (uint64_t)(tail[10]) << 16;
  case 10: k2 ^= (uint64_t)(tail[ 9]) << 8;
  case  9: k2 ^= (uint64_t)(tail[ 8]) << 0;
           k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

  case  8: k1 ^= (uint64_t)(tail[ 7]) << 56;
  case  7: k1 ^= (uint64_t)(tail[ 6]) << 48;
  case  6: k1 ^= (uint64_t)(tail[ 5]) << 40;
  case  5: k1 ^= (uint64_t)(tail[ 4]) << 32;
  case  4: k1 ^= (uint64_t)(tail[ 3]) << 24;
  case  3: k1 ^= (uint64_t)(tail[ 2]) << 16;
  case  2: k1 ^= (uint64_t)(tail[ 1]) << 8;
  case  1: k1 ^= (uint64_t)(tail[ 0]) << 0;
           k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len; h2 ^= len;

  h1 += h2;
  h2 += h1;

  h1 = fmix64(h1);
  h2 = fmix64(h2);

  h1 += h2;
  h2 += h1;

  ((uint64_t*)out)[0] = h1;
  ((uint64_t*)out)[1] = h2;
}


-----------------------------------------------------------------

In stage one, I have benchmarked all 3 hash function. Here is the result:
-----------------------------------------------------------------
[qichang@aarchie hash]$ g++ -o benchmark benchmark.c
[qichang@aarchie hash]$ ./benchmark
x86_32:  48f2284b
Total time: 14.185739 seconds
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.185568 seconds
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.157431 seconds
-----------------------------------------------------------------
The first function spent: 14.185739 seconds
The second function spent: 13.185568 seconds
The third function spent: 8.157431 seconds
-----------------------------------------------------------------

In order to compare the results for each hash function with and without optimization, I decided to split these 3 hash functions, I will only execute one hash function at a time.
-rw-rw-r--. benchmark64128.c
-rw-rw-r--. benchmark86128.c
-rw-rw-r--. benchmark8632.c

As the optimization strategies, I mentioned in stage one, I will first go with the easiest way, which is to use altered build options to compile those hash functions. The second strategy I said I would use inline assembly to optimize it, but after I considered, it seems too difficult, and I don't think that there are many chances for me to improve the performance.

In the Makefile of SSDUP, I noticed that the software use 'gcc -std=gnu99' to compile, which it doesn't use any optimization option to compile and build SSDUP, so the result would be same as my stage one benchmark results. But since I split those into different files, I will do it again for each file. And then I will compile my benchmarked program with the -O3 option instead of compile without optimization options. I am going to run each file 5 times in order to produce an accurate result.
-----------------------------------------------------------------
First hash function (MurmurHash3_x86_32):
[qichang@aarchie hash]$ gcc -std=gnu99 -o benchmark8632 benchmark8632.c
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 14.248357 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 14.177813 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 14.190903 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 14.117388 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 14.223588 seconds

[qichang@aarchie hash]$ gcc -std=gnu99 -O3 -o benchmark8632 benchmark8632.c
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.574737 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.572717 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.573482 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.574128 seconds
[qichang@aarchie hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.573317 seconds
-----------------------------------------------------------------
Second hash function (MurmurHash3_x86_128):
[qichang@aarchie hash]$ gcc -std=gnu99 -o benchmark86128 benchmark86128.c
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.310612 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.332081 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.294588 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.240154 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.247286 seconds

[qichang@aarchie hash]$ gcc -std=gnu99 -O3 -o benchmark86128 benchmark86128.c
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 1.337854 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 1.338757 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 1.341055 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 1.339572 seconds
[qichang@aarchie hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 1.340968 seconds
-----------------------------------------------------------------
Third hash function (MurmurHash3_x64_128):
[qichang@aarchie hash]$ gcc -std=gnu99 -o benchmark64128 benchmark64128.c
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.195865 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.179205 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.190692 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.193841 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.234855 seconds

[qichang@aarchie hash]$ gcc -std=gnu99 -O3 -o benchmark64128 benchmark64128.c
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.304771 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.308439 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.315429 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.316185 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.314772 seconds
-----------------------------------------------------------------
For the first hash function (MurmurHash3_x86_32), the exectution time for -O3 is about 802% faster than without compilation option.
For the second hash function (MurmurHash3_x86_128), the execution time for -O3 is about 891% faster than without compilation option.
For the third hash function (MurmurHash3_x64_128), the execution time for -O3 is about 523% faster than without compilation option.

The first two hash functions, there has been optimized for x86 platforms, it has a huge improved in performance after using compilation build option -O3, the third hash function has been optimized for x64 platforms, it also has a huge difference than not using -O3, but not much than x86 platforms.

After using the compilation build option, I am going to improve the existing algorithms to optimize the functions. For this part, I will only try to optimize the third hash function, because it has been optimized for x64 platforms. Within the hash function, there is a function called getblock with the second parameter that is provided to the function is i*2+0 or i*2+1, I will use an addition operation instead of multiplication operation to test if there is any improvement in performance. Secondary, at the beginning of the x64 hash function, there is a code that nblocks = len / 16, and within the function, there is a line of code that performs the operation nblocks*16. The values of nblocks and len do not change the function, they are both constants and integer data type. So, nblocks*16 = len, and I can eliminate one operation by replacing the multiplication operation with variable len to test if there is any improvement in performance.
Here are what I changed:
-----------------------------------------------------------------
Before:
uint64_t k1 = getblock(blocks,i*2+0);
uint64_t k2 = getblock(blocks,i*2+1);

After:
uint64_t k1 = getblock(blocks,i+i+0);
uint64_t k2 = getblock(blocks,i+i+1);
-----------------------------------------------------------------
Before:
const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

After:
const uint8_t * tail = (const uint8_t*)(data + len);
-----------------------------------------------------------------

Here is the result:
-----------------------------------------------------------------
[qichang@aarchie hash]$ gcc -std=gnu99 -o benchmark64128 benchmark64128.c
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.145001 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.143322 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.137052 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.195385 seconds
[qichang@aarchie hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.178763 seconds
-----------------------------------------------------------------
The average time of this hash function without changed is 8.1984 seconds.
The average time of this hash function with changed is 8.1596 seconds.
There is 0.04% faster with code changed.

In the next step, I am going to optimize the hash functions on an x86 system. Same as before, I will first compile my benchmark program without -O3 option, and then compile again with -O3 option, to compare the results if there is any improvement in the hash functions. Here are the results:
-----------------------------------------------------------------
First hash function (MurmurHash3_x86_32):
[qichang@xerxes hash]$ gcc -std=gnu99 -o benchmark8632 benchmark8632.c
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.321532 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.729160 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.325567 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.336685 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.720984 seconds

[qichang@xerxes hash]$ gcc -std=gnu99 -O3 -o benchmark8632 benchmark8632.c
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.144952 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.145310 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.144949 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.147177 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 1.144718 seconds
-----------------------------------------------------------------
Second hash function (MurmurHash3_x86_128):
[qichang@xerxes hash]$ gcc -std=gnu99 -o benchmark86128 benchmark86128.c
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.496733 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.512100 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.519267 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.516495 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.517119 seconds

[qichang@xerxes hash]$ gcc -std=gnu99 -O3 -o benchmark86128 benchmark86128.c
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 0.983106 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 0.982542 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 0.982342 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 0.982931 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 0.982379 seconds
-----------------------------------------------------------------
Third hash function (MurmurHash3_x64_128):
[qichang@xerxes hash]$ gcc -std=gnu99 -o benchmark64128 benchmark64128.c
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 4.855595 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 4.855948 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 4.849469 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 4.855682 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 4.848947 seconds

[qichang@xerxes hash]$ gcc -std=gnu99 -O3 -o benchmark64128 benchmark64128.c
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.293736 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.298314 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.296543 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.296466 seconds
[qichang@xerxes hash]$ ./benchmark64128
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 1.294762 seconds
-----------------------------------------------------------------

For the first hash function (MurmurHash3_x86_32), average execution time without -O3 is about 9.49 seconds, with -O3 is about 1.15 seconds. The execution time with -O3 is about 725% faster than without compilation option.
For the second hash function (MurmurHash3_x86_128), average execution time without -O3 is about 8.51 seconds, with -O3 is about 1.14 seconds. The execution time with -O3 is about 646% faster than without compilation option.
For the third hash function (MurmurHash3_x86_128), average execution time without -O3 is about 4.85 seconds, with -O3 is about 1.29 seconds. The execution time with -O3 is about 275% faster than without compilation option.

The first two hash functions, which have been optimized for x86 platforms, have significantly improved in performance after compiling with -O3 option. The third hash function, which has been optimized for x64 platforms, has a slightly poorer performance after compiling with -O3 option.

In the next step of optimization, I will try to optimize the first two hash function, because of the first two hash function already optimized for x86 platforms. Like the ones I changed in the third function, in the first function, nblocks*4 = len, I will remove one operator by replacing the multiplication operator with the constant integer len to see if there is any improvement. Here are the code changed:
First function:
-----------------------------------------------------------------
Before:
const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);
const uint8_t * tail = (const uint8_t*)(data + nblocks*4);

After:
const uint32_t * blocks = (const uint32_t *)(data + len);
const uint8_t * tail = (const uint8_t*)(data + len);
-----------------------------------------------------------------
Second function:
-----------------------------------------------------------------
Before:
const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

After:
const uint32_t * blocks = (const uint32_t *)(data + len);
const uint8_t * tail = (const uint8_t*)(data + len);
-----------------------------------------------------------------

Here are the results:
-----------------------------------------------------------------
First hash function with code changes (MurmurHash3_x86_32):
[qichang@xerxes hash]$ gcc -std=gnu99 -o benchmark8632 benchmark8632.c
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.726852 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.724399 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.724434 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.726194 seconds
[qichang@xerxes hash]$ ./benchmark8632
x86_32:  48f2284b
Total time: 9.360804 seconds
-----------------------------------------------------------------
Second hash function (MurmurHash3_x86_128):
[qichang@xerxes hash]$ gcc -std=gnu99 -o benchmark86128 benchmark86128.c
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.516874 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.523627 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.537537 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.528274 seconds
[qichang@xerxes hash]$ ./benchmark86128
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 8.525721 seconds
-----------------------------------------------------------------

For the first hash function (MurmurHash3_x86_32), average execution time without code changed is about 9.49 seconds, with code changed, is about 9.65 seconds.
For the second hash function (MurmurHash3_x86_128), average execution time without code changed is about 8.51 seconds, with code changed, is about 8.52 seconds.

All of the tests are first completed on an AArch64 system. My first step to optimize the hash function is to compile my benchmark program with -O3 compilation option. The first two hash functions, which have been optimized for x86 platforms, which has a significant improvement in performance. The third hash function, which has been optimized for x64 platforms, after compiling with -O3 option, which is a very small improvement in performance. My second step in optimization is to change some code in the third function, there is 0.04% faster than without changing the code.

Afterward, I perform the benchmark program on an x86_64 system, the results show that it also has a significant improvement in performance if compiling with -O3 option. But the improvement of the third function on an AArch64 system is not as much as different than x86_64 platforms. After that, I also change some code in the first two hash function for improvement in performance. What I disappointed is, after I changed the code and I thought it would have some improvement, but the truth is negative. The results show it slower than without code changed. In the end, compiling with compilation option -O3 is the best way for improvement in hash function performance.

by Frankie Chen (noreply@blogger.com) at April 12, 2018 03:14 AM


Owen Mak

Brave URL Testing

On my latest lab for open source development, I was tasked to fix an error in the Brave laptop/desktop web browser involving parsing of URL from the address bar.  To fix this bug, I had to build the browser based on the source code from GitHub. Afterwards, I input several url test cases into the address bar and observe the behaviour from the Brave browser.  Finally, I repeat the test cases on the Firefox and Chrome browsers to see if there were any differences in how the url is handled.

From the lab, there were many test cases offered, but ones that were of concern are as follows:

For these test cases, the Brave browser will not treat the url as valid, and run a google search based on the url text.  For the screenshots below, I used the url: https://www.google.ca/search?q=dog cat and tested the browser responses for Brave and Firefox.  In Brave, the browser would perform a google search with the url input as keyword.  This is in contrast to Firefox, where the browser would parse the url and decipher the search terms to only be “dog cat”.

Click to view slideshow.

To get the class started on fixing the bug, our professor introduced us to the code involved in Url parsing for Brave and Firefox.  For Firefox, it was based on C++ code that went as far as the Netscape days, whereas for Brave, it was Javascript code.

Since this was a test driven development style of programming, I was supposed to write a set of test cases prior to changing the code in urlutil.js.  The first function I studied closely was the getUrlFromInput() function, because I felt that if I could change the way the string is retrieved from the address bar, I can control what is passed to the subsequent function calls.  To begin, I wrote a new test case for the getUrlFromInput as per the gist below:

Strangely, when I ran the new test case with the expectation of failure, the following happened:

getUrlFromInputTestCaseMy new test case with file path to “dog cat.txt” passed.

Since this test case passed, my professor indicated to me that it meant getUrlFromInput() is not the function causing the erroneous behaviour.

I moved on to looking at another function in the urlutil.js of particular interest.  This time, it was the isNotURL() function.  By manipulating the inputs that get flagged as a valid urls, I can control the strings that processed as a Url and the strings that are processed as keyword searches. To start things off again, I wrote two test cases for the isNotURL() function, located at the bottom of the test cases:

The test cases failed as expected, and I knew from this failure that I had to modify some of the code in the isNotUrl() function.

To fix the behaviour in Brave, I had to work with 3 cases.  The first case is to check if there was a scheme to the url, and then look for a space in the url text.  If there is a scheme, then the url is definitely not a file path, and what I needed to do was to replace the whitespace ONLY in the query section of the url string with ‘%20’.  At first, I simply replaced all whitespace in the entire url string and triggered a new error in one of the existing test case:

assert.equal(urlUtil.isNotURL('https://www.bra ve.com/test space.jpg'), true)

My fix would mistakenly trigger an error on this case because it would replace the whitespace in ‘bra ve’ and change it to ‘bra%20’.  Afterwards, the browser would attempt to go to url https://www.bra%20ve.com/test space.jpg and result in an error.

BraveFaultyUrlThe function mistakenly adds the ‘%20’ when there is a space in the domain name

To prevent this error, I had to check the case of an existing scheme (http or https), and then check for a space in the url string.  Afterwards, I would search for a ‘?’, such that I will only replace the whitespace in the query part of the url.  Here’s a short snippet of the code I used to perform this replacement:

After the small addition, passing an url such as https://www.google.ca/search?q=dog cat hamster would replace only the spaces in the query string, and leave string in domain name unchanged.  The following slides demonstrates this behaviour.

Click to view slideshow.

The last two cases of ‘dog cat’ and ‘/home/omak/Documents/dog cat.txt’ would be fixed under the same section of the isNotUrl() function.

This code will only replace the whitespace if the url is a file path.  This means only file paths such as ‘/home/omak/Documents/dog cat.txt’ will be converted to ‘/home/omak/Documents/dog%20cat.txt.  Strings such as ‘dog cat’ will remain as google searches and left untouched.

Click to view slideshow.

With the two fixes, I managed to get Brave to work for the following urls:

This is a final screenshot to show my test cases are working at the end.

Brave-TestCaseSuccessTake particular note in ‘is an absolute file path with a space in the files’ and ‘is a url which contains a space in the query’

The commit to this fix is located here.

Learning experience:

From this exercise, the biggest learning experience was from comparing the way firefox wrote their code in C++ to the way it was written in Javascript for Brave.  By comparing the two codebase, one could tell a lot has changed in software development from the days of Netscape to current coding practices.  In particular, in the Firefox code base, the naming of variables based on facts about the variable, such as it being a member variable, therefore they put a ‘m’ before the variable name was quite interesting for me.  Another interesting note was the canParseURL() function in the urlutil.js, where the professor explained the case of window being undefined, highlighting the possibility of the function not being used from a browser.

This exercise was also my first time trying Test Driven Development, and it was quite different for me to write out my test cases first, and then modify the code base.  As mentioned above, I was quite surprised to have my test case on getUrlFromInput() pass, when it should have failed.  This scenario demonstrated the merits of writing out the test case first, as I would have wasted a lot of time fiddling with the getUrlFromInput() method otherwise.

 

by Owen Mak at April 12, 2018 01:40 AM


Aleksey Glazkov

Animate.CSS

in

In this blog post I want to focus your attention on a open-source project called Animate.css.

Animate.css is just a collection of “cool, fun, and cross-browser animations for you to use in your projects” as it is said on project’s GitHub page. I found it particularly useful when you want to make your web pages look more “alive”. With its help, you headings will not just rest on top of the page, they will slowly fade in or fade out, pulse or flash, shake or bounce. How cool is it?

box

The most important thing is taht it can be done with a simple line of code. All you have to do is just add a class animated and an effect name to the element that you want to animate on your web page.

You don’t need to install any dependencies because all animations are css based. All you have to do is include the remote version of the stylesheet on you document’s <head>.

You can do even more. With the help jQuery and some custom CSS rules you can change animation duration, add delay, change number of times it plays, dynamically add animations, detect when animation ends and do something.

Initial commit on GitHub was on October 12, 2011. Since that time this project has more than 400 commits, 50,000 stars, 10,00 forks and 83 contributors. This figures show that this project is quite popular.

Next time you create a web page, try out some of these animations. They will definitely make it look cool!

flash

by alexglazkov at April 12, 2018 12:49 AM

April 11, 2018


Hao Chen

TDD in Brave Browser

This week, I will try TDD to fix a small issue in the Brave Browser. To setup the browser, follow this guide.

Entering

https://www.google.ca/search?q=dog cat

into the browser will produce this :

No idea why someone would do this, but lets guard against it! The expected result should be simply ‘dog cat’ in google search. So I wrote a test in Mocha to test this expected behavior.

Breaks as expected at first.

This is my initial fix. I attempt to use Regex to find and replace the space within a given url with ‘%20’.

Although my test passed…something else broke. Now it fills in the space within a given domain name, something we do not want!

I added additional arguments to my Regex to ignore the occurrences of space within the domain name.

Voila~


TDD in Brave Browser was originally published in Haorc on Medium, where people are continuing the conversation by highlighting and responding to this story.

by Hao Chen at April 11, 2018 11:57 AM


Aliaksandr Ushakou

Release 0.2

The goal of this release is to contribute to a real open source project.

So, the first task is to find an open source project. My first thought was to work on Visual Studio Code because I’ve already tried it and it was a very interesting experience. My second thought was to try something new. I found out that Brave browser is developing very actively and there are tons of open issues of varying complexity. My third thought was to work on both projects and I decided to chase ‘two hares’.

Part one: VSCode
 

My first attempt to fix a bug in VSCode started here. I found an issue that was related to the shadow element under tab bar. The problem was that the width of the shadow was smaller than the width of the editor. The most interesting aspect of this issue is that it shows up only when the minimap is on the left side. I think that the root of this problem is that the minimap planned to be only on the right side. Therefore, there are a bunch of code blocks where the possibility of the minimap on the left side is not taken into account. Here, for example, is another issue related to the minimap on the left side. I faced this issue and was going to file it, but found out that someone already did it. I also tried to fix it, but so far I’ve not been able to find the cause.

So, this post ended with the fact that the problem is fixed (fixed in my opinion) and pull request is submitted. Sounds like the story is over. But, actually, it was only the beginning.

Usually, when a student submits a work or assignment, his job is done and he can forget about it. Open source development doesn’t work that way. When you submit a pull request, there is a great chance that you will be asked to make a correction or adjustment. I think this chance is directly proportional to the amount of code changed or added.

Thus, ten days later (after I submitted the PR) I got a notification that someone commented on my pull request.

comment

My first reaction was confusion. How come it does not fix the issue if it actually does?! So, I decided to ask for clarification.

comment2

The next reply was also confusing, but I figured out that we are looking at different things. And I decided to write detailed comment with screenshots where I would explain my point of view.

The point of confusion was here:

scr-vscode

(“issue 1” is about the broken top shadow that shows that there is something beyond the screen on the top. And “issue 2” is about vertical scrollbar that doesn’t show that there is something beyond the screen on the right (vertical scrollbar should be transparent))

The purpose of the shadow is not only decorative, but also to show that there are lines of code beyond the screen. So, the original issue stated the problem that showed on the picture above as “issue 1”. However, the person who reviewed my pull request was focusing on the problem that showed on the picture as “issue 2”.

Fortunately, after a short dialogue, we understood each other and my task became clear.

comment3new

Although he said “sorry”, I think that he wasn’t really wrong because he saw the problem from a different angle, which allowed to correct two bugs at once.

Once I realized what I need to do, I began to fix the bug. The fix itself was pretty simple. The coding part took a few hours, whereas the discussion took about five days. One of the reasons why the discussion took more time than expected is different time zones. I have a suspicion that the man who reviewed my PR is from Switzerland.

2106102

Fortunately, my correction was accepted and I’ve gotten my first merged pull request!

merged

It was really nice experience and I really appreciate the feedback and guidance from the VSCode team member.

Part two: Brave
 

First of all I want to mention that in my opinion contributing to Open Source should be fun. And in order to be fun, the project itself should be interesting or useful. For example, I use VSCode almost every day, so contribution to VSCode is contribution to the project that I’m interested in.

Brave browser is suddenly a very cool project. Although, there are lots of bugs because this project is still actively developing, Brave browser is really fast. Also, I think that the design of this browser is pretty neat. More than 2500 open issues are also an advantage for potential contribution. These are reasons why I decided to work on this project.

Сouldn’t make it
 

The first issue that I’ve tried to tackle is #13250: “Trying to save page offline always shows Downloading…”.

So, if you open a web page while there is an internet connection, and then try to save this page while there is no internet connection, this page will not be downloaded. Brave browser will not show any error or warning and downloading will continue forever.

My first idea was to put somewhere in the code a condition that cancels downloading process if there is no internet connection. However, after a little research I found out that Google Chrome is able to save a page from the cache even if there is no internet connection. So, this issue turned out to be more complicated than at first glance. I spent a few days trying to figure out how downloading process works in Brave, but I couldn’t make it.

Thus, I decided to move on and found another issue that I might tackle.

Сould make it
 

And that issue is #8633: “URL bar is not centre aligned when UI element scale is set to supersize”.

After reproducing this issue I found out that the problem is not the alignment of the URL bar, but weird behavior: URL bar is “jumping” when the window is scaling.

1

As we can see, the width of the URL bar element is changing, so my first step was to find code block that is responsible for this behavior.

I started to check CSS files in the project and found out that there are only six CSS files and none of them are relevant to the URL bar.

css

I continued my search and found an interesting file type that I’ve never seen before. And this file type is LESS.

*.less files are related to Less.js. Less (which stands for Leaner Style Sheets) is a backwards-compatible language extension for CSS. In other words, it is a CSS preprocessor that allows to write styles using the Less preprocessor language and then Less.js will compile it into pure CSS. LESS looks just like CSS, but comes with a lot of powerful features such as: variables, mixins, nested rules, operations, imports and more.

If we check LESS files in the project, we can find 20 LESS files and one of them is what I need.

less

After a little search, I found  navigationBar.less  file that includes a bunch of CSS rules about navigation bar. The URL bar is part of the navigation bar, so it is what I need.

I looked at this file and found the code block that is responsible for “jumping” behavior.

code1

I added  background-color: skyblue !important;  in order to see what this code block is responsible for.

blue_h

As we can see, the problem is not in the URL bar element itself, but is in the element that contains buttons on the right of the URL bar.

The code block that is responsible for this behavior is following:

code2

It means that if the maximum width of the display area equals to  @breakpointWideViewport , then width of the buttons on the right of URL bar is set to  @navbarBraveButtonMarginLeft .

From the  variables.less  file:
 @breakpointWideViewport: 1000 px; 
 @navbarBraveButtonMarginLeft: 80 px; 

Therefore, if the maximum width of the display area equals to 1000 px, then width of the buttons on the right of URL bar is set to 80 px.

I think that solution is to remove the line of code that sets the width of the buttons to  @navbarBraveButtonMarginLeft , or to remove the whole  @media rule.

So, the updated code block might look like this:

code3

Let’s check if it actually works:

2

And it works! URL bar does not jump anymore.

I’ve already submitted the pull request for this issue; however, I’m still not sure whether the whole @media rule should be removed or just one line that sets the width.

Fixing issues while fixing issues
 

While working on the previous issue, I found out that URL bar elements are layered on each other when window width is reduced and “Toolbar and UI elements scale” is set to “Larger” or “Supersize”.

It looks like this:

sh1

I searched for an opened issue that is related to this bug, but I couldn’t find anything, so I filed this issue by myself.

I also fixed it and submitted the pull request; however, I’m not sure if my solution is perfect, but it works.

All I did is set the min-width of the URL bar element to 30 px.

The result looks like this:

sh2

In conclusion
 

Contributing to Open Source can be hard and exhausting and fun at the same time. Every time I work on an open source project I learn something new. I think the most valuable lesson I’ve gotten while working on Release 0.2 is that sometimes it is worthwhile to stop digging an issue and move on if you feel like you are totally stuck or that issue is too complicated. And my point is not to quit when it is difficult, but to quit when it is not interesting anymore.

by aushakou at April 11, 2018 08:33 AM


Matt Rajevski

SPO600 Project – Part 1[update] & 2

After looking into the problem some more I was unable figure out how to get the gprof function working, therefore I will be using the time program and the program’s built in benchmark option. The time program won’t help me determine the efficiency of the functions within the program but it will give me a rough time to complete the task. As for the program’s built in benchmark option, it provides information on certain functions and their performance.

[mrrajevski@aarchie p7zip_16.02]$ 7za b "-mm=*"

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

LE
CPU Freq: 1998 1999 1999 1999 1999 1999 1999 1999 1999

RAM size: 16000 MB, # CPU hardware threads: 8
RAM usage: 1802 MB, # Benchmark threads: 8




Method       Speed Usage   R/U  Rating  E/U Effec
             KiB/s     %  MIPS    MIPS    %     %

CPU                  661  1999   13223
CPU                  624  1999   12478
CPU                  666  1999   13321  120   800

LZMA:x1      47086   651  2646  17213   159  1034
            149903   657  1859  12209   112   733
LZMA:x5:mt1  10144   659  1924  12673   116   761
            145518   658  1864  12272   112   737
LZMA:x5:mt2  10718   682  1963  13390   118   804
            142188   640  1873  11991   112   720
Deflate:x1  110402   642  2184  14018   131   842
            445994   621  2232  13858   134   832
Deflate:x5   40482   636  2449  15587   147   936
            448804   623  2235  13934   134   837
Deflate:x7   15434   662  2582  17101   155  1027
            475964   657  2247  14771   135   887
Deflate64:x5 38920   672  2503  16819   150  1010
            499315   697  2242  15621   135   938
BZip2:x1     23042   651  2140  13922   129   836
            109903   640  1862  11914   112   716
BZip2:x5     17371   660  2198  14498   132   871
             63509   661  1887  12466   113   749
BZip2:x5:mt2 17328   682  2119  14462   127   868
             62237   688  1776  12216   107   734
BZip2:x7      6172   678  2359  15991   142   960
             63919   653  1920  12535   115   753
PPMD:x1      15871   663  2475  16415   149   986
             12652   646  2308  14899   139   895
PPMD:x5       9558   665  2438  16200   146   973
              8063   652  2319  15111   139   907
Delta:4    2386459   642  2286  14662   137   881
           2062576   634  1998  12672   120   761
BCJ        3844213   673  2338  15746   140   946
           3679123   644  2341  15070   141   905
AES256CBC:1 478672   630  1867  11764   112   706
            488557   649  1850  12007   111   721
AES256CBC:2

CRC32:1    1590921   653  1774  11582   107   696
CRC32:4    4273755   651  1466   9539    88   573
CRC32:8    6023965   646  1265   8168    76   491
CRC64      4182943   657  1303   8567    78   514
SHA256      829640   622  2720  16925   163  1016
SHA1       1175588   622  1770  11004   106   661
BLAKE2sp

CPU                  596  1999  11913
------------------------------------------------------
Tot:                 656  2037  13350   122   802

The first thing I noticed with these stats it that the values that represent a percent value go into the hundreds, so I believe that they are on a scale of 0-1000. For example: the total efficiency is 649, so the value it represents is 64.9%. Another thing I noticed is that there were two sets of stats per method test, this is probably either the read/write speed or the compress/decompress rate. I will understand those values more if I manage to see an increase after making a change. The last thing I noticed was that most functions had an efficiency greater that 75%, which means it might be tricky to find an optimization in the source code as it may already have some optimizations.

The first thing I looked into for optimization was the flags set during compilation. The default build of the program used the -O flag.

Using the same 193 MiB folder, I tested the different optimization flags.

Setting the -O flag to -O2 resulted in an average runtime of 26.1s

Setting the -O flag to -O3 resulted in an average runtime of 26.1s

Considering that both flags resulted in the same average runtime, I suggest that -O2 is used because the compile time was much shorter that -O3 yet it had the same result.

I will look into the source code and see if I can find anything to improve and update with what I find.

by mrrajevski at April 11, 2018 04:01 AM


Ilkyu Song

SPO600 Project - Stage 2

In this Stage 2, I am going to learn deeply the Redis selected in Stage1. Redis (Remote Dictionary Server) is an in-memory-based key-value store. Performance is faster than memory-based databases, as it directly processes the data into memory. In the case of data types that can be stored, the rest of the repository provides only the primitive types, while the rest of the data types are data types such as String, Set, Hash, List, And provides basic functions such as search, add, and delete of data.
First, I chose a Redis client library to test the Redis database. It is Hiredis. Hiredis provides the APIs needed to manipulate Redis as a C client library. And, I will send a 1,000,000 commands to the server to benchmark using Hiredis. In relation to this, I picked c language source in a  Github, and I modified this source. The Github URLs are below.

Hiredis Client Library Git: https://github.com/redis/hiredis
Benchmark Git: https://github.com/stefanwille/redis-client-benchmarks

This source is the source for the benchmark.

const int N = 1000000;

int main() {
printf("Connecting...\n");
redisContext *redis = redisConnect("localhost", 6379);

clock_t start, end;
float ftime;

if (redis->err) {
puts(redis->errstr);
char *p = redis->errstr;
while (*p) {
printf("%x ", (int)(*p++));
}
printf("\n");
}
else {
start = clock();

char *cmd;
int len;

for (int i = 0; i < N; i++) {
len = redisFormatCommand(&cmd, "HSET myset:__rand_int__ element:__rand_int__");

redisAppendFormattedCommand(redis, cmd, len);
}

for (int i = 0; i < N; i++) {
redisReply *reply;
assert(redisGetReply(redis, (void*)&reply) == REDIS_OK);
redisGetReply(redis, (void**)&reply);

freeReplyObject(reply);
}

end = clock();

ftime = (float)(end - start) / CLOCKS_PER_SEC;

printf("Runing Time: %f sec. \n", ftime);
}
}


I analyzed the Hiredis source to implement optimization for the function. Unfortunately, it was not easy to find a place to optimize. So I decided to apply the optimization what I learned in this lecture. The source below invokes the redisFormatCommand function of the Hiredis library. So I looked for a part of the function that I could optimize. The redisFormatCommand invokes the redisvFormatCommand function. So I figured out where to optimize the function and implemented the optimization. Of course, this may not be the right way for optimization. However, I wanted to make sure if performance improvements can be made when optimizing in small parts.

The source below is a modified source of Hiredis.

    while(*c != '\0') {
if (*c != '%' || c[1] == '\0') {
if (*c == ' ') {
if (touched) {


//Change Source for optimization
int result;
__asm__ __volatile__("add %0,%1,%2 \n\t":"=r" (result) : "r"(argc), "r"(1));
newargv = realloc(curargv, sizeof(char*)*(result));

//newargv = realloc(curargv,sizeof(char*)*(argc+1)); - Original Sourc
e
if (newargv == NULL) goto memory_err;
curargv = newargv;
curargv[argc++] = curarg;
totlen += bulklen(sdslen(curarg));

/* curarg is put in argv so it can be overwritten. */
curarg = sdsempty();
if (curarg == NULL) goto memory_err;
touched = 0;
Now let's benchmark using the Hiredis library. The first picture is the one executed prior to optimization. And the second figure shows the results after optimization. I did not change the results dramatically because I modified a very small part. However, you can see that performance has improved a bit since you have executed 1,000,000 commands and executed optimizations in the while statement. I benchmarked the Hset datatype, compiled it with different compile options, and checked the speed. The compile option -03 is the most optimized result. However, there is not much difference depending on the option.
  • The result of original
  • The result of optimization.

I modified the source code to run on the x86 platform.

    while(*c != '\0') {
if (*c != '%' || c[1] == '\0') {
if (*c == ' ') {
if (touched) {

//Change Source for optimization
int result;
__asm__("addl %%ebx, %%eax;" : "=a" (result) : "a" (argc), "b" (1));
newargv = realloc(curargv,sizeof(char*)*(argc+1));

//newargv = realloc(curargv,sizeof(char*)*(argc+1)); - Original Source
if (newargv == NULL) goto memory_err;
curargv = newargv;
curargv[argc++] = curarg;
totlen += bulklen(sdslen(curarg));

/* curarg is put in argv so it can be overwritten. */
curarg = sdsempty();
if (curarg == NULL) goto memory_err;
touched = 0;

I modified the source code to run on the x86 platform.
I have tried hard to run the Hiredis library and this code on the x86 platform, but I have not found a way. So I tried to run it on the Windows platform, but the Redis libraries were not suitable for running on Windows. The following picture shows errors when running the Hiredis library on an x86 platform. Even the example sources supported by Redis have not been compiled.



I learned a few important things through this stage. First, hash-based data is very fast and useful. In addition, the hash data can be used on any platform. However, it is only possible to configure the library well. In addition, I was able to understand the process of working together on Github. This is a very useful and essential stuff for programmers. Best of all, I found that even a small amount of code optimization can improve the performance of the system.

by Ilkyu Song (noreply@blogger.com) at April 11, 2018 03:36 AM


Thomas Nolte

SPO600 Final Project Stage 2

Introduction

In this stage of the project I will show the changes I have made to the software. I will also show the improvement these changes have made for the individual function and the software as a whole. Finally I will show the performance on the x86_64 CPU architecture.

Read more on the project

Read previous blog post

Changes

In the previous blog I pointed out the six most taxing functions that the program uses but did not go into details. My changes have only be for one function so far so lets delve into the code:

void kvz_filter_hpel_blocks_full_luma_generic(const encoder_control_t * const encoder, kvz_pixel *src, int16_t src_stride, int width, int height, frac_search_block *filtered)
{
 int x, y;
 int16_t shift1 = KVZ_BIT_DEPTH - 8; //A simple operations product being put into memory
 int32_t shift2 = 6; //constant initialized every function call 
 int32_t shift3 = 14 - KVZ_BIT_DEPTH; //A simple operations product being put into memory
 int32_t offset23 = 1 << (shift2 + shift3 - 1);

 int8_t *fir0 = kvz_g_luma_filter[0];
 int8_t *fir2 = kvz_g_luma_filter[2];

 int16_t flipped0[(LCU_WIDTH + 1) * (KVZ_EXT_BLOCK_W + 1)];
 int16_t flipped2[(LCU_WIDTH + 1) * (KVZ_EXT_BLOCK_W + 1)];

 int16_t temp_stride = height + KVZ_EXT_PADDING + 1;
 int16_t dst_stride = (LCU_WIDTH + 1);

// Horizontal positions
 for (x = 0; x < width + 1; ++x) {
   for (y = 0; y < height + KVZ_EXT_PADDING + 1; ++y) {
   int ypos = y - FILTER_OFFSET; //initialized every loop iteration
   int xpos = x - FILTER_OFFSET; //initialized every loop iteration
   flipped0[x * temp_stride + y] = kvz_eight_tap_filter_hor_generic(fir0, &src[src_stride*ypos + xpos]) >> shift1;
   flipped2[x * temp_stride + y] = kvz_eight_tap_filter_hor_generic(fir2, &src[src_stride*ypos + xpos]) >> shift1;
   }
  }

// Filter vertically and flip x and y
 for (x = 0; x < width + 1; ++x) {
  for (y = 0; y < height + 1; ++y) {
  filtered[HPEL_POS_HOR][y * dst_stride + x] = kvz_fast_clip_32bit_to_pixel(((kvz_eight_tap_filter_hor_16bit_generic(fir0, &flipped2[x * temp_stride + y]) + offset23) >> shift2) >> shift3);
  filtered[HPEL_POS_VER][y * dst_stride + x] = kvz_fast_clip_32bit_to_pixel(((kvz_eight_tap_filter_hor_16bit_generic(fir2, &flipped0[x * temp_stride + y]) + offset23) >> shift2) >> shift3);
  filtered[HPEL_POS_DIA][y * dst_stride + x] = kvz_fast_clip_32bit_to_pixel(((kvz_eight_tap_filter_hor_16bit_generic(fir2, &flipped2[x * temp_stride + y]) + offset23) >> shift2) >> shift3);
  }
 }
}
KVZ_Filter_qpel_blocks_full_luma_generic_no_changes.png //gprof output

This function alone accounts for 37% of run time being called 34377 times. Its unsurprising that it takes a lot of time to run due to have nested loops. However because its called so many times we should look at how often its initializing values. As seen in our lab5 memory lookups can take more time than simple operations so replacing some variables with the operations to get their value should show some improvements. There are also times that variables are being initialized within the loops causing which is leading to more time taking due to having to allocating memory each iterations.

After the changes the function not looks like this:

void kvz_filter_hpel_blocks_full_luma_generic(const encoder_control_t * const encoder, kvz_pixel *src, int16_t src_stride, int width, int height, frac_search_block *filtered)
{
 int x, y;
 //removed the variables shift1-3 and instead inlined their the operation to get their value
 int32_t offset23 = 1 << (19 - KVZ_BIT_DEPTH); //simplified the operation

 int8_t *fir0 = kvz_g_luma_filter[0];
 int8_t *fir2 = kvz_g_luma_filter[2];

 int16_t flipped0[(LCU_WIDTH + 1) * (KVZ_EXT_BLOCK_W + 1)];
 int16_t flipped2[(LCU_WIDTH + 1) * (KVZ_EXT_BLOCK_W + 1)];

 int16_t temp_stride = height + KVZ_EXT_PADDING + 1;
 int16_t dst_stride = (LCU_WIDTH + 1);

 int ypos = 0; //moved to outside the loop
 int xpos = 0; //moved to outside the loop
 
 // Horizontal positions
 for (x = 0; x < width + 1; ++x) {
  for (y = 0; y < height + KVZ_EXT_PADDING + 1; ++y) {
  ypos = y - FILTER_OFFSET;
  xpos = x - FILTER_OFFSET;
  flipped0[x * temp_stride + y] = kvz_eight_tap_filter_hor_generic(fir0, &src[src_stride*ypos + xpos]) >> (KVZ_BIT_DEPTH - 8);
  flipped2[x * temp_stride + y] = kvz_eight_tap_filter_hor_generic(fir2, &src[src_stride*ypos + xpos]) >> (KVZ_BIT_DEPTH - 8);
  }
 }

// Filter vertically and flip x and y
 for (x = 0; x < width + 1; ++x) {
  for (y = 0; y < height + 1; ++y) {
  filtered[HPEL_POS_HOR][y * dst_stride + x] = kvz_fast_clip_32bit_to_pixel(((kvz_eight_tap_filter_hor_16bit_generic(fir0, &flipped2[x * temp_stride + y]) + offset23) >> 6) >> (14 - KVZ_BIT_DEPTH));
  filtered[HPEL_POS_VER][y * dst_stride + x] = kvz_fast_clip_32bit_to_pixel(((kvz_eight_tap_filter_hor_16bit_generic(fir2, &flipped0[x * temp_stride + y]) + offset23) >> 6) >> (14 - KVZ_BIT_DEPTH));
  filtered[HPEL_POS_DIA][y * dst_stride + x] = kvz_fast_clip_32bit_to_pixel(((kvz_eight_tap_filter_hor_16bit_generic(fir2, &flipped2[x * temp_stride + y]) + offset23) >> 6) >> (14 - KVZ_BIT_DEPTH));
  }
 }
}
KVZ_Filter_qpel_blocks_full_luma_generic_with_changes.png //gprof output

The changes are simple code rearrangement and removal of some not so needed variables. The removal of the variables can make the code less readable but nothing some simple commenting in the main code can’t help with. After these changes are running the original tests again this function now accounts for just 31.5%.

Benchmarking AArch64

With the changes made lets see the output of the original test ran 10 times.

Processed 300 frames, 633888 bits AVG PSNR Y 42.1265 U 44.9784 V 45.8845
 Total CPU time: 13.628-13.824 s. // ~1.45% variance 
 Encoding time: 13.626-13.822 s. // ~1.45% variance 
 Encoding wall time: 8.934-9.008 s. /~0.8% variance
 Encoding CPU usage: 152.52%
 FPS: 33.58

real 0m8.941-9.015s //~0.8% variance
user 0m13.583-13.800s // ~1.45% variance 
sys 0m0.051s
//The input file size was 11404800 and the output file size was 79236.

The results are interesting, when looking at the individual performance of the function there is ~6.5% increase in speed but the software as a whole doesn’t seem to show remarkable increase in time taken to run. However after the changes the min/max run time are much closer together implying that our changes made the run time more constant. The input and output file are the same as before the changes so our changes did not effect the program in that regard.

Benchmarking x86_64

without changes.

Processed 300 frames, 633888 bits AVG PSNR Y 42.1265 U 44.9784 V 45.8845
 Total CPU time: 35.035-35.440 s.
 Encoding time: 33.033-35.438 s.
 Encoding wall time: 21.046-21.263 s.
 Encoding CPU usage: 166.66%
 FPS: 14.11

real 0m21.071-21.286s
user 0m34.995-35.396s
sys 0m0.071s

With changes.

Processed 300 frames, 633888 bits AVG PSNR Y 42.1265 U 44.9784 V 45.8845
 Total CPU time: 35.208-35.450 s.
 Encoding time: 35.207-35.448s.
 Encoding wall time: 21.140-21.261 s.
 Encoding CPU usage: 166.76%
 FPS: 14.11

real 0m21.164-21.286s
user 0m35.165-35.392s
sys 0m0.070s

//The input file size was 11404800 and the output file size was 79236.

Similar to the results on AArch64 the changes made have not made as noticeable impact on run time as I would like. However across both platforms the variance seems to be much smaller when using the changes I made.

Conclusion

The Changes I have made no doubt have improved the individual functions run time but has not made a noticeable increase in the overall run time. I would like to continue modifying other functions in hopes to make a overall improvement to the software. For now I will contact the community to see if they think my changes are worth committing to the live version of the program.

by nolte96 at April 11, 2018 01:59 AM


Matteo Peruzzi

Project Stage 2: SPO600

This blog contines part of the documentation progress for my SPO600 class project. For Stage 2 of this assignment I’ve been dealing with a lot of dependency issues regarding my project (OpenRCT2). Specifically it has been compiling issues on startup that are bugging me. Over the past couple weeks I’ve had a lot of trouble regarding a few of the factors I’ll list below, and today I’ve finally broken through towards getting some proper testing done. I’ll give a quick TLDR list of issues I ran into.


  1. Make sure your dependencies are covered.
  2. Retarget your code to fit these dependencies.
  3. Make sure your software is up to date.

The first bug I encountered was regarding Windows SDK’s. For OpenRCT2 for compiling on Visual Studio 2017 Windows SDK version 10.0.14 is necessary, something which I didn’t have installed. After some downloads, restarts (don’t forget those) and property changes on VS I re-targeted the solution and that got rid of my SDK errors. Only to now have new ones yay! /s ( Auto correct wants me to say ‘nay’ there and I’m inclined to agree. )

These new errors caused me a huge bit of grief. On compilation I was constantly being hit by some “cannot open file” errors and I had a great bit of trouble debugging this particular error. Upon completely following the instructions described on the wiki once again to double check for any simple mistakes, I was left with the same errors. So my only choice left was to ask the community before I drove myself to insanity.


In the process of waiting for some responses from the community I had been exploring the code looking for any specific improvements that I might be able to make to the existing code. Hopefully with the original game being coded in assembly there should be some sections which I can play around with after they had been initially converted to c++.

One potential small change I saw was in the HardwareDisplayEngineCode, specifically the RenderDirtyVisuals function. 

 

void RenderDirtyVisuals()    {        
float scaleX = gConfigGeneral.window_scale;        
float scaleY = gConfigGeneral.window_scale;
        SDL_SetRenderDrawBlendMode(_sdlRenderer, SDL_BLENDMODE_BLEND);        
        for (uint32 y = 0; y < _dirtyGrid.BlockRows; y++)
        {            
            for (uint32 x = 0; x < _dirtyGrid.BlockColumns; x++)
            {
                auto timeLeft = GetDirtyVisualTime(x, y);
                if (timeLeft > 0)
                {
                    uint8 alpha = (uint8)(timeLeft * 5 / 2);
                    SDL_Rect ddRect;
                    ddRect.w = (sint32)(_dirtyGrid.BlockWidth * scaleX);
                    ddRect.h = (sint32)(_dirtyGrid.BlockHeight * scaleY);                    
                    ddRect.x = (sint32)(x * ddRect.w);                    
                    ddRect.y = (sint32)(y * ddRect.h);                    
                    SDL_SetRenderDrawColor(_sdlRenderer, 255, 255, 255, alpha);
                    SDL_RenderFillRect(_sdlRenderer, &ddRect);
                }
            }
        }
    }

The if timeleft < 0 section can possibly have a few calculations removed since _dirty.BlockHeight *scaleY is used twice there. Same with the width scaleY counterpart. I don’t think this will have the large impact at all, but just something small. Once I get the environment working properly I can look at it some more.

From reading some issue notices on the Github page I have also seen a few sections of potential interest. One is a section of code that deals with guest (peep) navigation and it involves a lot of decision making and seems to be a source of slowdown at higher guest counts.


Once the community got back to me I realized my problem was as simple as upgrading Visual Studio 2017 to the latest version and viola it builds!

Quick Build Summary:

  1. Install dependencies for project.
    • OpenRCT2 requires Microsoft’s 10.0.14 SDK.
    • 7-zip for part of msbuild.
    • NSIS for deployment purposes.
    • Nice little organization tip I use now involves making sure any dependencies get installed into a utilities folder to organize some of the madness.
  2. Clone repository to controlled folder. In my case Matteo\Github\OpenRCT2
  3. Navigated to the directory on Visual Studio Powershell 2017
  4. Upon reaching the project location enter the following command. (It breaks here if you don’t update properly)
msbuild openrct2.proj /t:build /p:platform=x64

5. If it builds you can now run it with bin\openrct2

Once the msbuild command is run and there are no errors, to get the actual testing environment you have to set the right start-up project. By opening the sln (solution) file in my cloned OpenRCT2 directory I can open the project in Visual Studio. In my case the correct start-up is openrct-win since I am on Windows and then set the debugger to x64 mode. All that’s left is to start the debugger and to begin observing the code while the game runs!

Now regarding the code I want to make improvements on is a little tricky, I feel that attempting to improve the first code I noticed (HardwareDisplayEngineCode) will be a easier endeavor for a few reasons.

  1. The code is easy to read.
  2. The file size is a lot more manageable compared to the Guest (peep) file. 420 lines of code vs 14400…
  3. This size difference should help the debugging process.
  4. 400 lines of code is something more my size at this time of the project, I’m not too familiar with all of the code on hand.

The whole code here (HardwareDisplayDrawingEngine) is here and the specific bit I was looking at is the following.

for (uint32 x = 0; x < _dirtyGrid.BlockColumns; x++)
{
   auto timeLeft = GetDirtyVisualTime(x, y);
   if (timeLeft > 0)
   {
   uint8 alpha = (uint8)(timeLeft * 5 / 2);

   SDL_Rect ddRect;
   ddRect.x = (sint32)(x * _dirtyGrid.BlockWidth * scaleX);
   ddRect.y = (sint32)(y * _dirtyGrid.BlockHeight * scaleY);
   ddRect.w = (sint32)(_dirtyGrid.BlockWidth * scaleX);
   ddRect.h = (sint32)(_dirtyGrid.BlockHeight * scaleY);

   SDL_SetRenderDrawColor(_sdlRenderer, 255, 255, 255, alpha);
   SDL_RenderFillRect(_sdlRenderer, &ddRect);
   }
}

 

 

The change I want is nothing complex just something to get used to benchmarking on Visual Studio. I was curious if switching the order of the calculations for the x,y,w,h values might lead to some changes. In particular if the width and height values are calculated first they could be subbed into the lines there instead of some extra calculation being done. For example.

 ddRect.w = (sint32)(_dirtyGrid.BlockWidth * scaleX);
 ddRect.h = (sint32)(_dirtyGrid.BlockHeight * scaleY);
 ddRect.x = (sint32)(x * ddRect.w);
 ddRect.y = (sint32)(y * ddRect.h);

 

Again no large changes, I just want to get used to the process. Problem with this small change is finding out if it has had any effects for me however. It compiled fine so that is great but finding the trigger point for this function has been a lost cause so far. This has show me even if I see a possible change, if I don’t know the full context that doesn’t really help much. As Step 2 of this project comes to a close I haven’t been able to see much of a result in my attempts, but my confidence is growing so hopefully I can find an optimization soon. I should be particularly targeting functions I know will be used frequently. Something like a rendering of a ride, music or guest. This particular mistake has helped me narrow my focus properly.

 

by Peruzzi Blog at April 11, 2018 01:56 AM

April 10, 2018


Dennis Arul

Update and changes so far

I previously mention that I found CFLAGS and CXXFLAGS using -O2 what I did not notice was there was another one called CCASFLAGS that was also using using -O2 so I did a little bit of research and found out that it is another compiler flag that is used. So what I did was I changed all three of them so they use -O3 rather than -O2 and after the changes I ran the program again.

The image below shows the changes made to CCASFLAGS and CFLAGS. To view prior to the changes please view my previous blog.

Change1.PNG

This image  shows the change to CXXFLAGS.Change2.PNG

This was the first run and did not have a difference at all  it was quite similar to what it was when I ran it with -O2.

ChangeB1.PNG

The second run had a difference but honestly the change was not all that big, it was faster when it came to real, but for the user it was actually 1 second slower.

ChangeB2.PNG

The third run was similar to the second run in the sense that real was slightly faster, but the difference I saw between the 2nd run and the 3rd run was the user was similar to the  original runs that I had using -O2.

ChangeB3.PNG

Overall I would say that these changes were a failure, going forward I am going to be looking for more methods that could allow me to optimize the program successfully.

 

by dparul at April 10, 2018 11:53 PM