Planet CDOT

July 03, 2015

Hosung Hwang

CC Image License Search Engine API Implementation

Big Picture

CC - New Page


Previously, my co-worker Anna made a page that search similar images by uploading or from the link. This UI page can be either inside the server or outside the sever. It uses only PHP API without accessing Database directly.


This is open API that have functions of Adding, Deleting, and Matching image. It can be accessed by anyone who want this function. UI page or client implementation such as browser extension uses this API. The matching result is JSON format.
This API page Add/Delete/Match by asking “C++ Daemon” without changing Database.
Only for read-only access to the Database will be permitted.

C++ Daemon

All adding/deleting operation will be done in this daemon. By doing so, we can remove the problem of synchronization between database and index for matching. That is because this daemon will have content index on the memory all the time for fast matching.
Because this daemon is active all the time, to get the request and give result to “PHP API”, it works as domain socket server. PHP API will request using domain socket.


Database contains all metadatas about CC license images and thumbnail path that are used to show as a preview in the matching result.

by Hosung at July 03, 2015 03:16 PM

July 02, 2015

Chris Tyler (ctyler)

The OSTEP Applied Research Team

I haven't introduced my research team for quite a while, and it has changed and grown considerably. Here is the current Open Source Technology for Emerging Platforms team working with me at Seneca's Centre for Development of Open Technology. From left to right:

  • (me!)
  • Michael (Hong) Huang (front)
  • Edwin Lum (rear)
  • Glaser Lo
  • Artem Luzyanin (front)
  • Justin Flowers (rear)
  • Reinildo Souza da Silva
  • Andrew Oatley-Willis

Edwin and Justin work with me on the DevOps project, which is applying the techniques we've learned and developed to the software development
processes of a local applied research partner.

Michael, Glaser, Artem, Reinildo, and Andrew work with me on the LEAP Project. Recently (since this photo was taken), Reinildo returned to Brazil, and has been replaced by Christopher Markieta (who has previously worked with this project).

I'm dying to tell you the details of the LEAP project, so stay tuned for an announcement in the next week!

by Chris Tyler ( at July 02, 2015 08:51 PM

Anna Fatsevych

Flickr API Woes

My genius Flickr Downloader was chugging along and downloading images with all the required licensing and author information and everything seemed fine, until yesterday when I ran into an interesting issue. The images kept duplicating themselves after the folder was at 4,497 files. I ran the program again and (after a few hours, mind you) the issue reappeared. After I had exhausted all the possibilities of errors on my end (code, maximum dir size capability, etc), I began an investigation on Flickr API that yielded no results. Today I ran the program a few different times, on various dates, and alas, it capped out at 4,500 images on the dot each time.

The only limit ever mentioned in Flickr Official API Documentation is the 3600 api calls per hour throttling cap, and nothing is documented on the maximum results returned by a search. I had dug out this StackOverflow article that mirrors my issue, the only difference that it states the cap to be 4,000 search results, whereas I found it to be 4,500.

I am now testing the new downloader with more frequent time increments that yield search results that are less than the allowable max.

by anna at July 02, 2015 01:02 AM

June 29, 2015

Hosung Hwang

Pastec Test for Performance

So far, I tested Pastec in terms of the quality of image matching. In this posting, I tested speed of adding and searching.

Adding images to index

Firstly I added 100 images. Adding 100 images took 48.339 seconds. Then I added all directory from 22 to 31. Those images are uploaded to wikimedia commons from 2013.12.22 to 2013.12.21.

Directory Start End Duration Count Average
22 17:32:42 18:43:50 01:11:08 8785 00:00.49
23 18:43:50 19:42:03 00:58:13 7314 00:00.48
24 19:42:03 20:28:56 00:46:53 6001 00:00.47
25 20:28:57 21:28:02 00:59:05 7783 00:00.46
26 21:28:02 22:41:12 01:13:10 9300 00:00.47
27 22:41:19 23:54:28 01:13:09 9699 00:00.45
28 00:54:28 01:53:23 00:58:55 7912 00:00.45
29 00:53:23 02:27:42 01:34:19 11839 00:00.48
30 02:27:42 03:31:48 01:04:06 8827 00:00.44
31 03:31:48 04:23:15 00:51:27 6880 00:00.45

Average time for adding an image was around 0.46 second and it didn’t increased as the index grows. Most of the time for adding an image is extracting features.
I saved the index file for 100 images, from 22 to 26, and from 22 to 31. The size were 8.7mb, 444.1mb, and 935.8mb respectively.


Searching images

I loaded the index file for 100 images. And searched all 100 images that are used to add.

Directory Start End Duration Count Average
22 00:01:14 100 00:00.74

Searching took 1m14.781s. Since it is 100 images, average time to add one image was 0.74 second.

Then I loaded the index file that contains index for 39,183 images in the directory from 22 to 26.

Directory Start End Duration Count Average
22 09:00:05 11:21:06 02:21:01 8785 00:00.96
23 11:21:06 13:13:52 01:52:46 7314 00:00.93
24 13:13:52 14:48:26 01:34:34 6001 00:00.95
25 14:48:26 16:48:44 02:00:18 7783 00:00.93
26 16:48:44 19:13:11 02:24:27 9300 00:00.93

This time, average time for searching one image was 0.95 second.

Then I loaded the index file that contains index for 84,340 images that are in the directory from 22 to 31.

Directory Start End Duration Count Average
22 19:32:54 22:44:09 03:11:15 8785 00:01.31
23 20:44:09 23:16:59 02:32:50 7314 00:01.25
24 01:16:59 03:24:52 02:07:53 6001 00:01.28
25 03:24:52 06:11:33 02:46:41 7783 00:01.28
26 06:11:33 09:30:53 03:19:20 9300 00:01.29

Searching performed for the same images from 22 to 26. Average time for searching was 1.3 seconds.


  • Adding an image took 0.47 second.
  • Adding time didn’t varied by index size.
  • Searching an image varied by index size.
  • When the index size was 100, 39183, and 84340, searching time was 0.74, 0.95, and 1.3 seconds, respectively.
    Screenshot from 2015-06-28 23:14:15
    In the chart, y-axis is time in milliseconds. Around 0.6 second is likely to be for reading an image and extracting features. And searching time will be increased in proportion to the size of index.

by Hosung at June 29, 2015 03:28 AM

June 26, 2015

Barbara deGraaf

The thrilling saga on shaders continues

In my last post I detailed some basics of creating a shader and in this post I will be focusing on how to create a depth of field shader.

There is going to be a couple files that need changing including the shader file and the main js file. I am going to start off with the shader file and mention the js file later.

As I stated before in the last post the depth of field shader is only going to change the fragment shader so the vertex shader will be the same as the one that I have posted on the last post.

So this post will manly focus on the fragment shader. I was going to talk about the code in the shader but that has made the post too long so I will talk about the main concept of creating depth of field. Which is as follows; create a texture containing the depth map. Then grab the value from the depth texture to figure out how far away from the camera the pixel is. Using the inputs from the camera find out what the near and far depth of field areas are. We can then compare the depth of pixel to the near and far depth of field to find out how blurry it should be. We then do something called image convolution. This process grabs the colour of the pixels around the certain pixel and adds them together so that the final pixel is a mix of all the colours around it.

To get the shader to work Three.js has something called effect composer and shader pass to work with your shaders. This is done in rough form as follows;

composer = new THREE.EffectComposer( renderer );
composer.addPass( new THREE.RenderPass( scene, camera ) );

var Effect1 = new THREE.ShaderPass( shadername, textureID );
Effect1.uniforms[ 'value1' ].value = 2.0 ;
Effect1.renderToScreen = true; //the last shader pass you make needs to say rendertoscreen = true
composer.addPass( Effect1 );

Then to get this to work you need to call composer.render() in the render loop instead of the normal renderer.render().

I will end here for this post, If need be I will wrap up some minor things about shaders in the next post. As well once the scene is nicely set up and the GUI works with real world cameras/lenses I will post a post with a survey to see what shader produces the best results and where it can be improved.



by barbaradegraafsoftware at June 26, 2015 02:04 PM

Hosung Hwang

scp/sftp through ssh turnnel

SSH Tunneling

Machine CC can be connected from another machine called zenit.
To do scp to CC through zenit, following command establish a ssh tunnel to CC.

ssh -L 9999:[address of CC known to zenit]:22 [user at zenit]@[address of zenit]
in my case,
ssh -L 9999:

Now, 9999 port of localhost( is for tunnel to CC through zenit.
This session need to be alive to do all followings.


SCP through the SSH Tunnel

Then these commands do scp from local test.png file to CC:~/tmp and copy from CC:/tmp/test.png to ..

scp -P 9999 test.png ccuser@
scp -P 9999 ccuser@ .


Making it easy

Typing those long command is not a good idea.
I added an alias to .bashrc.

alias ccturnnel='ssh -L 9999:'

Then wrote two simple bash script.

This is cpfromcc.

var=$(echo $1 | sed 's/\/home\/hosung/~/g')
scp -P 9999 ccuser@$remote $2

This is cptocc.

for var in "$@"
    if [ $i -ne $# ]
        values="$values $var"
        var=$(echo $var | sed 's/\/home\/hosung/~/g')
scp -P 9999 $values ccuser@$remote

The reason why I use sed for remote path is because bash changes ~ to my home directory.
Now I can establish ssh tunnel by typing ccturnnel.
Then I can do scp from my machine to CC using :

cptocc test.jpg test2.jpg ~

And I can do scp from CC to my machine using :

cpfromcc ~/remotefile.txt .


Making it convenient using sftp

When the tunnel is established, sftp is the same.

$ sftp ccuser@


Making it more convenient using Krusader

By typing sftp://ccuser@ in the URL bar of the Krusader, and then by adding the place to the bookmark, the remote machine’s file system is easily accessed.

Screenshot from 2015-06-26 10:23:39

Mounting it using sshfs also will be possible.

by Hosung at June 26, 2015 03:54 AM

June 24, 2015

Anna Fatsevych

Flickr API – Date Time

Flickr API has a funny way with dates, I am in the middle of discovering how it really works. Before I was sending the date in terms of string “YYYY-MM-DD” and setting a difference of one day i.e. “2015-03-20 2015-03-21″ and I was getting only about 1,000 images per day (on average).

I had dug deeper into the API and realized that UNIX timestamp and MySQL datetime. In my php code I set the default timezone to Greenwich and then set the date in the MySQL datetime like this

min_upload_date: “2015-03-20 00:00:00″
max_upload_date: “2015-03-20 23:59:59″

And now I get on average of 200,000 results per day (Licenses 1 through 7).
This is great news – there are some grey areas that I need to further research – in terms of time comparison, or how exactly does Flickr compare dates, with what precision, round off, or truncation.

More to come as I am still researching and running tests.



by anna at June 24, 2015 09:47 PM

Hosung Hwang

Pastec Test for real image data

In the previous test of Pastec, I used 900 jpeg image that was mainly computer generated images. This time, I tested images from WikiMedia Commons Archive of CC License Image that are uploaded from 2013-12-25 to 2013-12-30. They are zip file 17GB to 41GB and it contains around 10,000 files including jpg, gif, png, tiff, ogg, pdf, djvu, svg, and webm. Before testing, I deleted xml, pdf, djvu and webm. Then there are 55,643 images.


Indexing 55,643 images took around 12 hours and Index file was 622mb. At first, I made separate index files for each day. However, Pastec can load only 1 index file. So I added all 6 days’ images and saved it to one index file.

While indexing there are some errors.

  1. Pastec uses OpenCV, and OpenCV doesn’t support gif and svg. For these two format, OpenCV didn’t open.
  2. Pastec adds images that is bigger than 150×150 pixel.
  3. There are zero bytes images : 153 files in 55,643 files. However on the web page of wikimedia, there are valid images. Anyways it causes an error.
  4. One tiff image cause crash inside the Pastec. It need debugging.


After loading the 622 mb index file, images can be searched. Searching 55,643 images took around 15 hours. Every searching process, it extracts features before searching, therefore, searching takes more time.

Search result

Among 55,643 images, 751 images(1.43%) are smaller than 150×150, so they were not added. 51479 images are proper size, proper format for OpenCV, they are indexed and can be searched.

  • 42,931 (83%) images are matched with only themselves (exactly the same image)
  • 8,459 (15%) images are matched more than one image
  • 90 (0.17%) images are not matched with any images even with themselves.

Images didn’t match with any images

These 90 images are properly indexed, but didn’t match even with themselves.

  • 55 images were png image that include transparency. Other than this case, jpg images
  • 14 images were long panorama images like followings


  • 6 images were simple images like followings

__Amore_2013-12-30_14-18 __Bokeh_(9775121436) __Bokeh_(9775185973) __Hmm_2013-12-30_16-54 __Moon_early_morning_Shot

  • 8 vague images : lines are not clear and photographs that are out of focus

__20131229141153!Adrien_Ricorsse SONY DSC __Llyn_Alwen,_Conwy,_Cymru_Wales_21 __Minokaya_junior_high_school __Moseskogbunn __Nella_nebbia._Franco_Nero_e_Valeria_Vaiano_in_Mineurs_-_Minatori_e_minori SONY DSC SONY DSC

  • Other cases
    __Brännblåsa_på_fingret_2013-12-26_13-40 __Pottery_-_Sonkh_-_Showcase_6-15_-_Prehistory_and_Terracotta_Gallery_-_Government_Museum_-_Mathura_201d247a1ec8535aec4f9bf86066bd10dd
    These two images are a bit out of focus.

__Jaisen_Wiki_Jalayathra_2013_Alappuzha_Vembanad_Lake26 __Jaisen_Wiki_Jalayathra_2013_Alappuzha_Vembanad_Lake41 __Jaisen_Wiki_Jalayathra_2013_Alappuzha_Vembanad_Lake42

Original image size of this is 150×150 pixel. May be it is too small and simple.

Images matched with more than one image

8,459 images were matched with more than one images. To compare the result, I generated an html file that shows all match results like following :
Screenshot from 2015-06-24 16:29:49

I converted all images to 250×250 pixel using convert -resize 250x250 filename command to show it on one page. The html file size was 6.8 mb and it shows 64,630 images.

As I mentioned on my previous blog, Pastec is good for detecting rotated/cropped image.
Almost all matching was reasonable(similar). Followings are significant matchings :
20131225102452!Petro_canada Petro_canada_info

20131225193901!Bitlendik Bitlendik-avatar

In these two cases, the logo was matched.

20131225212947!June_Allyson_Dick_Powell_1962 June_Allyson_Dick_Powell_1962 Aveiro_342

This matching looks like false positive.

Buddhapanditsabhacandapuri Aveiro_144

This matching also is false positive.

NZ-SH45_map NZ-SH30_map NZ-SH38_map

In this case, the map is shifted.

PK-LJH_(Boeing_737-9GP)_from_Lion_Airlines_(9722507259) Blick über die Saalenbergkapelle in Sölden zum Kohlernkopf

This is obvious false positive, maybe sharp part of the airplane and the roof part was matched.

From my observation, obvious false positive matching that doesn’t share any object was less than 50, which was 0.08%. Usually when the image contains graphs or documents, there were wrong matching. When the image was normal photograph, the result was very reliable.

by Hosung at June 24, 2015 09:41 PM

Anna Fatsevych

Curl and wget

When downloading images using php (using curl or file_put_contents) I have ran into issues with download sizes, possible interruptions and memory usage, which all can and have to be changed in your php.ini file.

Then I have come across this comparative article about wget and curl curl vs. wget and decided to give wget a try as it does not seem to initially have those limitations and has the ability to continue downloading even after an interrupt, thus making the case for the preferred download method in our case.

Curl relies heavily on php.ini settings and is incorporated with my php program, whereas wget is executed as a command line and downloads independently of the php settings, thus might be more beneficial in making a portable downloader with minimal changes of configuration required.

I did not have to install wget package for he Linux Mint Cinnamon OS and can just run the executable within my php code like this:

exec("wget http://your/url");
exec(“wget “.$urlToDownload);

or you can choose to specify the directory of the download with wget

exec(“wget https://your/url -O /your/dir/filename.jpg”);


I have ran more test, as at times wget would give me a 100% downloaded message, but the file was sized 0 bytes. This alarmed me as the error was not caught and it was caused by a redirect, which is automatically handled by curl. I am currently looking more into this issue, but in the meantime I have ran some tests, and these are my results:

280 images – CURL: 422 seconds, No Errors
WGET: 703 seconds, No Errors

350 images – CURL: 475 seconds, No Errors
WGET: 821 seconds, No Errors

450 images – CURL: 541 seconds, No Errors
WGET: 1008 seconds, 3 Errors – Images size 0

In regards to file storage, I have found out (NTFS) that the directory can store sufficient amount of image files for our purposes in NTFS format, and therefore one directory would be enough to store images that way as opposed to having them stored as blobs in the MySQL database.

More to come on this topic,


by anna at June 24, 2015 02:57 PM

June 23, 2015

Andrew Smith

Using ImageMagick without running out of RAM

For our research project we needed to use pHash to do some operations on a lot (tens of thousands) of image files. pHash uses ImageMagick internally, probably for simple operations such as resizing and changing the colour scheme.

I am pretty familiar with errors such as these coming from convert or mogrify:

convert.im6: no decode delegate for this image format `Ru-ей.ogg' @ error/constitute.c/ReadImage/544.
convert.im6: no images defined `pnm:-' @ error/convert.c/ConvertImageCommand/3044.
sh: 1: gm: not found

[CImg] *** CImgIOException *** [instance(0,0,0,0,(nil),non-shared)] CImg<unsigned char>::load(): Failed to recognize format of file 'Ru-ей.ogg'

What I wasn’t expecting was to get such errors in one of my own applications that uses a library (phash) that uses another library (imagemagick). What moron prints error messages to stdout from inside a library? Seriously!!??

But it gets worse. As I put this code in a loop it quickly found a reason (the first was a .djvu file) to eat up all my ram and then start on the swap. Crappy code, but it’s a complex codebase, I can forgive them. I figured I’ll just set my ulimit to not allow any program to use over half a gig of RAM with “ulimit -Sv 500000″ and ran my program again:

[CImg] *** CImgInstanceException *** [instance(0,0,0,0,(nil),non-shared)] CImg<float>::CImg(): Failed to allocate memory (245.7 Mio) for image (6856,9394,1,1).
terminate called after throwing an instance of 'cimg_library::CImgInstanceException'
  what():  [instance(0,0,0,0,(nil),non-shared)] CImg<float>::CImg(): Failed to allocate memory (245.7 Mio) for image (6856,9394,1,1).

Aborted? What sort of garbage were these people smoking? You don’t bloody abort from a library just because you ran out of memory, especially in a library that routinely runs out of memory! Bah. Anyway, I found a way to make sure it doesn’t abort. Set ulimit back to unlimited and instead created a global imagemagick configuration file /usr/share/ImageMagick-6.7.7/policy.xml:

  <policy domain="resource" name="memory" value="256MiB"/>
  <policy domain="resource" name="map" value="512MiB"/>

Now no more aborts and no more running out of memory. Good. Until I got to about file number 31000 and my machine ground to a halt again, as if out of RAM and swapping. What this time? Out of disk space of course, why not!

I’ve already set ImageMagick in my program to use a specific temporary directory (export MAGICK_TMPDIR=/tmp/magick1 && mkdir -p $MAGICK_TMPDIR) so that my program, after indirectly using the imagemagick library can run “system(“rm -f /tmp/magick?/*”);” because, you know, it’s too much to ask ImageMagick to clean up after itself. Barf… But it even got around that. For a single PDF file it used over 65GB of disk space in /tmp.

And if at least they said they’re using other people’s libraries it’s not their fault and so on and so forth maybe I wouldn’t be so pissed, but instead they give me bullshit like “oh what’s a lot of resources to you is nothing to someone else, we have 1TB of RAM, bla bla”.

Piss off, I’m going to find another solution that doesn’t involve using this garbage.

by Andrew Smith at June 23, 2015 02:47 AM

June 19, 2015

Barbara deGraaf

An Introduction to shaders

For our project we are using shaders to replicate the depth of field for the camera. The shaders online certainly work but I was not happy with the lack of explanation or the procedure within those shaders, so I have decided to make my own to replicate depth of field.

Within this post I am just going to explain some introductory concepts about using shaders in Three.js and led up to the final shader results in the later posts.

Before going into details about the shaders I am going to talk a bit about the rendering pipeline and then jump back. The rendering pipeline is the steps that OpenGL (The API that renders 2D and 3D vector graphics) takes when when rendering objects to the screen.


This image was taken from the OpenGl rendering pipeline page here.

Glossing over some things a bit, there is basically two things happening. First the pipeline deals with the vertex data. Then the vertex shader is responsible for turning those 3D vertices into a 2D coordinate position for your screen(responsible for where objects get located on the screen). After some other stuff rasterization occurs which makes fragments(triangles) from these vertices points. After this occurs the fragment shader occurs. This fragment shader is responsible for what colour the fragment/pixel on screen has.

This whole pipeline runs on the GPU and the only two parts of this pipeline that are programmable by a user are the vertex shader and the fragment shader. Using these two shaders we can greatly alter the output on the screen.

For Three.js/WebGL the shaders are written in GLSL (with three.js simplifying things for us a little bit) which is similar to C. This shader file can be separated into three main parts: uniforms, vertex shader, and the fragment shader.

For the first part the uniforms this is going to be all the values passed from the main JS file. I’ll talk about passing in values in a later post. A basic example is;

uniforms: {
"tDiffuse": { type: "t", value: null },
"value1": { type: "f", value: 1.2 }

tDiffuse is the texture that was passed from the pervious shader and this name is always the same for three.js. The types that can occur in the uniforms are many but some of the basic ones are i = integer, f=float, c=colour, t=texture, v2 = vector2 (also 3 and 4 exist), m4 = matrix4 etc….

The next part is the vertex shader, because of what I want to do (change the colour of the pixel to create a blurring effect) I don’t need to change anything in here, but it is still required to write this in the shader file. If you want to code one you must code the other as well.

vertexShader: [

  "varying vec2 vUv;",
  "void main() {",
    "vUv = uv;",
    "gl_Position = projectionMatrix * modelViewMatrix * vec4( positio       n, 1.0 );",


Varying meaning that the value change for each pixel being processed. In this one we have vUv which is a vector that holds the UV (screen co-ordinates) of the pixel and is automatically passed in by three.js. The next line just takes the 3D coords and projects them onto the 2D coords on your screen. I am going to skip the explanation of why this works as it is not important, just look it up or ask me if you really want to know.

Now for the important one, the fragment shader;

fragmentShader: [

"uniform sampler2D tDiffuse;",
"varying vec2 vUv;",

"void main() {",
  "vec4 color = texture2D(tDiffuse, vUv);",
  "gl_FragColor = color;",


For this vUv is the same as from the vertex shader and tDiffuse is the texture that was passed in (stated as sampler2D here). In this main loop we are grabbing the RGBA value from the passed in texture as coord vUv and then assigning it to the output pixel.

This is the shader I will be using to create a depth of field and for the rest of the posts I will be looking at this shader only.

That’s it for the introduction, next post I will start to get into the fragment shader and image convolution.



by barbaradegraafsoftware at June 19, 2015 07:48 PM

Dmitry Yastremskiy

Hello Data!

I’m working on the project of 3D Data Visualization, which is an emerging industry and getting popular these days, especially where lots of data gets generated and needs ways to interpret it to allow humans read it and learn something from it. The goals of this project are to be able to grab pretty much any data and visualize it as well as visualize it taking advantage of the 3rd dimension Z, where 2 dimensions X and Y just not enough. In order to give the app to be extensible and to live happy life we are structuring it the way people will be able to add their own templates and sources of data, so it is not wired to particular data sources or visualizations. From the technical side the tools we using are: Three.js for WebGL, Backbone.js for MVC pattern, Require.js for dynamic script loading and pure vanilla JavaScript for the rest. You can see our first steps here: We will be happy for any feedback or advices. Feel free.

by hamabama at June 19, 2015 07:03 PM

June 18, 2015

Hosung Hwang

Pastec analysis

Pastec works as following order :

  1. Load visual words : visualWordsORB.dat file contains it, the size is 32,000,000 bytes. Loading the file takes around 1 seconds.
  2. Building the word index : using the visual words, builds word index; it takes around 13 seconds.
  3. Now previously saved index file can be loaded, or an image can be added to the index.
  4. Using an image file, similar image file that contains similar word indexes can be searched.
  5. Index in the memory can be written to a file

Adding new image to the index works as following order :

  1. Using OpenCV, ORB features are extracted.
  2. Matching visual words are searched.
  3. Matching visual words are indexed on the memory

When I added 900 images, the size of index file was 16,967,440 bytes.

By changing source code, I saved matching visual word list to the text file for each images. Each word matching stored using this struct :

struct HitForward
    u_int32_t i_wordId;
    u_int32_t i_imageId;
    u_int16_t i_angle;
    u_int16_t x;
    u_int16_t y;

Each word matching has word id, image id, angle, and x/y coordination. Saved file looks like this (order of ImageID,Angle,x,y,WordId) :


It contains 1593 lines, which means it has 1593 matching words. Image id 469 was Jánské.jpg and the image looks like this :
The size of this image is 12.8 mb. Like other HDR images, it contains lots of features. Also it has biggest number of matching words among 900 images. When the data was written to the text file, the size was 39,173 bytes, it would be the worst case. When the image is simple, only few words are matched. Full size of matching word text files of 900 images was 20.9 mb.

To reduce it, I made a simple binary format. Since the image id is the same for an image, I wrote it once, and it is followed by 4 bytes count. Then every word is written as 4 bytes word id, 2 bytes angle, 2 bytes x, and 2 bytes y.

4 bytes - id
4 bytes - count
4,2,2,2 (10 bytes) *  count

In case of id 469 image, the size is 11,238 bytes. And the file looks like this :

00000000: d501 0000 3906 0000 e282 0100 dcd9 a101  ....9...........
00000010: 6f00 a2fc 0300 10b4 a801 c501 889c 0000  o...............
00000020: 9610 6203 0901 f2b1 0900 00ad 5703 2701  ..b.........W.'.
00000030: 9b70 0000 0ee7 df02 0c01 4d20 0200 ee30  .p........M ...0
00000040: 1102 7000 9ba0 0200 e130 f401 2700 3b68  ..p......0..'.;h
00000050: 0400 a2bd 6702 3b00 b094 0800 c64c 5f02  ....g.;......L_.

0x1d5 is 469 and 0x639 is 1593.
In this case, the size was 15938 bytes, which was 15 kb, around 34% of text format (39 kb).
Since this image is the worst case, storing all binary index to database for all image record is realistic.
Full size of all 900 images was 8.5 mb. (text file was 20.9 mb)
Interestingly, it is smaller than index file for 900 images (16.2 mb)


I was thinking of saving index file. However, saving word list for each image will be the better solution because when it is binary format, it consumes less storage and adding it to the index is very fast. Also, when it is stored as a database field, synchronization between index and database is not a problem.

by Hosung at June 18, 2015 09:58 PM

June 17, 2015

Hosung Hwang

How to import CMake project in Eclipse CDT4

Currently I am analysing Pastec; it uses CMake as a build system. To split them up, I wanted to analyse it using the functionality of Eclipse.

Pastec can be built using following order.

$ git clone
$ mkdir build
$ cd build
$ cmake ../
$ make

To build Pastec in Eclipse CDT, instead of doing “cmake ..”, following order need to be done. (Debug build)

$ cd build
$ cmake -G"Eclipse CDT4 - Unix Makefiles" -D CMAKE_BUILD_TYPE=Debug ..

Then, it can be imported into Eclipse:

  1. Import project using Menu File->Import
  2. Select General->Existing projects into workspace:
  3. Browse where your build tree is and select the root build tree directory(pastec/build). Keep “Copy projects into workspace” unchecked.
  4. You get a fully functional eclipse project


by Hosung at June 17, 2015 03:53 PM

June 16, 2015

Anna Fatsevych

Wiki Commons API

I have been working on downloading meta data for the images found in the Wiki Image Dumps. I am using the Commons Tools API to gather licensing data and author information.

The fact that anybody can edit information on the Wiki, is great for many reasons, but can produce unexpected, and sometimes totally unreadable results when trying to parse XML returned from the call.

Here is the code snipped, and while the image name is unique and stays unchanged, the author name, license, description, and even the template itself can be changed and edit by the user.

 [file] => SimpleXMLElement Object
            [name] => QuezonNVjf181.JPG
            [title] => File:QuezonNVjf181.JPG
            [urls] => SimpleXMLElement Object
                    [file] =>
                    [description] =>

            [size] => 6480788
            [width] => 4608
            [height] => 3456
            [uploader] => Ramon FVelasquez
            [upload_date] => 2013-12-29T09:28:24Z
            [sha1] => 8646ca2be96f423faa2c33da1f2bbddbeee454c8
            [date] => 
            [author] => a href="" title="User:Ramon FVelasquez">Ramon FVelasquez SimpleXMLElement Object

As you can see – the author tag has an html tag, that sometimes can be just plain text; I am parsing the “title” tag and storing the contents, which prove to be erroneous at times. Also as far as licensing is concerned, it is usually much clearer, as the pre-set Creative Commons Licenses are mostly used, and thus provide an easier parse-able fields:

    [licenses] => SimpleXMLElement Object
            [@attributes] => Array
                    [selfmade] => 1

            [license] => SimpleXMLElement Object
                    [name] => CC-BY-SA-3.0
                    [full_name] => Creative Commons Attribution Share-Alike V3.0
                    [attach_full_license_text] => 0
                    [attribute_author] => 1
                    [keep_under_same_license] => 0
                    [keep_under_similar_license] => 1
                    [license_logo_url] =>
                    [license_info_url] =>
                    [license_text_url] =>


I am using this Commons Tool to get the information from already downloaded images. I also have been checking first if I have the complete information in the XML file dumps first, but now, have decided to bypass that check and just use the API, as I think it will provide us with the newly updated information, and less possibility for an outdated or corrupt XML file.



by anna at June 16, 2015 08:58 PM

June 15, 2015

Hosung Hwang

Pastec test method and result


Pastec is mentioned on my previous posting about Content Based Image Retrieval(CBIR). It extracts features using ORB and Visual Word.

Pastec offers visual word data file: visualWordsORB.dat that is 10.5MB. Pastec program load the visual word data initially and then load index data file. Then it can be searched. Today, I am going to write about the test result for 900 images the same as I did before. Performance and source code analysis will be done later.

Test Mothod

Full API is in this page.
Pastec runs as HTTP Server that uses RESTful API. It can run using following command :

./pastec visualWordsORB.dat

I added all jpeg images in the directory to the index by writing this script :

for F in /home/hosung/cdot/ccl/hashtest/all-images/*.jpg;
    curl -X PUT --data-binary @"${F}" http://localhost:4212/index/images/$i;

Then each image is searched by this script :

for F in /home/hosung/cdot/ccl/hashtest/all-images/*.jpg;
    echo $i,"${F}"
    curl -X POST --data-binary @"${F}" http://localhost:4212/index/searcher;

These generates an output like following :

2,/home/hosung/cdot/ccl/hashtest/all-images/05 0751 DOE NamUs UP 345 Reconstruction 001a.jpg
3,/home/hosung/cdot/ccl/hashtest/all-images/0514-80 Reconstruction 002b.jpg
70,/home/hosung/cdot/ccl/hashtest/all-images/A 3D Object design using FreeCad Software.jpg

Since response is json data, I had to parse again. So I wrote simple python script because in python, json parsing is easy.

import json

id = 0
file = "nofile"
error = 0
notfound = 0
found = -1
moreThanOne = 0
onlyOne = 0

with open("search2.txt", "r") as f:
    for line in f:
        if line[0] != '{':
            line1 = line.split(',')
            id = int(line1[0])
            file = line1[1]
            j = json.loads(line)
            if j["type"] == "SEARCH_RESULTS":
                ids = j["image_ids"]
                if len(ids) == 0:
                    notfound += 1
                if len(ids) == 1:
                    found = ids.index(id)
                    onlyOne += 1
                if len(ids) > 1:
                    moreThanOne += 1
                    print str(id) + " : ",
                    print ids,
                    print file
                print str(id) + " : " + j["type"],
                print " : " + file
                error += 1

print "Error : " + str(error)
print "NotFound : " + str(notfound)           
print "Match Only One : " + str(onlyOne)
print "Match More Than One : " + str(moreThanOne)

I printed only the results that include more than one matching. Following is the result of previous python script

22 : [22, 835] /home/hosung/cdot/ccl/hashtest/all-images/1992-06560 Reconstruction 002.jpg
23 : [23, 835] /home/hosung/cdot/ccl/hashtest/all-images/1992-06614 Reconstruction 002.jpg
28 : [28, 29, 30] /home/hosung/cdot/ccl/hashtest/all-images/20131017 111028 green spiral ornament with Purple background.jpg
29 : [29, 30, 28] /home/hosung/cdot/ccl/hashtest/all-images/20131017 111122 Fairest wheel ornament with wall as background.jpg
30 : [30, 29] /home/hosung/cdot/ccl/hashtest/all-images/20131017 111143 - White Feerest wheel ornament with plywood background.jpg
70 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/A 3D Object design using FreeCad Software.jpg
77 : [77, 78] /home/hosung/cdot/ccl/hashtest/all-images/Alaska Hitchhiker Skull (Moustache Hair Eyepatch).jpg
78 : [78, 77] /home/hosung/cdot/ccl/hashtest/all-images/Alaska Hitchhiker Skull (Moustache Hair).jpg
90 : [90, 91] /home/hosung/cdot/ccl/hashtest/all-images/Anisotropic filtering en.jpg
91 : [91, 90] /home/hosung/cdot/ccl/hashtest/all-images/Anisotropic filtering pl.jpg
175 : [175, 180] /home/hosung/cdot/ccl/hashtest/all-images/Ch Light10.jpg
176 : [176, 177] /home/hosung/cdot/ccl/hashtest/all-images/Ch Light2.jpg
177 : [177, 176] /home/hosung/cdot/ccl/hashtest/all-images/Ch Light3.jpg
178 : [178, 181] /home/hosung/cdot/ccl/hashtest/all-images/Ch Light4.jpg
180 : [180, 175] /home/hosung/cdot/ccl/hashtest/all-images/Ch Light6.jpg
193 : [193, 195] /home/hosung/cdot/ccl/hashtest/all-images/Circle reflect wikipedia 2.jpg
195 : [195, 193] /home/hosung/cdot/ccl/hashtest/all-images/Circle reflect wikipedia sky.jpg
204 : [204, 205] /home/hosung/cdot/ccl/hashtest/all-images/Computer generated image of the M챈rsk Triple E Class (1).jpg
205 : [205, 204] /home/hosung/cdot/ccl/hashtest/all-images/Computer generated image of the M챈rsk Triple E Class (cropped).jpg
207 : [207, 367, 772] /home/hosung/cdot/ccl/hashtest/all-images/Copper question mark 3d.jpg
211 : [211, 210] /home/hosung/cdot/ccl/hashtest/all-images/Cro-Magnon man - steps of forensic facial reconstruction.jpg
216 : [216, 217] /home/hosung/cdot/ccl/hashtest/all-images/CTSkullImage - cropped.jpg
217 : [217, 216] /home/hosung/cdot/ccl/hashtest/all-images/CTSkullImage.jpg
220 : [220, 222] /home/hosung/cdot/ccl/hashtest/all-images/Cubic Structure.jpg
222 : [222, 220] /home/hosung/cdot/ccl/hashtest/all-images/Cubic Structure with Shallow Depth of Field.jpg
237 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/Dimens찾o Fractal.jpg
251 : [251, 252] /home/hosung/cdot/ccl/hashtest/all-images/Earthrelief.jpg
252 : [252, 251] /home/hosung/cdot/ccl/hashtest/all-images/Earthrelief mono.jpg
266 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/ENIGMA Logo.jpg
281 : [281, 282] /home/hosung/cdot/ccl/hashtest/all-images/Flower And Vase (Graphic).jpg
282 : [282, 281] /home/hosung/cdot/ccl/hashtest/all-images/Flower And Vase Ver.02.jpg
337 : [337, 338] /home/hosung/cdot/ccl/hashtest/all-images/Frankfurt Skyline I - HDR (14196217399).jpg
338 : [338, 337] /home/hosung/cdot/ccl/hashtest/all-images/Frankfurt Skyline II - HDR (14391360542).jpg
350 : [350, 352, 351] /home/hosung/cdot/ccl/hashtest/all-images/Glass ochem dof2.jpg
351 : [351, 350, 352] /home/hosung/cdot/ccl/hashtest/all-images/Glass ochem dof.jpg
352 : [352, 350, 351] /home/hosung/cdot/ccl/hashtest/all-images/Glass ochem.jpg
356 : [356, 357] /home/hosung/cdot/ccl/hashtest/all-images/GML-Cave-Designer (1).jpg
357 : [357, 356] /home/hosung/cdot/ccl/hashtest/all-images/GML-Cave-Designer.jpg
358 : [358, 359] /home/hosung/cdot/ccl/hashtest/all-images/GML-Gothic-Cathedral (1).jpg
359 : [359, 358] /home/hosung/cdot/ccl/hashtest/all-images/GML-Gothic-Cathedral.jpg
360 : [360, 361] /home/hosung/cdot/ccl/hashtest/all-images/GML-Gothic-Window-Thickness (1).jpg
361 : [361, 360] /home/hosung/cdot/ccl/hashtest/all-images/GML-Gothic-Window-Thickness.jpg
362 : [362, 363] /home/hosung/cdot/ccl/hashtest/all-images/GML-Stuhl-Template (1).jpg
363 : [363, 362] /home/hosung/cdot/ccl/hashtest/all-images/GML-Stuhl-Template.jpg
364 : [364, 365] /home/hosung/cdot/ccl/hashtest/all-images/GML-Voronoi-Diagram (1).jpg
365 : [365, 364] /home/hosung/cdot/ccl/hashtest/all-images/GML-Voronoi-Diagram.jpg
367 : [367, 207, 772] /home/hosung/cdot/ccl/hashtest/all-images/Gold question mark 3d.jpg
377 : [377, 378] /home/hosung/cdot/ccl/hashtest/all-images/Griffith Park Jane Doe Reconstruction 9b.jpg
378 : [378, 377] /home/hosung/cdot/ccl/hashtest/all-images/Griffith Park Jane Doe Reconstruction 9d.jpg
423 : [423, 424] /home/hosung/cdot/ccl/hashtest/all-images/Hall effect A.jpg
424 : [424, 423] /home/hosung/cdot/ccl/hashtest/all-images/Hall effect.jpg
435 : [435, 815, 814] /home/hosung/cdot/ccl/hashtest/all-images/HDR The sound of silence (The road to Kamakhya).jpg
436 : [436, 837] /home/hosung/cdot/ccl/hashtest/all-images/HEAD inline.jpg
448 : [448, 449] /home/hosung/cdot/ccl/hashtest/all-images/Homo erectus pekinensis
449 : [449, 448] /home/hosung/cdot/ccl/hashtest/all-images/Homo erectus pekinensis.jpg
453 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/HrdiBloomExample.jpg
457 : [457, 458] /home/hosung/cdot/ccl/hashtest/all-images/Ilame In Tengwar Ver.01-2.jpg
458 : [458, 457] /home/hosung/cdot/ccl/hashtest/all-images/Ilam챕 (Name) In Tengwar.jpg
487 : [487, 488] /home/hosung/cdot/ccl/hashtest/all-images/King's Cross railway station MMB C1.jpg
488 : [488, 487] /home/hosung/cdot/ccl/hashtest/all-images/King's Cross railway station MMB C2.jpg
489 : [489, 490] /home/hosung/cdot/ccl/hashtest/all-images/King's Cross railway station MMB C3.jpg
490 : [490, 489] /home/hosung/cdot/ccl/hashtest/all-images/King's Cross railway station MMB C4.jpg
494 : [494, 495] /home/hosung/cdot/ccl/hashtest/all-images/KrakowHDR pics.jpg
495 : [495, 494] /home/hosung/cdot/ccl/hashtest/all-images/KrakowHDR slides.jpg
512 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/LOD Example.jpg
521 : [521, 524, 523] /home/hosung/cdot/ccl/hashtest/all-images/Lync02.jpg
523 : [523, 524, 521] /home/hosung/cdot/ccl/hashtest/all-images/Lync04.jpg
524 : [524, 523, 521] /home/hosung/cdot/ccl/hashtest/all-images/Lync05.jpg
586 : [586, 593] /home/hosung/cdot/ccl/hashtest/all-images/Mount Vernon
610 : [610, 611] /home/hosung/cdot/ccl/hashtest/all-images/Obsidian Soul 1.jpg
611 : [611, 610] /home/hosung/cdot/ccl/hashtest/all-images/Obsidian Soul 2.jpg
617 : [617, 618] /home/hosung/cdot/ccl/hashtest/all-images/Oren-nayar-vase1.jpg
618 : [618, 617] /home/hosung/cdot/ccl/hashtest/all-images/Oren-nayar-vase2.jpg
667 : [667, 668] /home/hosung/cdot/ccl/hashtest/all-images/Radiosity Comparison.jpg
668 : [668, 667] /home/hosung/cdot/ccl/hashtest/all-images/Radiosity scene.jpg
676 : [676, 677, 678] /home/hosung/cdot/ccl/hashtest/all-images/Rauzy2.jpg
677 : [677, 678, 676] /home/hosung/cdot/ccl/hashtest/all-images/Rauzy3.jpg
678 : [678, 677, 676] /home/hosung/cdot/ccl/hashtest/all-images/Rauzy4.jpg
721 : [721, 724, 722, 723] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.00.12 PM Meshlab.jpg
722 : [722, 721, 723, 724] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.00.26 PM meshlab.jpg
723 : [723, 722, 721, 724] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.00.37 PM meshlab.jpg
724 : [724, 721, 722, 723] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.00.49 PM meshlab.jpg
725 : [725, 726, 731, 730, 727, 729, 728] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.09.42 PM blender.jpg
726 : [726, 725, 731, 730, 727, 729, 728] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.11.32 PM blender.jpg
727 : [727, 725, 726, 731, 730, 729, 728] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.11.42 PM blender.jpg
728 : [728, 729, 727, 726, 725, 731, 730] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.13.32 PM blender.jpg
729 : [729, 726, 731, 727, 725, 730, 728] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.14.07 PM blender.jpg
730 : [730, 731, 726, 725, 727, 729, 728] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.14.11 PM blender.jpg
731 : [731, 730, 726, 725, 727, 729, 728] /home/hosung/cdot/ccl/hashtest/all-images/Screen Shot 2013-10-27 at 2.14.15 PM blender.jpg
734 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/Scupltris logo.jpg
763 : [763, 764] /home/hosung/cdot/ccl/hashtest/all-images/Snapshot12.jpg
764 : [764, 763] /home/hosung/cdot/ccl/hashtest/all-images/Snapshot13.jpg
772 : [772, 207, 367] /home/hosung/cdot/ccl/hashtest/all-images/Spanish Question mark 3d.jpg
790 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/Sterling2 icon SterlingW2589.jpg
799 : [799, 800] /home/hosung/cdot/ccl/hashtest/all-images/Synagoge Weikersheim innen 01.jpg
800 : [800, 799] /home/hosung/cdot/ccl/hashtest/all-images/Synagoge Weikersheim innen 02.jpg
814 : [814, 435, 815] /home/hosung/cdot/ccl/hashtest/all-images/The Sound of Silence -2EV.jpg
815 : [815, 435] /home/hosung/cdot/ccl/hashtest/all-images/The Sound of Silence Resulting HDR.jpg
835 : [835, 22, 23] /home/hosung/cdot/ccl/hashtest/all-images/UP 3773 and UP 3774 (1400UMCA and 1397UMCA) Reconstruction 001.jpg
837 : [837, 436] /home/hosung/cdot/ccl/hashtest/all-images/UPPER inline.jpg
844 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/Valentine Doe 1993 Scaled.jpg
852 : [852, 854] /home/hosung/cdot/ccl/hashtest/all-images/ViewFrustum.jpg
854 : [854, 852] /home/hosung/cdot/ccl/hashtest/all-images/ViewWindow2.jpg
876 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/Woman in bra staring.jpg
882 : [882, 883] /home/hosung/cdot/ccl/hashtest/all-images/WP VS 1 rel(dachris).jpg
883 : [883, 882] /home/hosung/cdot/ccl/hashtest/all-images/WP VS 2 rel(dachris).jpg
898 : IMAGE_SIZE_TOO_SMALL : /home/hosung/cdot/ccl/hashtest/all-images/Zoomin.jpg

Error : 10
NotFound : 10
Match Only One : 783
Match More Than One : 97

Test Result

The result says that there were 10 images that did not added. The reason was ‘IMAGE_SIZE_TOO_SMALL’. According to the source code, when the image’s width or height is smaller than 150 px, it does not add to the index. Since the 10 images didn’t added to the index, there were 10 images that are not founded in the searching.
783 images were matched with only the same image.
And 97 images were matched with more than one images.
Therefore, there was no true negative result.

Followings are some meaningful matching.

Cropped image

This :1992-06560 Reconstruction 002, and this : 1992-06614 Reconstruction 002 matches with this:UP 3773 and UP 3774 (1400UMCA and 1397UMCA) Reconstruction 001

It means this algorithm detects when an image is part of the other image. Followings are similar results :

Computer generated image of the Mærsk Triple E Class (1) Computer generated image of the Mærsk Triple E Class (cropped)

Cro-Magnon man rendered Cro-Magnon man - steps of forensic facial reconstruction

Frankfurt Skyline I - HDR (14196217399) Frankfurt Skyline II - HDR (14391360542)

Hall effect Hall effect A

Homo erectus pekinensis Homo erectus pekinensis, forensic facial reconstruction

Oren-nayar-vase1 Oren-nayar-vase2

moving and similar images

20131017 111028 green spiral ornament with Purple background 20131017 111122 Fairest wheel ornament with wall as background 20131017 111143 - White Feerest wheel ornament with plywood background

Mount Vernon, NYJane Doe facial reconstruction NamUs 3123 Reconstruction 001

This result is a bit strange. Faces of two people look resemble, however, this seems to be a false positive result.

Synagoge Weikersheim innen 01 Synagoge Weikersheim innen 02

changing colours and rotation

Copper question mark 3d Gold question mark 3d Spanish Question mark 3d

The other cases

KrakowHDR pics KrakowHDR slides

Three images’ position were changed.

Obsidian Soul 1 Obsidian Soul 2

Rauzy2 Rauzy3 Rauzy4

This result is a bit strange; another false positive.

Snapshot12 Snapshot13

This result is interesting because rotated object are detected. Whereas, for similar images (Snapshot00, 01, 02 ~ 14.jpg) that gives a lot of false positive result in pHash, didn’t match each other.


  • Pastec ignores images when their width or height is smaller than 150px. This should be considered.
  • Rotated and cropped images can be detected.
  • Comparing to DCT/MH hash in pHash, there were much less false positive results.
  • All in all, the result for 900 images were reliable than pHash
  • Hashing/Indexing and searching seems to be quite fast. However, performance test should be performed.
  • Hash size and indexing/searching mechanism should be analysed to customize for our server system


by Hosung at June 15, 2015 09:53 PM

Ali Al Dallal

Simple React Webpack and Babel Starter Kit

At Mozilla Foundation, we're starting to use React mainly to create our Web application and most of the time writing React without Webpack and Babel can be a bit annoying or really hard I can say.

Finding an example to create React app with Webpack and Babel sometimes you get tons of stuff that you don't want or don't care and having to remove stuff yourself you'll either create bugs or finding yourself spending more time fixing things that you broke than starting to code, so I created this simple repo with just the simple stuff you need to get started.

React Webpack and Babel
Simple React Webpack Babel Starter Kit

This is a simple React, Webpack and Babel application with nothing else in it.

What's in it?

Just a simple index.jsx, webpack.config.js and index.html file.

To run

You can simply run webpack build using this command:

> $ npm run build

If you want to run with webpack-dev-server simply run this command:

> $ npm run dev

Please contribute to the project if you think this can be done better in anyway even for the README :)

by Ali Al Dallal at June 15, 2015 02:37 PM

June 12, 2015

Anna Fatsevych

Flickr API in PHP

In one of my previous posts, I wrote a Python program to download images using Flickr API.

Now, I wrote it in PHP using phpFlickr API, which is quite easy to use and understand. For our purposes, my program will download all the images uploaded on the specific date. It makes one api call per image, it also hashes the images, as well as stores them in MySQL database.

Here is a code snipped to see how easy it is to make an API call and set the required parameters:

$f = new phpFlickr("YOUR API KEY HERE");
$photos = $f->photos_search(array("tags"=>"car","per_page"=>"500",
          "license"=>"3", "extras"=>"url_o,owner_name, license"));

More details on Flickr API queries and limitations are in my previous post here. The PHP program is available on GitHub.



by anna at June 12, 2015 07:24 PM

June 11, 2015

Hong Zhan Huang

OSTEP – The City of ARMs – Tools of the trade 2: tmux

The tool of the trade to be featured in this post will be the terminal multiplexer known as tmux. A terminal multiplexer is a tool which allows a user to create, access and manage a number of terminals all withing the confines of one screen. tmux also has the ability to detached from the screen and continue running in the background and later reattached when one wishes to continue the work from where the session was left off at. The tmux manual offers an encompassing literature on the workings of the program for those interested.

In this post I’ll be expounding upon my experience in setting up and using tmux.

The work that I’m doing at CDOT for the OSTEP team involves ssh’ing into a variety of machines (mainly housed in our EHL server cabinet) on a daily basis. After a certain point it becomes difficult to manage each connection with just a regular terminal. There’s also the inability to continue from the point where you had left off, the next time you want to return to work. After seeing my coworkers making use of tmux in their work processes, I endeavored to attempt to do the same.

tmux vs screen

Before we get into the basics of tmux, we should perhaps compare it with another terminal multiplexer: GNU’s Screen. I’m no expert on Screen but the gist of the comparison seems to point to tmux being a more modern and better version of Screen and is still actively being supported. The reading on the reasons why that is can be read on this FAQ. For myself as a new user to tmux and only have a little bit of dabbling with Screen, tmux does seem to be the better tool so far.


After installing tmux onto your system, to use it you’ll need to start a new session of tmux. This can be done through this command:

tmux new -s Demo

This will create a new session named Demo that has a single window and display it on the screen. You’ll also notice that in this window there is a status line at the bottom of the screen that will show information about the current session as well as being the location to input tmux commands.

A basic tmux session with one window

From here we can begin using tmux’s features and functionality to modify our terminal work space to suit our liking.

tmux prefix

The prefix or escape function is the key combination that allows the user to exit normal input and enter tmux commands or shortcuts. The prefix in tmux is ctrl-b or in other words ctrl plus b together. Following this input you may press any key that has a bound functionality to it (ctrl-b c will create a new window for example) or press the colon key to enter the tmux command prompt where you can type out the command you wish to execute manually. You can find a list of all the currently assigned bindings with ctrl-b then question mark (ctrl-b ?). Now with the knowledge of the prefix let’s go and play around.

We’ll start by creating an additional three more windows in our session:

ctrl-b c x3 or new-window in the tmux command prompt

In our first window we’ll split the window into three panes by first splitting the window in half vertically:

ctrl-b % or split-window -v (v for vertical splits and h for horizontal splits)

Lastly we’ll rename the current window to “A Distant Place” (tmux has a search function for window names so you can easily find a window if you have many running if you have a name for it):

ctrl-b , or command-prompt -I #W "rename-window '%%'"

Now our session looks like this:

We have four windows as shown in the status line and our first window now named A Distant Place has a two pane split. These are just some of the basic options to creating a work-space to your liking.


One of the pros of using terminal multiplexers like tmux is the ability to start a task, walk away and come back to it later. The process to do this is to detach the session:

ctrl-b d or detach-client

and then when you wish to return to your session:

tmux attach -t Demo

Sessions are ended when all windows of a session are exited. My typical usage of tmux so far is to have my workstation start the session and thus become the tmux server. I can then remotely access my workstation via a laptop when I’m not on site and can continue using my session for as long as it exists. By using tmux I can maintain  a constant terminal environment with all the ssh or serial connections easily.


I had said earlier that tmux is quite easily customizable. You can change how the key bindings are for tmux commands or create new ones for your own preferences. You may also change the visual aspects of tmux such as the colours of the status bar items. You can add items of choice to the status bar such as up-time, number of users currently using the session or battery-life of your laptop. Mouse support also exists for tmux should you want it. Suffice to say there is a lot of customization you can do with tmux. I’ll share the .tmux.conf file that has all the configurations I’ve been using so far (comments are prefixed with the # sign):

#Start numbering at 1
set -g base-index 1
set -g pane-base-index 1

#Set status bar for cleaner look
set -g status-bg black
set -g status-fg white
set -g status-left '#[fg=green]#H'

#Highlight active window
set-window-option -g window-status-current-bg red
set-window-option -g window-status-activity-style "fg=yellow"

#Show number of current users logged in and average loads for the computer
set -g status-right '#[fg=yellow]#(uptime | cut -d "," -f2-)'

#Set window notifactions
setw -g monitor-activity on
set -g visual-activity on

#Automatically set window title
setw -g automatic-rename

#Rebind split window commands
unbind % #Remove the default binding for split-window -h
bind | split-window -h
bind - split-window -v

#Less input delay in command sequences ie C-a n
set -s escape-time 0

#Mouse support
set -g mode-mouse on
set -g mouse-resize-pane on
set -g mouse-select-pane on
set -g mouse-select-window on

#Allow for aggressive resizing of windows (not constrained by smallest window)
setw -g aggressive-resize on

#pane traversal bindings
bind h select-pane -L
bind j select-pane -D
bind k select-pane -U
bind l select-pane -R

# reload config
bind r source-file ~/.tmux.conf \; display-message "Config reloaded..."

#COLOUR (Solarized 256)

#default statusbar colors
set-option -g status-bg colour235 #base02
set-option -g status-fg colour136 #yellow
set-option -g status-attr default

#default window title colors
set-window-option -g window-status-fg colour244 #base0
set-window-option -g window-status-bg default
set-window-option -g window-status-attr dim

#active window title colors
set-window-option -g window-status-current-fg colour166 #orange
set-window-option -g window-status-current-bg default
set-window-option -g window-status-current-attr bright

#pane border
set-option -g pane-border-fg colour235 #base02
set-option -g pane-active-border-fg colour136 #base01

#message text
set-option -g message-bg colour235 #base02
set-option -g message-fg colour166 #orange

#pane number display
set-option -g display-panes-active-colour colour33 #blue
set-option -g display-panes-colour colour166 #orange

set-window-option -g clock-mode-colour colour64 #green

# status bar
set-option -g status-utf8 on

So that about wraps up an introductory bit about tmux’s utility and a brief on how you can go about using it. I think it is a really useful tool for those who are regularly using remote machines through ssh and I’ll likely be using it all the time from here on out. There are many features and items I didn’t touch on such as tmux’s copy mode, multi-user sessions and more. If you’re so interested in learning more about tmux, please refer to their official manual.

by hzhuang3 at June 11, 2015 05:36 PM

Hosung Hwang

MH Hash, MVP-Tree indexer/searcher for MySQL/PHP

Current development server works on the LAMP stack. Anna is working on Creative Commons Image crawler and User Interface using PHP/MySQL. For the prototype that works with the PHP UI code and MySQL database, I made an Indexer and Searcher.


The database contains lot’s of records that contains image url, license, and hash values. And that is make by crawler written in PHP.


Source code :

Description :

$ ./mhindexer
Usage :
     mhindexer hostName userName password schema table key value treeFilename
     hostName : mysql hostname
     userName : mysql username
     password : mysql password
     schema : db name
     table : table name
     key : image id field name in the table
     value : hash field name in the table
     treeFilename : mvp tree file name
Output :

The program takes MySQL connection informations : hostname, username, password. And the database information : schema, table, key, value. After connecting using the information, it reads all ‘key’ and ‘value’ fields from the ‘table’. ‘key’ is used as a unique key that points the db record that contains image information : filename, url, hash value, etc. ‘value’ is a hash value that is used to calculate hamming distance.

After connecting to the database, program reads all records that contains hash values. And makes add them to MVP-tree. When the tree is built, it is written to the ‘treeFilename’ file.

I made simple bash script that run mhindexer with parameters. output is :

$ ./,784,0.035845

From the hashes in the database, the tree is written to and there are 784 nodes and it took 0.035845 seconds.


Source code :

Description :

Usage :
    mhsearcher treeFilename imageFilename radius
    eg : mhsearcher ./test.jpg 0.0005
output : 0-success, 1-failed
    success : 0,count,id,id,id,...
      eg : 0,2,101,9801 
    failed : 1,error string
      eg : 1,MVP Error

For now, searcher reads the tree file(treeFilename) to generate tree structure, and extracts MH hash from input file(imageFilename), then search the hash value in the tree using ‘radius’.

Output is used by php script. When the first field divided by comma is 0, there is no error and the result is meaningful. Second field is count of detected hashes. And following fields are ids of hashes. Using the ids, php script can get image information from the database.
When the first field is 1, following field is the error message.

To test it, I randomly chose an image that is in the database.
Example output is :

$ ./mhsearcher WTW_Nov_2013_Tumanako_023.JPG 0.001
$ ./mhsearcher WTW_Nov_2013_Tumanako_023.JPG 0.1
$ ./mhsearcher WTW_Nov_2013_Tumanako_023.JPG 0.2
$ ./mhsearcher WTW_Nov_2013_Tumanako_023.JPG 0.3
$ ./mhsearcher WTW_Nov_2013_Tumanako_023.JPG 0.44

For the performance statistics purpose, I added radius, calculation count and extraction time at the end of the result.
In this image’s case, when the radius was 0.2, matching image was found. And when the radius was 0.44, there was 5 results.


  • This utilities works well with MySQL and PHP.
  • Because of the characteristics of tree search algorithm, repeated search from the radius of 0.001 to 0.5 inside the searcher can be done to get the fast and reliable result.
  • Later, indexer and searcher can be changed to linux daemon process to maintain the tree in the memory for fast searching.
  • When the amount of database record is enormous(millions ~ billions), the tree can be divided to several sections in the database.

by Hosung at June 11, 2015 04:30 AM

June 10, 2015

Koji Miyauchi

Heroku with node.js & mongoDB


The goal of our project last two weeks is toput our application on to Github Page. In order to do that, we had to host our server side APIs to somewhere accessible.
After some discussions with our clients, we decided to host the server side codes to Heroku.

Heroku is one of the popular cloud application platforms ( such as AWS, DigitalOcean and Engine Yard ) that can host your web application. Good thing is about Heroku is initial cost is free.

This service is very easy to use.
Only you need to do is basically these.

  1. Have your git repository for the app.
  2. Proper configuration in your project.
    In our case, we use Node.js, so we configure the applications’ dependencies and start up file in package.json
  3. Push the repository to Heroku

After deploy your application to Heroku’s master repository. It will automatically install all the dependencies your app need, and run it.

Deploy your application to Heroku

Here is good instruction how to deploy your Node.js application onto Heroku. Setting up is very straight forward.

Install mongoDB Add-on

In order to use mongoDB on Heroku after set up your application. You need to install Add-on calledmongoLab or Compose MongoDB. I use mongoLab this time.

Installing Add-on is also quite easy to do. Just type

heroku addons:create mongolab -a <application>

and it will install the Add-on on to your application.
All the configuration of your DB is available from Heroku’s web console.
mongoLab 500MB storage for free.


Heroku accepts many types of applications, such as Node.js, Ruby on Rails, php, Java, Python and so on.
And it allow user to deploy the application very quickly. It will automatically set up infrastructure for you, so you can save your time as well.

by koji miyauchi at June 10, 2015 09:11 PM

Anna Fatsevych

Wiki Parser and User Interface

As I mentioned in the last post I was writing a “parser” of some sorts to get through the xml files that are located in the Wiki Image Grab along with the corresponding images.

I have a php program now, that will get the image name from the list file, and will then use wiki API to get the latest data (author, license, and its existence status). The program is available on GitHub.

I have also written a User Interface in PHP that will allow for comparison of images: either downloaded or VIA url. Here is a preview of it.


Here is the link to this code on GitHub. This is a quick demo for now, using jQuery and Bootstrap – and the PHP code will be re-factored and cleaned up.

by anna at June 10, 2015 09:02 PM

Hosung Hwang

MVP Tree with MH Hash for Image Search

MH Image hash in pHash project generates 72bytes’ hash values. Despite the weakness of false positive result for simple images, it has a benefit of the fact that it can be used with MVP Tree implementation.

Sample program

I wrote sample utility using C++ to test real samples.
Source code is (can be changed later) :

This program works like following usage :

Usage :
    MHHashTree drectory filename radius
      directory : a directory that contains .hashmh files that will be in the MVP-tree
      filename : a .hashmh file to search from the tree
      radius : radius to search eg. 0.0001, 0.1, 1.0, 4.0
    MHHashTree drectory filename radius BranchFactor PathLength LeafCap
      BranchFactor : tree branch factor - default 2
      PathLength : path length to use for each data point - default 5
      LeafCap : leaf capacity of each leaf node - maximum number of datapoints - default 25

Test Result 1

The sample directory contains 900 image hashes that are extracted from images. I picked up an image that has 1 similar image :
Ch Light6

$ ./MHHashTree /home/hosung/cdot/ccl/hashtest/all-images "Ch Light6.jpg.hashmh" 0.001
(*) Ch Light6.jpg.hashmh   : ff43e93178c77400008696922ecc3100efe2b2a5493b6fa72524409aac816330204898fcb2fc300bc9f0fc7e392436c7e3f1ffb40c04e07030fc7e3f038fc7000000000000000000
------------------Results 1 (9 calcs) (0.000011 secs)---------
(0) Ch Light6.jpg.hashmh   : ff43e93178c77400008696922ecc3100efe2b2a5493b6fa72524409aac816330204898fcb2fc300bc9f0fc7e392436c7e3f1ffb40c04e07030fc7e3f038fc7000000000000000000
$ ./MHHashTree /home/hosung/cdot/ccl/hashtest/all-images "Ch Light6.jpg.hashmh" 0.1
(*) Ch Light6.jpg.hashmh   : ff43e93178c77400008696922ecc3100efe2b2a5493b6fa72524409aac816330204898fcb2fc300bc9f0fc7e392436c7e3f1ffb40c04e07030fc7e3f038fc7000000000000000000
------------------Results 2 (738 calcs) (0.002161 secs)---------
(0) Ch Light10.jpg.hashmh   : ff43e93158c7740000949690aecc3100e7e0b2a5493b6fa5263444bad9c16930224891f9b2fc300bc1f0fc7e392436c7e7f1ffb40c04e07030fc7e3f038fc7000000000000000000
(1) Ch Light6.jpg.hashmh   : ff43e93178c77400008696922ecc3100efe2b2a5493b6fa72524409aac816330204898fcb2fc300bc9f0fc7e392436c7e3f1ffb40c04e07030fc7e3f038fc7000000000000000000
$ ./MHHashTree /home/hosung/cdot/ccl/hashtest/all-images "Ch Light6.jpg.hashmh" 0.4
(*) Ch Light6.jpg.hashmh   : ff43e93178c77400008696922ecc3100efe2b2a5493b6fa72524409aac816330204898fcb2fc300bc9f0fc7e392436c7e3f1ffb40c04e07030fc7e3f038fc7000000000000000000
------------------Results 11 (897 calcs) (0.000733 secs)---------
(0) Ch Light10.jpg.hashmh   : ff43e93158c7740000949690aecc3100e7e0b2a5493b6fa5263444bad9c16930224891f9b2fc300bc1f0fc7e392436c7e7f1ffb40c04e07030fc7e3f038fc7000000000000000000
(1) Metaball3.jpg.hashmh   : 0000000000000000000002b9c0400fc4620000f6e4a77c7b877e242496ec45b978d848db24b5254f99b97cdcdb2076ccdfefcd6de42400447e2a0203381e00000000000000000000
(2) Ch Light6.jpg.hashmh   : ff43e93178c77400008696922ecc3100efe2b2a5493b6fa72524409aac816330204898fcb2fc300bc9f0fc7e392436c7e3f1ffb40c04e07030fc7e3f038fc7000000000000000000
(3) Orx-logo.jpg.hashmh   : 000000000000000000000000000000000000000063f1b9ef2fb200006b897da4b194020000226c5098e17dea00000037fbf6f92dfe00000000000604000000000000000000000000
(4) Snapshot10pointcloud.jpg.hashmh   : 0000000000000000000000001fcfc000000000000012cdb0000000000000124db0000000000000161db00000000000001228d8000000000000027028000000000000000000000000
(5) Snapshot05.jpg.hashmh   : 00000000000000000000000000afc80000000000001a4db0000000000000122da800000000000016cb680000000000001263200000000000001b72e0000000000000000000000000
(6) Snapshot01.jpg.hashmh   : 00000000000000000000000000a1980000000000001b4308000000000000112928000000000000136ba80000000000000922400000000000000c0378000000000000000000000000
(7) Alaska Hitchhiker Skull (Moustache Hair).jpg.hashmh   : 000080003007bc0000f702c50c4d773389448fd50e3fcf81399c0400d2c483b1f88f96d220b4a4ea6ba81e4b2223d300e1e8a81f2883000000000000000000000000000000000000
(8) Alaska Hitchhiker Skull (Moustache Hair Eyepatch).jpg.hashmh   : 00008000700fbc0000ff00439c4c4704683c8fd51e7f4781399d8c0095dce391f84f079220eda6eb69a9e64b2623d300e1e8a81d2a83000000000000000000000000000000000000
(9) Snapshot04.jpg.hashmh   : 00000000000000000000000000a1880000000000001a424800000000000012292800000000000016cfb0000000000000092bf00000000000001b7078000000000000000000000000
(10) Snapshot07.jpg.hashmh   : 00000000000000000000000000a1c80000000000001a4df0000000000000126db8000000000000124db00000000000001244d8000000000000137270000000000000000000000000

When the radius was 0.001 or 0.01, calculation count was 9 and the result was only 1 image that is exactly the same. Time was 0.000011 secs.
When the radius was 0.1, calculation count was 738, and the result was 2 images. More time took than 9 times’ calculation. Newly added image(Ch Light10.jpg.hashmh) was this :
Ch Light10
When the radius was 0.3, the result was the same as 0.1
When the radius was 0.4, calculation count was 897 and there were 11 results. The result images are :
Snapshot07Snapshot04Alaska Hitchhiker Skull (Moustache Hair Eyepatch)Alaska Hitchhiker Skull (Moustache Hair)Snapshot01Snapshot05Snapshot10pointcloudOrx-logoMetaball3

Test Result 2

This time I picked up an image that has white background and more similar images : Snapshot01.jpg.

$ ./MHHashTree /home/hosung/cdot/ccl/hashtest/all-images "Snapshot01.jpg.hashmh" 0.01
(*) Snapshot01.jpg.hashmh   : 00000000000000000000000000a1980000000000001b4308000000000000112928000000000000136ba80000000000000922400000000000000c0378000000000000000000000000
------------------Results 1 (21 calcs) (0.000073 secs)---------
(0) Snapshot01.jpg.hashmh   : 00000000000000000000000000a1980000000000001b4308000000000000112928000000000000136ba80000000000000922400000000000000c0378000000000000000000000000

$ ./MHHashTree /home/hosung/cdot/ccl/hashtest/all-images "Snapshot01.jpg.hashmh" 0.1
(*) Snapshot01.jpg.hashmh   : 00000000000000000000000000a1980000000000001b4308000000000000112928000000000000136ba80000000000000922400000000000000c0378000000000000000000000000
------------------Results 10 (152 calcs) (0.000435 secs)---------
(0) Snapshot06.jpg.hashmh   : 00000000000000000000000000000000000000000000afc00000000000001244d0900000000000124ff4000000000000040200000000000000000000000000000000000000000000
(1) Snapshot09.jpg.hashmh   : 00000000000000000000000000aee01000000000001642e37c00000000001244940000000000000b6db80000000000000d92d0000000000000088100000000000000000000000000
(2) Snapshot02.jpg.hashmh   : 00000000000000000000000000a1980000000000000929f80000000000001b63780000000000001b6b280000000000000922580000000000000882f0000000000000000000000000
(3) Snapshot05.jpg.hashmh   : 00000000000000000000000000afc80000000000001a4db0000000000000122da800000000000016cb680000000000001263200000000000001b72e0000000000000000000000000
(4) Snapshot03.jpg.hashmh   : 0000000000000000000000000020200000000000001253b00000000000001262500000000000001a62480000000000001b6b08000000000000040c00000000000000000000000000
(5) Snapshot01.jpg.hashmh   : 00000000000000000000000000a1980000000000001b4308000000000000112928000000000000136ba80000000000000922400000000000000c0378000000000000000000000000
(6) K-3D logo.jpg.hashmh   : 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
(7) Snapshot04.jpg.hashmh   : 00000000000000000000000000a1880000000000001a424800000000000012292800000000000016cfb0000000000000092bf00000000000001b7078000000000000000000000000
(8) Snapshot07.jpg.hashmh   : 00000000000000000000000000a1c80000000000001a4df0000000000000126db8000000000000124db00000000000001244d8000000000000137270000000000000000000000000
(9) Snapshot00.jpg.hashmh   : 00000000000000000000000000000000000000000000000000000000000016d998000000000000126fb0000000000000040380000000000000000000000000000000000000000000

When the radius was 0.01, the result was only 1 that exactly matches with 21 calculation.
When the radius was 0.1, after 152 calculation, there were 10 results that are similar :
Snapshot07Snapshot06Snapshot05Snapshot04Snapshot03Snapshot02Snapshot01Snapshot00K-3D logo


  • When the radius was smaller than 0.01, calculation count in the tree was only few, and the result was exactly the same.
  • When the radius was 0.1, calculation count was grater, and the result was similar.
  • When the radius was 0.4, calculation count was almost similar to the count of all samples.
  • Radius means the distance in the tree that says the similarity, and it is same as hamming distance.
  • MH hash generates lots of 0 when the image contains solid background colour. Therefore, the hash value of the image that has only black colour is all 0.
  • As for BranchFactor, PathLength, and LeafCap parameters that are used for making MVP-Tree, I used default values, 2, 5, and 25 respectively. Test for various values need to be done.

by Hosung at June 10, 2015 08:22 PM

MVP Tree for similarity search

For several days, I analysed and implemented C++ utility that interact with Perceptual Hashes from the database. In this posting, I will introduce general analysis of MVP-Tree.

MVP Tree

Following two papers gives details about VP-Tree and MVP-Tree for similarity search :

“In vp-trees, at every node of the tree, a vantage point is chosen among the data points, and the
distances of this vantage point from all other points (the points that will be indexed below that node) are computed. Then, these points are sorted into an ordered list with respect to their distances from the vantage point. Next, the list is partitioned to create sublists of equal cardinality. The order of the tree corresponds to the number of partitions made. Each of these partitions keep the data points that fall into a spherical cut with inner and outer radii being the minimum and the maximum distances of these points from the vantage point. The mvp-tree behaves more cleverly in making use of the vantage-points by employing more than one at each level of the tree to increase the fanout of each node of the tree.” [Bozkaya & Ozsoyoglu 2]

Screenshot from 2015-06-10 10:39:47
[Bozkaya & Ozsoyoglu 9]

Screenshot from 2015-06-10 10:40:18
[Bozkaya & Ozsoyoglu 10]

MVP Tree implementation

The source code that I used was from that introduced on this page.

Followings are major APIs.

MVPTree* mvptree_alloc(MVPTree *tree,CmpFunc distance, unsigned int bf,unsigned int p,unsigned int k);
typedef float (*CmpFunc)(MVPDP *pointA, MVPDP *pointB);

mvptree_alloc allocates memory to store MVP-tree structure. CmpFunc is comparison function that is used to calculate hamming distance between two hash values inside MVPDP struct when new data point is added and the searching is happened.

MVPError mvptree_add(MVPTree *tree, MVPDP **points, unsigned int nbpoints);

This function add a data point to the tree. It can be the array of data point or one data point. While adding the node, tree is formed by comparing using CmpFunc.

MVPError mvptree_write(MVPTree *tree, const char *filename, int mode);
MVPTree* mvptree_read(const char *filename, CmpFunc fnc, int branchfactor, int pathlength, int leafcapacity, MVPError *error);

Using these functions, the tree structure can be written to a file, and can be loaded without making the tree again.

MVPDP** mvptree_retrieve(MVPTree *tree, MVPDP *target, unsigned int knearest, float radius, unsigned int *nbresults, MVPError *error);

This function retrieves similar hash results based on radius. When the radius is big, the comparison is done more times.

Sample program results

When I used 100 samples of 10 bytes random binaries, when the radius is changed from 0.01 to 3.0 and 5.0, results are :

radius : 0.01
------------------Results 1 (7 calcs)---------
(0) point101

------------------Results 3 (18 calcs)---------
(0) point108
(1) point101
(2) point104

------------------Results 10 (24 calcs)---------
(0) point102
(1) point103
(2) point105
(3) point107
(4) point108
(5) point101
(6) point104
(7) point106
(8) point109
(9) point110

When the radius was 0.01, there was 7 calculation while going through the tree. When the radius was 5.0, there was 24 calculations. When I changed the size of samples from 10 bytes to 72 bytes, the size of MH hash, the comparison count was more than the number of samples.


Sample program generates random values instead of using real hash of image. Since random values hardly have similarity between them, when the radius was less than 0.1, there was only one result that exactly matched value. To get some results, the radius should be at least 3. In this case, calculation count was almost the same with the number of values.
When I used real image hash values, the search result was quite impressive. That will be written in the next posting.

by Hosung at June 10, 2015 03:36 PM

Barbara deGraaf

What’s in a GUI

In this post I am going to talk about adding a GUI to a test scene so that a user can change values. I have meant to put this up earlier but got side tracked because I was watching  Hannibal’s season 3 premiere which has some of the most breathtaking cinematography I have seen, so if you want just what sort of results a cinematographer can create that would be the show to watch.

So back to the GUI, using THREE.js there is a library file called dat.gui which you can grab from google code page. Within your javascript file you can start making the GUI with;

  var gui = new dat.GUI();

I also recommend creating something to hold all the values that are going to be used in the GUI so in this case

var params ={


focallen: 100,

…(all other camera and lens properties)


If after you made the GUI you want to add folders you can add them with;

var camfolder= gui.addFolder(Camera);

var lenfolder=…

After you make all the folders you want you can start adding variables to the folder with;

var foc = lenfolder.add(params, focallen‘); 

The gui.dat library will add text boxes or sliders depending on if the value in params was a text or a number. So for the number values we can change it so that the user has a lower and upper limit for what the value can be and change the increment for the slider with using this line instead;

var foc = lenfolder.add(params, focallen,10,200).step(4).name(focal length);

The other type of input was a select menu for camera/lens. In order to do this the first step is to store the information about the camera/lens into a JSON file. After having the file we can use  jQuery;


//the inner working may change depending on how the JSON file was setup but you are          //going to use $.each to loop through the JSON file getting each entity and grab the value you    //want. So for this example I looped and grabbed the format value and then added it to an  //array of cameras(listcams).

after looping with the $.each we can use this list of cameras formats as options for the menu with

var cam = camfolder.add(params,’format’,listcams);

After having the GUI working we want it to do something when we change values so we can use


params.focallen =value;


We can do this for all values to continuously update the params. If you are running into issues with the JSON and storing the values gathered from the JSON file just remember that jQuery is async and do the onChange within the $.getJSON above.

If you want to add a button to the GUI the best way to do that is;

var obj= {submit:function(){

//logic that occurs when pressed here

//I did calculations of Hyperfocal distance and depth of field here


gui.add(obj, ‘submit’);

So this is basically all we will need to work with in terms of making and changing the GUI. The next step that my partner and I worked on was the depth of field using shaders. So next blog post I will talk about shaders before going into depth about them with depth of field.

Have a good night everyone.



by barbaradegraafsoftware at June 10, 2015 01:58 AM

June 09, 2015

Hong Zhan Huang

OSTEP – The City of ARMs – Tools of the trade 1: iperf

In the short time that I’ve been working on the OSTEP Team at CDOT there’s been much to take in and learn. In these Tools of the trade series of posts I’ll be describing a tool I have been making use of in my work.


iperf is a network performance measuring tool that I have been using to do some testing with the Ethernet ports of some of our ARM machines. I required a tool that would be able to measure the maximum performance of these ports while bypassing intermediate mediums that could obfuscate the results (such as the write speeds of a hard drive). iperf seemed to be a tool that would meet all my needs and more.

To quote the features of iperf from their official site:

  • TCP
    • Measure bandwidth
    • Report MSS/MTU size and observed read sizes.
    • Support for TCP window size via socket buffers.
    • Multi-threaded if pthreads or Win32 threads are available. Client and server can have multiple simultaneous connections.
  • UDP
    • Client can create UDP streams of specified bandwidth.
    • Measure packet loss
    • Measure delay jitter
    • Multicast capable
    • Multi-threaded if pthreads are available. Client and server can have multiple simultaneous connections. (This doesn’t work in Windows.

There’s quite bit of things that iperf is able to do but for my purposes, the TCP functionality with one client and one server suits me fine.

Using iperf

As alluded to prior, iperf operates in a client and server model where the server will artificially serve a file to the client and from that interaction iperf will measure the performance of the transfer between the two machines.

The steps to start up a basic testing process are as follows:

  1. Start iperf on the machine that will act as the server with: iperf -s
  2. On the other machine, start it up as the client with: iperf -c {IP of the Server}

And that’s it for basic operation! Following the completion of that instance you will see the results of the testing both the server and the client machines and it’ll look something like:

Server listening on TCP port 5001
TCP window size: 8.00 KByte (default)
[852] local port 5001 connected with port 33453
[ ID]   Interval          Transfer       Bandwidth
[852]   0.0-10.6 sec   1.26 MBytes   1.03 Mbits/sec

Again this is for the most basic usage of iperf which use the default window sizes, ports, protocol (TCP is default), units of measurement (Mbits/sec is default) and other options. For my use I only made use of the -f option which allows the user to choose what unit of measurement the results should be formatted to (in my case I used -f g which gives me the results in GBits/sec). In the chance you’d like access iperf’s other features this guide is what I read to get a understanding of how to operate this tool.

To make my life a little easier I wrote one bash script to automate the process of doing the iperf tests and recording their results as well as another to more easily parse the resulting logs.

test script:


echo "Beginning tests"

if [ "$1" = "" ] || [ "$2" = "" ]
  echo "Requires IP of the iperf server and output file name."
  touch ./$2
  for i in `seq 1 10`;
    iperf -c "$1" -f g >> $2

echo "Finished the tests"

The test script is meant to be used on the client machine as follows: test {IP of Server} {Filename of log}

parse script


echo "The file begin parsed is $1:"

echo "`grep "Gbits/sec" $1`\n"

AVG=$(grep -o "[0-9].[0-9][0-9] Gbits/sec" $1 |
  tr -d "Gbits/sec" |
  awk '{ SUM += $1} END { print SUM }' |
  awk '{ AVG = $1 / 10 } END { print AVG }')

MAX=$(grep -o "[0-9].[0-9][0-9] Gbits/sec" $1 |
  tr -d "Gbits/sec" | sort -n | tail -1)

MIN=$(grep -o "[0-9].[0-9][0-9] Gbits/sec" $1 |
  tr -d "Gbits/sec" | sort -n | head -1)

echo "The average rate of transfer was:  $AVG"
echo "The max rate was: $MAX"
echo "The min rate was: $MIN"

echo "Finished parsing."

The parse script is again used on the client in the following manner: parse {Filename of log}

And that about wraps up iperf in brief. The only thing to note is that you may need to open the relevant ports for iperf to work.

by hzhuang3 at June 09, 2015 06:00 PM

June 08, 2015

Barbara deGraaf

First up a test scene

Most feedback I got from the last post was that it was too mathy and I promise this one will have 100% less math than the last one.

The first thing to do done in the project was to make a test scene to work with. This will allow us to try different techniques and see if the outcome was expected.

The first part of making the test scene was to make walls and a ground. By using the box geometry or plane geometry in THREE.js is it very easy to make a wall or ground of size wanted. Adding all walls and the ground to an single object3D object allows us to move the whole scene around if we want the walls and ground to be in a different place.

To help measure units in the scene better a black and white checkerboard pattern was added to the wall and ground. The best way to do this is to have a small texture pattern of the checkerboard and to set texture.wrapS and texture.wrapT to THREE.RepeatWrapping and then use texture.repeat.set(x,x) where x is half the length/width of the geometry used above. Basically these three lines will cause the small checkerboard texture to appear on the whole wall/ground.

After having the basic walls and ground of the scene set up the next part is to add some detailed objects to the scene. Instead of boxes and spheres we need something will more definition and I decided to use humanoid models. There are a couple different ways to add external models to the scene. The way I did was to use MakeHuman software which allows you to easily make models and use them under the cc0 license. Exporting the created model to a obj/mtl files allows easy use in THREE.js. You can also use to make a object and export them to the file type you want.

To load the model THREE.js has a obj/mtl loader. The THREE.js website has excellent documentation on how to use it so I will say to check that up if you need to. After the model is loaded you can make as many meshes of the model as you want to put in the scene. The models can be easily scaled for accurate dimensions. By defining 1 THREE.js unit as representing 1 foot we can resize the models. Using a box of dimension 6x2x1 I can resize the human model to fit inside the box and therefore be accurate. I also added all the humans to a single object3D so that all humans can be moved at once. For my scene I ended up putting 5 human models in the scene spaced evenly apart from each other.

With these elements we have a scene that can be customized for any dimensions or distances we may want to test depth of field or field of view.

I was going to talk about adding the GUI here but I think instead I will make a separate post talking about the GUI so I can mention some specific points in creating it. So look forward to that next.


by barbaradegraafsoftware at June 08, 2015 01:57 AM

June 04, 2015

Hosung Hwang

Eclipse CDT Cross GCC Linker Option


I am testing an algorithm called MVP Tree. The source code is written in c and used MakeFile. To analyse it I wanted to see actual values in nodes and memory state when the tree is forming. Debugging with gdb in the console was painful. So I moved it to Eclipse CDT. However, in the linking process, Eclipse shows following errors.

MHMvcTree.c:50: undefined reference to `exp'
MHMvcTree.c:106: undefined reference to `sqrt'


Adding -lm linker option to
Project -> Properties -> Cross GCC linker -> Miscellaneous -> Other objects

Screenshot from 2015-06-04 16:04:33

Now I am happy.

by Hosung at June 04, 2015 08:15 PM

June 02, 2015

Anna Fatsevych

Wiki Media Commons

Wiki Media Commons is a media file repository, containing millions of images. I have been working with their Wikimedia Image Grabs to get the author, title, and licensing information.

The files gathered from the Wiki Commons Grabs come with an XML file that provides the author information in a wiki template {{Information}} to be exact, along with the {{Credit Line}} template

I have been looking into a few Wikimedia Template Parsers – many of them are not updated, and many just parse the text into html, or wikitext, ignoring the various available templates, which is what I need, unfortunately. My goal is to get the information I need from the XML files I already have without calling the Wikimedia API on each image – i.e. using the network. Here is the list of the Alternate Parsers.

I am currently writing one in PHP. So far, I have used wiki_parser.php to attempt to parse the Information template into key value pairs, but it looks like it only succeeds in parsing categories and I will have to write a parser myself. Here is my code so far:


$xmlfile = 'test.xml';
$fp = fopen($xmlfile, 'r');
$xmldata = fread($fp, filesize($xmlfile));

$xml = simplexml_load_string($xmldata);

// count number of revisions, get the latest one;
$numrevisions = count($xml->page->revision)-1;

// get the text part of the Information - isolate text to parse;
$text = $xml->page->revision[$numrevisions]->text;
(string)$try = (string)$text[0];

// so far successful at parsing "categories"
$wikipedia_syntax_parser = new Jungle_WikiSyntax_Parser($try);

I use simple_xml to parse the xml file into elements, and also to isolate the “Information” part, which is a text tag

Here is the output:

|Description=[[:en:Ford Galaxie|1967 Ford Galaxie GT]] photographed in [[:en:Laval, Quebec|Laval]], [[:en:Quebec|Quebec]], [[:en:Canada|Canada]] at Auto classique VACM Laval.[[Category:Auto classique VACM Laval 2013]]
[[Category:Ford Galaxie]]
[[Category:1967 Ford automobiles]]
|Permission=All Rights Released.
== {{int:license-header}} ==

To be continued with the parser code,


by anna at June 02, 2015 08:24 PM

Hosung Hwang

Performance Test of pHash : MH Image Hash

In another posting, I did performance test of DCT Image hash in pHash. Today, I did the same test for MH Image hash.

The test machine is the same machine I did before. And this time, the test performed only in Internal SSD. In the benchmarking using dd command, the reading speed was 440MB/S.

Sample images are 900 jpg files and the size is varied from 1.3kb to 50.4mb. Full size of the sample files were 805.4mb. The function for MH Image hash is ph_mh_imagehash(). This function allocates a memory block for output hash value and the hash value is 72bytes, which is much bigger than DCT hash result(8bytes). I wrote another C++ program to calculate hash using this function for all images in the given directory. To test only the reading time, printing or storing didn’t do.

Test results are :

$ for F in {0..7}; do ./phashbenchmh /home/hosung/cdot/ccl/hashtest/all-images; done
Elapsed secs to hashing 900 files : 246.779554
Elapsed secs to hashing 900 files : 242.660379
Elapsed secs to hashing 900 files : 242.693598
Elapsed secs to hashing 900 files : 242.494878
Elapsed secs to hashing 900 files : 243.201334
Elapsed secs to hashing 900 files : 242.948810
Elapsed secs to hashing 900 files : 243.554532
Elapsed secs to hashing 900 files : 243.004734

Interestingly, the hashing time was faster than DCT hash. DCT hash spent around 291 seconds in the internal SSD. Whereas, MH hash spent around 243 seconds.

by Hosung at June 02, 2015 08:20 PM

MH Image Hash in pHash 2 : test result

900 Sample image test

I used the same samples that are used in Andrew Smith’s test. Although the bunchmark from pHash team says that the hamming distance is smaller than 0.3 the images are similar, in the range of 0.2 ~ 0.3 there are so many false positive results. False positive results less than 0.2 was 175 pairs. And less than 0.1 was 63. However, many of them looks similar. And some strange result was in the range up to 0.2

Firstly, an image that filled with black colour matches with some simple images.

K-3D logo

Hamming distance between this black images and following 10 images were less than 0.1

Snapshot07 Snapshot10pointcloud Snapshot09 Snapshot06

Snapshot05 Snapshot04 Snapshot03 Snapshot02

Snapshot01 Snapshot00

I have no idea why the black image matches with so many images. in the 0.2, there are more matching.

Following image matchings seems to be true positive although it could be controversial.

Selection_046 Selection_047 Selection_048 Selection_049 Selection_050 Selection_051 Selection_053 Selection_055 Selection_060

In terms of false positive results, in the range of 0.0~0.1, all the matching is like folowing matching other than black image.


Following image matchings seems to be false positive results in the range of 0.1 ~ 0.2

Selection_062Selection_059 Selection_058 Selection_057 Selection_056 Selection_054 Selection_052

Something in common in those matching is the fact that the images contains wide range of solid colour background.

Font image test

I used the same image set for dct hash test in my previous posting.

Rotation 2 degrees

Intra Distance
Inter Distance

Rotation 4 degrees

Intra Distance
Inter Distance

Rotation 45 degrees

Intra Distance
Inter Distance

In terms of rotation, when the degree is up to 4 degrees, the images shows hamming distance from 0.1 to 0.35. The result of inter distance comparison is around 0.2~0.5. When the degree was 45, the distance range was the same as inter distance.

Adding a dot

Intra Distance

Interestingly, when the image has additional dot, all of the hamming distance was less than 0.1.

Moving 2%

Intra Distance

Moving 5%

Intra Distance

The result shows that MH hash also cannot detect when there was moving.

Arial Bold Font

Intra Distance

Arial Italic Font

Intra Distance

Georgia Font

Intra Distance

Times New Roman Font

Intra Distance

In terms of font change, the distance was in the range of 0.1 ~ 0.5.


  • From the sample image test, there are some false positive result for simple images.
  • When the sample image is complex or it doesn’t have solid background colour, there was no false positive results.
  • As for the font images, in terms of rotation change, MH hash seems to detect more than DCT hash.
  • Adding little dot doesn’t cause big hamming distance comparing to DCT hash.

by Hosung at June 02, 2015 06:06 PM

June 01, 2015

Hosung Hwang

MH Image Hash in pHash

So far, what we have tested with pHash was all about DCT(Discrete Cosine Transform) image hash algorithm. According to pHash design website, MH Image Hash method is recently add in pHash, and it has better results.

Difference between MH and DCT

  • MH hash size is 72 bytes : DCT is 8 bytes
  • MH takes more time on hashing and computing hamming distance than DCT
    -> Speed test will be performed later
  • MH is stronger against attacks
  • MH can be used with MVP tree indexing structure for fast retrieving
  • Hamming distance of MH is calculated by Binary quantization; whereas, DCT is XORing.
    : It must be slower than XORing.

This chart shows the benchmark result of MH hash. According to this result, when the hamming distance is bigger than 0.3 the images are different; and if the distance is smaller than 0.3 the images are similar.

This chart shows the result of DCT hash.

MH Hash implementation

I wrote two simple cpp program to hash and to calculate the hamming distance.

phashmp.cpp source code

int alpha = 2;
int level = 1;
const char *filename = argv[1];
int hashlen = 0;
uint8_t* hash1 = ph_mh_imagehash(filename, hashlen, alpha, level);
for (int i = 0; i < hashlen; i++)
    printf("%02x", hash1[i]);

Default values of alpha and level parameters are 2 and 1, respectively. I used default values. Third parameter is reference of generated hash value’s size. If the function succeeded, it is always 72(bytes). Return value is byte array (uint8_t is byte) of 72bytes hash value.

hammingmh.cpp source code

#define HASHLEN 72
    for (int i = 0; i < HASHLEN; i++) {
        sscanf(argv[1] + 2*i, "%02x", (unsigned int*)&hash1[i]);
    for (int i = 0; i < HASHLEN; i++) {
        sscanf(argv[2] + 2*i, "%02x", (unsigned int*)&hash2[i]);
    double dist = ph_hammingdistance2(hash1, HASHLEN, hash2, HASHLEN);
    printf("%lfn", dist);

In case of MH hash, hamming distance is calculated by binary quantization; a double value.

Test method

I wrote some bash scripts to generate hashes for all jpg files and to gather all hamming distance files to a file. Then I sorted them by hamming distance. To compare them, I wrote 2 different kind of bash script.

while read line           
    num=$(echo $line | cut -d , -f 1)
    filename=$(echo $line | cut -d , -f 2)
    filename2=$(echo $line | cut -d , -f 3)
    eog "$filename" "$filename2"
done < $1

This script shows two images using gnome image viewer line by line.

while read line           
    num=$(echo $line | cut -d , -f 1)
    filename=$(echo $line | cut -d , -f 2)
    filename2=$(echo $line | cut -d , -f 3)
    base1=$(basename "$filename");
    base2=$(basename "$filename2");
    r=$(( $RANDOM % 1000000 ));
    ln -s "$filename" "$linkpath1";
    ln -s "$filename2" "$linkpath2";
done < $1

This script makes soft link of all images to another directory by hamming distance, random value and the file name. So, I can see the image files’ pair list in the image viewer like XnView.

Test result will be posted soon.

by Hosung at June 01, 2015 09:45 PM

May 29, 2015

David Humphrey

Messing with MessageChannel

We're getting close to being able to ship a beta release of our work porting Brackets to the browser. I'll spend a bunch of time blogging about it when we do, and detail some of the interesting problems we solved along the way. Today I wanted to talk about a patch I wrote this week and what I learned in the process, specifically, using MessageChannel for cross-origin data sharing.

Brackets needs a POSIX filesystem, which is why we spent so much time on filer.js, which is exactly that. Filer stores filesystem nodes and data blocks in IndexedDB (or WebSQL on older browsers). Since this means that filesystem data is stored per-origin, and shared across tabs/windows, we have to be careful when building an app that lets a user write arbitrary HTML, CSS, and JavaScript that is then run it in the page (did I mention we've built a complete web server and browser on top of filer.js, because it's awesome!).

Our situation isn't that unique: we want to allow potentially dangerous script from the user to get published using our web app; but we need isolation between the web app and the code editor and "browser" that's rendering the content in the editor and filesystem. We do this by isolating the hosting web app from the editor/browser portion using an iframe and separate origins.

Which leads me back to the problem of cross-origin data sharing and MessageChannel. We need access to the filesystem data in the hosting app, so that a logged in user can publish their code to a server. Since the hosted app and the editor iframe run on different origins, we have to somehow allow one to access the data in the other.

Our current solution (we're still testing, but so far it looks good) is to put the filesystem (i.e., IndexedDB database) in the hosting app, and use a MessageChannel to proxy calls to the filesystem from the editor iframe. This is fairly straightforward, since all filesystem operations were already async.

Before this week, I'd only read about MessageChannel, but never really played with it. I found it mostly easy to use, but with a few gotchas. At first glance it looks a lot like postMessage between windows. What's different is that you don't have to validate origins on every call. Instead, a MessageChannel exposes two MessagePort objects: one is held onto by the initiating script; the other is transferred to the remote script.

I think this initial "handshake" is one of the harder things to get your head around when you begin using this approach. To start using a MessageChannel, you first have to do a regular postMessage in order to get the second MessagePort over to the remote script. Furthermore, you need to do it using the often overlooked third argument to postMessage, which lets you include Transferable objects. These objects get transferred (i.e., their ownership switches to the remote execution context).

In code you're doing something like this:

 * In the hosting app's js
var channel = new MessageChannel();  
var port = channel.port1;


// Wait until the iframe is loaded, via event or some postMessage
// setup, then post to the iframe, indicating that you're
// passing (i.e., transferring) the second port over which
// future communication will happen.
iframe.contentWindow.postMessage("here's your port...",  

// Now wire the "local" port so we can get events from the iframe
function onMessage(e) {  
  var data =;
  // do something with data passed by remote
port.addEventListener("message", onMessage, false);

// And, since we used addEventListener vs. onmessage, call start()
// see


// Send some data to the remote end.  
var data = {...};  

I'm using a window and iframe, but you could also use a worker (or your iframe could pass along to its worker, etc). On the other end, you do something like this:

 * In the remote iframe's js

var port;

// Let the remote side know we're ready to receive the port
parent.postMessage("send me the port, please", "*");

// Wait for a response, then wire the port for `message` events
function receivePort(e) {  
  removeListener("message", receivePort, false);

  if( === "here's your port...") {
    port = e.ports[0];

    function onMessage(e) {
      var data =;
      // do something with data passed by remote

    port.addEventListener("message", onMessage, false);
    // Make sure you call start() if you use addEventListener
addEventListener("message", receivePort, true);


// Send some data to the other rend
var data = {...};  

Simple, right? It's mostly that easy, but here's the fine print:

  • It works today in every modern browser except IE 9 and Firefox, where it's awaiting final review and behind a feature pref. I ended up using a slightly modified version of MessageChannel.js as a polyfill. (We need this to land in Mozilla!)
  • You have to be careful with event handling on the ports, since using addEventListener requires an explicit call to start which onmessage doesn't. It's documented, but I know I wasted too much time on that one, so be warned.
  • You can safely pass all manner of data across the channel, except for things like functions, and you can use Transferables once again, for things that you want to ship wholesale across to the remote side.
  • Trying to transfer an ArrayBuffer via postMessage doesn't work right now in Blink

I was extremely pleased to find that I could adapt our filesystem in roughly a day to work across origins, without losing a ton of performance. I'd highly recommend looking at MessageChannels when you have a similar problem to solve.

by David Humphrey at May 29, 2015 08:02 PM

Justin Flowers

Starting out with Ansible in CentOS6

Ansible is an incredibly powerful automation tool. It can allow you to connect to a VM and control configuration and installation of programs simply. Here’s what I did to to get it working for the first time:

Step 1: Get client machines installed with basic requirements

Important steps here are to make sure that:

  • You have configured your SSH RSA keys for the accounts which will be connecting (check out here for a great tutorial)
  • Your client machines for Ansible have Python installed
  • Your client machines for Ansible have libselinux-python installed
  • You have a supported OS installed for Ansible control machine

If you can SSH to your machine without using a password then you should be fine with the RSA keys here.

Step 2: Install Ansible on control machine

If you’re on Fedora, you likely have the ansible package in repositories already. Otherwise, you can install Ansible by installing epel-release with these commands:

sudo yum install epel-release
sudo yum install ansible

Step 3: Configure Ansible hosts

If you open up /etc/ansible/hosts you can add and modify groups of hosts. There are many options for configuration in this file, but suffice it to say you can declare a group with square brackets and then write either hostnames or IPs below the square brackets to add machines to it. For example, I defined an IP on my local host-only vbox network to be in the logstash-host group with:


Step 4: Write a playbook and test

This is the hard part. There are many examples on the internet on how to write this kind of a file, but essentially you can see it as defining a group of hosts to work for, the user to remotely connect, and then  a list of the tasks the playbook should perform.

Each task is made up of a name and command. The name is essentially what will be shown to you when it attempts to perform the given command. The command is a specific action to be performed by Ansible. For example, one of the tasks I used in my playbook was this:

 - name: send logstash input configuration
 copy: src=~/elk/logstash-input.conf dest=/etc/logstash/conf.d/

This command copied the file logstash-input.conf on the control machine to /etc/logstash/conf.d/ on the client machine. If you need help with finding what command to use or how to use it, googling your issue followed by ansible is usually good enough to get you a StackOverflow answer or take you right to the Ansible documentation for what you need.

Finally, to test, simply run:

ansible-playbook logstash_playbook.yml

Substituting logstash_playbook.yml for the name of the playbook you made.

by justin at May 29, 2015 06:57 PM

Mohamed Baig

How to setup Hypothesis locally for development on Arch Linux

What is Hypothesis Hypothesis is a web annotator. It allows users to create notes on any web page and PDF. As it says on the project website: Our team is building an open platform for discussion on the web. It leverages annotation to enable sentence-level critique or note-taking on top of news, blogs, scientific articles, […]

by mbbaig at May 29, 2015 05:33 AM

Barbara deGraaf

Everything you wanted to know about cameras

In this post I will detail the main points related to the image that is produced in a camera.

Without going into too much detail on how cameras and lenses work there are three main things that the final image can differ in:

1) Field of view

The field of view is how much of the area in front of you will be on the final image taken by the camera. While field of view and angle of view tend to be used interchangeably they are different in that field of view refers to the distances in real life that are being placed on the final image angle of view refers to the angle from top to bottom that is extended out from the camera.

To find the angle of view you need to know the focal length of the lens and the size of the film or sensor used in the camera. This following image by Moxfyre at English Wikipedia (under CC BY-SA 3.0) shows the best example for this concept 

Wiki page

Optics of a Camera

In this S1 is the distance from the lens to the object, S2 is the distance from lens to the sensor or film, and F is the focal length of the lens. You can see that as you increase the focal length while keeping the film sized fixed the angle will get smaller. If you keep the focal length the same but increase the film size the angle will get bigger. The equation to find the angle of view is easy enough to find with trigonometry from the above equation (with the assumption that S2=F which is not a valid assumption for macro situations but is valid for distant objects) and is

α = 2*arctan(d/2f)                                                                                                  (1)

The above can be looked at as if top down or from the side, in fact the angle of view tends to be different from horizontal and vertical as the film size is different for these dimensions.

Therefore field of view depends on film/sensor size a property of the camera chosen and focal length which is a property of the lens chosen.

2) Depth of field

Depth of field may be a little harder to understand and refers to the area that will be sharp or acceptably sharp in the final image. The first thing to do to find the hyperfocal distance. This distance is where objects will be in focus from half the hyperfocal distance up to infinity(past the half distance it is always in focus).

For example if the hyperfocal distance is 20m and you decide to focus on an object 25m away then the image will be in focus from 10m to infinity. If you focus on something 15m away(<H) then you have a finite depth of field which you will have to calculate.

First the equation for the hyperfocal distance, at the risk of being too mathy I will leave out the derivation (which can be found with geometry)

H = (F^2)/N*C                                                                                                      (2)

Where F is the focal length, N is the f-stop and C is the circle of confusion. The f-stop is the aperture and is the ratio of focal length to diameter of entrance pupil. C is the circle of confusion which is a property of a lens and is where light will not come to prefect focus.

After finding the hyperfocal distance the near and far depth limits can be found after knowing the focus distance (which is something the cinematographer picks.)

DNear = H*S                                                                                                         (3)             H+(S-F)

DFar = H*S                                                                                                           (4)           H-(S-F)

Where H is the hyperfocal distance and S is the focus distance.

For a good explanation of how depth of focus works go to this page.

Therefore the depth of field depends on focal length which is a property of the lens, circle of confusion which is also a property of the lens, aperture which is a property chosen by user and focus/subject distance which is also chosen by the user.

3) Exposure

This last thing I will mention but not go into detail. This refers to the amount of light going into the camera and how bright the picture will be. This depends on many things like aperture, shutter speed, and lights placed in the scene.

Therefore the main this that a user should be able to pick are the type of camera, type of lens, aperture setting, focus distance, and maybe focal length of lens if it a zoom lens.

Stay tuned for the adventure of making a test scene to use and verify our cameras in.

by barbaradegraafsoftware at May 29, 2015 03:37 AM

May 28, 2015

Anna Fatsevych

MySQL Tests Continued

The Wikimedia Commons image dumps include files of .svg,.pdf, .ogg, and .djvu formats. SVG is an image file, whereas .pdf’s were mostly books, ogg’s were video/audio and .djvu’s were scanned books/texts.

Hashing these with pHash and BlockHash was a challenge, because they do not always throw an error (i.e. when trying to hash pdf), so some issues took longer to discover, and the others (svg, ogg, and djvu) cannot be hashed.

Dealing with file names with many different characters of various languages some exceptions arose – a php function addcslashes($string, “list of characters to escape”) comes in handy as well as the ones mentioned in the previous post.

I ran my php program on 6,953 files, of which 6308 were processed (some due to format and some due to file naming convention errors). It took 2.41 hours – to pHash, and blockHash each image out of 6308, and then store the values in the database. Granted, hashing took the most time as the dummy data results averaged 1,000 INSERTS per minute.

I ran SELECT tests on my 6,008 records and discovered that, ‘SELECT *’ and ‘SELECT where’ based on hashes speeds were quite impressive, with the longest one taking 3.2 seconds for select all. Granted, there were many duplicates in this database (the same hash was applied to erroneous files), which will not be the case.

Part Two (May 29, 2015):

I have run more tests on MySQL. Here is an overview:


To time the ‘SELECT *’ statement I am running this command on the shell:

time mysql --user=root --password='password' myDB < selectall.sql > ~/temp

In “selectall.sql” I have the following statement:

select * from MyTable;

And for the 105,000 entries here is the time

real    0m1.069s
user    0m0.887s
sys     0m0.084s

When timed in my PHP code, SELECT *, and SELECT on phash or bhash took respectively (0.2 and 0.1 seconds);

Here is a snipped of my PHP code on how I have timed the queries:

$sql = "SELECT * FROM IMG where phash=15980123629327812403";
$result = $conn->query($sql);

if ($result->num_rows > 0) {
    // you can output data of each row
    while($row = $result->fetch_assoc()) {
} else {
    echo "0 results";
$time_post = microtime(true);
$exec_time = $time_post - $time_pre; 

To conclude here are the results of my queries, without printing to screen:

INSERT times were constant at approximately 16 inserts per second;
SELECT * was timed at 0.01 seconds (system) for 105,000 records, and WORST at 3.02 seconds for 6,308 records (with BLOBs);
SELECT WHERE (search on hash value) averaged at 0.04 seconds in general. There were many duplicate hashes, as generating unique values proved very time consuming. Although, I am planning to run a SELECT test on unique values in near future.
VARCHAR(73) or CHAR(73) were both tested for efficiency, and there was no difference on 5,000 record tests.

More to come on this topic,


by anna at May 28, 2015 07:17 PM

Hosung Hwang

Performance test of pHash

I performed performance test of a perceptual hash algorithms : pHash.

Following is the specification of test machine.

OS : Ubuntu 14.04 LTS
Processor : Intel Core i7-3517U CPU @ 1.90GHz x 4
OS type 64-bit
Memory 7.7GiB
Disk : 117.6GB SSD

Test performed from the internal SSD(MSATA) drive and external USB hard drive.

Read/Write benchmarking

Before doing actual hashing test, I performed simple read/write benchmark using dd command. It writes and read 8.2gb file :

time sh -c "dd if=/dev/zero of=/media/hosung/BACKUP/test.tmp bs=4k count=2000000 && sync"
time sh -c "dd if=/media/hosung/BACKUP/test.tmp of=/dev/null bs=4k"

Each job was performed 5 times. Followings are average values.

Condition Speed
Internal SSD Write 245 MB/s
Internal SSD Read 440MB/s
External HDD Write through USB 3.0 109MB/s
External HDD Read through USB 3.0 122 MB/s
External HDD Write through USB 2.0 109 MB/s
External HDD Read through USB 2.0 129 MB/s

USB 3.0 reading speed was slightly faster than USB 2.0. And Internal SSD was 4 times faster than USBs.

pHash performance test

Sample images are 900 jpg files and the size is varied from 1.3kb to 50.4mb. Full size of the sample files were 805.4mb. For the test, I wrote a c++ code that extract hash values using ph_dct_imagehash() function in pHash from all jpg images in a directory. The reason to rewrote the program is to avoid the time of loading new process when shell script is used. Every test were performed 8 times after rebooting.

Internal SSD

hosung@hosung-Spectre:~/cdot/ccl/PerHash/pHash/pHash-0.9.6/phash-test$ for F in {0..7}; do ./phashbench /home/hosung/cdot/ccl/hashtest/all-images; done
Elapsed secs to hashing 900 files : 292.419326
Elapsed secs to hashing 900 files : 290.789127
Elapsed secs to hashing 900 files : 291.163042
Elapsed secs to hashing 900 files : 290.769897
Elapsed secs to hashing 900 files : 290.710176
Elapsed secs to hashing 900 files : 290.940988
Elapsed secs to hashing 900 files : 290.880126
Elapsed secs to hashing 900 files : 290.766687

External HDD through USB 3.0 port

hosung@hosung-Spectre:~/cdot/ccl/PerHash/pHash/pHash-0.9.6/phash-test$ for F in {0..7}; do ./phashbench /media/hosung/BACKUP/all-images; done
Elapsed secs to hashing 900 files : 293.422019
Elapsed secs to hashing 900 files : 293.145768
Elapsed secs to hashing 900 files : 292.828859
Elapsed secs to hashing 900 files : 292.591345
Elapsed secs to hashing 900 files : 292.631436
Elapsed secs to hashing 900 files : 292.811508
Elapsed secs to hashing 900 files : 292.898119
Elapsed secs to hashing 900 files : 292.607773

External HDD through USB 2.0 port

hosung@hosung-Spectre:~/cdot/ccl/PerHash/pHash/pHash-0.9.6/phash-test$ for F in {0..7}; do ./phashbench /media/hosung/BACKUP/all-images; done
Elapsed secs to hashing 900 files : 294.008601
Elapsed secs to hashing 900 files : 292.954135
Elapsed secs to hashing 900 files : 292.275561
Elapsed secs to hashing 900 files : 292.255697
Elapsed secs to hashing 900 files : 292.459464
Elapsed secs to hashing 900 files : 292.737186
Elapsed secs to hashing 900 files : 292.803859
Elapsed secs to hashing 900 files : 292.605617


  • USB 3.0 port and USB 2.0 port seems to have no difference.
  • Even when it was performed from internal SSD, the speed was slightly fast.
  • In spite of 4 times faster reading speed, hashing speed in internal SSD was less than 1% faster than USB.
  • Therefore, in terms of hashing, CPU performance seems to be important than IO performance.
  • The other method in pHash : ph_mh_imagehash should be tested later.

by Hosung at May 28, 2015 02:34 PM

May 27, 2015

Justin Flowers

Working with Host-Only VBox Networks in CentOS6

In order to communicate between VMs a simple alternative to fancy port forwarding is to set up a host-only network joining them. This is my go to solution for testing machines that need to talk to other machines. In CentOS6 this can be quite difficult to figure out on your own. Here I’ll discuss how to set up your machines fully in simple terms.

Step 1: Create the network in VirtualBox preferences

Before we can begin configuring boxes to be on this host-only network, we’ll need to make it first. This is relatively easy, thankfully. Simply go to VBox’s network preferences through File->Preferences->Network and hit the + button at the top right of the page to add a new host-only network.

Step 2: Connect VMs to host-only network

Note: to do this part your VMs must be powered down
Next we need to give the VMs access to this network. Go to the VMs network settings through right-clicking on the machine and Settings->Network. Once there add a new adapter by clicking on one of the tabs at the top and checking Enable Network Adapter. Then simply pick Host-only Adapter and it should automatically pick the first network on the list. Do this for all machines you want to communicate via the host-only network.

Step 3: Configure adapters on VMs

This is the hardest step and took me the longest to figure out. Begin by doing:

ifconfig -a

This will show you a list of all adapters present on your machine. The one you’re looking for will match the hardware address created for each network adapter in step 2, although usually its the last Ethernet one in the list. Once you have the name of your adapter (likely eth1 if you only had a NAT adapter before) you can begin configuring it with:

sudo vi /etc/sysconfig/network-scripts/ifcfg-eth1

Substituting eth1 with the name of your adapter you found with the ifconfig.In this new file copy and paste:


Again, substituting eth1 for whatever the name of your host-only adapter was, adapter_mac with the MAC address for your host-only adapter (which can be found with ifconfig -a or from the VBox machine network settings page), and the IP address for whichever one you want the machine to have.

Save that file and then run:

ifup eth1

Alternatively, if you know you will be destroying the machine soon and wish to configure this quickly, simply run:

sudo ifconfig eth1 netmask up

However the above command will not persist through reboots and will not keep your IP static, meaning your computer could be leased a new one if you’re working for a longer period of time.

If you’ve followed all the steps correctly you should now see your connection to the host-only network! Test it out with a ping to see if it works.

by justin at May 27, 2015 07:51 PM

Anna Fatsevych

MySQL Stress Test

I ran various commands through MySQL database testing for speed and size variations.

To check for size of the database I used this code:

SELECT table_schema “Data Base Name”,
-> sum( data_length + index_length ) / 1024 /
-> 1024 “Data Base Size in MB”,
-> sum( data_free )/ 1024 / 1024 “Free Space in MB”
-> FROM information_schema.TABLES
-> GROUP BY table_schema
-> ;

I am running a PHP file, and the MySQL commands are in a loop. Here is the code snipped to show how the processing time is calculated.

       // Start timing
       $time_pre = microtime(true);

        for ($x = 0; $x <= 50000; $x++) {
            $DB->query("INSERT INTO IMG(phash, bhash,author,license,directory) VALUES($hash,'{$bhash}','author','license','{$directory}')");
        // Stop timing
        $time_post = microtime(true);
        $exec_time = $time_post - $time_pre;   
        echo "\n\n TIME: ".$exec_time."\n";

The Results (click for a larger image):


Difference in INSERT statements in TIME between BLOB and VARCHAR(1024) was relatively small (52.54 seconds per 1000 records, and respectively 58.22 for the BLOB formatted ones). But the most significant difference was in size:


by anna at May 27, 2015 07:16 PM

MySQL and Data Types

After attempting to store pHash results as a BIGINT in MySQL database, some hashes were correct and some were erroneous, showing up constantly as 9223372036854775807 (the largest possible BIGINT in MySQL structure).

The solution to this problem was using UNSIGNED BIGINT – the type to store the largest possible integer in a MySQL database. More on it here.
If it is still too small, the next option is VARCHAR.

Block Hash is stored as a binary string – and the data type for it is BINARY(64).

Images are stored as a BLOB.

Also, when working with pHash, special characters such as brackets, parentheses, etc might raise an error; use addslashes(), or quotemeta(), you can als escape special characters with a regEx. In this example I am executing pHash and using the functions mentioned above to escape strings.

$hash = exec('./phash '.$directory.quotemeta($photo));

More MySQL and PHP goodness to come.


by anna at May 27, 2015 03:56 PM

May 26, 2015

Hosung Hwang

Content Based Image Retreval : Concept and Projects

Concept of CBIR

Content Based Image Retrieval(CBIR) is the technology to search images from image database without other information like title and author.


It is basically divided to 2 parts : Indexing and Searching. Indexing performs based on color, texture, and shape. For fast and accurate searching, developing indexing algorithm has been important issue in computer vision field.

Open Source CBIR Engines

There are some commercial CBIR Engines such as Google Image Search and TinEye. Followings are open source CBIR engines.


  • GPL, developed until 2006
  • written in C# .NET 2.0 using Visual Studio .NET 2005


  • Open license, developed until 2009
  • Written in C++, Supports python

The GNU Image-Finding Tool (GIFT)

  • full server system written in c, developed until 2004
  • divided to kernel part and plugin part
  • very complex, many dependency in other gnu projects

imgSeek WebSite, Desktop

  • developed until 2012
  • server system written in python/C++/Java, used SOAP


  • Recently developed, GPL
  • written in Java, use OpenCV, LSH
  • business model : source code customization consulting


  • image recognition platform for your mobile apps
  • Recently developed. LGPL
  • Written in C++, consist of 10 cpp files. use OpenCV 2.4
  • Include HTTP server
  • Used ORB instead of SIFT or SURF (patent protected)
  • business model : Image database server hosting, source code customization


  • Pastec can be a good reference to make a content based image retreval system because it is simple and has major features. Also it is free from license/patent.
  • However, it is wired that they are promoting that it is for mobile apps. Intensive test is needed.

by Hosung at May 26, 2015 10:18 PM

Peter McIntyre

Responsive top-of-page logo image with Bootstrap

We’re working on a new version of the School of ICT web site recently.

It uses Drupal 7, and we have configured it to use the Bootstrap framework.

One of the things we wanted to do was to have a full-width logo image at the top of the page. However, we did not want it to scale, so I looked for a way to show an image that matched the Bootstrap style for the viewport. Without using JavaScript.

Here’s what we settled on.


Four logo images

We created four logo images:

  1. logo-1100-plus.png
  2. logo-910-plus.png
  3. logo-690-plus.png
  4. logo-690-minus.png


Wrap the images

Wrap the image in a suitable container, if necessary. Then, use the Bootstrap classes “hidden-??”.

Here’s a code sample. Replace the <a href and <img src values with your own. Enjoy.

<p>Responsive images, using Bootstrap grid breakpoints</p>

<p>Width 1100 or more</p>
<div class="hidden-md hidden-sm hidden-xs">
  <a href="">
  <img src="~/Content/Images/logo-1100-plus-v1.png" /></a>

<p>Width between 992 and 1100</p>
<div class="hidden-lg hidden-sm hidden-xs">
  <a href="">
  <img src="~/Content/Images/logo-910-plus-v1.png" /></a>

<p>Width between 768 and 992</p>
<div class="hidden-lg hidden-md hidden-xs">
  <a href="">
  <img src="~/Content/Images/logo-690-plus-v1.png" /></a>

<p>Width less than 768</p>
<div class="hidden-lg hidden-md hidden-sm">
  <a href="">
  <img class="img-responsive" src="~/Content/Images/logo-690-minus-v1.png" /></a>























by petermcintyre at May 26, 2015 08:07 PM

Kieran Sedgwick

Lessons from the battlefield for an intermediate programmer

My biggest challenge as an intermediate-level programmer is almost poetic, is desperately tense and is more and more obvious the further I develop in the field. I’ve reached the point where walking through code is almost second nature. No more cold sweats at what looks like a magical jump in logic, or raised blood pressure at the sight of yet another third-party API, or reinforcing the fetal-shaped imprint on my bed when a bug defies all attempts to understand it.

At this point it’s not about solving problems. It’s about efficiency. My latest piece of work was relatively meaty, with sizeable problems that needed carefully considered solutions. It also took me a week longer to solve than I would have liked, so I decided to analyze my performance and draw lessons from the experience. Here’s what I observed:

Lesson No. 1: Problems in isolation are easier to solve


The web app I’m developing this summer needed to be refactored to work with a brand new authentication server, using a completely different authentication protocol. The codebase itself is messy, test-free and highly coupled. A complete refactor (my definite preference) was out of the question since there’s simply too much real work to do to worry about technical debt.

And so I fell into my first trap. My attention was torn between implementing the new authentication protocol, and not breaking the mess of a codebase in the process. Jumping back and forth between these two goals left me confused about the causes of my errors, and mentally exhausted from the task switching.

Solution: Next time, separate the problems, solving the most basic first

  • Identify which parts of the problem are easiest to isolate
  • Solve those problems in isolation, even if it’s contrived
  • Use perfected solutions of simple problems as a basis for the more complicated ones


  • Attention isn’t split between different domains
  • Problems don’t “cross pollinate”, confusing the source of an issue
  • Saves time

Lesson No. 2: If you have to really know a new technology, slow down and build it out as you learn it


Oauth2 is pretty cool, and I know that now. Learning it could have gone a little faster, if I used more haste and less rush. Relying on my instinct by skipping over learning fundamental terms and concepts led me down the wrong path. I struggled to find a good example implementation, so I tried to cobble together the concepts enough to empower me to implement it all at once. Not a good idea!

Solution: Make an effort to learn terms and concepts without losing patience, implementing something as soon as possible

  • Small confusions snowball as the pieces come together, so be thorough in research
  • Find or create examples that further or test your understanding of the piece you don’t understand
  • Solidify the learning by implementing something that works as soon as possible, even if it’s incomplete or contrived.


  • Would cut down the amount of “back and forth” referencing of material since you’re more familiar with the ideas
  • Surfaces misunderstandings through broken implementations, helping to solidify the trouble parts faster

To the future

In his book Talent Is Overrated: What Really Separates World-Class Performers from Everybody Else, Geoff Colvin points out that masters of their craft approach every task in predictable way.

First, they have specific goals aimed at besting their previous efforts as they go into a task. Second, they check in with those goals as they perform the task, making sure they’re staying focused on improvement. Finally, they reflect on the experience to build the goals for their next attempt. Adopting this was my motivation for this post, and seemed like the best response to my disappointment over my performance.

To success!

by ksedgwick at May 26, 2015 04:52 PM

May 23, 2015

Anna Fatsevych

The Wiki Way

After some research on Wikimedia Commons, I have found out some information, along with the links for further references.

There is the WikiPedia API Sandbox that allows to test the calls.

There is also MediaWiki which is actually a software package (written in .php) that has multiple tools and extensions available to add functionality and support. I have installed MediaWiki locally, and will be able to create my own extensions and tools. To do so, I have signed up for a developer account with WikiLabs. Here are some helpful links:

API Source

Wikimedia Commons is a free file repository that is dedicated to promoting sharing of media. Read more about the Commons here.

There are also database dumps available with all the media on the Commons.

Here are the image dumps:

And mirrors here



by anna at May 23, 2015 01:39 AM

May 22, 2015

Barbara deGraaf


This blog will focus on my involvement on my current project which is to make the cameras in three.js act like real cameras.

Continue reading to learn how cameras work and the various setting cinematographers can change on their cameras and how that affects the image

by barbaradegraafsoftware at May 22, 2015 01:15 PM

May 21, 2015

Justin Flowers

Configuring and using Logstash

Logstash has some incredibly well defined installation guides that can be easily found through a Google search similar to “install logstash on *insert Linux distro:*“. However understanding how its configuration and permissions work by default can be daunting.

To start, we’ll need to find a way to figure out if Logstash is working correctly. Unfortunately starting the service with the standard means (something along the lines of sudo service logstash start) does not give any information about the success of its launch. To check if it’s working correctly you’ll have to check the log file at /var/log/logstash/logstash.log. If you see an error like:

{:timestamp=>"2015-05-21T14:20:51.434000-0400", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}

That simply is a log notifying you that Logstash successfully shut down. If the file is blank after starting the service then you know that it started up with no errors. Otherwise you likely have permissions issues with your firewall or issues with your Logstash configuration file (the error will likely give you more information). So, make sure to cat out that file and verify that Logstash successfully started up before injecting logs into your system. Permissions are a continuous issue configuring Logstash, from permissions to write to the folders you want to keep your output to permissions for reading the log files you need to intercept.

Configuring Logstash can be quite frustrating, as there is quite a bit of arbitrary rules for the various filtering methods. For example, the grok filter allows you to add fields to a JSON log before sending it to outputs. Unfortunately, grok will not add these fields unless there is a successful match block. This means to add a field, regardless of whether you care about the message field or not, you must use a grok with at least one match block (which, in this case, I matched message with .*).

The only issues I had setting up input was getting the permissions correctly configured for the log files coming in, so I’ll skip that component except to show what mine looked like:

input {
 file { 
   path => "/var/log/httpd/access_log"
   type => "apache"
 file { 
   path => "/var/lib/mysql/mysqld_general.log"
   type => "mysql"
 file {
   path => "/var/log/messages"
   type => "linuxsys"

The filter section, however, was much more frustrating. This is where I encountered the grok needs a match to run anything else issue. This is a relatively simple filter which checks the type of the log and then gives it a different entry for its operatingsystem field.

filter {
   if [type] == "apache" {
     grok {
       patterns_dir => "/etc/logstash/conf.d/patterns"
       match => { 
          "message" => "%{COMMONAPACHELOG}"
  else {
    grok {
      match => { "message" => ".*" } 
      add_field => { "operatingsystem" => "Host OS (CentOS)" }

The patterns_dir field tells Logstash where to find your user defined regex terms. More about that over here. As you can see, if the log’s type is “apache” then it will run my custom apache log regex on the message string (which identifies and stores the OS, among other things). Otherwise it adds the operatingsystem field with an entry about what should be added there. As mentioned earlier, without the match => { “message” => “.*” } line the grok in the else will never go off.

Finally, for output I only had 1 issue: getting Logstash to send data to Elasticsearch. The key ended up being to add a protocol line to the elasticsearch block and telling it explicitly to use http for connecting. Again. for example I’ll post it as well:

output {
 file { 
    path => "/tmp/logstash_out.log"
    codec => json_lines 
 elasticsearch { 
    host => ""
    port => "9275"
    protocol => "http"

Except for some nasty regex with Apache logs, that’s about all the issues I ran into making the Logstash configuration file work the way I needed it too. I hope this post has helped you work your way around the Logstash system a little better!

by justin at May 21, 2015 09:04 PM

Hong Zhan Huang

OSTEP – The city of ARMs – Building a simple network

After having received the opportunity of joining the Seneca Center for the Development of Open Technology (CDOT) I have begun a journey into the world of ARM architecture which is the focus of the OSTEP team that I am apart of. Following the training for new additions to the team like myself, we were to assign ourselves an initial task to handle. For me that task would be to work with an ARM machine which I’ll lovingly dub Sunny.

The main points of my task is to do the initial set up with Sunny and then to test it with a particular brew of Linux for aarch64 systems to see how stable Sunny is in conjunction with this operating system.

The Set Up

Initial Stage

The first step towards setting up was to understand what manner of connections I could use to work with Sunny from my workstation. Sunny doesn’t have any sort of video ports that one can hook up to a display. The main ports of interest are serial, Ethernet and a SD-card slot. The serial port is how we’ll be interacting with it. With the use of a serial to USB adapter one can use a terminal emulation tool such as Screen, Minicom or putty to connect via the serial port. Directions to using these tools to connect to serial can be found here. Here after I began to familiarize myself with Sunny and proceeded to think about what method of booting I can implement to most effectively place another operating system onto it.

As stated prior aside from the serial, there are Ethernet ports and a SD-card slot. The SD slot is seemingly meant for the use of firmware updates (and perhaps as a last resort to move files to and from). That leaves the Ethernet remaining which then means a network installation (PXE Boot) would be the best choice. Thankfully here at CDOT we have a lovely cabinet of goodies known as the EHL (Enterprise Hyperscale Lab). So instead go through the process of setting up a new PXE Boot server and all the things that it entails, I can leverage our existing infrastructure to make this task much easier.

Secondary Stage

With an inkling of what I need to do, now is the time to figure out how to do it. Certainly I want to be able to use our existing network but how would I do it. Sticking Sunny into the EHL is a little difficult at this time as it’s undergoing some clean up as well as being generally quite full of things already. As luck would have it, we in the OSTEP team are intending to build a second cabinet which might be called the EHL2 but at the time of this writing, the physical cabinet to hold it all together has yet to arrive. However all the other parts that would be come the EHL2 are already present for use. Thus my task then becomes to set up a temporary “cabinet 2″ to which will connect to Sunny as well as the original EHL to make use of what’s already here. I think a visual will better show off what is being put together: A pile of gear

This set up is much like a very simple and stripped down version of the actual EHL. Following physically connecting up all the hardware, it was time to configure some of the pieces. The steps to do so were as follows:

  1. Edit the DHCPD config to add new records for the Power Distribution Unit (PDU), Terminal Server (TS) and Sunny. Sunny’s record will also contain the information regarding what it will attempt to PXE boot from.
  2. Power up the TS and PDU and confirm their connection.
  3. Configure TS and PDU as needed. The TS is what allows us to remotely connect to Sunny’s serial port. The PDU will allow us remote power management of our “cabinet 2″ network. For example we can remotely power cycle Sunny through the terminal or a web interface for the PDU. Power cycling is the equivalent of unplugging the power cord and plugging it back in again.

Now that we’ve put everything together and configured it, with some luck our system will be up and fully running.

Things aren’t quite working yet stage

Here comes the trouble: For some reason, everything was working except for Sunny’s ethernet ports. Attempting to run a ifconfig command showed that the ethernet devices were detected and enabled but weren’t actually up and running. Previously before I was assigned to this machine there had been a myriad of magical and mysterious issues pertaining to the ethernet but at some point had resolved it self. And once again it appears when it’s in my ownership. Upon the suggestion of my peers, I went to test the power supply in the case and sure enough it was reading as faulty with a power supply tester. After switching it up, everything now works. Hurrah.

Some of the things I’ve learned this time:

  • The things that can make up a server cabinet and how to put one together physically and as well as the configuration that is needed to make it work.
  • Thing to look out for when trouble shooting hardware such as faulty power supplies.
  • Sometimes things will just be a mystery that requires testing things out one by one.

We’re finally ready to PXE Boot but we’ll continue that in another post.

by hzhuang3 at May 21, 2015 02:58 PM

Anna Fatsevych

Download Speed, Size, and Bandwidth

While optimizing and expanding on my findings from previous post,

I ran some tests on my home wireless network with an average speed of 67 Mbps, with 23 ms PING according to

Running two Flickr queries (API calls) per image – up to three tries per query – I was able to successfully download 784 images within 57 minutes that averaged 5 MB in size (705 images were 8 MB or less, largest image was 40.1 MB), and after that – constant fails in downloading, which led me to believe a maximum allowed quota per hour of 3600 queries was reached. I believe this was achieved due to intermittency in my network (wireless) and also on the server side, causing multiple tries for unsuccessful API call – I programmed it that way for continuity.

Going by my statistics and referring to this Mbps to MB/s calculator, my transfer speed was about 8.3 MB/s, so the 705 images would take under a second to download, whereas images averaging 25 MB, of which I had 23 would take 3 seconds, and images averaging 35 MB (there were 9 in my case) would take approximately 5 seconds; thus having the larger images eat up the download time.

Ideally a maximum of 1797 images can be downloaded per hour, 3600 total – 2 initial search queries, and 1 query per page of 500 images (4 maximum), followed by 2x queries per image with the current python downloader program images-flickr.

Efficient download speeds of at least 250 MBps which is 31 MB/s (preferably higher) would decrease the spectrum of time variance for the download size, equalizing it to about an image in under a second. From one Flickr account (API KEY) a maximum of 43,128 images can be downloaded per day with bandwidth averaging 237 GB per day.

Will continue posting my findings,


by anna at May 21, 2015 06:42 AM

May 19, 2015

Hosung Hwang

Research about OpenSIFT 2

In the previous posting, I wrote about what is OpenSIFT and how it works. In this posting, I will discuss about the performance, size, and quality.

Extracting keypoint

The utility program from OpenSIFT does extracting keypoint and comparing at the same time. In our project, extracting hash and comparing happens separately. So, I changed the sample code and wrote bash script to do it for all image files.

Sample images were 896 jpg images, and size of them were 804.2MB in total. Generating keypoints for all 896 images take in my Core i7 machine :

real    93m27.545s
user    84m19.643s
sys 2m34.055s

Keypoint extracting time and keypoint file size is significantly varied by image : how complex the image is.

The least complex image : Metaball4.jpg

Following image is the image with features.
Only 1 feature was found.
Original file size was 2.5MB
Keypoint file size was 375Byte
Following is entire keypoint file :

1 128
1256.797741 1136.853164 75.205367 3.072790
 6 0 0 0 0 0 0 11 3 0 0 0 0 1 28 46 5 2 0 0
 0 24 46 11 130 80 0 0 0 4 1 1 39 1 0 0 0 0 0 20
 109 16 3 3 3 11 53 130 8 4 4 23 93 123 76 31 130 9 0 2
 29 19 0 19 43 14 0 0 0 0 0 2 105 129 65 16 2 1 1 15
 11 22 77 130 73 10 2 8 130 0 0 16 21 1 0 130 10 6 0 0
 0 0 0 0 5 31 25 1 0 0 0 4 17 4 29 16 0 0 2 104
 130 0 0 1 0 0 0 130

Extracting time was :

real    0m6.994s
user    0m7.082s
sys 0m0.277s

The result seems weird. Only one feature was detected even though there is a big round shape. And extracting time was 6 seconds.

The most complex image : VidGajsek – Razgled 505.jpg

Following image is the image with features.
VidGajsek - Razgled 505.jpg_screenshot_19.05.2015
265116 features are found.
Original file size was 9.9MB.
Keypoint file size was 101.7MB.
Extracting time was :

real    3m1.761s
user    2m33.824s
sys 0m9.152s

The original image was kind of HDR image. There were many separate dots and all those small pixels seems to have features. Surprisingly, keypoint file size was 101.7MB and extracting took more than 3 minutes.

I looked at APIs if it has an option to write the keypoint file binary format, which will reduce the size. However, there was no those kind of option.

Matching keypoints of two images

One of the problem of OpenSIFT is that there is no standard way to decide if the two image is similar or if one of the image is part of the other image. Matching works based on one image according to the code.

Testing character images

In case of simple character image, the number of features are few. Therefore, when those images are compared with other images, there are many false positive results. For example, “I” has 2 features, and all two features matches with 1, 2, 5, 7, b, d, E, f, h, i, j, L, n, P, r, T, V, W, and y. Likewise, ‘i'(4), ‘j'(3), ‘l'(4), ‘o'(2), ‘O'(2) have many matching results. Other than those, following results were false positive : 2 and d, 3 and I, 7 and V, 7 and y, D and 5, D and P, h and n, h and u, M and h, M and I, M and L, M and P, u and h, w and I.
When the image is simple, OpenSIFT gives many false positive results.

Testing complex images

When I compared two keypoint files generated by completely different images that has 188507 and 216297 features, the time of comparison was 1 minute, and matching features were 224.

When I compared 4 images with all other images, it doesn’t seem to have false positive result. I couldn’t do this for all images; it takes too long time, especially for complex images.

In case of cropped image of another image :
Screenshot from 2015-05-19 18:58:39
Matching image looks like this :
Comparing took 10.430 seconds, and the result was :

Features of cropped image : 35915
Features of original image : 45669
Matched features : 23109

In this case, 64% features of base image were matched.

Although more test is needed, when the images are complex, if matching features are more than 50%, the two images are similar or one is part of the other one.

Problems are significant size of keypoint file, keypoint generation time, and comparison time. Keypoint file size seems to be the biggest problem. Total size of sample image is 808MB. Total size of generated keypoint file is 2.9GB. Although reducing the size seems to be possible by changing it to binary format or compressed format, 30% of 2.9GB is still too big : bigger than original image size.

by Hosung at May 19, 2015 11:33 PM

Anna Fatsevych

Flicker Method Calls and Python

Flickr has an extensive API to perform all sorts of operations. The method calls I found most useful were:

It takes “photo_id” as an argument, and of course the API KEY is required to use any of the Flickr API methods, and returns detailed information about the photo.
The response will include in XML(REST), JSON, or PHP Serial, such valuable information about the image as: user information (nsid (user ID), user real name, username, date uploaded, date taken, license, and more).

Flickr also provides instant Request/Response where you can test out your calls with the parameters in the request, and also the information you will get in the response. You can try it out for yourself here:
Here is a sample of a search result response:

Screenshot from 2015-05-19 14_55_00

In my previous post I mentioned I am using Python API (here) It defines the methods and the arguments that are easy to use :


def photos_search(user_id='', auth=False,  tags='', tag_mode='', text='',\
               min_upload_date='', max_upload_date='',\
               min_taken_date='', max_taken_date='', \
               license='', per_page='', page='', sort='',\
               safe_search='', content_type='', **kwargs)

Here is an example of me calling that method in my program:

f = flickr.photos_search(min_upload_date="2015-03-20", license="1,2,3", per_page="500") 

A list of photos will be returned with the specified parameters. Per_page specifies how many results to display per page (100 is default, 500 is max). You may, however, query the specific pages :

f = flickr.photos_search(min_upload_date="2015-03-20", license="1,2,3", page="2", per_page="500")

I have my program available on GitHub
I am using two method calls initially to get the total amount of images and pages:

#get the photos
f = flickr.photos_search(min_upload_date=date, max_upload_date=date, license=lic, per_page="500")

#get the total pages
fn = flickr.photos_search_pages(min_upload_date=date, max_upload_date=date, license=lic, per_page="500")

flickr.photos_search_pages is a method defined in (FLickr python API) that will return the number of pages, and then it is just as simple as iterating through all the pages, if you were out to download all the images from a specific search result.

While researching more about Flickr API for Python, the official Flickr API Documentation provides three different links to Python APIs. Some are kept a little more rigorously updated and some need a little tweaking, in my case, had to be slightly changed (ie. Flickr no longer responds with “isadmin” property for users, so the “isadmin” had to be commented out in order not to fault). Here are the links to the Flickr Python API:


Beej’s Python Flickr API (one I am using)

Hope this is helpful.



One more thing, when working with network(s), I learned that it is extremely “touchy” and various “socket” errors are bound to interrupt your flow. Thus, wrapping the code in “try: and exception:” proved to be crucial in my case. And if you make a mistake and wrap everything so tightly that the good old “Ctrl+C” fails to stop your forloop, just add

except KeyboardInterrupt:

by anna at May 19, 2015 08:02 PM

Hosung Hwang

Research about OpenSIFT


As a better solution of comparison of images, there might be more solutions other than perceptual hash. What I looked at was SIFT(Scale-invariant feature transform), which is an algorithm to detect features in images. For example, if an image is part of bigger image, this algorithm detects even though the cropped image is rotated.

Building OpenSIFT

OpenSIFT is SIFT algorithm implementation. It uses OpenCV.
In my Ubuntu 14.04 machine, I installed OpenCV using following order.

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install build-essential libgtk2.0-dev libjpeg-dev libtiff4-dev libjasper-dev libopenexr-dev cmake python-dev python-numpy python-tk libtbb-dev libeigen3-dev yasm libfaac-dev libopencore-amrnb-dev libopencore-amrwb-dev libtheora-dev libvorbis-dev libxvidcore-dev libx264-dev libqt4-dev libqt4-opengl-dev sphinx-common texlive-latex-extra libv4l-dev libdc1394-22-dev libavcodec-dev libavformat-dev libswscale-dev default-jdk ant libvtk5-qt4-dev
$ wget
$ unzip
$ cd opencv-2.4.9
$ mkdir build
$ cd build
$ make
$ sudo make install

After installing OpenCV, simply make builds libopensift.a and utilities : match, siftfeat.

How it works

siftfeat utility extracts keypoints. From this original image :
It detects keypoints like this :
When I write keypoint information into a text file, it looks like this :

116 128
103.219669 142.087615 42.452571 0.869987
 0 0 0 0 2 0 0 0 2 21 34 14 20 2 0 0 4 21 15 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 80 15 0 0
 74 89 51 61 191 59 1 7 191 159 25 13 29 5 1 26 17 0 0 0
 0 0 0 2 0 0 0 0 27 1 0 0 91 14 1 6 191 102 12 47
 191 37 1 1 30 26 9 159 33 0 0 0 0 0 0 8 0 0 0 0
 0 0 0 0 13 6 0 0 3 1 0 1 44 12 0 0 1 0 0 4
 0 0 0 0 0 0 0 0
103.219669 142.087615 42.452571 -2.213366
 0 0 0 0 0 0 0 0 1 0 0 6 46 8 0 0 3 1 0 2
 17 6 0 0 0 0 0 0 0 0 0 0 0 0 0 10 29 0 0 0
 29 21 9 190 190 25 1 2 190 82 13 58 95 12 1 13 23 0 0 0
 0 0 0 1 0 0 0 2 14 0 0 0 32 4 1 33 190 140 25 16
 190 39 1 9 73 79 44 79 77 9 0 0 0 0 0 3 0 0 0 0
 0 0 0 0 0 0 0 0 5 23 17 2 26 2 0 0 2 21 30 14
 2 0 0 0 0 0 0 0
130.321665 70.660335 15.892657 -2.046253
 1 2 1 59 56 15 1 1 130 96 1 3 2 10 5 42 133 122 0 0
 0 0 0 23 9 4 0 0 0 0 0 0 20 1 1 9 100 25 1 6
 133 5 0 1 0 1 4 133 133 14 0 0 0 0 7 133 15 1 0 0
 0 0 0 10 21 0 0 0 131 43 8 20 133 43 0 0 0 0 3 50
 133 45 0 0 0 0 1 34 2 1 0 0 0 0 0 2 1 0 0 0
 38 110 76 23 104 14 0 0 0 1 58 69 82 21 0 0 0 0 0 6
 1 0 0 0 0 0 0 0
<the last part omitted>

It writes 116 features into a text file. Full size of the keypoint file is 41.8KB. Original png image file was 29.9KB.

When there is another image like this :
Extracted features looks like this :
In this case, 95 features were founded.

Using match utility, following output image is generated :
And there are 34 total matches of features.

Using c APIs, we can

  • detect features
  • generate image with features
  • keypoint file
  • know how many features are found
  • match two images
  • match two keypoint files
  • generate features matching image
  • know how many features are matched between two images

In the next posting, more test result will be discussed.

by Hosung at May 19, 2015 04:49 PM

May 13, 2015

Hosung Hwang

Perceptual hash test for text image

In the last posting, I made sets of test images that contains a character.

Today, I tested how pHash works for this kind of images and how much hamming distance is reliable in case of rotation, moving, and adding a dot.

Original Sample images

I generated basic images using this command:

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 0x0+0+0 "$L" "$L.jpg" ; done

I generated hash of every images using phash program. And compared every images using perl script made by Andrew Smith.

The only image pair that give 0 distance was l(small L) and I(capital I). They are slightly different but almost the same.

I l

Other than this case, all comparison gives more than 12. In case of text, 0% False positive.


I rotated 2 degrees. They looks very similar.
Original :
Rotated :

In this case, I used the sample program of pHash : test_image.
Result :

0 0.jpg 0.jpg dist = 12
1 1.jpg 1.jpg dist = 2
2 2.jpg 2.jpg dist = 4
3 3.jpg 3.jpg dist = 8
4 4.jpg 4.jpg dist = 8
5 5.jpg 5.jpg dist = 4
6 6.jpg 6.jpg dist = 4
7 7.jpg 7.jpg dist = 6
8 8.jpg 8.jpg dist = 4
9 9.jpg 9.jpg dist = 6

4 degrees Rotation :

Result :

0 0.jpg 0.jpg dist = 14
1 1.jpg 1.jpg dist = 8
2 2.jpg 2.jpg dist = 8
3 3.jpg 3.jpg dist = 16
4 4.jpg 4.jpg dist = 8
5 5.jpg 5.jpg dist = 10
6 6.jpg 6.jpg dist = 12
7 7.jpg 7.jpg dist = 6
8 8.jpg 8.jpg dist = 8
9 9.jpg 9.jpg dist = 6

45 degrees Rotation :
Result :

0 0.jpg 0.jpg dist = 24
1 1.jpg 1.jpg dist = 38
2 2.jpg 2.jpg dist = 20
3 3.jpg 3.jpg dist = 28
4 4.jpg 4.jpg dist = 30
5 5.jpg 5.jpg dist = 30
6 6.jpg 6.jpg dist = 22
7 7.jpg 7.jpg dist = 36
8 8.jpg 8.jpg dist = 26
9 9.jpg 9.jpg dist = 20


I slightly changed the position of characters in the image.
Original :
Moved :
Result :

0 0.jpg 0.jpg dist = 14
1 1.jpg 1.jpg dist = 10
2 2.jpg 2.jpg dist = 14
3 3.jpg 3.jpg dist = 18
4 4.jpg 4.jpg dist = 20
5 5.jpg 5.jpg dist = 20
6 6.jpg 6.jpg dist = 16
7 7.jpg 7.jpg dist = 14
8 8.jpg 8.jpg dist = 18
9 9.jpg 9.jpg dist = 14

When I moved more, the distance was even bigger.

Adding a Dot

Then I add a dot at the same position.
Original :
Added :
Result :

0 0.jpg 0.jpg dist = 18
1 1.jpg 1.jpg dist = 4
2 2.jpg 2.jpg dist = 0
3 3.jpg 3.jpg dist = 2
4 4.jpg 4.jpg dist = 8
5 5.jpg 5.jpg dist = 2
6 6.jpg 6.jpg dist = 2
7 7.jpg 7.jpg dist = 6
8 8.jpg 8.jpg dist = 4
9 9.jpg 9.jpg dist = 2
10 A.jpg A.jpg dist = 16
33 X.jpg X.jpg dist = 16

From the result, if the changed image is overapped with line, the distance was small. Whereas, if the dot is not overapped with line, the distance is more than 10.

Next step

  • More tests for resize, crop and different font are needed

ppt link

by Hosung at May 13, 2015 09:18 PM

May 12, 2015

Anna Fatsevych

Flickr API and ExifTool on CentOS 7

Flickr holds ridiculous amount (millions) of Creative Commons Licensed images, and I have been learning to use their API (REST) to download images based on the tag and license requirement.

I am using Flickr Python API (here) and a modified version of python script available here.  You do need to obtain your API Key and Secret from Flickr by registering and also inputting those values in the file. Flickr API method provides a list of all available licenses.


I am doing a search on Flick images by tag, and also will sort by License ID. To do so, I use the method that allows for the search parameters to be sent in with the method call, such as: user_id, tags, tag_mode, min_upload_date, max_upload_date, min_taken_date, max_taken_date,

Here is the code for calling the method :


I run the program using:

$ python students 1

The program will then proceed to download all available images under that tag with the specified license (1 in this case, which is CC BY-NC-SA-2.0), and download all the matching images (up to a maximum of 500 per query, and within the 3600 API queries per hour limit set by Flickr).

To check which xmp tags the downloaded images have – I downloaded the perl script – that uses the Exif Tool library, and to install it on CentOS 7, EPEL release RPM package has to be installed first by issuing this command:

$ sudo yum install epel-release

Then to install the ExifTool:

$ sudo yum install perl-Image-ExifTool

Then, to read the tags, you can just run exiftool imagename.jpg.




by anna at May 12, 2015 08:11 PM

Justin Flowers

Simulating Hard Drive Failures in Linux

For our project we’re going to need to do various kinds of testing, including simulating failed drives. From our research there are not many ways to do this without creating a copy of a working drive and inject corrupt blocks. Here’s a great answer on Stack Overflow with a list of options available for this.

Since we’ll be handling large drives, making a corrupt copy of a drive seems unfeasible. That pretty much eliminates the device mapping route. Additionally, our access to the hard drives will not only be through POSIX commands, making libfiu less useful. Because of all this we feel that the best route would be to simulate our drives as a kind of linear raid device with mdadm, setting them to faulty every now and then.

Unfortunately, this is easier said than done. The documentation for mdadm is quite confusing.  To do it, you’ll need to start with an unmounted drive and call:

sudo mdadm --create /dev/md1 --level=faulty --layout=clear --raid-devices=1 /dev/sdb

That command will create a mapping from your drive (replace “/dev/sdb” with your drive) to a newly creating raid profile called “md1″. It will also set the drive to use the faulty personality with “–level=faulty” and set it to not send any faults yet with “–layout=clear”.

If your drive already had partitions with filesystems on them, then you can skip this step. Otherwise you’ll want to add a partition to md1 with fdisk:

sudo fdisk /dev/md1

To add a new partition use “n”, adding a primary partition in the first slot. You can use the defaults for the other options by just hitting enter. If you want to make the partition smaller than the size of the drive then you can use “+500M” or “+1G”, for example, in the “Last sector” option. When you’ve finished adding the partition make sure send “w”, otherwise your changes wont be made!

Surprisingly though, your newly made partition (or existing partitions) will not show up in /dev. This is because you’ve technically made a logical device with mdadm, so you’ll need to register it with:

sudo kpartx -a /dev/md1

After doing that, run “lsblk” to see your block devices with their registers. You can see your mappings to partitions in /dev/mapper (the one you just made will look like “md1p1″). By running:

sudo mkfs -t ext4 /dev/mapper/md1p1

You can format it to ext4. Finally, to mount it, create a directory to mount it to (I like /mnt/hdd) and call:

sudo mount /dev/mapper/md1p1 /mnt/hdd

Now that your system is mounted, at any time you can start injecting errors by calling:

sudo mdadm --grow /dev/md1 --layout=wp3

You can set different layout parameters to inject different faults with the “–layout” portion. The first half specifies the type of fault to inject (“wp”) and the number after it specifies the number of accesses before sending the fault (or period). For more examples of how to use the layout parameter and what the different faults are check out here and here.

To take down this drive, you’ll need to call in this order:

sudo umount /mnt/hdd
sudo kpartx -d /dev/md1
sudo mdadm --stop /dev/md1

And then your drive will be back to normal!

by justin at May 12, 2015 07:09 PM

Hosung Hwang

Making sample images using ‘convert’ utility

In the previous posting, I made sample images using Java. ‘convert’ utility gives powerful image processing functionalities.
‘convert’ is part of ImageMagick. Using ‘convert’ and bash script, we can make sample image easily.

A-Z a-z 0-9 image

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 0x0+0+0 "$L" "$L.jpg" ; done

Screenshot from 2015-05-12 13:20:47

The same set with italic style

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 0x30+0+0 "$L" "$L.jpg" ; done

Screenshot from 2015-05-12 13:20:11

The same set with rotation

for L in {A..Z} {a..z} {0..9} ; do convert -size 80x50 xc:white -font /usr/share/fonts/truetype/msttcorefonts/arial.ttf -pointsize 50 -fill black -gravity center -annotate 45x45+0+0 "$L" "$L.jpg" ; done

Screenshot from 2015-05-12 13:21:32

by Hosung at May 12, 2015 05:26 PM

David Humphrey

Learning to git bisect

Yesterday one of my students hit a bug in Brackets. We're working on an extension for Thimble that adds an autocomplete option to take a selfie when you are typing a URL that might be an image (e.g., <img src="...">). It also needs to work in CSS, when you enter something like background-image: url(...). Except it doesn't work in the latter case. I advised him to file a bug in the Brackets repo, and the response was, "This used to work, must be a regression."

I told my student he should bisect and find the commit that caused the regression (i.e., code change that introduced the bug). This was a new idea for him, so I did it with him, and promised I'd write something later to walk through the process. Now that it's "later," I'm writing a short walkthrough on what I did.

As I write this, there are 16,064 commits in the adobe/brackets repo on github. That's a lot of coming and going, where a bug could easily hitch a ride and quietly find its way into the product. How does one hope to find what is likely a tiny needle in such a huge haystack? The answer is git bisect.

Having a large number of commits in which to locate a bad commit is only difficult if we try and manually deal with those commits. If I remembered that this code worked yesterday, I might quickly look through the diffs for everything that landed recently, and figure things out that way. But when we're talking about months, or maybe years since this last worked, that strategy won't be efficient.

Luckily git is designed to help us here. Because git knows about every commit, both what changed, and what change(s) came before, we can take a huge range of commits and slice and dice them in order to expose the first bad one.

In order for this process to work, you need a reproducible test case. In an ideal world, this is a unit test that's in your git history, and is stable across all the commits you'll be checking. In the real world you often have to write one yourself, or get a set of steps that quickly exposes the bug.

For the bug I was tracking, I had a simple way to test in the editor manually. When the code works, it should give us a list of filenames to use for the URL, and look like this:

Working Image

When it fails, it keeps giving the list of completions for background-image instead, and looks like this:

Broken Image

Now that I have a simple way to confirm whether or not the bug is present in a particular version of the code, I need to quickly eliminate commits and narrow down exactly where it came from.

You begin a bisect session with git by doing: git bisect start. In order to have git bisect for me, I need to create a range, and to do this I need two end points (i.e., two commits): one where the bug is not present (good commit); and one where I know it's broken (bad commit). Finding a bad commit is usually pretty easy, since you already know you have the bug--in my case I can use master. I tell git this is the bad commit by doing git bisect bad (note: I was sitting on the master branch when I typed this. You can also explicitly give a commit or branch/tag).

But for the last-known-good commit, I obviously don't know exactly where it happened. As a result, I'm going to need to overshoot and go back far enough to get to something that works.

Brackets is currently preparing for version 1.4 on master, and there are 80 previous released (i.e., tagged) versions. These tagged versions are useful for quickly jumping back in time, since they represent versions of the editor that are most likely to run (i.e,. they didn't tag and release broken commits). So I start checking out old releases: 1.1 (broken), 1.0 (broken), release 0.44 (broken). It looks like this is an old problem, so I jump further back so as to not waste my time testing too many.

Eventually I checkout version 0.32 from Oct 4, 2013, and the bug isn't there. Now I can tell git about the other end of my bisect range by doing: git bisect good.

Now git can do its magic. It will take my good and bad commits, and checkout a commit half way between them. It looks like this:

Bisecting: 3123 revisions left to test after this (roughly 12 steps)  
[940fb105ecde14c7b5aab5191ec14e766e136648] Make the window the holder

A couple of things to notice. First, git has checked out commit 940fb105ecde14c7b5aab5191ec14e766e136648 automatically. It has let me know that there are 12 steps left before it can narrow down the problematic commit. That's not too bad, given that I'm trying to find one bad commit in thousands from the past two years!

At this point I need to run the editor for commit 940fb105ecde14c7b5aab5191ec14e766e136648 and test to see if it has the bug. If it does, I type git bisect bad. If it doesn't, I type git bisect good. In this case the bug is there, and I enter git bisect bad. Git responds:

Bisecting: 1558 revisions left to test after this (roughly 11 steps)  
[829d231440e7fa0399f8e12ef031ee3fbd268c79] Merge branch 'master' into PreferencesModel

A new commit has been checked out, eliminating half of the previous commits in the range (was 3132, now it's 1558), and there are 11 steps to go. This process continues, sometimes the bug is there, sometimes it isn't. After about 5 minutes I get to the first bad commit, and git shows me this:

6d638b2049d6e88cacbc7e0c4b2ba8fa3ca3c6f9 is the first bad commit  
commit 6d638b2049d6e88cacbc7e0c4b2ba8fa3ca3c6f9  
Author: <Name of Author>  
Date:   Mon Apr 7 15:23:47 2014 -0700

    fix css value hints

:040000 040000 547987939b1271697d186c73533e044209169f3b 499abf3233f1316f75a83bf00acbb2955b777262 M    src

Now I know which commit caused the bug to start happening. Neither git nor I know why this commit did what it did, but it doesn't matter. We also know what changed, which bug was being fixed, who did it, and when they did it. Armed with this info we can go talk to people on irc, and add notes to a few bugs: the bug where the bad commit was added (this will alter people who know about the code, and could more easily fix it), and our bug where we've filed the issue. Often you don't need to solve the bug, just find it, and let the person who knows the code well help with a fix.

The last step is to tell git that we're done bisecting: git bisect reset. This takes us back to the commit we were on before we started our bisect.

Despite being quite skilled with git, I don't think that any of my students had used bisect before. Lots of people haven't. It's worth knowing how to do it for times like this when you need to quickly narrow down a regression.

by David Humphrey at May 12, 2015 03:01 PM

May 11, 2015

Hosung Hwang

Making images from alphabet characters

To test Perceptual Hash, I wanted images that contains a character : a.jpg, a.png, a.gif, b.jpg, etc,.

There was no this kind of images, so, I wrote a simple java program using a code block from this page.

This program makes jpg, png, and gif image for A-Z, a-z, and 0-9.

Full source code is:

import java.awt.Color;
import java.awt.Font;
import java.awt.FontMetrics;
import java.awt.Graphics2D;
import java.awt.RenderingHints;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;

//modified from
public class FontImage {

    public static void drawText(Font font, String text){
        BufferedImage img = new BufferedImage(1, 1, BufferedImage.TYPE_INT_RGB);
        Graphics2D g2d = img.createGraphics();
        FontMetrics fm = g2d.getFontMetrics();
        int width = fm.stringWidth(text);
        int height = fm.getHeight();

        img = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
        g2d = img.createGraphics();
        g2d.setRenderingHint(RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_QUALITY);
        g2d.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON);
        g2d.setRenderingHint(RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_QUALITY);
        g2d.setRenderingHint(RenderingHints.KEY_DITHERING, RenderingHints.VALUE_DITHER_ENABLE);
        g2d.setRenderingHint(RenderingHints.KEY_FRACTIONALMETRICS, RenderingHints.VALUE_FRACTIONALMETRICS_ON);
        g2d.setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR);
        g2d.setRenderingHint(RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_QUALITY);
        g2d.setRenderingHint(RenderingHints.KEY_STROKE_CONTROL, RenderingHints.VALUE_STROKE_PURE);
        fm = g2d.getFontMetrics();
        g2d.fillRect(0, 0, width, height);
        g2d.drawString(text, 0, fm.getAscent());
        try {
            ImageIO.write(img, "gif", new File(text + ".gif"));
            ImageIO.write(img, "png", new File(text + ".png"));
            ImageIO.write(img, "jpg", new File(text + ".jpg"));
        } catch (IOException ex) {

    public static void main(String[] args) {
        String text = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
        Font font = new Font("Arial", Font.PLAIN, 48);

        for (char t : text.toCharArray())
            drawText(font, Character.toString(t));

The output files are :
Screenshot from 2015-05-11 16:35:03

There is a better way to do this using convert utility. It is described in the next posting.

by Hosung at May 11, 2015 08:44 PM

May 07, 2015

Kieran Sedgwick

Leveraging travic-ci & heroku for rapid deployment of Thimble

As we got closer to a usable mashup of Brackets and Thimble we wanted to allow other people to play with what we’d done so far.

By leveraging Travis-ci hooks I was able to automate a deployment of our app to Heroku whenever we updated our main branch. This was easier than I’d anticipated (you can see the plethora of excellent documentation for more information on the process) and also surfaced some issues revolving around heavily interdependent apps:

1. Local is a racetrack, deployed is a city block

The reality of local development of Webmaker is that most of the pain has been removed. The excellent webmaker-suite module clones, installs and configures the Webmaker application ecosystem based on which of their core and peripheral servers you need running.

Aside from new environment variables we introduced, we never had to touch the config. All of the features tied to the tight coupling with other parts of the ecosystem, like login and publishing makes, “just worked”. Not so in deployment.

We had to accept that, at least at first, there was only so much we could expose to others for testing, and that our application would look far more incomplete than it actually was.

2. When deployed, the pitstop is miles away

An automated deployment system also meant that we were held hostage by the length of the continuous integration process if something broke on Heroku. Reverting breaking changes had a time delay not found locally, and considering our workflow (imagine Thimble as a tree with a complicated root system of dependancies and submodules) things could get pretty messy as we tracked down a problem.

Add to that the time it took to redeploy and it became clear that we had to be more mindful of what we pushed and where.

3. If local is a… drawing of a donut, then… something something bagels

The main takeaway from this process was that it heroku wasn’t at all ideal for deploying this particular application. It was a picture of a donut to a donut. What we really needed was a full deployment of the Webmaker ecosystem! So that became the next goal in our automagical deployment journey.

by ksedgwick at May 07, 2015 05:12 PM

May 04, 2015

Andrew Smith

Screen scraping timetable data from a PeopleSoft Faculty Center

Our school moved to PeopleSoft for.. I’m not going there.. but that’s where everyone’s timetables are now. I thought maybe this big fancy company has an API to let me access the data but no, it’s basically impossible to access the API directly.

So I was left with screen scraping, which I always wanted to try, why not. Go to the page I want to examine, open up Firebug, and drill down to the table elements I’m interested in: body>div>iframe>html>body>form>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div…

Er, wtf? I seemed to be going in some Firebug bug infinite loop. Surely they don’t have that many tables inside each other? Then I discovered the “Click an element” button and found that there are lots and lots of tables inside tables on this simple page:

faculty centre

This is with the text at its minimum size, you can see by the scrollbars what I’m talking about:

Firebug peoplesoft html

But after a while I managed to figure it out. I had to learn some XPath to find the cells I was interested in based on their IDs, but I couldn’t use XPath for everything – I tried but it ate all my RAM and was still working through the swap partition when I killed it in the morning.

Here’s the script in case you’re in the same boat. It prints the timetable data in the console. For myself I intend to make some Json out of it for import into Everyone’s Timetable.

// Firebug script to scrape timetable data from a PeopleSoft-backed website.
// Run it when you're on the page that shows the timetable. You get to that page
// like so:
// Faculty Center
//  Click the Search tab
//    Expand Additional Search Criteria
//      Set "Instructor Last Name" to the one you're looking for
//        Start Firebug, go to Console, paste in this script and run it
// Author: Andrew Smith

var frameDocument = document.getElementById('ptifrmtgtframe').contentWindow.document;

// DERIVED_CLSRCH_DESCR200$0, $1, etc. have the course title
var courseTitles = frameDocument.
  frameDocument.documentElement, null,

// For each course
for (var i = 0; i < courseTitles.snapshotLength; i++) {
  var courseTitle = courseTitles.snapshotItem(i);
  // Find the the next tr which has the timetable data for this course
  var timetableTableParentRow = courseTitle
  // There's some fucked up empty row after the first course title only
  if (i == 0)
    timetableTableParentRow = timetableTableParentRow
  // Now go down to the table in this tr, it's the only thing that has 
  // an id so I can use xpath to find its children (timetable rows).
  var timetableTableId = timetableTableParentRow
  // MTG_DAYTIME$0, $1, etc. have the day and time range in this format:
  // Mo 1:30PM - 3:15PM
  var times = frameDocument.
    evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_DAYTIME')]", 
    frameDocument.documentElement, null,
  var timesArray = new Array();
  for (var j = 0; j < times.snapshotLength; j++) {
    timesArray[j] = times.snapshotItem(j).textContent;
  // MTG_ROOM$0, $1, etc. have the room number in this format:
  // S@Y SEQ Bldg S3028
  var rooms = frameDocument.
    evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_ROOM')]", 
    frameDocument.documentElement, null,
  var roomsArray = new Array();
  for (var j = 0; j < rooms.snapshotLength; j++) {
    roomsArray[j] = rooms.snapshotItem(j).textContent;
  // MTG_INSTR$0, $1, etc. have the instructor names but I think I'll
  // ignore them. For shared courses it won't hurt too much I hope.
  // Dump all the timetable data into the console, will do something with it later.
  for (var j = 0; j < times.snapshotLength; j++) {
    console.log(timesArray[j] + roomsArray[j]);

by Andrew Smith at May 04, 2015 02:39 AM