Thursday, 01. October 2009
NoSQL: MongoDB performance testing (part 2: counting)… 
After my insert tests last time I decided to look at some count queries as we do count a lot at
twimpact.com. As a first result I can say that without any index count makes no sense with a database of this size.
I have used the database left over from my last insert test and added a few indexes which takes around 30-40 minutes per index. I did not check in more detail about the time it takes as we tend to create the index while working on the database anyway.
Now for todays results. The queries are quite simple, but in our case practical. I get a cursor for 1.000.000 documents as a result of a simple query and count the amount of documents that have the value of one of the documents properties:
def cursor = db.find().limit(1000000)
// alternative: query one of the indexed properties
// def cursor = db.find(new BasicDBObject("property", new BasicDBObject("\$ne", null))).limit(1000000)cursor.each { doc ->
def value = doc.get("property")
def count = db.getCount(new BasicDBObject("property", value))
}
The time was taken for each of the
"db.getCount()" calls and it turns out that around 40-50% of all queries result in negligible query time (< 1ms) which is the smallest time frame I can measure right now. This needs to be taking into account when evaluating the graphs as they only show the queries with at least 1ms duration (log scale plot).
In the plot you see query time versus the result of getCount(). As expected higher counts may take longer,
Some explanation is necessary for the plots.
random means that I get some documents and count one of the properties (the same for all documents). I do not know the order in which the documents come, so they are unrelated to the property I am counting.
correlated is the counting if I query the documents using an index and the count the property that was indexed. The assumption here was that it might be easier for the database to count all documents having a certain property value if I previously queried all documents having a non-null property value.
This holds true for the
long index but not for the
string index. The latter behaves about the same as my random counts.
The results show that count queries are very fast, but only if indexed.
What we also need for
twimpact.com are some more advanced queries. I assume that the results for those also depend on how we design our documents to fit our needs. The design will take some time and I will get back with results of design and advanced queries at a later date.

Friday, 25. September 2009
NoSQL : MongoDB performance testing (part 1: insert)… 
The
twimpact.com project currently uses a
PostgreSQL. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.
That is a reason I am looking at non-SQL alternatives. The list includes
redis, the
Cassandra Project and
MongoDB.
I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.
The setup is as follows:
- 4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk
- MongoDB: mongodb-osx-x86_64-2009-09-19
- (i had to work in parallel, so there might be some swapping)
Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the
insert test was to copy over batches of 10.000 tweets like the following pseudo code shows:
// initialize MongoDB (started with a complete new one for each test)
def db = new Mongo("twimpact")
DBCollection coll = db.getCollection("twimpact");
// coll.createIndex(new BasicDBObject("retweet_id", 1)) // long index
// coll.createIndex(new BasicDBObject("from_user", 1)) // short string indexdef offset = 0
def limit = 10000
def rowCount = sql.count("tweets")while(offset < rowCount) {
// get batch of tweets form PostgreSQL server
def data = sql.rows("SELECT * FROM tweets OFFSET ${offset} LIMIT ${limit}")
// convert each row into a document and insert
data.each { row ->
BasicDBObject info = new BasicDBObject();
row.each { key, value ->
info.put(key, value);
}
coll.insert(info);
}
offset += data.size()
}The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:
…
DBObject[] bulk = DBObject[5000]
… loop …
// two times as 10000 was too big for the driver
coll.insert(bulk)
...
The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:
{
"id": 3551935825,
"user_id": 1657468,
"retweet_id": 15965974 ,
"from_user": "thinkberg",
"from_user_id": 6190551,
"to_user": null ,
"to_user_id": null,
"text": "RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain-Computer Interface
http://bit.ly/3fg9OG",
"iso_language_code": "en",
"source": "<a href="
http://adium.im" rel="nofollow">Adium</a>",
"created_at": "Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)",
"updated_at": "Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)",
"version": 0,
"retweet_user_id": null
}And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.

The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.
Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.
A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.
Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at
twimpact.com where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)
But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs.

Wednesday, 29. July 2009
twimpact.com - trends by citation 
It feels good to code a little again. Again, social software but this time from the analysis point of view. Check out
twimpact.com to see the trends of the last hour bubble up.
All done in
grails, which I love.

Friday, 01. May 2009
Re-use replaced backup harddisks 
Now you have RAID system. It runs perfectly, but it also runs full as all storages do over time. You buy new 1.5TB harddisks, replacing the old 500GB ones. Now what do you do with those old ones? They are still perfectly healthy disks.
Well, you buy an
external SATA dock!
Then you can do off-RAID backup to the disks. Those disks probably last longer than your DVD backups.

Saturday, 25. April 2009
The next Backup iteration 
Finally I have a backup strategy for my server too. Not actually perfect, but it works for me. I even added backup of some data from my home RAID system and vice versa to it. The data is backed up to two different locations (
rsync.net and
Amazon S3) and additionally to the RAID. Some data, like photos is transferred from the RAID to the Server and from there to Amazon S3. All Laptops backup to the RAID. That is too much data to be stored at either offsite location price-wise.
All data transfer is encrypted. The data files are encrypted at either offsite backup but not on the RAID for easy access.

Friday, 27. March 2009
Twitter - what? 
In contrast to my last post, I am using twitter now. More for telling the world what we do, than what I personally do for my leisure. It is the only valid way for me. Giving an idea of what's happening in research.

Monday, 23. March 2009
New Job - Industry Liaison Manager 
I have changed jobs and moved away from the
Fraunhofer Society to take a post as
Industry Liaison Manager for a
Machine Learning and
Neurotechnology Research group at the Berlin Institute of Technology.
My main focus now will be to manage our industry relations, organize talks and seminars and work on technology transfer. The research project works on
non-invasive neurotechnology to improve sensors, data analysis and apply the results in neuro-usability and other applications related to man-machine interaction.
This is going to be a challenging and most interesting job.

Friday, 23. January 2009
Amazon S3 / WebDAV proxy updated 
I took the liberty to check out
my old code and work on it to finally fix some of the problems. It now correctly uses the last-modified time and the cache handling as well as lazy download from S3 is implemented. To really work with the server it will need better cache handling. After many tests the basic and copymove finally run through repeatedly without failure.
Still a long way to go.
Update: (2009-01-28) In the meantime I implemented the property handling which only fails for some strange UTF-8 property values. Now the litmus test runs 99% through. Using MacOS X Finder to test looks promising.

Sunday, 18. January 2009
twitter: the public chat 
I have been following a few friends
twitter messages via
Google Reader and I get the impression that it works much like a group chat system. The conversations are similar to cross-linked comments in weblogs and have a similar publicity.
Unlike these friends I never really started to use twitter and even deleted my account there, as well as in a few other social networking systems. I give away so much already so I don't want to make the harvesting too easy. What strikes me though is, why a service like twitter has taken away the public chat room from classic instant messaging systems. It works much like
IRC (Internet Relay Chat) where you can just join into an open chat. However, it looks crude that you have to read the others chat to actually communicate.
I guess the real advantage of twitter is the simple user interfaces on loads of different systems that heavy weight instant messaging systems failed to provide until now.

Thursday, 08. January 2009
Amazon EU 
Well, it is one big continent (plus the little island). I ordered at
amazon.co.uk and when my package arrived by "Deutsche Post" it turns out it was even sent from Bad Hersfeld, Germany. Actually, Amazon should get real and drop the delivery fee from UK to Germany. Seems to be a penalty for ordering in their UK shop.

Saturday, 20. December 2008
Nothing to read … 
I don't know why. There are about a thousand books in my little library, but I cannot find one to read.

Wednesday, 29. October 2008
Logitech Harmony Support: Excellent 
I have had the best telephone support experience ever. To make it easier for my family to operate all the gadgets crammed underneath out TV I have a Logitech Harmony Universal Remote (
model 885). This device works quite well, unless ..., unless you leave it uncharged for about one year. Then the battery is gone and reviving it almost impossible.
Anyway, when I came back from Korea I called the free hotline and even though I did not expect it, they immediately tested the device online and filed an exchange for me. That was the first good thing.
Now, last week I bought a
Dreambox, which is a nice little tv receiver with built-in disk. Took me a few hours to get my smartcard running as german cable tv is encrypted. The Harmony remote works okay if you use the default device settings found in the database. However, some keys react slowly and some seem to emit double signals so it always skips. Reading in some forums I found that I should get a copy of a special Dreambox profile into my user account.
Calling the hotline again, I had a helpful and very friendly person on the phone within less than 30 seconds and after presenting serial numbers was escalated to second level support in Canada where another friendly technician copied the profile into my account in no time.
That is service! No fuzz, friendlyness and last but not least speed.

Sunday, 28. September 2008
The End and The Beginning 
In three days my assignment to Korea ends. It was a good time, a stressful time, we did meet new friends. I am grateful.
In three days I will be back in Germany. It will be a good time, great changes show their signs at the horizon. I am looking forward to it.

Monday, 15. September 2008
The quiet Tokyo 
View Larger Map
This is a map of all buddhist temples I have visited during my short holiday trip to Tokyo. Out of the 22 temples I visited I can prove 16 through my pilgimage book stamped and nicely signed by the priests. At first I choose the temples randomly but then decided to read a bit more. Here is a nice
explanation of Buddhism in Japan. From
this site I then selected temples by pilgrimage to give my walks more of a sense. However, since the
Six Amida Pilgrimage (see bottom) can take a while when on foot, I decided for a bycicle. You can rent one just outside Kamakura station.
The best trip though was my trip to the Izu peninsula. Here you can either just visit the Shuzenji onsen or visit the similarily named temple and then hike along the Hiragana path to Okonuin temple. To get there I took the Shinkansen to Mishima and then the local railroad to Shuzenji. It then only takes ten minutes to the temple by bus.
It is an excellent way to experience this city and its surroundings.


Saturday, 19. July 2008
Neue Technik, Alte Technik 
Irgendwie ist das schon eine komische Sache mit der modernen Technik. Sie verspricht komplizierte Dinge einfacher zu machen. Aber in letzter Zeit komme ich mir vor, wie ein Ingenieur vor hundert Jahren. Wenn etwas nicht funktioniert, nochmal etwas Öl dran, ein ordentlicher Tritt und dann ging es meistens.
Ich hab das jetzt ein paarmal durch. Zum einen wollte Outlook partout nicht mit meinem IMAP Server und mir ist bis heute nicht klar, warum. Ein paarmal virtuell drantreten und nochmal Schmiere in die Konfiguration und plötzlich gings. Das gleiche hatte ich gerade mit einer WLAN Konfiguration. Allerdings hatte die Einrichtung eher was von einem Kurbelstart. Ein paar Umdrehungen und ab gings. Auch hier wieder das gleiche: theoretisch weiss ich wie es geht und praktisch hab ich keine Ahnung warum es erst nach x-mal umkonfigurieren und reset funktioniert hat.
Naja, nur keine Panik und ein kleiner Schubser und alles geht :-)

Saturday, 05. July 2008
Secure Online Banking? 
I think I wrote about this before, but it annoys me every week. Internet banking here in Korea is only possible using a PC with Microsoft Windows. Not only that, it is only possible using Internet Explorer. Still not done, I can only do it by installing ActiveX plugins that employ
trojan horse like technology to protect me from other trojans and
key loggers. Actually, I had to install about 3-4 different plugins before I even see the login, which in itself is a plugin that manages the certificats.
Fortunately I have a Parallels virtual machine to protect me, but it does not protect my online account. I wonder why on earth only here in Korea I have to do it. I have had accounts in different parts of the world and I was always able to use standard web browsers of different kinds to do the online banking.

Sunday, 08. June 2008
i600 = M6200
Looks like the Samsung Blackjack sold in Korea is identical in most parts with the European version i600. When I tried to update the phone using Samsungs MITs Upgrade Wizard the process stopped at 89% for some reason and I was left with a non-functional phone. Fortunately there are lots of adventurous people around who write about their experience flashing phones. While trying to get it back I decided to give it a try and flash WM6.
First I had to find the USB flash mode, which can be enabled by pressing and holding the green "Receiver" button and the power button. This is quite different from what you find on the net elsewhere. But then it all works as
advertised.
Important, though, is to run the MITs wizard first in emergency mode to get the original flash files, just in case. Now the phone has a working WM6 with all its pros and cons. One drawback, however, is that the phones buttons change. In the european version, the number buttons are located in the middle and not the left side of the keyboard. However, that is something I can live with.

Monday, 31. March 2008
The Random Pick 
How do I get music like
Teranoid? Whenever I am in Japan, I enter one of the big book, video and music stores and look randomly through the shelfes. I usually end up in the "Hardcore" section, where you find lots of fun stuff to listen to. If you cannot read what you're about to buy this is the way to go for me :-)

Rauschkapsel 
After month without a car I started driving in Seoul end of last year. The traffic is terrible, just like most drivers experience. When I enter the Gangbyeon Expressway near Hannam-dong I usually put myself into a sound capsule. Right now this is
Teranoid Overground Edition. Some japanese techno stuff that simply drives you through.
I tried
The Prodigy, but it does not work as well.

Saturday, 08. March 2008
Pyongyang 
I am back from a trip to North Korea. It was a very interesting trip I now have to reflect on. The following pictures are a night and a morning view from the Yanggakdo Hotel. They make Pyongyang look much better than it actually is ...

I have seen many things in North Korea and I need time to think about many things. This is just a tiny little surreal impression.