Thursday, 06. September 2012
Blogging? 
I wonder if I should discontinue the blog as I do not regularly write much. Usually it's just some random thing that comes to mind. Most of the interesting stuff happens on other networks, even though I do value having my ramblings under control. That is not so true on
Facebook or
Twitter. The integration into the blog is rather difficult with ever-changing rules and API's …

Friday, 20. April 2012
DataScience to the rescue 
My colleague Mikio Braun just
gave a talk on our history doing real-time analysis on the
Twitter data stream. It's quite interesting how we started with
jruby and
rails and a normal
RDBMS and worked it until it broke, then moved on to use scala (actually, I was pressing Mikio to get away from ruby and start using a JVM based solution) and the shiny new
Cassandra NOSQL database. Again after some fine tuning and clustering it broke again for our use case.
That's how we ended up, still using scala, to create our own storage and management platform that does exactly what we need and is fast enough to process the stream in real-time.
Of course such a history of a technology fight is always interesting to other developers as we can tell them what to avoid or how to tackle certain problems, be it simple programming stuff or mathematical modeling. We ended up with an idea to bring the experience to the public by creating
datascience-berlin.de, an offer of a number of seminars from basic DataScience to advanced issues and in future even more focussed seminars for specific industries.
http://datascience-berlin.de/ | Twitter:
@datasciencebln 
Saturday, 25. February 2012
Twitter Crowd Attention 
It is an easy task to map where users are, but what are the places people talk about? I have set myself the task to find out and a nice example is now visible in
this youtube video shown mentioned locations during March 2011 when the earthquake hit japan :(

Tuesday, 29. November 2011
TWIMPACT NIPS preparations 
Finally,
as Mikio wrote in our dev blog the analysis is running smoothly. Next we are going to put the non-serial analyzable stuff on the cluster to roll the whole of 2011 into our snapshots. It will be fun to see what are the hotspots. And as a side effect we are looking at a comparison of our own little language detector to the chrome language detector.
What Mikio doesn't know, i am already planning on using more memory for even more fun stuff to analyze.

Sunday, 09. October 2011
Warum der Bundestrojaner so schlecht ist … 
Siehe Google+, wo ich in Ansätzen versuche zu beleuchten wieso Software des Bundes so schlecht ist ...
https://plus.google.com/117122816629542437147/posts/f5WpLAYGqFj 
Monday, 27. June 2011
The curse of the bit dump … 
I wonder where it will end. Everyone is speaking of big data and even
I take part in it.

At first I thought we might end up something like in "Muell", a comic book by Juan Gimenez picturing a world where only a very small part is still habitable and all the rest of it looks like a big dump. Funnily, that is not going to happen. At least I think that we are going in a different direction that has to do with our fascination for big data, which is closely related to "messyism": Stupidly collecting things we no longer need.

A not so long time ago everything would at some point simply disintegrate because the material would not last forever. With the industrial revolution we invented more and more durable materials and now we have this volatile bit which is going to haunt us even longer, because we make it stick in our data centers. Already most of Europe and parts of the rest of our world are covered with data centers. We will need even more if we don't look back at how our ancestors handled information.
Partly due to lack of storage, but also for the sake of simplicity and understanding information was boiled down, filtered and then came part where it was stored (drawn or written down mostly). That's the point where I think we need to think again and focus on the part where understanding and filtering reduces the amount of information.
Still, there is this wish to keep up with the speed of our time and having it all available. At the speed we produce data, the analytics tools that evolve will lose in the end if we try to catch up the bit dumps already there.
We have a chance though to keep the pace by shifting away from big data storage analytics to real-time data analytics that simply keeps our level of knowledge, the insights from our continuous data streams, up-to-date at all times.The resulting data will be much less and much more informative than all the bit muell it came from and it will in the end protect us from ending up to look like Coruscant, a planet of data centers.

Sunday, 06. March 2011
updated arab revolt trends with location detection 
IMPORTANT: To really update the application, give it a few reloads in the browser until you see a map below the keywords and the trending graph.
I have updated our trend detection for the arabic world in two ways:
- I have separated
libya and
iran into their own trackers.
- The trends now also contain retweet information.
- An experimental features shows an approximate coordinate or what is being talked about in the tweet
More to come, the data stream already transmits mention trends, retweet trends, hashtag trends and link trends in addition to what is already there. I just did not have enough time to put it into the interface.

You can now select the tracker in the menu top right:

Enjoy!

Thursday, 17. February 2011
How to use egypt.twimpact.com 

Monday, 31. January 2011
Spike on arrest of Al-Jazeera Journalists 
Here is the spike our trender collected after Al-Jazeera journalists have been taken into custody.

Sunday, 30. January 2011
Following #egypt on twitter 
Trying to follow all the keywords relevant to the egypt crisis can be very difficult. The updates sometimes come in so quickly that it is impossible to read. I've tried it first, tracking all keywords directly. The opportunity to actually provide a useful service is great, so I have installed a special egypt events tracker on
http://egypt.twimpact.com/It tracks retweets that contain the keywords and ranks the keywords over time. All this is done using
Mikios new "squid farm" trending system. This is not even starting to cratch the surface of its possibilities; but who cares. We will install it to track a few million topics on
twimpact.com later on.
It's actually streaming the latest tweets directly to all subscribers. The latest updates are there, a tag cloud showing the relevance of the keywords over time and the trending rate within the last hour.
What's next? I am going to install a link trender to track all links and if I get it working a picture stream from the twitter data.

Saturday, 02. October 2010
Moving a cluster at 200GB/hour 
I just moved the
twimpact cluster from Berlin to Amsterdam. It's about 650km and the current payload on the cluster is about 800GB. We moved at an average rate of 200GB/hour but in the end, with the last traffic jam right before entering Amsterdam we dropped back to an effective rate of 114GB/hour. Still okay, especially since a DELL PowerEdge M1000e Blade Enclosure fits very nicely into the trunk of my car. The only indication that we have loaded something is the weight, pressing down the back of the car.
Today it will be put into the rack and then the holiday part of the trip begins.

Saturday, 18. September 2010
What I've learned in the past few month … 
This is not counting that it keeps a lot of other knowledge alive. And all that after I thought I would never really code again. For those who don't know, my day-to-day job is
knowledge transfer from university to industry in the neuroscience area. A lot of fun too.
Do you love your job? ;-)

Thursday, 15. July 2010
pool size, threads and TIME_WAIT 
We are now in a stage where we put everything together for the new TWIMPACT. Our main analysis, done my
Mikio now runs on
Scala, the database is based on
Cassandra and we have created a messaging infrastructure to distribute further analysis and aggregation tasks.
Now was the time to do some testing on real hardware to see whether it would give us what we want. In a first test using a single analysis thread did run fast at first, but came down to very low numbers storing the data. So he decided to increas the number of threads doing the job and the performance increase is immense. However, at some point our little test program started spitting out strange socket errors about not being able to "
assign the requested address".
It turned out that the host system was accumulating open sockets up to the maximum number around 30k and from then on it did not allow any more new sockets to be opened. However, most of these sockets were in state TIME_WAIT, indicating a closed socket, but not finalized yet.
This did look very strange, I had wrapped
commons-pool in a very straightforward way and the socket factory also looks simple enough. Enabling extensive debug output only revealed normal action until the errors started. Also, the actual pool size never really went above 10. And that's where that idea, lurking in the back of my mind, came rushing forward: The pools maximum amount of idle elements is set to 8 by default, leading to the problem.
- 16 threads take out a socket from the pool (not all 16 are actually active at the same time for some reason).
- The pool creates new sockets on demand.
- After doing a single task the threads put back the socket.
- The pool sometimes counts the amount of idle sockets and closes some to get back to the maximum idle count.
- The closed sockets sit in the system with TIME_WAIT.
As all this happens very fast the pool closes and creates new sockets quickly and the system starts accumulating closed sockets until it breaks.
Setting the pool size to 16 with 16 worker threads works beautifully. Also, we have 32 (1 server, 1 client) open sockets for 16 TSocket connections using the
Thrift API, which did speed up the accumulation of open sockets.
Conclusion:
Set the pool size according to your worker threads. 
Thursday, 13. May 2010
counting lines - a correction for the cassandra test 
First of all, the tests described in my previous post are correct in terms of timing and the amount actually stored in
Cassandra. However, I made a mistake, assuming the data file contained ~42.000.000 data points. After recounting it turned out that there are empty lines between the data points, thus reducing the data count to about 21.000.000. Still cassandra crashed on the laptop after that amount of data for lack of heap space.
Using one of our cluster nodes, giving Cassandra a bit more heap space and using a hard disk that can actually hold the data files the rate for storing batches rose to about 37k per second. A single put test will also follow.
To make my tests somewhat comparable I will copy another ten days into the data file, raising the data count to ~40.000.000. This makes about ~75GB raw json data. With the addition of UUID keys as well as cassandra indexing and some data fill this will blow up the raw space Cassandra needs to an estimated amount ~140GB.
Did I tell, these numbers are just 20 days?

Wednesday, 12. May 2010
NoSQL revisited: Cassandra / Scala 
First, I have to admit, we have not moved on to MongoDB. Sorry. It was nice to play with it, but after a crash on my laptop which rendered the database unusable there were some doubts.
Anyway, we are in dire need of a scalable database solution for
twimpact.com as you can check out for yourself there. The PostgreSQL database is killing the not-so-small server with its IO load right now. The setup may not be optimal but I don't think that the complicated way to SQL database clustering/sharding etc. will help much.
This is where
Cassandra enters the playground. I did have my problems with wording and paradigms but in the end I treated it all as a direct addressable sparse matrix. I got the idea when writing a data file converter for
GNU Octave, writing loads of sparse matrix files.
Ah, and we would not be geeks if we didn't change the programming language in the process too.
Mikio started it by playing around with
Scala for the new retweet matcher. As I was looking for a typed alternative to
Groovy I ended up learning Scala. I can't say I have yet understood all of the concepts but what I have found useful works.
First, I looked up all the different implementations of Cassandra APIs for Java and Scala. The syntactic sugar added by the Scala versions is nice but it also involves a lot of under-the-hood processing sometimes. Before entering a field where I have to fight off issues not under my control I decided to use
hector which is probably the best available access solution in Java right now. However, wrapping hectors results in Scala is not funny. I do not blame this on Hector, but rather Scala's over-use of its own types.
After assessing the needs for twimpact we came up with a simple API necessary to support our analysis and storage use. Mikio already had a working analyser using just a simple key/value store, so I decided to wrap Cassandras key/column system into a just-as-simple put/get. Here's the trait:
trait CassandraStore {
def put(table: String, key: String, name: Array[Byte], value: Array[Byte])
def put(table: String, key: String, columns: List[Column])
def put(table: String, mutations: Map[String, List[Column]])
def get(table: String, key: String, name: Array[Byte]): Array[Byte]
def get(table: String, key: String): List[Column]
def get(table: String, start: String, end: String, count: Int): Map[String, List[Column]]
def remove(table: String, key: String)
def remove(table: String, key: String, name: Array[Byte])
}Simple, simple. I then implemented it two times, one just wrapped hector's api and then I decided to do my own direct
thrift implementation. Which is, er, well, … a bit more work.
So, the hector implementation played the role of the base to test against. At first, just doing the most simple implementations worked quite well until I started to store more than 1000 keys sequentially. It broke, because I created a complete new transport and client for each put/get request. No wonder you might say, but hey I do like to start simple. To overcome this problem I wanted a connection pool, which is useful anyway and there are existing implementations already out there in the form of
apache commons-pool. You will find Scala wrappers for it in some of the Cassandra projects, i.e.
Cassidy. This is by the way the most beautiful API design I have found for Cassandra. For the purpose of learning and because I am always trying to improve (some called me a perfectionist for this) I opted for my own pool wapping. I am going to release it some time after this post, as soon as I have found out how to host a maven repo for it.
Now we have a simple API, a connection pool and some code to drive it (a ThriftSocketFactory and a corresponding keyed pool). What do we need next? Yes, some data. I appended 9 days in April of raw retweet data and gzipped it into one singel 9.7GB file.
Time to start testing the cassandra. Like before with MongoDB I did it on my laptop that has the occasional 4GB of RAM and about 50GB free space on the harddisk. The test program takes the raw json data from the file as stdin and, parses the json, creates data entries in the form of key:column and stores them using the aforementioned self-made API.
I always read in 10.000 retweet lines and then store them. Test 1 just loops through the 10.000 entries and does a single put(key, name, value) and the second test just does a batch insert.
Here is the result:

All worked well for the single inserts until around 21.000.000 inserts. This was the point where the Cassandra process found it did not have enough heap space anymore and additionally the hard disk was full. The same happened much earlier with the bulk insert. It looks like Cassandra needs much more space on disk for bulk inserts.
I have to admit that Cassandra was running with the standard setup of only 1GB heap space and disk full is usually a point where the fan and excrement issue happens. A tragic incident of this has been given to protocol
here by reddit, which is a good read for all who plan to go a similar web noSQL way.
Right now I am testing cassandra on one of our cluster nodes to see how far we can get with a more relaxed disk/RAM environment as we do not just have to take care of 9 days in April (~43.000.000 data points), but rather 500.000.000 data points. So, adding more than just one node is mandatory.

Thursday, 04. March 2010
Japanese TWIMPACT beta 

We are now running a beta site for TWIMPACT for Japan only. It only works with japanese tweets and works quite well. What is interesting is the battle between a former politician
555hamako and
masason the president of
Softbank, a large telecommunications company in Japan.
First,
masason started out december 2009 with a quick rise to the top. Then
555hamako followed beginning of 2010 (the year of elections) with an even steeper rise to take the crown. Also, it looks like masason has not managed to attract the same size of an audience as before, his TWIMPACT stalls a little at the end. Maybe he was just watching the winter Olympics.
I guess we will be starting to adapt the
TWIMPACT rating to degrade over time to provide a better view of the current impact a user has. Even though it is hard to keep on rising one keeps its TWIMPACT at the moment. This effect is much more visible on the
global site where you can find a lot of not-so-spammy-spam twitterers that rise quickly and should fall down over time again after they hit the ceiling.

Thursday, 01. October 2009
NoSQL: MongoDB performance testing (part 2: counting)… 
After my insert tests last time I decided to look at some count queries as we do count a lot at
twimpact.com. As a first result I can say that without any index count makes no sense with a database of this size.
I have used the database left over from my last insert test and added a few indexes which takes around 30-40 minutes per index. I did not check in more detail about the time it takes as we tend to create the index while working on the database anyway.
Now for todays results. The queries are quite simple, but in our case practical. I get a cursor for 1.000.000 documents as a result of a simple query and count the amount of documents that have the value of one of the documents properties:
def cursor = db.find().limit(1000000)
// alternative: query one of the indexed properties
// def cursor = db.find(new BasicDBObject("property", new BasicDBObject("\$ne", null))).limit(1000000)cursor.each { doc ->
def value = doc.get("property")
def count = db.getCount(new BasicDBObject("property", value))
}
The time was taken for each of the
"db.getCount()" calls and it turns out that around 40-50% of all queries result in negligible query time (< 1ms) which is the smallest time frame I can measure right now. This needs to be taking into account when evaluating the graphs as they only show the queries with at least 1ms duration (log scale plot).
In the plot you see query time versus the result of getCount(). As expected higher counts may take longer,
Some explanation is necessary for the plots.
random means that I get some documents and count one of the properties (the same for all documents). I do not know the order in which the documents come, so they are unrelated to the property I am counting.
correlated is the counting if I query the documents using an index and the count the property that was indexed. The assumption here was that it might be easier for the database to count all documents having a certain property value if I previously queried all documents having a non-null property value.
This holds true for the
long index but not for the
string index. The latter behaves about the same as my random counts.
The results show that count queries are very fast, but only if indexed.
What we also need for
twimpact.com are some more advanced queries. I assume that the results for those also depend on how we design our documents to fit our needs. The design will take some time and I will get back with results of design and advanced queries at a later date.

Friday, 25. September 2009
NoSQL : MongoDB performance testing (part 1: insert)… 
The
twimpact.com project currently uses a
PostgreSQL. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.
That is a reason I am looking at non-SQL alternatives. The list includes
redis, the
Cassandra Project and
MongoDB.
I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.
The setup is as follows:
- 4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk
- MongoDB: mongodb-osx-x86_64-2009-09-19
- (i had to work in parallel, so there might be some swapping)
Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the
insert test was to copy over batches of 10.000 tweets like the following pseudo code shows:
// initialize MongoDB (started with a complete new one for each test)
def db = new Mongo("twimpact")
DBCollection coll = db.getCollection("twimpact");
// coll.createIndex(new BasicDBObject("retweet_id", 1)) // long index
// coll.createIndex(new BasicDBObject("from_user", 1)) // short string indexdef offset = 0
def limit = 10000
def rowCount = sql.count("tweets")while(offset < rowCount) {
// get batch of tweets form PostgreSQL server
def data = sql.rows("SELECT * FROM tweets OFFSET ${offset} LIMIT ${limit}")
// convert each row into a document and insert
data.each { row ->
BasicDBObject info = new BasicDBObject();
row.each { key, value ->
info.put(key, value);
}
coll.insert(info);
}
offset += data.size()
}The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:
…
DBObject[] bulk = DBObject[5000]
… loop …
// two times as 10000 was too big for the driver
coll.insert(bulk)
...
The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:
{
"id": 3551935825,
"user_id": 1657468,
"retweet_id": 15965974 ,
"from_user": "thinkberg",
"from_user_id": 6190551,
"to_user": null ,
"to_user_id": null,
"text": "RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain-Computer Interface
http://bit.ly/3fg9OG",
"iso_language_code": "en",
"source": "<a href="
http://adium.im" rel="nofollow">Adium</a>",
"created_at": "Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)",
"updated_at": "Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)",
"version": 0,
"retweet_user_id": null
}And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.

The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.
Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.
A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.
Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at
twimpact.com where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)
But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs.

Wednesday, 29. July 2009
twimpact.com - trends by citation 
It feels good to code a little again. Again, social software but this time from the analysis point of view. Check out
twimpact.com to see the trends of the last hour bubble up.
All done in
grails, which I love.

Friday, 01. May 2009
Re-use replaced backup harddisks 
Now you have RAID system. It runs perfectly, but it also runs full as all storages do over time. You buy new 1.5TB harddisks, replacing the old 500GB ones. Now what do you do with those old ones? They are still perfectly healthy disks.
Well, you buy an
external SATA dock!
Then you can do off-RAID backup to the disks. Those disks probably last longer than your DVD backups.