[ start | index | login ]



    Friday, 31. January 2014

    DEPRECATED

    This is just the archive of my old blog. See >>http://thinkberg.com/ for the latest posts. PermaLink
    no comments | post comment

    Thursday, 06. September 2012

    Blogging?

    I wonder if I should discontinue the blog as I do not regularly write much. Usually it's just some random thing that comes to mind. Most of the interesting stuff happens on other networks, even though I do value having my ramblings under control. That is not so true on >>Facebook or >>Twitter. The integration into the blog is rather difficult with ever-changing rules and API's … PermaLink
    no comments | post comment

    Friday, 20. April 2012

    DataScience to the rescue

    My colleague Mikio Braun just >>gave a talk on our history doing real-time analysis on the >>Twitter data stream. It's quite interesting how we started with >>jruby and >>rails and a normal >>RDBMS and worked it until it broke, then moved on to use scala (actually, I was pressing Mikio to get away from ruby and start using a JVM based solution) and the shiny new >>Cassandra NOSQL database. Again after some fine tuning and clustering it broke again for our use case.

    That's how we ended up, still using scala, to create our own storage and management platform that does exactly what we need and is fast enough to process the stream in real-time.

    Of course such a history of a technology fight is always interesting to other developers as we can tell them what to avoid or how to tackle certain problems, be it simple programming stuff or mathematical modeling. We ended up with an idea to bring the experience to the public by creating >>datascience-berlin.de, an offer of a number of seminars from basic DataScience to advanced issues and in future even more focussed seminars for specific industries.

    >>http://datascience-berlin.de/ | Twitter: >>@datasciencebln PermaLink

    no comments | post comment

    Saturday, 25. February 2012

    Twitter Crowd Attention

    It is an easy task to map where users are, but what are the places people talk about? I have set myself the task to find out and a nice example is now visible in >>this youtube video shown mentioned locations during March 2011 when the earthquake hit japan :(

    2011-03-10@23:10 PermaLink

    no comments | post comment

    Tuesday, 29. November 2011

    TWIMPACT NIPS preparations

    Finally, >>as Mikio wrote in our dev blog the analysis is running smoothly. Next we are going to put the non-serial analyzable stuff on the cluster to roll the whole of 2011 into our snapshots. It will be fun to see what are the hotspots. And as a side effect we are looking at a comparison of our own little language detector to the chrome language detector.

    What Mikio doesn't know, i am already planning on using more memory for even more fun stuff to analyze. PermaLink

    no comments | post comment

    Sunday, 09. October 2011

    Warum der Bundestrojaner so schlecht ist …

    Siehe Google+, wo ich in Ansätzen versuche zu beleuchten wieso Software des Bundes so schlecht ist ...

    >>https://plus.google.com/117122816629542437147/posts/f5WpLAYGqFj PermaLink

    no comments | post comment

    Monday, 27. June 2011

    The curse of the bit dump …

    I wonder where it will end. Everyone is speaking of big data and even >>I take part in it.

    muellschwermetall At first I thought we might end up something like in "Muell", a comic book by Juan Gimenez picturing a world where only a very small part is still habitable and all the rest of it looks like a big dump. Funnily, that is not going to happen. At least I think that we are going in a different direction that has to do with our fascination for big data, which is closely related to "messyism": Stupidly collecting things we no longer need.

    datacenters A not so long time ago everything would at some point simply disintegrate because the material would not last forever. With the industrial revolution we invented more and more durable materials and now we have this volatile bit which is going to haunt us even longer, because we make it stick in our data centers. Already most of Europe and parts of the rest of our world are covered with data centers. We will need even more if we don't look back at how our ancestors handled information.

    Partly due to lack of storage, but also for the sake of simplicity and understanding information was boiled down, filtered and then came part where it was stored (drawn or written down mostly). That's the point where I think we need to think again and focus on the part where understanding and filtering reduces the amount of information.

    Still, there is this wish to keep up with the speed of our time and having it all available. At the speed we produce data, the analytics tools that evolve will lose in the end if we try to catch up the bit dumps already there.

    coruscant We have a chance though to keep the pace by shifting away from big data storage analytics to real-time data analytics that simply keeps our level of knowledge, the insights from our continuous data streams, up-to-date at all times.

    The resulting data will be much less and much more informative than all the bit muell it came from and it will in the end protect us from ending up to look like Coruscant, a planet of data centers.

    PermaLink

    no comments | post comment

    Sunday, 06. March 2011

    updated arab revolt trends with location detection

    IMPORTANT: To really update the application, give it a few reloads in the browser until you see a map below the keywords and the trending graph.

    I have updated our trend detection for the arabic world in two ways:

    1. I have separated >>libya and >>iran into their own trackers.
    2. The trends now also contain retweet information.
    3. An experimental features shows an approximate coordinate or what is being talked about in the tweet
    More to come, the data stream already transmits mention trends, retweet trends, hashtag trends and link trends in addition to what is already there. I just did not have enough time to put it into the interface.

    screeshot16small

    You can now select the tracker in the menu top right:

    menu

    Enjoy! PermaLink

    no comments | post comment

    Thursday, 17. February 2011

    How to use egypt.twimpact.com

    link=http://egypt.twimpact.com PermaLink
    no comments | post comment

    Monday, 31. January 2011

    Spike on arrest of Al-Jazeera Journalists

    Here is the spike our trender collected after Al-Jazeera journalists have been taken into custody.

    link=space/start/2011-01-31/1/wike.png PermaLink

    no comments | post comment

    Sunday, 30. January 2011

    Following #egypt on twitter

    Trying to follow all the keywords relevant to the egypt crisis can be very difficult. The updates sometimes come in so quickly that it is impossible to read. I've tried it first, tracking all keywords directly. The opportunity to actually provide a useful service is great, so I have installed a special egypt events tracker on

    >>http://egypt.twimpact.com/

    It tracks retweets that contain the keywords and ranks the keywords over time. All this is done using >>Mikios new "squid farm" trending system. This is not even starting to cratch the surface of its possibilities; but who cares. We will install it to track a few million topics on >>twimpact.com later on.

    It's actually streaming the latest tweets directly to all subscribers. The latest updates are there, a tag cloud showing the relevance of the keywords over time and the trending rate within the last hour.

    What's next? I am going to install a link trender to track all links and if I get it working a picture stream from the twitter data. PermaLink

    no comments | post comment

    Saturday, 02. October 2010

    Moving a cluster at 200GB/hour

    I just moved the >>twimpact cluster from Berlin to Amsterdam. It's about 650km and the current payload on the cluster is about 800GB. We moved at an average rate of 200GB/hour but in the end, with the last traffic jam right before entering Amsterdam we dropped back to an effective rate of 114GB/hour. Still okay, especially since a DELL PowerEdge M1000e Blade Enclosure fits very nicely into the trunk of my car. The only indication that we have loaded something is the weight, pressing down the back of the car.

    Today it will be put into the rack and then the holiday part of the trip begins. PermaLink

    no comments | post comment

    Saturday, 18. September 2010

    What I've learned in the past few month …

    This is not counting that it keeps a lot of other knowledge alive. And all that after I thought I would never really code again. For those who don't know, my day-to-day job is >>knowledge transfer from university to industry in the neuroscience area. A lot of fun too.

    Do you love your job? ;-) PermaLink

    no comments | post comment

    Thursday, 15. July 2010

    pool size, threads and TIME_WAIT

    We are now in a stage where we put everything together for the new TWIMPACT. Our main analysis, done my >>Mikio now runs on >>Scala, the database is based on >>Cassandra and we have created a messaging infrastructure to distribute further analysis and aggregation tasks.

    Now was the time to do some testing on real hardware to see whether it would give us what we want. In a first test using a single analysis thread did run fast at first, but came down to very low numbers storing the data. So he decided to increas the number of threads doing the job and the performance increase is immense. However, at some point our little test program started spitting out strange socket errors about not being able to "assign the requested address".

    It turned out that the host system was accumulating open sockets up to the maximum number around 30k and from then on it did not allow any more new sockets to be opened. However, most of these sockets were in state TIME_WAIT, indicating a closed socket, but not finalized yet.

    This did look very strange, I had wrapped >>commons-pool in a very straightforward way and the socket factory also looks simple enough. Enabling extensive debug output only revealed normal action until the errors started. Also, the actual pool size never really went above 10. And that's where that idea, lurking in the back of my mind, came rushing forward: The pools maximum amount of idle elements is set to 8 by default, leading to the problem.

    • 16 threads take out a socket from the pool (not all 16 are actually active at the same time for some reason).
    • The pool creates new sockets on demand.
    • After doing a single task the threads put back the socket.
    • The pool sometimes counts the amount of idle sockets and closes some to get back to the maximum idle count.
    • The closed sockets sit in the system with TIME_WAIT.
    As all this happens very fast the pool closes and creates new sockets quickly and the system starts accumulating closed sockets until it breaks.

    Setting the pool size to 16 with 16 worker threads works beautifully. Also, we have 32 (1 server, 1 client) open sockets for 16 TSocket connections using the >>Thrift API, which did speed up the accumulation of open sockets.

    Conclusion: Set the pool size according to your worker threads.

    PermaLink

    no comments | post comment

    Thursday, 13. May 2010

    counting lines - a correction for the cassandra test

    First of all, the tests described in my previous post are correct in terms of timing and the amount actually stored in >>Cassandra. However, I made a mistake, assuming the data file contained ~42.000.000 data points. After recounting it turned out that there are empty lines between the data points, thus reducing the data count to about 21.000.000. Still cassandra crashed on the laptop after that amount of data for lack of heap space.

    Using one of our cluster nodes, giving Cassandra a bit more heap space and using a hard disk that can actually hold the data files the rate for storing batches rose to about 37k per second. A single put test will also follow.

    To make my tests somewhat comparable I will copy another ten days into the data file, raising the data count to ~40.000.000. This makes about ~75GB raw json data. With the addition of UUID keys as well as cassandra indexing and some data fill this will blow up the raw space Cassandra needs to an estimated amount ~140GB.

    Did I tell, these numbers are just 20 days? PermaLink

    no comments | post comment

    Wednesday, 12. May 2010

    NoSQL revisited: Cassandra / Scala

    First, I have to admit, we have not moved on to MongoDB. Sorry. It was nice to play with it, but after a crash on my laptop which rendered the database unusable there were some doubts.

    Anyway, we are in dire need of a scalable database solution for >>twimpact.com as you can check out for yourself there. The PostgreSQL database is killing the not-so-small server with its IO load right now. The setup may not be optimal but I don't think that the complicated way to SQL database clustering/sharding etc. will help much.

    This is where >>Cassandra enters the playground. I did have my problems with wording and paradigms but in the end I treated it all as a direct addressable sparse matrix. I got the idea when writing a data file converter for >>GNU Octave, writing loads of sparse matrix files.

    Ah, and we would not be geeks if we didn't change the programming language in the process too. >>Mikio started it by playing around with >>Scala for the new retweet matcher. As I was looking for a typed alternative to >>Groovy I ended up learning Scala. I can't say I have yet understood all of the concepts but what I have found useful works.

    First, I looked up all the different implementations of Cassandra APIs for Java and Scala. The syntactic sugar added by the Scala versions is nice but it also involves a lot of under-the-hood processing sometimes. Before entering a field where I have to fight off issues not under my control I decided to use >>hector which is probably the best available access solution in Java right now. However, wrapping hectors results in Scala is not funny. I do not blame this on Hector, but rather Scala's over-use of its own types.

    After assessing the needs for twimpact we came up with a simple API necessary to support our analysis and storage use. Mikio already had a working analyser using just a simple key/value store, so I decided to wrap Cassandras key/column system into a just-as-simple put/get. Here's the trait:

    trait CassandraStore {
      def put(table: String, key: String, name: Array[Byte], value: Array[Byte])
      def put(table: String, key: String, columns: List[Column])
      def put(table: String, mutations: Map[String, List[Column]])
      def get(table: String, key: String, name: Array[Byte]): Array[Byte]
      def get(table: String, key: String): List[Column]
      def get(table: String, start: String, end: String, count: Int): Map[String, List[Column]]
      def remove(table: String, key: String)
      def remove(table: String, key: String, name: Array[Byte])
    }

    Simple, simple. I then implemented it two times, one just wrapped hector's api and then I decided to do my own direct >>thrift implementation. Which is, er, well, … a bit more work.

    So, the hector implementation played the role of the base to test against. At first, just doing the most simple implementations worked quite well until I started to store more than 1000 keys sequentially. It broke, because I created a complete new transport and client for each put/get request. No wonder you might say, but hey I do like to start simple. To overcome this problem I wanted a connection pool, which is useful anyway and there are existing implementations already out there in the form of >>apache commons-pool. You will find Scala wrappers for it in some of the Cassandra projects, i.e. >>Cassidy. This is by the way the most beautiful API design I have found for Cassandra. For the purpose of learning and because I am always trying to improve (some called me a perfectionist for this) I opted for my own pool wapping. I am going to release it some time after this post, as soon as I have found out how to host a maven repo for it.

    Now we have a simple API, a connection pool and some code to drive it (a ThriftSocketFactory and a corresponding keyed pool). What do we need next? Yes, some data. I appended 9 days in April of raw retweet data and gzipped it into one singel 9.7GB file.

    Time to start testing the cassandra. Like before with MongoDB I did it on my laptop that has the occasional 4GB of RAM and about 50GB free space on the harddisk. The test program takes the raw json data from the file as stdin and, parses the json, creates data entries in the form of key:column and stores them using the aforementioned self-made API.

    I always read in 10.000 retweet lines and then store them. Test 1 just loops through the 10.000 entries and does a single put(key, name, value) and the second test just does a batch insert.

    Here is the result: Cassandra Insert Test (click to enlarge)

    All worked well for the single inserts until around 21.000.000 inserts. This was the point where the Cassandra process found it did not have enough heap space anymore and additionally the hard disk was full. The same happened much earlier with the bulk insert. It looks like Cassandra needs much more space on disk for bulk inserts.

    I have to admit that Cassandra was running with the standard setup of only 1GB heap space and disk full is usually a point where the fan and excrement issue happens. A tragic incident of this has been given to protocol >>here by reddit, which is a good read for all who plan to go a similar web noSQL way.

    Right now I am testing cassandra on one of our cluster nodes to see how far we can get with a more relaxed disk/RAM environment as we do not just have to take care of 9 days in April (~43.000.000 data points), but rather 500.000.000 data points. So, adding more than just one node is mandatory. PermaLink

    no comments | post comment

    Thursday, 04. March 2010

    Japanese TWIMPACT beta

    masason vs. 555hamako impact (click to enlarge)

    We are now running a beta site for TWIMPACT for Japan only. It only works with japanese tweets and works quite well. What is interesting is the battle between a former politician >>555hamako and >>masason the president of >>Softbank, a large telecommunications company in Japan.

    First, >>masason started out december 2009 with a quick rise to the top. Then >>555hamako followed beginning of 2010 (the year of elections) with an even steeper rise to take the crown. Also, it looks like masason has not managed to attract the same size of an audience as before, his TWIMPACT stalls a little at the end. Maybe he was just watching the winter Olympics.

    I guess we will be starting to adapt the >>TWIMPACT rating to degrade over time to provide a better view of the current impact a user has. Even though it is hard to keep on rising one keeps its TWIMPACT at the moment. This effect is much more visible on the >>global site where you can find a lot of not-so-spammy-spam twitterers that rise quickly and should fall down over time again after they hit the ceiling. PermaLink

    no comments | post comment

    Thursday, 01. October 2009

    NoSQL: MongoDB performance testing (part 2: counting)…

    After my insert tests last time I decided to look at some count queries as we do count a lot at >>twimpact.com. As a first result I can say that without any index count makes no sense with a database of this size.

    I have used the database left over from my last insert test and added a few indexes which takes around 30-40 minutes per index. I did not check in more detail about the time it takes as we tend to create the index while working on the database anyway.

    Now for todays results. The queries are quite simple, but in our case practical. I get a cursor for 1.000.000 documents as a result of a simple query and count the amount of documents that have the value of one of the documents properties:

    def cursor = db.find().limit(1000000)
    // alternative: query one of the indexed properties
    // def cursor = db.find(new BasicDBObject("property", new BasicDBObject("\$ne", null))).limit(1000000)

    cursor.each { doc -> def value = doc.get("property") def count = db.getCount(new BasicDBObject("property", value)) }

    MongoDB query test (click to enlarge)

    The time was taken for each of the "db.getCount()" calls and it turns out that around 40-50% of all queries result in negligible query time (< 1ms) which is the smallest time frame I can measure right now. This needs to be taking into account when evaluating the graphs as they only show the queries with at least 1ms duration (log scale plot).

    In the plot you see query time versus the result of getCount(). As expected higher counts may take longer,

    Some explanation is necessary for the plots. random means that I get some documents and count one of the properties (the same for all documents). I do not know the order in which the documents come, so they are unrelated to the property I am counting. correlated is the counting if I query the documents using an index and the count the property that was indexed. The assumption here was that it might be easier for the database to count all documents having a certain property value if I previously queried all documents having a non-null property value.

    This holds true for the long index but not for the string index. The latter behaves about the same as my random counts.

    The results show that count queries are very fast, but only if indexed.

    What we also need for >>twimpact.com are some more advanced queries. I assume that the results for those also depend on how we design our documents to fit our needs. The design will take some time and I will get back with results of design and advanced queries at a later date. PermaLink

    2 comments (by rbtst) | post comment

    Friday, 25. September 2009

    NoSQL : MongoDB performance testing (part 1: insert)…

    The >>twimpact.com project currently uses a >>PostgreSQL. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.

    That is a reason I am looking at non-SQL alternatives. The list includes >>redis, the >>Cassandra Project and >>MongoDB.

    I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.

    The setup is as follows:

    • 4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk
    • MongoDB: mongodb-osx-x86_64-2009-09-19
    • (i had to work in parallel, so there might be some swapping)
    Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the insert test was to copy over batches of 10.000 tweets like the following pseudo code shows:

    // initialize MongoDB (started with a complete new one for each test)
    def db = new Mongo("twimpact")
    DBCollection coll = db.getCollection("twimpact");
    // coll.createIndex(new BasicDBObject("retweet_id", 1)) // long index
    // coll.createIndex(new BasicDBObject("from_user", 1))  // short string index

    def offset = 0 def limit = 10000 def rowCount = sql.count("tweets")

    while(offset < rowCount) { // get batch of tweets form PostgreSQL server def data = sql.rows("SELECT * FROM tweets OFFSET ${offset} LIMIT ${limit}") // convert each row into a document and insert data.each { row -> BasicDBObject info = new BasicDBObject(); row.each { key, value -> info.put(key, value); } coll.insert(info); } offset += data.size() }

    The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:

    …
      DBObject[] bulk = DBObject[5000]
      … loop …
      // two times as 10000 was too big for the driver
      coll.insert(bulk)
    ...

    The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:

    {
      "id": 3551935825,
      "user_id": 1657468,
      "retweet_id": 15965974 ,
      "from_user": "thinkberg", 
      "from_user_id": 6190551, 
      "to_user": null , 
      "to_user_id": null, 
      "text": "RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain-Computer Interface >>http://bit.ly/3fg9OG", 
      "iso_language_code": "en", 
      "source": "<a href=">>http://adium.im" rel="nofollow">Adium</a>", 
      "created_at": "Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)",
      "updated_at": "Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)",
      "version": 0,
      "retweet_user_id": null
    }

    And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.

    mongo.stat.small

    The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.

    Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.

    A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.

    Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at >>twimpact.com where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)

    But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs. PermaLink

    no comments | post comment

    Wednesday, 29. July 2009

    twimpact.com - trends by citation

    It feels good to code a little again. Again, social software but this time from the analysis point of view. Check out >>twimpact.com to see the trends of the last hour bubble up.

    All done in >>grails, which I love. PermaLink

    no comments | post comment

    [subscribe to thinkberg]

    Connections:
    >>TWIMPACT
    >>Mikio
    >>WSDHA
    >>Serienjunkies
    >>Stephans Blog
    >>USA Erklärt
    >>sofa. rites de passage
    >>langreiter.com
    >>henso.com

    Logged in Users: (0)
    … and a Guest.
    14 users and 293 docs.
    Emerged 10 years and 120 days ago

    Current Gaming:
    New Super Mario Bros. Dr. Kawashima's Brain Training

    Ohloh profile for Matthias L. Jugel

    < February 2012 >
    SunMonTueWedThuFriSat
    1234
    567891011
    12131415161718
    19202122232425
    26272829

    Portlet 1
    thinkberg
    subconscious opinions
    Copyright © 2005-2008 Matthias L. Jugel | SnipSnap 1.0b3-uttoxeter