[ start | index | login ]
start > 2009-09-25 > 1

2009-09-25 #1

Created by arte. Last edited by arte, 4 years and 206 days ago. Viewed 4,803 times. #6
[diff] [history] [edit] [rdf]
labels
attachments
mongo.stat.png (666556)
mongo.stat.small.png (43329)

NoSQL : MongoDB performance testing (part 1: insert)…

The >>twimpact.com project currently uses a >>PostgreSQL. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.

That is a reason I am looking at non-SQL alternatives. The list includes >>redis, the >>Cassandra Project and >>MongoDB.

I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.

The setup is as follows:

  • 4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk
  • MongoDB: mongodb-osx-x86_64-2009-09-19
  • (i had to work in parallel, so there might be some swapping)
Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the insert test was to copy over batches of 10.000 tweets like the following pseudo code shows:

// initialize MongoDB (started with a complete new one for each test)
def db = new Mongo("twimpact")
DBCollection coll = db.getCollection("twimpact");
// coll.createIndex(new BasicDBObject("retweet_id", 1)) // long index
// coll.createIndex(new BasicDBObject("from_user", 1))  // short string index

def offset = 0 def limit = 10000 def rowCount = sql.count("tweets")

while(offset < rowCount) { // get batch of tweets form PostgreSQL server def data = sql.rows("SELECT * FROM tweets OFFSET ${offset} LIMIT ${limit}") // convert each row into a document and insert data.each { row -> BasicDBObject info = new BasicDBObject(); row.each { key, value -> info.put(key, value); } coll.insert(info); } offset += data.size() }

The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:

…
  DBObject[] bulk = DBObject[5000]
  … loop …
  // two times as 10000 was too big for the driver
  coll.insert(bulk)
...

The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:

{
  "id": 3551935825,
  "user_id": 1657468,
  "retweet_id": 15965974 ,
  "from_user": "thinkberg", 
  "from_user_id": 6190551, 
  "to_user": null , 
  "to_user_id": null, 
  "text": "RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain-Computer Interface >>http://bit.ly/3fg9OG", 
  "iso_language_code": "en", 
  "source": "<a href=">>http://adium.im" rel="nofollow">Adium</a>", 
  "created_at": "Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)",
  "updated_at": "Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)",
  "version": 0,
  "retweet_user_id": null
}

And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.

mongo.stat.small

The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.

Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.

A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.

Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at >>twimpact.com where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)

But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs.

no comments | post comment
[subscribe to thinkberg]

Connections:
>>TWIMPACT
>>Mikio
>>WSDHA
>>Serienjunkies
>>Stephans Blog
>>USA Erklärt
>>sofa. rites de passage
>>langreiter.com
>>henso.com

Logged in Users: (0)
… and a Guest.
14 users and 293 docs.
Emerged 10 years and 120 days ago

Current Gaming:
New Super Mario Bros. Dr. Kawashima's Brain Training

Ohloh profile for Matthias L. Jugel

< April 2014 >
SunMonTueWedThuFriSat
12345
6789101112
13141516171819
20212223242526
27282930

Portlet 1
thinkberg
subconscious opinions
Copyright © 2005-2008 Matthias L. Jugel | SnipSnap 1.0b3-uttoxeter