[ start | index | login ]
start > 2010-05-12 > 1

2010-05-12 #1

Created by arte. Last edited by arte, 3 years and 344 days ago. Viewed 2,501 times. #3
[diff] [history] [edit] [rdf]
cassandra.insert.png (367306)
cassandra.insert.small.png (132427)

NoSQL revisited: Cassandra / Scala

First, I have to admit, we have not moved on to MongoDB. Sorry. It was nice to play with it, but after a crash on my laptop which rendered the database unusable there were some doubts.

Anyway, we are in dire need of a scalable database solution for >>twimpact.com as you can check out for yourself there. The PostgreSQL database is killing the not-so-small server with its IO load right now. The setup may not be optimal but I don't think that the complicated way to SQL database clustering/sharding etc. will help much.

This is where >>Cassandra enters the playground. I did have my problems with wording and paradigms but in the end I treated it all as a direct addressable sparse matrix. I got the idea when writing a data file converter for >>GNU Octave, writing loads of sparse matrix files.

Ah, and we would not be geeks if we didn't change the programming language in the process too. >>Mikio started it by playing around with >>Scala for the new retweet matcher. As I was looking for a typed alternative to >>Groovy I ended up learning Scala. I can't say I have yet understood all of the concepts but what I have found useful works.

First, I looked up all the different implementations of Cassandra APIs for Java and Scala. The syntactic sugar added by the Scala versions is nice but it also involves a lot of under-the-hood processing sometimes. Before entering a field where I have to fight off issues not under my control I decided to use >>hector which is probably the best available access solution in Java right now. However, wrapping hectors results in Scala is not funny. I do not blame this on Hector, but rather Scala's over-use of its own types.

After assessing the needs for twimpact we came up with a simple API necessary to support our analysis and storage use. Mikio already had a working analyser using just a simple key/value store, so I decided to wrap Cassandras key/column system into a just-as-simple put/get. Here's the trait:

trait CassandraStore {
  def put(table: String, key: String, name: Array[Byte], value: Array[Byte])
  def put(table: String, key: String, columns: List[Column])
  def put(table: String, mutations: Map[String, List[Column]])
  def get(table: String, key: String, name: Array[Byte]): Array[Byte]
  def get(table: String, key: String): List[Column]
  def get(table: String, start: String, end: String, count: Int): Map[String, List[Column]]
  def remove(table: String, key: String)
  def remove(table: String, key: String, name: Array[Byte])

Simple, simple. I then implemented it two times, one just wrapped hector's api and then I decided to do my own direct >>thrift implementation. Which is, er, well, … a bit more work.

So, the hector implementation played the role of the base to test against. At first, just doing the most simple implementations worked quite well until I started to store more than 1000 keys sequentially. It broke, because I created a complete new transport and client for each put/get request. No wonder you might say, but hey I do like to start simple. To overcome this problem I wanted a connection pool, which is useful anyway and there are existing implementations already out there in the form of >>apache commons-pool. You will find Scala wrappers for it in some of the Cassandra projects, i.e. >>Cassidy. This is by the way the most beautiful API design I have found for Cassandra. For the purpose of learning and because I am always trying to improve (some called me a perfectionist for this) I opted for my own pool wapping. I am going to release it some time after this post, as soon as I have found out how to host a maven repo for it.

Now we have a simple API, a connection pool and some code to drive it (a ThriftSocketFactory and a corresponding keyed pool). What do we need next? Yes, some data. I appended 9 days in April of raw retweet data and gzipped it into one singel 9.7GB file.

Time to start testing the cassandra. Like before with MongoDB I did it on my laptop that has the occasional 4GB of RAM and about 50GB free space on the harddisk. The test program takes the raw json data from the file as stdin and, parses the json, creates data entries in the form of key:column and stores them using the aforementioned self-made API.

I always read in 10.000 retweet lines and then store them. Test 1 just loops through the 10.000 entries and does a single put(key, name, value) and the second test just does a batch insert.

Here is the result: Cassandra Insert Test (click to enlarge)

All worked well for the single inserts until around 21.000.000 inserts. This was the point where the Cassandra process found it did not have enough heap space anymore and additionally the hard disk was full. The same happened much earlier with the bulk insert. It looks like Cassandra needs much more space on disk for bulk inserts.

I have to admit that Cassandra was running with the standard setup of only 1GB heap space and disk full is usually a point where the fan and excrement issue happens. A tragic incident of this has been given to protocol >>here by reddit, which is a good read for all who plan to go a similar web noSQL way.

Right now I am testing cassandra on one of our cluster nodes to see how far we can get with a more relaxed disk/RAM environment as we do not just have to take care of 9 days in April (~43.000.000 data points), but rather 500.000.000 data points. So, adding more than just one node is mandatory.

no comments | post comment
[subscribe to thinkberg]

>>Stephans Blog
>>USA Erklärt
>>sofa. rites de passage

Logged in Users: (0)
… and a Guest.
14 users and 293 docs.
Emerged 10 years and 123 days ago

Current Gaming:
New Super Mario Bros. Dr. Kawashima's Brain Training

Ohloh profile for Matthias L. Jugel

< April 2014 >

Portlet 1
subconscious opinions
Copyright © 2005-2008 Matthias L. Jugel | SnipSnap 1.0b3-uttoxeter