I love using technology, there’s a lovely thrill in getting something to work. However technology should not be brittle to the facts of life. Which sadly is where I am right now with couchdb. I don’t develop on couchdb, I’m a user of couchdb as part of chef. So as a user I did what most users do, install the technology and get on with the thing I’m really interested in. Sadly I got bitten by one of those oldest of sysadmin mistakes, ignore the stack and you can guarantee that the bit you know least will take a large chunk out of you.
Now this is not a post about NoSQL databases but it did occur to me that with Oracle and Postgresql (at least, I bet MySQL and others do too) is perfectly possible to recover from this kind of issue. Which is why postgresql and friends use Write Ahead Logging so that you can get the database consistent again. What’s even more galling is that if there’s one set of folks who should be kicking goals here it’s the Apache Software Foundation. They make quality software and by and large I’d say their recommendation is a good start in selecting an open source package.
Now let me take you back to the point, what happened and what to do about it. So the layup is simple, what you do is fail to compact the couchdb database. Couchdb is based on the assumption is that disk is cheap and so couchdb munches along treating your disk like its free beer in the pub. There is an assumption that you’ll run couchdb’s compaction process, which if you use the chef recipe for couchdb will happen automatically, I found that out too late sadly. If you don’t use compaction? Well instead of someone just having a free drink at that bar it’s more like someone drinking from one of those beer hats that seem so beloved of american college students. The consumption goes right on until eventually couchdb is sick and runs out of disk space. Now when when couchdb runs out of disk is critical. If couchdb runs out out during a write (as our instance did) then you’re going to be using your backups. Oh and as a side note I love Amazon S3, just don’t talk to me about dragging down a 70GB file from S3 here in Australia.
Now I bet you’ve just sshed onto your boxen and feverishly dfed to check you’re not reading this just five minutes too late. Of course if you are reading this after the fact… I feel your pain, I really do. However comiserations get us nowhere so what are we to do? Well, the answer is simple start by working out a couchdb setup that’s resilient. So how do you make a resilient couchdb setup? This is a much tougher question to answer but my initial thoughts are:
- Get a cron job for compaction going early and keep it going
- Copy your couchdb files to back them up
- Test recovering your couch db, you’ll be glad you did
- Replicate, we didn’t do this and it was a major mistake in hindsight
- Make sure you’re monitoring your boxen, couchdb isn’t going to let you know when it’s hovered up your disk space
Best of luck out there :)
UPDATED: removed dodgy recommendation of the dump utility that Mikeal rightly points out does not exist.