Replication FAIL

We’ve had our clusters running for a few months without significant issues. Or at least so we thought.
I’m not sure of the why and how yet, but it seems that even rack topology scripts running, replication factor of 3 and nightly rebalancing we had some 600 blocks failing to be replicated across racks. After digging thru documentation, consulting vendors and generally feeling frustrated, I discovered that this is somewhat know. Apparently the “fix” is something I found back in December of last year. See here.

So my new nightly routine — automated of course — is to fsck the cluster looking for ‘Replica placement policy” violations, alter the replication factor +1, then set it back after it’s replicated. I am somewhat irritated by this need.

About Grease Monkey

30+ Years of IT Geekiness, Linux Fanboy and Open Source patriot.
This entry was posted in Administration, Deployment, Tuning. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>