We’ve had our clusters running for a few months without significant issues. Or at least so we thought.
I’m not sure of the why and how yet, but it seems that even rack topology scripts running, replication factor of 3 and nightly rebalancing we had some 600 blocks failing to be replicated across racks. After digging thru documentation, consulting vendors and generally feeling frustrated, I discovered that this is somewhat know. Apparently the “fix” is something I found back in December of last year. See here.
So my new nightly routine — automated of course — is to fsck the cluster looking for ‘Replica placement policy” violations, alter the replication factor +1, then set it back after it’s replicated. I am somewhat irritated by this need.