If you’re serious about using Hadoop you should subscribe to the User Mailing Lists. They are a great source of insight as to how things are performing, new features and common problems.
I’m currently working on a JIRA to clarify documentation around the Data Node write process. Does it write to disk before it requests a copy on the next node or does it fork and write and request at the same time? When does it checksum? What is fully complete before data nodes report back to the client that the block has been written? And which data node tells the Name Node that blocks are complete? And when?
Finally, does the client need to ask the Name Node where to write the blocks or do the data nodes stream data according to previous relationships?
Documentation and source may agree; but it’s not clear at this point. I’ll post an update when the JIRA is submitted and resolved.
Why does any of this matter? Because when you’re on the leading edge of Hadoop use, you need to understand exactly how and when data is written to the cluster file system. Otherwise, recovery, performance tuning and general administration are a SWAG.