Life on the edge of data node writes

If you’re serious about using Hadoop  you should subscribe to the User Mailing Lists.  They are a great source of insight as to how things are performing, new features and common problems.

I’m currently working on a JIRA to clarify documentation around the Data Node write process.  Does it write to disk before it requests a copy on the next node or does it fork and write and request at the same time?  When does it checksum?  What is fully complete before data nodes report back to the client that the block has been written?  And which data node tells the Name Node that blocks are complete?  And when?

Finally, does the client need to ask the Name Node where to write the blocks or do the data nodes stream data according to previous relationships?

Documentation and source may agree; but it’s not clear at this point.  I’ll post an update when the JIRA is submitted and resolved.

Why does any of this matter?  Because when you’re on the leading edge of Hadoop use, you need to understand exactly how and when data is written to the cluster file system.  Otherwise, recovery, performance tuning and general administration are a SWAG.

About Grease Monkey

30+ Years of IT Geekiness, Linux Fanboy and Open Source patriot.
This entry was posted in Administration, Development and tagged , , , . Bookmark the permalink.

2 Responses to Life on the edge of data node writes

  1. jbattisti says:

    I subscribed to the mail list. BTW – you are good looking, and people like you…Darn it! (at least professionals)

  2. She was not a professional. No money was exchanged and she said Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.