You Paid for Support?! Bwah-ha-ha

We’re using Open Source Software extensively in our Big Enterprise. It really irritates me that we pay millions of dollars for “Support” from our vendors and we get endless circles of “try this,” “that should work” and “oh, that’s an upstream bug, we’ll file a bug report.” Seriously? For 10% of what we’re paying these guys, I’ll do it myself.

I currently have 3 bugs open with 3 vendors; 2 of those are open source. Let’s talk about them.

1) OS won’t PXE boot across a LACP Bond. The documentation says it should. Everything “looks right” but after 3 business days of the vendor telling me to try things I’ve already tried, I finally solved this myself. I can boot my DataNode image to my servers, but I wanted to install an OS on some of the control nodes. As soon as the install agent starts up, it loses network connectivity. I told it how to configure the bond on the kernel boot line, but it fails to see it and use it. Trying to use a single interface doesn’t work because the switch is expecting to distribute the packets (per LACP 802.3ad spec) across 4 NICs. It turns out that I can tell the kernel to use eth0 and NOT probe other network devices, which solves 99% of my problem. It’s not perfect, but it’s a hellava lot better than trying to hand install. Here’s hint if you have this problem: nonet.

2) Proprietary software vendor can’t pull the Avro schema from HDFS. This seems to be squarely in their court for resolution, however, they claim it’s a bug in Hive and opened a bug report. Come on kids, if you’re finding hdfs:// and expecting file:// something is wrong on your side.

3) Open source Hadoop vendor opened a bug report because pig doesn’t correctly support Avro in our version. We supplied a bug report and a bug solution from Apache, but they made us chase our tails for 10 days before they agreed and opened a new bug report.

After losing some 600 blocks of data in our Dev cluster we found out there is a “fix” for under replicated blocks coming in HDFS 0.20, but 0.1x doesn’t have this “feature.” Support DID help us find that issue, but ONLY after they ran us thru hoops looking for non-existent configuration problems.

My advice: Eschew paid support and dig into the details on your own. You’ll learn more, be more valuable and solve you own problems faster.

Posted in Uncategorized | Leave a comment

Replication FAIL

We’ve had our clusters running for a few months without significant issues. Or at least so we thought.
I’m not sure of the why and how yet, but it seems that even rack topology scripts running, replication factor of 3 and nightly rebalancing we had some 600 blocks failing to be replicated across racks. After digging thru documentation, consulting vendors and generally feeling frustrated, I discovered that this is somewhat know. Apparently the “fix” is something I found back in December of last year. See here.

So my new nightly routine — automated of course — is to fsck the cluster looking for ‘Replica placement policy” violations, alter the replication factor +1, then set it back after it’s replicated. I am somewhat irritated by this need.

Posted in Administration, Deployment, Tuning | Leave a comment

Intelligent Design – A hindsight lesson.

Our Boot from Network Datanode design was conceived in ignorance of real world application. Serial Number vs. MAC Address debates ensued in Ivory Tower minds and a schema was built. I’m currently in-between designs. I’ve consulted our resident data genius and he devised a superior schema that I have not yet been able to implement.

Today was spent trying to figure out why I built a view to contain my base information and how in the Hell to add new DataNodes so everything works as expected. Knowing the future and working with the past can be painful, especially when multiplied by the 0.0.2 changes to the existing schema that “seemed like a good idea” at the time. The bonus multiplier for getting this done is that I have 3 clusters to add in the next 60 days, a new admin to get trained and 6 developers to support. Did I mention that I get to figure out how to re-process failed files in production? The developer who created the process in no longer available and didn’t have time for a knowledge transfer. That’s why we pay $200/hour for Professional Services! So they can go away and leave us trying to understand their mistakes! (Just a little bitter)

What’s the lesson here? Just because you have incrementally better designs, doesn’t mean you should half-ass implement them. Live with what you have until you can fully move to a better design. Full on releases are much better than temporary fixes that get forgotten when you’re distracted for 3 days. I’m planning to create a new database to house all of the changes coming in releas 1.1 of Indostan. Otherwise we’ll be stuck in hybrid upgrade Hell forever. Yes, lazy admin, that means you have to do something the “hard/wrong” way EVEN WHEN you now there is a better/easier one. Until the new stuff is fully baked, use the old stuff.

I know this flies in the face of “continuous deployment” models, but it works much better for foundation level applications. It allows me to compartmentalize change and when I have 3-5 hair on fire events per day, that is essential. Today seemed slow and I think there were 4 crisis that had to be solved “right now.”

Being able to strike a balance between growth and stability is a valuable skill for any administrator. It’s sometimes harder for Hadoop Admins because there are new and better options weekly if not daily which makes it much more of a requirement. Choose an intelligent design and run it for a while. 3 weeks, 3 months or maybe 6 months, but take the time to let evolve in your mind and within your environment. Then re-evaluate and make changes. 6 months is a lifetime in Hadoop add-ons.

Posted in Deployment, Development, Hindsight, Opinions | Leave a comment

Insights for Articles from the Hadoop Summit 2013

Hadoop Summit KeynoteI just left the Hadoop Summit 2013 so my next series of articles are going to be on some insights I learned.  For this post I’m going to just post a long list of future topics  – let me know which ones are the most interesting and I’ll prioritize:

  1. The biggest complaint around Hadoop is it is pretty immature and needs more Enterprise capabilities
  2. Only 30% of companies are doing Hadoop today
  3. Definition of big data – high volume, velocity, and variety of data
  4. Definition of Hadoop – collection of things in a framework to process across a distributed network
  5. Amazon (AWS) started 5 ½ million Hadoop clusters last year
  6. Traditional IT vs Big Data Style
  7. Tractor vendors are becoming analytics vendors
  8. It takes a community to raise an elephant
  9. Yahoo is massive in the Hadoop space (365 PB and 40k nodes) – here is what I learned from them

10. APM GPU, Atom – not ready for big data

11. Solid State Storage, In Memory, SATA, SAS – when to use which in Hadoop clusters

12. YARN – Yet Another Resource Negotiator or a necessary component for solidifying Hadoop and moving it to the next level

13. Top 10 Things to Make Your Cluster Run Better (This one is my favorite)

14. Why LinkedIn and Yahoo are going to kick butt with big data

15. How to create good performance tests for a cluster

16. Hadoop Cluster Security – authorization, authentication, encryption, etc

17. Automating Hadoop Clusters

18. Hadoop and OpenStack  and Why You need to consider using them together

19. Email Archiving with Hadoop – the perfect use case…maybe…maybe not

20. File Ingesting

21. Lustre

22. Apache Falcon and data life cycle management

23. Storm

24. In Memory DB as a component of Hadoop

25. Tez – game changer – I think so…

26. Knox Security Gateway

27. NFS into Hadoop directly

28. HDFS Snapshots

29. Why we don’t use Ambari and why we should use the metrics

30. HBASE and all of the things you need to consider before deploying (it’s different)

31. Excel Data Explorer and Geoflow and how they might displace more expensive data mining solutions

32. Hadoop Scaling to Internal Cloud

33. YARN – is this just virtualization on top of Hadoop

34. Hadoop Infrastructure rethought

35. Cluster Segmentation – when one big cluster just won’t do…

Let me know which topics are most intriguing and I’ll post those first.

 

Posted in Uncategorized | 1 Comment

Hadoop and the honeycomb

I love the kind of honey where they leave a piece of the honeycomb in the jar.  Its great to chew on when you’ve used up all the honey.  Reminds me of this big old oak tree we used to pull honey out of in the woods as a kid.  Where was I?

Oh yeah.  I was showing my daughter how the honeycomb is made up of perfectly shaped hexagons, then it hit me.  Honeycombs are great illustrations of the 3x block replication factor in Hadoop which inspired us to create this logo.

Posted in Uncategorized | Leave a comment

Hadoop Hindsight #1 Start Small

I thought we would start a weekly series on some lessons we’ve learned.  Many of the topics we’ve learned the hard way so we thought it might be helpful for those a few steps behind us.  YMMV, but we wish this ideology was firmly ensconced when we started.

Identify a business problem that Hadoop is uniquely suited for.
Just because you found this cool new hammer doesn’t mean everything is a nail.  Find challenges that your existing tech can’t answer easily.  One of our first projects involved moving 300 gigs of EDI transaction files.  A business unit was having BA’s grep for customer strings on 26,000 files to find 4 or 5 files, then FTP’ing those to their deskptop for manual parsing and review.  They might spend a few HOURS doing this for each request.  It was a natural and simple use of Hadoop.  We learned a lot about design patterns, scheduling, and data cleanup.

Solve this one business challenge well.
Notice I didn’t say nail it perfectly.  There are many aspects of Big Data that will challenge the way you’ve looked at things the last 20 years.  The solution should be good, but not necessarily perfect.  Accepting this gives time to establish PM strategy and basic design patterns.

Put together a small team that has worked well together in the past.
This is critical to your success! Please, please, please take note!  Inter-team communication is the foundation upon which your Hadoop practice will grow.  In The Mythical Man-Month my man Fredrick Brooks said:

To avoid disaster, all the teams working on a project should remain in contact with each other in as many ways as possible…

Ideally a team should consist of the following:

1 Salesman (aka VPs)
1 Agile-trained PM
1 Architect
2 Former DBAs
1-3 skilled java developers
1 Cluster Admin

Obviously this is very simplified and some roles can overlap.  My point is you should have no more than 10 people max starting out!

Support your solution.
This very same team should also live thru at least 3 months of support of the solution they’ve created.  Valuable insight is gained once you have to fix a few production problems.  Let the solution mature in production a bit to understand support considerations. This gives you time to adjust your design patterns. Trust me, you’ll want time to reflect on your work and correct flaws.

Smash your solution and rebuild (Optional – If time permits)
Good luck getting the time, but if you’re serious about a sustainable Enterprise Hadoop solution this should be rightly considered.

Go forth and multiply.
By this time your patterns and procedures should form the DNA of your new Hadoop cell. You’re team should naturally develop into the evangelists and leaders upon which the mitosis of a new project occurs, carrying with it the new replicated chromosomes.  As your project cells divide and multiply, you’ll be able to take on more formidable challenges.

That’s all I have to say about that.

Posted in Administration, Development, Opinions | Leave a comment

Interview Questions for Hadoop Developers


Interview Questions for Hadoop Developers
(via Dice News in Tech)

Hadoop is an open distributed software framework that enables programmers to run an enormous number of nodes handling terabytes of data. One of its most significant abilities is allowing a system to continue to operate even if a significant number of nodes fail. Since Hadoop is continuing to mature…

Continue reading

Posted in Uncategorized | Leave a comment

Cinderella has left the Hadoop Cluster

HadoopIt’s Friday evening before our Hadoop Administrator leaves for a week of vacation in New Hampshire and about an hour before he leaves he says “it’s turning into a pumpkin in an hour”.  Of course we wanted to go live with a new project on Friday afternoon before he leaves.  About a week before he left we reminded everyone he was leaving and made it clear he didn’t have a “real” backup.

So now me – the “Ugly Step Sister” is filling in for Cinderella and of course things don’t go as planned….it never does.  I’m fond of saying if things “just worked” we wouldn’t need high priced engineers/architects, so I suppose that’s a good thing.  Now after a week of filling in for Cinderella we have had a number of last minute things that needed fixed and I have figured out that I have been living in an ivory tower a bit and I have forgotten how to sweep floors. 

Hadoop, Administration, Hadoop thoughtsWhen moving into architecture ten years ago my biggest fear was my IT capabilities would atrophy and I would be less employable.  I know that I’m actually very good at strategy and I am far more value strategically driving millions of dollars in savings and large strategic shifts in IT, but boy this week was a humbling experience.   Now that I’m through the week and I have overcome a number of obstacles (some of them slowly) here is what I have learned:

First – I need to get back to my roots.  Every architect needs a sandbox to get hands on keyboard and they need to allocate time to get their hands dirty.  I’m creating a sandbox for myself to tinker and I’m going to spend some time doing some Hadoop Automation work – I really like getting my hands dirty.

Second – We need more automation – many of the things we do in Hadoop are manual today.  Right now the Hadoop Engineering team (Grease Monkey) is so busy fighting fires…or hair on fire developers that he isn’t automating much.  While he was out I put together a rough user add script – that adds users includes integrating with active directory, creating Hadoop environments, adjusting permissions, rolling changes from access to name nodes, creating pub, sub, bus directories, creating hive table, testing…and more testing.  (Grease Monkey took it to the next level when he returned)

Third – Developers are crazy…hair on fire crazy and giving them a deadline drives crazy blame everyone behavior.  We need some queuing mechanism that gives us the ability to track, plan, and prioritize work.  We put a preliminary easy to use one in place, but nobody is using it.  We need to go back and address this in a non-corporate, fast moving startup sort of fashion….but we NEED IT!

Fourth – While we are doing some amazing innovative things that are game changing in the admin space we are also very early in the Operationalization of the platform and we have a ton of solidification work to do before we can breathe easy.  As fast as Hadoop is evolving I’m not sure we will ever breathe easy though.

Posted in Administration, Career | 1 Comment

The 3 Pillars of Data Democracy

In order to promote the use of data within the enterprise, we need to provide a collaborative environment which gives people the freedom and incentive to try new things.  This gives everyone the chance to prove great ideas, or at worst to fail quickly.  We may all understand the benefits of democratizing data, yet without an environment to foster that exploration it will remain just a great idea.  It is much like a unicorn.  We all know what it looks like, but no one has actually seen one.

Therefore I propose 3 pillars that are essential to encouraging this environment:

Searchable Metadata

We can provide a powerful framework for mining all sorts of data, but if users cannot inspect data elements easily from a trusted source, we will be perceived as unintuitive and difficult to use.

“Simplicity is the ultimate sophistication.” ? Leonardo da Vinci

Gallery

The gallery showcases works that others have done and springboards innovation by providing rapid provisioning of data.  Infographics and supporting worksheets can be linked into the gallery and tagged so that others may search for it.  Did someone already perform margin rate analysis?  Let’s go to the gallery first to find out.

If a person finds a relevant workbook, they can easily provisioning that insight for their own use.  This is a powerful feature that allows others to build upon and extend new insights without reinventing the wheel.

Gamification

We are wired to appreciate feedback loops.  We post on Facebook and Twitter not because we like to type, but we enjoy the recognition in the form of comments and followers.   As of 2011, the global video game market is valued at $65 billion.  People spend hours in front of a screen because it leverages their natural desire for competition, accomplishment, and status.  Adopting this mindset in our analytic platform not only encourages participation, but it gives us yet another set of metrics for self-assesment.  Which infographics are most accessed?  How many times is a worksheet copied for further analysis?  The ‘Like’ and ‘Friend’ button allows Facebook to catalog over a billion active user profiles – all to sell you stuff.

Posted in Data, Development, Opinions | Leave a comment

Why Enterprise Hadoop jobs will not require Java skills in 3-5 years.

In the late 1979, RSI’s Oracle version 2 ran on Digital’s VAX minicomputers (32bit AND virtual memory!). If you were proficient with the first commercial RDBMS, you had to posses mad Macro-11 or PL-11 (the high level version) skills to actually make many of the functions work that we take for granted now. Many basic tools that DBAs and developers use today simply didn’t exists.  You had to roll your own.  Even the data dictionary was a new concept and often in-flux.

Hello World, Macro-11 style:

        .TITLE  HELLO WORLD
        .MCALL  .TTYOUT,.EXIT
HELLO:: MOV     #MSG,R1 ;STARTING ADDRESS OF STRING
1$:     MOVB    (R1)+,R0 ;FETCH NEXT CHARACTER
        BEQ     DONE    ;IF ZERO, EXIT LOOP
        .TTYOUT         ;OTHERWISE PRINT IT
        BR      1$      ;REPEAT LOOP
DONE:   .EXIT

MSG:    .ASCIZ /Hello, world!/
        .END    HELLO

Don’t forget the RT-11 commands to assemble, link, and run!

.MACRO HELLO
ERRORS DETECTED:  0

.LINK HELLO

.R HELLO
Hello, world!
.

It was an immature but revolutionary way to store and recall information. Bell Labs saw the business benefits of the Oracle RDBMS and thus much hype and exuberance flowed in the land:

“They could take this data out of the database in interesting ways, make it available to nontechnical people, but then look at the data in the database in completely ad hoc ways.” – Ed Oates

During these early days you would need a room full of advanced computer science academics just to keep the system functioning – at each and every business.  There were no safety nets and everyone had there own perspective on how to do both a multi-join query WITH an aggregate function (and on the 4th day RBO was created, and it was good).  Read consistency was still 5 years away!  As time went on, the best brains from the IT collective pioneered standards and best practice that we all use today.  As the tech matured, the need for low-level Macro-11 developers diminished as they were replaced by a more mature product that would appeal to large non-tech companies.  As the need for low-level tech skills went away, patterns were established and the need for highly skilled programmers to keep the data store functioning went away.  Interestingly,  the data and the patterns of its flow remained.  That is why enterprises have DBA to maintain modern relational databases, not developers.

Inevitably, there are some times when advances dictate new low-level programming skills on a large scale.  When RSI released Version 3 in C, there was high demand for developers who could read and speak the prose of Mr. Ritchie.  This was necessary for recompiling and testing a consistent code base across everything from minis and mainframes, to PCs.  While C was quite portable, there was much work to be done in the storage subsystems.  Again, as the need for low-level tech skills went away, the data remained.

When we look at the new world of Hadoop, we must understand that this type of tech revolution has occurred before.  Right now there is much work afoot to solve the primitive questions.  This undoubtedly requires a new breed of low-level Java developers… for awhile.  We see the results of these efforts in tools like Pig, Hive, Impala, and Stinger glued together via HCat.  Once the dust settles, I wouldn’t stake my professional future on mastering MapReduce, but rather focus on mastering the higher level tools.  This will allow the enterprises quicker access to business insight.  As Hadoop’s primitive issues are solved in to standards and patterns in the next 3-5 years, the need for Java developers will diminish substantially in the next 3-5 years.  Just look at how many PL-11 or C++ programmers your enterprise has in their DBA teams; the low-level tech comes and goes, but the data remains.

Posted in Development, Opinions | Tagged , | Leave a comment