10 of the Biggest Platform Development Mistakes

18 Comments

Just like with golf, technology is as much about ensuring that your bad hits are recoverable as it is ensuring that you make great ones. We’re all going to have failures in our careers but avoiding the really big pitfalls will help you keep your company on the right growth path. Here are 10 common mistakes we at AKF Consulting see made during platform development — and the ones we believe are the most important to avoid. 

1) Failing to design for rollback: If you’re developing a SaaS platform and you can only make one tweak to your current process, make it so that you can always roll back any code changes. We know that it takes additional engineering work and testing but in our experience, such effort yields the greatest ROI.

2) Confusing product release with product success: Do you have “release” parties? Don’t — you are sending your team the wrong message. A release has little to do with creating shareholder value. Align your celebrations with achieving specific business objectives, such as increasing sign-ups by 10 percent.

3) Assuming a new Product Development Lifecycle (PDLC) will fix issues with missing delivery dates: Too often CTOs see repeated problems in their development life cycles, such as missing release dates, and wrongly blame the development methodology. Make sure you’re fixing the right thing — lack of ownership or involvement in and/or incomplete understanding of the current PDLC are among the most common root causes of late dates.

4) Allowing history to repeat itself: Organizations don’t spend enough time looking at past failures. The best and easiest way to improve your future performance is to track your past failures, group them by causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly to identify recurring problems.

5) Scaling through third parties: If you’re a hyper-growth SaaS site, you don’t want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it’s built into your architecture. Define how your platform scales through your efforts, not through the systems that a third-party vendor provides.

6) Relying on QA to find your mistakes: You cannot test quality into a system and it’s mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem, and that’s where the problem should be treated.

7) Relying on “revolutionary” or “big bang” fixes: The degree of success of complete rewrites or re-architecture efforts typically ranges somewhere between not returning the expected ROI and complete failure. The best projects — and the ones with the greatest returns — are not revolutionary but evolutionary. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small steps.

8) Not taking into account the multiplicative effect of failure: Every time you have one service call another service in a synchronous fashion, you are lowering your theoretical availability. If each of your services is designed to be 99.999 percent available, then the product of all of the service calls is your theoretical availability. Five calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.

9) Failing to create and incent a culture of excellence: Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Be a leader.

10) Not having a business continuity/disaster recovery plan: No one expects a disaster, but they happen, and if you can’t maintain normal business operations you will lose both revenue and customers. A solid business continuity plan explains to everyone how to operate in the event of an emergency. Even worse is not having a disaster recovery plan, which outlines how you will restore your site in the event a disaster shuts down a critical piece of your infrastructure, such as your collocation facility or connectivity provider. Our preference is to provide your own disaster recovery through multiple collocation facilities.

Marty Abbott and Michael Fisher are partners with AKF Consulting.

18 Comments

Chirag Mehta

This could be a long comment and non-HTML format would strip the links out. I have captured my comments in a detailed post on my blog at:

http://cloudcomputing.blogspot.com/2008/07/saas-platform-design-and-architecture.html

I also took an opportunity to capture my strategic recommendations to SaaS vendors on some of the important topics that are typically excluded from the overall platform strategy:

http://cloudcomputing.blogspot.com/2008/07/saas-platform-pitfalls-and-strategy.html

My post copied as a comment below (without HTML):

I would argue that many of these mistakes are not specific to a SaaS
platform but any platform. I agree with most of the mistakes and recommendations, however I have quite the opposite thoughts about the rest.

1) Failing to design for rollback

“…you can only make one tweak to your current process, make it so that you can always roll back any code changes…”

This is a universal truth for any design decision for a platform irrespective of the delivery model, SaaS or on-premise. eBay makes it a good case study to understand the code change management process called “trains” that can track down code in a production system for a specific defect and can roll back only those changes. A philosophical mantra for the architects and developers would be not to make any decisions that are irreversible. Framing it positively prototype as fast as you can, fail early and often, and don’t go for a big bang design that you cannot reverse. Eventually the cumulative efforts would lead you to a sound and sustainable design.

2) Confusing product release with product success

“…Do you have “release” parties? Don’t — you are sending your team the wrong message. A release has little to do with creating shareholder value…”

I would not go to the extreme of celebrating only customer success and not release milestones. Product development folks do work hard towards a release and a celebration is a sense of accomplishment and a motivational factor that has indirect shareholder value. I would instead suggest a cross-functional celebration. Invite the sales and marketing people to the release party. This helps create empathy for the people in the field that developers and architects never or rarely meet and this could also be an opportunity for the people in the field to mingle, discuss, and channel customer’s perspective. Similarly include non-field people while celebrating field success. This helps developers, architects, and product managers understand their impact on the business and an opportunity to get to know who actually bought and started using their products.

5) Scaling through third parties

“….If you’re a hyper-growth SaaS site, you don’t want to be locked into a vendor for your future business viability…”

I would argue otherwise. A SaaS vendor or any other platform vendor should really focus on their core competencies and rely on third parties for everything that is non-core.

“Define how your platform scales through your efforts, not through the systems that a third-party vendor provides.”

This is partially true. SaaS vendors do want to use Linux, Apache, or JBoss and still be able to describe the scalability of a platform in the context of these external components (that are open source in this case). The partial truth is you still can use the right components the wrong way and not scale. My recommendation to a platform vendor would be to be open and tell their customers why and how they are using the third party components and how it helps them (the vendor) to focus on their core and hence helps customers get the best out of their platform. A platform vendor should share the best practices and gather feedback from customers and peers to improve their own processes and platform and pass it on to third parties to improve their components.

6) Relying on QA to find your mistakes:

“QA is a risk mitigation function and it should be treated as such”

The QA function has always been underrated and misunderstood. QA’s role extends way beyond risk mitigation. You can only fix defects that you can find and yes I agree that mathematically it is impossible to find all the defects. That’s exactly why we need QA people. The smart and well-trained QA people think differently and find defects that developers would have never imagined. The QA people don’t have any code affinity and selection bias and hence they can test for all kinds of conditions that otherwise would have been missed out. Though I do agree that the developers should put themselves in the shoes of the QA people and make sure that they rigorously test their code, run automated unit tests, and code coverage tools and not just rely on QA people to find defects.
8) Not taking into account the multiplicative effect of failure:

“Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.”

No synchronous calls and swimlane architecture are great concepts but a vendor should really focus on automated recovery and self-healing and not just failure detection. A failure detection could help vendor isolate a problem and help mitigate the overall impact of that failure on the system but for a competitive SaaS vendor that’s not good enough. Lowering MTBF is certainly important but lowering MDT (Mean down time) is even more important. A vendor should design a platform based on some of the autonomic computing fundamentals.

10) Not having a business continuity/disaster recovery plan:

“Even worse is not having a disaster recovery plan, which outlines how you will restore your site in the event a disaster shuts down a critical piece of your infrastructure, such as your collocation facility or connectivity provider.”

Having a disaster plan is like posting a sign by an elevator instructing people not to use it when there is a fire. Any disaster recovery plan is, well, just a plan unless it is regularly tested, evaluated, and refined. Fire drills and post-drill debriefs are a must-have.

Gordon Rivers

I agree with a couple of the other comments

This is consultant speak designed to hook you in but not really saying much thats not regular common sense regardless of industry ..
I have nothing to do with the business but these issues look like what I help my clients with all the time.

A launch party is important but make sure individual contribution is recognized also.

Hire a consultant that has more depth ,, all these things are easy to say BUT implementation in real world is the issue… and a hired helper WILL NOT solve your problems if culture and individual contributors are not in line with company goals. Get he right team and let em go

Bob Ngu

Absolutely agree with the other people that it is just as important to have release parties for the aforementioned reasons. An unmotivated or low morale engineering team will be death of the company.

Jeremy Campbell

Loved the list, I think also it’s important to keep your development platform as dynamic as possible because as we all know there are always going to be changes, and tweaks desired in the future.

Very important: Build a core platform around a problem you’re trying to solve, and through user feedback build most of the future developements according to what the community wants. This is the direction my company is taking on our next project.

It’s too bad Twitter doesn’t listen to their community to have a more valuable service, and one that doesn’t go down all the time. FriendFeed certainly is, and they are doing a great job allowing people to really connect through meaningful conversations.

Greg Isaacs

I would also include another point (#11) that platforms should be built with external developers in mind (ie make your platform open). It is very easy to only think of how a platform will solve your current problems and not consider how a thriving developer community can help shape success in the future.

Chris Ammerman

Have to agree with the couple others who are supporting release parties. If you tell your developers that release is not an important milestone for the business and that “signups”, which are sales-driven, are all that matters, you are going to absolutely destroy your developers’ morale.

My response to such talk, as a developer, would be one of three things:
1) If I am ambitious and self-confident, find a new job ASAP.
2) If I am passive and self-confident, throw it back in management’s face next time they are demanding unpaid overtime to meet a release date, pointing out how “release is not an important milestone for the business”.
3) If I am passive and not self-confident, reduce the effort I put into my work on a daily basis, to make up for the unrewarded, unrecognized stress and effort that goes into a product near release.

A release party doesn’t need to be sanctioned by the CEO, and shared by the whole company. But it is important for the developers to blow off some stress and get recognition, at least by their immediate managers, for the blood, sweat, and tears it took to get the product shipped.

Conversely, if the product ships according to spec, and it still flops, that is almost NEVER the DEVELOPERS’ fault. They followed a plan, a bad plan. And the creators and custodians of the bad plan are the ones who should be flogged.

Dennis

I’m glad others have pointed out the importance of release parties. The time approaching the release is the hardest time on your developer and verification teams, and they have likely ungodly hours to deliver the goods. Release parties are unquestionably an excellent idea, and a great way to reward that work.

If you want to exclude the marketeers and sales guys then do but I doubt they thank you for that!

Mukul Kumar

About “Failing to design for rollback: If you’re developing a SaaS platform and you can only make one tweak to your current process, make it so that you can always roll back any code changes.”

This is easier said than done. Designing rollback and an ‘idempotent state machine’ is one of the difficult problems that any designer can face. The amount of code that is written with rollback enabled would probably be at least 2x the size of code that you would write without rollback.

I am not saying we should not do rollback, it’s just that you need to keep the extra time and effort in mind while designing rollback.

Thanks,
Mukul.

Glenn Rogers

::[ Allowing history to repeat itself ]::

This one needs a lot more attention in the industry. Of all the company I’ve worked for over the years as a programming contractor I’ve never been in a planning meeting that discussed past project mistakes and how to avoid them or about successes and how to repeat them.

Nick

Here at Zynaptics we have one more to add:

Always make sure that an error message can be traced back to the offending code and data. You have to include this functionality in your error handling and user messaging.

How many times have you have received a bug report that had some vague description: I was importing the file and got a “Object cannot be null”?

When you look at the import routine you have 5000 lines of code and the user was importing 250,000 records.

Much better to get this error: “App Error: Object cannot be null. PriceCalcFunction on record #45514”. Imagine how much time you’ll save finding and fixing the issue. This benefits the users and the development team.

Finally, when you take this approach it tells people we take bugs seriously so test them out before you deliver code.

Paul

Looks like a pretty good list to me, and specific enough to help make both tactical and strategic decisions (which undermines charges of ‘consultant-speak’).

As for separating the engineering mission from the company mission, well, I’d say this is paving the road to ruin. There is only one mission.

Can’t disagree with the argument that you have to solve some problems completely with 1.0 or it ain’t worth dreaming about 3.0.

Jeffrey

Definitely seeing the consultant-speak here as well.

Launch parties are a way to reward the engineering organization for a job well done. Success in the marketplace is something that happens after a launch and is largely out of the control of the engineering organization. If you want to make your hard-working software developers feel even more beat down, throwing parties to commemorate the successes of the sales and marketing organizations sounds like a terrific way to do it.

Raanan Avidor

* You should do “release” parties AND “product success” parties. Reaching a release is a big step for RnD, QA and the whole company, you can’t just say it is not important, they ARE important milestones that should be celebrated.

* Relying on QA to find your mistakes is a big mistake. QA is there only to find out the bugs that the development cycle missed and maybe have a more holistic approach to the product that development a lot of the time lose due to the development process.

Mark Sigal

Three comments. One, this feels very much like consultant-speak and not real business strategy, as these items are mostly catch-all platitudes and not super specific or clearly actionable.

Two, the post speaks about ‘platform development,’ but again, the feedback only loosely anchors to the fundamental question of what makes a platform. You could have changed the title and sub-text to ‘How to Create Really Scaleable Widgets’ and most of the text would have held.

Three, the biggest mistake I have seen with platform development is confusing 1.0 and 3.0 requirements. Platforms need to be designed with some governing lifecycle set of goals and objectives in mind but need to solve 1.0 problems for anyone to care. Too often, you see 3.0 vision that doesn’t solve 1.0 problems, or 1.0 tactical goodness that doesn’t scale. That is the biggest mistake I have seen repeatedly, a construct I call the 1.0/3.0 Paradox.

Here’s a recent post that speaks to the 1.0/3.0 Paradox:

Threading the Needle: Essential Truths for Early-Stage Entrepreneurs
http://thenetworkgarden.com/weblog/2008/06/threading-the-n.html

Cheers,

Mark

Comments are closed.