Flying Blind within the Clouds with PaaS Applications & “Managed” Services
“PaaS, is fraught with pitfalls and dangers that could cause your application to stop running at any point. Moreover, should this occur, the ability to identify and correct the problem may be so far out of your hands that only by spending an inordinate amount of time with your PaaS provider’s support personnel could the problem be corrected.” - JP Morgenthal, Cloud Business Director, EMC
There are many reasons why IT organizations are looking at outsourced managed services and Platform as a Service (PaaS), including cost control & reduction, increased efficiency & scalability, a re-focusing on core business, etc. But there are pitfalls and risks with this, some of which can only be truly mitigated with even greater observability and (automated) control over applications and services than has ever been needed or achieved within the private domain.
Today companies need to be increasingly agile in perceiving, reacting and adapting to changes in the environment. This causes a company and its systems to be doused in a continuous flow of change. But change is the enemy of control when systems themselves are not reactive and adaptive, especially cost control, which is why managed services vendors can at times appear to put procedures and practices in place that deliberately restrict this important flow (the life of the company in some cases). The problem here is that both parties involved are oblivious to what the other is doing and more importantly the underlying behavioral, state and quality aspects of the application or service being managed. The infrastructure appears as a black box to the customer – not necessarily a problem until there is a problem. The application appears as a black box to the managed services vendor – this is always a problem unless the vendor has sufficient application visibility, expertise, insight and change control which generally resides with the customer especially for custom(ized) applications and services.
“Let me put your black box into this other bigger black box. Trust me, I know what I’m doing.”
Once performance and reliability problems start to surface you very quickly begin to see how disconnected (or painfully slow) the communication and process flow is. It becomes a game of “your IT” with each party blaming (tapping) the other person for an imagined change. This is compounded by the fact that in such environments deep diagnostics and flow control are absent. Control has been taken away from the customers IT department and given to another organization that is not sufficiently confident enough in it’s own understanding of the application and possible remedies in the event of problems. The service provider is less reluctant to take action when it can’t be sure it is the application that is at fault. The safe bet (from a legal perspective) is to sit and wait it out until the long drawn out process of problem resolution begins across strained communication lines. Each party feels the other party does not fully understand or need to know how their black box operates.
There is a solution to the problems that each party suffers and that is for both the infrastructure and the application (or service) to be self adaptive and stimulated by sensed changes in the environments, both inner and outer. Unfortunately this capability is only starting to be recognized as mission critical. We are a very long way from any degree of standardization of such which would allow each system to reason and predict more accurately about each others behaviors and intents. Even if many applications, services, platforms and libraries were rewritten to have a little bit more awareness and adaptability there is still a question of tooling and analysis of a completely new level of complexity that will force us (humans) to take a back seat and instead guide good system behavior by way of policies and self regulation routines founded on a model of the true system dynamics. We don’t appear ready to take that medicine in any form soon.
That said all is not lost if instead we create an illusion that nothing has in fact changed. The new way looks like the old way. We can create this illusion by mirroring and simulating the application and its activities. The customer can still see and sense the outsourced service as if it were still within its own private environment and the service provider still maintains some degree of protection and isolation from inspection activities performed by the customers service management team. An added benefit is that the customer can in this simulated environment greatly assist in problem resolution as well as augment the application simulation in ways that our unique to its needs and beyond the scope of the service contract.
Note: The simulated application mirrors the behavior of the application in product but is not the application. It does not have access to code, memory state or storage resources (database, filesystem) and it does not consume the same degree of computing resources. It mimics the software execution call sequencing across threads irrespective of the threads actual process location. It is a “chameleon” monitoring solution that becomes the application it is monitoring. It does not require service management beyond its intended use as an observation point. It is a dynamic but immutable mirror image of reality within the service providers black box.
And because simulation is built on a recording mechanism both operations teams can learn from experience by reliving incidents again and again, even on the first day as a new staff member of operations. This capability also helps with testing because a recording in the test environment can be compared with a similar recording, online or offline, in production. Detecting behavioral changes, side effects of other changes, across releases as well as across environments is made so much easier.
Finally because simulation can combine the metering (mirroring) feed from multiple application processes the customer does not necessarily need to concern himself with the (auto-)scaling mechanism that the service provider utilizes to meet some service level agreement. From the customer viewpoint it is the application and its activities that matter not the managed infrastructure elements such as nodes and processes. It is a win-win for all.
Update – 18th Oct 2013
We’ve had lots of positive feedback and questions on the article. I’ve already made some changes above to call out further some important points. One question that did come up and really should have its own section if not article is whether the simulation needs to reside external to the managed environment which might initially appear to fly in the face of outsourcing. I would expect that most providers would look to offer this as a value service and host the simulation within the environment though not exactly collocated with the application. Naturally you might question why the simulation is needed at all? Well, the provider does not want the application management team at a customer site inspecting and impacting directly on production. The provider does not want the team to be made aware of the topology and the elasticity of the environment – simulation makes it look like there is only process with many, many threads. The simulation is a like a portal and unlike management dashboards it appears real and plays nicely to our trained senses and existing tools. A simulation is to a dashboard what a hologram is to a badly drawn sketch. Dashboards lack realism and invariably assume that people knew what they were doing when they started writing the code for an application before they have had any experience of what production would look like on a given day.
One of the hardest problems in getting a custom application or service migrated to a managed environment is the dependency on existing integrations with systems still in-house and not part of the outsourcing strategy. Again simulation can help with alleviating such problems with the customer being able to augment the mirrored application runtime with metering plugins that hook into the simulated replay of what is occurring live in production (wherever that maybe and look like) and then call out to external services facades. This code would be unchanged whether it was exiting live in production or within the simulated application allowing the customer to already do some preparation work and testing before the actual managed service migration.
Mirroring, Mindreading and Simulation of the (Java) Virtual Machine
Changing Space and Time in Software Execution – The Future is Simulated
Rewriting Application Performance Monitoring Histories with Record & Playback
Simulation and Time Synchronization of Application Memories
Simz – The Promise of Near Real-Time Application Simulation in the Cloud
The SaaS APM Racket – Collect, Relay, Inform, Mine and Extort (CRIME)
Visions of Cloud Computing – PaaS Everywhere