Skip to content

The Complexity and Challenges of IT Management within the Cloud

Following on from Feedback & Control signals its arrival in the Cloud & Enterprise which touched on complex adaptive systems and feedback loops (or adaptive control) there are two dimensions that should be considered in the management of applications in the cloud – time and space. In addition consideration needs to be given to such characteristics as diversity and dynamism which compound the problems created by such dimensions in the context of computing and its management.

Time contributes to the complexity and risk in managing systems, services and applications when there are significant differences (time delays) between when something is observed, when it is judged and when it is reacted upon. Operations are being tasked with managing behavior (cause) and resulting events (effects) that have a time resolution as low as few microseconds whilst it collects, processes, sorts, filters, and analyses measurement data in seconds even minutes and then reacts in minutes even hours. By the time an engineer acts the basis for such has probably already elapsed and being invalidated.

Its like sending an unmanned rocket on a space mission to a star in another galaxy which we can observe from earth today but which has already died and collapsed.

Engineers are actually always in prediction mode which usually means they wait for the more catastrophic events to occur before acting (when there is simply no chance that stability will be regained). The protracted delay in reacting to events also means that prediction (more so trending & simply guessing) needed for such things as policing, protecting and possibly provisioning is also performed at much coarser granularity increasing risk levels further. This is then further compounded by the dynamism in the cloud (or environment) itself not under the direct supervision or control of engineering.

Real-time should be in terms of calling & computing speeds not (human) thinking & reacting speeds.

Similar to time there is a disconnect in the space dimension in what can be held within the brain’s short term memory (or displayed on a dashboard) and the number of elements that can now be provisioned (then monitored & managed) at a time, ignoring what happens beyond their sphere of control. Amazingly operations are still using (or attempting to use) the same techniques and processes used to manage a single mainframe or a small cluster of homogenous (or known) servers. Not only is the management (problem) space increasing in size (both numbers and capacity) but its rate of change is accelerating along with increasing (element) diversity (weather its explicitly indicated or indirectly observed) and wider dispersion and distribution (transparently or not).

Bigger and faster growth & change rates sounds like a disaster. How can we reason about the problem space when we are being forced to see light years into the future (in terms of the computing speeds) and in a space that is vast (in terms of our mental capabilities) whilst expanding and contracting on demand for causes we still have very little insight and control over? How can we even reason about the present when the past seems so distant and different in this regard. Remember BIG data is not a model that is something we have to mine from it.

The answer lies in emergent computing and adaptive control that is local and immediate. Local in that observation, judgement and reaction are collocated with the normal processing via embedded controllers and sensors weaved into applications (at runtime). Immediate in that the time interval between measuring, sensing and signaling (possibly to a remote station) and the actuating is at the same resolution of the underlying task/transaction processing that is being monitored, managed and controlled.

For this to happen we need for IT to change starting with how it (or its systems) observe. Moving from logging to signaling. Moving from monitoring to metering. Moving from correlation to causation. Moving from process to code then context. Moving from state to behavior then traits. Moving from delayed to immediate. Moving from past to present. Moving from central to local. Moving from collecting to sensing. When that has occurred we can then begin to control via built in controllers and supervisors.

Its time to let the machines (at least the software) self regulate themselves inline with policies and priorities we set.