100% system availability in January…even Google can’t promise that!

, , Leave a comment

To support the many functions and services of the University, Information Services must offer and maintain a diverse array of systems.  We have 12 core systems including SITS, SAP, email, telephones, and the websites.  Keeping all of these systems running smoothly is an ongoing challenge, and issues inevitably arise with these services.

Despite these challenges, in January of 2015, Information Services was able to report 100% uptime on all core systems. This means that, throughout the entire month, none of our core IT systems failed or were the cause of disruption at any time.

Why is this significant?

This is the first time we have experienced 100% uptime in all core systems for an entire month. This is a challenge not just for the University, but for any technologically equipped organisation. Google have gone on record as saying that a guaranteed 100% uptime on their services is “not attainable” (zdnet.com), while Amazon commit to a 99.5%. For us to have achieved 100% uptime demonstrates the significant progress we have made as we continually work to improve and refine the services we offer.

Graph

How did we achieve this?

We take every effort to ensure that our services are constantly improving, and that issues will not reoccur.

Whenever an issue affects a large number of people or has a significant impact on the University, we call it a major incident. A report on this issue is written and a major incident review is convened.  This meeting is attended by the relevant IS staff, and is open to affected staff outside IS. The aim of the report and review is to find the cause of the incident, and to identify ways we can prevent a similar incident occurring in the future.

It isn’t just major incidents that attract this type of scrutiny, we also look for patterns and trends in every day incidents.  This enables us to see if there could be a common cause, like a malfunctioning server or a piece of software that no longer works as it should. Once this has been identified we can take steps to prevent future incidents by repairing the fault, or replacing a system with something more appropriate.

What next?

We aim to continue this trend of rising average uptime, and hope to achieve 100% on a more regular basis in the future. As we constantly work to bring new features and services to the University, we will also work to make our systems as reliable and effective as possible.