NASA, SETI, RAID, WiFi Solutions to the rescue of Reliability and Quality-of-Service
The Business Software market is expecting improvements in Reliability, Availability and in general Quality-of-Service (QoS) for Web Services (WS) to gain widespread acceptance. R.a.ï.s. (Redundant Array of Independent Services) achieves this by making Client Applications subscribe to several independent Services performing the same functionality. Inspired by solutions implemented at NASA, by the SETI@Home initiative and in the RAID technology, R.a.ï.s. provides advantages to WS Suppliers as well, by making the market intrinsically larger and by providing a potent selling tool in showing their WS being on par with if not better than the competition.
1. Advantages of Web Services
For an introduction see IBM's "Web Services 101" . Briefly, a WS is a Web-based application that communicates to other applications using the Web Services open standards: IP as backbone, XML to tag the data transferred using SOAP, and WSDL to describe / UDDI to list the available services . By using those standards WS can become a very powerful tool, allowing Companies to communicate their data internally and externally independently from their underlying IT systems. WS can increase the possibility for business integration while reducing the need for duplication of data between applications. Furthermore, Developers can use the best technology available for each task without worrying about custom integration coding .
2. Open questions: Reliability and Quality-of-Service
Pervasive usage of WS will depend on the ability of Clients to trust and manage performance. In particular good Reliability and Quality-of-Service (QoS ) are important , and especially so when a WS is to be provided by an external Company.
In fact, parallel can be drawn with the Application Service Provider (ASP) model, where Clients had to rely on another company’s ability to provide reliable applications and proper data management, often without basic tools for checking checking errors, high performance, resilience and availability.
Solutions proposed in the area of WS security include SAML and XKMS , while the rest is still not clearly defined. Furthermore, unscrupulous WS Suppliers could “milk the Client” as neither service suspension nor replacement with a different supplier’s are feasible solutions. The criteria of choice of a WS or another are also still vague . As the number of available, competing WS will increase more and more systems will perform similar functions, leaving the Client with the task of picking a supplier from a tangle of possibilities.
3. Solutions from history: NASA, SETI, RAID…and WiFi
Technology has confronted issues of Reliability and “QoS” in the past.
a) NASA: The Space Shuttle’s onboard main computing system has to guarantee extremely high reliability. Alongside highly-dependable components, the designers chose a quintuple of identical computers performing the same processing and submitting the results to each other. Failure of one computer is easily spotted, with its data being different from the others (Majority Rule). The faulty systems exclude themselves from further control of the spaceship, whose security is still guaranteed by the presence of 3 or 4 fully functioning computers .
b) SETI@Home: In one of the first examples of grid computing, each node is sent packets containing a fraction of the data collected by a radiotelescope. Results are checked by a centralised system looking for extraterrestrial signals. SETI@Home has to cope with bogus data sent back by some users. This issue was overcome by sending the same packets to several nodes: again, results that vary from the majority are discarded thus nullifying all attempts at distorting the data .
c) RAID: A well-known technology developed for increased reliability is RAID (Redundant Array of Independent (or Inexpensive) Disks). This is the solution of choice when both data access speed and fault tolerance are of particular relevance. The simplest RAID implementation is the so-called Level 0, where the same information is written in two or more drives in combination (data striping). At higher levels, errors are corrected using control codes . Once more the solution is to use multiple data copies
d) WiFi: WirelessFidelity is the term describing any technology enabling wireless LAN connectivity under the IEEE 802.11 family of specifications. WiFi is promising a revolution just in making Internet access available everywhere at negligible costs. On top of that, WiFi is one of the first examples of Open Spectrum technology. Instead of strictly going through a single channel, transmission uses several frequencies at once . This, alongside error correction and packet re-send capabilities, eliminates the need for protection against interference.
In common above is the idea of eliminating reliance on single items in the system: computers at NASA, processing nodes on SETI@Home, disks for RAID, transmission channels on WiFi.
4. Application to WS: R/A/I/S
This concept has inspired a solution to WS Reliability and QoS with R.a.ï.s. (Redundant Array of Independent Services).
Under R.a.ï.s., Client Applications subscribe to an array of WS performing the same processing (redundant), provided by different suppliers and independent from each other. Results are compared, and data discordant from the majority discarded. The result is a highly reliable WS architecture with an intrinsically high QoS.
As an example, let’s imagine application “A” subscribing to 5 WS, performing the same function but each developed, maintained and hosted by different Suppliers.
The datasets coming back from each WS are reconciled . If identical, "A" will proceed using any one of them. If any dataset differs from the others, it is discarded and its Service marked as faulty.
"A" can still proceed regardless, using any other result .
5. QoS and Reliability improvements
R.a.ï.s. improves QoS as:
a) With a large dose of competition built into the architecture, Clients no longer rely on the successes, failures and whims of a single external Company.
b) It fully utilises the nature of the Internet, as chances of simultaneous attacks on independent suppliers (including DoS) are less likely than on a single one Also the risk derived by delays in the network is contained as late or unreachable Services can simply be discarded .
c) The risk of handling incorrect data due to bugs in the WS Producer’s code is greatly reduced by choosing independent Suppliers. Software bugs in each WS are in fact much easier to spot, and the Supplier can deal with them without compromising the WS Consumer’s functionality.
6. Example Web Services
Some example obviously suited for R.a.ï.s., as several Suppliers can easily provide the same functionality:
a) Global Bank Holiday Calendar
b) Event Management System
c) Market Data
7. Types of Reconciliation
Apart from when different values are returned, datasets can be considered discordant under other circumstances, for example when a WS appears consistently late in its responses.
The above can be performed using a majority rule or extending it to include a weighted-voting system, reducing the reliance on one or more low-quality Services.
The extent of the data checking can also be varied. With large datasets, reconciliation speed can be increased by using subsets of the data, or statistical checking, or comparing control “parity” data .
This can be used to deal with additional processing power with large or complex reconciliations, for example when there are demands to handle faults of one or more Services.
8. Performing the Reconciliation
The WS Consumer itself may conduct the data reconciliation, allowing its managers to keep full control of its behaviour.
At the Client’s side, another choice is delegating a separate Reconciliation Machine to compare the data received from all the Producers to all the local Consumers, making possible:
a) The running of a Company-wide R.a.ï.s. policy
b) Easier implementation of additional processing power as needed as system upgrades make the data larger and more complex.
The number of WS to use for the R.a.ï.s. framework will in general be odd, for the obvious reason of avoiding the deadlock caused by the same number of WS reporting different datasets (e.g. a 1vs.1 or a less likely 2vs.2 situation). For similar reasons, it is risky to use a 3-WS configuration, as the failure of one would increase the chance of a 1vs.1 deadlock.
On the other hand, the number of WS should be kept as low as possible, to minimise the reconciliation processing power. A sensible solution is therefore to use 5 Services .
10. Handling of and Recovery from failure of one or more WS
Apart from discarding its data, managing of the failure of one WS can involve several optional stages, for example:
a) Dropping the faulty Service and subscribing to another one automatically, for example from a directory listing. This can be either temporary or permanent according to appropriate Contracts for “Emergency Supply”
b) Re-interrogating the faulty WS in the future, up to a certain number of times
c) Reporting the fault and the relevant data automatically to the Supplier
d) Communicating the failure of one WS to other Companies that are using it, either publicly or privately, in an open or anonymous fashion.
As with the Reconciliation, this process can be performed by the Application itself, by the Reconciliation Machine or by a purpose-built system.
Careful selection of the Suppliers will likely make the simultaneous failure of several WS an extremely rare occurrence. Client Companies needing to handle it anyway can implement tools for the automatic selection of additional Services (see 10.a).
11. Suppliers: Availability, Selection and Advantages
The R.a.ï.s. framework relies on the availability of several WS performing similar functions, and Due Diligence in selecting which WS to use, including verifying their independence.
The first point is virtually guaranteed by the already large number of suppliers. Concerning selection, R.a.ï.s. itself could help. In fact, a WS used by a R.a.ï.s. Client has the selling point of being at least on par with the competition.
Listings containing the number of Customers using each Producer could then become the basis for WS League Tables, to be used during the Due Diligence phase, instead of sifting through anonymous, huge WS directories.
This is also an obvious advantage for the best Suppliers, as Services near the top of the League Table are evidently easier to sell. Moreover, QA is continuously conducted on live-data, making even internal quality controls easier to perform and verify. Finally, Clients have the incentive of supporting a large number of independent Suppliers, without which most of the advantages of R.a.ï.s. would be lost .
12. Costs: Independent or also Inexpensive?
Under a rather simplistic analysis R.a.ï.s. might appear as an increase of costs, with WS Consumers having to pay multiple Service fees. In practice there are several factors mitigating that situation:
a) R.a.ï.s. dramatically decreases the risks of not being able to run an Application and of handling incorrect data due to problems at the WS Supplier’s side and/or communication failures. The costs of those occurrences need to be factored in when opting for WS, and they may be one of the most important reasons holding off the WS market.
b) The framework guarantees a larger supply, as the market is larger . Also, the need for independent Services makes monopoly conditions less likely to form, keeping the Service fees low.
c) Suppliers can slash subscription costs under particular circumstances, for example when subsidising their presence in order to make an impact in the League Table.
d) Contracts can accommodate Client discounts for the service of providing live-data continuous QA.
e) Similarly, Client dues can be calculated on the actual amount of processing power used, excluding the occasions when the WS was considered faulty.
f) Clients can loosely federate themselves when using a specific WS to share information on its reliability and performance, decreasing the risk of failure even further .
The careful exploitation of these, combined to all the other advantages in terms of Reliability and QoS, can compensate the Client for additional costs (if any) due to R.a.ï.s.’s multiple subscriptions, thus making the I of R.a.ï.s. mean Inexpensive in addition to Independent like for the RAID acronym.
1. "Web Services 101" by Bob Sutor, Director of Web Services Strategy, IBM http://e-serv.ebizq.net/wbs/sutor_1.html
2. From the Webopedia: http://www.webopedia.com/TERM/W/Web_services.html
3. Presentation by Simon Walkden, Global Head of IT Architecture, UBS Warburg available at http://www.finexpo.com/finexpo-images/presentations/Simon_Walkden-UBSWarburg.pdf
4. QoS includes reliability, management, monitoring and security
5. See "Thinking about Implementing a Web Services Strategy?" by Brian Buehling http://webservices.xml.com/pub/a/ws/2003/03/04/strategy.html
6. Security Assertion Markup Language, XML-based
7. XML Key Management Specification, allowing use of PKI
8. Both "customer milking" and vague choice criteria were also problems already present in the ASP model
9. Or even the simultaneous failure of two returning identical faulty results
10. The simultaneous identical failure of 3 computers can be considered too improbable an event to be granted consideration.
11. This process is also speeding up the signal processing, as the system is not relying on the availability of idle time on a single PC.
12. At RAID Level 3, the control data is written in a dedicated disk. At Level 5, they are themselves striped across the available disks.
13. As long as it is certified and approved by the “WiFi Alliance” organisation
14. A total of 14 channels on the 2.4-GHz band in the case of 802.11
15. Also known as WS "Consumers"
16. Also known as WS "Producers". Note that the function performed by the Producer could be any within a standard WS Architecture, including data processing, billing, orchestration, and so on.
17. It could be different external Companies, or different Departments within the same Company
18. See below for types and location of the reconciliation
19. This rule can be applied even if 2 WS do not agree with the others, again by picking up as “true” any dataset from the remaining 3 services.
20. Therefore shielding the WS Consumer also from interruptions for example for software upgrades at the Suppliers’ side.
21. Just as for any data transfer
22. Less likely as it would involve at least 2 Web Services failing identically, despite being independent from each other
23. Just as the Space Shuttle uses 5 onboard computers. Note that if one WS is marked as faulty, deadlock could be prevented by temporarily discarding another one and still leaving 3 Services would be available to the WS Consumer.
24. For example, a small number of suppliers would increase rates and decrease the quality assured by independent verification of results.
25. Larger because each Consumer needs more than one Producer
26. The other members could drop a Service that had become untrustworthy in the eyes of a member of the Federation, or this could lobby the Supplier as a whole for improvements, debugging, etc.