Testing, testing, testing
Karl Reed, Information Age
19/04/2002 17:13:07
Testing, testing, testing Software testing increasingly engages Australian IT specialists. Now a major regional survey seeks to guide best practice, research and educationYou may have heard the story of passengers being offered the choice of flying on one of two aircraft. One has thoroughly tested fly-by-wire software, and the other's systems have been formally verified and proven correct, but, they have not been tested. The latter is on its maiden flight. If you think this is ridiculous, then you may have missed some of the most important philosophical and practical arguments in our field. Pick any modest-sized program with a realistic input specification, and try to construct a set of tests that will guarantee that the program meets the specification. For even simple programs, many hundreds of test cases will be produced. But, if the program passes all your tests, does this mean it is "correct"? The answer is, you can't be sure. However, if you find an error (or fault), then it's possible to say something definite: "We found a fault and (hopefully) corrected it." This has led to the oft-quoted aphorism, due to Edgar Dijkstra, "Testing cannot prove the absence of errors". By implication, it can only prove their presence. This follows, it is argued, because test sets are limited in number (finite) while the set of all possible inputs for any real program could be extremely large as a result of the known limitations of test generation techniques. However, test we must. Testing has become increasingly visible as a specialist activity in software development. A major question for us all is: how much testing are we doing, and how much should we do? Further, we need to answer the question: are we getting an appropriate benefit? Despite the problems, we persist because, first, thorough testing does locate faults, whose removal increases a system's quality. Second, testing need not be limited to the detection of faults - it should also address usability issues. Third, the test sets that a system passes, if properly constructed, can define an operational "envelope". It would be possible to restrict the system's operation to this envelope of valid operations. The alternative to testing is program proving, an approach which has come a long way in the last 35 years, with significant claims of success. It is now often applied to specifications of programs, where it can be used to prove that they are correct, something generally easier than proving the program itself. These formal specifications can be used to generate as part of "specifications-based testing". Program proving suffers from the same problems that programming itself does. First, the proofs can contain errors. Second, the methods require special training, skill and some mathematical aptitude, and hence are not widely used. In addition, it is difficult to formally verify or prove code containing many of today's favourite programming devices. For example, the critical parts of the Boeing 777's digital flight control systems are reputed to be written in a variant of Ada which has no dynamic storage allocation, and no generic packages. A good discussion can be found in Shari Pfleeger and Les Hatton's paper of 1997 [Pfle1997]. Testing approaches are divided into "white-box" and "back-box", according to whether they involve detailed analysis of the code, or allow testers to work from the specification However the distinction is not clear. Several equivalence-class and boundary value require the tester to "partition the input into equivalence classes that have the property that any test chosen from such a class is equivalent to any other test in the class". This means only one test is needed instead of many. A little thought leads one to conclude that for the tests to be equivalent in this way, then they must all execute the same code. This suggests that the only way of being sure that the partitions are correct is by looking at the code. TY Chen's [Chen2001] recent work provides an excellent discussion of this problem, which was part of the early proposals for partition-based testing. In terms of "coverage", the measure of success for white box testing, we could report that a particular test-set caused 90 per cent of the statements to be exercised. The candidate features for test coverage are language dependant. Ada and the OO languages because procedures/classes that can be invoked with a wide variety of parameter types, making test coverage more complex. In most older languages, it is possible to call a procedure only with specific parameter types. Typically, common features "covered" are statement, path, procedures, procedure invocations, branches, etc. Less common (but equally important) are exceptions raised and exceptions "caught". Sadly, however, it is known in practice that it is difficult and unusual to achieve 100 per cent statement coverage or path coverage no matter how carefully we generate input cases ([Basi1987]). In addition, the set of "features" that need to be "covered" is wider than one would expect can be seen from a very readable paper by Weiser and company [Weis1985], which describes a number of difficulties which all testers should be familiar with. The debate on "why test software" was given a nasty twist in 1987 when Vic Basili [Basi1987], (now Visiting Professor at UNSW) published a study showing that experienced programmers who were restricted to reading code were significantly more effective in finding errors than those who could execute the code, or those generating test-case from specifications, and those who used test-coverage analysis. This may not be surprising in terms of what we now know about inspections but it was an unexpected outcome at the time. In the past, system-based test harnesses were used to verify program behaviour against a specification. Ernie Zimmer's test harness for telephone exchange systems at Ericsson in the 1970s was a language for describing expected behaviour at the system-level in terms of the values expected in a series of data items, at a sequence of points traversed during execution. The test-harness would check that the values and points reached were correct, and could reset incorrect values to allow execution to continue. Abramson and Sosic's [Sosi97] GUARD system compares the internal data generated by a working version of a program, and with that generated by a modified version, or one ported to another platform. Both programs are run simultaneously', either on one machine or on two, and if necessary, communicate over a network to make the comparisons, a feature of great use when debugging distributed systems. Test quality relates to having confidence that tests would indeed find errors if they were present. The "white-box" aspect of this deals with the process of seeding, adding bugs to the code, and running the tests to see if they produce a detectable fault. Knight looks at the capacity of tests to detect syntactic errors, [Knig1985]). The most extreme approach is known as "code-based mutation testing" [King1991] (to differentiate it from the more recent approach of "specification-based mutation testing", a black-box approach being investigated by my student, Tafline Murnane [Murn01]. When to stop testing In practice, knowing that we have either found all the faults in a system, or that it is no longer economic to continue trying to find them is important. Since the initial number of faults is unknown, a means of estimating them is needed. Historical data which recorded the fault rates for particular teams producing similar software can be used to provide an indication of the number of faults which we might expect to find during testing. Shooman explains how a statistical device called the "fish pond test" can be used to estimate the number of errors in a program [Shoo1983] Musa's[Musa1990] two models, the "basic" and the "logarithmic", provide tractable estimates using initial testing results although I have not seen much recent work validating them. From a pragmatic point of view, one can suggest a cyclic process which should be followed as the number of faults detected (cumulatively) begins to plateau as testing progresses. Check the test sets carefully to ensure that they seem to have been thoroughly constructed. In particular, check that any prescriptive methods are being followed, and that the test-cases are reviewed to ensure that obvious erroneous input has been included. If necessary, introduce new test-cases, and test some more. Otherwise, stop. Of course, a company may simply decide to release some software even though there are known errors. This violates our precepts of good software engineering. Sadly, my daily experience with this shows that some suppliers consider their buggy products "good enough". Companies are clearly making judgments here -- balancing legal liabilities against product development cost against reputation and market penetration. The inevitable result will be that products will contain known errors. Ultimately, it means understanding that no matter matter what, reducing the inherent error rates in the programming process is at least as important as simply testing. As Whitaker points out in his excellent summary of testing [Whit01], customers seem to report bugs, no matter how hard you test. The regional survey The real question is: exactly how do those in industry actually perform testing and how effective is it? Anecdotal evidence is that more and more effort is being committed. How much more? What kind of test techniques are used? Well, the purpose of this article is to encourage you to participate in a survey of software testing. The survey is being conducted by Swinburne University of Technology and La Trobe University, and is being financially supported by the ACS. We do know that in recent years, software testing has been receiving increasing attention by the Australian IT industry. The level of attendance at the first AsiaSTAR Conference held in Sydney last July, and the popular demand for training courses, suggest that testing is emerging as a specialised profession. Our aim is to provide a clear indication of industry best practice, and future trends and demands for education and training, and research directions. A parallel survey is being conducted in Southeast Asian countries. One outcome should be the identification of best practice, enabling the Australian software industry to strengthen its competitive advantage. The survey is directed to software testing practitioners and management in the areas of software testing techniques, automated tools, training and education, standards and external consultancy. For the survey to be effective, we need responses from a large number of software developers, whether they are heavily engaged in testing or not. The survey team is: Doug. Grant, Swinburne University Tech, principal investigator,T Y Chen, and S Ng, Swinburne University Tech, principal investigatorT Murnane and K Reed, La Trobe University. Companies interested in participating should contact the survey coordinator,Dr Sebastian Ng. School of Information Technology Swinburne University of Technology. E-mail sng@it.swin.edu.au, phone (03) 9214 8666Conclusion and acknowledgementsIt has not been possible to deal with all-important current topics in testing in the space available. The author gratefully acknowledges input and assistance from Phil Stokes (Bond University) and Tafline Murnane who proofed the drafts, and provided some key references, Dave Abramson (Monash), and Paul Strooper (University of Queensland). In the end, however, any errors of fact and sins of omission are the responsibility of the author. References [Basi1987] Victor R Basili and Richard Selby, Comparing the Effectiveness of Software Testing Strategies, IEEE Transactions on Software Engineering, pp 1278-1296, December 1987. [Chen2000] Chen, T Y, Tang, S F, Poon, P L and Yu, Y T White on Black: A White-Box Approach to Selectinbg Black-Box-Generated Test Cases Proceedings of the First Asian Pacific Conference on Quality Software, IEEE Computer Society, Hong Kong 2000[King1991] King, K N, Offut, A Jefferson, "A Fortran Language System for Mutation-based Software Testing" Software - Practice and Experience, Vol. 21 (7), July 1991, Pages 685-718. [Knig1985] Knight, J C and Amman, P E An experimental evaluation of simple methods for seeding program errors Proc. 8th International Conference on Software Engineering, London, 1985, pp. 337-342[Murn01] Murnane, T and Reed, K On the Effectiveness of Mutation Analysis as a Black Box Testing Technique Proceedings of the Australian Software Engineering, Conference Canberra, IEEE press, August 2001[Musa1990] Musa, J D, Iannino, A and Okumoto (1990) Software Reliability, Professional Edition, McGraw-Hill Software Engineering Series. [Pfle1997] Pfleeger, S L and Hatton, L Investigating the Influence of Formal Methods IEEE-Cs Software Vol 13 February 1997 pp 33-43[Shoo1983] Shooman, M Software Engineering: Design, Reliability and Management, McGraw-Hill, 1983[Sosi97] Sosic, R and Abramson, D A Guard: A Relative Debugger, Software Practice and Experience, Vol 27(2), pp 185 - 206 (Feb 1997)[Weis1985] Weiser, M D, Gannon, J D, McMullin, P R,"Comparison of Structural Test Coverage Metrics", IEEE Software, March 1985, Pages 80 - 85[Whit2001] Whitaker, J what is software testing? and why is it so hard? IEEE software January/February 2001 pp 70-79Karl Reed is Director, Computer Systems and Software Engineering Board and Visiting Professor, School of Information Technology, Bond University. He is on leave from the Department of Computer Science and Computer Engineering at La Trobe University
[ Printer Friendly Version ]
[ Other stories about Black Box, UNSW, Swinburne University of Technology, Ericsson, OFT, Bond University, La Trobe University, Swinburne University of Technology, Boeing, IEEE, Swinburne University, University of Queensland, La Trobe University, University of Queensland, ACS ]
|