How to assess the quality of a test - Reliability

It can be a difficult task to figure out whether the test tools available on the market are of high enough quality to create real value within your company. Fortunately, local agencies, that evaluate test quality, do exist - for example the British Psychological Society (BPS) in the UK or DNV in Norway. An accreditation by these agencies signifies that a test meets several stringent quality requirements assessed by independent experts, and this is a very good benchmark. But what do you do in countries without such agencies?


What do the reliability numbers say about the precision and quality of a test?

In order to fully assess the quality of a test in terms of reliability, it is recommended to look at more than one factor. It would be easier if we could measure the quality of a test by looking at a single number. Unfortunately, this is not possible, due to the complex nature of occupational tests. Condensing a test's reliability to just one number does not really make sense. It would be equivalent to evaluating a car solely based on its miles pr. gallon (MPG). First, there are many other elements which are important for whether it is a good car, such as safety, horsepower or size. Second, there are several different standards for assessing a car's MPG. Likewise, you can’t evaluate a test's reliability from one number alone. The most important factor in measuring if the car is good or not, is to understand what you need the car for – you need a context.

To do this, it is important to understand what reliability really is. Essentially, reliability describes precision over both time and place. However, to better explain the concept, I’ll instead describe how we study reliability – this gives a good picture of what we are dealing with.

Reliability in tests can be investigated in three different ways:

  • How well do the elements within the test fit together? (Internal consistency)
  • Do test takers get the same result if they take the test several times? (Test-retest reliability)
  • Are different people or versions of the test coming up with the same results (Inter-rater reliability or parallel versions)


Internal consistency.  Internal consistency is investigated by quantifying how well each item in a scale is connected to other item within the same scale. A frequently used statistic is the Alpha coefficient, also called Cronbach's Alpha. Alpha captures the degree to which a scale is consistent in measuring the underlying concept of interest. What we want here is that the items on a scale share as much variance as possible. Alpha captures this shared variance. It ranges between 0 to 1, with values below .65 considered a minimum and values around 0.9 generally considered optimal. When people ask for "a test's reliability", they often refer to the Alpha coefficient as one of the most used measures of the level of precision of a test. 


Test-retest reliability.  It may, in some cases, be more relevant to know if the test is stable over time. In other words, do people get the same result if they retake the test after e.g. three months? For example, this is important information if you’re re-testing employees after a period. Using a test with poor test-retest reliability, you won’t know if differences in results are due to test inaccuracy or whether the employee has in fact changed behavior.


Test-retest reliability is typically stated as a correlation coefficient between the different times the test was taken, and ranges between 0 and 1. The higher the association between outcomes on a test on time x and time y, the better (i.e. coefficients closer to 1 are better). Practically speaking, values around 0.7 are good.


Inter-rater reliability.  Inter-rater reliability is investigated by having different individuals to assess the same person. An example could be a 360-degree survey, asking several different people to assess the same person. The likeness of their scoring indicates the accuracy of the tool used. Obviously, personal evaluations confound the results and affects precision. But if the tool used is reliable, you should see a solid correlation between outcome on a test, independent on who administered the test. Preferably, we want to see association above 0.6.

Parallel versions of the same test are not often used for businesses, as it requires developing the “same” test twice and then investigating how similar the results from both tests are when they are filled out by the same person. Here you will typically want to see correlational values higher than 0.8.


Understand the numbers in their right context

The reliability of a test is by nature a complex topic. In the end, you want to know whether a test is “good” or not. However, that cannot be established based on one number alone. As a minimum you’ll need the context. An Alpha of 0.7 is certainly not "good", but if we are talking about test-retest reliability, then that number would actually indicate that the test is stable over time.

To make matters even more complicated, knowledge about reliability is not enough to assess the overall quality of a test. You also need to look at validity – which we will take a closer look in an upcoming article.

Datum: 21.01.2020