How to assess the quality of a test - Validity

As described in the previous article on reliability, it can be a be a difficult task to figure out whether the test tools available on the market are of high enough quality to create real value for the company.


If reliability are the tracks, then the validity is the train. Both must be working correctly for passengers to get where they need to go. But while you can lay down tracks without a train, the train cannot run without the tracks. Similarly, while reliability is a prerequisite for validity, it does not say anything about the actual validity of a test.


It requires expert knowledge to do an in-depth assessment of the validity of a test. This is one of the reasons why many international companies require their testing tools to be certified/registered by a major agency such as BPS or DNV. If a test has a certification in just a single language, it usually means that you can trust that is of a certain high quality. Be aware, however, that the test has been evaluated psychometrically by a trustworthy organisation.


If you are in a situation where you have to evaluate whether a test is good enough, then I would recommend that you look for these three elements: (1) The test must be based on a solid theory, (2) theory and practice should be linked and (3) the test must be able to "predict" relevant outcomes.


1) Select a theoretically based test

Today, test construction is almost always based on a (psychological) theory. For example, the vast majority of personality tests for business are based on Trait Theory (the so-called "Big Five Traits"), since there exists a large body of research showing associations to a variety of relevant outcomes such as teamwork, performance, retention, engagement etc. Without going further into a discussion on psychological theories, a good advice is to choose a test that has a theoretical foundation in a psychologically recognized theory. Generally, this can serve as a good first benchmark for high quality.


2) Check whether theory and practice are linked

Any test publisher can claim that their tests are based on theory, but how do you examine if that is true?


Here you will most likely have to put in some work, because there is not one proper way to validate a test. As a minimum, you should look at the Fact Sheet of a test. And preferably go a little bit further and read the test documentation which the test publisher should be able to provide. Here you’re searching for Construct Validity, or in other words:  Does the test measure what it claims to measure? This should be outlined in the Fact Sheet and explained in detail in the Documentation Manual.


Based on the European Federation of Psychologists' Associations (EFPA) framework for evaluation of tests, a test provider should be able to present at least one, and preferably more of the following types of studies on their tests:

  • Item-Test correlations
  • Correlations with other similar tests
  • Test bias
  • Group differences
  • Factor analysis
  • Multi-Method Design


Item-Test correlations. For a scale in a test to be valid, all the questions within that scale should be strongly connected to each other. That is, if a person “strongly agrees” to one question, he or she should typically also agree to all other questions within that scale. This can be analysed using Item-Test (or item-total) correlations, where the rule of thumb is that the associations between each item and the scale should be higher than 0.3 with an average of at least 0.5.


Correlations with other similar tests. If a test is based on the Big Five and claims to measure Big Five Traits, then it should have a strong consistency with other established tests that measure the Big Five Traits. This can also be said for any other theoretical foundation. The method is then to test the same people with different tests measuring the same construct, and then analysing how well their scores correlate. Typically, we here accept correlations higher than 0.55.


Test bias. Test bias is a very broad area. Put briefly, different groups of people should have equal possibility for answering the test. For example, if a question in a cognitive test required knowledge of a city in Denmark, then all Danes would have a clear advantage. Of course, competent test developers never design tests this way. What typically happens is that, if one specific translation is poor then those who take the test in that language will be worse off. Test providers must be able to show, either by an extremely thorough translation process, or by analysis, that their tests are free of bias.


Group differences.  If what you are trying to measure has "natural" differences across different groups, then the test must also show the differences. For example, within personality testing, we often find that older people have lower scores on the scale called “Neuroticism” (of the Big Five dimensions). Thus, if the test you are using measures the Big Five, then it should also be able to demonstrate this connection. Of course, these differences will inevitably vary, but the test provider should be able to show some sensitivity to established group differences within the test.


Factor analysis and Multi-Method Design. I won’t describe Factor Analysis or Multi-method designs here, as they are a lot more complex to go into. There are plenty of good descriptions to find online if you are interested, and typically they will not be the first types of studies that test publishers will be providing anyways, so it’s less likely that you need to have knowledge about them in order to evaluate a test.


3) Find out if the test works

From a user perspective, the most important aspect of validity to consider is perhaps Criterion Validity. Criterion Validity shows you how well the test predicts an external outcome, typically an important Key Performance Indicator (KPI) for the company. If the test is capable of repeatedly predicting performance, then you know that it works for the purpose. An example of a measure of criterion validity, is the association between outcome on the test (or scale) and sales, as a KPI at a company. In general, correlations above 0.2 are considered acceptable and over 0.35 good. If you are looking for a single good measure of validity, then look for criterion validity, and choose tests that can document results with correlations of at least 0.35 to a KPI.


As should be clear by now, it is quite an extensive process to properly assess validity. It can be strongly recommended to select tests with an accreditation. But if that is not possible, then hopefully this article has given you a slightly better idea of how to assess the validity yourself.

Category: Talent Acquisition, Recruitment, Leadership, Development

Datum: 18.11.2020