At many serious games conferences I attend, people talk about the pressing need for more serious games to be validated. People talk about the handful of examples of serious games that have been validated. I assume this means that scientific trials were conducted that validated the use of these serious game to impact outcomes.
But when I listen more closely, sometimes I hear people say that they have “validated” their serious game at various steps of the development process. Humph. How do you validate an incomplete game for effectiveness? Then it turns out they never conducted a trial to evaluate the efficacy of their game to impact outcomes. But they still say they “validated” their game. How can that be?
I was confused by this for a long time until I dug a bit deeper into the use of the term “validation.” I think I found the root of my confusion.
The confusion surrounding the use of the term “validity” with serious games often arises because serious games can be used as interventions to impact outcomes or as measures to assess performance. Someone could say, “My Kinect exercise program has been validated.” This could mean that the Kinect exercise program is a valid intervention shown to increase physical fitness outcomes. It could also mean that their Kinect exercise program has been determined to be a good way to measure how much a person exercises. The game can be a good measure of physical exercise regardless of whether or not people who play it are exercising more as a result of interacting with the game.
It’s basically the difference between a pedometer that has been shown to lead to increases in physical activity versus a pedometer shown to give accurate readings of physical activity in the people who use it, whether or not these people get off the couch and start exercising more.
Both types of validity are important but obviously indicate very different things. Experimental validity refers to treatments or interventions that have gone through the process of an experimental study or studies that provided adequate evidence that they “work” as intended. Test validity refers to assessment tools or measures that are evaluated and developed with a combination of subjective assessments and correlational studies that demonstrated that they do a good job of “measuring” what they are intended to measure.
I am going to explain the details of experimental validity and test validity below. I realize that for many of you this discussion will be very academic and boring. I will try to make it as interesting as possible. I am doing this because I hope you can slog through it so you might gain some insights that in the least change the way you think about things. Ultimately, I hope they might change your behavior. Specifically, I hope that it will inspire you to be more specific about what you mean by validation when you use the word. I also hope that if you hear someone say that their serious game has been validated, you will be confident enough to ask if that means it has been validated as intervention or as an assessment tool. So if you can bear with me through the following academic discussion, I will follow-it up with my thoughts on why I think this distinction between experimental and test validity is important to clarify in the area of serious games.
Experimental validity refers to whether or not experimental condition or treatment has an effect for its intended purpose. We can talk about experimental validity in terms of having differing levels of internal and external validity.
Internal validity refers to whether or not an experimental treatment (in our case, a serious game) makes a difference on outcomes AND whether or not there is sufficient evidence to support the claim that it makes a difference. People often say that Re-Mission, a game for young people with cancer, has been validated as an adherence intervention because we did a very large randomized trial with objective measures. The results showed that patients who played the game increased their adherence to oral chemotherapy and antibiotics. The randomized trial design nailed the evidence for drawing causal conclusions about the game with a slam dunk of scientific evidence (yes, criticize randomized trials all you want but they do a great job of testing causal claims). There was therefore good scientific evidence in this one trial to support claims that Re-Mission actually “worked” as intended. We can therefore say that Re-Mission is a serious game that has internal validity as an intervention for treatment adherence among young people with cancer.
External validity refers to whether or not the claimed effects of an experimental treatment or serious game will transfer to people, environments and situations outside the original scientific study setting. In the case of Re-Mission, the game’s effects probably have high external validity for young patients with cancer in the US, Australia and Canada where the study was conducted. It has questionable external validity for older patients with cancer or other diseases in other countries and cultures where the effects were not evaluated.
When I give talks at conferences where I try to argue that we need more quality research on serious games, I ask audience members to raise their hands if they cited the Re-Mission study in their proposals to get funding to develop their own serious game. I then ask these people what type of serious game they were proposing to develop. They have told me that they were proposals to develop a classroom game to teach math, a game for cystic fibrosis, and even a game for driver education (among many topics). I then ask them if they really think that the findings from a treatment adherence game for young cancer patients will generalize to their groups. There is nervous laughter and some heads look at the floor. The findings may or may not generalize to their game. This is an issue of external validity. By the way, this doesn’t mean these people shouldn’t cite the Re-Mission study. It does mean that any new serious game should plan to scientifically evaluate whether or not their new serious game does actually have an effect on intended outcomes. It certainly is not a given that one study proved that all serious games should work.
Taken together, internal and external validity as components of experimental validity allow one to make claims that their serious game works in light of the evidence they have to support that claim while acknowledging the limits of how much that claim might generalize beyond the existing evidence.
Where the Confusion Sets In
Serious Games as Assessment Tools
There is confusion about what a “validated” serious game means because serious games are not only being used as interventions to impact outcomes but they are also used as measures to assess performance. It is confusing for me personally because my initial assumption is that serious games are made to impact outcomes. BUT, people do make and use serious games as assessments to measure something about the player.
If we look at the field of medical education, we find many examples of serious games used to train aspiring doctors. Most of these serious games are actually simulation training tools, such as a surgical simulator (note: we can argue about whether or not these are really “serious games” elsewhere). Surgical simulators are digital interactive tools that can be used to help doctors in training learn surgical skills. They are educational tools designed to train and shape target behaviors. Confusion arises around issues of validity because these same training simulators can also be used to assess student performance. The validity of a surgical simulator as an assessment tool involves efforts to determine what is known as “test validity.”
If we use a surgical simulation as an example, the simulator would in general be a valid measure of surgical skill if performance on the simulation was a good indicator of surgical performance on real patients in a real operating room (or “theater” as the Brits like to say). There are several components of test validity that one would evaluate to determine the validity of a surgical simulation as a test of surgical skill.
Construct validity in testing
A measure has construct validity if is actually measuring what it theoretically set out to do. For example, a surgical simulator could set out to measure the technical aspects of surgical skill. It could provide situations that allow for the measurement of hand-eye coordination, fine and gross motor skills, mechanical knowledge (how to tie a knot or make a cut), and efficient performance with economy of movement and minimal errors. Surgical skill as a construct might also set out to assess some non-technical skills shown to be related to surgical success in the real world such as communication skills, stress management, or teamwork.
Content validity indicates the extent to which an assessment tool contains content that relates to the knowledge or skills that are required in the area of assessment focus. A surgical simulator would have content validity if group of experts agreed that the content it content allowed for the evaluation of critical technical and perhaps even non-technical skills that would be evident in someone skilled in surgery. If the simulator left out opportunities to evaluate a critical skill such as fine motor skills, then content validity would be lower.
Criterion Validity (Predictive and Concurrent)
Criterion validity refers to the extent that performance on a test relates to criteria in the real world. If a surgical simulator has high criterion-related validity, then performance scores on the surgical simulator would be highly correlated with surgical performance in the real world on real patients in a real operating room.
Criterion validity can also be determined by its concurrent and predictive validity. The simulator would have concurrent validity if it was highly correlated with other assessments given at the same time as the surgical simulator that have been shown to be highly correlated with surgical performance in the real world. Furthermore, the surgical simulator would have high predictive ability if performance on the simulator was determined to be highly correlated with surgical performance months or even years later.
Face Validity (consensus) is basically whether or not you can subjectively take a look at the assessment tool and it seems to be “getting at” what you want it to get at. Many times, face validity involves experts agreeing on whether or not the assessment contains critical items that will get at the core construct in a way that makes sense. Face validity can be important for people taking the test. It influences whether or not they think the test is fair and therefore plays a part in their motivation to take the test.
Overall, different aspects of test validity “get at” whether or not an assessment is measuring what it is intended to measure. It consists of qualitative and empirical approaches. These approaches share some similarities but many differences from processes used to evaluate experimental validity.
The importance of distinguishing experimental validity from test validity
OK, here is why I think it is REALLY important to be really clear whether or not the a serious game is a valid treatment or a valid measure.
Important point #1
When we say we need validation studies of serious games, I think most people mean that we need good evidence that our serious games work. Thus, most people are talking about experimental validity. While it is important to validate serious games used as assessments for the test validity, at this point in time, the future of the field depends more critically on validating their use as tools to train. There are overwhelmingly more serious games that have been developed as intervention tools than as performance measures. The fact that most of these have not been validated for the impact they have on intended outcomes weakens the basic premise of serious games that they provide engaging and entertaining ways to train and educate. While it is important to the field that serious games used as performance measures are validated, I think that efforts to validate serious games as interventions provide more powerful support to the strength of our endeavor as whole.
When I hear someone say that they validated their game as an assessment tool, I can almost audibly hear people breathe a sigh of relief that someone else has done the hard work of validating a serious game. But watch out, they validated it as an assessment tool and not as an intervention. So you can breathe but not a sigh of relief. There is still work to be done in evaluating whether or not most of our serious games can be considered valid intervention tool that impact outcomes. As stated above, I believe this is a critical issue for the future of serious games.
Important point #2
We should be clear about what we are validating because valid measures can creep into becoming interventions through the magic of Testing Effects. I explained the magic of Testing Effects in more detail in last week’s post on the efficacy of Brain Games. Brain Games basically take assessment tools that psychologists have used to measure cognitive functioning, put them in a game format, get players to practice them and get better on that test. They then claim these tests-turned-interventions made players better in the area of cognitive functioning that the test is supposed to reflect. That ain’t necessarily so, my friends. The DMV (Department of Motor Vehicles) may fall for that type of reasoning when people take their driving test multiple times and improve on it, but hopefully we’re smarter than the DMV. We know a bit more about the difference between a test effect and an intervention effect. A validated measure is NOT the same thing as a validated treatment. In fact, a well-validated measure can be used to show that a treatment is NOT valid. These beasts are related but they are not the same beast.
Thank you for sticking with me through this very academic discussion of validity. Moving forward, the field is wide open to have serious games that have been validated as interventions AND/OR as assessment tools. We just need to use the word “validity” more carefully when we talk about our serious games. And let’s not be afraid to get other people to clarify what they mean when they say their serious game has been validated. We can simply ask, “Are you referring to experimental or test validity when you say your serious game has been validated?” or “Are you saying that your serious game was validated as an intervention or a measure?” Or as Dr. Lyle A. Brenner, Jack Faricy Professor of Marketing at University of Florida is likes to say, “‘Validated’ always needs a modifier. For example ‘validated as a measure / predictor of X’ or ‘validated as affecting Y.'”
By asking these questions and providing these reminders, you’ll probably clear up a lot of confusion not only for yourself but for other people in the room as well. And finally, let’s try to do more validation studies on serious games as effective tools to make people smarter, healthier, stronger, and maybe even kinder. The future of serious games depends on it.
(Note: Please read my next post for Part 2 on validating serious games http://wp.me/p299Wi-dW)
Campbell, D. T., Stanley, J. C., & Gage, N. L. (1963). Experimental and quasi-experimental designs for research (pp. 171-246). Boston: Houghton Mifflin.
Carter, F. J., Schijven, M. P., Aggarwal, R., Grantcharov, T., Francis, N. K., Hanna, G. B., & Jakimowicz, J. J. (2005). Consensus guidelines for validation of virtual reality surgical simulators. Surgical Endoscopy and Other Interventional Techniques, 19(12), 1523-1532.
Kato, P. M. (2012). Evaluating efficacy and validating games for health. GAMES FOR HEALTH: Research, Development, and Clinical Applications,1(1), 74-76.
Kato, P. M. (2010). Video games in health care: Closing the gap. Review of General Psychology, 14(2), 113-121.
Kato, P. M., Cole, S. W., Bradlyn, A. S., & Pollock, B. H. (2008). A video game improves behavioral outcomes in adolescents and young adults with cancer: A randomized trial. Pediatrics, 122(2), e305-e317.