Wednesday, June 3, 2015

Guide to Developing High-Quality, Reliable, and Valid Multiple-Choice Assessments Marcy H. Towns

Guide to Developing High-Quality, Reliable, and Valid Multiple-Choice Assessments
Marcy H. Towns
Journal of Chemical Education 2014 91 (9), 1426-1431
DOI: 10.1021/ed500076x

Dr. Marcy Towns is a member of the Department of Chemistry at Purdue University.  She has authored other Journal of Chemical Education articles including another I read on Developing Learning Objectives and Assessment Plans at a Variety of Institutions: Examples and Case Studies which was published in 2010.

In this article, Towns aims to provide faculty some guidelines on creating reliable and valid multiple choice assessment questions.  She finds that this is an important part of the chemistry programs’ efforts to enhance their assessments of student learning.  Much of what she has summarized and synthesized as a guide in this article came from other works of reference whose content she has adapted for chemistry faculty use.

This article has very many useful tips for creating and improving multiple choice items in assessing a learning outcome using a valid and reliable assessment item.  Many of the multiple choice questions I have used are either exact or revised versions from publisher test banks.  I have created original ones as well and I must say that I have committed some of the practices not recommended by the author based on the reference works she consulted.

Part of the evaluation of assessment tasks and question items involves looking at their reliability and validity. 
·         Reliability pertains to reproducibility and precision of the question item so that regardless of who reads and responds to the item, the question is interpreted the same way and the same answer is selected.
·         Validity pertains to the accuracy of the assessment item: that it truly is assessing the aligned learning objective and not some other skill or knowledge.
Thus, reliable and valid assessments allow faculty to measure in a precise and accurate way the students’ ability to meet specific learning objectives.

Multiple choice questions are a common mode for assessments, including the ACS Standardized National Exams which are all purely multiple choice.  She defines the following terms for multiple choice questions: the multiple choice item starts with a stem (the question) followed by response options, one of which is the correct response and the rest are distractors.

Content Guidelines
Learning outcomes and assessment should be closely aligned; the multiple choice question should be driven by a well-defined, specific learning objective.  The author recommends that the multiple choice question should assess understanding and require student to apply knowledge and not retrieval of “trivia”.

General Item-Writing Guidelines
·         If learning objectives use Bloom’s taxonomy, then the MC item can be easily classified: “based upon the type of knowledge required (factual, conceptual, and/or procedural knowledge) and the level at which the question is asked (remembering, understanding, applying, or analyzing, etc.)”.
·         The author recommends not using complex multiple choice items that require elimination and selection as this tests a different form of analytical skills that have more to do with test=taking skills and process of elimination.

Stem Construction
·         Stems should be clear and brief.
·         It should avoid negative phrasing such as “not” and “except”.  If used, they should be highlighted, underlined, or bolded.
·         Ideally, stems should be written so that the student should be able to come up with an answer without looking at the response set. 
·         The stem should contain as much information as needed while keeping the response set brief, clear, and to-the-point. (See article for an example)

Response set construction
The author recommends the followings based on research and practice literature on writing assessment multiple choice questions.
·         Numerical answers should be arranged in ascending or descending order.
·         Verbal answers should be about the same length and should be arranged in a logical manner. An overly long response statement may invoke “look for the longest response” bias; attempting to make them similar in length removes this bias and improves validity.
·         Response sets should avoid “all of the above” and “none of the above”.  These have the potential to be answered correctly based on the student’s analytic test taking skills and thus nulling the questions validity and reliability.  Using “none of the above” turns the question into a true or false question because the student is then encouraged to evaluate each statement as true or false to test for the correctness of “none of the above” and not necessarily connecting it to the concept linked by the question.
·         Avoid using absolute terms such as “never” or “always” as students normally gravitate away from these (thus again reducing validity).
·         Write response statements in the affirmative to make it less confusing (precision).
·         Distractors for questions requiring calculations can be created by applying common calculation errors (whether mathematical or conceptual) that students make.  E.g., in doing a stoichiometry calculation, forgetting to use the correct mole ratio, not converting to the correct units, etc.). Distractors generated using common errors make more plausible answers and increases the validity of the question.
·         Simplify! Overly complex questions that require multiple connections to determine the correct answer is not ideal as it may compel student to simply guess.  It also likely ends up assessing more than one learning outcome.  See example 6 in article.
·         Write questions that require the application of knowledge and not just recall.  For example, asking students to identify a halogen or an alkaline earth metal from a list of elements is a recall question. Asking students to identify an ionic compound composed of an alkaline earth metal and a halogen requires application of knowledge.
·         Write multiple choice questions that are independent of each other.  Linking multiple choice questions can doom students if the question is such that if their first answer is incorrect, the probability that the rest are incorrect is high.  Here is what the author has to say further about this: The entire test is an instrument to measure student understanding of learning objectives and proficiency in a content domain. Each item needs to independently contribute to that measurement. Moreover, measures of test reliability and validity assume that the items are independent from one another.

Optimum number of test responses in a set:
·         The author recommends that the number of responses in a set should be driven by the number of plausible answers.  Using implausible or non-functioning response statements reduces the reliability and the validity of the test question. (Her exact words are:  The guiding principle in response set construction is that every option must be plausible.)
·         One research (Rodriguez, see article citation list) suggests an optimal number of three for one correct answer and two plausible distractors.  Rodriguez’s research shows that using more than three “does little to improve item and test score statistics”. Others’ research agrees with this. One practical advantage is that students need less time to evaluate 3 response statements and therefore more multiple choice items can be added.

Issues related to test-wiseness
·         Some cues that students who are test savvy may use to determine the correct answer using test-taking skill instead of applying knowledge: responses that do not follow grammatically from the stem, overly specific or absolute statements with words like never or always, and repetition of a phrase from the stem signaling a correct response.

Correct response issues and key balancing
·         Double check that each multiple choice item has only one correct answer
·         Do a frequency count to achieve key balancing (distribution of correct answer letters).

ITEM ORDER AND RESPONSE ORDER EFFECTS
·         Research by Schroeder et al on analysis of item order and response order effects in the ACS exams indicates that when students have to answer a series of 4-5 challenging questions (e.g. equilibrium and stoichiometry problems that require multiple calculations and analysis), the likelihood of getting the last answer correct is lower. This suggests that when ordering items, clustering of challenging questions should be avoided.
·         Students are more likely to do better on a question if it is similar to the previous one.  This kind of priming effect should be avoided to increase reliability and validity.  The author states as similar questions that require similar calculations or application, use similar cognitive processes, or apply similar conceptual knowledge.

USING ITEM ANALYSIS TO IMPROVE ITEM WRITING
On item difficulty, here is what the author recommends for how to evaluate % students who answered correctly in the context of achieving a learning objective:  “If the learning objective is fundamental in nature, then faculty may expect between 90 and 100% of the students to score correctly. If it is more challenging, then faculty may be pleased with an item difficulty of less than 30%. A rule of thumb to interpret values is that above 75% is easy, between 25% and 75% is average, and below 25% is difficult. Items that are either too easy or too difficult provide little information about item discrimination.”

Item discrimination analysis is used to determine “how well an item discriminated between students whose overall test performance suggests they were proficient in the content domain being evaluated and those whose performance suggests they were not”. Two methods for calculating this are described briefly by the author in the article (see article for more details).  In one method, a discrimination index of 0.40 indicates that the higher scorers are more likely to answer an item correctly indicating that the item is effective in discriminating high-achieving students from low-achieving students. The author recommends doing a discrimination analysis on distractors as well; these should get negative index values.  Otherwise, distractors that get a positive index value indicates low validity.

Distribution of Responses
The author notes that the proof of plausibility of distractors lies in the distribution of answers among the correct responses and the distractors. One interpretation is that if less than 5% of students choose a distractor as a correct answer, then it may not be plausible enough (or the question is simply too easy).



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.