Assessing the risk of bias in individual studies in a systematic review can be done using individual components or by summarizing the study quality in an overall score.
We examined the instructions to authors of the 50 Cochrane Review Groups that focus on clinical interventions for recommendations on methodological quality assessment of studies.
We found that recommendations by some groups were not based on empirical evidence and many groups had no recommendations on how to use the quality assessment in reviews. We suggest that all Cochrane Review Groups refer to the Cochrane Handbook for Systematic Reviews of Interventions, which is evidence-based, in their instructions to authors and that their own guidelines are kept to a minimum and describe only how methodological topics that are specific to their fields should be handled.
The strength of systematic reviews of randomized trials and observational studies, as opposed to narrative reviews and expert opinion, is the application of systematic strategies to reduce bias. Since the conclusion may become unreliable if the data are flawed, this involves an assessment of the internal validity of the included studies [1]. The term methodological quality is often used instead of internal validity, but as quality may address issues that are not related to bias, it would be preferable to speak about an assessment of the risk of bias.
There are four main areas of bias in controlled clinical studies: selection bias (differences in baseline characteristics between the groups of prognostic importance), performance bias (unequal provision of care apart from the treatment under evaluation), detection bias (biased outcome assessment) and attrition bias (biased occurrence and handling of deviations from the protocol and loss to follow-up) [2–8].
The outcome of the risk-of-bias assessment can be listed for the different methodological areas separately (component approach) or by summarizing the information in an overall quality score (scale approach). The risk-of-bias assessment can be used in the review with a variety of approaches. For example, as a threshold for inclusion of studies; as a possible explanation for differences in results between subgroups of studies; by performing sensitivity analyses where only some of the studies are included; or by using a risk-of-bias score as a weight in a meta-analysis of the results.
Using a scale can be tempting but is not well supported by empirical research [9–11]. A major problem with scales is that they often incorporate items that are more related to the quality of reporting, ethical issues or statistical issues than to bias [11].
The biggest producer of systematic reviews, the Cochrane Collaboration, advises against the use of scales [12]. After peer review, the reviews are edited by one of the 51 Cochrane Review Groups related to different fields of healthcare. Most review groups have their own set of instructions to authors, based on the Cochrane Handbook for Systematic Reviews of Interventions [12], and these guidelines are published in the Cochrane Library under the description of the Cochrane Collaboration [13].
There are currently more than 3,000 Cochrane reviews and they have been shown to be of higher methodological quality, on average, than other systematic reviews [14, 15]. However, a previous study of 809 Cochrane reviews published from 1995 to 2002 reported that 36% of the review authors had used scales [16]. We examined how the different review groups currently recommend assessment and handling of the risk of bias in the studies, with a focus on the use of scales, and suggest possible improvements.
We reviewed the guidelines for assessment of methodological quality of the primary studies included in Cochrane reviews. In March 2007, one author (A.L.) extracted the relevant data from the descriptions of the Cochrane Review Groups in the Cochrane Library, supplemented with information from websites when reference was made to such sites, and with contacts to the review groups to clarify any uncertainties. The other author (P.C.G.) checked the extracted data, and any disagreements were resolved by discussion. Of the 51 review groups, we excluded the Methodology Review Group, as these reviews do not address clinical interventions.
A standardised data sheet was used and data were extracted on:
1) The type of methodological quality assessment recommended for individual studies, i.e. a component or a scale approach.
2) Areas of methodological quality and other areas recommended to be assessed.
3) Recommendations for using methodological quality assessments of individual studies in reviews, e.g. for inclusion of studies or for analytic purposes.
4) Recommendations to grade the level of evidence for the review as a whole.
Six review groups were asked for clarifications and all replied. The Upper Gastrointestinal and Pancreatic Diseases Group was unable to address our questions because it was being reorganized.
Groups that did not provide any recommendations, but referred to the Cochrane Handbook, were classified as recommending a component approach, since the Handbook advises that quality scores should not be used (as this approach is not supported by empirical research, can be time-consuming, and is potentially misleading) [12]. These groups were regarded as having addressed the main areas of bias mentioned in the Handbook: generation of allocation sequence; concealment of allocation; blinding of patients, caregivers and outcome assessors; and follow-up. They were also classified as giving no specific advice for using methodological quality assessments of individual studies in reviews, as the Handbook has no specific recommendations on this. Groups that offered no information and no reference to the Handbook were treated similarly, as we regarded referral to the Handbook as implicit in these cases. Groups that recommended both scales and components as optional were classified as recommending scales (there were only two such groups). Groups that recommended checklists of individual items were classified as recommending components, unless an overall score was calculated. Finally, for groups that recommended specific items in their guidelines but also referred to the Handbook, we assessed what they recommended in their guidelines.
We report the number of groups that recommended scales or components, areas of methodological quality assessed, specific recommendations for using the assessments of individual studies in the reviews, and type of analytical approach recommended, e.g. subgroup or sensitivity analyses, or meta-regression. We used Fisher's exact test to compare proportions [17].
Forty-one of the 50 review groups (82%) recommended a component approach, 34 of these explicitly, including 16 which also had reservations about scales (Table 1). Twenty-three of these 41 groups had their own checklists, ranging from 4 to 23 items.
Despite advising against scales, the Cochrane Handbook actually recommends a ranking scale [12]. The scale distinguishes between low risk of bias (all criteria met), moderate risk of bias (one or more criteria partly met) and high risk of bias (one or more criteria not met). The types of criteria are not specified, other than they should be few and address substantive threats to the validity of the study results. This scale was recommended explicitly by two groups and implicitly by seven others. In one place, the Handbook states that authors or review groups can use a scale, but that it must be with caution. This is in contrast to the general advice against scales, and this ambiguity can perhaps explain why some groups recommend scales.
The weights and the direction of the bias for the individual items is a substantial problem with scales. As pointed out by Greenland, a true association with two or more components may be overlooked if the associations cancel out in the total score, or if these components have so little weight that this variation is lost in the total score [22]. Usually, all items are given the same weight although it is clear that they do not contribute equally to avoiding bias. For example, the Back Group uses a scale with 11 items, and trials of acceptable quality are defined as those meeting 50% of the criteria (i.e. a minimum of six) [23]. Thus, items on compliance, distribution of co-interventions and timing of outcome assessment are given the same weight as concealment of allocation, which, along with blinding, has been documented as the most important safeguard against bias [3, 4]. With this scale, trials that have no concealment of allocation and no blinding can be judged to be of acceptable quality.
The many problems with scales are illustrated in a study by Jüni et al. [11]. These authors used 25 existing scales to identify high-quality trials, and found that the effect estimates and conclusions of the same meta-analysis varied substantially with the scale used.
About two-thirds of groups recommended assessing sequence generation. This could be an improvement from the 26% reported previously for Cochrane Reviews [16]. Adequate concealment of allocation may not prevent against selection bias if the sequence generation is deciphered by the persons enrolling patients [4, 5, 7, 24].
Per-protocol analyses will often lead to substantial overestimation of treatment effects [25–27]. The Cochrane Handbook recommends analyzing all data according to the intention-to-treat principle using different analytical methods such as imputation. Currently it has no recommendations for assessing intention-to-treat analysis as a methodological item or how to assess attrition bias (i.e. loss to follow-up). This is in contrast with 21 groups that recommend to assess intention-to-treat as a separate item using different criteria. While large numbers of loss to follow-up have been associated with bias [6], the use of arbitrarily defined cut-points from 10–30% for assessing attrition bias is not based on empirical results and should therefore not be part of instructions to authors. These findings suggest that the Handbook should give clearer recommendations to ensure a more homogeneous methodology.
Several groups recommended assessment of items in their scales or checklists that are hardly related to the risk of bias in clinical studies. For example, the Back Group and the checklists of four other groups recommended to assess for similarities between groups at baseline, but it is not clear how or for what purpose. Proper randomisation ensures that there is no selection bias, but it also means that 5% of baseline characteristics will be expected to differ between the groups at the 5% significance level, and 1% at the 1% level, etc. Furthermore, significant differences in some characteristics may have no effect on the outcome while non-significant differences in others may. Statistical hypothesis testing of the distribution of baseline characteristics should therefore usually only be performed if fraud is suspected [28, 29]. It can also be problematic to assess the use of co-interventions and the level of compliance, as both of these may merely reflect the differential effects of the studied interventions.
Another example is the Moncrief scale that is used by the Depression, Anxiety and Neurosis Group as a checklist, without assigning an overall score as was originally the intention [30]. This scale has 23 items and some relate to external validity and appropriateness and reporting of statistical analysis, which are not associated with bias in the study. As chance findings can be misinterpreted as bias, such items can be problematic not only in a scale approach but also in a component approach, if they are used as a threshold for inclusion of studies in the review or in a sensitivity analysis.
The Peripheral Vascular Diseases Group referred to the "Schulz scale", but their reference includes no such scale [4], and Schulz has never constructed one; in fact, he advises against the use of scales for assessment of methodological quality. The Drugs and Alcohol Group recommended against assessing detection bias because of low interobserver agreement, but did not document this statement. The Incontinence Group and the Heart Group described attrition bias as selection bias occurring after randomization, which, although not formally incorrect, is confusing, as it is well understood that selection bias is avoided by proper randomization.
The Musculoskeletal Group recommended a scale for quality assessment of non-randomized studies [31]. The problems with scales are likely much greater for non-randomized studies than for randomized trials, as there is not much empirical evidence for the degree of bias, on average, that is introduced if different criteria are not met.
Only a little more than half of the groups had recommendations for using the quality assessment in reviews. The analytical method most often endorsed was sensitivity analysis to test if including only trials of higher methodological quality changes the effect estimates. As explained above, such analyses should not be based on an overall score. Rather than accepting the different combinations of criteria that are possible using scales, one should use one criterion, or only a few important ones simultaneously. For example, in a Cochrane review where a main outcome was number of blood transfusions [32], which is vulnerable to bias if the trial is not blinded, high-quality trials were defined as those that had adequate concealment of allocation and double blinding. Furthermore, high- and low-quality trials were grouped separately in the meta-analyses for easy comparisons.
It is also questionable to exclude trials entirely from the review if they fall below a certain quality cut-point on a scale [24], whereas it can be entirely reasonable to include only trials that are adequately randomized and blinded, e.g. if the main outcome is subjective, such as pain.
Grading of the evidence can help guide the decisions of clinicians and patients [33], provided the grading system is logically consistent and is in accordance with results from empirical studies. The grading system recommended by the Back Group has five levels of evidence and was developed using a consensus method [23]. Consistent findings among multiple, low-quality non-randomized studies are considered to be the same level of evidence as one high-quality randomized trial, which is not in accordance with findings from empirical studies [34, 35], or with the Cochrane Handbook [12]. Consistent results from non-randomized studies may merely reflect that they are all biased to a similar degree. This was the case, for example, for hormone replacement therapy, where a meta-analysis of observational studies [36] as well as a large cohort study [37] showed that hormones decreased the incidence of coronary heart disease by about 50%, whereas a high-quality randomized trial showed that hormones cause heart disease [38]. The Back Group intends to remove this scale from its guidelines [39] and will use the GRADE system for grading evidence [40, 41].
The four-level grading system used by the Musculoskeletal Group is also based on consensus [42] and is also highly problematic. The system is based on arbitrary cut-points such as sample size above 50 and more than 80% follow-up, which are not based on empirical evidence. The only difference between platinum and gold evidence is that there needs to be two randomized trials for platinum and one for gold, which is not reasonable, as, for example, the platinum trials could involve 60 patients each and the gold trial 500 patients. Silver level can be either a randomized trial with a 'head-to-head' comparison of agents or a high-quality case-control study, which is hard to accept, and bronze level can be a high-quality case series without controls or expert opinion.
The Cochrane Handbook is produced by experts in methodology, is evidence-based, and is regularly updated in accordance with new evidence. The long guidelines of some review groups therefore seem to be superfluous, and in some cases they are not in accordance with the Handbook, or with the empirical evidence on bias. As the guidelines are probably followed by many review authors, they could potentially threaten the credibility of the reviews. We suggest that all Cochrane Review Groups refer to the Cochrane Handbook in their instructions to authors and that their own guidelines are kept to a minimum and describe only how methodological topics that are specific to their fields should be handled.
The Cochrane Handbook is currently being updated to ensure a more homogenous methodology in its reviews [43]. This revision is based on the acknowledgement of the discrepancies in assessment of methodological quality between the review groups [44], and it will involve introduction of a detailed risk-of-bias tool to be used in all reviews. The tool will also address bias in selective outcome reporting [45, 46]. Finally, we suggest that the revision should improve recommendations for assessing attrition bias and the usage of the risk-of-bias assessments, as the current recommendations are not clear about this.
The views expressed in this article represent those of the authors and are not necessarily the views or the official policy of the Cochrane Collaboration.
We thank the following persons for providing additional information about the groups' guidelines: Jane Cracknell, Lindsey Shaw, Sharon Parker, Henning Keinke Andersen, Ian Roberts and Cathy Benett. Both authors are funded by Rigshospitalet, Copenhagen, and a grant from Inge og Jørgen Larsens Mindelegat (a non-profit foundation) supported part of the study. The funding organizations had no role in any aspect of the study, the manuscript or the decision to publish.