Assessing certainty of evidence

How valid and reliable are your conclusions?

Watch Associate Professor Zachary Munn from the Adelaide GRADE Centre take us through the GRADE approach to guideline development.

Assessing certainty of evidence

Once you have synthesised the evidence relevant to your guideline’s questions and drawn conclusions about the size and direction of the effects, you must also understand how valid and reliable that estimate is. This is essential information that will underpin your decisions to recommend — or not — different courses of action based on this evidence. It will also ensure that you avoid relying too strongly on results that are uncertain, and which can lead to inappropriate recommendations.

The Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach to assessing the certainty of a body of evidence is now considered best practice by many international organisations that develop clinical guidelines, such as the World Health Organisation (WHO), National Institute for Health and Care Excellence (NICE), and Canadian Task Force on Preventive Health Care. GRADE assessment provides a structured way to consider key factors that may increase or decrease our confidence in the synthesised findings of a body of evidence (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017). These factors include:

  • the risk of bias
  • the precision of the effect estimates
  • the consistency of the individual study results
  • how directly the evidence answers the question of interest
  • the risk of publication or reporting biases.

GRADE was initially developed to address questions about the effectiveness of interventions based on randomised and observational studies. It can also be used to assess either narrative or statistical syntheses (Murad, Mustafa et al. 2017). Assessing the certainty of evidence is undertaken for each outcome separately. Two variations of GRADE are now available for qualitative synthesis and network meta-analysis and discussion is ongoing about the application of GRADE to a broader range of evidence types (see Table 1).

Table 1: Applications and variations of GRADE
ToolExamples of evidence or synthesis variations
GRADE
  • Interventions (Guyatt, Oxman et al. 2011; Murad, Mustafa et al. 2017; Schünemann, Oxman et al. 2017)
  • Network meta-analysis (Salanti, Del Giovane et al. 2014)
  • Patient preferences and values
  • Diagnostic tests (Schünemann, Oxman et al. 2008)
  • Prognosis (Iorio, Spencer et al. 2015)
  • Environmental exposures (Morgan, Thayer et al. 2016)
  • Animal studies (Hooijmans, de Vries et al. 2018)
  • Patient preferences and values (Zhang, Alonso-Coello et al. 2018)
  • Economic evidence (Brunetti, Shemilt et al. 2013)
  • Overviews of reviews (Brennan, McKenzie et al. 2017)
  • Qualitative evidence (GRADE-CERqual)

 

Where possible, it is advised to use GRADE or a published variation of GRADE without introducing new ad hoc variations to ensure your approach is consistent with best practice. If GRADE is not appropriate for the kind of evidence supporting your guideline and no variation is available, you may need to take a different systematic approach — this will be discussed elsewhere.

In some research topics such as public health, randomised trials are not always the best study design to use for decision-making (Harder et al, 2015). For example, the Community Preventive Service Task Force (CPSTF) in the United States considered several types of study designs as ‘strong’ evidence in public health including randomised trials, non-randomised trials, prospective cohort studies and other designs with concurrent comparisons, such as interrupted time-series with comparison (see the CPSTF Methodology Guide). The use of well-designed observational studies in the modification of the GRADE approach for public health is still the subject of ongoing discussion and research (Burford et al, 2012; Rehfuess and Akl, 2013). If you suspect there will be a scarcity of high-quality evidence, such as for public health interventions, consider using a framework such as the PRECEPT framework to structure your review.

In addition to using GRADE to assess the main effects of your intervention, exposure or test of interest, other aspects of the evidence used in your guideline can also be assessed. For example, the guideline development process may also consider evidence relating to additional ‘secondary’ questions such as evaluating the importance of outcomes or the values and preferences used to establish meaningful thresholds. These sources of evidence may or may not be sourced from systematic reviews, but guidance is available if you wish to apply GRADE to these bodies of evidence (Zhang, Alonso-Coello et al. 2018; Zhang, Alonso Coello et al. In Press).

GRADE categorises the certainty of the evidence as high, moderate, low and very low (see Table 2). As identified in Section 8 of this module, the level of certainty can be downgraded — or in some circumstances upgraded, as each factor is assessed. In the context of a question about the effects of an intervention, both randomised and non-randomised trials begin with a high default rating (Schünemann, Cuello et al. 2018). Observational studies begin with a low default rating. For each outcome a decision is made whether to downgrade or upgrade the certainty of the evidence by one or two levels for each factor, leading to a final rating. Note that regardless of how many reasons there are to downgrade, the certainty of the evidence cannot fall below very low (Balshem, Helfand et al. 2011).

Table 2: Interpretation of the four levels of evidence used in the GRADE profile (GRADE Working Group 2013)
GradeDefinition
HighWe are very confident that the true effect lies close to that of the estimate of the effect.
ModerateWe are moderately confident in the effect estimate: the true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different
LowOur confidence in the effect estimate is limited: the true effect may be substantially different from the estimate of the effect.
Very LowWe have very little confidence in the effect estimate: the true effect is likely to be substantially different from the estimate of effect

From: GRADE Working Group 2013

From the perspective of guideline developers the measure of certainty is ‘the extent of our confidence that the estimates of the effect are adequate to support a particular decision or recommendation’ (Hultcrantz, Rind et al. 2017). The effect is not only greater than zero but is above, or below the threshold at which the guideline development group would make a recommendation for or against a course of action. When two alternatives are similar in their effect, other considerations, such as patient preference may be used in developing recommendations.

This module focuses on the assessment of certainty of evidence in using GRADE. The following sections outline the practical requirements of planning and reporting an assessment of certainty. The key factors used to assess evidence on the effects of interventions are also outlined, although many of these will be relevant for other types of questions or evidence. Guidance on related steps in the guideline development process is provided in other modules (see the Forming the questions, Assessing risk of bias, Synthesising evidence and Evidence to decision modules).

What to do

1. Plan your approach to assessing certainty

The decisions you make around forming the questions and deciding what evidence to include may influence the body of evidence you uncover. This body of evidence may also have a number of limitations. For example, when the quality or certainty of evidence for the body of evidence is low, it may be difficult to make recommendations. The WHO Handbook for guideline developers (WHO 2014) and in an article by Balshem et al. (Balshem, Helfand et al. 2011) provide schematic representations of this issue. The method that you intend to use for assessment of certainty should be planned in advance as part of a research protocol (NICE 2014; WHO 2014). The protocol should include plans for all the steps included in this module.

The assessment of certainty may be completed by the guideline development group if members have the appropriate expertise. It may also be commissioned as part of an external evidence review. In either case, it is important that the team performing the assessments and at least several members of the guideline development group have experience in the application and interpretation of GRADE (WHO 2014). Norris et al outline possible levels and sources of experience with GRADE that may be useful for different roles in the guideline development process (Norris, Meerpohl et al. 2016). The guideline development group may need to collaborate with the team responsible for evidence synthesis to set the context for GRADE assessments. They will also need some experience with GRADE to effectively interpret GRADE assessments when formulating their recommendations (see the Evidence to decision module).

As mentioned in the Overview section, you should plan to perform a separate GRADE assessment for each individual outcome within each PI/ECO question of interest (see Chapter 3.1 of the GRADE Handbook). This is because different groups of studies may contribute to each outcome and the certainty of different outcomes may vary even within the same groups of studies (Balshem, Helfand et al. 2011).

A good practice approach is for two people to conduct an independent GRADE rating of each outcome and then discuss their findings to reach consensus on the final rating. As GRADE assessment is a subjective process and requires a nuanced consideration of the details of the body of evidence and the context, this discussion between two assessors can be very productive in improving the quality of the assessment (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017).

Once you have completed GRADE assessments on each outcome, you then have to make a judgement in relation to the body of evidence and comment on the certainty of the evidence. As this process is subjective, two assessors should make this determination independently and then discuss to reach consensus. You should plan for this approach, including any staffing or other resources that may be required.

2. Consider the importance of outcomes

Selecting and rating the relative importance of outcomes is recommended when Forming the questions (see separate module and Chapter 3.1 of the GRADE Handbook) and following your evidence synthesis. Outcomes can be classified as ‘critical’ or ‘important’ or of ‘limited importance’ in terms of relative importance for decision making by guideline development groups.

To facilitate ranking of outcomes according to their importance guideline developers may choose to rate outcomes numerically on a 1 to 9 scale (7 to 9 — critical; 4 to 6 — important; 1 to 3 — of limited importance) to distinguish between importance categories. Critical and important outcomes should be presented in evidence tables and summary of findings tables.  

Only outcomes considered critical (rated 7 to 9) are the primary factors influencing a recommendation and will be used to determine the overall quality of evidence supporting a recommendation.

3. Assess risk of bias (or study limitations)

A risk of bias assessment is conducted on the studies included in the evidence review (see the Assessing risk of bias module for a more detailed discussion).

Consider whether there are sufficiently serious concerns about risk of bias to reduce your confidence in the overall effect estimate. Remember that the presence of one or two studies with high risk of bias may not automatically lead to an overall judgement that the synthesised findings have high risk. Consider a sensitivity analysis to identify whether removing studies with high risk would significantly change the estimate of effect — if not, then the risk of bias may not be important to the interpretation of the result.

The level of certainty can be downgraded by one level for serious concerns about risk of bias or two levels for very serious concerns (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017).

4. Assess inconsistency or heterogeneity

Any group of studies brought together in a systematic review will differ in their results due to random variation, no matter how similar their design and conduct. If differences are small and likely only due to chance then the results are considered homogeneous and you can be more confident in the estimate of effect. If the results are inconsistent across studies — more than you would expect to see by chance alone — then the results are considered ‘heterogeneous’. There may be important factors modifying the effect from study to study and caution is needed in interpreting the results unless these factors are understood.

Variation in results can arise from differences in the study characteristics such as the population, intervention/exposure, comparators or outcome measures. It can also arise from methodological differences such as bias or study design. While some evidence reviews will include more than one study design, reporting of the synthesis is undertaken by design. For example, the synthesis of observational evidence is undertaken separately to that of randomised controlled trials.

The term ‘statistical heterogeneity’ is used to describe variation that is greater than would be expected to arise by chance alone. There are several different ways statistical heterogeneity can be assessed in a meta-analysis. Visual examination of forest plots or other figures and tables can identify inconsistencies in the results. If the confidence intervals around the studies’ effect estimates do not overlap it is very unlikely that the studies are estimating the same underlying effect — that is, one or more effect-modifying factors may be at work. Investigating these factors can be one of the most interesting opportunities for understanding the effects of the intervention or exposure of interest provided by meta-analysis. It can provide an opportunity to uncover important differences in how an intervention or exposure works that cannot be assessed with a single study (Deeks, Higgins et al. 2017). There are statistical tools to assist in measuring heterogeneity in the context of meta-analysis, including:

  • the chi-squared test (or Cochran Q test), which provides a P value to indicate the likelihood that the null hypothesis of homogeneity — that is, the studies are all estimating the same underlying effect — is true given the results of the individual studies
  • the I2 statistic, which estimates the proportion of the observed variation that is due to heterogeneity rather than chance (between 0–100%) (Deeks, Higgins et al. 2017).

The reasons for heterogeneity should be explored within your analysis. This is to identify any significantly different effects for specific populations, or intervention/exposure categories within the review (see the Synthesising evidence module) (Popay, Roberts et al. 2006; Deeks, Higgins et al. 2017).

If any observed heterogeneity has been explained through these investigations, then there is no need to consider the effect uncertain. Clear reasons for heterogeneity will likely lead to different recommendations for different circumstances. However, if considerable unexplained heterogeneity remains then our certainty in the evidence may decrease. The GRADE rating may be downgraded by one level for serious concerns or two levels for very serious concerns (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017).

5. Assess indirectness

The concept of indirectness relates to whether the evidence available, including the population, comparisons and outcomes measured, directly and completely answers the questions posed by the guideline (see the Forming the questions module). Certainty in the evidence decreases when there are significant differences between the population, comparisons or outcomes you are interested in and those measured in the available evidence. Certainty of evidence may also be upgraded for observational studies. Section 8 covers various reasons for this upgrading.

Examples of important indirectness might include:

  • randomised trials of narrow segments of the population, such as only participants with relatively mild illness or only adults
  • studies conducted in high-income, urban settings rather than including rural or low-income settings
  • studies comparing options that do not reflect current practice questions, such as comparisons against placebos or outdated options
  • studies that do not reflect the setting or behavioural context of interest, such as studies of nutritional supplements in a clinical setting for a guideline interested in real-world dietary habits
  • studies that use surrogate outcomes or short-term effects rather than directly measuring the outcome identified as a high priority for the guideline.

A systematic review is likely to find studies that vary in lots of different ways. Not every difference reduces our confidence in the effect estimate. There is no need to downgrade unless there is a compelling reason to think that the results will differ importantly from the effect of interest to the guideline. For example, this might be based on key factors identified through the theoretical framework or logic model for the review, or tested by subgroup analysis (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017). Where important differences are identified certainty in the evidence may be downgraded by one level for serious concerns, or two levels for very serious concerns. Or, you may decide to proceed with two target situations in your subsequent recommendations — one for the population reflected in the systematic review and another for different groups.

6. Assess imprecision

Traditionally, P values have been used to determine if an effect exists, but they only tell us if a result is likely to be due to chance or not. Current best practice is to avoid this and instead to consider measures such as confidence intervals (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017). These give a more useful estimate of the range of effects within which the ‘true’ result is likely to lie.

Results are less precise, and confidence intervals wider when the analyses, the review or synthesis include small numbers of patients; when there are few observed events, for dichotomous outcomes; or when there is considerable variability in the effects among patients, for continuous outcomes. When a confidence interval excludes the possibility of no effect you can be reasonably confident that an effect is present. This effect will be either increasing or decreasing the outcome by an amount within the range of the confidence interval (Guyatt, Oxman et al. 2011).

A result is considered imprecise if the confidence interval includes both a meaningful result in one direction — enough to reach the threshold to recommend for or against the option under consideration (Hultcrantz, Rind et al. 2017); or a negligible effect, no effect or an effect in the opposite direction. If the confidence interval includes measures of effect that cover both the possibility that you would make or not make a particular recommendation, then the certainty in the direction of the effect is reduced. Note that a precise finding of ‘no effect’ requires more than a confidence interval that includes the possibility of no effect: it requires a confidence interval narrow enough to exclude any meaningful effect in either direction (Guyatt, Oxman et al. 2011).

Identifying the right threshold — a positive or negative effect large enough to support a particular recommendation — depends on the specific context of the guideline and the preferences and values of the people who will be affected by the decision. Guideline development groups should carefully consider these thresholds and identify these values and preferences in the relevant communities. Remember these may vary for different groups and defining these thresholds needs to be done in partnership with community representatives (see the Consumer involvement and Engaging stakeholders modules).

Importantly, the GRADE approach recommends that you consider precision in relation to the balance of all the positive and negative outcomes measured. In the trade-off between all the likely outcomes, including costs, harms and benefits foregone, the threshold beyond which you would make a recommendation might change. That is, if an intervention is expensive and has a large opportunity cost, it might take a larger threshold of effect to make the recommendation worthwhile. For example, a new drug which improves symptoms in some people but the cost of which could be used elsewhere in the health system. Similarly, if acting to avoid an exposure might lead to other negative outcomes — such as increasing the treatment of drinking water to remove pathogens but increasing exposure to harmful chemicals. Hultcrantz and colleagues provide a more detailed discussion of this kind of consideration with examples (Hultcrantz, Rind et al. 2017).

Guideline developers should work closely with the group performing the evidence synthesis to identify the likely values, preferences and thresholds. It is possible that after an initial GRADE assessment a full discussion of the trade-offs between the different outcomes may lead to changes to the ratings for imprecision (see the Evidence to decision module).

An outcome may be downgraded by one level for serious concerns about imprecision or two levels for very serious concerns.

7. Assess publication biases

Reporting biases can arise when studies are not fully reported in the published literature. This makes them more available for inclusion in a systematic review or guideline. Because of a traditional reluctance to publish studies that are negative, or that have shown no effect, those reported are more likely to report positive, or statistically significant findings which can in turn introduce bias. When some studies are not published or not published in readily accessible locations, this is referred to as publication bias. An additional concern is outcome reporting bias, where published studies selectively report individual outcomes with positive or statistically significant results. This is addressed with the Cochrane RoB (risk of bias) tool and therefore is considered under the risk of bias/study limitations factor for GRADE. Obtaining and including data from unpublished studies or outcomes can reduce the problem (see the Identifying the evidence module) but the results of unpublished studies can be difficult to obtain (Sterne, Egger et al. 2017).

Identifying whether or not your findings are affected by reporting biases can be challenging and the GRADE approach recommends a relatively high threshold for downgrading the certainty of evidence. You can downgrade the certainty by one level where reporting biases are strongly suspected or otherwise conclude that reporting bias is undetected (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017).

Examples of situations in which reporting bias may be strongly suspected include:

  • a body of evidence consisting of small studies
  • industry sponsored studies, including pharmaceutical and other manufacturers — there is evidence that industry-sponsored trials are particularly vulnerable to reporting bias
  • an evidence review that has not included sufficient efforts to identify unpublished studies or data, that is, outcomes; or studies published outside the major journals, including searching trials registries and data from drug regulators such as the FDA where relevant
  • evidence for a relatively recent intervention or question where early, positive findings may be available but there has not been sufficient time for the publication of negative or equivocal results or replication (Guyatt, Oxman et al. 2011).

Some guidance and tools are available to assist in identifying the presence of reporting biases (Page, McKenzie et al. 2018) although this is an area of ongoing methodological development.

Statistical tools are also available. Where meta-analysis has been conducted and it has enough included studies — at least ten — one option is to consider using funnel plots for key outcomes (Sterne, Sutton et al. 2011; Sterne, Egger et al. 2017). Funnel plots graph the effect size in each study against a measure of precision such as the inverse of the standard error (SE) (1/SE). In the funnel plot, studies will be scattered around the true effect estimate with large studies at the top of the funnel and smaller studies further down. Smaller studies are expected to scatter more widely as they are more likely to be affected by random variation. A symmetrical plot looks similar to an inverted funnel or triangle. This can identify small study effects in which the results of these smaller studies. which are more vulnerable to publication bias, are skewed in one direction and are not randomly scattered around the true effect estimate (Sterne, Egger et al. 2017).

One possible explanation for small study effects is that studies with negative or equivocal findings are missing from your analysis. This is either because the study as a whole is not available or because the outcome was not included in the study report. However, funnel plots should not be used as direct measures of publication bias; and in some cases, such as prevalence studies or those estimating proportions, funnel plots may even be misleading (Hunter, Saratzis et al. 2014). Small study effects can also arise through chance, or other sources of heterogeneity such as clinical or methodological differences between studies. Further exploration is required to identify the reason for this heterogeneity if funnel plot asymmetry is detected (Sterne, Egger et al. 2017).

Additional statistical tests are available to test or adjust for funnel plot asymmetry but these methods are variable in their reliability. You should seek statistical advice before using or interpreting these tests (Sterne, Egger et al. 2017).

8. Consider reasons to upgrade the certainty of the evidence

Recognising that observational evidence can in certain cases provide very strong evidence, the GRADE approach includes consideration of circumstances where the certainty of the evidence can be upgraded for observational studies. These include where:

  • a very large effect is observed, for example, at least a twofold increase or decrease for one group compared to another
  • a dose-response relationship is observed
  • the plausible confounders affecting the result actually serve to increase the certainty of the effect, for example, a significant result is found even though plausible confounders would work to reduce the result, or where no effect is observed, even though plausible confounders would work to exaggerate the effect.

These factors should only be considered after firstly addressing the reasons above for downgrading the evidence (Sections 2 to 6). In most cases, a study should not be upgraded if there are serious concerns about any of these factors (Guyatt, Oxman et al. 2011; Schünemann, Oxman et al. 2017).

9. Report the certainty of the evidence

The next critical step is to summarise and present the assessments of certainty alongside the results of the evidence synthesis. This is to support the deliberation and recommendations of the guideline development group. The GRADE approach recommends standard format ‘Evidence profiles’ for this purpose, which are related to ‘Summary of findings tables’ used in systematic reviews. Evidence profiles will need to have been tested with users to ensure clarity of communication (Guyatt, Oxman et al. 2011; GRADE Working Group 2013). Even where you are not using the standard formats, summarising comparable information will ensure that the guideline development group has all the information it needs to inform its recommendations (see the Evidence to decision module).

A separate evidence profile should be developed for each question used to develop the guideline. Each profile should include:

  • Some brief information on the question in a PI/ECO or similar format.
  • A list of the critical and important outcomes relevant to the question, including information on the time points at which outcomes are measured and tools/definitions used to measure the outcome (see the Forming the questions module). This this list should include adverse effects and outcomes for which no evidence was found.
  • Summary statistics for the point estimates from the meta-analyses, including, where possible both absolute and relative effects and confidence intervals. More than one absolute effect may be reported for different populations with higher or lower levels of risk, which can enable the guideline development group to assess the real world impact of its recommendations on different populations (Guyatt, Oxman et al. 2013; Guyatt, Thorlund et al. 2013).
  • Summary statements for any narrative or qualitative syntheses.
  • Details of the GRADE assessment for each individual consideration and the overall GRADE rating.
  • The number of studies and participants contributing data to this outcome.
  • Any additional important comments (Guyatt, Oxman et al. 2011).

Some variations on the evidence profile include a plain language statement interpreting the result in the context of the GRADE rating. Such statements can be useful in both technical reports and the guideline itself. Standardised language has been suggested for this purpose, for example, ‘There is moderate certainty evidence that the exposure probably increases the risk of the outcome’. Testing of this language has indicated that it improves understanding of the review’s findings. A more detailed resource on the use and interpretation of standardised language is provided by Cochrane Consumers and Communication (Ryan, Synnot et al. 2016).

In the final guideline you should present the following information in summary form within the guideline itself and in detail in the attached technical reports:

  • All the methods used for assessment of certainty, including those pre-specified in the protocol and any subsequent amendments or additional methods. This should include a transparent rationale for all changes.
  • GRADE ratings for each synthesised outcome, either as complete evidence profiles or ‘Summary of findings’ tables. These incorporate the reasons for upgrading and downgrading evidence in summary footnotes rather than detailed columns in the table.

You should report the methods and assessments in enough detail to allow others to appraise the guideline. This will allow future guideline developers to understand your approach when deciding whether to update, adopt or adapt your guideline.

Updating, adapting, adopting

If you are considering updating, adapting or adopting an existing guideline, review whether the original guideline incorporated a GRADE assessment. If so, you may also wish to review how well it was performed and whether judgements about indirectness and imprecision were made in a way that is applicable to your context and the purpose of your guideline. You may need to conduct or commission a new GRADE assessment even where the existing body of evidence is applicable (see the Adopt, adapt or start from scratch module)

Online tools

Two main online tools are available specifically to support the development of guidelines: GRADEpro GDT and MAGICapp. Both can assist you in presenting structured evidence profiles and working through assessments of each outcome using GRADE. MAGICapp also supports the online publication of guidelines with layered presentation allowing readers to access evidence tables from the recommendations; for example, see the Australian Clinical guidelines for stroke management. A standalone app has also been developed to support the use of the CINeMA tool to assess certainty in network meta-analysis.

Developers should be aware that costs may be associated with the use of these tools.

NHMRC Standards

The following standards apply to the Assessing certainty of evidence module:

2. To be transparent guidelines will make publicly available:

2.1 The details of all processes and procedures used to develop the guideline.

 

6. To be evidence informed guidelines will:

6.1 Consider the body of evidence for each outcome (including the quality of that evidence) and other factors that influence the process of making recommendations including benefits and harms, values and preferences, resource use and acceptability.

 

7. To make actionable recommendations guidelines will:

7.5 Grade the strength of each recommendation.

 

Guidelines approved by NHMRC must meet the requirements outlined in the Procedures and requirements for meeting the NHMRC standard.

Useful resources

Cochrane Handbook for Systematic reviews of interventions

Developing NICE Guidelines: The manual

GRADE Handbook

GRADE Tutorial: Online software

McMaster University GRADE online learning modules

WHO Handbook for Guideline development

References

Balshem, H., Helfand, M., et al. (2011). "GRADE guidelines: 3. Rating the quality of evidence." Journal of Clinical Epidemiology 64(4): 401-406.

Brennan, S., McKenzie, J., et al. (2017). Developing GRADE guidance for overviews of systematic reviews. Global Evidence Summit. Cape Town, South Africa, Cochrane Database of Systematic Reviews. 9 Suppl 1. https://doi.org/10.1002/14651858.CD201702.

Brunetti, M., Shemilt, I., et al. (2013). "GRADE guidelines: 10. Considering resource use and rating the quality of economic evidence." Journal of Clinical Epidemiology 66(2): 140-150.

Deeks, J. J., Higgins, J. P. T., et al. (2017). Chapter 9: Analysing data and undertaking meta-analyses. Cochrane Handbook for Systematic Reviews of Interventions. J. Higgins, R. Churchill, J. Chandler and M. Cumpston, version 5.2.0 (updated June 2017), Cochrane. Available from www.training.cochrane.org/handbook.

GRADE Working Group (2013). GRADE Handbook. Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach. H. Schünemann, J. Brożek, G. Guyatt and A. Oxman.

Guyatt, G., Oxman, A. D., et al. (2011). "GRADE guidelines: 1. Introduction - GRADE evidence profiles and summary of findings tables." Journal of Clinical Epidemiology 64(4): 383-394.

Guyatt, G. H., Oxman, A. D., et al. (2011). "GRADE guidelines 6. Rating the quality of evidence - imprecision." Journal of Clinical Epidemiology 64(12): 1283-1293.

Guyatt, G. H., Oxman, A. D., et al. (2011). "GRADE guidelines: 8. Rating the quality of evidence - indirectness." Journal of Clinical Epidemiology 64(12): 1303-1310.

Guyatt, G. H., Oxman, A. D., et al. (2011). "GRADE guidelines: 7. Rating the quality of evidence - inconsistency." Journal of Clinical Epidemiology 64(12): 1294-1302.

Guyatt, G. H., Oxman, A. D., et al. (2011). "GRADE guidelines: 5. Rating the quality of evidence - publication bias." Journal of Clinical Epidemiology 64(12): 1277-1282.

Guyatt, G. H., Oxman, A. D., et al. (2013). "GRADE guidelines: 12. Preparing Summary of Findings tables - binary outcomes." Journal of Clinical Epidemiology 66(2): 158-172.

Guyatt, G. H., Oxman, A. D., et al. (2011). "GRADE guidelines: 9. Rating up the quality of evidence." Journal of Clinical Epidemiology 64(12): 1311-1316.

Guyatt, G. H., Oxman, A. D., et al. (2011). "GRADE guidelines: 4. Rating the quality of evidence - study limitations (risk of bias)." Journal of Clinical Epidemiology 64(4): 407-415.

Guyatt, G. H., Thorlund, K., et al. (2013). "GRADE guidelines: 13. Preparing Summary of Findings tables and evidence profiles - continuous outcomes." Journal of Clinical Epidemiology 66(2): 173-183.

Hooijmans, C. R., de Vries, R. B. M., et al. (2018). "Facilitating healthcare decisions by assessing the certainty in the evidence from preclinical animal studies." PloS ONE 13(1): e0187271.

Hultcrantz, M., Rind, D., et al. (2017). "The GRADE Working Group clarifies the construct of certainty of evidence." Journal of Clinical Epidemiology 87: 4-13.

Hunter, J. P., Saratzis, A., et al. (2014). "In meta-analyses of proportion studies, funnel plots were found to be an inaccurate method of assessing publication bias." Journal of Clinical Epidemiology 67(8): 897-903.

Iorio, A., Spencer, F. A., et al. (2015). "Use of GRADE for assessment of evidence about prognosis: rating confidence in estimates of event rates in broad categories of patients." BMJ : British Medical Journal 350.

Morgan, R. L., Thayer, K. A., et al. (2016). "GRADE: Assessing the quality of evidence in environmental and occupational health." Environment International 92-93: 611-616.

Murad, M. H., Mustafa, R. A., et al. (2017). "Rating the certainty in evidence in the absence of a single estimate of effect." Evidence Based Medicine 22(3): 85-87.

NICE (2014). 6: Reviewing the research evidence. Developing NICE guidelines: the manual. Manchester, UK, National Institute for Health and Care Excellence.

Norris, S. L., Meerpohl, J. J., et al. (2016). "The skills and experience of GRADE methodologists can be assessed with a simple tool." Journal of Clinical Epidemiology 79: 150-158.e151.

Page, M. J., McKenzie, J. E., et al. (2018). "Tools for assessing risk of reporting biases in studies and syntheses of studies: a systematic review." BMJ Open 8(3).

Popay, J., Roberts, H., et al. (2006). Guidance on the conduct of narrative synthesis in systematic reviews: a product of the ESRC methods programme (Version I). Lancaster, UK, University of Lancaster.

Ryan, R., Synnot, A., et al. (2016). Describing results, Cochrane Consumers and Communication Group, available at: http://cccrg.cochrane.org/author-resources. Version 2.0 December 2016.

Schünemann, H., Oxman, A., et al. (2017). Chapter 11: Completing ‘Summary of findings’ tables and grading the confidence in or quality of the evidence. Cochrane Handbook for Systematic Reviews of Interventions. J. Higgins, R. Churchill, J. Chandler and M. Cumpston, version 5.2.0 (updated June 2017). Cochrane. Available from www.training.cochrane.org/handbook.

Schünemann, H., Oxman, A., et al. (2017). Chapter 12: Interpreting results and drawing conclusions. Cochrane Handbook for Systematic Reviews of Interventions version 5.2.0 (updated June 2017). J. Higgins, R. Churchill, J. Chandler and M. Cumpston, Cochrane. Available from www.training.cochrane.org/handbook.

Schünemann, H. J., Cuello, C., et al. (2018). "GRADE guidelines: 18. How ROBINS-I and other tools to assess risk of bias in nonrandomized studies should be used to rate the certainty of a body of evidence." Journal of clinical epidemiology.

Schünemann, H. J., Oxman, A. D., et al. (2008). "Grading quality of evidence and strength of recommendations for diagnostic tests and strategies." British Medical Journal 336(7653): 1106-1110.

Sterne, J., Egger, M., et al. (2017). Chapter 10: Addressing reporting biases. Cochrane Handbook for Systematic Reviews of Interventions. J. Higgins, R. Churchill, J. Chandler and M. Cumpston, version 5.2.0 (updated June 2017), Cochrane. Available from http://handbook.cochrane.org.

Sterne, J. A. C., Sutton, A. J., et al. (2011). "Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials." BMJ 343.

World Health Organisation (WHO) (2014). WHO handbook for guideline development, World Health Organisation.

Zhang, Y., Alonso-Coello, P., et al. (2018). "GRADE Guidelines: 19. Assessing the certainty of evidence in the importance of outcomes or values and preferences - Risk of bias and indirectness." Journal of Clinical Epidemiology(In press).

Zhang, Y., Alonso Coello, P., et al. (In Press). "GRADE Guidelines: 20. Assessing the certainty of evidence in the importance of outcomes or values and preferences - Inconsistency, Imprecision, and other Domains." Journal of Clinical Epidemiology(In press)

Acknowledgements

NHMRC would like to acknowledge and thank Professor Brigid Gillespie (author) and Professor Lukman Thalib (author) from Griffith University, Miranda Cumpston (author) and Professor Wendy Chaboyer (editor) from Griffith University for their contributions to this module.

Version 5.0. Last updated 6 September 2019.

Suggested citation: NHMRC. Guidelines for Guidelines: Assessing certainty of evidence. https://nhmrc.gov.au/guidelinesforguidelines/develop/assessing-certainty-evidence. Last published 6 September 2019.

ISBN: 978-1-86496-024-2