Statistical Concepts and Skills

Absolute Risk Difference

Absolute Risk Difference (or Absolute Risk Reduction or Absolute Risk Increase - depending on the direction of change...)

The actual reduction in risk from baseline in a study.

EER - Experimental Event Rate

CER - Control Event Rate

ARD - Absolute Risk Difference (ARR - Absolute Risk Reduction)

ARD = CER-EER

Allocation Concealment

Allocation concealment can be thought of as a type of blinding/masking that is done at the time of recruitment. The rule for this is that the recruiter (the investigator who applies the inclusion/exclusion criteria to potential study subjects and enrolls them into the trial) cannot know the allocation assignment of the subjects he/she is recruiting. This prevents selection bias. The best way this is done in a study is to enroll patients into the study by adhering to the inclusion/exclusion criteria, then randomize them later or have the randomization done by an off-site center. Either way, the person recruiting has no way of knowing which group the next subject will go into.

Explained another way (with examples) by the CONSORT statement.

A wonderful description of allocation concealment that doesn't involve mysterious consultations with statistical services:

Randomisation procedure

A clerk (who then took no further part in the study)

prepared 600 sequentially numbered opaque sealed

envelopes, each containing a card with allocation

group determined by computer generated random

number (odd = intervention). If the participant met the

inclusion criteria and gave consent, he or she was

entered into the study and underwent baseline

spirometry. The next numbered envelope in the series

was then opened to determine allocation group.

- from Parkes, G, Greenhalgh, T, Griffin M, Dent, R.

Effect on smoking quit rate of telling patients their lung age:

the Step2quit randomised controlled trial

doi:10.1136/bmj.39503.582396.25

Biases

Selection bias (random sequence generation and allocation concealment)

performance bias (blinding of participants and personnel and other threats to validity),

detection bias (blinding of outcome assessment and other potential threats to validity),

attrition bias (incomplete outcome data)

reporting bias (selective outcome reporting assessed by comparing outcomes reported in the protocol to those in the published study or by comparing outcomes reported in the results to those in the methods of the published study)

Clinical Significance

As opposed to statistical significance, clinical significance is not demonstrated by a particular number. Instead, it's a judgment call made by a physician to determine if the result of a study causes the physician to make a different decision that they would otherwise.

Clinical significance can be understood in several ways:

  • number needed to treat - this statistic gets at the magnitude of benefit for the population of patients like those in the study. If a treatment has a statistically significant result, but would require an inordinately large number of people to be treated before benefiting one, then that treatment's clinical significance is less.

  • gestalt - a medication that results in the improvement in blood pressure of only 1-2 mmHg may indeed be a statistically significant finding (especially with a large sample size), but one must wonder about the ability of this medication to actually achieve any blood pressure reduction goals...

  • confidence intervals - I know...we're supposed to talk about those in the context of statistical significance, but consider this: while p-values only give you statistical significance information, confidence can provide you with a "what if" scenario - if your decision would be different between a point estimate on the lower end of the confidence interval and one on the upper end, then the result is probably not clinically significant.

    • example - a moderately toxic chemotherapy drug helps reduce mortality for a given cancer with an NNT of 10 with a 95% CI of 3-50. now, if the number needed to harm (from the toxicity of the drug) is only 20, then the NNT CI includes the possibility of great benefit as well as a questionable risk-benefit decision involving likely toxicity. (thanks to Rod Jackson for this example)

Composite Outcomes

A maneuver used in studies to lower the sample size needed to see a significant difference between groups. The composite or combined outcome usually combines disease-oriented or at least proxy outcomes. A person is considered to have met the outcome by fulfilling at least one of the outcomes listed under combined. Unfortunately, we don't know important details (like which outcome did they get?) from this list.

Confidence Intervals

Confidence Intervals are an expression of the error surrounding a point estimate derived from a sample of a larger population. Confidence intervals are calculated independently from the p-value - they represent different philosophical approaches to biostatistics.

A confidence interval can tell you the precision of a measured parameter in a population (mean blood pressure for instance). In this case, confidence intervals don't tell us anything about significance, only an estimate of error. If the study were done over and over again, 95% of the time the estimate will fall within the confidence interval.

A confidence interval can surround means (though usually standard deviations do), sensitivity/specificity, etc. In all these cases - asking about "significance" is irrelevant - it's just an "error bounds" for the estimate.

It is only with COMPARISONS of rates/proportions do confidence intervals tell us significance in terms of "significant differences or comparisons"

More thoughts...

The explication of a 95% confidence interval is: if you repeat the study over and over on different samples from the population, then 95% of the time, the point estimate derived for each sample will fall within the bounds of the confidence interval. This holds for any confidence interval you calculate around any type of data. It's the interpretation of the confidence interval that's sometimes tricky.

What confidence intervals MEAN is how confident we are that the point estimate (whether it's relative risk, NNT, mean difference, ARR or sensitivity, etc) represents the real "truth" for the population. We assess this statistically, and it's largely based on the size of the sample we have.

Confidence intervals can tell us about "significance", but mainly when we're discussing "significant differences" when comparing groups. Remember that confidence intervals can also tell us about the "precision" of a point estimate. If we measure something in a sample of the population, a wide confidence interval means that the point estimate (measurement) we found in the sample has a lot of error associated with it. If we sample more of the population, our confidence interval will generally narrow around the true point estimate.

So, let's talk about some examples of different types of data and measures.

Mean blood pressure in one group

  • continuous data

  • Here the interval is used to tell us how good this average derived from the sample is at describing the population.

Mean difference in blood pressure between groups

  • compares continuous data between groups

  • Here it's the difference that matters - and so, if the CI for the difference crosses zero (a negative to a positive number), then the interval includes the possibility of no difference as well as either higher or lower mean blood pressure in one group compared to the other.

Mean difference in blood pressure in one group from baseline

  • compares continuous data

  • Here, again, it's the difference that matters, but not between a treatment or control, but within ONE group from baseline to the completion of the study. A non-significant CI would again include zero - encompassing the possibility of an increase in mean BP, a decrease in mean BP and no change in mean BP - all from baseline to the end of study. If you then wanted to compare between groups, you would use significance testing of some sort to compare (with p-values, most likely) the mean differences from baseline between the two groups.

Sensitivity

  • continuous

  • Here the CI is used to let us know how well this estimate reflects the rest of the population (as in the first example above)

Likelihood ratio

  • continuous

  • same as Sensitivity above

Relative risk

  • compares dichotomous data

  • Here the CI is one for a ratio of two things...therefore the line of no effect is ONE. A non-significant CI would include the possibility of reduced risk - an RR<1, no risk - an RR=1 - and increased risk - an RR>1.

Relative risk reduction

  • compares dichotomous data

  • I know it says "relative" but it's talking about a difference - from baseline relative risk (1) to the relative risk after the intervention (RR): RRR=1-RR. A non-significant CI for RRR would include ZERO since 1-0 is still 1 (same risk as baseline). A non-significant CI would include negative numbers (meaning relative risk increase) and positive numbers meaning relative risk reduction.

Absolute risk reduction

  • compares dichotomous data

  • a non-significant confidence interval for ARR would include ZERO, since the calculation for ARR involves a difference. For the absolute risk of side effects for an intervention - be careful to pay attention to the "directionality" of the numbers - since the intervention group will likely be higher than the control group (most active medications cause more side effects than placebo), the difference will be "negative" - just make sure you understand what this means, then for calculations - drop the sign.

Number needed to treat

  • compares dichotomous data

  • Remember here - don't bother trying to calculate or look for NNT confidence intervals on data (ARR) that is not significant - since a non-significant ARR would include the possibility of zero - and we all are taught to fear the instance of zero showing up in the denominator of a fraction (NNT=1/ARR)...

Links:

Graphical Presentation of Data

Inclusion and Exclusion Criteria

For cohort studies and randomized controlled trials:

inclusion and Exclusion Criteria define the population included in the study.

There is a trade-off between broad and narrow inclusion/exclusion criteria:

  • Too narrow/specific criteria result in a small, homogeneous group which guards against confounding of results, but may not be very generalizable

  • Too broad/non-specific criteria result in a wide, heterogeneous group - more reflective of practice, but likely confounded by co-morbidities, etc.

Remember also to check the baseline demographics of the included patients in a study (usually Table 1) to see what sorts of patients were successfully recruited with these criteria.

For systematic reviews:

Inclusion and Exclusion Criteria define which STUDIES are included in the review. These criteria come mainly from the focused question (in PICO format) of the review: the patients/population, the intervention, the comparison (if applicable) and the outcome. One or more of these may be left open depending on what is known about the subject at baseline.

Reviewers may choose to use study design as inclusion criteria (e.g. only randomized controlled trials) if there has been a lot of research on the topic and lesser-quality studies are likely not needed to arrive at a conclusion.

Independent, blind comparison

For diagnostic test studies - the major validity criteria can be summed up in the following sentence:

There should be an independent, blind comparison to a recognized reference standard for the condition in an appropriate spectrum of patients with the disease.

  • Independent - the decision to apply the reference standard should NOT depend on the results of the test - everyone should ideally get both the reference standard and the test of interest

  • Blind - Those investigators performing the test of interest should not know the results of the reference standard and vice versa.

  • Recognized reference standard - the test whose positive result DEFINES disease.

  • Appropriate spectrum of patients - both in terms of patient demographics (an external validity component) but also in terms of severity of disease (internal validity).

These rules are easier to apply in cross-sectional or cohort designs rather than in a case-control study.

Intention to treat

Intention to treat analysis is an analysis of data from a randomized controlled trial that compares data from the groups according to their initial, randomized allocation. The groups are analyzed this way regardless of whether all participants in the group actually received the intervention.

For instance, in a trial comparing surgery and medical therapy for a given disease, there may be some in the surgery group who do not or cannot get surgery, and some in the medication group who end up undergoing surgery. The intention to treat principle states that they must be analyzed in the groups to which they were randomized, regardless of whether they underwent the planned intervention.

Why? First, ITT preserves the prevention against selection bias that randomization confers. Failure to do preserve this results in essentially a poorly organized cohort study. Second, ITT is more like how we really treat patients - it's an effectiveness analyses rather than an efficacy analysis. In practice, physicians write prescriptions, hoping the patient will fill the medication, take it as directed, etc. ITT is a measurement of whether, on average, the choice of the intervention is better for the patient.

More reading:

Likelihood ratios

Likelihood ratios tell us how the pretest probability of a disease can be changed as a result of the test.

These are calculated based on sensitivity and specificity, and so do not change with prevalence of the disease. However, they are used in a similar fashion to predictive values, which is more clinically useful.

Likelihood Ratio for a POS test = sensitivity/(1-specificity)

Likelihood Ratio for a NEG test = (1-sensitivity)/specificity

Common guides for how good a likelihood ratio is:

  • POS - 1-2 poor, 2-5 ok, 5-10 good, >10 excellent.

  • NEG - 0.5-1.0 poor, 0.2-0.5 ok, 0.1-0.2 good, <0.1 excellent

Further information and a nomogram to help assess pre- and post-test probabilities:

Masking and Blinding

Masking (or Blinding - discussion HERE), in its traditional definition, is keeping the allocation status (intervention or control group) secret from the patient and/or the investigators.

  • Single-masking is used to harness the placebo effect by not allowing the patient to know to which group they have been randomized. This is done by offering a placebo form of the study medication that looks, tastes, smells, etc. just like the active medication. In this way, both groups believe they have an equal chance at getting the active medication, further reducing any confounding differences between the groups.

  • The use of a "dummy" is required especially when comparing two different modes of intervention - i.e. a pill and an injection - one group gets a dummy pill + an active injection, the other group an active pill + a dummy injection.

  • Double-blinding is when the investigators are also unaware of the treatment assignment. This prevents them from treating the groups differently, which would confound the study results.

BMJ Study Design Question

Multiple Comparisons

In a big study with lots of data, researchers are often able to test multiple hypotheses - different demographic or clinical details that may contribute to the overall outcome. Usually, these will not be specified as the primary outcome, and usually the study will be very big - plenty of sample size exists to find potentially significant associations. However, the reader needs to keep in mind that our significance threshold (the alpha level) is usually set at 5%. Meaning (roughly) that we are allowing ourselves a 1 in 20 chance of being wrong about a given association's significance. If we do multiple tests like this in a study (say 20 or 30 or 50), then we worry that some of them will look significant just by chance.

To account for this, the researchers should reduce their alpha threshold so that only the really significant comparisons are highlighted - i.e. look for p-values that are 0.01 or 0.001. One of the statistical maneuvers for this is called the Bonferroni method. There are others, including just arbitrarily decreasing the alpha level to something lower than 0.05.

A page from the BMJ that talks about this: Multiple significance tests: the Bonferroni method

Number Needed to Treat

Number Needed to Treat (NNT) is a way to understand the magnitude of potential benefit you'll see in practice from an intervention given to your patients. We measure this benefit by calculating the Absolute Risk Difference (ARD), but since there are those of use who don't like to think in percentages, there is the NNT.

NNT = 1/ARD

In a study of Treatment T given for 10 days for Disease D:

  • Percentage of people in control group still with disease D at 10 days: 30% (CER)

  • Percentage of people in experimental group still with disease D at 10 days: 10% (EER)

  • Absolute Risk Difference = CER-EER = 20%

  • NNT = 1/ARD = 1/0.2 = 5. In other words, 5 people need to be treated for 10 days with Treatment T to cure 1 additional disease D.

The benefit of ARD and NNT is that they automatically take into account the baseline rate of disease when considering the magnitude of benefit.

In Another study of Treatment T for Disease D:

  • Percentage of people in control group still with disease D at 10 days: 3% (CER)

  • Percentage of people in experimental group still with disease D at 10 days: 1% (EER)

  • Absolute Risk Difference = CER-EER = 2%

  • NNT = 1/ARD = 1/0.02 = 50. In other words, now 50 people need to be treated for 10 days with Treatment T to cure 1 additional disease D. That's a lot of people!

You can use the same calculation for adverse event rates - and call it "Number Needed to Harm" - just use the mathematical "absolute value" of the ARD.

  • Percentage of people in the control group with nausea: 13%

  • Percentage of people in the treatment T group with nausea: 39%

  • Absolute Risk Difference = |CER-EER| = 26%

  • NNH = 1/ARD = 1/0.26 = 3.8 ~ 4. In other words, for every 4 people you treat with treatment T, you will make another one nauseated.

You can compare the NNT and NNH to get a sense of the overall risk/benefit calculation to your treatment.

  • In the first treatment T example, the NNT was 5. And we'll use the NNH of 4.

  • Find the least common multiple to make the easiest statement out of it: For every 20 patients treated, I'll cure 4 cases of Disease D (20/NNT), but make 5 people (20/NNH) nauseated.

  • Or put it in a ratio comparing NNT to NNH: (using the second treatment T numbers) NNT = 50, NNH = 4 - for every 12.5 (13) patients I make nauseated, I will cure one Disease D.

Rounding: NNTs and NNHs should get reported as whole numbers, not decimals, since we don't treat fractions of patients. Rounding here is usually done UPWARD at the point of calculating the actual NNT or NNH. When you're comparing NNT and NNH, then use conventional rounding rules. To figure out when and how to round here, use common sense and get a general idea of the magnitude and direction of the numbers.

Additional Reading:

Number Needed to Treat for "person-year" data

With all the long-term studies going on, survival analyses are common. These produce hazard ratios and numbers like "rate per 100 person-years."

How do we make an NNT out of that?

You can't do too much if you're only given hazard ratios, but if you're given the rate per X person-years, you can figure it out.

According to a nice BMJ statistical letter:

Given rates for each group (or a difference in rates between groups) expressed in so much per X person-years: for instance, 1.25 cases per 100 patient-years

Select a time value over which you'd like the NNT to apply (remember the simple NNT calculation for a study should always include the length of the study - see NNT): for instance, 5 years (this can be a little arbitrary, but if you project it beyond the actual maximum time of the study, it's less valid)

Then the equation for a one-year NNT is: NNT = 1/0.0125 = 80

1.25 cases per 100 patient-years is 0.0125 cases per patient-year.

Odds Ratios

An odds ratio (OR) compares the odds of one event given another.

The usual way OR is used is in a case-control study. In that type of study, we don't have real population rates to compare, because we're starting with knowing those with an outcome (usually a rare one) and matching them with people who look similar, but don't have the outcome. Therefore it's not as if the numbers we get are the true rate in a population, it's just they are the ones we have found.

OR is used to compare the odds of the EXPOSURE, given the OUTCOME.

So, in a 2x2 table:

+ OUT - OUT

+EXP a b

- EXP c d

OR = (a/c)/(b/d) = ad/bc

Now, you can technically also calculate an odds ratio the other way 'round: the likelihood of an outcome given exposure:

OR = (a/b)/(c/d) = ad/bc (it's the same calculation...)

Links for further reading:

Outcome Assessment

The assessment of outcomes by the investigators should be either very objective (death, incontrovertible morbidity, standardized lab test, etc.) and/or blinded. This is to ensure that the assessment of whether a subject has the outcome of interest is not influenced by knowledge of the treatment assignment. This is especially important for outcomes that are clinical syndromes (heart failure) or subjective (degree of wound healing/cosmesis).

Outcome - Patient-Oriented, Disease-Oriented, Proxy


Outcome - Primary and Secondary

The primary outcome should definitely be described in the study. It is the basis for the primary research question and the sample size determination. Unfortunately, not all study reports disclose the intended primary outcome - likely because the study did not show a difference in that outcome. Clinicaltrials.gov is a website that registers the important details of a trial before it starts so that the primary outcome is posted. Often, you can check there to ensure the trial has disclosed all the outcome data it studied.

Secondary outcomes are additional outcomes that are important to understanding the research and the clinical problem. You should proceed with caution when analyzing these secondary outcomes - studies usually are powered only for the primary outcome, so just because there was no difference seen between groups in the secondary outcomes does not mean that no difference exists - it means more research should be performed.

P-values

From Evidence-Based-Healthcare Listserve:

1. The P-value is a measure of how unlikely it would be to get a value of the test-statistic (say T) as extreme as the value obtained from the data *if the Null Hypothesis is true*.

If data are simulated from a model which satisfies the Null Hypothesis, then the P-value will be random and have a uniform distribution over the interval (0,1). Thus the P-values will certainly "jump all over the place" -- a value anywhere in (0,1) is just as likely to occur as a value anywhere else.

2. However, the chance of getting a P-value as small as, or smaller than, (say) 0.05 when the Null Hypothesis is true is, by the same argument, 0.05; this is a fairly small probability.

So if you apply a test T using a critical significance level of 0.05 you have a 1 in 20 chance of rejecting a true Null Hypothesis; this is the "Error of the First Kind".

Similarly, if you use a critical significance level of 0.01 then you only have a 1 in 100 chance of falsely rejecting a true Null Hypothesis. And so on.

3. The test statistic T will have been chosen to express a measure of discrepancy between data and hypotheis: the larger the value of T, the greater the degree of discrepancy in the sense of "discrepancy" encapsulated in the choice of T.

Therefore the smaller the P-value, the larger the value of T, hence the greater the discrepancy.

4. At this point one applies what George Barnard used to call the "Principle of Disbelief in Tall Stories". The analogy is with someone who has been arrested as a suspect for a crime.

The suspect attempts to explain that he is really innocent (Null Hypothesis) and that the circumstances which led to his arrest (the data) arose in a completely innocent way.

E.g. "I had an urgent need to urinate while walking along the street, saw a house with a broken window, entered the house through the window and used the toilet". And then the

Police arrived responding to a report of burglary and found him inside, and arrested him. And when he gave his explanation, the officer said "That's a pretty tall story mate, we're not going to believe that". On the grounds, of course, that it is a very unlikely thing to happen (though indeed possible).

Therefore is the discrepancy between data (being found in a house which has just been burgled) and hypothesis ("I only went in for a pee") is so large as to be deemd very unlikely,

and the Police *decide* to not believe it. However, once in a while that decision will be incorrect ... since it could happen.

Similarly, a value of T such that so large a value would be very unlikely to be observed when generated by a true Null Hypothesis will give rise to a decision to reject the Null

Hypothesis. The P-value for so large a T will be so small that asserting that it could arise from a true Null Hypothesis amountsto a "tall story".

5. Next one must turn to what happens to P-values when the Null Hypothesis is false (and therefore *should* be rejected). With an appropriate choice of test statistic, departures

from the Null Hypothesis will be reflected in a change of the distribution of the (still random) T such that large values are now more likely than they were under the Null. Correspondingly, small P-values now become more likely than they were under the Null (i.e. then uniformly distributed). Instead, the distribution of P-values will become more

concentrated towards small values of P, the degree of concentration increasing the greater the difference between the Null Hypothesis and the model which really is generating the data.

Thus one has a situation in which:

a) If the Null Hypothesis is true, then the chance of rejecting

it is held at a low level (e.g. 0.05, 0.01, ... );

b) If the Null Hypothesis is false, the chance of rejecting

it is greater than that low level, and can rise to near

certainty for large discrepancies between the Null and the

true model.

This relationship between probability of rejecting the Null and degree of divergence of true model from Null is called the Power Function: "Power" meaning "the probability of rejecting the Null when it is false" -- i.e. the power to detect a departure from the Null.

6. The notion of calculating "a confidence interval for a P-value" is not particularly meaningful. "Confidence interval" is something which refers to uncertainty about some parameter of the model generating the data. The P-value is something calculated from the data with respect to a specific Null Hypothesis.

However, it is certainly possible to discuss the distribution of possible P-values when the data are generated according to a variety of possible models, wuithin the framework outlined above. This has certainly been done! Indeed, every statistical test of a Null Hypothesis has a theory which relates its Power (probability of a small P-value) to different kinds and degrees

of departure from the Null, and the literature is full of accounts of such things.

Hoping this helps,

Ted.

--------------------------------------------------------------------

Ted Harding

Date: 08-Feb-10 Time: 12:13:45

From XKCD.com:

http://xkcd.com/1478/

Power and Sample Size

Power calculations are used to determine how many people need to be in a study to find a difference in the study that is reflective of a true difference in the population.

+ Effect in Population

no Effect in Population

+ Effect in sample

chance of true positive result - alpha

chance of false positive result - Type I error

no effect in sample

chance of false negative result - power = 1-beta, Type II error

chance of true negative result - beta

Power calculations have three basic inputs and you get out a sample size at the other end:

  1. alpha level - usually 0.05 (as in the p-value at which we call things significant)

  2. power level - usually 0.8-0.9 (80-90%) - how confident are you that you can see the specified difference at the current alpha level.

  3. effect size - the size/magnitude of the difference that you want to be seen. (based on prior research, educated guess, etc.)

Just make sure that if the study showed no difference, the authors actually recruited enough of the patients to satisfy the power calculation requirements. Also watch for subjects lost to followup as they can reduce power.

Pre-test Probability

Pre-test probability is the chance that a patient has a disease prior to doing a diagnostic test.

This can be based on population-based prevalence of disease, your specific population's prevalence, a general estimate (educated guess) or a clinical decision rule.

General estimates of pre-test probability use very broad categories to just get a sense of the potential change in post-test probability:

  • low risk - @30%

  • intermediate risk - @50%

  • high risk - @80%

Clinical decision rules are validated rules of patient information that can help define a pre-test probability for disease.

Prevalence

Prevalence is the amount of total disease in a population over a time period - usually a year.

Prevalence estimates sometimes form the basis of pre-test probability estimates.

Proxy Outcome

An outcome that's measured that's short of any true patient-oriented outcome. Often too many assumptions need to be made with proxy outcomes to assume that they automatically lead to the valuable patient outcomes.

Randomization

Randomization is a procedure to reduce Selection Bias by allocating subjects by chance to the intervention and control (or other intervention) groups.

The best way to do this is use random number generators or tables, and have a pre-defined algorithm so that each subject brought into the study is truly placed by random.

Quazi-Random - the use of methods which seem random, but are truly not - coin toss (this is argued), alternate-day allocation, allocation by social security number, etc.

Random sampling is different - this is a way of gathering subjects from a population based on a pre-defined algorithm (every 4th person in the phone book). There's certainly no law against randomization of a randomly selected sample.

BMJ Statistics Notes

Info on Random Numbers and Randomization at Random.org

Randomization Protocols at SealedEnvelope.com

Receiver Operating Characteristic Curves

ROCs (and Area Under the ROC) are methods for finding the best cutoff point for "dichotomizing" a continuous variable in a diagnostic or prediction study.

So, if you wanted to find the best cutoff for a PHQ-9 score, you calculate sensitivity and specificity for all 27 scores possible on the PHQ-9 and then graph them (using sensitivity and 1-specificity...the slope of any point on the graph is the likelihood ratio for that cutoff point, btw). The point on the graph that is closest to the upper left corner (x=0, y=1.0) is the best choice for a cutoff - the best balance between sensitivity and specificity.

Relative Risk (and Reduction)

Relative Risk is a ratio used to compare rates in the exposed/intervention group and the non-exposed/control group. Relative Risk Reduction is the change in relative risk from baseline (baseline relative risk being 1).

EER - Experimental Event Rate

CER - Control Event Rate

RR = Relative Risk

RRR = Relative Risk Reduction

RR = EER/CER

RRR = 1-RR

Sensitivity and Specificity

Sensitivity and Specificity are properties of a diagnostic test that reflects its utility at identifying disease.

These are presumed not to change with the prevalence of the condition being studied.

2X2 table:

Disease POS

Disease NEG

Test POS

a (True Positive)

b (False Positive)

Test NEG

c (False Negative)

d (True Negative

Sensitivity is calculated as:

  • a/a+c

  • True Positives/All Disease Positives

Specificity is calculated as:

  • d/b+d

  • True Negatives/All Disease Negatives

Rule of thumb for Sensitivity and Specificity:

SpPIn - In a test with a high SPecificity, a Positive test rules In disease.

SnNOut - In a test with a high SeNsitivity, a Negative test rules OUT disease.

(these however, do not work well at the extremes of prevalence)

Standardized Mean Difference (Effect Size)

A method, used a lot in systematic reviews, for combining results from different scales/survey instruments.

Per the Cochrane Collaboration, the term "standardized mean difference" is preferred, as opposed to "effect size", since in EBM we use effect size more generically to mean "magnitude of effect".

You will see two statistics performed - a Cohen's d and a Hedges' g. Both are evaluated on the scale of standard deviations, though there is a wide variety of opinions on ranges for small, medium and large effects, because these numbers can change depending on what's being compared. (commonly, 0.2 (small), 0.5 (medium), and 0.8 (large) are used, but one reference suggests 0.41 (small), 1.15 (medium), and 2.70 (strong)).

Distinguish "standardized mean difference" from "mean difference" (sometimes called "weighted mean difference") - which is the difference between computed means of groups when they all use the same scale

References:

http://stats.stackexchange.com/questions/1850/difference-between-cohens-d-and-hedges-g-for-effect-size-metrics

http://stats.stackexchange.com/questions/66956/whats-the-difference-between-hedges-g-and-cohens-d

http://handbook.cochrane.org/chapter_9/9_2_3_2_the_standardized_mean_difference.htm

http://handbook.cochrane.org/chapter_9/9_2_3_1_the_mean_difference_or_difference_in_means.htm

Another way to discuss effect size is similar to the way we talk about risk reduction: absolute and standardized (relative).

Absolute effect size might refer to the absolute risk reduction for dichotomous outcomes, but for a comparison of mean scale scores, etc., it refers to the change between groups on a specific scale. It's important, then, to be aware of the clinical significance of that change. For instance, a clinically significant change in the PHQ-9 scale (a 9-item, 27 point scale used for depression) might be a reduction in the previous score by 50% (as discussed here) or a change in a certain number of points - and would result from a study that looked at the outcomes associated with this change.

The standardized effect size (such as standardized mean difference) uses the parameters discussed above.

Reference:

http://www.jgme.org/doi/abs/10.4300/JGME-D-12-00156.1

Statistical Significance

Statistical significance is a way of judging the results of a study in a sample of the population as truly descriptive of the larger population the sample is taken from. This concept is to be used primarily with comparisons - of rates - between two or more groups.

The primary ways of determining statistical significance are through P-values and confidence intervals.

For instance, if a study finds a relative risk of exposure to be 2.4, then a confidence interval may be 1.5-3.3. That's a fairly precise estimate, but most importantly the comparison is found to be statistically significant.

P-values also show statistical significance, but do not reveal information about the precision of the estimate.

Table 1

The demographics of the patient population included in a study is usually found in the first part of the results section and/or in Table 1. The demographics may be listed by group, or may be bundled into a single column.

For cohort studies, p-values may be calculated to determine significant differences between groups. Although this concept doesn't make sense for Randomized controlled trials (why would we say a difference isn't due to chance, when randomization may easily result in some differences), sometimes p-values are included in the trial reports.

Look over this data and then make sure the authors adjusted for any notable differences between groups in the analysis.

Test and Treat Thresholds

Every year we have the opportunity to learn the concepts of diagnostic tests characteristics, predictive values and prevalence just a little bit better.

It's called "flu season."

The setup:

  • every year, we have a common clinical condition that starts as low prevalence, increases to high prevalence, and then goes back to low.

  • we have readily accessible data about that prevalence week-to-week, courtesy of the CDC and the state health departments (Virginia, for example).

  • we have a rapid test for it, enabling on-the-spot decision making.

  • we can treat it, though you are excused if you question the overall benefits of oseltamavir.

This is the ideal setup to think about the age-old question, "What will you do differently with the results of that test?".

In addition, we can get more comfortable with the idea of test and treat thresholds.

A few concepts:

  • since the rapid flu tests are not perfect, we must interpret their results keeping in mind pre-test probability

  • we should decide how certain we would have to be in order to 1) treat empirically (and not test) and 2) not test or treat.

    • we express our certainty in terms of pre-test probability. for example, "if someone's 90% likely to have the diagnosis, I wouldn't bother testing, I'd just treat."

    • or, "if someone only has a 5% chance of having the disease, I wouldn't test or treat - just reassure them they probably don't have the disease"

There's a great web site that can draw some of this decision making out for us.

We can input the prevalence of the disease, the test characteristics (usually sensitivity and specificity), and even the "certainty levels" (see above), then we can play with these numbers to see how our decision making would change.

Let's say we have a rapid flu test

Withdrawals and Follow Up

It is important to know what happened to everyone enrolled in the trial at its conclusion.

  1. If the subjects who were "lost to followup" are lost because they are having bad outcomes (death, illness, etc., especially in the intervention group), then the study must reflect that, or risk overestimating the benefit of the intervention by not having those outcomes in the analysis.

  2. If there too much loss to followup, then power is threatened - there may no longer be enough sample size to see a difference if one exists.

  3. Examine the power analysis - see what sample size the authors determined was necessary to see a difference in the primary outcome. Then look at the study flowchart or the Results section to see how many were analyzed at the end of the study and compare the numbers.

  4. Rule-of-thumb: the "5 and 20 rule" - if there's less than 5% loss, then the study's generally ok, if there's more than 20% loss, the study has lost significant validity. Between 5 and 20%, judgment must be applied - look for differential loss between groups and at the power analysis. Would the results change if you assumed everyone that was lost had the bad outcome?

For More Information

Another great site is the CEBM's Glossary of EBM Terms

A page of the common biases referred to in critical appraisal.

Self-Assessment Questions

Useful Statistical and Study Design Self-Assessment Questions

Topical vs Oral Ibuprofen

Practice Problem - Statistics for Therapy Studies