Irrelevant Yet Significant (part 2/2)
Win-Ratio arrives in critical care as a P-fishing strategy.
I reviewed DEFENDER's hypothesis in the first part of this essay. The study proposed that dapagliflozin will benefit nearly all patients admitted to critical care. They resorted to things like “metabolic efficiency” and other undefinable entities to show how every patient with cardiovascular, kidney, or ventilatory failure benefits from the study drug. It is another half-baked hypothesis, of course.
Only aleatory or systematic errors could render a “positive” result for this study. We saw it in the past. Many small and fatally biased studies “proved” bad hypotheses right. The older reader will remember the tight glycemic control hysteria and the nebulous entity called relative adrenal insufficiency. However, in the last two decades, critical care RCTs gained more power and better design, avoiding aleatory and systematic errors. The result is the scarcity of "positive" trials.
The field of critical care research suffers from a singular curse: the clinical effects of proposed therapies, if at all present, are usually so small that the effect signal goes undetected by the usual statistics, and P-values are stubbornly high.
Small differences require large samples to reach statistical significance, and clinical trials may get incredibly large, expensive, and complicated. This leads to Delta Inflation, a tendency to inflate the predicted effect size (the delta) to fit a feasible sample size. I recommend reading the excellent essay I just linked. Delta inflation is anywhere in critical care. I have recently posted about a delta-inflated study that expected a 12% difference in mortality just by switching sedatives. Is it the most optimistic hypothesis ever conceived, or were they trying to fit the estimates to their sample size?
The DEFENDER study adopted a more realistic 2% difference in mortality (30% to 28%) and remarkably planned to enroll only 500 patients. Pure and plain démodé sample size calculation to reach significance with a 30% to 28% mortality delta is in the thousands (say >15,000) of patients. Even considering a composite endpoint including mortality, new hemodialysis, and ICU stay longer than 28 days, we would still talk about thousands of patients.
But, what if one could augment the power to detect a minuscule effect?
Deus ex-machina!
Win Ratio statistics appeared and solved the problem by introducing effect metrics and inferential statistics (P-value, confidence intervals) that are not for the association of exposition and outcome incidence.
Instead, DEFENDER will offer you something like a structured "waiting room contest”, that weird conversation where patients brag about how sick they have been. Someone starts by sharing an ED admission, then another guy says he had cancer, and the winner takes the prize describing how she survived three cardiac arrests—got the idea? Researchers will count how many wins the intervention arm had over the control arm, and how many wins the control arm had over the intervention arm, divide one value by the other, and provide you with the win ratio as a measure of treatment effect.
But how do researchers measure the wins? First, let us define what is a win. Take the component outcomes, and rank them from the worst to the… not-so-bad.1 In the example provided by the DEFENDER study, the outcomes rank is (1) Death, (2) Dyalisis, and (3) Longer ICU stay. Then you make pairs of patients, one of each arm, to compare who loses, i.e., who met the worst outcome, and declare the winner of each duel. You can use a propensity score to match the patients or compare every possible pair of patients, comparing each one of the patients with all patients in the other group. If both patients in a pair had the outcome or neither met the outcome, it is a tie, no one wins the duel. The tied pair now duels in the next outcome down the ranking, till there is a winner or the pair ties to the bottom of the outcomes rank.
The choice of component outcomes poses an additional problem in DEFENDER. Discrete outcomes, like death and initiation of hemodialysis, produce more ties. We can expect many ties in these outcomes because there are only two possible values. The third outcome is a discrete numeric variable expected to range from 3 to 282 The fact that most wins will come from the third outcome limits interpretation for two reasons: (1) despite coming mostly from the third outcome, results will be presented as a difference in the composite outcome, and (2) if only the tied pairs advance to the next outcome, the second and third outcome will not be counted in all pairs, precluding any further interpretation of the results. Moreover, the win ratio is the same with a win count of 300/200 or 3/2. And it is the same if the 3 wins come from the third outcome.
Finally, the inferential test may be made by bootstrapping, i.e., resampling from the study sample, or other methods. DEFENDER will bootstrap 10,000 observations to calculate the 95% confidence interval of the win ratio. If it does not include the unity, it will be considered significant.
But how did DEFENDER arrive at the 500-patient RCT? Forget old-school sample size calculations with discouraging results! In this Brave New World, you run a Monte Carlo simulation to find the sweet spot between study power and your budget.
You had a glimpse of the dark future of critical care research. Small studies with terrible hypotheses will display the desired p<0.05 in medical journal headlines - article titles are already read as headlines - and in journal clubs worldwide, to the new generation of intensivists.
The reader may think I am overstating the perils of this new idea. However, it is easy to get a low P-value. Imagine you make all possible pairwise comparisons with 250 patients in each group but the treatment "effect" (the win ratio) is not significantly different. If you need more power to detect a smaller difference, you may add 10 more patients, five in each group, providing a few thousand additional comparisons. You may also add another outcome to have more duels with winners. The number of observations sees no physical constraint. That's how you guarantee you will fish a nice P-value misusing Win Ratio statistics. It got so complicated and detached from the empirical reality that, like in Hamlet's cloud scene, people will be convinced of anything researchers tell them.
Remember the Thoughtful Intensivist Axiom:
"The worse the hypothesis, the more complicated the statistical analysis"
Finally, it is no surprise if the DEFENDER study finds significantly more wins in the dapagliflozin arm, although irrelevant clinical effects. Whatever comes as a result, I think it is a concerning move. Instead of working on defining a sound research hypothesis, researchers will go for irrelevant yet significant, thus publishable, effects. I am appalled by the prospect that our dear specialty will pivot into studies like this. In my Substack posts, I try to sound light-hearted, mocking, provocative, etc but this case is genuinely problematic. I hope to be wrong!
That's how you spot a non-native speaker of English
Remember DEFENDER is an open-label study. Physicians in charge of deciding ICU discharge are aware of the patient allocation. The stage is set for performance bias.