By CHARLIE STRAUSS
Editor’s note: This article was received Thursday, July 23.
Because deaths from Covid-19 do not begin to appear until about three weeks after the disease’s symptoms appear, the numbers of deaths lag behind the numbers of cases. For this reason, one can expect to see more deaths than have been reported so far.
Forecasting is a risky business, and a wrong forecast can cause problems either by being too scary or downplaying things too much. Not to mention the soothsayer risks embarrassing ridicule. Instead, this article is about how to think about constructing a forecast.
And thus it behooves us to keep this analysis overly simple as a way to get across the idea rather than strive for perfection in the forecast itself. Thus be forewarned – the following is a deliberately simple forecasting approach. In fact it’s boneheaded, but after showing this to some wiser people I was told it’s something that adds to the current narrative because more sophisticated forecasting models are so complicated that few people have an intuition about when to trust them and when not to. This one is very easy to forecast and also easy to see what might go wrong with that forecast.
The first figure shows the total covid cases and total deaths in the US in days since Jan. 1. The bottom plot, the “fatality fraction”, is the number of deaths-to-date divided by the number of cases-to-date.
So now we see the fatality rate is dropping after 120 days right?. Well no, it’s not really dropping any more than it was rising after 60 days. What’s really going on is that the number of people with positive tests has exponentially risen while inevitable deaths among those haven’t happened yet. So the denominator (large case rate) is dragging the ratio down. Likewise, we should ignore the spike in the fatality rate before 60 days,as testing and reporting deaths wasn’t ubiquitous and the medical system was under stress.
Conceptually, a better baseline assumption is that the fatality rate is a constant at all times. How can we estimate that?
A quick estimate can be made by just looking for some place in the case-rates where the number of new cases is roughly constant for a long stretch of time. When that’s true, the computed fatality ratio converges toward the underlying true fatality rate since it’s close to an equilibrium. So we extract our best guess for it from the peak of the observed fatality rate in that constant-new-case regime. So in the above plot that would be 0.06 or 6%. We can customize this estimate for each state.
To get a feel for this, let’s apply this rule of thumb, then later we’ll discuss why this works or doesn’t.
1. Find the peak fatality rate in the curve after about march (90 days past Jan 1 )
2. Adjust this for the changing testing rate: multiplying by the number of tests on that day divided by the number of tests on the current day (this might be as 30% or more).
3. Multiply this by the current number of positive cases.
4. Subtract off the current number of deaths to arrive at the forecast for additional deaths ahead.
This is a prediction for the number of deaths that will happen in the next 3 weeks (I’ll explain why 3 weeks below.) Applying this hypothesis to some US States with recent surges in cases one gets:
If this predictor is accurate then it is forecasting more deaths in just the next 21 days than all days prior to this in Florida and Arizona, and almost as many in Texas. This is grim and if even close to accurate requires preparation. However, please remember this is not supposed to be a great model forecast, it’s just an article on how to go about making forecasts simply enough that we can also understand how to second guess them.
Okay, why should this work and why 21 days?
Now let’s discuss the notion underlying this rule of thumb, then later we can examine the reasons this model might be wrong. Going back to that original, erroneous, fatality rate computation (current deaths divided by current cases) we can try correcting this in two ways. Notionally, most of the people who die on a given day were first reported as cases weeks before. First let’s see what happens if we instead divide the current number of deaths by the number of cases 21 days ago (or some other lag of days). The dashed blue line in the following set of curves shows the fatality fraction for lags of 0,7, 21, and 35 days. These are shown with an enlarged portion on the right. A lag of 0 day is, of course, the original naive fatality rate curve above.
The second correction we can apply is for the escalating number of tests. The number of tests per day has increased close to linearly. As a crude correction we shall assume we can simply divide the reported case rate by the number of tests that day, and then renormalize this back to the total number of all tests. This will convert the case rate to what it would be if the number of tests had been constant all along. (See caveats about non-random testing below.)
Incorporating this correction we get the orange line on the plots. As can be seen, the most extended flat range occurs in the 21-day lag plot. We can write-off the non-flat behaviour before day 100 as a combination of poor early testing,and the fact that the crude renormalization for testing rates likely distorts the relative weight of these early days.
One can plot this for other lags, but somewhere around 20 to 24 days one obtains the flattest most extended period of constant fatality rate thus the 21-day plot is shown.
The intuitive interpretation of this is that the fatality rate is basically a constant. And many people would expect that to be true. It would be more surprising if it weren’t!
So what this means is given the underlying constant fatality ratio one should simply be able to multiply the current number of cases by this fatality factor to get the expected number of deaths to accrue from those cases. And it appears that forecast is most accurate 21 days into the future from the present day.
Putting this into action:
One could try refitting the data set for any given state, but daily testing rates are not available for every state, there are fewer counts, and there are lots of day-to-day fluctuations at the state level. Thus instead our rule of thumb is to use those convenient equilibrium regions where the new-case rate is roughly constant and extract the fatality rate therefor the state.
What might be wrong about this analysis:
Before discussing that, a word about what this “fatality rate” actually means and doesn’t mean. First of all it’s not the actual fatality rate. Sometime in the future we may know the real number of latent cases and we’ll also compensate for mis-attributed deaths. The “fatality rate” we’re discussing isn’t trying to estimate that future “true” number. Instead it’s just a factor that when multiplied by the currently known (not the actual number, just the number testing gives us) will predict the number of deaths in 21 days quite well. We even recognize that testing results are likely even a week late in being posted, but that effect is already folded into this factor. The only reason we’re calling this a “fatality rate” is that it has the dimensions of deaths-per-positive-case, and because it’s also equal to the “apparent” fatality rate in use right now given only the data we have now (not corrected by future information).
Additionally the “21 days” should not be interpreted as the “life expectancy” for those that eventually die. It’s again, just the point in the future when the above factor is most accurate in predicting the expected deaths. It works as indicator of the lag between surges in cases and surges in deaths. So this analysis should not be faulted for either latent cases masking the true fatality rate or the fact that individual cases progress at different rates.
Where this analysis is likely to contain errors:
1. The fatality factor would not be constant if something changed in the demographics in the past 21 days. For example, perhaps a policy of strictly isolating the elderly is implemented or schools started 21 days ago. In that case, the cases shift down in median age and the fatality rate should fall and forecast will be wrong.(This can be corrected only when the new fatality rate can be estimated.)
2. Likewise if some new treatment is introduced in that 21-day period, such as the new steroid treatments, then lag time stretches out and the fatality rate falls distorting the prediction based on the prior 21-day numbers.
3. Hospital saturation or heat waves or other events might drive up the fatality rate or shorten the optimal timelag.
4. The testing is not random, and how subjects are selected may be changing, and even the methods of confirming diagnosis may change and vary from state to state. This is a blind spot I cannot correct for. It’s a big blind spot, too, and it impairs most models.
5. Large swings in the testing rate could invalidate the simple, crude, correction used.
6. There could exist some completely different model that explains the data for which this is merely an accidental temporary agreement that won’t persist in the future. That is, it’s just nonsense!
Now remember, those forecasts will have big errors! And moreover this whole analysis makes the grave error of not putting in any error-bars or uncertainty quantification. It’s just a qualitative discussion, faults and all. There certainly are more sophisticated models one could concoct. However, since this one has been so accurate over the last 75 days (residual error near zero), no other model would do better on that particular time period. Only the future will tell if this holds up, and more sophisticated models may be needed.
The take-home message here is that the flat line obtained for the 21-day, test-rate corrected fatality rate means that this was an accurate predictor of the deaths for all of that recent 75 day period in the US totals (where the line was flat). That is, it predicted the deaths 21 days in advance on each of those 75 days. So the answer to the question posed in the title seems to be yes. But will it hold in the future? And does it degrade for individual states? We shall see.
Data and methods:
- Daily case rates from the NYtimes Github page. https://github.com/nytimes/covid-19-data
- Estimates of the daily testing rate obtained by fitting the plots found at https://covid19.healthdata.org/united-states-of-america
- Plots made in Jupyter Notebook using the Pylab plotting package.
Acknowledgments: Thank you to L. Hanson for critical reading and suggestions. Disclaimer: This work is not connected in any way to my employer. I did not use any computers, software, methods, ideas or data owned by or related to my employer. The predictions made are inexact and likely inaccurate and are not made with the intention of precision but just made as an example for teaching. Nothing here should be construed as endorsed by me or my employer as being accurate.