Event Study and Difference-in-Differences
From naive two-way fixed effects to modern heterogeneity-robust estimators. This lab walks you through the evolution of DiD methodology using simulated firm-level data on the 2022 Russia sanctions shock across European economies.
Staggered treatment across four EU countries. Pre-computed TWFE and Callaway-Sant'Anna estimates for comparison.
Durée indicative : 45–60 min
Context & Theory
The Evolution of Difference-in-Differences
The difference-in-differences (DiD) estimator compares outcomes before and after treatment between treated and control units. For decades, the classical 2×2 design (two groups, two periods) was the workhorse of policy evaluation. Applied researchers extended this framework to panel settings via two-way fixed effects (TWFE) regressions with unit and time fixed effects.
However, a revolution in econometric methodology since 2020 has revealed that TWFE can produce severely biased estimates when treatment is staggered across groups and treatment effects are heterogeneous over time or across groups. The core problem: TWFE implicitly uses already-treated units as controls, and with heterogeneous effects, this contaminates the estimates with "negative weights."
— Goodman-Bacon (2021), Journal of Econometrics
The Modern Toolkit
Several estimators now address these problems. This lab focuses on:
- Callaway & Sant'Anna (2021): estimates group-time average treatment effects on the treated (ATT(g,t)), comparing each treated cohort to never-treated or not-yet-treated units.
- Sun & Abraham (2021): interaction-weighted estimator that decomposes TWFE into cohort-specific effects and reweights them.
- de Chaisemartin & D'Haultfoeuille (2020): demonstrates that TWFE weights can be negative, proposes the DID_M estimator.
The Key Equation
where αi are unit fixed effects, λt are time fixed effects, and Dkit are relative-time indicators (k periods from treatment). The coefficients βk trace out the dynamic treatment effects — but only if the estimator handles staggered timing correctly.
Parallel Trends and Pre-Trends Testing
The identifying assumption is parallel trends: absent treatment, the treated and control groups would have followed the same trajectory. We test this indirectly by examining the pre-treatment coefficients βk for k<0. If they are jointly zero, we gain confidence in the assumption. But as Roth (2022) warns, pre-trends tests have low power — failure to reject does not prove parallel trends.
Application Context
This lab uses simulated firm-level data modeled on European firms exposed to the 2022 Russia sanctions regime. Treatment is staggered by country due to different exposure channels:
- Roth, J., Sant'Anna, P.H.C., Bilinski, A. & Poe, J. (2023). "What's Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature." Journal of Econometrics, 235(2), 2218-2244.
- Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, 225(2), 200-230.
- Sun, L. & Abraham, S. (2021). "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." Journal of Econometrics, 225(2), 175-199.
- de Chaisemartin, C. & D'Haultfoeuille, X. (2020). "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects." American Economic Review, 110(9), 2964-2996.
- Goodman-Bacon, A. (2021). "Difference-in-Differences with Variation in Treatment Timing." Journal of Econometrics, 225(2), 254-277.
- Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Review of Economic Studies, 90(5), 2555-2591.
Data Exploration
The dataset contains 200 firms across 4 EU countries (50 per country) plus 40 control firms (low Russia exposure, not treated). Outcome: revenue growth index (base = 100 at Q4 2021). Time: Q1 2020 to Q4 2023 (16 quarters).
Observation Guide
What to look for:
- In the "By Treatment Group" view, treated and control groups track each other closely before 2022. The divergence begins at staggered dates.
- In the "By Country" view, notice that Poland drops earlier (Q1 2022) while France and Italy drop later (Q3 2022). Germany is in between (Q2 2022).
- In the "Individual Firms" view, observe the heterogeneity. Some treated firms decline sharply; others are barely affected. This heterogeneity is what causes TWFE to fail.
Estimation Exercise
We begin with the standard approach: regress the revenue index on unit fixed effects, time fixed effects, and a set of relative-time dummies. The reference period is k=−1 (one quarter before treatment).
The Problem
Notice the non-zero pre-trend coefficients at k=−3 and k=−2. This apparent "pre-trend violation" is not real — it is an artifact of TWFE using already-treated units (e.g., Poland) as controls when estimating effects for later-treated units (France, Italy). With heterogeneous effects, this contaminates the pre-period estimates.
The Callaway & Sant'Anna (2021) estimator computes group-time ATTs, comparing each cohort to the not-yet-treated or never-treated group. This avoids the contamination problem. Toggle between the two estimators to see the difference.
Why do the pre-trends differ?
The TWFE pre-trends are contaminated because the regression uses already-treated units as implicit controls. When Poland is treated in Q1 2022, it enters the "control" comparison for Germany (treated Q2 2022) in subsequent periods. If Poland's treatment effect is heterogeneous, this leaks into Germany's pre-trend estimates.
The CS estimator avoids this by only comparing each cohort to clean controls (never-treated or not-yet-treated), yielding pre-trend estimates that are essentially zero.
Which estimator would you trust?
The CS estimator. It correctly identifies the parallel trends assumption as holding (clean pre-trends) and produces uncontaminated treatment effect estimates. The TWFE estimator creates a false impression of pre-trend violation and underestimates the treatment effect due to negative weighting of already-treated comparisons.
The group-time ATTs can be aggregated in different ways. Each scheme answers a different question. Select an aggregation to see how results change.
Understanding the aggregation schemes
Simple: equal weight on each group-time ATT. Answers: "What is the average effect across all treated units and post-treatment periods?"
By group: averages within each cohort first, then across cohorts. Answers: "What is the average effect for each treatment cohort?"
Calendar time: averages across groups within each calendar period. Answers: "What was the average effect at each point in time?"
Diagnostics
A formal Wald test of the joint null H0: βk = 0 for all k < 0. Under TWFE, the test rejects (artificially). Under CS, it fails to reject, consistent with parallel trends holding.
p-value = 0.006
REJECTS H0 at 5%
p-value = 0.738
FAILS TO REJECT H0 at 5%
The TWFE rejection is a false alarm driven by contamination, not by actual pre-existing differential trends. This is one of the most important lessons of the modern DiD literature.
Even if pre-trends tests pass, parallel trends could still be violated. Rambachan & Roth (2023) propose "honest" confidence intervals that allow for deviations from parallel trends. The key parameter M bounds the maximum slope change in the pre-trend violation between consecutive periods. As M increases, the confidence set widens.
At what value of M does the result lose significance?
The average treatment effect on the treated loses significance (confidence interval crosses zero) at approximately M = 0.030. This means that if we believe the pre-trend violation slope could be as large as 3 index points per quarter, we can no longer rule out a zero effect. Since our CS pre-trends are essentially zero, this is a fairly large violation to entertain — the result is reasonably robust.
Goodman-Bacon (2021) shows that the TWFE DiD estimator is a weighted average of all possible 2×2 DiD comparisons. Some comparisons are "clean" (treated vs. never-treated), while others are "bad" (earlier-treated vs. later-treated, or vice versa). The chart below shows the decomposition.
The "bad" comparisons (using already-treated as controls) carry 28% of the total weight but produce attenuated or wrong-sign estimates. This is why the TWFE estimate is biased toward zero relative to the CS estimate.
Why do "bad" comparisons matter?
When an already-treated unit is used as a control, its post-treatment trajectory already includes the treatment effect. If the treatment effect grows over time (dynamic effects), the "control" is itself changing, leading to an underestimate of the true effect. If treatment effects are heterogeneous across cohorts, the bias can go in either direction — including producing negative weights on some group-time ATTs.
Results & Export
| Estimator | ATT | SE | 95% CI | Pre-Trends F | Verdict |
|---|---|---|---|---|---|
| TWFE | −5.83 | 1.41 | [−8.60, −3.06] | 4.28 (p=0.006) | Contaminated |
| CS | −8.72 | 1.28 | [−11.23, −6.21] | 0.42 (p=0.738) | Preferred |
| CS by group: | |||||
| Poland Q1 2022 | −12.34 | 2.15 | [−16.55, −8.13] | Largest effect | |
| Germany Q2 2022 | −9.61 | 1.87 | [−13.28, −5.94] | Energy channel | |
| France Q3 2022 | −6.28 | 1.52 | [−9.26, −3.30] | Export channel | |
| Italy Q3 2022 | −7.15 | 1.63 | [−10.35, −3.95] | Trade channel | |
Key Takeaway
TWFE underestimates the true sanctions effect by approximately 33% (ATT = −5.83 vs. −8.72) due to contamination from using already-treated units as controls. The modern CS estimator reveals clean pre-trends and a larger, more precisely estimated effect. With staggered treatment and heterogeneous effects, always use a heterogeneity-robust estimator.