GeoRisk Data Lab 3

Event Study and Difference-in-Differences

From naive two-way fixed effects to modern heterogeneity-robust estimators. This lab walks you through the evolution of DiD methodology using simulated firm-level data on the 2022 Russia sanctions shock across European economies.

Phase 1 Theory & evolution of DiD

Phase 2 Data exploration

Phase 3 Estimation exercise

Phase 4 Diagnostics

Phase 5 Results & export

Firms 200

Quarters 16

Treatment Groups 4 + ctrl

Method DiD

Staggered treatment across four EU countries. Pre-computed TWFE and Callaway-Sant'Anna estimates for comparison.

Durée indicative : 45–60 min

Context & Theory

The Evolution of Difference-in-Differences

The difference-in-differences (DiD) estimator compares outcomes before and after treatment between treated and control units. For decades, the classical 2×2 design (two groups, two periods) was the workhorse of policy evaluation. Applied researchers extended this framework to panel settings via two-way fixed effects (TWFE) regressions with unit and time fixed effects.

However, a revolution in econometric methodology since 2020 has revealed that TWFE can produce severely biased estimates when treatment is staggered across groups and treatment effects are heterogeneous over time or across groups. The core problem: TWFE implicitly uses already-treated units as controls, and with heterogeneous effects, this contaminates the estimates with "negative weights."

"The key insight is that TWFE regressions implicitly compute many different 2×2 DD comparisons, some of which use already-treated units as controls."
— Goodman-Bacon (2021), Journal of Econometrics

The Modern Toolkit

Several estimators now address these problems. This lab focuses on:

Callaway & Sant'Anna (2021): estimates group-time average treatment effects on the treated (ATT(g,t)), comparing each treated cohort to never-treated or not-yet-treated units.
Sun & Abraham (2021): interaction-weighted estimator that decomposes TWFE into cohort-specific effects and reweights them.
de Chaisemartin & D'Haultfoeuille (2020): demonstrates that TWFE weights can be negative, proposes the DID_M estimator.

The Key Equation

Y_it = α_i + λ_t + Σ_k β_k · D^k_it + ε_it

where α_i are unit fixed effects, λ_t are time fixed effects, and D^k_it are relative-time indicators (k periods from treatment). The coefficients β_k trace out the dynamic treatment effects — but only if the estimator handles staggered timing correctly.

Parallel Trends and Pre-Trends Testing

The identifying assumption is parallel trends: absent treatment, the treated and control groups would have followed the same trajectory. We test this indirectly by examining the pre-treatment coefficients β_k for k<0. If they are jointly zero, we gain confidence in the assumption. But as Roth (2022) warns, pre-trends tests have low power — failure to reject does not prove parallel trends.

Application Context

This lab uses simulated firm-level data modeled on European firms exposed to the 2022 Russia sanctions regime. Treatment is staggered by country due to different exposure channels:

Poland Q1 2022 — border & security

Germany Q2 2022 — energy dependence

France Q3 2022 — luxury export ban

Italy Q3 2022 — trade disruption

Key References

Roth, J., Sant'Anna, P.H.C., Bilinski, A. & Poe, J. (2023). "What's Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature." Journal of Econometrics, 235(2), 2218-2244.
Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, 225(2), 200-230.
Sun, L. & Abraham, S. (2021). "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." Journal of Econometrics, 225(2), 175-199.
de Chaisemartin, C. & D'Haultfoeuille, X. (2020). "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects." American Economic Review, 110(9), 2964-2996.
Goodman-Bacon, A. (2021). "Difference-in-Differences with Variation in Treatment Timing." Journal of Econometrics, 225(2), 254-277.
Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Review of Economic Studies, 90(5), 2555-2591.

Data Exploration

The dataset contains 200 firms across 4 EU countries (50 per country) plus 40 control firms (low Russia exposure, not treated). Outcome: revenue growth index (base = 100 at Q4 2021). Time: Q1 2020 to Q4 2023 (16 quarters).

Observation Guide

What to look for:

In the "By Treatment Group" view, treated and control groups track each other closely before 2022. The divergence begins at staggered dates.
In the "By Country" view, notice that Poland drops earlier (Q1 2022) while France and Italy drop later (Q3 2022). Germany is in between (Q2 2022).
In the "Individual Firms" view, observe the heterogeneity. Some treated firms decline sharply; others are barely affected. This heterogeneity is what causes TWFE to fail.

Estimation Exercise

Step 3a — Naive TWFE Event Study

We begin with the standard approach: regress the revenue index on unit fixed effects, time fixed effects, and a set of relative-time dummies. The reference period is k=−1 (one quarter before treatment).

The Problem

Notice the non-zero pre-trend coefficients at k=−3 and k=−2. This apparent "pre-trend violation" is not real — it is an artifact of TWFE using already-treated units (e.g., Poland) as controls when estimating effects for later-treated units (France, Italy). With heterogeneous effects, this contaminates the pre-period estimates.

Step 3b — Callaway & Sant'Anna Correction

The Callaway & Sant'Anna (2021) estimator computes group-time ATTs, comparing each cohort to the not-yet-treated or never-treated group. This avoids the contamination problem. Toggle between the two estimators to see the difference.

Why do the pre-trends differ?

The TWFE pre-trends are contaminated because the regression uses already-treated units as implicit controls. When Poland is treated in Q1 2022, it enters the "control" comparison for Germany (treated Q2 2022) in subsequent periods. If Poland's treatment effect is heterogeneous, this leaks into Germany's pre-trend estimates.

The CS estimator avoids this by only comparing each cohort to clean controls (never-treated or not-yet-treated), yielding pre-trend estimates that are essentially zero.

Which estimator would you trust?

The CS estimator. It correctly identifies the parallel trends assumption as holding (clean pre-trends) and produces uncontaminated treatment effect estimates. The TWFE estimator creates a false impression of pre-trend violation and underestimates the treatment effect due to negative weighting of already-treated comparisons.

Step 3c — Aggregation Schemes

The group-time ATTs can be aggregated in different ways. Each scheme answers a different question. Select an aggregation to see how results change.

Understanding the aggregation schemes

Simple: equal weight on each group-time ATT. Answers: "What is the average effect across all treated units and post-treatment periods?"

By group: averages within each cohort first, then across cohorts. Answers: "What is the average effect for each treatment cohort?"

Calendar time: averages across groups within each calendar period. Answers: "What was the average effect at each point in time?"

Diagnostics

Pre-Trends Test

A formal Wald test of the joint null H₀: β_k = 0 for all k < 0. Under TWFE, the test rejects (artificially). Under CS, it fails to reject, consistent with parallel trends holding.

TWFE Pre-Trends Test

F-stat = 4.28
p-value = 0.006
REJECTS H₀ at 5%

CS Pre-Trends Test

F-stat = 0.42
p-value = 0.738
FAILS TO REJECT H₀ at 5%

The TWFE rejection is a false alarm driven by contamination, not by actual pre-existing differential trends. This is one of the most important lessons of the modern DiD literature.

Sensitivity Analysis — Rambachan & Roth (2023)

Even if pre-trends tests pass, parallel trends could still be violated. Rambachan & Roth (2023) propose "honest" confidence intervals that allow for deviations from parallel trends. The key parameter M bounds the maximum slope change in the pre-trend violation between consecutive periods. As M increases, the confidence set widens.

M (maximum pre-trend violation slope): 0.00

At what value of M does the result lose significance?

The average treatment effect on the treated loses significance (confidence interval crosses zero) at approximately M = 0.030. This means that if we believe the pre-trend violation slope could be as large as 3 index points per quarter, we can no longer rule out a zero effect. Since our CS pre-trends are essentially zero, this is a fairly large violation to entertain — the result is reasonably robust.

Bacon Decomposition

Goodman-Bacon (2021) shows that the TWFE DiD estimator is a weighted average of all possible 2×2 DiD comparisons. Some comparisons are "clean" (treated vs. never-treated), while others are "bad" (earlier-treated vs. later-treated, or vice versa). The chart below shows the decomposition.

The "bad" comparisons (using already-treated as controls) carry 28% of the total weight but produce attenuated or wrong-sign estimates. This is why the TWFE estimate is biased toward zero relative to the CS estimate.

Why do "bad" comparisons matter?

When an already-treated unit is used as a control, its post-treatment trajectory already includes the treatment effect. If the treatment effect grows over time (dynamic effects), the "control" is itself changing, leading to an underestimate of the true effect. If treatment effects are heterogeneous across cohorts, the bias can go in either direction — including producing negative weights on some group-time ATTs.

Results & Export

Summary Table

Estimator	ATT	SE	95% CI	Pre-Trends F	Verdict
TWFE	−5.83	1.41	[−8.60, −3.06]	4.28 (p=0.006)	Contaminated
CS	−8.72	1.28	[−11.23, −6.21]	0.42 (p=0.738)	Preferred
CS by group:
Poland Q1 2022	−12.34	2.15	[−16.55, −8.13]	Largest effect
Germany Q2 2022	−9.61	1.87	[−13.28, −5.94]	Energy channel
France Q3 2022	−6.28	1.52	[−9.26, −3.30]	Export channel
Italy Q3 2022	−7.15	1.63	[−10.35, −3.95]	Trade channel

Key Takeaway

TWFE underestimates the true sanctions effect by approximately 33% (ATT = −5.83 vs. −8.72) due to contamination from using already-treated units as controls. The modern CS estimator reveals clean pre-trends and a larger, more precisely estimated effect. With staggered treatment and heterogeneous effects, always use a heterogeneity-robust estimator.