The plan looked bulletproof. Three rounds of review with our GPT advisors. Detailed migration mapping. A checklist covering every config section, animation class, and color token. The official upgrade tool handled 73 files automatically.
Then we ran Lighthouse.
The promise
Tailwind CSS v4 ships with some wild benchmarks:
| Metric | v3 | v4 | Improvement |
|---|---|---|---|
| Full builds | ~378ms | ~100ms | 3.5x faster |
| Incremental builds | 44ms | 5ms | 8.8x faster |
| No-change builds | 35ms | 192μs | 182x faster |
The architecture changed fundamentally. Configuration moved from JavaScript to CSS. The tailwind.config.ts file we'd maintained for months got deleted entirely. Everything now lives in src/index.css using @theme, @utility, and @plugin directives.
@import 'tailwindcss';
@plugin 'tailwindcss-animate';
@plugin '@tailwindcss/container-queries';
@custom-variant dark (&:is(.dark *));
@theme {
--color-background: hsl(var(--background));
--color-foreground: hsl(var(--foreground));
/* ... 50 more color tokens */
}
The developer experience improved. Vite's hot reload feels snappier. The CSS-first configuration is more predictable than the JavaScript version. Container queries are now built-in. Autoprefixer is bundled.
The reality
Our build times improved modestly: 5.67s down to 5.17s (-9%). Not the 3.5x advertised, but our site has Mermaid diagrams, Monaco editor, and other heavy dependencies that dwarf Tailwind's contribution.
But the CSS bundle grew.
| Metric | v3 | v4 | Change |
|---|---|---|---|
| CSS size | 102KB | 140KB | +37% |
That 38KB increase triggered our CI budget check. We'd set the threshold at 110KB with 15% headroom. v4 blew past it. We bumped the budget to 150KB and merged.
Then came the Lighthouse audit.
| Page | Before | After | Delta |
|---|---|---|---|
| Home | 95 | 79 | -16 |
| Blog | 93 | 73 | -20 |
| Projects | 87 | 76 | -11 |
Twenty points off the blog page. That's not a rounding error.
Why the regression?
The larger CSS bundle hits performance in three places. There's 38KB more to transfer, even with compression. The browser's style engine has to parse more CSS. And since CSS is render-blocking, first paint waits until the entire stylesheet is processed.
Tailwind v4's new architecture generates more complete CSS. It includes utility classes we might use, rather than only those it can statically detect. The tradeoff is convenience for runtime performance.
The planning process
This upgrade was the first real test of our new AI planning workflow: Claude Code orchestrating GPT experts through Codex MCP for plan validation and specialized analysis.
How the delegation works
We set up a pattern where Claude Code (Anthropic's CLI tool) delegates specific tasks to GPT experts via the Codex MCP server. Each expert has a specialized prompt that shapes its analysis. The Plan Reviewer evaluates plans for completeness and gaps, returning APPROVE/REJECT with specific feedback. The Architect analyzes system design decisions and creates migration mappings. The Scope Analyst catches ambiguities before work starts and surfaces hidden requirements.
What clicked for us was letting each model do its one job well. Claude Code keeps the thread of the conversation. GPT experts dive deep on the narrow questions we throw at them.
Round 1: First rejection
The initial plan documented what needed to change conceptually but lacked specifics. We delegated to the Plan Reviewer:
TASK: Review the Tailwind CSS v4 upgrade plan for completeness.
CONTEXT:
- Plan document at docs/plans/22-tailwind-v4-upgrade.md
- Current config: 130 lines of JavaScript in tailwind.config.ts
- Target: CSS-first configuration with @theme/@utility directives
MUST DO:
- Evaluate clarity, verifiability, completeness, big picture
- Simulate actually doing the work to find gaps
The verdict came back: REJECTED.
Missing file-level mapping of tailwind.config.ts to CSS @theme/@utility blocks. Animation plugin class replacement details incomplete. No inventory of which components use tailwindcss-animate utilities.
Fair points. The plan said "replace tailwindcss-animate" but didn't specify which animation classes existed in our codebase or where they were used.
Round 2: Architect analysis
Rather than guessing, we delegated to the Architect expert to build the missing inventory:
TASK: Create detailed migration mapping for Tailwind v4 upgrade.
CONTEXT:
- tailwind.config.ts contains custom colors, container config, keyframes
- tailwindcss-animate plugin is used across shadcn/ui components
- Need file-level mapping of old config → new CSS syntax
MUST DO:
- Audit tailwind.config.ts line by line
- Find all tailwindcss-animate class usages
- Map each config section to equivalent @theme/@utility syntax
The Architect found 18 components using animation utilities -- Accordion, Alert, AlertDialog, Carousel, Collapsible, among others. Each was mapped to specific classes like animate-accordion-down, animate-in, animate-out, fade-in, slide-in-from-top.
This created a concrete checklist. We knew exactly which animations needed to survive the migration.
We updated the plan document with:
- File structure showing how each config section maps to CSS
- Animation class inventory with component locations
- Color token mapping from JS to
@themesyntax
Ran it through Plan Reviewer again. REJECTED.
Color token strategy conflicting—plan mentions both "hsl in variable" and "wrap with hsl() in @theme" without clarifying which approach. tw-animate-css integration unclear: some sections say @plugin, others say @import.
Round 3: Hard decisions
The second rejection exposed actual ambiguity. We'd documented options without picking one. That's fine for exploration, dangerous for execution.
We made decisions. Everything goes in a single entry file, src/index.css, with no separate config files. For HSL handling, we'd keep raw HSL values in :root (for shadcn/ui compatibility) and wrap with hsl() in @theme (for Tailwind consumption). For the animation plugin, we'd use @plugin 'tailwindcss-animate' consistently rather than @import.
Updated the plan to reflect these choices unambiguously. Third review: APPROVED.
Why this matters
The three-round review process took maybe 30 minutes. What did it catch?
The animation inventory was the biggest catch. The official upgrade tool missed three animations: animate-collapsible-down, animate-collapsible-up, and animate-caret-blink. Without the inventory, we'd have found these broken one at a time in production. The collapsible animation powers the mobile nav menu. The caret blink is used in the CLI playground.
The reviewers also caught an HSL ambiguity -- two valid approaches exist for color tokens in v4, and picking the wrong one would have required re-migrating 50+ color references. And they flagged the @plugin vs @import distinction for animation libraries, where getting it wrong means animations silently fail.
Each of these would have cost debugging time. Catching them in planning cost nothing but a few prompts.
The pattern
Yeah, I know "AI review pipeline" sounds like something from a LinkedIn post. But it works. The workflow:
- Write initial plan (Claude Code or human)
- Validate with Plan Reviewer (GPT expert)
- Address gaps with Architect analysis (GPT expert)
- Re-validate until approved
- Execute with confidence
The experts don't write code. They just poke holes in your plan until you've actually thought it through.
The small bugs
A few things broke that weren't on any checklist.
The dark mode toggle stopped showing a pointer cursor on hover. Tailwind v4 changed some default behaviors. Fixed with a cursor-pointer class we hadn't needed before.
The CSS budget check failed in CI. We'd set it at 110KB with 15% headroom, and a framework upgrade nearly doubling the CSS size wasn't something we'd planned for. Bumped the limit, added a comment explaining the v4 tradeoff.
npm version mismatches after stashing changes caused a confusing state where node_modules had v4 but package-lock.json still referenced v3. Nuking node_modules and reinstalling resolved it.
The decision
We had three options: revert to v3 and preserve performance scores, spend time purging unused CSS with custom build configs, or ship v4 and document the tradeoff.
We chose the third.
Here's the reasoning: Lighthouse scores measure synthetic performance under throttled conditions. Real users on modern networks and devices won't notice 38KB. The build-time improvements compound across every code change. The CSS-first configuration will save debugging time for months.
And honestly? A 79 performance score is still "good" by Lighthouse standards. We're trading perfect green numbers for a better development experience.
What we'd do differently
Set realistic expectations. The 182x improvement benchmarks are for no-change incremental builds in isolation. Real projects have other bottlenecks.
Test bundle size early. We should have built v4 in isolation before merging and compared output sizes. The CI check caught it, but we'd have had more options if we'd known earlier.
Budget for regressions. Framework upgrades aren't free. Even "drop-in" upgrades can have measurable performance costs. Plan for investigation time.
The takeaway
Performance optimization isn't about hitting numbers. It's about making informed tradeoffs.
We traded ~16 Lighthouse points for faster builds, simpler configuration, and a modernized CSS architecture. That math works for a personal portfolio site. It might not work for an e-commerce checkout page where every millisecond matters.
The key is measuring before and after, understanding what changed, and choosing based on the data rather than assuming "newer is better."
We're shipping v4. We know the cost. We documented it here so future-me remembers why the Lighthouse scores look different than they did last week.
The Analytics page at dylanbochman.com/projects/analytics tracks Lighthouse metrics over time. You can see the v4 regression in the January 21 data point.