Why Lab Scores Lie While Users Bounce

Your Lighthouse run just finished. The score is 97. The team celebrates, the Slack message goes out, and someone adds it to the quarterly review deck. Then you pull up your RUM dashboard and see that p75 LCP on mobile is 4.2 seconds. One in four real visitors is waiting over four seconds to see the main content of your page. They are not bouncing because your code is bad. They are bouncing because the metric you optimized for has almost nothing to do with what they actually experience.

This is the lab-field gap, and it is one of the most consistently underappreciated problems in frontend performance engineering.

What the Lab Actually Measures

Lighthouse and similar lab-based tools run your page in a controlled environment: a single network throttle profile, a single device emulation, zero browser cache, no extensions, no concurrent tabs, no user-specific factors. They use a deterministic set of conditions designed to produce reproducible scores. That reproducibility is the whole point — you need a stable baseline to catch regressions in CI.

The problem is that reproducibility requires stripping away the exact variables that govern real-user experience. Your users are not a Moto G4 emulation on a throttled 3G connection running a clean Chrome profile. They are on mid-range Android devices with a dozen apps consuming RAM, they are on commuter Wi-Fi that fluctuates between fast and unusable, they have your CDN-cached assets from three days ago mixed with uncached dynamic content. They have installed third-party browser extensions that inject scripts into your page.

None of these variables appear in a Lighthouse score.

How LCP Diverges Between Lab and Field

Largest Contentful Paint is the Core Web Vital most sensitive to the lab-field gap. Consider what actually drives LCP in the field:

Network variability. A throttled lab test uses a fixed RTT and bandwidth. Real 4G connections vary from 20ms to 200ms RTT within a single session, and throughput drops as signal degrades. Your hero image that loads in 1.1 seconds at lab-throttled "Fast 3G" might take 2.8 seconds at real-world 4G with a congested carrier.
Device CPU. Mobile CPU throttling in Lighthouse uses a fixed multiplier against your test machine's performance. Real devices have thermal throttling — a mid-range Android phone that has been running a video for ten minutes may be 40-60% slower at JavaScript execution than it was cold.
Cache state. Lighthouse always runs cold. Your actual returning visitors have cached sub-resources, which makes their LCP significantly faster. Your first-time visitors on uncached critical resources face the full load path. A single p75 aggregating these together is telling you the average of two fundamentally different experiences.
Third-party scripts. Lab runs typically don't pick up the full third-party script load because consent platforms, tag managers, and analytics may behave differently without cookies or real user context. In the field, these scripts execute in full and compete for the main thread during LCP candidate rendering.

The Percentile Problem in Lab Reporting

Beyond all the environmental differences, there is a structural issue in how lab scores get interpreted. A single Lighthouse run produces one data point. Teams report it as a pass/fail threshold: "we are above 90." But performance is a distribution, not a single number. Even if you ran Lighthouse a hundred times, you would get a distribution of scores — and that distribution still would not represent your user population because the sample is still drawn from the same controlled environment.

Real-user data gives you the actual distribution: p50 (median), p75, p95, sometimes p99 for SLA-sensitive applications. The Core Web Vitals specification itself uses the p75 threshold for the "good" classification — a page is considered good at LCP if 75% of its real-user loads complete within 2.5 seconds. Not the median. Not the average. The 75th percentile.

A team that sees a 97 Lighthouse score and stops there is optimizing for the median of a controlled environment. A team that watches p75 in the field is protecting the experience of users who actually have trouble with their page.

A Concrete Example of the Gap

Consider a content publishing platform — a news or editorial site serving a mixed desktop-and-mobile audience globally. Their engineering team has automated Lighthouse CI runs on every PR merge to main. Scores consistently come back 92-96. Their deployment pipeline passes performance gates. The engineering culture believes performance is in good shape.

When they instrument with field data collection, they see a different picture: p75 LCP on mobile users in Southeast Asia is 5.3 seconds. Their hero images are large JPEGs served from a US-east CDN edge. The CDN delivers sub-100ms to the lab's US-based test environment. For users in Manila or Jakarta, the nearest CDN PoP adds 180-220ms of additional RTT, and their image sizes mean even modest throughput constraints push total image load time to 2+ seconds before any JavaScript has executed.

The lab score reflected US infrastructure performance. The field data reflected where real users actually were. Neither was "wrong" — but only one of them predicted actual bounce behavior.

When Lab Scores Are Still Useful

We are not saying Lighthouse is useless. The lab environment is exactly right for one job: catching regressions you introduced. If your Lighthouse score drops from 94 to 71 after a PR, that is a valid signal that something changed. CI-gated lab scores are a reliable regression fence because they test the same conditions before and after a code change.

The mistake is treating a high lab score as evidence of good real-user performance. It is evidence of good performance under those specific controlled conditions. That evidence has narrow validity. It does not substitute for watching what your actual users experience.

The two metrics answer different questions. Lab scores answer: "Did this deploy make things worse in our controlled baseline?" Field p75 answers: "Are the people using our product right now experiencing acceptable load times?" Use both. Trust the field data when they conflict.

What to Watch Instead

The shift from lab-first to field-first performance monitoring is fundamentally a question of what signal drives your optimization priority queue.

Field data should segment LCP by at minimum: device category (mobile vs. desktop), connection type (4G, WiFi, 3G), and geography. A single unsegmented p75 across all users is better than a lab score but still hides important patterns. An unsegmented p75 of 2.9s might look acceptable — until you split it and discover your mobile-on-4G users are at 4.8s and your desktop-on-WiFi users are at 1.4s, and 68% of your sessions are mobile-on-4G.

Deploy correlation matters too. Field p75 needs to be tracked against deploy timestamps so that regressions are attributed to the code change that caused them, not discovered days later when the team has moved on. A performance regression that goes uncorrelated for 48 hours is almost impossible to fix without a full bisect of recent deploys.

The transition from "Lighthouse passed CI" to "field p75 within budget across all primary segments" is the transition from optimizing for the controlled test environment to optimizing for the people actually using your product. Both require discipline. Only one of them predicts bounce.