← 所有文章

Overnight

At 3am Pacific, my production site is serving more requests than at any point during the business day. Not to users. To bots.

Googlebot is crawling 21,000 pages. Bingbot is crawling 10,000. My comprehensive nightcheck is grinding through 15,000 blog posts and company pages. A warm pass is priming the Cloudflare edge cache for the next day’s traffic. Together, the overnight processes touch more pages than all human visitors combined.

The site I build during the day is not the site that matters most. The site that matters most is the one the crawlers see at 3am.

The Crawl Census

Every day I run a crawl census that counts what the bots saw in the last 24 hours. The census uses Cloudflare’s analytics API filtered by user agent. The numbers tell a story about what search engines value:

Google:     21,463  (67%)
Bing:       10,620  (33%)
Combined:   32,083

Jobs:       16,111  (50% of all crawls)
Blog:       298
Locale:     1,233
Programmatic: 257
Companies:  14

Jobs consume half the crawl budget. Googlebot crawls 10,654 job pages per day. The job sitemap has no cap. Every eligible job listing is included. The crawl budget allocation tells me what Google considers the highest-value content on the site.

Blog posts get 298 crawls per day despite being the highest-quality content. The ratio of crawl investment (jobs 50x more than blog) does not match the content investment (blog requires 100x more effort per page than jobs). Search engines crawl what they can index at scale, not what took the most effort to produce.

Companies get 14 crawls per day despite having 7,000+ pages in the sitemap. This is a crawl budget starvation problem: the job pages consume so much budget that company pages barely get discovered. The overnight data revealed this problem. Without the census, I would not have known that 7,000 company pages are essentially invisible to crawlers.

What 410 Tells You

The census tracks HTTP status codes. The most interesting status is 410: Gone.

Google 410s:  7,614  (35.5% of crawls)
Bing 410s:    4,494  (42.3% of crawls)

Over a third of all crawler requests hit expired job pages that return 410. These are jobs that existed when the crawler first discovered them, were indexed, and have since been removed. The 410 tells the crawler “this page existed but is permanently gone, stop requesting it.”

The 410 rate is declining. Last week it was 8,858 for Google. This week it is 7,614. The crawlers are learning. Each day, the number of ghost requests drops as the search engines update their indexes. But the learning is slow. Bing’s 410 rate (42.3%) is higher than Google’s (35.5%) because Bing is slower to process removal signals.

The 410 trend is the clearest overnight signal. It tells me the rate at which the search engines are converging on the current state of the site. A rising 410 rate means I am removing content faster than crawlers can adapt. A falling 410 rate means the index is catching up. Equilibrium is zero 410s, which means every page the crawler requests still exists.

The 524 Problem

Cloudflare returns 524 when the origin server does not respond within the timeout window. On a heavy deploy day (87 commits), the census showed:

Google 524s:  68  (0.3%)
Bing 524s:    0

Sixty-eight origin timeouts in 24 hours. Each one means Googlebot requested a page, Cloudflare forwarded the request to Railway, and Railway did not respond in time. The most likely cause was Railway worker restarts during frequent deploys. Each deploy restarts the application, creating a brief window where requests time out.

For 0.3% of crawls, Google saw a broken site. The 524 errors did not appear in any application log because the application was not running when they occurred. The error existed only in the space between Cloudflare and Railway, visible only through the crawl census.

The next morning, the 524 count dropped to zero. The deploys had stopped. The workers were stable. The overnight data confirmed that the problem was transient deploy churn, not a structural issue.

The Warm Pass

Before the crawlers arrive, the warm pass runs. It fetches every blog post, every locale variant, and 50 company pages through Cloudflare’s edge cache. The goal is to ensure that when Googlebot hits a page, it gets a cached response instead of waiting for an origin render.

The difference matters. A cached blog post returns in 80ms. An uncached blog post takes 1.5 seconds from origin. Googlebot has a crawl rate budget measured in requests per second. Faster responses mean more pages crawled in the same window. A warm cache doubles the effective crawl coverage.

The warm pass is invisible to users. No human visitor benefits from a blog post cached at 2am. But the warm pass determines whether Googlebot discovers 300 blog posts or 600 in its overnight window. The SEO impact is real even though no human sees the mechanism.

What the Night Reveals

Every morning I read the overnight logs. The pattern is the same: mostly green, a few anomalies, one or two things worth investigating. The rhythm is boring. The value is in the boring rhythm.

A boring overnight means the deploys did not break anything, the crawlers found what they expected, the cache is working, and the site is ready for the next day’s traffic. An interesting overnight means something changed: a new error pattern, a cache rule that stopped working, a crawl budget shift that indicates a ranking signal change.

The crawl census showed me that 7,000 company pages are invisible to Google. No daytime metric would have revealed this. User analytics show zero company page traffic, which I attributed to low demand. The census showed zero company page crawls, which means Google has not even evaluated the pages. The problem is not demand. The problem is discovery.

The 524 analysis showed me that Railway deploys create origin timeout windows that Googlebot hits. No application monitoring would have revealed this because the application is not running during the timeout. The problem exists in the infrastructure gap between deployment and availability.

The 410 trend showed me that Bing processes removal signals 20% slower than Google. This matters for SEO: expired job pages remain in Bing’s index longer, potentially serving stale results to users who search on Bing-powered surfaces (DuckDuckGo, Yahoo).

Each of these insights came from the overnight data. The daytime tells you what users do. The night tells you what the infrastructure does when you are not watching. Both matter. The night matters more for SEO.


FAQ

How do you run the crawl census?

The census uses Cloudflare’s GraphQL analytics API (httpRequestsAdaptiveGroups) filtered by user agent patterns (%Googlebot% and %bingbot%). It categorizes pages by URL path prefix and aggregates status codes. The script runs in 30 seconds and produces a side-by-side comparison of Google and Bing crawl behavior.

Why not use Google Search Console for crawl data?

Google Search Console reports crawl statistics with a 2-3 day delay and limited granularity. The Cloudflare census is real-time (last 24 hours) and includes status codes, content categories, and cache status that GSC does not report. GSC is useful for trends. The census is useful for operational decisions.

Does the warm pass increase Cloudflare costs?

No. Cloudflare caches are populated by any request, regardless of source. The warm pass uses standard HTTP requests that count against the normal request quota. On the free plan, there is no request limit for cached responses. The origin requests during the warm pass count against Railway’s bandwidth, but at 15,000 pages averaging 50KB each, the total is approximately 750MB per warm pass.

What if crawlers change their behavior?

The census captures whatever the crawlers do, regardless of changes. A shift in crawl pattern (more job pages, fewer blog pages) appears immediately in the next census. The trend data across days reveals whether the shift is a one-time anomaly or a sustained change.


Sources

This article draws on daily crawl census data collected via Cloudflare GraphQL API since March 2026. Census tool: ~/Projects/Utility/crawl_census.py. Nightcheck tool: ~/.claude/skills/nightcheck/.

相关文章

What I Run Before I Sleep

Every night: 15,000 pages checked, TTFB measured, cache verified, sitemaps crawled. The goodnight routine is where opera…

7 分钟阅读

The Handoff Document

A diagnosis that survived three code review corrections, two priority reorderings, and guided the correct implementation…

7 分钟阅读

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 分钟阅读