Cleaning up 400 dead URLs from Google's index

Listen to this articleAI narration

0:00 / 0:00

Google Search Console export revealed a problem: over 400 URLs being crawled that no longer exist. Old blog posts, legacy Swedish pages, dashboard routes that were never meant to be public, and some truly bizarre malformed paths like a forward slash followed by a dollar sign, or a forward slash followed by an ampersand.

The fix involves three layers: redirects for pages with logical successors, explicit "gone" signals for dead content, and crawler directives to prevent future indexing attempts.

The Problem: Crawl Budget Waste

Two CSV exports from Search Console told the story. The first file contained 330 rows, mostly from the jobs subdomain with Swedish municipality pages. The second file had 89 rows from the main site covering old blog posts, services, and authentication pages.

The key insight: 404 and 410 are not the same to Google.

A 404 Not Found says "This page doesn't exist right now." Google keeps checking back periodically, thinking it might return. A 410 Gone says "This page is permanently removed and won't come back." Google de-indexes faster and stops wasting crawl budget.

For content that's truly dead, 410 is the correct signal.

Implementation: Three Layers

Layer 1: Redirects in the Next.js Configuration

For pages with logical equivalents, permanent 301 redirects preserve link equity and guide users to the right place. The redirect configuration in the Next.js config file uses an async redirects function that returns an array of redirect rules.

The first category handles the canonical domain redirect from www to non-www. Any path on the www subdomain gets permanently redirected to the same path on the main domain.

The second category maps old service pages to the homepage. URLs like mvp-services, full-stack-solutions, and the Swedish version fullstack-losningar all redirect permanently to the root path.

The third category handles old blog post migrations to the knowledge base. Specific URLs like blog/top-saas-tools-every-solopreneur-needs-to-scale redirect to their knowledge base equivalents like knowledge-base/top-saas-tools-freelancers-scale-operations.

The fourth category uses path patterns to handle localized blog paths. Both the English and Swedish blog URL patterns, starting with /en/blog/ or /sv/blog/, redirect to the knowledge base using a slug wildcard to preserve the article name.

Layer 2: Middleware for 410 Gone

For URLs that should never exist again, middleware intercepts requests before they hit the file system. The middleware file imports NextResponse and the NextRequest type, then defines an array of regex patterns for URLs that should return 410 Gone status.

The gone patterns array covers several categories. The first patterns match malformed garbage URLs with literal dollar sign and ampersand characters in the path. The next pattern catches any remaining blog paths that weren't handled by specific 301 redirects. Additional patterns cover old localized paths starting with /en/ or /sv/ except for blog routes which get redirected. Dashboard and authentication pages like /dashboard, /admin, /auth, /register, /signup, and password reset URLs all return 410. Old portfolio and project pages, pricing pages, team pages, and feed or RSS URLs are all marked as permanently gone.

The middleware function itself loops through each pattern and tests it against the incoming request pathname. When a match is found, it returns a NextResponse with a simple HTML page displaying "410 Gone" and a link back to the homepage, with the status code set to 410 and the content type header set to text/html. If no pattern matches, the function calls NextResponse.next() to let the request continue to the route handlers.

The regex patterns are intentionally broad. Since redirects run first in Next.js, the middleware only sees URLs that weren't matched by a specific 301 redirect but do match a dead category pattern.

Layer 3: robots.txt Disallow

The final layer prevents future crawl attempts on paths that should never be indexed. The robots.ts file in the app directory exports a function that returns a MetadataRoute.Robots object.

The function first determines the base URL from an environment variable with a fallback to the default domain. It then returns a rules array with a single rule object targeting all user agents. The rule allows the root path but disallows a comprehensive list of paths: the API routes, Next.js internal paths, admin and dashboard pages, all authentication-related routes, the old localized paths, blog, portfolio, projects, case studies, work, pricing, plans, team, and feed directories. A crawl delay of one second is specified, and the sitemap URL is constructed from the base URL.

This generates a standard robots.txt file that tells crawlers which paths they're allowed to access and which they should avoid, along with the sitemap location for discovering valid content.

The Three Layers Working Together

A Regex Gotcha

The first attempt at the middleware had a bug. The pattern using two dollar signs in a row was meant to match the literal URL "forward slash dollar sign", but in regex double dollar sign means "end of string". This pattern would actually match just the homepage forward slash followed by end of string, not the malformed URL we were trying to catch.

The fix was to escape the dollar sign with a backslash, so the pattern becomes forward slash backslash dollar sign dollar sign. The backslash dollar sign matches the literal dollar character, and the final dollar sign anchors to the end of the string.

This difference matters because without the fix, we would have returned 410 Gone for the entire homepage, effectively breaking the whole site.

Results

After deployment, the three signal types affected different URL counts. The 301 redirects handled about 25 URLs, preserving their link equity and guiding users to the new locations. The 410 Gone responses covered approximately 350 URLs, enabling faster removal from the index. The robots.txt disallow rules prevent any new crawl attempts on those path patterns going forward.

Google typically processes 410 signals within a few days to a couple weeks. The Search Console Pages report should show the indexed count dropping as Googlebot encounters the new status codes.

Monitoring

To track progress, there are three key places to watch. First, the Google Search Console Pages report shows the "Not indexed" count, which should decrease over time. Second, the Crawl stats section in Search Console reveals the response code distribution, where you should see 410 responses appearing. Third, server logs confirm that the 301 and 410 responses are actually being served to requests.

For Vercel deployments, you can filter logs to show only 410 responses from the past day to verify the middleware is working as expected.

Key Takeaways

First, the difference between 404 and 410 matters. A 404 means "not found right now" while 410 means "gone forever". Google treats them differently for indexing purposes.

Second, layering your signals provides defense in depth. The robots.txt prevents future crawls, redirects handle migrations to preserve link equity, and middleware catches everything else. Each layer serves a distinct purpose.

Third, understanding Next.js execution order is essential. Requests flow through redirects in the config file first, then rewrites, then middleware, and finally route handlers. This ordering lets you handle specific cases in configuration before broader patterns in middleware.

Fourth, regex patterns need testing. The dollar sign versus escaped dollar sign bug would have returned 410 for the entire site. Always test patterns against edge cases before deployment.

Fifth, crawl budget is finite. Google won't crawl everything forever. Dead URLs waste budget that should go to discovering and indexing actual content.

The technical SEO debt is now paid off. The site tells Google exactly what exists, what moved, and what's gone.