SEO · AEO · GEO~12 minCW-2

Sitemap.xml, robots.txt, and llms.txt — programmatic, AI-friendly, complete

Generate sitemap and robots in Next.js, explicitly allow every AI crawler, plus the new llms.txt convention. Ready-to-paste code.

Three small files punch above their weight: sitemap.xml tells search engines what to index, robots.txt tells them what to crawl, llms.txt tells AI assistants what your site is. Next.js generates the first two from TypeScript exports; llms.txt is a manual markdown file. All three together are the indexing layer that compounds for years. This guide is the complete setup with the latest AI-crawler list.

Sitemap.ts — programmatic generation

Next.js App Router uses a special file `src/app/sitemap.ts` that exports a default function. The function returns the array of URLs; Next.js generates the XML at build time. Auto-updates on every build.

1.The full pattern with static + dynamic routes

Copy this into src/app/sitemap.ts. Customise the BASE URL and dynamic data sources.

import type { MetadataRoute } from "next";
import { getAllPosts } from "@/lib/posts";
import { getAllServices } from "@/lib/services";

const BASE = "https://yourdomain.com";

export default function sitemap(): MetadataRoute.Sitemap {
  const now = new Date();

  const staticPages: MetadataRoute.Sitemap = [
    { url: BASE, lastModified: now, changeFrequency: "monthly", priority: 1.0 },
    { url: `${BASE}/about`, lastModified: now, changeFrequency: "monthly", priority: 0.8 },
    { url: `${BASE}/services`, lastModified: now, changeFrequency: "monthly", priority: 0.8 },
    { url: `${BASE}/contact`, lastModified: now, changeFrequency: "yearly", priority: 0.6 },
    { url: `${BASE}/blog`, lastModified: now, changeFrequency: "weekly", priority: 0.7 },
  ];

  const posts = getAllPosts();
  const postPages: MetadataRoute.Sitemap = posts.map((post) => ({
    url: `${BASE}/blog/${post.slug}`,
    lastModified: new Date(post.updatedAt || post.publishedAt),
    changeFrequency: "monthly" as const,
    priority: 0.6,
  }));

  const services = getAllServices();
  const servicePages: MetadataRoute.Sitemap = services.map((service) => ({
    url: `${BASE}/services/${service.slug}`,
    lastModified: now,
    changeFrequency: "monthly" as const,
    priority: 0.7,
  }));

  return [...staticPages, ...postPages, ...servicePages];
}

What to exclude

Don't include /admin, password-protected routes, or anything noindex. Don't include search results or filtered views.

2.Verify after deploy
Visit https://yourdomain.com/sitemap.xml. You should see an XML document with all your URLs. If you get 404, the file isn't named exactly src/app/sitemap.ts.

Robots.ts — allow AI crawlers explicitly

In 2026, the AI crawler list grows monthly. Generating robots.txt programmatically means you update the TypeScript file once and Next.js handles the regeneration.

1.The full AI-crawler-friendly robots.ts

Copy into src/app/robots.ts.

import type { MetadataRoute } from "next";

const BASE = "https://yourdomain.com";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      // General web crawlers
      { userAgent: "*", allow: "/", disallow: ["/admin/", "/api/"] },

      // OpenAI / ChatGPT
      { userAgent: "GPTBot", allow: "/" },
      { userAgent: "ChatGPT-User", allow: "/" },
      { userAgent: "OAI-SearchBot", allow: "/" },

      // Anthropic / Claude
      { userAgent: "ClaudeBot", allow: "/" },
      { userAgent: "Claude-Web", allow: "/" },
      { userAgent: "anthropic-ai", allow: "/" },

      // Perplexity
      { userAgent: "PerplexityBot", allow: "/" },
      { userAgent: "Perplexity-User", allow: "/" },

      // Google AI
      { userAgent: "Google-Extended", allow: "/" },

      // Apple Intelligence
      { userAgent: "Applebot-Extended", allow: "/" },

      // Meta AI
      { userAgent: "Meta-ExternalAgent", allow: "/" },
      { userAgent: "Meta-ExternalFetcher", allow: "/" },

      // Other major AI / search
      { userAgent: "Bytespider", allow: "/" },
      { userAgent: "Amazonbot", allow: "/" },
      { userAgent: "DuckAssistBot", allow: "/" },
      { userAgent: "Diffbot", allow: "/" },
    ],
    sitemap: `${BASE}/sitemap.xml`,
    host: BASE,
  };
}

Why allow AI crawlers?

Default robots.txt syntax (User-agent: *) covers AI crawlers too, but some CDNs and hosting platforms block AI crawlers at the network level. Explicitly listing them makes your intent clear and overrides default-block behavior.

2.When to disallow AI crawlers
If you have paid content, gated articles, or proprietary research, disallow specific paths. Don't disallow at the User-agent level (you lose general indexing); disallow specific paths.
```
// In your rules:
{ userAgent: "GPTBot", allow: "/", disallow: ["/paid/", "/members/"] },
{ userAgent: "ClaudeBot", allow: "/", disallow: ["/paid/", "/members/"] },
```

llms.txt — the new convention

llms.txt is a growing convention: a markdown file at your root that summarises your site for AI engines. No major LLM enforces reading it yet, but it's cheap to ship and forward-looking.

1.Create public/llms.txt

Save the file at public/llms.txt. Next.js serves it at /llms.txt automatically.

# Your Site

> One sentence about what this is. Plain English, no buzzwords.

## Pages

- [Home](https://yourdomain.com/): one-sentence summary of what visitors find here.
- [About](https://yourdomain.com/about): who runs this, background, why.
- [Services](https://yourdomain.com/services): what you sell, with indicative pricing.
- [Contact](https://yourdomain.com/contact): three channels — email, WhatsApp, calendar.
- [Blog](https://yourdomain.com/blog): writing on [your topics].

## Brand

- Voice: direct, specific, dry.
- Tone: professional but human.
- Anti-voice: no buzzwords ("leverage", "synergy", "in today's fast-paced world").

## Audience

- [Specific audience 1]
- [Specific audience 2]

## Contact

- Email: you@yourdomain.com
- WhatsApp: +65-1234-5678

Update llms.txt when you launch new things

New service? New blog post category? Update llms.txt. Treat it as a brief AI assistants read to understand your site at a glance.

Verify all three

•curl https://yourdomain.com/sitemap.xml | head -20
•curl https://yourdomain.com/robots.txt
•curl https://yourdomain.com/llms.txt
•All three should return 200 OK and look right.

Troubleshooting

sitemap.xml is empty.

Your getAllPosts() or similar dynamic source returns an empty array. Check the data source is correctly imported and returning items. Run npm run build locally and check the .next output.

robots.txt missing a crawler I want to allow.

Add it to the rules array. The list above is comprehensive as of mid-2026. New AI crawlers emerge — check OpenAI/Anthropic/etc.'s docs periodically.

Google Search Console says 'Sitemap could not be read'.

Three causes: (1) URL is wrong — confirm you're submitting 'sitemap.xml' not the full URL. (2) Robots disallows /sitemap.xml — check robots.ts. (3) Sitemap is malformed — open in browser and verify it's valid XML.

Want to do this with us in the room?

Bring your real project to a full-day workshop and leave with it shipped.

See the workshops