CrawlHawk.com

Crawl up to 500 unique links for FREE. Upgrade for more credit & features.

What data do you need?

Crawl Any Website. Generate Sitemaps. Extract Product Data.

Six workflows. One credit pack. Zero subscription.

CrawlHawk is a credit-based web data tool that combines six workflows in one account: a web crawler for site audits and link discovery, a Google-ready XML sitemap generator, a website content scraper that turns any site into clean AI-ready Markdown, an AI-powered product data scraper for e-commerce pages, a contact scraper for publicly listed contact details, and a lead classifier that scores your domain lists against your own criteria. There is no subscription, credits never expire, and every account includes all features. Start free with 500 URLs and upgrade only when you need more. Built and operated in the EU under GDPR.

Six Workflows in One Tool

XML Sitemap Generator

Google-ready sitemaps from any domain — in one crawl.

CrawlHawk crawls your domain (or a selected scope), identifies every indexable URL, and generates a search-engine-ready XML sitemap following the sitemaps.org protocol. The output is accepted by Google Search Console, Bing Webmaster Tools, Yandex Webmaster, and every major search engine without modification.

Use it for new site launches (so search engines discover your pages quickly), post-migration sitemap rebuilds when URL structures change, e-commerce catalog updates, and ongoing SEO maintenance. CrawlHawk respects noindex meta tags, canonical URLs, and robots.txt directives, and automatically splits sitemaps exceeding the 50,000-URL limit into a sitemap index file. No installation, no plugin, no recurring fee — pay only for the URLs you crawl, with credits that never expire.

Custom Link Crawler

Granular link audits, configured your way.

Choose the scope (full domain, main domain only, single subdomain, URL path, or single page) and the link types you want (internal, external, broken, orphan, file, image). For each link discovered, CrawlHawk returns the source URL, target URL, anchor text, HTTP status code, link type, and rel attributes (nofollow, sponsored, ugc, noopener).

Export to CSV (for spreadsheets), JSON (for programmatic use), Excel XLSX (for reports), HTML (for visual review), PDF (for client deliverables), or XML sitemap format. Common uses include SEO site audits, broken-link maintenance, internal linking reviews, outbound link inventories, and competitor structure analysis. JavaScript rendering for dynamically loaded content is supported at an additional credit cost per URL.

Website Content Scraper

Any website's content as clean, AI-ready text.

Paste a URL and CrawlHawk extracts every page's main content as clean Markdown — navigation, footers and cookie banners stripped away. The download is a ZIP with one file per page plus a single combined file of the whole site, ready to upload to NotebookLM, ChatGPT Projects, Claude Projects or any RAG pipeline.

Use it to build AI assistants and knowledge bases from your documentation, to migrate content between platforms, to prepare complete sites for translation, or to audit what your pages actually say. Content extraction is standard crawling — 1 credit per page, and the free tier covers 500 pages per crawl with no signup, which makes most blogs and documentation sites extractable entirely free.

Extract website content

AI Product Scraper

Structured product data from any e-commerce page — no recipe needed.

Point CrawlHawk at a product page, category, or entire catalog. An AI model reads the page like a human would and extracts titles, full descriptions, prices (including currency and sale prices), SKUs, brands, variant data (sizes, colors, configurations), image galleries, dimensions, weights, materials, specifications, and stock status — without manual XPath rules, per-site recipes, or template setup.

The same scraper works across Shopify stores, WooCommerce shops, Magento, BigCommerce, custom-built sites, and most large marketplaces. Output is delivered as XLSX, JSON, or CSV — every product is a row with consistent attribute columns, ready for direct import into Shopify, WooCommerce, your ERP, or any PIM tool. JavaScript-rendered product pages are supported at an additional credit cost per URL.

Contact Scraper

Public contact details, collected responsibly.

CrawlHawk crawls the websites you specify and collects the contact details they publish — email addresses, phone numbers, physical addresses and social profile links — with the source URL recorded for every entry. Results arrive as a structured spreadsheet: one row per contact point, deduplicated, ready for your CRM.

Built for legitimate outreach research: supplier and partner discovery, B2B prospect list building from companies' own websites, and keeping your existing records current. You confirm a lawful basis for each crawl, and every export is traceable back to the public page it came from. Contact extraction uses credits per processed URL, shown before the crawl starts — credits never expire.

Collect contact details

Lead Classifier

Your domain list, scored by your criteria.

Paste a list of domains — or upload a CSV — and describe what a good lead looks like for you, in plain language. CrawlHawk crawls each site and scores it against your criteria: industry fit, product range, company signals, whatever you defined. The output is your list back, ranked, with a score and a short justification per domain.

Built for sales and market-research teams that have lists but no time: distributor shortlists, exhibitor and directory exports, scraped prospect pools. Instead of opening five hundred websites by hand, you read one ranked spreadsheet and start with the top of it. Classification uses credits per domain, shown before the run starts.

Coming soon

Crawling Options

CrawlHawk supports five scope modes, so you can target exactly the part of the web you care about — from an entire domain down to a single page. All five modes work with every link type and every output format; scope only controls how widely the crawl spreads.

Crawl Full Domain

Every internal page reachable from one starting URL.

Submit a URL and CrawlHawk discovers every internal page within the same domain — across all paths, sections, and reachable URLs. The full-domain crawl is ideal for complete site audits, XML sitemap generation for new launches, and comprehensive internal-link mapping. It returns a full inventory of your site's crawlable structure with every URL, the page it was discovered from, and its crawl depth.

For very large platforms with millions of URLs (eBay, Amazon, Wikipedia, YouTube), use a tighter scope or restrict to specific URL paths instead — full-domain crawling at that scale consumes large amounts of credits and may be limited by the target site's protections. For mid-sized sites under 50,000 pages, full-domain mode is usually the right choice for thorough SEO work, sitemap generation, and link audits.

Main Domain Only

Crawl one domain — skip the subdomains.

When a site uses subdomains for separate properties (blog.example.com, shop.example.com, docs.example.com, support.example.com), the Main Domain Only mode crawls just example.com and excludes all subdomains. The crawler follows internal links within the main domain but ignores any link that would cross into a subdomain.

This is useful when your main site, blog, store, and support center operate as logically separate properties — each with its own content strategy, sitemap, and SEO priorities. Crawl them independently in distinct passes to keep results focused. Main Domain Only is also the right choice when you need a sitemap that includes only your core site URLs, without subdomain URLs cluttering the index.

Subdomain Only

Crawl one subdomain in isolation.

Target a single subdomain — your blog (blog.example.com), your store (shop.example.com), your help center (support.example.com), your developer portal (developers.example.com), or any other subdomain — without including the main domain or any sibling subdomains. The crawler follows internal links only within the specified subdomain.

Useful for property-specific audits where each subdomain has its own SEO strategy and needs separate treatment, for generating focused sitemaps per property, or for auditing one subdomain you recently launched or migrated. Pair Subdomain Only with broken-link detection to find post-migration link issues, or with image link extraction to inventory the visual assets on a specific subdomain.

Single Path

Crawl a specific section of a site.

Restrict the crawl to a single URL path — for example, example.com/products/, example.com/blog/, or example.com/docs/. CrawlHawk discovers links only within that path and ignores everything outside it. The crawler follows internal links that begin with the specified path prefix; links to other sections of the same domain are not followed.

Ideal for category-level product audits (crawling just the /products/ section of a webshop), documentation crawls (just the /docs/ tree), blog section reviews, or analyzing one part of a very large site without processing the whole thing. Path-based crawls keep credit consumption focused and results scoped to what you actually need.

Single Page

Extract all links from one URL.

Crawls a single URL and captures every link contained in that one page — internal, external, file, image, broken — including links rendered dynamically by JavaScript. Returns a precise outbound-link list from one URL with full anchor text, target URL, HTTP status, and link attributes.

Perfect for one-off URL inspections, competitor landing-page audits (what does your competitor link to from their homepage?), individual product-page analysis, broken-link checks on a specific high-value page, or extracting outbound references from a long-form article. Single-page mode is the most credit-efficient option for narrow, targeted analysis — one URL, one credit.

Link Types to Collect

Choose which link categories CrawlHawk reports. Combine as many as you need within a single crawl — every link type is included in every plan, and one crawl can return any combination of internal, external, broken, orphan, file, and image links in a single consolidated dataset.

Internal Links

Map the link structure inside your domain.

Internal links connect pages within the same domain. CrawlHawk reports every internal link found during the crawl, with source URL, target URL, anchor text, HTTP status code, and link attributes (rel=nofollow, sponsored, ugc). Internal link data is the foundation of technical SEO.

Use the data to identify under-linked important pages (pages with too few inbound internal links), to plan internal linking improvements (which anchor texts point to which pages), to verify that key landing pages are reachable within 2-3 clicks from the homepage, to distribute link equity efficiently across your site, and to plan content silo architectures that group related pages topically.

Broken Links

Detect 404s, 5xx errors, and other broken targets.

Reports every link returning a 4xx or 5xx HTTP status code across the crawled scope. Each broken link is logged with its source page (where the link appears), target URL (what it tries to link to), exact HTTP status code (404, 500, 503, etc.), and link context (anchor text and surrounding text) — so you can fix issues quickly.

Crucial for technical SEO maintenance (broken links damage rankings and user experience), post-migration cleanups (when URL structures change, old internal links break), affiliate link verification (commission revenue depends on working outbound links), and regular site health checks. CrawlHawk distinguishes between client errors (4xx — page gone or moved) and server errors (5xx — temporary outages that may resolve on retry).

External Links

List every outbound link from your site.

Identifies every link pointing to a different domain. Reports source page, target URL, full anchor text, and rel attributes (nofollow, sponsored, ugc, noopener). The full external link inventory helps you understand what your site links to and how those links are attributed.

Useful for outbound link audits (what domains is your site sending traffic to?), brand-safety reviews (are you linking to sites you'd rather not?), identifying unintended outbound traffic, verifying link-attribute compliance for sponsored or affiliate content (FTC requires disclosure; Google requires rel=sponsored for paid links), and detecting compromised pages with injected external links from malware or hacked content.

Orphan Pages

Find pages with no inbound internal links.

Orphan pages exist on your site but have no internal links pointing to them — making them invisible to most crawlers and hard for users and search engines to discover. CrawlHawk cross-references your sitemap or known URL set against the crawled internal link graph to surface pages that exist but aren't linked from anywhere within the site.

Critical for site coverage audits (is every important page actually reachable?), SEO completeness (Google may not discover or rank orphan pages effectively), and post-migration verification (migrations frequently create orphans when redirects fail). Common orphan causes include old landing pages, deprecated category pages, A/B test variants left in production, and content that was published without being linked from navigation or hubs.

File Links

Extract links to PDFs, documents, and downloadable resources.

Captures every link pointing to a downloadable file — PDF, DOCX, XLSX, ZIP, MP4, MP3, and any other non-HTML resource. Reports the source page (where the link appears), the file URL, the file type and extension, and the link context (anchor text and surrounding text).

Useful for resource library audits (what PDFs and whitepapers does your site offer, and from where are they linked?), document accessibility reviews, broken-file detection (file links that 404), GDPR or compliance reviews (knowing exactly what documents are publicly linked), and SEO of file-heavy sites with downloads, datasheets, brochures, or media archives. Often paired with the broken-link checker to verify that all linked files are still available.

Image Links

Find every image URL on a site.

Extracts every image source URL across the crawled scope, with the source page where the image is embedded, the alt text (or its absence), the dimensions where available, and the file format (JPG, PNG, WebP, AVIF, SVG, GIF). Returns a complete image inventory of the crawled scope.

Critical for image SEO audits (are images optimized, properly sized, and using alt text for accessibility and search?), broken image detection (images that fail to load damage user experience), alt-text coverage reviews (which images lack alt attributes, required for accessibility compliance), CDN migrations (you need a full image inventory before moving assets), and visual content audits across your site or competitor sites.

What CrawlHawk Is

CrawlHawk is a credit-based web data extraction service operated by New Horizon Platforms Kft. in Szeged, Hungary. The Service combines a web crawler, an XML sitemap generator, a website content scraper, an AI-powered product data scraper, a contact scraper and a lead classifier in a single interface. There is no monthly subscription — Users purchase credit packs that never expire and unlock every feature from day one. The free tier includes 500 URLs without a credit card. Hosting and operations are based in the European Union, with GDPR-compliant processing and transparent vendor disclosure.

Built For

SEO professionals running site audits, broken-link checks, and internal linking reviews
E-commerce teams importing product catalogs and monitoring competitor data
Webshop owners enriching product data with AI extraction across supplier sites
Data analysts gathering structured market and product research
Content creators and affiliate marketers compiling product information at scale
Technical SEO consultants generating sitemaps and identifying orphan pages
Sales and lead-gen teams qualifying domain lists and building outreach-ready contact records

no credit card required