✓ 7-Phase Pipeline ✓ 450+ Shops Tested ✓ Fully Automated ✓ No API Key

How the Algorithm Works

From URL input to 5 finished output files — through 7 sequential phases, fully automated, without any manual configuration – fully Explained here.

Start Generator → All Features

2,000

URLs / Analysis

Classif. Phases

20×

Parallel Fetches

450+

Shops Tested

26+

Intent Routings

Output Formats

Live Analysis Log — solar-autark.com

$ analyse https://www.solar-autark.com

Analysis Pipeline

7 Phases in Detail

Each phase builds on the previous one. Higher recognition layers only override lower ones when signals are more definitive — keeping classifications stable and accurate.

🔍P1

Phase 1 · Discovery

robots.txt → Sitemap → Fallback Crawl

Starts with robots.txt — disallow rules recorded, sitemap paths extracted. XML and gzip sitemaps (.xml.gz) are loaded, index sitemaps resolved recursively. No sitemap: structured homepage crawl.

✓ gzip · ✓ index recursion · ✓ disallow filter · ✓ up to 2,000 URLs
Gambio · WooCommerce · Shopware · Shopify · JTL · TYPO3 · Joomla

robots.txtXML SitemapgzipRecursion

🗂️P2

Phase 2 · Inventory

URL Normalization & Deduplication

All collected URLs are normalized, restricted to the base domain, and deduplicated. System URLs — admin, API, checkout, login, cart — are automatically filtered out.

Example: 685 sitemap URLs → 612 products + 61 categories + 12 info pages
Filtered: /admin/ · /checkout/ · /api/ · /login/ · /?action=

NormalizationDedupSystem Filter

🧠P3

Phase 3 · Classification

5-Layer Classification (autoCat)

The algorithmic core. Each URL passes through five recognition layers. Navigation-confirmed categories (catUrls) are permanently immunized against reclassification.

1. Schema.org Product · CollectionPage · Brand
2. URL Pattern Gambio .html · WooCommerce /product/ · path depth
3. Manufacturer /brands/ crawled → brands set
4. BRAND_WORDS 40+ patterns
5. Structural depth × extension × navigation

Schema.org5 LayersNav Immunity40+ Patterns

⚡P4

Phase 4 · Enrichment

Batch Fetch: 20 URLs in Parallel

All classified URLs are fetched in batches of 20 simultaneously via the server proxy. Extracted: title, description, OG tags, Schema.org JSON-LD, prices, currency, brand, H1–H3 headings. Soft-404s detected via HTML analysis.

Batch: 20 URLs/call · avg 0.38s/batch
Extracted: title · desc · schema · price · currency · brand · headings
Soft-404: HTML body analysis · hStr() heading normalization

Parallel FetchJSON-LDPrice DataSoft-404

🔄P5

Phase 5 · Reclassification

Metadata Refinement (reClassify)

With enriched metadata, classifications are refined. Price patterns in titles identify additional products. Brands and nav-categories remain permanently protected.

Patterns: price · buy · add to cart · in stock · available
Protection: catUrls + confirmedBrandUrls block misclassification
WooCommerce: depth≥2 without /product/ = always category

Price DetectionNav ProtectionBrand Immunity

⭐P6

Phase 6 · Featured

Sale Detection & Priority Scoring

Sale and new product pages (specials.php, /sale/) are crawled automatically. Linked products receive featured=true + priority 0.95 and appear first in the "⭐ Recommended" preset.

Sources: specials.php · products_new.php · /sale/ · /offers/
Scoring: 0.95 sale · 0.90 new products · 0.85 regular

Sale ItemsNew ProductsPriority Scoring

📄P7

Phase 7 · Generation

5 Output Formats Simultaneously

All data flows into five output formats at once. Smart Titles, 26+ industry-specific intent routings, date header. Ready as a ZIP archive for download.

Smart Titles: Schema.org → URL slug → path fallback
26+ Intents: automatically inserted by detected industry

llms.txtJSONYAMLrobots.txtZIP

Output Formats

5 Finished Files in One ZIP

All formats are generated in a single run — ready to upload immediately.

📋

llms.txt

Standard for ChatGPT, Claude & Perplexity. Compact, directly citable.

📄

llms-full.txt

Extended format with full texts and structured product data.

🗃️

llms-data.json

Machine-readable JSON for APIs, cron jobs and further processing.

📝

llms-meta.yaml

YAML for developers, CI/CD pipelines and technical documentation.

🤖

robots-llms.txt

Snippet for direct integration into your existing robots.txt.

Algorithm Details

Technical Recognition Logic

Every mechanism was developed and refined through tests with real shops.

🏭 CMS Detection

Gambio, WooCommerce, Shopware, JTL, PrestaShop — each CMS has unique URL patterns. Gambio: .html, WooCommerce: /product/. Classification rules adapt automatically.

💰 Price & Currency Extraction

Prices extracted from offers.price + offers.priceCurrency (ISO-4217). Price ranges and multi-level offer structures are handled correctly.

🏷️ Brand Immunization

Detected brand URLs (confirmedBrandUrls) are immune to all later reclassifications — even when product schema is present (common CMS behavior).

🔒 Soft-404 Detection

HTTP 200 responses on non-existent pages are detected via HTML body analysis — not via Content-Type, which can mislead on misconfigured servers.

📊 Heading Normalization

Proxy responses return headings as objects {tag, text}. hStr() normalizes to strings and prevents crashes on mobile devices.

🧭 26+ Intent Routings

Product recommendations, delivery times, price comparisons, installation guides — automatically inserted as routings, matched to the detected industry.

Experience the Pipeline Live

Enter your shop URL — all 7 phases run in under 60 seconds.

Start Generator → 📄 Product Overview (PDF)