Skip to content
✓ 7-Phase Pipeline ✓ 450+ Shops Tested ✓ Fully Automated ✓ No API Key

How the Algorithm Works

From URL input to 5 finished output files — through 7 sequential phases, fully automated, without any manual configuration – fully Explained here.

Start Generator → All Features
2,000
URLs / Analysis
5
Classif. Phases
20×
Parallel Fetches
450+
Shops Tested
26+
Intent Routings
5
Output Formats
Live Analysis Log — solar-autark.com
$ analyse https://www.solar-autark.com

7 Phases in Detail

Each phase builds on the previous one. Higher recognition layers only override lower ones when signals are more definitive — keeping classifications stable and accurate.

🔍P1
Phase 1 · Discovery
robots.txt → Sitemap → Fallback Crawl

Starts with robots.txt — disallow rules recorded, sitemap paths extracted. XML and gzip sitemaps (.xml.gz) are loaded, index sitemaps resolved recursively. No sitemap: structured homepage crawl.

✓ gzip · ✓ index recursion · ✓ disallow filter · ✓ up to 2,000 URLs
Gambio · WooCommerce · Shopware · Shopify · JTL · TYPO3 · Joomla
robots.txtXML SitemapgzipRecursion
🗂️P2
Phase 2 · Inventory
URL Normalization & Deduplication

All collected URLs are normalized, restricted to the base domain, and deduplicated. System URLs — admin, API, checkout, login, cart — are automatically filtered out.

Example: 685 sitemap URLs → 612 products + 61 categories + 12 info pages
Filtered: /admin/ · /checkout/ · /api/ · /login/ · /?action=
NormalizationDedupSystem Filter
🧠P3
Phase 3 · Classification
5-Layer Classification (autoCat)

The algorithmic core. Each URL passes through five recognition layers. Navigation-confirmed categories (catUrls) are permanently immunized against reclassification.

1. Schema.org Product · CollectionPage · Brand
2. URL Pattern Gambio .html · WooCommerce /product/ · path depth
3. Manufacturer /brands/ crawled → brands set
4. BRAND_WORDS 40+ patterns
5. Structural depth × extension × navigation
Schema.org5 LayersNav Immunity40+ Patterns
P4
Phase 4 · Enrichment
Batch Fetch: 20 URLs in Parallel

All classified URLs are fetched in batches of 20 simultaneously via the server proxy. Extracted: title, description, OG tags, Schema.org JSON-LD, prices, currency, brand, H1–H3 headings. Soft-404s detected via HTML analysis.

Batch: 20 URLs/call · avg 0.38s/batch
Extracted: title · desc · schema · price · currency · brand · headings
Soft-404: HTML body analysis · hStr() heading normalization
Parallel FetchJSON-LDPrice DataSoft-404
🔄P5
Phase 5 · Reclassification
Metadata Refinement (reClassify)

With enriched metadata, classifications are refined. Price patterns in titles identify additional products. Brands and nav-categories remain permanently protected.

Patterns: price · buy · add to cart · in stock · available
Protection: catUrls + confirmedBrandUrls block misclassification
WooCommerce: depth≥2 without /product/ = always category
Price DetectionNav ProtectionBrand Immunity
P6
Phase 6 · Featured
Sale Detection & Priority Scoring

Sale and new product pages (specials.php, /sale/) are crawled automatically. Linked products receive featured=true + priority 0.95 and appear first in the "⭐ Recommended" preset.

Sources: specials.php · products_new.php · /sale/ · /offers/
Scoring: 0.95 sale · 0.90 new products · 0.85 regular
Sale ItemsNew ProductsPriority Scoring
📄P7
Phase 7 · Generation
5 Output Formats Simultaneously

All data flows into five output formats at once. Smart Titles, 26+ industry-specific intent routings, date header. Ready as a ZIP archive for download.

Smart Titles: Schema.org → URL slug → path fallback
26+ Intents: automatically inserted by detected industry
llms.txtJSONYAMLrobots.txtZIP

5 Finished Files in One ZIP

All formats are generated in a single run — ready to upload immediately.

📋
llms.txt
Standard for ChatGPT, Claude & Perplexity. Compact, directly citable.
📄
llms-full.txt
Extended format with full texts and structured product data.
🗃️
llms-data.json
Machine-readable JSON for APIs, cron jobs and further processing.
📝
llms-meta.yaml
YAML for developers, CI/CD pipelines and technical documentation.
🤖
robots-llms.txt
Snippet for direct integration into your existing robots.txt.

Technical Recognition Logic

Every mechanism was developed and refined through tests with real shops.

🏭 CMS Detection

Gambio, WooCommerce, Shopware, JTL, PrestaShop — each CMS has unique URL patterns. Gambio: .html, WooCommerce: /product/. Classification rules adapt automatically.

💰 Price & Currency Extraction

Prices extracted from offers.price + offers.priceCurrency (ISO-4217). Price ranges and multi-level offer structures are handled correctly.

🏷️ Brand Immunization

Detected brand URLs (confirmedBrandUrls) are immune to all later reclassifications — even when product schema is present (common CMS behavior).

🔒 Soft-404 Detection

HTTP 200 responses on non-existent pages are detected via HTML body analysis — not via Content-Type, which can mislead on misconfigured servers.

📊 Heading Normalization

Proxy responses return headings as objects {tag, text}. hStr() normalizes to strings and prevents crashes on mobile devices.

🧭 26+ Intent Routings

Product recommendations, delivery times, price comparisons, installation guides — automatically inserted as routings, matched to the detected industry.

Experience the Pipeline Live

Enter your shop URL — all 7 phases run in under 60 seconds.

Start Generator → 📄 Product Overview (PDF)