How the Algorithm Works
From URL input to 5 finished output files — through 7 sequential phases, fully automated, without any manual configuration – fully Explained here.
Start Generator → All Features7 Phases in Detail
Each phase builds on the previous one. Higher recognition layers only override lower ones when signals are more definitive — keeping classifications stable and accurate.
Starts with robots.txt — disallow rules recorded, sitemap paths extracted. XML and gzip sitemaps (.xml.gz) are loaded, index sitemaps resolved recursively. No sitemap: structured homepage crawl.
Gambio · WooCommerce · Shopware · Shopify · JTL · TYPO3 · Joomla
All collected URLs are normalized, restricted to the base domain, and deduplicated. System URLs — admin, API, checkout, login, cart — are automatically filtered out.
Filtered: /admin/ · /checkout/ · /api/ · /login/ · /?action=
The algorithmic core. Each URL passes through five recognition layers. Navigation-confirmed categories (catUrls) are permanently immunized against reclassification.
2. URL Pattern Gambio .html · WooCommerce /product/ · path depth
3. Manufacturer /brands/ crawled → brands set
4. BRAND_WORDS 40+ patterns
5. Structural depth × extension × navigation
All classified URLs are fetched in batches of 20 simultaneously via the server proxy. Extracted: title, description, OG tags, Schema.org JSON-LD, prices, currency, brand, H1–H3 headings. Soft-404s detected via HTML analysis.
Extracted: title · desc · schema · price · currency · brand · headings
Soft-404: HTML body analysis · hStr() heading normalization
With enriched metadata, classifications are refined. Price patterns in titles identify additional products. Brands and nav-categories remain permanently protected.
Protection: catUrls + confirmedBrandUrls block misclassification
WooCommerce: depth≥2 without /product/ = always category
Sale and new product pages (specials.php, /sale/) are crawled automatically. Linked products receive featured=true + priority 0.95 and appear first in the "⭐ Recommended" preset.
Scoring: 0.95 sale · 0.90 new products · 0.85 regular
All data flows into five output formats at once. Smart Titles, 26+ industry-specific intent routings, date header. Ready as a ZIP archive for download.
26+ Intents: automatically inserted by detected industry
5 Finished Files in One ZIP
All formats are generated in a single run — ready to upload immediately.
Technical Recognition Logic
Every mechanism was developed and refined through tests with real shops.
🏭 CMS Detection
Gambio, WooCommerce, Shopware, JTL, PrestaShop — each CMS has unique URL patterns. Gambio: .html, WooCommerce: /product/. Classification rules adapt automatically.
💰 Price & Currency Extraction
Prices extracted from offers.price + offers.priceCurrency (ISO-4217). Price ranges and multi-level offer structures are handled correctly.
🏷️ Brand Immunization
Detected brand URLs (confirmedBrandUrls) are immune to all later reclassifications — even when product schema is present (common CMS behavior).
🔒 Soft-404 Detection
HTTP 200 responses on non-existent pages are detected via HTML body analysis — not via Content-Type, which can mislead on misconfigured servers.
📊 Heading Normalization
Proxy responses return headings as objects {tag, text}. hStr() normalizes to strings and prevents crashes on mobile devices.
🧭 26+ Intent Routings
Product recommendations, delivery times, price comparisons, installation guides — automatically inserted as routings, matched to the detected industry.
Experience the Pipeline Live
Enter your shop URL — all 7 phases run in under 60 seconds.
Start Generator → 📄 Product Overview (PDF)