Knows.nl

Technical website intelligence, screenshots, and crawl history.

About This Project

This project is built for fun and learning. The main goal is to learn how to crawl, store, and analyze large datasets in a practical setup with MySQL and S3-compatible storage. Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked, but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.

Why It Exists

  • Learn scalable crawling patterns.
  • Practice data modeling and deduplication.
  • Measure storage growth over time.
  • Keep architecture stateless for container environments.
  • Testing ZFS compression and dedup.
  • All on slow homelab hardware, so it has to be very efficient.

Current Dataset

URLs1635570
Fetches680918
Screenshots453777
Site details622982

Storage Footprint

MySQL total size4.0 GiB (4306386944 bytes)
S3 total size7.7 GiB (8260651399 bytes)
S3 total files357272
S3 snapshot updated2026-05-15 20:10:48

S3 Breakdown

PrefixFilesBytesReadable
screenshots/ 357272 8260651399 7.7 GiB

Top MySQL Tables By Size

TableRows (est.)Size
site_enrichments 523521 1.7 GiB
urls 1524441 548.4 MiB
domain_scores 1523354 334.8 MiB
fetches 655841 276.4 MiB
jobs 999086 247.5 MiB
links 1031202 209.1 MiB
fetch_observations 648635 175.1 MiB
site_profiles 628929 149.7 MiB
render_snapshots 420461 136.7 MiB
screenshots 448133 131.6 MiB

`Rows (est.)` comes from MySQL metadata and is approximate.