Knows.nl

Technical website intelligence, screenshots, and crawl history.

About This Project

This project is built for fun and learning. The main goal is to learn how to crawl, store, and analyze large datasets in a practical setup with MySQL and S3-compatible storage. Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked, but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.

Why It Exists

  • Learn scalable crawling patterns.
  • Practice data modeling and deduplication.
  • Measure storage growth over time.
  • Keep architecture stateless for container environments.
  • Testing ZFS compression and dedup.
  • All on slow homelab hardware, so it has to be very efficient.

Current Dataset

URLs1804297
Fetches1119863
Screenshots685473
Site details1012076

Storage Footprint

MySQL total size5.5 GiB (5901303808 bytes)
S3 total size2.6 GiB (2814687521 bytes)
S3 total files121850
S3 snapshot updated2026-05-28 17:05:09

S3 Breakdown

PrefixFilesBytesReadable
screenshots/ 121850 2814687521 2.6 GiB

Top MySQL Tables By Size

TableRows (est.)Size
site_enrichments 977267 2.7 GiB
urls 1707922 625.6 MiB
fetches 1029849 408.6 MiB
domain_scores 1668731 369.9 MiB
links 1467341 284.2 MiB
jobs 759755 224.5 MiB
fetch_observations 774751 221.1 MiB
render_snapshots 635056 209.8 MiB
site_profiles 951142 201.7 MiB
screenshots 664997 190.7 MiB

`Rows (est.)` comes from MySQL metadata and is approximate.