Knows.nl

Technical website intelligence, screenshots, and crawl history.

About This Project

This project is built for fun and learning. The main goal is to learn how to crawl, store, and analyze large datasets in a practical setup with MySQL and S3-compatible storage. Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked, but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.

Why It Exists

  • Learn scalable crawling patterns.
  • Practice data modeling and deduplication.
  • Measure storage growth over time.
  • Keep architecture stateless for container environments.
  • Testing ZFS compression and dedup.
  • All on slow homelab hardware, so it has to be very efficient.

Current Dataset

URLs1695
Fetches945
Screenshots769
Site details603

Storage Footprint

MySQL total size5.4 MiB (5619712 bytes)
S3 total size10.2 MiB (10723519 bytes)
S3 total files475
S3 snapshot updated2026-04-13 00:00:14

S3 Breakdown

PrefixFilesBytesReadable
screenshots/ 475 10723519 10.2 MiB

Top MySQL Tables By Size

TableRows (est.)Size
site_enrichments 492 2.5 MiB
urls 1695 576.0 KiB
fetches 925 384.0 KiB
fetch_observations 945 320.0 KiB
screenshots 769 288.0 KiB
domain_scores 1289 272.0 KiB
render_snapshots 769 256.0 KiB
change_events 925 208.0 KiB
jobs 1107 192.0 KiB
links 764 176.0 KiB

`Rows (est.)` comes from MySQL metadata and is approximate.