Project - Knows.nl

About This Project

This project is built for fun and learning. The main goal is to learn how to crawl, store, and analyze large datasets in a practical setup with MySQL and S3-compatible storage. Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked, but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.

Why It Exists

Learn scalable crawling patterns.
Practice data modeling and deduplication.
Measure storage growth over time.
Keep architecture stateless for container environments.
Testing ZFS compression and dedup.
All on slow homelab hardware, so it has to be very efficient.

Current Dataset

URLs	1695
Fetches	945
Screenshots	769
Site details	603

Storage Footprint

MySQL total size	5.4 MiB (5619712 bytes)
S3 total size	10.2 MiB (10723519 bytes)
S3 total files	475
S3 snapshot updated	2026-04-13 00:00:14

S3 Breakdown

Prefix	Files	Bytes	Readable
screenshots/	475	10723519	10.2 MiB

Top MySQL Tables By Size

Table	Rows (est.)	Size
site_enrichments	492	2.5 MiB
urls	1695	576.0 KiB
fetches	925	384.0 KiB
fetch_observations	945	320.0 KiB
screenshots	769	288.0 KiB
domain_scores	1289	272.0 KiB
render_snapshots	769	256.0 KiB
change_events	925	208.0 KiB
jobs	1107	192.0 KiB
links	764	176.0 KiB

`Rows (est.)` comes from MySQL metadata and is approximate.