Project - Knows.nl

About This Project

This project is built for fun and learning. The main goal is to learn how to crawl, store, and analyze large datasets in a practical setup with MySQL and S3-compatible storage. Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked, but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.

Why It Exists

Learn scalable crawling patterns.
Practice data modeling and deduplication.
Measure storage growth over time.
Keep architecture stateless for container environments.
Testing ZFS compression and dedup.
All on slow homelab hardware, so it has to be very efficient.

Current Dataset

URLs	2230498
Fetches	2459171
Screenshots	1092382
Site details	2157331

Storage Footprint

MySQL total size	9.1 GiB (9783689216 bytes)
S3 total size	2.4 GiB (2583086480 bytes)
S3 total files	111496
S3 snapshot updated	2026-07-15 02:24:57

S3 Breakdown

Prefix	Files	Bytes	Readable
screenshots/	111496	2583086480	2.4 GiB

Top MySQL Tables By Size

Table	Rows (est.)	Size
site_enrichments	2088463	4.9 GiB
fetches	2382050	940.8 MiB
urls	2308801	783.4 MiB
links	2489761	470.5 MiB
site_profiles	1971292	465.9 MiB
domain_scores	1983221	413.1 MiB
render_snapshots	926913	286.9 MiB
change_events	1761985	280.6 MiB
fetch_observations	980789	260.9 MiB
screenshots	988866	260.8 MiB

`Rows (est.)` comes from MySQL metadata and is approximate.