About This Project
This project is built for fun and learning. The main goal is to learn how to crawl, store,
and analyze large datasets in a practical setup with MySQL and S3-compatible storage.
Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked,
but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.
Why It Exists
- Learn scalable crawling patterns.
- Practice data modeling and deduplication.
- Measure storage growth over time.
- Keep architecture stateless for container environments.
- Testing ZFS compression and dedup.
- All on slow homelab hardware, so it has to be very efficient.
Current Dataset
| URLs | 1695 |
| Fetches | 945 |
| Screenshots | 769 |
| Site details | 603 |
Storage Footprint
| MySQL total size | 5.4 MiB (5619712 bytes) |
| S3 total size | 10.2 MiB (10723519 bytes) |
| S3 total files | 475 |
| S3 snapshot updated | 2026-04-13 00:00:14 |
S3 Breakdown
| Prefix | Files | Bytes | Readable |
| screenshots/ |
475 |
10723519 |
10.2 MiB |
Top MySQL Tables By Size
| Table | Rows (est.) | Size |
| site_enrichments |
492 |
2.5 MiB |
| urls |
1695 |
576.0 KiB |
| fetches |
925 |
384.0 KiB |
| fetch_observations |
945 |
320.0 KiB |
| screenshots |
769 |
288.0 KiB |
| domain_scores |
1289 |
272.0 KiB |
| render_snapshots |
769 |
256.0 KiB |
| change_events |
925 |
208.0 KiB |
| jobs |
1107 |
192.0 KiB |
| links |
764 |
176.0 KiB |
`Rows (est.)` comes from MySQL metadata and is approximate.