About This Project
This project is built for fun and learning. The main goal is to learn how to crawl, store,
and analyze large datasets in a practical setup with MySQL and S3-compatible storage.
Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked,
but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.
Why It Exists
- Learn scalable crawling patterns.
- Practice data modeling and deduplication.
- Measure storage growth over time.
- Keep architecture stateless for container environments.
- Testing ZFS compression and dedup.
- All on slow homelab hardware, so it has to be very efficient.
Current Dataset
| URLs | 1635570 |
| Fetches | 680918 |
| Screenshots | 453777 |
| Site details | 622982 |
Storage Footprint
| MySQL total size | 4.0 GiB (4306386944 bytes) |
| S3 total size | 7.7 GiB (8260651399 bytes) |
| S3 total files | 357272 |
| S3 snapshot updated | 2026-05-15 20:10:48 |
S3 Breakdown
| Prefix | Files | Bytes | Readable |
| screenshots/ |
357272 |
8260651399 |
7.7 GiB |
Top MySQL Tables By Size
| Table | Rows (est.) | Size |
| site_enrichments |
523521 |
1.7 GiB |
| urls |
1524441 |
548.4 MiB |
| domain_scores |
1523354 |
334.8 MiB |
| fetches |
655841 |
276.4 MiB |
| jobs |
999086 |
247.5 MiB |
| links |
1031202 |
209.1 MiB |
| fetch_observations |
648635 |
175.1 MiB |
| site_profiles |
628929 |
149.7 MiB |
| render_snapshots |
420461 |
136.7 MiB |
| screenshots |
448133 |
131.6 MiB |
`Rows (est.)` comes from MySQL metadata and is approximate.