About This Project
This project is built for fun and learning. The main goal is to learn how to crawl, store,
and analyze large datasets in a practical setup with MySQL and S3-compatible storage.
Parts of it are "vibe-coded". Years ago there was a handcrafted PHP version; it worked,
but it was much less structured and harder/slower to maintain than this rebuild. I know this contributes to the noise on the internet. That’s fine, and I’m not sorry but you can always remove your site.
Why It Exists
- Learn scalable crawling patterns.
- Practice data modeling and deduplication.
- Measure storage growth over time.
- Keep architecture stateless for container environments.
- Testing ZFS compression and dedup.
- All on slow homelab hardware, so it has to be very efficient.
Current Dataset
| URLs | 1804297 |
| Fetches | 1119863 |
| Screenshots | 685473 |
| Site details | 1012076 |
Storage Footprint
| MySQL total size | 5.5 GiB (5901303808 bytes) |
| S3 total size | 2.6 GiB (2814687521 bytes) |
| S3 total files | 121850 |
| S3 snapshot updated | 2026-05-28 17:05:09 |
S3 Breakdown
| Prefix | Files | Bytes | Readable |
| screenshots/ |
121850 |
2814687521 |
2.6 GiB |
Top MySQL Tables By Size
| Table | Rows (est.) | Size |
| site_enrichments |
977267 |
2.7 GiB |
| urls |
1707922 |
625.6 MiB |
| fetches |
1029849 |
408.6 MiB |
| domain_scores |
1668731 |
369.9 MiB |
| links |
1467341 |
284.2 MiB |
| jobs |
759755 |
224.5 MiB |
| fetch_observations |
774751 |
221.1 MiB |
| render_snapshots |
635056 |
209.8 MiB |
| site_profiles |
951142 |
201.7 MiB |
| screenshots |
664997 |
190.7 MiB |
`Rows (est.)` comes from MySQL metadata and is approximate.