I'm working on a hobby project, similar to web archiving.
With a crawler, I'm downloading web pages into memory. One crawler batch crawls 50 domains at the same time.
Ideally, I'd like to have WARC files (text files with some HTTP header info as well as the HTML content) split into:
1 warc file per day per domain
Usually, I crawl rather slow with ca. 1 request per minute - but it can lead to ca. 1 GB per day per domain.
If I have a box with 8GB RAM working on this, using 1GB for system overhead etc. I might end up with 7GB empty RAM.
Then I'd start the crawl with 50 domains * 140MB (which might be ca. 2 - 3 hours worth of data) and the RAM will be full.
- How does Debian work from here then?
- My objects are held in memory I assume?
- The system would start swapping?
- With an HDD for the swap partition, for every write to the memory, the HDD would spin and write?
There are several options that I can see (any others/better?)
- Buying more memory
- Using an SSD for swapping
- Closing the WARC file after ca. 100MB worth of data and use a WARC file merger later on after my crawl has finished
As I scale this, I could reduce the WARC file size to 20MB an merge - but that feels even more like a workaround.
Any ideas / thoughts on how to improve the concept / architecture?