Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

Buffering and Swapping / Storing and data from web / memory

Here you can discuss every aspect of Debian. Note: not for support requests!
Post Reply
Message
Author
Chris8087
Posts: 17
Joined: 2014-11-13 10:35

Buffering and Swapping / Storing and data from web / memory

#1 Post by Chris8087 »

Hello,

I'm working on a hobby project, similar to web archiving.
With a crawler, I'm downloading web pages into memory. One crawler batch crawls 50 domains at the same time.
Ideally, I'd like to have WARC files (text files with some HTTP header info as well as the HTML content) split into:
1 warc file per day per domain

Usually, I crawl rather slow with ca. 1 request per minute - but it can lead to ca. 1 GB per day per domain.

If I have a box with 8GB RAM working on this, using 1GB for system overhead etc. I might end up with 7GB empty RAM.
Then I'd start the crawl with 50 domains * 140MB (which might be ca. 2 - 3 hours worth of data) and the RAM will be full.
  • How does Debian work from here then?
  • My objects are held in memory I assume?
  • The system would start swapping?
  • With an HDD for the swap partition, for every write to the memory, the HDD would spin and write?
This is exactly what I wanted to avoid - from an IO bottleneck point of view and in tests, it seems to stress the HDD a lot.

There are several options that I can see (any others/better?)
  • Buying more memory
  • Using an SSD for swapping
  • Closing the WARC file after ca. 100MB worth of data and use a WARC file merger later on after my crawl has finished
With just the crawling, the current system is able to do 4 - 5 crawls at the same time, so 250 domains in parallel.
As I scale this, I could reduce the WARC file size to 20MB an merge - but that feels even more like a workaround.
Any ideas / thoughts on how to improve the concept / architecture?

Post Reply