Prevent Javascript function running out of memory because too many objects
I'm building a web scraper in nodeJS that uses request and cheerio to
parse the DOM. While I am using node, I believe this is more of a general
javascript question.
tl;dr - creating ~60,000 - 100,000 objects, uses up all my computer's RAM,
get an out of memory error in node.
Here's how the scraper works. It's loops within loops, I've never designed
anything this complex before so there might be way better ways to do this.
Loop 1: Creates 10 objects in array called 'sitesArr'. Each object
represents one website to scrape.
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description'
},
// ... x10
]
Loop 2: Loops through 'sitesArr'. For each site it goes to the homepage
via 'request' and gets a list of category links, usually 30-70 URLs.
Appends these URLs to the current 'sitesArr' object to which they belong,
in an array property whose name is 'categories'.
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description',
categories: [
{
name: 'shoes',
url: 'www.basedomain.com/shoes'
},{
name: 'socks',
url: 'www.basedomain.com/socks'
} // x 50
]
},
// ... x10
]
Loop 3: Loops through each 'category'. For each URL it gets a list of
products links and puts them in an array. Usually ~300-1000 products per
category
var sitesArr = [
{
name: 'store name',
baseURL: 'www.basedomain.com',
categoryFunct: '(function(){ // do stuff })();',
gender: 'mens',
currency: 'USD',
title_selector: 'h1',
description_selector: 'p.description',
categories: [
{
name: 'shoes',
url: 'www.basedomain.com/shoes',
products: [
'www.basedomain.com/shoes/product1.html',
'www.basedomain.com/shoes/product2.html',
'www.basedomain.com/shoes/product3.html',
// x 300
]
},// x 50
]
},
// ... x10
]
Loop 4: Loops through each of the 'products' array, goes to each URL and
creates an object for each.
var product = {
infoLink: "www.basedomain.com/shoes/product1.html",
description: "This is a description for the object",
title: "Product 1",
Category: "Shoes",
imgs:
['http://foo.com/img.jpg','http://foo.com/img2.jpg','http://foo.com/img3.jpg'],
price: 60,
currency: 'USD'
}
Then, for each product object I'm shipping them off to a MongoDB function
which does an upsert into my database
THE ISSUE
This all worked just fine, until the process got large. I'm creating about
60,000 product objects every time this script runs, and after a little
while all of my computer's RAM is being used up. What's more, after
getting about halfway through my process I get the following error in
Node:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
I'm very much of the mind that this is a code design issue. Should I be
"deleting" the objects once I'm done with them? What's the best way to
tackle this?
No comments:
Post a Comment