How to use a loop to request a URL, scrape data, grab a new URL from that page and move on to the next page X times

128
May 07, 2019, at 01:30 AM

I am looking to:

  1. Open a known URL (www.source.com/1 below)
  2. scrape all URLs on that page (e.g. www.urllookingfor.com/1 to .../10) and log to console
  3. scrape a new URL (e.g. www.source.com/2) from that page
  4. load the next page and repeat the process X number of times

Imagine a list of 50 URLs dividend across 5 pages where you need to click the next button to move on a page.

The first two steps work fine, but I think the issue is that the nextLink isn't updated before the loop runs again. Essentially what happens is that step four gets repeated with the original URL and not the 'new' URL. The steps above are within an if loop.

I've tried using setTimeout, async...await as I think the issue is that it doesn't have time to load the 'new' URL before the next function is complete but this did not work.

If I add console.log(URL) within the if function, it will print the original URL. But when I add console.log to outside the if loop it prints the updated URL which makes me think 'nextLink' isn't updated until after the if loop.

I've also tried repeating the functions over and over (essentially a repeated if statement), but this also does not seem to update 'nextLink' before the next function runs which goes against the above.

let nextLink = www.source.com/1
//this pulls source page and scrapes required URLs
const getDatafromPage = () => {
    request(nextLink, (error, response, html) => {
        if((!error) && (response.statusCode == 200)) 
        {
            let $ = cheerio.load(html);
            $('.class1').each((i, el) => {
                let link = $(el).find('.class2').attr('href');
                console.log(`${link});
            })
        }
    })
}
//this gets the next URL
const getNextLink = () => {
    request(nextLink, (error, response, html) => {
        if((!error) && (response.statusCode == 200)) 
        {        
            let $ = cheerio.load(html);
            nextLink = $('.class3').attr('href');
        }
    })
}
for (let i = 0; i <= 4; i++) {
    getDatafromPage();
    getNextLink();
}
console.log(nextLink)

Expected results (all 50 URLs from the pages and ends by logging the last source URL)

 www.urllookingfor.com/1
 ...
 www.urllookingfor.com/50
 www.source.com/5

Actual results (repeats the first page, but then logs the next page at the end):

 www.urllookingfor.com/1
 ...
 www.urllookingfor.com/10
 www.urllookingfor.com/1
 ...
 www.urllookingfor.com/10
 www.source.com/2
Rent Charter Buses Company
READ ALSO
How to prevent Angular Material 6 DatePicker with Momentjs to convert incorect format?

How to prevent Angular Material 6 DatePicker with Momentjs to convert incorect format?

I am using Angular Material Datepicker with Momentjs that is helping me use any type of dividers for days, month and year (like /,etc

194
Date not displaying properly in JS dates app

Date not displaying properly in JS dates app

I am trying to make a simple JS calendar which will indicate the current date and then 90 dates after this dateI want it to be displayed as, example: Monday, May 6, 2019 -and- Sunday August 4, 2019

126
Create a CSSStyleSheet object from a CSS string?

Create a CSSStyleSheet object from a CSS string?

Is there a baked-in solution for converting a CSS string to its respective CSSStyleSheet object?

194
Correctly size, scale and apply padding to SVG ( + tips for optimizing code style-wise needed)

Correctly size, scale and apply padding to SVG ( + tips for optimizing code style-wise needed)

I am working on making a bar chart representation of income Inequality data in D3 using SVGI am new to this and got quite far working with tutorials

115