Parsing html from a javascript rendered url with python object

338
November 26, 2017, at 10:31 PM

I would like to extract the market information from the following url and all of its subsequent pages:

https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1

I have successfully parsed the data that I want from the first page using some code from the following url:

https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages

I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. The problem is it crashes before the next page loads for a reason I don't fully understand.

I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. I don't know enough object orientated programming to work out the problem. Here is my code, much of which is borrowed from the the url above:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html
import re
from bs4 import BeautifulSoup
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  
  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'
#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
    print ('NEXT PAGE: ',complete_next_page, '\n')
    r = Render(complete_next_page)  # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
    result = r.frame.toHtml()     # ERROR IS THROWN HERE ON 2nd PAGE
# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))
# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))
# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
    print('CURRENT LINE IS: ',str(i))
    print(stripe_rows[i])
    print('###############################################')
    print(non_stripe_rows[i+2])
    print('\n')
    i+=1
#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page

I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? Unless its something else?

Here is the error:

result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'

I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this?

Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful?

Many thanks in advance.

Answer 1

How about using selenium and phantomjs instead of PyQt. You can easily get selenium by executing "pip install selenium". If you use Mac you can get phantomjs by executing "brew install phantomjs". If your PC is Windows use choco instead of brew, or Ubuntu use apt-get.

from selenium import webdriver
from bs4 import BeautifulSoup
base_url = "https://uk.reuters.com"
first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"
browser = webdriver.PhantomJS()
# PARSE THE HTML
browser.get(base_url + first_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
# PARSE ALL ROW DATA
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
# GO TO THE NEXT PAGE
next_button = soup.find("li", attrs={"class":"next"})
while next_button:
  next_page = next_button.find("a")["href"]
  browser.get(base_url + next_page)
  soup = BeautifulSoup(browser.page_source, "lxml")
  row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
  stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
  non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
  print(len(stripe_rows), len(non_stripe_rows))
  next_button = soup.find("li", attrs={"class":"next"})
# DONT FORGET THIS!!
browser.quit()

I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. You can use the requests module. However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer.

Rent Charter Buses Company
READ ALSO
Setting Events in a class that utilize a later mentioned Prototype?

Setting Events in a class that utilize a later mentioned Prototype?

When I reach the consolelog() its returning window

241
Which &ldquo;href&rdquo; value should I use for JavaScript links, &ldquo;#&rdquo; or &ldquo;javascript:void(0)&rdquo;?

Which “href” value should I use for JavaScript links, “#” or “javascript:void(0)”?

The following are two methods of building a link that has the sole purpose of running JavaScript codeWhich is better, in terms of functionality, page load speed, validation purposes, etc

231
angular 4 observable returns [object Object]

angular 4 observable returns [object Object]

I am using Angular 4 as my frontend and Laravel 55 as my restful API

609
How to use imacros to reload recaptcha frame to load new captcha?

How to use imacros to reload recaptcha frame to load new captcha?

I need to use imacros to reload recaptcha frame in this image to load new captcha, everyone, plz help: Link to the captcha: https://wwwmytrainerrewards

338