Spider Error Processing - Python Web Scraping Error

111
January 01, 2020, at 8:50 PM

I am new to Python and very new to web scraping, but I am trying to build a web scraper for this site: https://www.fortune.com/2019/09/23/term-sheet-monday-september-23/

However, my scraper is running into issues before I am even able to get any data from the website. It kicks back: "2019-12-31 09:37:16 [scrapy.core.scraper] ERROR: Spider error processing https://www.fortune.com/2019/09/23/term-sheet-monday-september-23/> (referer: None)".

I created a scraper that is very similar that worked well. I cannot figure out why this is happening on this website.

Any help or suggestions are appreciated!

My spider looks like this:

#Import necessary packages
import scrapy
import numpy as np
import pandas as pd
from scrapy.crawler import CrawlerProcess
#Define Spider
class Term_Sheet_Spider(scrapy.Spider):
    name = "Single_Page_Scraper"
    def start_requests(self):
        url = "https://www.fortune.com/2019/09/23/term-sheet-monday-september-23/"
        yield scrapy.Request(url = url, callback = self.parse)
    def parse(self,response):
        vc_deals = response.xpath('//*[contains(@target,"_blank")]/text()')
        vc_deals_ext = vc_deals.extract()
        companies = response.css('a[target = "_blank"] > strong::text')
        companies_ext = companies.extract_first()
        dict_vc_2[companies_ext] = vc_deals_ext
        vc_printouts_2.append(companies_ext)
#Define Empty Lists
vc_printouts_2 = []
links_list_2 = []
dict_vc_2 = {}
#Run Spider
process = CrawlerProcess()
process.crawl(Term_Sheet_Spider)
process.start()

This is what gets outputted from running the code:

2019-12-31 09:37:15 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2019-12-31 09:37:15 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-12-31 09:37:15 [scrapy.crawler] INFO: Overridden settings: {}
2019-12-31 09:37:15 [scrapy.extensions.telnet] INFO: Telnet Password: aab47f83ef3b5653
2019-12-31 09:37:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-12-31 09:37:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-12-31 09:37:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-12-31 09:37:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-12-31 09:37:15 [scrapy.core.engine] INFO: Spider opened
2019-12-31 09:37:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-12-31 09:37:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-12-31 09:37:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fortune.com/2019/09/23/term-sheet-monday-september-23/> (referer: None)
2019-12-31 09:37:16 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.fortune.com/2019/09/23/term-sheet-monday-september-23/> (referer: None)
Traceback (most recent call last):
  File "C:\Users\kyles\Anaconda3\lib\site-packages\parsel\selector.py", line 238, in xpath
    **kwargs)
  File "src/lxml/etree.pyx", line 1581, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\kyles\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "<ipython-input-2-440c0105f261>", line 11, in parse
    companies = response.xpath('//a[target = "_blank" > strong::text')
  File "C:\Users\kyles\Anaconda3\lib\site-packages\scrapy\http\response\text.py", line 119, in xpath
    return self.selector.xpath(query, **kwargs)
  File "C:\Users\kyles\Anaconda3\lib\site-packages\parsel\selector.py", line 242, in xpath
    six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])
  File "C:\Users\kyles\Anaconda3\lib\site-packages\six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\kyles\Anaconda3\lib\site-packages\parsel\selector.py", line 238, in xpath
    **kwargs)
  File "src/lxml/etree.pyx", line 1581, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
ValueError: XPath error: Invalid expression in //a[target = "_blank" > strong::text
2019-12-31 09:37:16 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-31 09:37:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 256,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 60503,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.962091,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 31, 16, 37, 16, 682486),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2019, 12, 31, 16, 37, 15, 720395)}
2019-12-31 09:37:16 [scrapy.core.engine] INFO: Spider closed (finished)
READ ALSO
Merging HTML &amp; JavaScript

Merging HTML & JavaScript

I'm editing a section of an open source web-pageIt gives logos of companies

73
How can I access BrowserView objects in html or css?

How can I access BrowserView objects in html or css?

I'm currently working on an ElectronJS app where I create a browserWindow object using:

76
jquery function to select a button

jquery function to select a button

How to select the clicked button element in function below with jquery?

93
Dynamically adding media queries to the page and overriding the styles from generated html from Application Side

Dynamically adding media queries to the page and overriding the styles from generated html from Application Side

I have code where i dynamically add custom css to my html on the application side like this:

75