Parsing a script tag with dicts in BeautifulSoup

552
July 07, 2017, at 1:32 PM

Working on a partial answer to this question, I came across a bs4.element.Tag that is a mess of nested dicts and lists (s, below).

Is there a way to return a list of urls contained in s without using re.find_all? Other comments regarding the structure of this tag are helpful too.

from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
## the first bit of s:
# s
# Out[116]: 
# <script type="application/ld+json">
# {"@context":"http://schema.org","@type":"ItemList","numberOfItems":50,

What I've tried:

  • randomly perusing through methods with tab completion on s.
  • picking through the docs.

My problem is that s only has 1 attribute (type) and doesn't seem to have any child tags.

Answer 1

You can use s.text to get the content of the script. It's JSON, so you can then just parse it with json.loads. From there, it's simple dictionary access:

import json
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(urls)
Rent Charter Buses Company
READ ALSO
Surface plot for multivariate 5 degree polynomial regression in Python

Surface plot for multivariate 5 degree polynomial regression in Python

I am implementing a paper in Python, which was originally implemented in MATLABThe paper says that a five degree polynomial was found using curve fitting from a set of sampling data points

379
What is the best way to pause/stop/command a Python script anytime anywhere?

What is the best way to pause/stop/command a Python script anytime anywhere?

I have a Python script running selenium framework in a command line and running continuously to control Chrome do background data processing and monitoringMy simplified code structure is as following

360
Python regex pull first capitalized word or first and second words if both are capitalized

Python regex pull first capitalized word or first and second words if both are capitalized

The current regex formula I have implemented is only able to extract the first two capitalized words for a given stringI want to be able to extract just the first word in a string if the second word is not capitalized

283
Payload from POST Request is Cutoff (Twisted Web)

Payload from POST Request is Cutoff (Twisted Web)

To give you an idea of the project: I am using Twisted Web as my server to receive the browser requests and websocket dataAlso I am using GraphQL for queries, mutations and subscriptions

401