Working on a partial answer to this question, I came across a bs4.element.Tag
that is a mess of nested dicts and lists (s
, below).
Is there a way to return a list of urls contained in s
without using re.find_all
? Other comments regarding the structure of this tag are helpful too.
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
## the first bit of s:
# s
# Out[116]:
# <script type="application/ld+json">
# {"@context":"http://schema.org","@type":"ItemList","numberOfItems":50,
What I've tried:
s
.My problem is that s
only has 1 attribute (type
) and doesn't seem to have any child tags.
You can use s.text
to get the content of the script. It's JSON, so you can then just parse it with json.loads
. From there, it's simple dictionary access:
import json
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(urls)
Firebase Cloud Functions: PubSub, "res.on is not a function"
TypeError: Cannot read properties of undefined (reading 'createMessageComponentCollector')
I am implementing a paper in Python, which was originally implemented in MATLABThe paper says that a five degree polynomial was found using curve fitting from a set of sampling data points
I have a Python script running selenium framework in a command line and running continuously to control Chrome do background data processing and monitoringMy simplified code structure is as following
The current regex formula I have implemented is only able to extract the first two capitalized words for a given stringI want to be able to extract just the first word in a string if the second word is not capitalized
To give you an idea of the project: I am using Twisted Web as my server to receive the browser requests and websocket dataAlso I am using GraphQL for queries, mutations and subscriptions