BeautifulSoup freaking out when extracting movie script

74
February 08, 2019, at 08:00 AM

I am trying to get a movie script as text from this website. It works great until a certain point, in which the text becomes like this:

5   .   
   /   b   >   

                   T   H   E       W   A   L   L   S       C   O   M   E       A   L   I   V   E   !       A       s   e   e   m   i   n   g   l   y       i   n   f   i   n   i   t   e       s   w   a   r   m       o   f       F   I   R   E   
                   D   E   M   O   N   S       r   a   l   l   y       t   o       S   u   r   t   u   r   '   s       a   i   d   .   

Here is my code

import requests
from bs4 import BeautifulSoup
website_url = requests.get("https://www.imsdb.com/scripts/Thor-Ragnarok.html").text
soup = BeautifulSoup(website_url, "lxml")
text = soup.pre

When printing out text, it shows the expected output until section 5., and then I get the text annoying above...

Any ideas on why this is happening, and how to fix it?

Answer 1

I used 'html.parser' instead of 'lxml' and was able to display the entire script with proper formatting:

import requests
from bs4 import BeautifulSoup
website_url = requests.get("https://www.imsdb.com/scripts/Thor-Ragnarok.html").text
soup = BeautifulSoup(website_url, 'html.parser')
text = soup.pre

i.e. the beginning of section 5 was displayed as:

<b>                           BLUE DRAFT 05/20/16                   5.
</b>
    THE WALLS COME ALIVE! A seemingly infinite swarm of FIRE
    DEMONS rally to Surtur's aid.
<b>                         THOR
</b>               I make grave mistakes all the time.
               Everything seems to work out.
    In the shadows, a massive FIRE DRAGON ROARS.
    The fire demons SURGE FORWARD. Thor backs up, HAMMERING
    AWAY. He then leaps back, SPRINGBOARDS off the wall, and-
Answer 2

Odd... I tried your original code on my machine and I am unable to reproduce the spacing problems you describe. I have lxml-4.3.0, bs4 version 4.7.1, and python 3.7.1. What versions do you have?

READ ALSO
PHP Double Quotes and replies [duplicate]

PHP Double Quotes and replies [duplicate]

This question already has an answer here:

47
How do I fix exporting csv from WordPress to server without also offering file to save

How do I fix exporting csv from WordPress to server without also offering file to save

I've managed to create a csv file after the post is published, it is exporting correctly and saving to the serverHowever, it is also opening a dialogue box offering the same file to download/save

34
Why can&#39;t I upload files with PHP?

Why can't I upload files with PHP?

I'm having trouble getting PHP file uploads to work on a CentOS serverHere is a very basic version of the script

56