Parse an HTML string with JS

35
June 09, 2021, at 11:20 AM

I searched for a solution but nothing was relevant, so here is my problem:

I want to parse a string which contains HTML text. I want to do it in JavaScript.

I tried this library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:

var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);

My goal is to extract links from an HTML external page that I read just like a string.

Do you know an API to do it?

Answer 1

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");
$('a', el) // All the anchor elements
Answer 2

It's quite simple:

var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');

According to MDN, to do this in chrome you need to parse as XML like so:

var parser = new DOMParser();
var htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');

It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers.

Edit: Now widely supported

Answer 3

EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method.

For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.

For example try parsing <td>Test</td>. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.

Only jQuery handles that case well.

So the future solution (MS Edge 13+) is to use template tag:

function parseHTML(html) {
    var t = document.createElement('template');
    t.innerHTML = html;
    return t.content;
}
var documentFragment = parseHTML('<td>Test</td>');

For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99

Answer 4
var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");
Answer 5

The following function parseHTML will return either :

  • a Document when your file starts with a doctype.

  • a DocumentFragment when your file doesn't start with a doctype.

The code :

function parseHTML(markup) {
    if (markup.toLowerCase().trim().indexOf('<!doctype') === 0) {
        var doc = document.implementation.createHTMLDocument("");
        doc.documentElement.innerHTML = markup;
        return doc;
    } else if ('content' in document.createElement('template')) {
       // Template tag exists!
       var el = document.createElement('template');
       el.innerHTML = markup;
       return el.content;
    } else {
       // Template tag doesn't exist!
       var docfrag = document.createDocumentFragment();
       var el = document.createElement('body');
       el.innerHTML = markup;
       for (i = 0; 0 < el.childNodes.length;) {
           docfrag.appendChild(el.childNodes[i]);
       }
       return docfrag;
    }
}

How to use :

var links = parseHTML('<!doctype html><html><head></head><body><a>Link 1</a><a>Link 2</a></body></html>').getElementsByTagName('a');
Answer 6

The fastest way to parse HTML in Chrome and Firefox is Range#createContextualFragment:

var range = document.createRange();
range.selectNode(document.body); // required in Safari
var fragment = range.createContextualFragment('<h1>html...</h1>');
var firstNode = fragment.firstChild;

I would recommend to create a helper function which uses createContextualFragment if available and falls back to innerHTML otherwise.

Benchmark: http://jsperf.com/domparser-vs-createelement-innerhtml/3

Answer 7

If you're open to using jQuery, it has some nice facilities for creating detached DOM elements from strings of HTML. These can then be queried through the usual means, E.g.:

var html = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
var anchors = $('<div/>').append(html).find('a').get();

Edit - just saw @Florian's answer which is correct. This is basically exactly what he said, but with jQuery.

Answer 8
const parse = Range.prototype.createContextualFragment.bind(document.createRange());
document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );

Only valid child Nodes within the parent Node (start of the Range) will be parsed. Otherwise, unexpected results may occur:

// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);
// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');
// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');
// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);
// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');
Answer 9

1 Way

Use document.cloneNode()

Performance is:

Call to document.cloneNode() took ~0.22499999977299012 milliseconds.

and maybe will be more.

var t0, t1, html;
t0 = performance.now();
   html = document.cloneNode(true);
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));

2 Way

Use document.implementation.createHTMLDocument()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.14000000010128133 milliseconds.

var t0, t1, html;
t0 = performance.now();
html = document.implementation.createHTMLDocument("test");
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));

3 Way

Use document.implementation.createDocument()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.14000000010128133 milliseconds.

var t0 = performance.now();
  html = document.implementation.createDocument('', 'html', 
             document.implementation.createDocumentType('html', '', '')
         );
var t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test</div></body></html>';
console.log(html.getElementById("test1"));

4 Way

Use new Document()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.13499999840860255 milliseconds.

  • Note

ParentNode.append is experimental technology in 2020 year.

var t0, t1, html;
t0 = performance.now();
//---------------
html = new Document();
html.append(
  html.implementation.createDocumentType('html', '', '')
);
    
html.append(
  html.createElement('html')
);
//---------------
t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")
html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';
console.log(html.getElementById("test1"));
Answer 10

for me. i had to use innerhtml of an element parsed in popover of angular ngx bootstrap popover this is the solution which worked for me

public htmlContainer = document.createElement( 'html' );

in constructor

this.htmlContainer.innerHTML = ''; setTimeout(() => { this.convertToArray(); });

 convertToArray() {
    const shapesHC = document.getElementsByClassName('weekPopUpDummy');
    const shapesArrHCSpread = [...(shapesHC as any)];
    this.htmlContainer = shapesArrHCSpread[0];
    this.htmlContainer.innerHTML = shapesArrHCSpread[0].textContent;
  }

in html

<div class="weekPopUpDummy" [popover]="htmlContainer.innerHTML" [adaptivePosition]="false" placement="top" [outsideClick]="true" #popOverHide="bs-popover" [delay]="150" (onHidden)="onHidden(weekEvent)" (onShown)="onShown()">
Answer 11
let content = "<center><h1>404 Not Found</h1></center>"
let result = $("<div/>").html(content).text()

content: <center><h1>404 Not Found</h1></center>,
result: "404 Not Found"

READ ALSO
Splitting the text and writing to combo box

Splitting the text and writing to combo box

It's only writes to comboBox1What's the problem? Why doesn't it write to other combo boxes?

31
Complex RegEx pattern Extractor Logic [closed]

Complex RegEx pattern Extractor Logic [closed]

Want to improve this question? Update the question so it focuses on one problem only by editing this post

63
Problem with getting JSON to android app, propably problem with url

Problem with getting JSON to android app, propably problem with url

I think i did something wrong with string named "url", but im not sure what should i change, what am i doing wrong?

47
JavaScript heap out of memory? Now I can&#39;t even use npm?

JavaScript heap out of memory? Now I can't even use npm?

UPDATE: so, tried chancing memory size, now I can't even use "npm" commands at all to download new filesEverything throws a memory issue

54