How do I parse multiple dynamic HTML tables using symfony domcrawler?

195
April 19, 2018, at 2:34 PM

I need to be able to scrape data from around 300 different tables on the same page. I can scrape the page and get the entire HTML for the page, but I am having trouble separating it into useful information.

I have read the documentation here for Dom Crawler and I am still having issues.

Each table is a different company and displays changes made to locations by a specific user.

<table cellpadding="0" cellspacing="0" class="result" style="margin-top: 4px;">
    <tr>
        <td class="prj_sect_heading" colspan="99">CompanyName1</td>
    </tr>
    <tr>
        <td class="rslt_break" colspan="99">Search Results (0 found)</td>
    </tr>
    <tr>
        <td class="rslt_body1" colspan="99">Zero comments found matching the criteria provided.</td>
    </tr>
</table>
<table cellpadding="0" cellspacing="0" class="result" style="margin-top: 4px;">
    <tr>
        <td class="prj_sect_heading" colspan="99">CompanyName2</td>
    </tr>
    <tr>
        <td class="rslt_break" colspan="99">Search Results (0 found)</td>
    </tr>
    <tr>
        <td class="rslt_body1" colspan="99">Zero comments found matching the criteria provided.</td>
    </tr>
</table>
<table cellpadding="0" cellspacing="0" class="result" style="margin-top: 4px;">
    <tr>
        <td class="prj_sect_heading" colspan="99">CompanyName3</td>
    </tr>
    <tr>
        <td class="rslt_break" colspan="99">Search Results (3 found)</td>
    </tr>
    <tr>
        <td class="rslt_head">ID</td>
        <td class="rslt_head">Login /<br />
        Login Group</td>
        <td class="rslt_head">Record Number</td>
        <td class="rslt_head">Comment</td>
        <td class="rslt_head">Time Stamp</td>
    </tr>
    <tr>
        <td class="rslt_body1">1234</td>
        <td class="rslt_body1">examplelogin /<br />
        admin</td>
        <td class="rslt_body1"><a href="https://blahblahblah.com/XXXXXXXXXXXXX" target="_blank">XXXXXXXXXXXXX</a></td>
        <td class="rslt_body1">Status changed from "Ready" to "Complete"</td>
        <td class="rslt_body1">2017-11-01 08:53:05</td>
    </tr>
    <tr>
        <td class="rslt_body2">1234</td>
        <td class="rslt_body2">examplelogin/<br />
        admin</td>
        <td class="rslt_body2"><a href="https://blahblahblah.com/XXXXXXXXXXXXX" target="_blank">XXXXXXXXXXXXX</a></td>
        <td class="rslt_body2">ORDER COMPLETE: Order has completed, changing status.</td>
        <td class="rslt_body2">2017-11-01 08:52:52</td>
    </tr>
    <tr>
        <td class="rslt_body1">1234</td>
        <td class="rslt_body1">examplelogin /<br />
        admin</td>
        <td class="rslt_body1"><a href="https://blahblahblah.com/XXXXXXXXXXXXX" target="_blank">XXXXXXXXXXXXX</a></td>
        <td class="rslt_body1">Order Requested</td>
        <td class="rslt_body1">2017-11-01 07:53:05</td>
    </tr>
    <tr>
        <td class="rslt_body1">4321</td>
        <td class="rslt_body1">examplelogin /<br />
        admin</td>
        <td class="rslt_body1"><a href="https://blahblahblah.com/YYYYYYYYYYYYY" target="_blank">YYYYYYYYYYYYY</a></td>
        <td class="rslt_body1">Order Request for location.</td>
        <td class="rslt_body1">2017-11-01 07:22:05</td>
    </tr>
</table>

Here is the working code that I have so far.

    //get all company tables
    $tables = $crawler->filter('.result');

    foreach($tables as $domTable) {
      $table = new Crawler($domTable);
      //get number of results
      $rslt = str_replace(" found)", '', str_replace("Search Results (", "", explode('#', $table->filter('.rslt_break')->text())[0]));
      //check if table is emtpy
      if($rslt != 0) {
        //get company name
        $comp = $table->filter('.prj_sect_heading')->text();
        $results[$comp] = [$rslt];
      } else {
      };
    };
      print_r($results);

I need to be able to get the following data for each result from all of the tables:

  • Company Name
  • Record Number
  • URL for record number
  • Comment
  • Time Stamp

For example the data for the first result from this HTML would be:

CompanyName3; XXXXXXXXXXXXX; https://blahblahblah.com/XXXXXXXXXXXXX; Status changed from "Ready" to "Complete"; 2017-11-01 08:53:05

So far I have been able to get the Company Name, and the number of results in that company's table. I've spent hours trying to get the remaining data points without any success.

As the rows containing the results repeat classes and do not have id tags, I'm not sure how to iterate through them to only get the data I need.

I have been stuck on this for some time any help would be greatly appreciated. Thank you!

READ ALSO
Should I use if-else in using whereIn()

Should I use if-else in using whereIn()

$search_val is array when it has two or more valuesBut when $search_val has only one value, $search_val is not array

125
Creating transparent PNG images in PHP with GD leaves fuzzy edges

Creating transparent PNG images in PHP with GD leaves fuzzy edges

I have a transparent logo in a png fileI'm trying to create a transparent box image which would be the background for the logo

118
Get browser width using php

Get browser width using php

Is there a way to get the screen size through php? I have had a look on google, and found a way of doing it through javascript as follows:

266