Crawling A Website With PHP, But The Website Runs JS To Generate Markup

- 1 answer

I have been doing webcrawling for the last couple weeks. Using a PHP library (PHP Simple DOM), im running a php script (using terminal) to fetch some URLs and JSON some data from it. This has been working very nice so far.

Recently i wanted to expand the crawling for a specific site and encountered the following problem:

Unlike any other site so far, this one only echos a barebones markup server side and instead relies on a single JS script to build up the relevant markup onload.

Obviously my PHP script cant handle that (as it is not executing the JS and hence the site stays mostly blank from what i can tell) and so i cant crawl the site, since the content is not yet created.

Im unsure how to proceed. Is it actually possibly to convert my current PHP script to be "compatible" with that site, or do i need to change gears and incorporate a browser, i.e. pick a completely different route ?

Im currently thinking i would need to create html/js site which opens the URL in an iFrame and that way i could run a JS function manually via the console to extract the data. However, im hoping there is a more feasible way.




When I need to scrap a website I normally:

1 - Navigate the target website on a normal browser (ff, chrome, etc.), while monitoring/logging any POST/GET requests containing pertinent info via Developer Tools ->Network Tab.
Pay special attention to XHR requests, as they normally contain json encoded data.
Here's a small video I've made exemplifying this:

You can mimic the request headers made previously (explained in the video) and use it on a curl request, i.e.:

$headers = [
    "Connection: keep-alive",
    "Accept: application/json, text/javascript, */*; q=0.01",
    "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "DNT: 1",
    "Accept-Language: pt,en-US;q=0.9,en;q=0.8,pt-PT;q=0.7,pt-BR;q=0.6",
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$server_output = curl_exec ($ch);
curl_close ($ch);
print  $server_output ;

2 - In some cases, it's impossible to crawl certain URL's without a JavaScript Enabled Client, when this happens, I normally use Selenium with Chrome or Firefox. You can also use PhantomJS, a headless browser. Latest versions of GeckoDriver (used by Selenium) also support headless browsing.

I'm aware the question is about PHP, but if the OP needs to use Selenium, Python is way more intuitive I'd say. Based on that, here's a Selenium example in Python:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
assert "No results found." not in driver.page_source

Example Src