Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags
Since the substr()
method seemed to be too much for some to swallow, here is a DOM parser method:
$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
echo $mock->saveHTML();
http://codepad.org/MQVQ3XQP
Anybody wish to see that "other one", see the revisions.
Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)
$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes
// and saving them individually
foreach ($body->childNodes as $childNode) {
echo $d->saveHTML($childNode);
}
$site = file_get_contents("http://www.google.com/");
preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
echo($matches[1]);
You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
'output-xhtml' => true,
'show-body-only' => true,
), 'utf8');
Then load extracted body into DOMDocument:
$xml = new DOMDocument();
$xml->loadHTML($htmlBody);
Then traverse, extract, move around XML nodes etc .. and save:
$output = $xml->saveXML();