Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

Since the substr() method seemed to be too much for some to swallow, here is a DOM parser method:

$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
    $mock->appendChild($mock->importNode($child, true));
}

echo $mock->saveHTML();

http://codepad.org/MQVQ3XQP

Anybody wish to see that "other one", see the revisions.


Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)

$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes 
// and saving them individually
foreach ($body->childNodes as $childNode) {
  echo $d->saveHTML($childNode);
}

$site = file_get_contents("http://www.google.com/");

preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);

echo($matches[1]);

You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:

$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
    'output-xhtml' => true,
    'show-body-only' => true,
), 'utf8');

Then load extracted body into DOMDocument:

$xml = new DOMDocument();
$xml->loadHTML($htmlBody);

Then traverse, extract, move around XML nodes etc .. and save:

$output = $xml->saveXML();

Tags:

Php