dom - Parse text and link pairs from HTML into PHP array with same order -
consider html, littered whitespace or irrelevant tags div
, span
:
<div> <span><a href="#1">title 1</a></span> <p>paragraph 2</p> <p>outside 3 <a href="#4">title 4</a> </p> </div>
how can convert php array of link , text pairs, in same order in html.
{"#1", "title 1" }, {null, "paragraph 2"}, {null, "outside 3" }, {"#4", "title 4" },
the problem dom searches $html->find("a, p")
capture 4 twice, once , once inside 3.
i'm wondering if solution traverse document "linearly", human read element element left right, , if node has text, pick parent node's href
, if any.
if viable, how go through dom this? have solution, preferably simple html dom parser or simple regexp, alternatively built-in php framework.
i @ https://github.com/salathe/spl-examples/wiki/recursivedomiterator recursevly traverse dom structure.
$dom = new domdocument(); $dom->loadhtml('<html>'.$htmlstring.'</html>'); // wrap initial html in <html></html> since has well-formed $dit = new recursiveiteratoriterator(new recursivedomiterator($dom)); $result = array(); foreach ($dit $node) { unset($r); if(trim($node->nodevalue) == "" || $node->childnodes->length > 0){ // non-empty last level nodes continue; } $parent = $node->parentnode; if($parent->nodename == 'a'){ $r[0] = $parent->getattribute('href'); } $r[1] = $node->nodevalue; $result[] = $r; }
Comments
Post a Comment