Technology Answer: Extract everything between <object></object>

I am using CURL to download a page. Now I want to extract this from the page:

<object classid="clsid:67DABFBF-D0AB-41fa-9C46-CC0F21721616" width="640"
        height="303.33333333333"
        codebase="http://go.divx.com/plugin/DivXBrowserPlugin.cab"
        id="object701207571">
    <param name="autoPlay" value="false" />
    <param name="custommode" value="Stage6" />
    <param name="src" value="" />
    <param name="movieTitle" value="Titanic" />
    <param name="bannerEnabled" value="false" />
    <param name="previewImage" 
           value="http://stagevu.com/img/thumbnail/oripmqeqzrccbig.jpg" />
    <embed type="video/divx" src="" width="640" height="303.33333333333"
           autoPlay="false" custommode="Stage6" movieTitle="Titanic"
           bannerEnabled="false"
           previewImage="http://stagevu.com/img/thumbnail/oripmqeqzrccbig.jpg"
           pluginspage="http://go.divx.com/plugin/download/"
           id="embed701207571">
    </embed>
</object>

Please help!

From stackoverflow

See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why this is probably the wrong thing to do.

That said you might be able to get away with something like /(<object>.*?<\/object>)/s. This matches the string "<object>" followed by any number of characters up to the string "</object>". The s on the end tells . to match newlines (it normally doesn't).

strager : +1 for the first paragraph.
This is partially in response to Owens (because I can't put code in a comment very well). That regex might not work for the object tag, basically because the opening <object> tag has attributes in it. Try this one instead:
```
/(<object[^>]*>)(.*?)(<\/object>)/si
```
It's case insensitive and broken into the three groupings for easy reference. It's not 100% perfect, but should help.
strager : > is legal in an attribute value, IIRC.
strager : Also, this does not handle
nesting.
St. John Johnson : Which is why it is hard to parse HTML with a Regex. But this will work for his attempt.

Chas. Owens : Yeah, these are the dangers of trying to use a regex, which is why I used a half-hearted match-what-he-showed approach. Any time spent attempting to bullet proof the regex is time that should have been spent learning how to use a parser.
this regex will match all the line breaks between the opening and closing tags and capture the entire thing in one group

/(<object[^>]*?>(?:[\s\S]*?)<\/object>)/gi

porneL : This will fail if objects are nested.

Scott Evernden : right .. but i don't think i've ever seen objects nested inside objects

strager : It's completely legal. I've seen it. You can have an image object inside a video object inside a flash object, for example.
Using SimpleXML:

$sxe = new SimpleXMLElement($xml); $objects = $sxe->xpath('//object[@id="object701207571"]'); $object = $objects[0]; $params = $object->xpath('param'); foreach($params as $param) { $attrs = $param->attributes(); echo $attrs['name'] . ' = ' . $attrs['value'] . "\n"; } // Get plain XML: echo $object->asXML();
$doc = DOMDocument::loadHTML($html); foreach($node->getElementsByTagName('object') as $object) { echo $doc->saveXML($object); }
@St. John Johnson Good sir its working ... thanx

Posted by Ku XI at 8:01 PM

0 comments:

Post a Comment

Newer Post Older Post Home

Subscribe to: Post Comments (Atom)
Blog Archive

May (200)

April (700)

March (663)

February (609)

January (887)

Technology Answer

Thursday, April 28, 2011

Extract everything between <object></object>

0 comments:

Post a Comment

Blog Archive