Extracting Information From A Webpage With PHP & CURL
Posted 1 month, 3 weeks ago at 5:17 pm. 12 comments
So it’s been a little while since my strangely popular GTA 4 controversy post and I thought I would journey back to some coding. This time I am going to show you how to use PHP to replicate a little part of something I have been working on for my other site, Celeb ‘O Rama.
It’s sounds like quite a simple concept but it’s actually quite hard to do. What, You ask? Well I wanted a way to show a little information about the particular celebrity in a small css or javascript based pop-up, the easiest way to do this is to extract a small amount of info from the excellent Wikipedia.
The pop-up was easy thanks to Cody Lindley and his excellent jTip javascript tool tip. But extracting information from Wikipedia, that’s the hard part. So let’s get started.
I used this code as a Wordpress plugin and therefore used WP 2.5’s new shortcode system to make my own shortcode for this. I will include that part but the code can be used as a standalone by using only the getInfo.php page. You’ll see what I mean.
So let’s start with the shortcode code… You know what I mean.
wp-wiki.php
This is page that will let Wordpress understand what to do when it receives a [wiki] shortcode tag.
So here we go first we need to tell WP that this is a plugin file. To do that add this comment to the top of your page:
/*
Plugin Name: wp-wiki
Plugin URI: http://returntrue.natural-tys.com/
Description: Uses WP 2.5 Shortcode to show wikipedia information for that word in a jTip tooltip
Author: Veneficus Unus
Version: 1.0
Author URI: http://celeb-o-rama.natural-tys.com/
*/You can of course change that but it would be nice if you would leave my URI and Name in.
Now that Wordpress knows this file is a plugin we can get on with the code. We will go in little bit, so let’s go:
add_shortcode('wiki', 'wiki_shortcode'); function wiki_shortcode($attr, $word = NULL) { extract(shortcode_atts(array( 'title' => "{$word}", 'no' => 2, 'width' => '', 'height' => '', ), $attr));
First we add a new shortcode to Wordpress and tell it to run wiki_shortcode() when it finds any of that shortcode written.
Next we define that function, the $attr is the attributes given in the shortcode, if any. For example, [wiki title="Charlotte Hatherley" width="400" height="400"] in the finished plugin would tell it to give the tooltip a title of ‘Charlotte Hatherley‘ and a width & height of 400px. The $word is because we will be using the shortcode in [wiki]test[/wiki] format instead of [wiki] format. I hope that makes sense.
Then we use extract to extract the items from the array given. Extract takes the keys and makes then proper variables with the values as their value. shortcode_atts() adds or replaces any of the default values given with the one specified.
Ok, next up:
srand ((double) microtime( )*1000000); $rid = rand(0,2000); if($word == NULL) return false; else $newWord = convertForWiki($word); $confirm = get_web_page('http://en.wikipedia.org/wiki/'.$newWord);
Ok we make a random number between 0 - 2000 and assign it to $rid, this is for later. Then we check to see if there was a $word, remember $word is the word given between the shortcode tags. If there isn’t, which is NULL, then we can’t continue since we have no word to look for at Wiki so we return false to cancel the whole process and it removes the [wiki] tag. Otherwise we use a function called convertForWiki() to convert the word into a format recognisable by Wikipedia. I’ll get to that function shortly.
Now we confirm that a page exists at Wikipedia for that word by using CURL. But I will get to that function a little later too.
if($confirm['errno'] != 0 || $confirm['http_code'] != 200) : $output = $word; else : if(preg_match("/Wikipedia does not have an article with this exact name\./i", $confirm['content'])) : return $word; endif; $output = '<a href="'.get_bloginfo('wpurl').'/wp-content/plugins/wp-cwiki/getInfo.php?word='.$newWord.'&no='.$no.'?'; if($width) $output .= 'width='.$width.'&'; if($height) $output .= 'height='.$height; $output .='" name="'.$title.'" title="'.$word.'" class="jTip" id="jTip-'.$newWord.'-'.$rid.'">'.$word.'</a>'; endif; return $output; }
This is the last part of the wiki_shortcode() function. The CURL function returns an array with the error code, HTTP code & content of the page. First to confirm the page existed we need to check the error & HTTP codes. If they are 0 & 200 respectively then we can go ahead. If not we assume no page exists and we just output the original word without any link.
If we get past A page exists but since Wikipedia has a custom 404 page we can’t assume it is the right one. To check it is the right one we make sure we don’t have Wikipedia’s 404 page by looking for a prominent sentence such as this one:
Wikipedia does not have an article with this exact name.
I use preg_match() since I find it is more accurate than stristr(). If we are on Wikipedia’s 404 page we can just return prematurely with the original word since we can’t give a link.
Otherwise we can continue on and make the link for Cody Lindley’s jTip. This requires a little bit of complex concatenation, it can get confusing but I’m sure you can figure it out from looking at the code since an explination would take too long.
Finally we return the $output whatever it may be.
function convertForWiki($c) { $c = trim($c); $c = ucwords($c); if(preg_match("/ /", $c)) $c = str_replace(" ", "_", $c); return $c; }
Next we have the convertForWiki() function I mentioned earlier. It basically turns this ‘Kate Tunstall’ into this ‘Kate_Tunstall’ as that is the format that Wikipedia’s URL takes. If you want a more detailed explanation of the function just ask.
Finally:
function get_web_page( $url ) { $options = array( CURLOPT_RETURNTRANSFER => true, // return web page CURLOPT_HEADER => false, // don't return headers CURLOPT_FOLLOWLOCATION => true, // follow redirects CURLOPT_ENCODING => "", // handle all encodings CURLOPT_USERAGENT => "spider", // who am i CURLOPT_AUTOREFERER => true, // set referer on redirect CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect CURLOPT_TIMEOUT => 120, // timeout on response CURLOPT_MAXREDIRS => 10, // stop after 10 redirects ); $ch = curl_init( $url ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); $header['errno'] = $err; $header['errmsg'] = $errmsg; $header['content'] = $content; return $header; }
This is CURL function I got from a website a little while ago now and boy, am I glad I kept it. I can’t remember what site but thank you whoever you are.
That’s it for this file just put it in a folder called ‘wp-wiki’ or something and hold onto it until we finish the next page.
getInfo.php
Ok, last page.
$word = urldecode($_GET['word']); $no = $_GET['no']; $url = "http://en.wikipedia.org/wiki/".$word; $content = get_web_page($url); $content = extractContent($content['content'], $no); echo $content;
It’s quite simple this bit. We get the word which was handed over by url earlier and assign it to $word while decoding it from the url since underscores will be encoded when placed in a url. We get no which I forgot to mention before is the number of paragraphs you want to retrieve from the Wiki article, 2 is the default. Finally the URL which is the standard wiki url with the word put on the end.
Then we use the CURL function from before again, yes that means you’ll need to include the get_web_page() function from above on this page too, to get the content and then we run the content through another function called extractContent(). Once that’s done we echo out the result.
Here is that extractContent() function:
function extractContent($c, $no) { global $url, $word; $tree = new DOMDocument(); @$tree->loadHTML($c); $count = 1; foreach($tree->getElementsByTagName('div') as $div) : if($div->getAttribute('id') == "bodyContent") : foreach($div->getElementsByTagName('p') as $p) : if($count <= $no) : $output .= "<p>".$p->nodeValue."</p>"; endif; $count++; endforeach; endif; endforeach; //Clean up wikipedia stuff; $output = preg_replace("/\[(\d+)\]/", "", $output); //Add excerpt taken from... $output = $output.'<strong style="float:right; display:block; font-size:9px;">Excerpt From Wikipedia</strong>'; return $output; }
Ok, so we pass along the content and the number of paragraphs we want. Then we make the url and the word global for later. We make a new DOMDocument so we can traverse the DOM. This will make it quite easy to get the info we are after. We set up a count. Then we loop through all of the divs on the page. If we get one with the ID of ‘bodyContent’ which is Wikipedia’s content div then we want to be inside there. We then run a loop on that div for all of the paragraphs inside it. Lucky for us the first paragraph on all of Wikipedia’s pages is the content we want, so we just add the nodeValue which is the contents of the p to a variable called $output We have to add the p tags back too as we retrieved the contents of the p’s, not the p’s themselves. We do that for as long as the count does not equal the max number of paragraphs set. We then add one to count so the counter works and then exit each loop.
Wikipedia has little reference icons which reference links at the bottom of the page that look like this [1]. We don’t want to include these so we use a quick preg_replace() to get rid of them. Then I want to credit Wiki so I add a little text to say it is from Wikipedia. Finally we return it.
That’s it. If you have the jTip code on your page all you need to do is put this file in the folder from before and upload into your Wordpress plugins directory and enable it. It has been tested in WP 2.5 and runs of PHP 5, I am unsure of PHP 4. Also you must have the CURL PHP extension enabled, if you don’t know either ask your host or make a blank php page with phpinfo() written on it. View that page and look for CURL if it’s not there the you won’t be able to run this code without installing it.
I hope you all enjoyed this tutorial. Download is now available below, it includes the wiki plugin & jTip:
WP-Wiki (2.5 KB, 0 hits)
If you have any problems at all just give me a shout.
Although I love to code & I love to give all of my code out for free it takes a lot of time & hard work to make these so if you would like to help me or show your gratitude please consider donating via the link at the top of the page. Thank you.






Hi,
I tried to install the two wordpress plugins. I activated them but we call the [wiki]test[wiki], it does not work. It displays nothing and it breaks the page.
I use WP 2.5.1, but I don’t think this is the reason why it does not work. Any idea why ?
I am willing to donate if I can fix this problem.
Herve
Hey there Herve,
I think I might have just figured out the problem and if you download the files again it should be sorted.
Sorry about that, I accidentally uploaded a version I use for testing new features, thanks for making me realise what I had done.
If that doesn’t help then please feel free to give me another call.
Thank you for the offer to donate, although I don’t require one to give out help it is still nice to receive a donation. So if you do decide to leave one, thank you for your kindness.
Hi. Thanks for the mini tutorial. I was looking for a better way to do content-extracting than what I have in mind (fopen :p).
Cheers.
.tre.
No problem.
Can’t to extract Information From A Webpage in javasript With PHP & CURL
I Can’t to extract Information From A Webpage in javasript With PHP & CURL. Who can help me?
I can’t to extract information from a webpage loading by activeXobject in file javascript with php & curl. please hepl me.
This script, as far as I am aware, doesn’t use an activeXobject. It just uses some basic javascript and CURL. The most probable reason for this script not working is that you do not have CURL enabled in your PHP installation. To check just use
phpinfo().Other than that I can’t help you without a more detailed description of your problem.
Example:I want get price of site http://hose.eps.com.vn/
I think, It load by ajax. You can help me, how get all content of that webpage.
Please help me, thanks
You cannot crawl that site without using a more advanced spider to crawl the website and wait for the ajax info to load. The reason you aren’t getting info is because the spider is going to the site and getting the information that is on the page before the AJAX has been loaded.
The code I developed here was built only to crawl sites such as Wikipedia as the information is open source and allowed to be used on other sites.
If the site does not explicitly say you are allowed to use it’s information on other websites I cannot help you. Sorry.
Ok, No problem.