Today I would be writing about a website parser which I wrote. In this post, I would show you how to parse www.cricbuzz.com website. But the logic is nearly the same for other websites also. You can play around by changing the logic according to your needs. I have used Python in my code, so you need to know Python to follow this post.
So lets begin parsing cricbuzz site. First go to that website. Go to the page that you want to parse. Suppose I want to parse the ongoing England Vs South Africa test match. Go to the Full scorecard page of cricbuzz as shown below:
.
If you are using Google Chrome browser, press Shift + Ctrl + J to go into the developer mode. You would see a new split window having some tabs as shown below:
Then click on the Network Tab:
Then click on scorecard.json which is highlighted in the above picture.
We see that this site uses JSON which is a light weight data interchange format to send the data. JSON is easy for machines to parse and generate. It is based on Javascript Programming Language. Now you can use your logic to parse the site. I will be using Python's Json package to parse the Json content. Lets start with the code. First import json package. We would need the URL of the JSON page to begin parsing, so get the url by right clicking on scorecard.json. You can check that URL by pasting in your web browser. You should see a page like this :
We need to get the data from this URL to begin parsing. We can use urllib2 package for this task. The following statement would get the whole data in result string, where the URL is the copied URL:
result = json.load(urllib2.urlopen(URL) )
So lets begin parsing cricbuzz site. First go to that website. Go to the page that you want to parse. Suppose I want to parse the ongoing England Vs South Africa test match. Go to the Full scorecard page of cricbuzz as shown below:
.
If you are using Google Chrome browser, press Shift + Ctrl + J to go into the developer mode. You would see a new split window having some tabs as shown below:
Then click on the Network Tab:
We see that this site uses JSON which is a light weight data interchange format to send the data. JSON is easy for machines to parse and generate. It is based on Javascript Programming Language. Now you can use your logic to parse the site. I will be using Python's Json package to parse the Json content. Lets start with the code. First import json package. We would need the URL of the JSON page to begin parsing, so get the url by right clicking on scorecard.json. You can check that URL by pasting in your web browser. You should see a page like this :
We need to get the data from this URL to begin parsing. We can use urllib2 package for this task. The following statement would get the whole data in result string, where the URL is the copied URL:
result = json.load(urllib2.urlopen(URL)
The logic I have used is that if the score changes after 20 mins, it would send an email to the person. To handle the email part, we have to use smtp package.
So, here is the complete code:
The code I have used has very little practicality, but the idea was to make the concept clear. If the idea was clear, you can play around with the logic. :)
Very Good post Ankit.....
ReplyDeleteIt made parsing a site look very easy. After reading this post, I tried parsing a few other sites which seemed a big deal before.