Monday, October 21, 2013

Retrieve Google Search results in JSON format

Let's say you want to...

Write a Python script which searches a specific site for a query in Google and returns the results [in JSON format].

Sounds easy, huh? Use requests to, err, make the HTTPS requests, then beautifulsoup to parse the links out. (and what would be the easiest way to grab those links? By getting the XPath from the browser...).

Well, that's what I've thought. Wrong.

Google don't want you to do that. And when Google doesn't want you to do it, you just can't. You'll find yourself blocked from making automatic requests pretty darn fast. You'll be required to solve CAPTCHAs in order to search. Not ideal for an automatic crawler...

The impolite solution would be to write a very sophisticated crawler which uses multiple proxies to cover the search-space without seeming to suspicious. The advantage is, you'll get it done. The obvious disadvantage - it's complicated!

The polite solution would be to use Google's own tool for this job - the Custom Search. You get 100 queries per day for free, and can pay for more (or utilize more accounts - sure easier than utilizing more proxies...).

I've written this guide today while performing these actions, so I can guarentee it's working - for now. Doing this will get you to far, strange places in Google like I've never seen before. Their documentation is quite lousy, which is why I've written this guide. I hope it won't get outdated too soon. Their official documentation can be found here: https://developers.google.com/custom-search/docs/overview

The short explanation

Goto Google Cloud Console and login.
https://cloud.google.com/console
Create a new project.
Under APIs and Auth > APIs, enable the Custom Search API.
Under APIs and Auth > Registered Apps, register a new web application.
Keep the Browser Key API KEY.

Goto Google Custom Search and login.
https://www.google.com/cse/
Add a site you would like to search.
Under Edit search engine > Setup, click on the Search engine ID button. Keep it as well.

Construct the URL as follows:
https://www.googleapis.com/customsearch/v1?q=<query>&cx=<Search engine ID>&key=<API KEY>

For instance:
https://www.googleapis.com/customsearch/v1?q=ooad&cx=011540632070411059229:cvop8g3tl2q&key=AIzaSyBNND2nT0vVb4M0875cxkhcUwzepeIkXng

Appearently, this imports via code as well. (requests is required):

import requests, json
url = "https://www.googleapis.com/customsearch/v1?q=ooad&cx=011540632070411059229:cvop8g3tl2q&key=AIzaSyBNND2nT0vVb4M0875cxkhcUwzepeIkXng"    
r = requests.get(url)
data = r.json()
with open('data.json','wb') as f:
    json.dump(data,f)

The results will look like this:

{
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "request": [
   {
    "title": "Google Custom Search - ooad",
    "totalResults": "3",
    "searchTerms": "ooad",
    "count": 3,
    "startIndex": 1,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "011540632070411059229:cvop8g3tl2q"
   }
  ]
 },
 "context": {
  "title": "Cs.bgu.ac.il"
 },
 "searchInformation": {
  "searchTime": 0.247919,
  "formattedSearchTime": "0.25",
  "totalResults": "3",
  "formattedTotalResults": "3"
 },
 "items": [
  {
   "kind": "customsearch#result",
   "title": "ADSS122 (Spring 2012)",
   "htmlTitle": "ADSS122 (Spring 2012)",
   "link": "http://www.cs.bgu.ac.il/~guyrap/adss122/",
   "displayLink": "www.cs.bgu.ac.il",
   "snippet": "OOAD Winter 2012's Moed Aleph Solution: PDF, pptx. (10.07.2012) New OCL   Examples: PDF, docx. Preparation for Moed Bet, ADSS Spring 2012: PDF, pptx.",
   "htmlSnippet": "\u003cb\u003eOOAD\u003c/b\u003e Winter 2012's Moed Aleph Solution: PDF, pptx. (10.07.2012) New OCL \u003cbr\u003e  Examples: PDF, docx. Preparation for Moed Bet, ADSS Spring 2012: PDF, pptx.",
   "cacheId": "WC8B134UUl0J",
   "formattedUrl": "www.cs.bgu.ac.il/~guyrap/adss122/",
   "htmlFormattedUrl": "www.cs.bgu.ac.il/~guyrap/adss122/"
  },
  {
   "kind": "customsearch#result",
   "title": "Guy Rapaport - CS Webpage",
   "htmlTitle": "Guy Rapaport - CS Webpage",
   "link": "http://www.cs.bgu.ac.il/~guyrap/",
   "displayLink": "www.cs.bgu.ac.il",
   "snippet": "Mar 21, 2013 ... [HighLearn]; Teach-assist the OOAD and ADSS courses (under Dr. Arnon Sturm).   (HighLearn , additional resources) (2011-2012, 4 semesters).",
   "htmlSnippet": "Mar 21, 2013 \u003cb\u003e...\u003c/b\u003e [HighLearn]; Teach-assist the \u003cb\u003eOOAD\u003c/b\u003e and ADSS courses (under Dr. Arnon Sturm). \u003cbr\u003e  (HighLearn , additional resources) (2011-2012, 4 semesters).",
   "cacheId": "YXe3atmPI-8J",
   "formattedUrl": "www.cs.bgu.ac.il/~guyrap/",
   "htmlFormattedUrl": "www.cs.bgu.ac.il/~guyrap/"
  },
  {
   "kind": "customsearch#result",
   "title": "הכנה למועד ב'",
   "htmlTitle": "הכנה למועד ב'",
   "link": "http://www.cs.bgu.ac.il/~guyrap/adss122/2012WinterPreMoedBet.pptx",
   "displayLink": "www.cs.bgu.ac.il",
   "snippet": "OOAD – סתיו 2012. במערכת ניהול הסטודנטים לתואר שני (תקרא מעתה המערכת) נדרש   לספק תמיכה לכלל הנוגעים לטיפול בסטודנטים הלומדים לתואר שני באוניברסיטה. אלו כוללים   את ...",
   "htmlSnippet": "\u003cb\u003eOOAD\u003c/b\u003e – סתיו 2012. במערכת ניהול הסטודנטים לתואר שני (תקרא מעתה המערכת) נדרש \u003cbr\u003e  לספק תמיכה לכלל הנוגעים לטיפול בסטודנטים הלומדים לתואר שני באוניברסיטה. אלו כוללים \u003cbr\u003e  את \u003cb\u003e...\u003c/b\u003e",
   "cacheId": "rcFEb9K_Sq4J",
   "mime": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
   "fileFormat": "Microsoft Powerpoint",
   "formattedUrl": "www.cs.bgu.ac.il/~guyrap/adss122/2012WinterPreMoedBet.pptx",
   "htmlFormattedUrl": "www.cs.bgu.ac.il/~guyrap/adss122/2012WinterPreMoedBet.pptx"
  }
 ]
}


The long explanation (with pictures)
1. Goto Google Cloud Console and login. https://cloud.google.com/console
2. Create Project. I used the default settings.
3. This is the newly created project overview page. Click on APIs & auth > APIs on the left-hand sidebar.
4. Go through the list of APIs and turn on the Custom Search API.
5. Go to the Registered apps. Click on the "REGISTER APP" button.
6. Register your new app. I used "Web Application".
7. This is the newly created app's overview page. Open up the Browser Key tab and keep the API KEY.
The first part -obtaining an API KEY - is done.
8. Goto the Google Custom Search page. "Create  a custom search engine". You should already be logged in after your activity in the Google Cloud Console.

9. In the CSE creation page, enter the sites you would like to custom-search. "Create".
10. For example:
(this is my page on BGU's CS dept. web server.)
11. "You've successfully created your Custom search engine". Why the capital C on custom??
Click on "Edit search engine" in the left side-bar.
12. Under details, click on the "Search engine ID". Keep the Search engine ID

As I've said in the short explanation (see above) -construct your search query URL and enjoy.

1 comment: