A Smarter Wikipedia Search using Google

Posted on March 26, 2007 by admin

I’m always looking for ways to speed up common tasks. This post describes a faster and more accurate way to reach a desired Wikipedia article.

There’s an increbibly useful (and free) plugin for Safari from David Wantanbe (the man also responsible for Acquisition, NewsFire, and Xtorrent) called Inquisitor which gives you search suggestions as you type in the search bar (described as “Spotlight for the web”, essentially Google Suggest). In addition to the search suggestions, Inquisitor allows you to assign key combinations to different search engines, which you can define yourself.

I usually just use Google to get to a Wikipedia article, since it’s so convenient. For example, type “safari wiki” and the first item will probably be the Wikipedia article for the Safari web browser. Taking this a step further, using the Google’s “I’m Feeling Lucky” button search, along with “site:en.wikipedia.org” appended to the search term, most of the time you will get redirected directly to the Wikipedia page you’re looking for.

I don’t find Wikipedia’s built in search to be particularly useful. Since this method uses Google’s smartness it will find partial matches as well as the (usually) the more relevant topic. For example, type “apple site:en.wikipedia.org” and it will take you to the computer company (actually “Apple Inc.”) rather than the fruit (the real “Apple”). I’m usually more interested in the computer company.

Of course typing all that out at google.com and pressing the “I’m Feeling Lucky” button would offset any convenience it adds, so combine it with Inquisitor or Firefox/IE’s custom searches.

Here’s the search URL for Inquisitor (install in “Safari” application menu : “Preferences” : “Search” tab : “Edit Sites…”):
http://www.google.com/search?hl=en&q=%@+site%3Aen.wikipedia.org&btnI=I%27m+Feeling+Lucky

If you just want to use Google as your Wikipedia search engine, use this one without the “I’m Feeling Lucky” option:
http://www.google.com/search?hl=en&q=%@+site%3Aen.wikipedia.org

All Inquisitor does is substitute the search terms you enter into a custom URL wherever it finds an “@”, then takes you to that URL. This particular URL uses the “I’m Feeling Lucky” feature of Google to redirect you to the top resulting page. Assign a key combo (I use command-option-W) to it and you’ve got a really simple and fast way to get to the most relevant Wikipedia article.

If you’re a Firefox 2 or [gasp] IE 7 user, I threw together two OpenSearch plugins. Unfortunately neither browser supports hot keys for different search engines so some of the convenience is lost. The plugins should be auto-detected, if not, click on one of the following links (JavaScript required):

Install Smarter Wikipedia (auto redirect to most relevant page)
Install Smarter Wikipedia Search (Google search results)

Note that this technique could be applied to *any* website, even ones without built in search engines: just replace “site:en.wikipedia.org” with the desired domain name. You could also remove the “site” clause entirely to get an “I’m Feeling Lucky” search of the entire web.

Update: John Gruber over at Daring Fireball also has a good use for Google’s “I’m Feeling Lucky” search.

Stealing LAPD's crime data

Posted on March 18, 2007 by admin

This post explains how to get data from LAPD’s [crime maps](http://www.lapdcrimemaps.org/) website. See my [previous post](http://tlrobinson.net/blog/?p=6) on scraping DPS’s incident logs for background of TOOBS.

After completing the TOOBS project for the UPE P.24 programming contest I was checking out LAPD’s [crime maps](http://www.lapdcrimemaps.org/) website, which is similar to TOOBS (but not as cool!), and I realized I could integrate their data with DPS’s data for the ultimate Los Angeles / USC crime map. There very little overlap between the LAPD and DPS data since the two are separate entities. Murders and some other incidents may show up in both, but hopefully these are rare…

The LAPD system also uses JavaScript and XMLHttpRequest to fetch the data from a server side script. Additionally, there is no security to check that the requests are coming from the LAPD web app. This means we can easily, and (as far as i know) legally, access their data.

Due to the same origin policy that restricts JavaScript to only making requests to the originating server, you cannot simply use their PHP script from your own JavaScript, you must use sort of a proxy. While this policy can be annoying, it is necessary to limit what malicious JavaScript could do.

To obtain the crime data from LAPD’s servers, we begin by forming the request URL which contains parameters such as the start date, the interval length, lat/lon coordinates, radius, and crime types. A HTTP request is made to their server, and the response is stored.

We notice the response is simply JavaScript that gets eval’d on the client:

searchPoints = newArray ();
searchPoints[0] = new searchPoint (‘0’, ‘#070307306’, ‘lightblue’, ’17’, ‘-118.301638’, ‘34.022812’, ’14XX W 36th St’, ‘0.74’, ‘6’, ’02-04-2007 10:45:00 PM’, ‘Southwest Division: 213-485-6571’);

searchPoints[1] = new searchPoint (‘1’, ‘#070307280’, ‘violet’, ’17’, ‘-118.284008’, ‘34.033212’, ’25XX S Hoover St’, ‘0.52’, ‘3’, ’02-04-2007 10:00:00 PM’, ‘Southwest Division: 213-485-6571’);

searchPoints[2] = new searchPoint (‘2’, ‘#070307224’, ‘cyan’, ’17’, ‘-118.304108’, ‘34.032481’, ’26XX Dalton Av’, ‘0.83’, ‘4’, ’02-04-2007 12:15:00 AM’, ‘Southwest Division: 213-485-6571’);

searchPoints[3] = new searchPoint (‘3’, ‘#070307222’, ‘blue’, ’17’, ‘-118.2903’, ‘34.0284’, ‘Menlo Av and 29th Av’, ‘0.02’, ‘2’, ’02-03-2007 11:00:00 PM’, ‘Southwest Division: 213-485-6571’);

…

We could simply redirect this code to our own app and do the processing on the client side with JavaScript, but we also notice that JavaScript syntax is very similar to PHP syntax. By creating a compatible PHP object called searchPoint and prepending a “$” to each variable name, we have valid PHP code that we can simply eval. The result is an array of searchPoint objects that we can easily add to our response, or insert into a database, or whatever we want!

Note that this is extremely insecure since we’re eval’ing text that we got from somewhere else. By changing the response, the provider of the data could execute any PHP they wanted on my server.

A more secure method would be to actually parse the data rather than letting PHP’s eval do the work.

Scraping USC DPS's incident logs

Posted on March 12, 2007 by admin

This post describes how to extract incident summaries and metadata from the USC Department of Public safety’s daily incident logs, which is used extensively in [TOOBS](http://tlrobinson.net/projects/toobs)

### Background ###

A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these “Crime Alert” emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.

This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn’t pretty, but it’s still available on my USC page here: [http://www-scf.usc.edu/~tlrobins/crimealert/](http://www-scf.usc.edu/~tlrobins/crimealert/)

For UPE’s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as [daily PDF logs](http://capsnet.usc.edu/DPS/CrimeSummary.cfm). Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.

### Method ###

My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.

The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day’s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used [rpdf2txt](http://raa.ruby-lang.org/project/rpdf2txt/).

Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.

### Code ###

The import script is available here: [http://tlrobinson.net/projects/toobs/import.rb](http://tlrobinson.net/projects/toobs/import.rb)

require"net/http"
require"rpdf2txt/parser"
require"date"
require "rubygems"

require_gem "activerecord"
# misc regular expressions constants

datetimeRE = /[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/

stampRE = /[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/
# connect to the database

ActiveRecord::Base.establish_connection(

  :adapter  => "mysql",

  :host     => "host",

  :database => "database",

  :username => "username",

  :password => "password" 

)
class Incident < ActiveRecord::Base

  set_table_name "crimes"

end

def import_url(url)

  puts "================== Processing: " + url + " =================="

  resp = Net::HTTP.get_response(URI.parse(url))

  if resp.is_a? Net::HTTPSuccess

    # parse the pdf, extract the text, split into lines

    parser = Rpdf2txt::Parser.new(resp.body)

    text = parser.extract_text

    lines = text.split("\n")

    incidents = Array.new # array containing each incident

    summary = false       # for multiple line summaries

    disp = false          # for cases when the "disp" data is on the line after the "Disp:" header

    # try to match each line to a regular expression or other condition

    # then extract the data from the line

    lines.each do |line|

      # first line

      if (line =~ stampRE)
        # special case for missing identifier of previous incident

        if (incidents.size > 0 && incidents.last.identifier == nil) 

          puts "+++ Last identifier is empty, searching for identifier in summary…"

          tempRE = /DR\#[\d]+/;

          tempId = incidents.last.summary[tempRE];

          if (tempId != nil) 

            puts "+++ Found! {" + tempId[3..tempId.length-1] + "}"

            incidents.last.identifier = tempId[3..tempId.length-1];

          end

        end

        # create new incident

        incidents << Incident.new

        summary = false

        disp = false

        # extract category, subcategory, time, and stamp

        cat_subcat_index = line.slice(/[^a-z]*(?=[A-Z][a-z])/).length

        incidents.last.category = line[0..cat_subcat_index-1].strip

        incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)-1].strip

        incidents.last.time = DateTime.parse(line.slice(datetimeRE))

        incidents.last.stamp = line.slice(stampRE)

      # identifier

      elsif (line =~ /^[0-9]+$/)

        incidents.last.identifier = line.slice(/^[0-9]+$/).to_i

      # location

      elsif (line =~ /Location:/)

        incidents.last.location = line.sub(/Location:/, "").strip

      # cc

      elsif (line =~ /cc:/)

        incidents.last.cc = line.sub(/cc:/, "").strip

        summary = false

      # disposition

      elsif (disp) 

        incidents.last.disp = line.sub(/Disp:/, "").strip

        disp = false

      # summary

      elsif (line =~ /Summary:/ || summary)

        if (incidents.last.summary.nil?)

          incidents.last.summary = line.sub(/Summary:/, "").strip

        else

          incidents.last.summary << (" " + line.sub(/Summary:/, "").strip)

        end

        if (incidents.last.summary =~ /Disp:/)

          # find the "Disp:" header and data, remove from summary

          disp = incidents.last.summary.slice!(/\s*Disp:.*/)

          incidents.last.disp = disp.sub(/Disp:/, "").strip

          disp = (incidents.last.disp == "") # check that we actually got the "disp" data

          summary = false

        else

          summary = true

        end

      # no match

      else

        puts "discarding line: {" + line + "}"

      end

    end

    # at the end save each incident and print a list

    incidents.each do |incident|

      begin

        puts( ("%8d" % incident.identifier) + " " +

              ("%25s" % ("{" + incident.category    + "}")) + " " +

              ("%45s" % ("{" + incident.subcategory + "}")) + " " +

              ("%60s" % ("{" + incident.location    + "}")));

        incident.save

      rescue Exception => exp

        puts exp

      end

    end

  end

end
if (ARGV.length > 0)

  # import each argument

  ARGV.each do |arg|

    import_url(arg)

  end

else

  yesterday = Date.today – 1;

  urlToImport = "http://capsnet.usc.edu/DPS/webpdf/"+

    ("%02d" % yesterday.mon) + ("%02d" % yesterday.mday) + yesterday.year.to_s[2..3] + ".pdf"

  import_url(urlToImport)

end

### Conclusion ###
This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can’t be parsed by rpdf2txt shows up. So far I haven’t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it’s not always successful.

Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.

tlrobinson.net blog

Monthly Archives: March 2007

A Smarter Wikipedia Search using Google

Stealing LAPD's crime data

Scraping USC DPS's incident logs