Ruby | tlrobinson.net blog

BugMeNot is a great little service for bypassing the registration process for websites that really shouldn’t require it (ahem, nytimes.com). The bookmarklet brings up BugMeNot for the current website you’re viewing, and gives you login/password pairs which you can then copy and paste.

But wouldn’t it be better if it automagically filled in the username and password for you? I thought so, so I wrote a few lines of code in the form of a bookmarklet and a JSONP web service to do this.

BugMeNot doesn’t provide an API so I had to do a little screen scraping with Hpricot. They also try to obfuscate the usernames and passwords by shifting the characters by some offset calculated from a “key” then Base64 encoding the string, and prepending 4 characters. Luckily their obfuscation was no match for a single line of Ruby:

def bmn_decode(input, offset)

  # decode base64, strip first 4 chars, convert chars to ints, substract offset, convert back ints back to chars

  input.unpack("m*")[0][4..-1].unpack("C*").map{|c| c – offset }.pack("C*")

end

The bookmarklet makes the request via an injected <script> tag. When it’s callback gets called it finds the most likely input elements for the username and password and fills them in with the result.

The Rails app consists of a single action that makes a request to bugmenot.com for the specified site, extracts and decodes the usernames and passwords, and picks the one with the highest rating. It then returns the result as JSON wrapped in a function callback (i.e. JSONP)

I’m not going to post the location of the live JSONP web service since BugMeNot limits the number of requests you can make, but the code is available on GitHub.

This post describes how to extract incident summaries and metadata from the USC Department of Public safety’s daily incident logs, which is used extensively in [TOOBS](http://tlrobinson.net/projects/toobs)

### Background ###

A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these “Crime Alert” emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.

This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn’t pretty, but it’s still available on my USC page here: [http://www-scf.usc.edu/~tlrobins/crimealert/](http://www-scf.usc.edu/~tlrobins/crimealert/)

For UPE’s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as [daily PDF logs](http://capsnet.usc.edu/DPS/CrimeSummary.cfm). Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.

### Method ###

My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.

The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day’s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used [rpdf2txt](http://raa.ruby-lang.org/project/rpdf2txt/).

Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.

### Code ###

The import script is available here: [http://tlrobinson.net/projects/toobs/import.rb](http://tlrobinson.net/projects/toobs/import.rb)

require"net/http"
require"rpdf2txt/parser"
require"date"
require "rubygems"

require_gem "activerecord"
# misc regular expressions constants

datetimeRE = /[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/

stampRE = /[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/
# connect to the database

ActiveRecord::Base.establish_connection(

  :adapter  => "mysql",

  :host     => "host",

  :database => "database",

  :username => "username",

  :password => "password" 

)
class Incident < ActiveRecord::Base

  set_table_name "crimes"

end

def import_url(url)

  puts "================== Processing: " + url + " =================="

  resp = Net::HTTP.get_response(URI.parse(url))

  if resp.is_a? Net::HTTPSuccess

    # parse the pdf, extract the text, split into lines

    parser = Rpdf2txt::Parser.new(resp.body)

    text = parser.extract_text

    lines = text.split("\n")

    incidents = Array.new # array containing each incident

    summary = false       # for multiple line summaries

    disp = false          # for cases when the "disp" data is on the line after the "Disp:" header

    # try to match each line to a regular expression or other condition

    # then extract the data from the line

    lines.each do |line|

      # first line

      if (line =~ stampRE)
        # special case for missing identifier of previous incident

        if (incidents.size > 0 && incidents.last.identifier == nil) 

          puts "+++ Last identifier is empty, searching for identifier in summary…"

          tempRE = /DR\#[\d]+/;

          tempId = incidents.last.summary[tempRE];

          if (tempId != nil) 

            puts "+++ Found! {" + tempId[3..tempId.length-1] + "}"

            incidents.last.identifier = tempId[3..tempId.length-1];

          end

        end

        # create new incident

        incidents << Incident.new

        summary = false

        disp = false

        # extract category, subcategory, time, and stamp

        cat_subcat_index = line.slice(/[^a-z]*(?=[A-Z][a-z])/).length

        incidents.last.category = line[0..cat_subcat_index-1].strip

        incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)-1].strip

        incidents.last.time = DateTime.parse(line.slice(datetimeRE))

        incidents.last.stamp = line.slice(stampRE)

      # identifier

      elsif (line =~ /^[0-9]+$/)

        incidents.last.identifier = line.slice(/^[0-9]+$/).to_i

      # location

      elsif (line =~ /Location:/)

        incidents.last.location = line.sub(/Location:/, "").strip

      # cc

      elsif (line =~ /cc:/)

        incidents.last.cc = line.sub(/cc:/, "").strip

        summary = false

      # disposition

      elsif (disp) 

        incidents.last.disp = line.sub(/Disp:/, "").strip

        disp = false

      # summary

      elsif (line =~ /Summary:/ || summary)

        if (incidents.last.summary.nil?)

          incidents.last.summary = line.sub(/Summary:/, "").strip

        else

          incidents.last.summary << (" " + line.sub(/Summary:/, "").strip)

        end

        if (incidents.last.summary =~ /Disp:/)

          # find the "Disp:" header and data, remove from summary

          disp = incidents.last.summary.slice!(/\s*Disp:.*/)

          incidents.last.disp = disp.sub(/Disp:/, "").strip

          disp = (incidents.last.disp == "") # check that we actually got the "disp" data

          summary = false

        else

          summary = true

        end

      # no match

      else

        puts "discarding line: {" + line + "}"

      end

    end

    # at the end save each incident and print a list

    incidents.each do |incident|

      begin

        puts( ("%8d" % incident.identifier) + " " +

              ("%25s" % ("{" + incident.category    + "}")) + " " +

              ("%45s" % ("{" + incident.subcategory + "}")) + " " +

              ("%60s" % ("{" + incident.location    + "}")));

        incident.save

      rescue Exception => exp

        puts exp

      end

    end

  end

end
if (ARGV.length > 0)

  # import each argument

  ARGV.each do |arg|

    import_url(arg)

  end

else

  yesterday = Date.today – 1;

  urlToImport = "http://capsnet.usc.edu/DPS/webpdf/"+

    ("%02d" % yesterday.mon) + ("%02d" % yesterday.mday) + yesterday.year.to_s[2..3] + ".pdf"

  import_url(urlToImport)

end

### Conclusion ###
This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can’t be parsed by rpdf2txt shows up. So far I haven’t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it’s not always successful.

Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.

tlrobinson.net blog

Category Archives: Ruby

Game of Life text and image generator generator

A Better BugMeNot Bookmarklet

Scraping USC DPS's incident logs