Scraping USC DPS's incident logs | tlrobinson.net blog

This post describes how to extract incident summaries and metadata from the USC Department of Public safety’s daily incident logs, which is used extensively in [TOOBS](http://tlrobinson.net/projects/toobs)

### Background ###

A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these “Crime Alert” emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.

This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn’t pretty, but it’s still available on my USC page here: [http://www-scf.usc.edu/~tlrobins/crimealert/](http://www-scf.usc.edu/~tlrobins/crimealert/)

For UPE’s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as [daily PDF logs](http://capsnet.usc.edu/DPS/CrimeSummary.cfm). Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.

### Method ###

My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.

The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day’s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used [rpdf2txt](http://raa.ruby-lang.org/project/rpdf2txt/).

Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.

### Code ###

The import script is available here: [http://tlrobinson.net/projects/toobs/import.rb](http://tlrobinson.net/projects/toobs/import.rb)

require"net/http"
require"rpdf2txt/parser"
require"date"
require "rubygems"

require_gem "activerecord"
# misc regular expressions constants

datetimeRE = /[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/

stampRE = /[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/
# connect to the database

ActiveRecord::Base.establish_connection(

  :adapter  => "mysql",

  :host     => "host",

  :database => "database",

  :username => "username",

  :password => "password" 

)
class Incident < ActiveRecord::Base

  set_table_name "crimes"

end

def import_url(url)

  puts "================== Processing: " + url + " =================="

  resp = Net::HTTP.get_response(URI.parse(url))

  if resp.is_a? Net::HTTPSuccess

    # parse the pdf, extract the text, split into lines

    parser = Rpdf2txt::Parser.new(resp.body)

    text = parser.extract_text

    lines = text.split("\n")

    incidents = Array.new # array containing each incident

    summary = false       # for multiple line summaries

    disp = false          # for cases when the "disp" data is on the line after the "Disp:" header

    # try to match each line to a regular expression or other condition

    # then extract the data from the line

    lines.each do |line|

      # first line

      if (line =~ stampRE)
        # special case for missing identifier of previous incident

        if (incidents.size > 0 && incidents.last.identifier == nil) 

          puts "+++ Last identifier is empty, searching for identifier in summary…"

          tempRE = /DR\#[\d]+/;

          tempId = incidents.last.summary[tempRE];

          if (tempId != nil) 

            puts "+++ Found! {" + tempId[3..tempId.length-1] + "}"

            incidents.last.identifier = tempId[3..tempId.length-1];

          end

        end

        # create new incident

        incidents << Incident.new

        summary = false

        disp = false

        # extract category, subcategory, time, and stamp

        cat_subcat_index = line.slice(/[^a-z]*(?=[A-Z][a-z])/).length

        incidents.last.category = line[0..cat_subcat_index-1].strip

        incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)-1].strip

        incidents.last.time = DateTime.parse(line.slice(datetimeRE))

        incidents.last.stamp = line.slice(stampRE)

      # identifier

      elsif (line =~ /^[0-9]+$/)

        incidents.last.identifier = line.slice(/^[0-9]+$/).to_i

      # location

      elsif (line =~ /Location:/)

        incidents.last.location = line.sub(/Location:/, "").strip

      # cc

      elsif (line =~ /cc:/)

        incidents.last.cc = line.sub(/cc:/, "").strip

        summary = false

      # disposition

      elsif (disp) 

        incidents.last.disp = line.sub(/Disp:/, "").strip

        disp = false

      # summary

      elsif (line =~ /Summary:/ || summary)

        if (incidents.last.summary.nil?)

          incidents.last.summary = line.sub(/Summary:/, "").strip

        else

          incidents.last.summary << (" " + line.sub(/Summary:/, "").strip)

        end

        if (incidents.last.summary =~ /Disp:/)

          # find the "Disp:" header and data, remove from summary

          disp = incidents.last.summary.slice!(/\s*Disp:.*/)

          incidents.last.disp = disp.sub(/Disp:/, "").strip

          disp = (incidents.last.disp == "") # check that we actually got the "disp" data

          summary = false

        else

          summary = true

        end

      # no match

      else

        puts "discarding line: {" + line + "}"

      end

    end

    # at the end save each incident and print a list

    incidents.each do |incident|

      begin

        puts( ("%8d" % incident.identifier) + " " +

              ("%25s" % ("{" + incident.category    + "}")) + " " +

              ("%45s" % ("{" + incident.subcategory + "}")) + " " +

              ("%60s" % ("{" + incident.location    + "}")));

        incident.save

      rescue Exception => exp

        puts exp

      end

    end

  end

end
if (ARGV.length > 0)

  # import each argument

  ARGV.each do |arg|

    import_url(arg)

  end

else

  yesterday = Date.today – 1;

  urlToImport = "http://capsnet.usc.edu/DPS/webpdf/"+

    ("%02d" % yesterday.mon) + ("%02d" % yesterday.mday) + yesterday.year.to_s[2..3] + ".pdf"

  import_url(urlToImport)

end

### Conclusion ###
This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can’t be parsed by rpdf2txt shows up. So far I haven’t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it’s not always successful.

Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.