Scraping USC DPS’s incident logs

This post describes how to extract incident summaries and metadata from the USC Department of Public safety’s daily incident logs, which is used extensively in TOOBS

Background

A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these “Crime Alert” emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.

This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn’t pretty, but it’s still available on my USC page here: http://www-scf.usc.edu/~tlrobins/crimealert/

For UPE’s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as daily PDF logs. Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.

Method

My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.

The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day’s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used rpdf2txt.

Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.

Code

The import script is available here: http://tlrobinson.net/projects/toobs/import.rb

require "net/http"
require "rpdf2txt/parser"
require "date"

require "rubygems"
require_gem "activerecord"

# misc regular expressions constants
datetimeRE = /[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/
stampRE = /[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/

# connect to the database
ActiveRecord::Base.establish_connection(
  :adapter  => "mysql",
  :host     => "host",
  :database => "database",
  :username => "username",
  :password => "password"
)

class Incident < ActiveRecord::Base
  set_table_name "crimes"
end
  
def import_url(url)
  puts "================== Processing: " + url + " =================="
  
  resp = Net::HTTP.get_response(URI.parse(url))
  if resp.is_a? Net::HTTPSuccess
    # parse the pdf, extract the text, split into lines
    parser = Rpdf2txt::Parser.new(resp.body)
    text = parser.extract_text
    lines = text.split("\n")
    
    incidents = Array.new # array containing each incident
    summary = false       # for multiple line summaries
    disp = false          # for cases when the "disp" data is on the line after the "Disp:" header
    
    # try to match each line to a regular expression or other condition
    # then extract the data from the line
    lines.each do |line|
    
      # first line
      if (line =~ stampRE)

        # special case for missing identifier of previous incident
        if (incidents.size > 0 && incidents.last.identifier == nil)
          puts "+++ Last identifier is empty, searching for identifier in summary…"
          tempRE = /DR\#[\d]+/;
          tempId = incidents.last.summary[tempRE];
          if (tempId != nil)
            puts "+++ Found! {" + tempId[3..tempId.length-1] + "}"
            incidents.last.identifier = tempId[3..tempId.length-1];
          end
        end
    
        # create new incident
        incidents << Incident.new
        summary = false
        disp = false
  
        # extract category, subcategory, time, and stamp
        cat_subcat_index = line.slice(/[^a-z]*(?=[A-Z][a-z])/).length
        incidents.last.category = line[0..cat_subcat_index-1].strip
        incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)-1].strip
        incidents.last.time = DateTime.parse(line.slice(datetimeRE))
        incidents.last.stamp = line.slice(stampRE)
        
      # identifier
      elsif (line =~ /^[0-9]+$/)
        incidents.last.identifier = line.slice(/^[0-9]+$/).to_i
        
      # location
      elsif (line =~ /Location:/)
        incidents.last.location = line.sub(/Location:/, "").strip
        
      # cc
      elsif (line =~ /cc:/)
        incidents.last.cc = line.sub(/cc:/, "").strip
        summary = false
      
      # disposition
      elsif (disp)
        incidents.last.disp = line.sub(/Disp:/, "").strip
        disp = false
      
      # summary
      elsif (line =~ /Summary:/ || summary)
        if (incidents.last.summary.nil?)
          incidents.last.summary = line.sub(/Summary:/, "").strip
        else
          incidents.last.summary << (" " + line.sub(/Summary:/, "").strip)
        end
    
        if (incidents.last.summary =~ /Disp:/)
          # find the "Disp:" header and data, remove from summary
          disp = incidents.last.summary.slice!(/\s*Disp:.*/)
          incidents.last.disp = disp.sub(/Disp:/, "").strip
          
          disp = (incidents.last.disp == "") # check that we actually got the "disp" data
          summary = false
        else
          summary = true
        end
      
      # no match
      else
        puts "discarding line: {" + line + "}"
      end
    end
    
    # at the end save each incident and print a list
    incidents.each do |incident|
      begin
        puts( ("%8d" % incident.identifier) + " " +
              ("%25s" % ("{" + incident.category    + "}")) + " " +
              ("%45s" % ("{" + incident.subcategory + "}")) + " " +
              ("%60s" % ("{" + incident.location    + "}")));
        incident.save
      rescue Exception => exp
        puts exp
      end
    end
    
  end
end

if (ARGV.length > 0)
  # import each argument
  ARGV.each do |arg|
    import_url(arg)
  end
else
  yesterday = Date.today – 1;
  urlToImport = "http://capsnet.usc.edu/DPS/webpdf/"+
    ("%02d" % yesterday.mon) + ("%02d" % yesterday.mday) + yesterday.year.to_s[2..3] + ".pdf"
  import_url(urlToImport)
end

Conclusion

This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can’t be parsed by rpdf2txt shows up. So far I haven’t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it’s not always successful.

Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.

blog comments powered by Disqus


Warning: include(/home/tlrobinson/tlrobinson.net/_footer-analytics.php) [function.include]: failed to open stream: No such file or directory in /home/tlrobinson/tlrobinson.net/blog/wp-content/themes/clean-look-150/footer.php on line 13

Warning: include() [function.include]: Failed opening '/home/tlrobinson/tlrobinson.net/_footer-analytics.php' for inclusion (include_path='.:/usr/local/lib/php:/usr/local/php5/lib/pear') in /home/tlrobinson/tlrobinson.net/blog/wp-content/themes/clean-look-150/footer.php on line 13