Scraping USC DPS’s incident logs
This post describes how to extract incident summaries and metadata from the USC Department of Public safety’s daily incident logs, which is used extensively in TOOBS
Background
A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these “Crime Alert” emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.
This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn’t pretty, but it’s still available on my USC page here: http://www-scf.usc.edu/~tlrobins/crimealert/
For UPE’s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as daily PDF logs. Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.
Method
My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.
The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day’s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used rpdf2txt.
Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.
Code
The import script is available here: http://tlrobinson.net/projects/toobs/import.rb
require "rpdf2txt/parser"
require "date"
require "rubygems"
require_gem "activerecord"
# misc regular expressions constants
datetimeRE = /[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/
stampRE = /[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/
# connect to the database
ActiveRecord::Base.establish_connection(
:adapter => "mysql",
:host => "host",
:database => "database",
:username => "username",
:password => "password"
)
class Incident < ActiveRecord::Base
set_table_name "crimes"
end
def import_url(url)
puts "================== Processing: " + url + " =================="
resp = Net::HTTP.get_response(URI.parse(url))
if resp.is_a? Net::HTTPSuccess
# parse the pdf, extract the text, split into lines
parser = Rpdf2txt::Parser.new(resp.body)
text = parser.extract_text
lines = text.split("\n")
incidents = Array.new # array containing each incident
summary = false # for multiple line summaries
disp = false # for cases when the "disp" data is on the line after the "Disp:" header
# try to match each line to a regular expression or other condition
# then extract the data from the line
lines.each do |line|
# first line
if (line =~ stampRE)
# special case for missing identifier of previous incident
if (incidents.size > 0 && incidents.last.identifier == nil)
puts "+++ Last identifier is empty, searching for identifier in summary…"
tempRE = /DR\#[\d]+/;
tempId = incidents.last.summary[tempRE];
if (tempId != nil)
puts "+++ Found! {" + tempId[3..tempId.length-1] + "}"
incidents.last.identifier = tempId[3..tempId.length-1];
end
end
# create new incident
incidents << Incident.new
summary = false
disp = false
# extract category, subcategory, time, and stamp
cat_subcat_index = line.slice(/[^a-z]*(?=[A-Z][a-z])/).length
incidents.last.category = line[0..cat_subcat_index-1].strip
incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)-1].strip
incidents.last.time = DateTime.parse(line.slice(datetimeRE))
incidents.last.stamp = line.slice(stampRE)
# identifier
elsif (line =~ /^[0-9]+$/)
incidents.last.identifier = line.slice(/^[0-9]+$/).to_i
# location
elsif (line =~ /Location:/)
incidents.last.location = line.sub(/Location:/, "").strip
# cc
elsif (line =~ /cc:/)
incidents.last.cc = line.sub(/cc:/, "").strip
summary = false
# disposition
elsif (disp)
incidents.last.disp = line.sub(/Disp:/, "").strip
disp = false
# summary
elsif (line =~ /Summary:/ || summary)
if (incidents.last.summary.nil?)
incidents.last.summary = line.sub(/Summary:/, "").strip
else
incidents.last.summary << (" " + line.sub(/Summary:/, "").strip)
end
if (incidents.last.summary =~ /Disp:/)
# find the "Disp:" header and data, remove from summary
disp = incidents.last.summary.slice!(/\s*Disp:.*/)
incidents.last.disp = disp.sub(/Disp:/, "").strip
disp = (incidents.last.disp == "") # check that we actually got the "disp" data
summary = false
else
summary = true
end
# no match
else
puts "discarding line: {" + line + "}"
end
end
# at the end save each incident and print a list
incidents.each do |incident|
begin
puts( ("%8d" % incident.identifier) + " " +
("%25s" % ("{" + incident.category + "}")) + " " +
("%45s" % ("{" + incident.subcategory + "}")) + " " +
("%60s" % ("{" + incident.location + "}")));
incident.save
rescue Exception => exp
puts exp
end
end
end
end
if (ARGV.length > 0)
# import each argument
ARGV.each do |arg|
import_url(arg)
end
else
yesterday = Date.today – 1;
urlToImport = "http://capsnet.usc.edu/DPS/webpdf/"+
("%02d" % yesterday.mon) + ("%02d" % yesterday.mday) + yesterday.year.to_s[2..3] + ".pdf"
import_url(urlToImport)
end
Conclusion
This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can’t be parsed by rpdf2txt shows up. So far I haven’t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it’s not always successful.
Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.