TOOBS | tlrobinson.net blog

This post explains how to get data from LAPD’s [crime maps](http://www.lapdcrimemaps.org/) website. See my [previous post](http://tlrobinson.net/blog/?p=6) on scraping DPS’s incident logs for background of TOOBS.

After completing the TOOBS project for the UPE P.24 programming contest I was checking out LAPD’s [crime maps](http://www.lapdcrimemaps.org/) website, which is similar to TOOBS (but not as cool!), and I realized I could integrate their data with DPS’s data for the ultimate Los Angeles / USC crime map. There very little overlap between the LAPD and DPS data since the two are separate entities. Murders and some other incidents may show up in both, but hopefully these are rare…

The LAPD system also uses JavaScript and XMLHttpRequest to fetch the data from a server side script. Additionally, there is no security to check that the requests are coming from the LAPD web app. This means we can easily, and (as far as i know) legally, access their data.

Due to the same origin policy that restricts JavaScript to only making requests to the originating server, you cannot simply use their PHP script from your own JavaScript, you must use sort of a proxy. While this policy can be annoying, it is necessary to limit what malicious JavaScript could do.

To obtain the crime data from LAPD’s servers, we begin by forming the request URL which contains parameters such as the start date, the interval length, lat/lon coordinates, radius, and crime types. A HTTP request is made to their server, and the response is stored.

We notice the response is simply JavaScript that gets eval’d on the client:

searchPoints = newArray ();
searchPoints[0] = new searchPoint (‘0’, ‘#070307306’, ‘lightblue’, ’17’, ‘-118.301638’, ‘34.022812’, ’14XX W 36th St’, ‘0.74’, ‘6’, ’02-04-2007 10:45:00 PM’, ‘Southwest Division: 213-485-6571’);

searchPoints[1] = new searchPoint (‘1’, ‘#070307280’, ‘violet’, ’17’, ‘-118.284008’, ‘34.033212’, ’25XX S Hoover St’, ‘0.52’, ‘3’, ’02-04-2007 10:00:00 PM’, ‘Southwest Division: 213-485-6571’);

searchPoints[2] = new searchPoint (‘2’, ‘#070307224’, ‘cyan’, ’17’, ‘-118.304108’, ‘34.032481’, ’26XX Dalton Av’, ‘0.83’, ‘4’, ’02-04-2007 12:15:00 AM’, ‘Southwest Division: 213-485-6571’);

searchPoints[3] = new searchPoint (‘3’, ‘#070307222’, ‘blue’, ’17’, ‘-118.2903’, ‘34.0284’, ‘Menlo Av and 29th Av’, ‘0.02’, ‘2’, ’02-03-2007 11:00:00 PM’, ‘Southwest Division: 213-485-6571’);

…

We could simply redirect this code to our own app and do the processing on the client side with JavaScript, but we also notice that JavaScript syntax is very similar to PHP syntax. By creating a compatible PHP object called searchPoint and prepending a “$” to each variable name, we have valid PHP code that we can simply eval. The result is an array of searchPoint objects that we can easily add to our response, or insert into a database, or whatever we want!

Note that this is extremely insecure since we’re eval’ing text that we got from somewhere else. By changing the response, the provider of the data could execute any PHP they wanted on my server.

A more secure method would be to actually parse the data rather than letting PHP’s eval do the work.

This post describes how to extract incident summaries and metadata from the USC Department of Public safety’s daily incident logs, which is used extensively in [TOOBS](http://tlrobinson.net/projects/toobs)

### Background ###

A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these “Crime Alert” emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.

This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn’t pretty, but it’s still available on my USC page here: [http://www-scf.usc.edu/~tlrobins/crimealert/](http://www-scf.usc.edu/~tlrobins/crimealert/)

For UPE’s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as [daily PDF logs](http://capsnet.usc.edu/DPS/CrimeSummary.cfm). Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.

### Method ###

My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.

The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day’s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used [rpdf2txt](http://raa.ruby-lang.org/project/rpdf2txt/).

Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.

### Code ###

The import script is available here: [http://tlrobinson.net/projects/toobs/import.rb](http://tlrobinson.net/projects/toobs/import.rb)

require"net/http"
require"rpdf2txt/parser"
require"date"
require "rubygems"

require_gem "activerecord"
# misc regular expressions constants

datetimeRE = /[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/

stampRE = /[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/
# connect to the database

ActiveRecord::Base.establish_connection(

  :adapter  => "mysql",

  :host     => "host",

  :database => "database",

  :username => "username",

  :password => "password" 

)
class Incident < ActiveRecord::Base

  set_table_name "crimes"

end

def import_url(url)

  puts "================== Processing: " + url + " =================="

  resp = Net::HTTP.get_response(URI.parse(url))

  if resp.is_a? Net::HTTPSuccess

    # parse the pdf, extract the text, split into lines

    parser = Rpdf2txt::Parser.new(resp.body)

    text = parser.extract_text

    lines = text.split("\n")

    incidents = Array.new # array containing each incident

    summary = false       # for multiple line summaries

    disp = false          # for cases when the "disp" data is on the line after the "Disp:" header

    # try to match each line to a regular expression or other condition

    # then extract the data from the line

    lines.each do |line|

      # first line

      if (line =~ stampRE)
        # special case for missing identifier of previous incident

        if (incidents.size > 0 && incidents.last.identifier == nil) 

          puts "+++ Last identifier is empty, searching for identifier in summary…"

          tempRE = /DR\#[\d]+/;

          tempId = incidents.last.summary[tempRE];

          if (tempId != nil) 

            puts "+++ Found! {" + tempId[3..tempId.length-1] + "}"

            incidents.last.identifier = tempId[3..tempId.length-1];

          end

        end

        # create new incident

        incidents << Incident.new

        summary = false

        disp = false

        # extract category, subcategory, time, and stamp

        cat_subcat_index = line.slice(/[^a-z]*(?=[A-Z][a-z])/).length

        incidents.last.category = line[0..cat_subcat_index-1].strip

        incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)-1].strip

        incidents.last.time = DateTime.parse(line.slice(datetimeRE))

        incidents.last.stamp = line.slice(stampRE)

      # identifier

      elsif (line =~ /^[0-9]+$/)

        incidents.last.identifier = line.slice(/^[0-9]+$/).to_i

      # location

      elsif (line =~ /Location:/)

        incidents.last.location = line.sub(/Location:/, "").strip

      # cc

      elsif (line =~ /cc:/)

        incidents.last.cc = line.sub(/cc:/, "").strip

        summary = false

      # disposition

      elsif (disp) 

        incidents.last.disp = line.sub(/Disp:/, "").strip

        disp = false

      # summary

      elsif (line =~ /Summary:/ || summary)

        if (incidents.last.summary.nil?)

          incidents.last.summary = line.sub(/Summary:/, "").strip

        else

          incidents.last.summary << (" " + line.sub(/Summary:/, "").strip)

        end

        if (incidents.last.summary =~ /Disp:/)

          # find the "Disp:" header and data, remove from summary

          disp = incidents.last.summary.slice!(/\s*Disp:.*/)

          incidents.last.disp = disp.sub(/Disp:/, "").strip

          disp = (incidents.last.disp == "") # check that we actually got the "disp" data

          summary = false

        else

          summary = true

        end

      # no match

      else

        puts "discarding line: {" + line + "}"

      end

    end

    # at the end save each incident and print a list

    incidents.each do |incident|

      begin

        puts( ("%8d" % incident.identifier) + " " +

              ("%25s" % ("{" + incident.category    + "}")) + " " +

              ("%45s" % ("{" + incident.subcategory + "}")) + " " +

              ("%60s" % ("{" + incident.location    + "}")));

        incident.save

      rescue Exception => exp

        puts exp

      end

    end

  end

end
if (ARGV.length > 0)

  # import each argument

  ARGV.each do |arg|

    import_url(arg)

  end

else

  yesterday = Date.today – 1;

  urlToImport = "http://capsnet.usc.edu/DPS/webpdf/"+

    ("%02d" % yesterday.mon) + ("%02d" % yesterday.mday) + yesterday.year.to_s[2..3] + ".pdf"

  import_url(urlToImport)

end

### Conclusion ###
This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can’t be parsed by rpdf2txt shows up. So far I haven’t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it’s not always successful.

Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.

tlrobinson.net blog

Category Archives: TOOBS

Stealing LAPD's crime data

Scraping USC DPS's incident logs