<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>tlrobinson.net blog &#187; TOOBS</title>
	<atom:link href="http://tlrobinson.net/blog/category/toobs/feed/" rel="self" type="application/rss+xml" />
	<link>http://tlrobinson.net/blog</link>
	<description></description>
	<lastBuildDate>Mon, 06 Apr 2009 08:37:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Stealing LAPD&#039;s crime data</title>
		<link>http://tlrobinson.net/blog/2007/03/stealing-lapds-crime-data/</link>
		<comments>http://tlrobinson.net/blog/2007/03/stealing-lapds-crime-data/#comments</comments>
		<pubDate>Sun, 18 Mar 2007 11:21:04 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[TOOBS]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://tlrobinson.net/blog/?p=7</guid>
		<description><![CDATA[This post explains how to get data from LAPD&#8217;s [crime maps](http://www.lapdcrimemaps.org/) website. See my [previous post](http://tlrobinson.net/blog/?p=6) on scraping DPS&#8217;s incident logs for background of TOOBS. After completing the TOOBS project for the UPE P.24 programming contest I was checking out &#8230; <a href="http://tlrobinson.net/blog/2007/03/stealing-lapds-crime-data/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This post explains how to get data from LAPD&#8217;s [crime maps](http://www.lapdcrimemaps.org/) website. See my [previous post](http://tlrobinson.net/blog/?p=6) on scraping DPS&#8217;s incident logs for background of TOOBS.</p>
<p>After completing the TOOBS project for the UPE P.24 programming contest I was checking out LAPD&#8217;s [crime maps](http://www.lapdcrimemaps.org/) website, which is similar to TOOBS (but not as cool!), and I realized I could integrate their data with DPS&#8217;s data for the ultimate Los Angeles / USC crime map. There very little overlap between the LAPD and DPS data since the two are separate entities. Murders and some other incidents may show up in both, but hopefully these are rare&#8230;</p>
<p>The LAPD system also uses JavaScript and XMLHttpRequest to fetch the data from a server side script. Additionally, there is no security to check that the requests are coming from the LAPD web app. This means we can easily, and (as far as i know) legally, access their data.</p>
<p>Due to the same origin policy that restricts JavaScript to only making requests to the originating server, you cannot simply use their PHP script from your own JavaScript, you must use sort of a proxy. While this policy can be annoying, it is necessary to limit what malicious JavaScript could do.</p>
<p>To obtain the crime data from LAPD&#8217;s servers, we begin by forming the request URL which contains parameters such as the start date, the interval length, lat/lon coordinates, radius, and crime types. A HTTP request is made to their server, and the response is stored.</p>
<p>We notice the response is simply JavaScript that gets eval&#8217;d on the client:</p>
<div style="text-align:left;color:#000000; background-color:#ffffff; border:solid black 1px; padding:0.5em 1em 0.5em 1em; overflow:auto;font-size:small; font-family:monospace; ">searchPoints = <span style="color:#881350;">new</span> <span style="color:#003369;">Array </span>();</p>
<p>searchPoints[<span style="color:#0000ff;">0</span>] = <span style="color:#881350;">new</span> <span style="color:#003369;">searchPoint </span>(<span style="color:#760f15;">&#8217;0&#8242;</span>, <span style="color:#760f15;">&#8216;#070307306&#8242;</span>, <span style="color:#760f15;">&#8216;lightblue&#8217;</span>, <span style="color:#760f15;">&#8217;17&#8242;</span>, <span style="color:#760f15;">&#8216;-118.301638&#8242;</span>, <span style="color:#760f15;">&#8217;34.022812&#8242;</span>, <span style="color:#760f15;">&#8217;14XX W 36th St&#8217;</span>, <span style="color:#760f15;">&#8217;0.74&#8242;</span>, <span style="color:#760f15;">&#8217;6&#8242;</span>, <span style="color:#760f15;">&#8217;02-04-2007 10:45:00 PM&#8217;</span>, <span style="color:#760f15;">&#8216;Southwest Division: 213-485-6571&#8242;</span>);<br />
searchPoints[<span style="color:#0000ff;">1</span>] = <span style="color:#881350;">new</span> <span style="color:#003369;">searchPoint </span>(<span style="color:#760f15;">&#8217;1&#8242;</span>, <span style="color:#760f15;">&#8216;#070307280&#8242;</span>, <span style="color:#760f15;">&#8216;violet&#8217;</span>, <span style="color:#760f15;">&#8217;17&#8242;</span>, <span style="color:#760f15;">&#8216;-118.284008&#8242;</span>, <span style="color:#760f15;">&#8217;34.033212&#8242;</span>, <span style="color:#760f15;">&#8217;25XX S Hoover St&#8217;</span>, <span style="color:#760f15;">&#8217;0.52&#8242;</span>, <span style="color:#760f15;">&#8217;3&#8242;</span>, <span style="color:#760f15;">&#8217;02-04-2007 10:00:00 PM&#8217;</span>, <span style="color:#760f15;">&#8216;Southwest Division: 213-485-6571&#8242;</span>);<br />
searchPoints[<span style="color:#0000ff;">2</span>] = <span style="color:#881350;">new</span> <span style="color:#003369;">searchPoint </span>(<span style="color:#760f15;">&#8217;2&#8242;</span>, <span style="color:#760f15;">&#8216;#070307224&#8242;</span>, <span style="color:#760f15;">&#8216;cyan&#8217;</span>, <span style="color:#760f15;">&#8217;17&#8242;</span>, <span style="color:#760f15;">&#8216;-118.304108&#8242;</span>, <span style="color:#760f15;">&#8217;34.032481&#8242;</span>, <span style="color:#760f15;">&#8217;26XX Dalton Av&#8217;</span>, <span style="color:#760f15;">&#8217;0.83&#8242;</span>, <span style="color:#760f15;">&#8217;4&#8242;</span>, <span style="color:#760f15;">&#8217;02-04-2007 12:15:00 AM&#8217;</span>, <span style="color:#760f15;">&#8216;Southwest Division: 213-485-6571&#8242;</span>);<br />
searchPoints[<span style="color:#0000ff;">3</span>] = <span style="color:#881350;">new</span> <span style="color:#003369;">searchPoint </span>(<span style="color:#760f15;">&#8217;3&#8242;</span>, <span style="color:#760f15;">&#8216;#070307222&#8242;</span>, <span style="color:#760f15;">&#8216;blue&#8217;</span>, <span style="color:#760f15;">&#8217;17&#8242;</span>, <span style="color:#760f15;">&#8216;-118.2903&#8242;</span>, <span style="color:#760f15;">&#8217;34.0284&#8242;</span>, <span style="color:#760f15;">&#8216;Menlo Av and 29th Av&#8217;</span>, <span style="color:#760f15;">&#8217;0.02&#8242;</span>, <span style="color:#760f15;">&#8217;2&#8242;</span>, <span style="color:#760f15;">&#8217;02-03-2007 11:00:00 PM&#8217;</span>, <span style="color:#760f15;">&#8216;Southwest Division: 213-485-6571&#8242;</span>);<br />
&#8230;</div>
<p>We could simply redirect this code to our own app and do the processing on the client side with JavaScript, but we also notice that JavaScript syntax is very similar to PHP syntax. By creating a compatible PHP object called searchPoint and prepending a &#8220;$&#8221; to each variable name, we have valid PHP code that we can simply eval. The result is an array of searchPoint objects that we can easily add to our response, or insert into a database, or whatever we want!</p>
<p>Note that this is <em>extremely</em> insecure since we&#8217;re eval&#8217;ing text that we got from somewhere else. By changing the response, the provider of the data could execute any PHP they wanted on my server.</p>
<p>A more secure method would be to actually parse the data rather than letting PHP&#8217;s eval do the work.</p>
]]></content:encoded>
			<wfw:commentRss>http://tlrobinson.net/blog/2007/03/stealing-lapds-crime-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Scraping USC DPS&#039;s incident logs</title>
		<link>http://tlrobinson.net/blog/2007/03/scraping-usc-dpss-incident-logs/</link>
		<comments>http://tlrobinson.net/blog/2007/03/scraping-usc-dpss-incident-logs/#comments</comments>
		<pubDate>Mon, 12 Mar 2007 22:41:19 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[TOOBS]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://tlrobinson.net/blog/?p=6</guid>
		<description><![CDATA[This post describes how to extract incident summaries and metadata from the USC Department of Public safety&#8217;s daily incident logs, which is used extensively in [TOOBS](http://tlrobinson.net/projects/toobs) ### Background ### A couple of years ago when the Google Maps API was &#8230; <a href="http://tlrobinson.net/blog/2007/03/scraping-usc-dpss-incident-logs/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This post describes how to extract incident summaries and metadata from the USC Department of Public safety&#8217;s daily incident logs, which is used extensively in [TOOBS](http://tlrobinson.net/projects/toobs)</p>
<p>### Background ###</p>
<p>A couple of years ago when the Google Maps API was first introduced I wanted to make something useful using it. The USC Department of Public safety sends out these &#8220;Crime Alert&#8221; emails whenever a robbery or assault or other violent crime against a student occurs, so I decided to plot each of the locations along with the summary of the crime on a map of USC and the surrounding area.</p>
<p>This was fine for a small number of crimes, but unfortunately the Crime Alert emails were completely unstructured and never formatted consistently, so automating the process was out of the question. I ended up creating each entry by hand, and hard coding the data. The result wasn&#8217;t pretty, but it&#8217;s still available on my USC page here: [http://www-scf.usc.edu/~tlrobins/crimealert/](http://www-scf.usc.edu/~tlrobins/crimealert/)</p>
<p>For UPE&#8217;s P.24 programming contest I decided to rewrite the whole thing to make it far more automated and flexible. Since my first attempt, I discovered that DPS publishes every single incident they respond to as [daily PDF logs](http://capsnet.usc.edu/DPS/CrimeSummary.cfm). Obviously I would have preferred XML or some other structured format, but the PDFs will have to do for now.</p>
<p>### Method ###</p>
<p>My language of choice was Ruby since I originally planned on using the project as an excuse to learn Ruby on Rails. Due to some ridiculously strange bugs I gave up on Rails for the project, but not before writing the incident logs parser.</p>
<p>The main script can either take a list of URLs as arguments, or if no arguments are specified it will try to download the previous day&#8217;s log (good for running it as a cron job). A HTTP request to the URL is made, and if successful the PDF is downloaded into memory. To convert the PDF to text I used [rpdf2txt](http://raa.ruby-lang.org/project/rpdf2txt/).</p>
<p>Once in text form, a variety of regular expressions and string manipulation functions are used to extract each field from the entry. When an entry is complete, it is inserted into a database. Spurious lines are discarded.</p>
<p>### Code ###</p>
<p>The import script is available here: [http://tlrobinson.net/projects/toobs/import.rb](http://tlrobinson.net/projects/toobs/import.rb)</p>
<div style="text-align:left;color:#000000; background-color:#ffffff; border:solid black 1px; padding:0.5em 1em 0.5em 1em; overflow:auto;font-size:small; font-family:monospace; "><span style="color:#881350;">require</span> <span style="color:#760f15;">&quot;net/http&quot;</span><br />
<span style="color:#881350;">require</span> <span style="color:#760f15;">&quot;rpdf2txt/parser&quot;</span><br />
<span style="color:#881350;">require</span> <span style="color:#760f15;">&quot;date&quot;</span></p>
<p><span style="color:#881350;">require</span> <span style="color:#760f15;">&quot;rubygems&quot;</span><br />
require_gem <span style="color:#760f15;">&quot;activerecord&quot;</span></p>
<p><span style="color:#236e25;"># misc regular expressions constants<br />
</span>datetimeRE = <span style="color:#c700c2;">/[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}-[A-Z][a-z]+ at [0-9]{2}:[0-9]{2}/</span><br />
stampRE = <span style="color:#c700c2;">/[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]+/</span></p>
<p><span style="color:#236e25;"># connect to the database<br />
</span>ActiveRecord::Base.establish_connection(<br />
&nbsp;&nbsp;<span style="color:#d6771c;">:adapter</span> &nbsp;=&gt; <span style="color:#760f15;">&quot;mysql&quot;</span>,<br />
&nbsp;&nbsp;<span style="color:#d6771c;">:host</span> &nbsp;&nbsp;&nbsp;&nbsp;=&gt; <span style="color:#760f15;">&quot;host&quot;</span>,<br />
&nbsp;&nbsp;<span style="color:#d6771c;">:database</span> =&gt; <span style="color:#760f15;">&quot;database&quot;</span>,<br />
&nbsp;&nbsp;<span style="color:#d6771c;">:username</span> =&gt; <span style="color:#760f15;">&quot;username&quot;</span>,<br />
&nbsp;&nbsp;<span style="color:#d6771c;">:password</span> =&gt; <span style="color:#760f15;">&quot;password&quot;</span> <br />
)</p>
<p><span style="color:#881350;">class</span> Incident &lt; ActiveRecord::Base<br />
&nbsp;&nbsp;set_table_name <span style="color:#760f15;">&quot;crimes&quot;</span><br />
<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;<br />
<span style="color:#881350;">def</span> import_url(url)<br />
&nbsp;&nbsp;<span style="color:#881350;">puts</span> <span style="color:#760f15;">&quot;================== Processing: &quot;</span> + url + <span style="color:#760f15;">&quot; ==================&quot;</span><br />
&nbsp;&nbsp;<br />
&nbsp;&nbsp;resp = Net::HTTP.get_response(URI.parse(url))<br />
&nbsp;&nbsp;<span style="color:#881350;">if</span> resp.is_a? Net::HTTPSuccess<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># parse the pdf, extract the text, split into lines<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;parser = Rpdf2txt::Parser.new(resp.body)<br />
&nbsp;&nbsp;&nbsp;&nbsp;text = parser.extract_text<br />
&nbsp;&nbsp;&nbsp;&nbsp;lines = text.<span style="color:#881350;">split</span>(<span style="color:#760f15;">&quot;\n&quot;</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;incidents = <span style="color:#881350;">Array</span>.new <span style="color:#236e25;"># array containing each incident<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;summary = <span style="color:#0000cc;">false</span> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># for multiple line summaries<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;disp = <span style="color:#0000cc;">false</span> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># for cases when the &quot;disp&quot; data is on the line after the &quot;Disp:&quot; header<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># try to match each line to a regular expression or other condition<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># then extract the data from the line<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;lines.each <span style="color:#881350;">do</span> |line|<br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># first line<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">if</span> (line =~ stampRE)</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># special case for missing identifier of previous incident<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">if</span> (incidents.size &gt; <span style="color:#0000ff;">0</span> &amp;&amp; incidents.last.identifier == <span style="color:#0000cc;">nil</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">puts</span> <span style="color:#760f15;">&quot;+++ Last identifier is empty, searching for identifier in summary&#8230;&quot;</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tempRE = <span style="color:#c700c2;">/DR\#[\d]+/</span>;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tempId = incidents.last.summary[tempRE];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">if</span> (tempId != <span style="color:#0000cc;">nil</span>) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">puts</span> <span style="color:#760f15;">&quot;+++ Found! {&quot;</span> + tempId[<span style="color:#0000ff;">3.</span>.tempId.length-<span style="color:#0000ff;">1</span>] + <span style="color:#760f15;">&quot;}&quot;</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.identifier = tempId[<span style="color:#0000ff;">3.</span>.tempId.length-<span style="color:#0000ff;">1</span>];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># create new incident<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents &lt;&lt; Incident.new<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;summary = <span style="color:#0000cc;">false</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;disp = <span style="color:#0000cc;">false</span><br />
&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># extract category, subcategory, time, and stamp<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cat_subcat_index = line.slice(<span style="color:#c700c2;">/[^a-z]*(?=[A-Z][a-z])/</span>).length<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.category = line[<span style="color:#0000ff;">0.</span>.cat_subcat_index-<span style="color:#0000ff;">1</span>].strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.subcategory = line[cat_subcat_index..line.index(datetimeRE)<span style="color:#0000ff;">-1</span>].strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.time = <span style="color:#881350;">DateTime</span>.parse(line.slice(datetimeRE))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.stamp = line.slice(stampRE)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># identifier<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">elsif</span> (line =~ <span style="color:#c700c2;">/^[0-9]+$/</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.identifier = line.slice(<span style="color:#c700c2;">/^[0-9]+$/</span>).to_i<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># location<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">elsif</span> (line =~ <span style="color:#c700c2;">/Location:/</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.location = line.<span style="color:#881350;">sub</span>(<span style="color:#c700c2;">/Location:/</span>, <span style="color:#760f15;">&quot;&quot;</span>).strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># cc<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">elsif</span> (line =~ <span style="color:#c700c2;">/cc:/</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.cc = line.<span style="color:#881350;">sub</span>(<span style="color:#c700c2;">/cc:/</span>, <span style="color:#760f15;">&quot;&quot;</span>).strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;summary = <span style="color:#0000cc;">false</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># disposition<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">elsif</span> (disp) <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.disp = line.<span style="color:#881350;">sub</span>(<span style="color:#c700c2;">/Disp:/</span>, <span style="color:#760f15;">&quot;&quot;</span>).strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;disp = <span style="color:#0000cc;">false</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># summary<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">elsif</span> (line =~ <span style="color:#c700c2;">/Summary:/</span> || summary)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">if</span> (incidents.last.summary.nil?)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.summary = line.<span style="color:#881350;">sub</span>(<span style="color:#c700c2;">/Summary:/</span>, <span style="color:#760f15;">&quot;&quot;</span>).strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">else</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.summary &lt;&lt; (<span style="color:#760f15;">&quot; &quot;</span> + line.<span style="color:#881350;">sub</span>(<span style="color:#c700c2;">/Summary:/</span>, <span style="color:#760f15;">&quot;&quot;</span>).strip)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">if</span> (incidents.last.summary =~ <span style="color:#c700c2;">/Disp:/</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># find the &quot;Disp:&quot; header and data, remove from summary<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;disp = incidents.last.summary.slice!(<span style="color:#c700c2;">/\s*Disp:.*/</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incidents.last.disp = disp.<span style="color:#881350;">sub</span>(<span style="color:#c700c2;">/Disp:/</span>, <span style="color:#760f15;">&quot;&quot;</span>).strip<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;disp = (incidents.last.disp == <span style="color:#760f15;">&quot;&quot;</span>) <span style="color:#236e25;"># check that we actually got the &quot;disp&quot; data<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;summary = <span style="color:#0000cc;">false</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">else</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;summary = <span style="color:#0000cc;">true</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># no match<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">else</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">puts</span> <span style="color:#760f15;">&quot;discarding line: {&quot;</span> + line + <span style="color:#760f15;">&quot;}&quot;</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#236e25;"># at the end save each incident and print a list<br />
</span>&nbsp;&nbsp;&nbsp;&nbsp;incidents.each <span style="color:#881350;">do</span> |incident|<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">begin</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">puts</span>( (<span style="color:#760f15;">&quot;%8d&quot;</span> % incident.identifier) + <span style="color:#760f15;">&quot; &quot;</span> +<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(<span style="color:#760f15;">&quot;%25s&quot;</span> % (<span style="color:#760f15;">&quot;{&quot;</span> + incident.category &nbsp;&nbsp;&nbsp;+ <span style="color:#760f15;">&quot;}&quot;</span>)) + <span style="color:#760f15;">&quot; &quot;</span> +<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(<span style="color:#760f15;">&quot;%45s&quot;</span> % (<span style="color:#760f15;">&quot;{&quot;</span> + incident.subcategory + <span style="color:#760f15;">&quot;}&quot;</span>)) + <span style="color:#760f15;">&quot; &quot;</span> +<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(<span style="color:#760f15;">&quot;%60s&quot;</span> % (<span style="color:#760f15;">&quot;{&quot;</span> + incident.location &nbsp;&nbsp;&nbsp;+ <span style="color:#760f15;">&quot;}&quot;</span>)));<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;incident.save<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">rescue</span> <span style="color:#881350;">Exception</span> =&gt; exp<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">puts</span> exp<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
&nbsp;&nbsp;&nbsp;&nbsp;<br />
&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
<span style="color:#881350;">end</span></p>
<p><span style="color:#881350;">if</span> (<span style="color:#a1617a;">ARGV</span>.length &gt; <span style="color:#0000ff;">0</span>)<br />
&nbsp;&nbsp;<span style="color:#236e25;"># import each argument<br />
</span>&nbsp;&nbsp;<span style="color:#a1617a;">ARGV</span>.each <span style="color:#881350;">do</span> |arg|<br />
&nbsp;&nbsp;&nbsp;&nbsp;import_url(arg)<br />
&nbsp;&nbsp;<span style="color:#881350;">end</span><br />
<span style="color:#881350;">else</span><br />
&nbsp;&nbsp;yesterday = <span style="color:#881350;">Date</span>.today &#8211; <span style="color:#0000ff;">1</span>;<br />
&nbsp;&nbsp;urlToImport = <span style="color:#760f15;">&quot;http://capsnet.usc.edu/DPS/webpdf/&quot;</span>+<br />
&nbsp;&nbsp;&nbsp;&nbsp;(<span style="color:#760f15;">&quot;%02d&quot;</span> % yesterday.mon) + (<span style="color:#760f15;">&quot;%02d&quot;</span> % yesterday.mday) + yesterday.year.to_s[<span style="color:#0000ff;">2..3</span>] + <span style="color:#760f15;">&quot;.pdf&quot;</span><br />
&nbsp;&nbsp;import_url(urlToImport)<br />
<span style="color:#881350;">end</span></div>
<p>### Conclusion ###<br />
This system works fairly well with a few exceptions. While the PDFs are far more consistent than the emails, occasionally a PDF that can&#8217;t be parsed by rpdf2txt shows up. So far I haven&#8217;t found a solution (perhaps using a different PDF to text converter). Also, sometimes entries are missing an identifier, or it shows up in a different location. Some special rules are used to try to find it, but it&#8217;s not always successful.</p>
<p>Overall it was a success, as demonstrated by the 4000+ incidents currently in the TOOBS database.</p>
]]></content:encoded>
			<wfw:commentRss>http://tlrobinson.net/blog/2007/03/scraping-usc-dpss-incident-logs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

