Jon's Perl Projects Page

This page is an overview of and index to my Perl work, and is meant as an adjunct to my resume.

HTML Preprocessor 1HTML Preprocessor 2Spider & IndexSearch EngineEmbedded PerlQuiz Maker

HTML Preprocessor 1

On my poker pages, I wanted following a link in one frame to change the 'you are here' arrow in the Table Of Contents in the other frame. This could easily be done with href=JavaScript() links, but only at cost of making the pages inaccessible to the 18% of the (click-weighted) population that turns JavaScript off. I could easily handle both groups with HTML like
<script language="JavaScript"><!--
document.write('<a href="javascript:Load(\'stud.html#Five_Card_Stud\')">Five Card Stud</a>');
// --></script><noscript><a href="stud.html#Five_Card_Stud">Five Card Stud</a></noscript>
that uses JavaScript to write the links that rely on JavaScript and that isolates their static equivalents within <noscript> tags - but that posed a real maintenance nightmare. (Just look at how much messier the code above is than a simple <a href=> tag, and at how I've had to write the link target and link text twice.)

This seemed a perfect place to put down Programming Perl in the middle of chapter four, and actually write my first Perl script. scans its working directory for .psx files that are newer than the corresponding .html files and:

  1. Makes sure that all <a> names and hrefs are quoted: ie, turns anything like <a name=foo href=foo.html> into <a name="foo" href="foo.html">.
  2. Converts any spaces in <img> or <a> names to _: ie, turns anything like <a name="IE sucks"> into <a name="IE_sucks">.
  3. Converts the pseudotag <x> into the document.write / <noscript> pairs, above.
  4. Escapes any ' in document.write()s: ie, turns document.write('it's') into document.write('it\'s').

HTML Preprocessor 2

My second Perl script was slightly more sophisticated - it even used subroutines. ;-)

It was meant to solve the problem my site had with boilerplate code. A consistent look meant lots of identical code all over the site. This meant that the information unique to a page (like the title and the links to on-page anchors) was easy to miss under all the code that didn't vary from page to page - and that changing the look meant changing all that code.

The answer was to replace all the boilerplate with macros that looked like :This() and that could have the unique parameters in the parentheses. Macros are defined in a package that exports a symbol table that consists of template text, with %1% and %2% substitution points, and a Perl function that takes an argument list, massages it in any way necessary, and then spits out a result list to be substituted into the macro template. This makes my source files much smaller and easier to read, and makes it easier to change standard elements. For example, I've added tags so that my search engine will only index text after the standard header and before the standard footer; adding these to the macro templates meant that they appeared in all my macroed pages just by calling the preprocessor with the -all switch.

Initially, I placed a shortcut to this script on my Win98 desktop. I'd edit a source file, click on the shortcut's icon on the 'desktop toolbar', and reload the page in a browser. Recently, though, I wrote a simple 'daemon' that waits on a change notification on my HTML source directory and then calls the preprocessor script via PerlEz. This means that the preprocessor is invoked every time I save a change to a source file, so that I just edit and reload, almost as if the preprocessor weren't even involved.

Spider & Index

I've wanted a search engine for my web site for a long time, but my ISP won't let me run any CGI on their server. For a while it looked like I'd be able to run Perl scripts on a friend's Solaris box with a DSL line, and I thought a search engine would make a great learning project. (The engine is 'done enough', but the friend just doesn't have the time to do anything with her toy, so the engine's not up, for now. Perhaps I'll look into changing ISP's.)

The engine consists of two different scripts: one to build a concordance and the other to use it. The indexer uses HTML::Parser to spider a local copy of my site and extract the plaintext. It lowercases all words and does some other canonicizing (like splitting hyphenated words into a pair of words), then builds the concordance, a hash by word of a hash by filename of lists of word positions. (This data structure makes it easy to do both simple searches (pages containing this word) and sequence (pages containing this word followed by that word) or 'near' searches.) It then prunes the most common words, and writes the concordance to disk as a Perl package that can be use-d by the search engine.

This script is the one that convinced me that Perl is fun: all the map and foreach operations started to feel like symbolic processing in Lisp.

Search Engine

Indexing was easy; searching turned out to be surprisingly complex. Doing simple lookups was pretty easy; doing embedded lookups ('bed' matches 'embedded') wasn't much harder; even sequence matches were pretty straightforward. What turned out to be difficult was scoring all the different possible combinations of full and partial matches!

Equally non-obvious at the outset was the need to keep the original text, so the user doesn't just have the title to tell them what a page contains. Currently, I 'invert the concordance' to generate something like the original text - but, while the code for this is pretty simple, it's rather expensive (it takes ten to one hundred times as long to format 25 results as to generate 100 hits) and is more than a bit weird looking since the index ignores commom words and is lower-cased. It's good enough for a first pass, though, and I'd put the engine up, if I had a host for it ....

Embedded Perl

Lately, I've been playing with using Active State's PerlEz library from Delphi. I wrote 'flat' and object oriented bindings, and started a ONElist mailing list about calling Perl from Delphi.

I've used these bindings to

  1. Do Perl regex searches in my Delphi job newsgroup scanning program.
  2. Automatically expand macros in the HTML source for my homeschooling site.
  3. Add a Delphi UI to a Perl Markhov chain program. This started as an exercise - how many lines did I need to do a simple first order Markhov chain program (ten, not counting white space and comments) - and evolved into a slightly more complex exercise in generating higher-order Markhov chains.

Quiz Maker

QuizMaker reads a relatively free-form text file, paragraph by paragraph and builds a self-contained HTML/JavaScript multiple choice quiz of the type common in newspapers and magazines. Paragraphs that start with a ?, *, or + are questions, score ranges, or quiz titles; all other paragraphs are passed untouched, as raw HTML text. Each response may have a score value (the default is 0) and a Score button pops up a score box scrolled to the appropriate range.

Copyright © 1999, Jon Shemitz,