Patti: Compiling Unification-Based Finite-State Automata into Machine Instructions for a Superscalar Pipelined RISC Processor

by Sascha Brawer

Diplomarbeit (Master's Thesis), January 1998
Available as PDF document

See also slides of a talk on the Patti compiler. That presentation was given in September 1999 at the IBM T.J. Watson Research Center (Yorktown Heights, NY), at Lucent Bell Labs (Murray Hill, NJ), and at Xerox PARC (Palo Alto, CA).

Abstract

This thesis is about a method for speeding up natural-language analysis using a novel compilation technique. As its input, the compiler takes a unification-based linguistic formalism (non-deterministic finite-state automata, where transitions are labeled by attribute-value matrices according to a finite type logic with a simple-inheritance type hierarchy). As its output, the compiler generates machine instructions for the PowerPC chip, a pipelined RISC processor with superscalar instruction dispatch.

Because of its fine-grained knowledge about the task, the compiler is able to perform optimizations that would be very difficult to achieve using traditional techniques. Examples include heuristics for Static Branch Prediction, data cache control and scheduling the machine instructions to benefit from superscalarity, so that certain unifications are executed in parallel.

The system is evaluated by measuring the time it takes to extract noun groups in texts of some thousand words in length. On Apple PowerMacintosh machines, this task could be accomplished in fractions of a millisecond, theoretically corresponding to a speed of up to 21 million tokens per second. Hence, the generated code is so efficient that unification and pattern matching become neglectible factors in the overall performance of a natural-language system.

Due to the achieved speed, the presented techniques could form the foundation technology of new, real-time NLP applications.