Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
Links Online Free Courses Bookstore Forum Submit Link New Additions Archive
Search the Web:    

Automatic Processing Text with Finite State Machine





One of the task in automatic processing text is to extract some sequence of characters in already known format. For example extract email from web page, social security from text, phone number, order number. Finite state machine is one of the way to do this. And here we will consider how it can be used on the example with ISBN number. Perl script will be provided also.
So we need to extract from some text ISBN number. For simplicity we assume that it can be only ISBN-10 number which has format d-ddd-ddddd-d where d is for digit and '-' is for separator. However separator can be not only '-' but some other characters. In this example it can be space, '-' or can be nothing. The diagram of ISBN number as a finite state machine is shown below. State 1 has all transitions showed on the diagram.


The first thing is to build transition table. The transition table indicates for each current state and character what will be the next action and state. The transition table for some states is shown below.
Curr state Char type Next state Action
1 separator (s) 2 -
1 digit(d) 3 update ISBN number with this char
2 digit (d) 3 update ISBN number with this char

For example if in state 1 we get digit then we skip state 2 and move to state 3 (continue reading ISBN number). But if we get non-digit separator then we move to state 2. If we get not separator and not digit then we should move to state 0 aand start from the beginning.




In addition to the transition table we need script that is reading our text character by character and can code each character as digit (d), separator or garbage and based on this code and transition table take appropriate action (update if needed ISBN number and move to next state). The transition table is described in function get_action using if statements. It could be also described in another file in the table format.
The perl script for reading ISBN number is provided here. If the different format is needed to be extracted only transition table (if rules) should be changed.

References

Automatic Processing Text with Finite State Machine: Perl Source Code

eCance
Copyright ? 2009 by LZ.