Automatic Processing Text with Finite State Machine
One of the task in automatic processing text is to extract some
sequence of characters in already known format. For example extract email from web page,
social security from text, phone number, order number. Finite state machine is one of the way to do this.
And here we will consider how it can be used on the example with ISBN number. Perl script will be
provided also.
So we need to extract from some text ISBN number. For simplicity we assume that it
can be only ISBN-10 number which has format d-ddd-ddddd-d where d is for
digit and '-' is for separator. However separator can be not only '-' but some other characters.
In this example it can be space, '-' or can be nothing. The diagram of ISBN number as a finite state machine is shown below.
State 1 has all transitions showed on the diagram.
The first thing is to build transition table. The transition table indicates
for each current state and character what will be the next action and state.
The transition table for some states is shown below.
Curr state
Char type
Next state
Action
1
separator (s)
2
-
1
digit(d)
3
update ISBN number with this char
2
digit (d)
3
update ISBN number with this char
For example if in state 1 we get digit then we skip state 2 and move to state 3 (continue reading ISBN number).
But if we get non-digit separator then we move to state 2.
If we get not separator and not digit then we should move to state 0 aand start from the
beginning.
In addition to the transition table we need script that is reading our text character by character
and can code each character as digit (d), separator or garbage and
based on this code and transition table take appropriate action (update if needed ISBN number and move to next state).
The transition table is described in function get_action using if statements. It could be also described in another file
in the table format.
The perl script for reading ISBN number is provided here. If the different format is needed to be extracted only transition
table (if rules) should be changed.