|
Table of Content | Chapter Sixteen (Part 7) |
| CHAPTER SIXTEEN: PATTERN MATCHING (Part 6) |
| 16.4 -
Designing Your Own Pattern Matching Routines 16.5 - Extracting Substrings from Matched Patterns |
| 16.4 Designing Your Own Pattern Matching Routines |
Although the UCR Standard Library provides a wide variety of matching functions, there is no way to anticipate the needs of all applications. Therefore, you will probably discover that the library does not support some particular pattern matching function you need. Fortunately, it is very easy for you to create your own pattern matching functions to augment those available in the UCR Standard Library. When you specify a matching function name in the pattern data structure, the match routine calls the specified address using a far call and passing the following parameters:
es:di- Points at the next character in the
input string. You should not look at any characters before this address. Furthermore, you
should never look beyond the end of the string (see cx below).
ds:si- Contains the four byte parameter found
in the matchparm field.
cx- Contains the last position, plus one, in
the input string you're allowed to look at. Note that your pattern matching routine should
not look beyond location es:cx or the zero terminating byte; whichever comes
first in the input string.
On return from the function, ax must contain
the offset into the string (di's value) of the last character matched plus
one, if your matching function is successful. It must also set the carry flag to denote
success. After your pattern matches, the match routine might call another matching
function (the one specified by the next pattern field) and that function begins matching
at location es:ax.
If the pattern match fails, then you must return the
original di value in the ax register and return with the carry
flag clear. Note that your matching function must preserve all other registers.
There is one very important detail you must never forget
with writing your own pattern matching routines - ds does not point at your
data segment, it contains the H.O. word of the matchparm parameter.
Therefore, if you are going to access global variables in your data segment you will need
to push ds, load it with the address of dseg, and pop ds
before leaving. Several examples throughout this chapter demonstrate how to do this.
There are some obvious omissions from (the current version
of) the UCR Standard Library's repertoire. For example, there should probably be matchtoistr,
matchichar, and matchtoichar pattern functions. The following
example code demonstrates how to add a matchtoistr (match up to a string,
doing a case insensitive comparison) routine.
.xlist
include stdlib.a
includelib stdlib.lib
matchfuncs
.list
dseg segment para public 'data'
TestString byte "This is the string 'xyz' in it",cr,lf,0
TestPat pattern {matchtoistr,xyz}
xyz byte "XYZ",0
dseg ends
cseg segment para public 'code'
assume cs:cseg, ds:dseg
; MatchToiStr- Matches all characters in a string up to, and including, the
; specified parameter string. The parameter string must be
; all upper case characters. This guy matches string using
; a case insensitive comparison.
;
; inputs:
; es:di- Source string
; ds:si- String to match
; cx- Maximum match position
;
; outputs:
; ax- Points at first character beyond the end of the
; matched string if success, contains the initial DI
; value if failure occurs.
; carry- 0 if failure, 1 if success.
MatchToiStr proc far
pushf
push di
push si
cld
; Check to see if we're already past the point were we're allowed
; to scan in the input string.
cmp di, cx
jae MTiSFailure
; If the pattern string is the empty string, always match.
cmp byte ptr ds:[si], 0
je MTSsuccess
; The following loop scans through the input string looking for
; the first character in the pattern string.
ScanLoop: push si
lodsb ;Get first char of string
dec di
FindFirst: inc di ;Move on to next (or 1st) char.
cmp di, cx ;If at cx, then we've got to
jae CantFind1st ; fail.
mov ah, es:[di] ;Get input character.
cmp ah, 'a' ;Convert input character to
jb DoCmp ; upper case if it's a lower
cmp ah, 'z' ; case character.
ja DoCmp
and ah, 5fh
DoCmp: cmp al, ah ;Compare input character against
jne FindFirst ; pattern string.
; At this point, we've located the first character in the input string
; that matches the first character of the pattern string. See if the
; strings are equal.
push di ;Save restart point.
CmpLoop: cmp di, cx ;See if we've gone beyond the
jae StrNotThere ; last position allowable.
lodsb ;Get next input character.
cmp al, 0 ;At the end of the parameter
je MTSsuccess2 ; string? If so, succeed.
inc di
mov ah, es:[di] ;Get the next input character.
cmp ah, 'a' ;Convert input character to
jb DoCmp2 ; upper case if it's a lower
cmp ah, 'z' ; case character.
ja DoCmp2
and ah, 5fh
DoCmp2: cmp al, ah ;Compare input character against
je CmpLoop
pop di
pop si
jmp ScanLoop
StrNotThere: add sp, 2 ;Remove di from stack.
CantFind1st: add sp, 2 ;Remove si from stack.
MTiSFailure: pop si
pop di
mov ax, di ;Return failure position in AX.
popf
clc ;Return failure.
ret
MTSSuccess2: add sp, 2 ;Remove DI value from stack.
MTSSuccess: add sp, 2 ;Remove SI value from stack.
mov ax, di ;Return next position in AX.
pop si
pop di
popf
stc ;Return success.
ret
MatchToiStr endp
Main proc
mov ax, dseg
mov ds, ax
mov es, ax
meminit
lesi TestString
ldxi TestPat
xor cx, cx
match
jnc NoMatch
print
byte "Matched",cr,lf,0
jmp Quit
NoMatch: print
byte "Did not match",cr,lf,0
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
Often, simply determining that a string matches a given pattern is insufficient. You may want to perform various operations that depend upon the actual information in that string. However, the pattern matching facilities described thus far do not provide a mechanism for testing individual components of the input string. In this section, you will see how to extract portions of a pattern for further processing.
Perhaps an example may help clarify the need to extract portions of a string. Suppose you are writing a stock buy/sell program and you want it to process commands described by the following regular expression:
(buy | sell) [0-9]+ shares of (ibm | apple | hp | dec)
While it is easy to devise a Standard Library pattern that
recognizes strings of this form, calling the match routine would only tell
you that you have a legal buy or sell command. It does not tell you if you are to buy or
sell, who to buy or sell, or how many shares to buy or sell. Of course, you could take the
cross product of (buy | sell) with (ibm | apple | hp | dec) and generate eight different
regular expressions that uniquely determine whether you're buying or selling and whose
stock you're trading, but you can't process the integer values this way (unless you
willing to have millions of regular expressions). A better solution would be to extract
substrings from the legal pattern and process these substrings after you verify that you
have a legal buy or sell command. For example, you could extract buy or sell into one
string, the digits into another, and the company name into a third. After verifying the
syntax of the command, you could process the individual strings you've extracted. The UCR
Standard Library patgrab routine provides this capability for you.
You normally call patgrab after calling match
and verifying that it matches the input string. Patgrab expects a single
parameter - a pointer to a pattern recently processed by match. Patgrab
creates a string on the heap consisting of the characters matched by the given pattern and
returns a pointer to this string in es:di. Note that patgrab
only returns a string associated with a single pattern data structure, not a chain of
pattern data structures. Consider the following pattern:
PatToGrab pattern {matchstr, str1, 0, Pat2}
Pat2 pattern {matchstr, str2}
str1 byte "Hello",0
str2 byte " there",0
Calling match on PatToGrab will
match the string "Hello there". However, if after calling match you
call patgrab and pass it the address of PatToGrab, patgrab
will return a pointer to the string "Hello".
Of course, you might want to collect a string that is the
concatenation of several strings matched within your pattern (i.e., a portion of the
pattern list). This is where calling the sl_match2 pattern matching function
comes in handy. Consider the following pattern:
Numbers pattern {sl_match2, FirstNumber}
FirstNumber pattern {anycset, digits, 0, OtherDigs}
OtherDigs pattern {spancset, digits}
This pattern matches the same strings as
Numbers pattern {anycset, digits, 0, OtherDigs}
OtherDigs pattern {spancset, digits}
So why bother with the extra pattern that calls sl_match2?
Well, as it turns out the sl_match2 matching function lets you create
parenthetical patterns. A parenthetical pattern is a pattern list that the pattern
matching routines (especially patgrab) treat as a single pattern. Although
the match routine will match the same strings regardless of which version of Numbers
you use, patgrab will produce two entirely different strings depending upon
your choice of the above patterns. If you use the latter version, patgrab
will only return the first digit of the number. If you use the former version (with the
call to sl_match2), then patgrab returns the entire string
matched by sl_match2, and that turns out to be the entire string of digits.
The following sample program demonstrates how to use parenthetical patterns to extract the pertinent information from the stock command presented earlier. It uses parenthetical patterns for the buy/sell command, the number of shares, and the company name.
.xlist
include stdlib.a
includelib stdlib.lib
matchfuncs
.list
dseg segment para public 'data'
; Variables used to hold the number of shares bought/sold, a pointer to
; a string containing the buy/sell command, and a pointer to a string
; containing the company name.
Count word 0
CmdPtr dword ?
CompPtr dword ?
; Some test strings to try out:
Cmd1 byte "Buy 25 shares of apple stock",0
Cmd2 byte "Sell 50 shares of hp stock",0
Cmd3 byte "Buy 123 shares of dec stock",0
Cmd4 byte "Sell 15 shares of ibm stock",0
BadCmd0 byte "This is not a buy/sell command",0
; Patterns for the stock buy/sell command:
;
; StkCmd matches buy or sell and creates a parenthetical pattern
; that contains the string "buy" or "sell".
StkCmd pattern {sl_match2, buyPat, 0, skipspcs1}
buyPat pattern {matchistr,buystr,sellpat}
buystr byte "BUY",0
sellpat pattern {matchistr,sellstr}
sellstr byte "SELL",0
; Skip zero or more white space characters after the buy command.
skipspcs1 pattern {spancset, whitespace, 0, CountPat}
; CountPat is a parenthetical pattern that matches one or more
; digits.
CountPat pattern {sl_match2, Numbers, 0, skipspcs2}
Numbers pattern {anycset, digits, 0, RestOfNum}
RestOfNum pattern {spancset, digits}
; The following patterns match " shares of " allowing any amount
; of white space between the words.
skipspcs2 pattern {spancset, whitespace, 0, sharesPat}
sharesPat pattern {matchistr, sharesStr, 0, skipspcs3}
sharesStr byte "SHARES",0
skipspcs3 pattern {spancset, whitespace, 0, ofPat}
ofPat pattern {matchistr, ofStr, 0, skipspcs4}
ofStr byte "OF",0
skipspcs4 pattern {spancset, whitespace, 0, CompanyPat}
; The following parenthetical pattern matches a company name.
; The patgrab-available string will contain the corporate name.
CompanyPat pattern {sl_match2, ibmpat}
ibmpat pattern {matchistr, ibm, applePat}
ibm byte "IBM",0
applePat pattern {matchistr, apple, hpPat}
apple byte "APPLE",0
hpPat pattern {matchistr, hp, decPat}
hp byte "HP",0
decPat pattern {matchistr, decstr}
decstr byte "DEC",0
include stdsets.a
dseg ends
cseg segment para public 'code'
assume cs:cseg, ds:dseg
; DoBuySell- This routine processes a stock buy/sell command.
; After matching the command, it grabs the components
; of the command and outputs them as appropriate.
; This routine demonstrates how to use patgrab to
; extract substrings from a pattern string.
;
; On entry, es:di must point at the buy/sell command
; you want to process.
DoBuySell proc near
ldxi StkCmd
xor cx, cx
match
jnc NoMatch
lesi StkCmd
patgrab
mov word ptr CmdPtr, di
mov word ptr CmdPtr+2, es
lesi CountPat
patgrab
atoi ;Convert digits to integer
mov Count, ax
free ;Return storage to heap.
lesi CompanyPat
patgrab
mov word ptr CompPtr, di
mov word ptr CompPtr+2, es
printf
byte "Stock command: %^s\n"
byte "Number of shares: %d\n"
byte "Company to trade: %^s\n\n",0
dword CmdPtr, Count, CompPtr
les di, CmdPtr
free
les di, CompPtr
free
ret
NoMatch: print
byte "Illegal buy/sell command",cr,lf,0
ret
DoBuySell endp
Main proc
mov ax, dseg
mov ds, ax
mov es, ax
meminit
lesi Cmd1
call DoBuySell
lesi Cmd2
call DoBuySell
lesi Cmd3
call DoBuySell
lesi Cmd4
call DoBuySell
lesi BadCmd0
call DoBuySell
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
Sample program output:
Stock command: Buy
Number of shares: 25
Company to trade: apple
Stock command: Sell
Number of shares: 50
Company to trade: hp
Stock command: Buy
Number of shares: 123
Company to trade: dec
Stock command: Sell
Number of shares: 15
Company to trade: ibm
Illegal buy/sell command
|
Table of Content | Chapter Sixteen (Part 7) |
Chapter Sixteen: Pattern Matching
(Part 6)
29 SEP 1996