https://doi.org/10.71352/ac.37.005
Address standardization
Abstract.
This paper is about matching non-standard format, mistyped addresses against a reference address database. The input addresses are
collected from several different,
non-standardized, non-verified, erroneous human input.
The objective of the process is to
“clean”
the input address and find it in a standardized address database,
or to find the most probable corresponding addresses with their appropriate credibilities. This process is known in the literature
as
“address cleansing”.
We developed an algorithm, which searches and matches the input address fields against the standard
address database stepwise, using a rule based system. Our rule based system uses tokens obtained from a specialized tokenization
and generates intermediate format addresses, by identifying the missing address field and their values. The rules are grouped into
rule sets and applied according to a recognition order. The tokens are matched against the address reference database using a
modified Levenshtein distance measure. Our uncertainty calculus uses the modified Levenshtein distance measure
matching value and the rule strength to modify with each rule application the intermediate format address credibility.
Our rule base is specialized to work with Hungarian addresses, using the standard reference database.
