Address standardization

Roberto Giachetta, Tibor Gregorics, Zoltán Istenes and Sándor Sike

Abstract. This paper is about matching non-standard format, mistyped addresses against a reference address database. The input addresses are collected from several different,
non-standardized, non-verified, erroneous human input. The objective of the process is to “clean” the input address and find it in a standardized address database, or to find the most probable corresponding addresses with their appropriate credibilities. This process is known in the literature as “address cleansing”.
We developed an algorithm, which searches and matches the input address fields against the standard address database stepwise, using a rule based system. Our rule based system uses tokens obtained from a specialized tokenization and generates intermediate format addresses, by identifying the missing address field and their values. The rules are grouped into rule sets and applied according to a recognition order. The tokens are matched against the address reference database using a modified Levenshtein distance measure. Our uncertainty calculus uses the modified Levenshtein distance measure matching value and the rule strength to modify with each rule application the intermediate format address credibility. Our rule base is specialized to work with Hungarian addresses, using the standard reference database.

Full text PDF