A Brief Look at EU Procurement Trends Over the Last 5 Years

December 30, 2023

European Procurement Market Statistics - Data Analysis

Text normalization and fuzzy matching for raw datasets

August 29, 2022

Photo by Andrew Ridley

Data Preprocessing - Text Normalization for Tabular Data

Text normalization is a widely used technique by text editing tools. Social media posts, academic papers or any non-edited document may include syntax errors. Edit tools either simultaneously pop up corrected suggestions by predicting what you want to write or give correction feedback by highlighting words to be corrected after the whole document is completed.

Text normalization is also a handy preprocessing tool when it is used for short texts in tabular data. Most businesses have lots of tabular data stored in databases. Even extracting meaningful information from raw tables is a challenging task. Databases store all of the information in a few types of data formats. In most database systems, there are a few main datatypes such as character datatypes, number datatype, date datatype and binary datatype [1]. These types can be classified into two main classes: numeric and categorical.

Categorical variables can be handled if their set size is limited. For example, user gender, customer income group and product color can be mapped into a limited-sized set. In some cases, categorical variables can not be limited by a set of features especially if the variable consists of raw text format. Firm names, product titles and customer addresses can be put in this type.

In some cases categoric information is created by a data entry specialist. For example, official records such as handwritten paper forms are copied into databases by civil servants by hand. In online forms, user information is collected directly from the user and whatever user writes in the textbox is copied into the database. Since users do not care about how the addresses are stored in the database, they could fill the address box with missing information. For that reason, to apply a location-based analysis, the first thing to do is to normalize the address field and map it into a limited-sized set.

For a large amount of data, it is essential to find an efficient method for text normalization. The main idea is finding the most similar source text according to the target text. Text similarity is the key factor when determining the normalized form of the given text. The most popular text similarity method is calculating the edit distance. Edit distance can be calculated by applying the Levenshtein algorithm [2]. The Levenshtein algorithm uses a dynamic programming method and it is efficient when calculating similarity between two words. But calculating similarity for each pair in a long list of texts is not time-efficient. The widely used text representation technique in NLP is helpful in this regard. The method called character based tf-idf vectors can be used to calculate the distance between most similar texts. The below code shows how to extract text representations by using character ngrams.