Text mining, also called text data mining, is the process of deriving high-quality information from written natural language. High-quality information refers to information that is new, relevant, and of interest for the project at hand. All of the data that we generate via e-mails, word documents, PDF files, and text messages are written in natural language, but this data isn’t typically stored in a structured format. Text mining is the process that we use to draw insights and patterns from that unstructured data.
For example, scanning a set of documents written in natural language is a simple text mining task. Then, you would either model the documents for predictive classification purposes, or populate a clean database with the extracted information.
What is the difference between text mining and text analytics?
Text mining is roughly synonymous with text analytics, and many people use the two terms interchangeably. But by strict definition, text mining is a step prior to text analytics in the grand process of your machine learning projects.
Text mining is the process of cleansing data. The overarching goal of text mining is to convert text data into a standard format, using natural language processing and analytical methods for information retrieval. You should end up with a clean, organized dataset, most likely in excel or csv file.
Once your data has gone through text mining, it’s now ready for text analytics, which is the process of applying statistical and machine learning algorithms. The goal of text analytics is to detect patterns in the data, and use it to predict or infer new insights.
What data preprocessing techniques are used in text mining?
A few of the most common preprocessing techniques used in text mining are tokenization, term frequency, stemming and lemmatization.
Tokenization is the process of breaking text up into separate tokens, which can be individual words, phrases, or whole sentences. In some cases, punctuation and special characters (symbols like %, &, $) are discarded in the process.
2.) Term Frequency
Term frequency tells you how much a term occurs in a document. Terms can be either individual words or phrases containing multiple words. Since documents differ in length, it’s possible that a term would appear more times in longer documents than shorter ones. Thus, you can calculate term frequency by dividing the number of times the term appears, by the total number of terms in the document, as a way of normalization.
Term Frequency = [Number of times the term appears in the document] / [Total number of terms in the document]
Stemming is the process of reducing words to their root form. For example, we would reduce the word robotics to the stem robot. The stem is usually a full word, but does not need to be. For example, the Porter stemmer, a widely used algorithm for removing common suffixes from English words, reduces the words universal, university, and universe to the stem univers.
As we saw with the Porter stemmer example, the simple suffix rules that are commonly used for stemming could modify the stem. Lemmatization is a more complex approach to determining word stems, which addresses this potential problem. In lemmatization, we use different normalization rules depending on a word’s lexical category (part of speech). This way, the stemmer can grasp more information about the word being stemmed, and use that to group similar words more accurately.
What are the practical applications of text mining?
Perhaps the most common end use case of text mining is text categorization. Text mining would be the first step for building a model that can categorize text into specific domains, such as spam versus non-spam emails, or detecting explicit content. Document classification is another common type of text categorization, especially for sorting news articles into categories such as domestic, international, sports, and lifestyle.
Other applications of text mining include document summarization, and entity extraction for identifying people, places, organizations, and other entities. You can also use for sentiment analysis, to identify and extract subjective information from written natural language. Sentiment analysis is especially useful for businesses to detect what their customers are saying on internet forums and social media.
Lionbridge AI is a provider of training data for machine learning, and with 20 years of experience in the translation and localization industry, natural language processing tasks are our strength. We provide data collection, data cleansing, and data annotation services for text, image, audio, and video files. No matter the current status of your project, Lionbridge AI can step in and help you build more efficient models.