All You Need to Know about Data Mining Pipelines

Data Mining Pipeline

Data mining is a process of collecting, cleaning, processing, analyzing and gaining useful insights of the data.
In present era, almost all automated processes like financial modelling, IoT, recommendation in retail marketing, use data. Data is mostly available in unstructured form that means it’s not possible to gain any insight from the data directly. That’s why data mining is needed to extract useful information from data. Through data mining, data is converted into one structured format so that it can be used more efficiently.

Pipeline for data mining process:
1. Data collection: Data collection is highly domain and application specific but plays a critical role in data mining process. Databases are used to collect the data when data is huge.
2. Feature extraction and data cleaning: Once data is collected, it may be in any form like web scrapped data, free form document or log of the file. Feature extraction is done to extract relevant features along with data cleaning like remove or correct missing values. As a end result of feature extraction and data cleaning, data obtained is in well structured format and can be used for computer program to process further. This whole process is known as data pre-processing.
3. Analytical Processing: It is considered as final part of data mining where analytical algorithms are designed with the help of processed data. According to use case or problem statement, it is decided that how the data should be clustering through analytical algorithm or processing.

Case Study
Consider a case from retail industry where a company wants to recommend its users or buyers their products according to customer’s preferences. For that company has customer’s log data i.e. which web pages customer has visited more and their profile information. Based on buying behavior and demographic data, company wants to recommend its customer specific products. How to design solution for such problem?

Solution architecture
Data collection: for this process, analyst has to collect two types of the data. First, log data from company’s website. Second, user’s profile information from company’s database.
Data cleaning and feature extraction: Log information will have multiple data types like numeric (IP address), text data, date and timings, product information. Analyst has to sort all kind of information available and extract the relevant information in structured form. During data processing, analyst record the data as attributes for each customer, integrate with customer’s demographic information.
Analytical processing: After pre-processing, analyst has to decide, how to use cleaned data for recommendation engine like how it should be clustered based on user’s preference or demographic information or if any other pattern observed.

This is a guest post by Arpita Gupta. Arpita Gupta is working as a Data Scientist at Accenture. She has research and development experience in Deep Learning, Machine Learning, and Data Mining techniques. Arpita likes to share her knowledge in Machine Learning through her website Let the Data Confess. She has done her post-graduation from BITS Pilani with M.E. Degree in Embedded Systems.