Databases are rich ‘information’ sources that yield units of discrete patterns depending on the type of mining technique used. Knowledge Discovery In Databases (KDD) is a method used for ‘discovering’ knowledge in databases from ‘unclear’ patterns that are not ‘obvious.’ Such “discovered” knowledge is found by KDD’s 7-steps of procedures.
Why do we need the KDD process?
Primarily used by researchers engaged in various domains such as artificial intelligence, pattern recognition, machine learning, knowledge acquisition for expert systems, data visualization, databases, and Statistics, the spin-off is knowledge for business decision making as well.
Knowledge discovery in databases is essential for the following key activities:
- Automated summarizing
- Identifying the patterns for models
- Extraction of the essence from the figures and facts presented.
KDD process seven steps aid in identifying the correct model for extracting the knowledge from data…
Simple 7 Steps to KDD
KDD is used to establish the procedure for recognizing valid, useful, and understandable patterns within huge and complex data sets. The seven steps are cleansing, integration, selection, transformation, mining, measuring, and visualization.
- Data cleansing
Data cleansing is defined as the process of detecting and correcting or removing corrupted, incorrect, incomplete data or duplicate data from the dataset, table, or record set. The irrelevant parts are replaced by deleting coarse or dirty data. It involves cleaning up data that is compiled in one area.
The business cases when data is cleaned are as follows –
- Where values are missing from the chosen data set
- When data is ‘noisy’ – noise is a variance error or noise is random
- Cleaning is possible with data discrepancy detection and tools which transform data
The inconsistent data is removed by the following processes:
- Normalization
- Validation
- De-duplication
- Standardization
- Anomaly detection
- Transformation
- Integration
Heterogeneous data that is derived from multiple diverse sources and combined into a target system that has architecture is called data integration. Primary and secondary resource extracts are conflated using:
Data integration involves:
- Data migration
- Synchronization of tools
- ETL tools for extraction transformation and loading process
- Data selection
Here the data selected is further refined for relevancy and needs to be segregated on the basis of data that is relevant to the analysis. It is retrieved from the data collection for data selection using
- Neural networks
- Decision trees
- Naïve bytes
- Clustering
- Regression
- Transformation
It is defined as the process of transforming data into appropriate forms required for the mining procedure. In this process, the raw data that needs to be used is transformed into a format that is necessary for the mining procedure.
This is completed in two ways:
(1) Data mapping – for assigning elements used in the source base to the destination phase for capturing the transformations
(2) Coding- for the creation of the transformation program is actually completed
- Data mining
This process is the technique that is applied to extract patterns that may have potential use to the business.
- Transforms the task that is relevant to the data in the form of a pattern
- It identifies the purpose of the model that is used for such classification and characterization
- Measuring or pattern evaluation
The pattern obtained is evaluated for
- Identifying the categorization of the pattern
- Move to summary and visuals so the purpose is highlighted
- Knowledge representation or Visualization
It is defined as the process of utilization of visual tools for presenting data mining results. An interesting score of each pattern is necessary.
The use of summarization and visualization to make the data understandable to the user is also necessary. This is then used for the following
- Reporting the creation of tables
- Discriminating the roles classification
- Characterization of the rules.
Takeaway
KDD is an iterative process for the evaluation of data quality measures that can be further refined. Using this, data mining is further refined and is again used for new data integration and transformation so that another level of appropriate results is obtained.
The pre-processing of databases is essential and includes data cleaning and data integration.
The abundance of data available today is directly proportional to Knowledge Discovery and data mining is an impressive start and significant in its application. It is currently used by businesses to achieve better decision-making and strategizing.