Glossary

All | # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

There are currently 25 names in this directory beginning with the letter C.

Category

See Class

Causal data mining

Causal data mining is a method of using causal machine learning algorithms to detect unknown patterns involving variables and their interrelationships in a causal network from data to predict outcomes

Centroids

Centroids are the mean values of the attributes for the objects in a cluster. They are used to represent the cluster and are often used as the basis for assigning new objects to a cluster.

Champion model

Champion model is the predictive model with the best model performance among the compared models in a data mining project. It is a relative best.

Chi-squared Automatic Interaction Detection (CHAID)

CHAID is a non-parametric method used for creating decision trees by recursively splitting the data into homogeneous subgroups based on the categorical predictor variables, using the chi-squared test to assess the independence between variables and find significant splits.

Class

A class refers to a distinct category or label that represents a specific group or type of data instances in a dataset, which is often the target.

Classification and Regression Tree (CART)

CART is a binary tree, which uses an algorithm to split a parent node into TWO child nodes repeatedly. CART can be used for both classification and association models.

Classification model

A predictive model with a binary or nominal target variable

Cluster assignment

Cluster assignment is a process of assigning each object to a specific cluster based on its similarity to the cluster centroids. The assignment is usually done by comparing the distance or similarity of the object to the centroids.

Cluster or segment

A cluster, also known as a segment, refers to a group of data points or objects that are similar to each other based on certain characteristics or attributes.

Cluster profiling

Cluster profiling involves analyzing the characteristics of each cluster, such as the mean value of the attributes or the distribution of the objects within the cluster.

Cluster sampling

Cluster sampling is a sampling method where the population is divided into clusters or groups, and a random selection of entire clusters is chosen, followed by data collection from all the members within the selected clusters.

Cluster validation

Cluster validation is a process of assessing the quality and reliability of the clusters obtained. It involves evaluating the internal cohesion and separation of the clusters, as well as their stability and meaningfulness.

Clustering

A process of grouping objects or observations into clusters (segments) based on their similarity or proximity. The goal is to create clusters that are internally homogeneous (objects within the same cluster are similar) and externally heterogeneous (objects from different clusters are dissimilar).

Clustering algorithms

A clustering algorithm determines the number of clusters and assigns observations to clusters. Cluster algorithms can be categorized into hierarchical clustering and partitioning clustering.

Component rotation

Component rotation is used to transform the principal components to align those components with the original variables, which achieves a more interpretable and meaningful representation of the data.

Conditional Probability Tables (CPTs)

Each node in a Bayesian network has an associated CPT, which defines the conditional probability distribution of the node given its parents’ states. The CPT quantifies the probabilistic relationship between variables.

Confidence level

The confidence level represents the degree of certainty or confidence we have in the estimates obtained from the sample.

Confusion matrix

Confusion matrix, also called the classification matrix, is the table that summarizes the performance of a classification algorithm.

Connection (edge) (in neural networks)

A connection connects neurons. Each connection has an associated weight that determines the level of the influence between connected neurons.

Constraint-based algorithm

A constraint-based algorithm is an approach used to learn the network structure from data by discovering the probabilistic dependencies between variables without directly optimizing a scoring metric. It iteratively tests and evaluates conditional independence relationships among variables to determine the network’s structure.

Convenience sampling

Convenience sampling is a non-probabilistic sampling method where researchers select the most readily available individuals to be part of the sample, leading to a lack of randomization and potential bias.

Cost sensitive learning

Cost Sensitive Learning is a method to evaluate and select a model based on the costs of misclassification.

Covariance

Covariance between two variables measures how much they vary together. If two variables have a high covariance, it means they are positively correlated, while a low covariance indicates no correlation. In PCA, we choose the principal components with the expectation that they are uncorrelated.

CRISP-DM process

Cross-Industry Standard Process for Data Mining