Glossary

All | # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

There are currently 231 names in this directory.

Activation function

An activation function is a function that determines the output of a neuron based on the weighted sum of its inputs. Common activation functions include sigmoid, ReLU (Rectified Linear Unit), and tanh (hyperbolic tangent).

Actual by predicted chart

The actual by predicted chart compares the actual values with the predicted values, showing us how well the trained model fits the data.

Adaptive stepwise

Adaptive stepwise is a forward stepwise method that allows removing a variable currently in the model but no longer significant. It is adaptive.

Aggregation methods

Ensemble models combine the predictions of base learners using aggregation methods. Common aggregation methods include averaging, voting, and weighted voting. The aggregated result is often more accurate and reliable than the predictions of individual models.

Artificial Intelligence (AI)

AI is a system that simulates human intelligence for solving problems or making decisions. AI can learn, predict, improve, and solve. Basically, AI could be an application or solution that automatically operates and act with minimal human interference.

Association model

A predictive model with an interval target variable

Backward propagation (or backpropagation)

Backward propagation is the process of adjusting the weights and biases in the network based on the prediction error. It involves calculating the gradients of the loss function and adjusting the parameters to minimize the error.

Backward stepwise

Backward stepwise is the method in which the model starts with all predictor variables, and at each step, the variable that contributes the least to the model’s performance is removed until no further improvement is achieved or a predefined stopping criterion is met.

Bagging

Bagging is an ensemble method that involves training multiple base learners independently on different random subsets of the training data (sampling with replacement). The final prediction is typically an average or voting over the predictions of all the base learners.

Bar chart

A bar chart is a graphical representation of categorical data that uses rectangular bars of varying lengths to visualize the frequency or count of each category.

Base learners (individual models)

Base learners are the individual models that constitute the ensemble. Base learners can be any machine learning algorithm, such as decision trees, support vector machines, neural networks, etc.

Bayes' Theorem

Bayes’ theorem is a fundamental equation used in Bayesian networks to update probabilities based on new evidence. It calculates the posterior probability of an event given prior knowledge and observed evidence.

Bayesian network

Bayesian network, also known as a Bayesian belief network or probabilistic graphical model, is a powerful and popular method for representing and reasoning about uncertainty and probabilistic relationships between variables. It is a graphical model that represents a set of variables and their conditional dependencies through a directed acyclic graph.

Bias

A bias term is added to the weighted sum of inputs in each neuron. It allows the network to learn a bias towards certain values and improves the accuracy of the model.

Boosting

Boosting is an iterative ensemble method where base learners are trained sequentially, and each learner focuses on correcting the mistakes made by its predecessors. Boosting assigns higher weights to misclassified instances, effectively giving more attention to challenging samples.

Box plot

A box plot (box-and-whisker plot) is a graphical summary of the distribution of numerical data that displays the median, quartiles, and potential outliers, providing a concise view of the data’s central tendency and spread.

Branch or sub-tree

A branch or a sub-tree is a sub-section of a decision tree.

Category

See Class

Causal data mining

Causal data mining is a method of using causal machine learning algorithms to detect unknown patterns involving variables and their interrelationships in a causal network from data to predict outcomes

Centroids

Centroids are the mean values of the attributes for the objects in a cluster. They are used to represent the cluster and are often used as the basis for assigning new objects to a cluster.

Champion model

Champion model is the predictive model with the best model performance among the compared models in a data mining project. It is a relative best.

Chi-squared Automatic Interaction Detection (CHAID)

CHAID is a non-parametric method used for creating decision trees by recursively splitting the data into homogeneous subgroups based on the categorical predictor variables, using the chi-squared test to assess the independence between variables and find significant splits.

Class

A class refers to a distinct category or label that represents a specific group or type of data instances in a dataset, which is often the target.

Classification and Regression Tree (CART)

CART is a binary tree, which uses an algorithm to split a parent node into TWO child nodes repeatedly. CART can be used for both classification and association models.

Classification model

A predictive model with a binary or nominal target variable

Cluster assignment

Cluster assignment is a process of assigning each object to a specific cluster based on its similarity to the cluster centroids. The assignment is usually done by comparing the distance or similarity of the object to the centroids.

Cluster or segment

A cluster, also known as a segment, refers to a group of data points or objects that are similar to each other based on certain characteristics or attributes.

Cluster profiling

Cluster profiling involves analyzing the characteristics of each cluster, such as the mean value of the attributes or the distribution of the objects within the cluster.

Cluster sampling

Cluster sampling is a sampling method where the population is divided into clusters or groups, and a random selection of entire clusters is chosen, followed by data collection from all the members within the selected clusters.

Cluster validation

Cluster validation is a process of assessing the quality and reliability of the clusters obtained. It involves evaluating the internal cohesion and separation of the clusters, as well as their stability and meaningfulness.

Clustering

A process of grouping objects or observations into clusters (segments) based on their similarity or proximity. The goal is to create clusters that are internally homogeneous (objects within the same cluster are similar) and externally heterogeneous (objects from different clusters are dissimilar).

Clustering algorithms

A clustering algorithm determines the number of clusters and assigns observations to clusters. Cluster algorithms can be categorized into hierarchical clustering and partitioning clustering.

Component rotation

Component rotation is used to transform the principal components to align those components with the original variables, which achieves a more interpretable and meaningful representation of the data.

Conditional Probability Tables (CPTs)

Each node in a Bayesian network has an associated CPT, which defines the conditional probability distribution of the node given its parents’ states. The CPT quantifies the probabilistic relationship between variables.

Confidence level

The confidence level represents the degree of certainty or confidence we have in the estimates obtained from the sample.

Confusion matrix

Confusion matrix, also called the classification matrix, is the table that summarizes the performance of a classification algorithm.

Connection (edge) (in neural networks)

A connection connects neurons. Each connection has an associated weight that determines the level of the influence between connected neurons.

Constraint-based algorithm

A constraint-based algorithm is an approach used to learn the network structure from data by discovering the probabilistic dependencies between variables without directly optimizing a scoring metric. It iteratively tests and evaluates conditional independence relationships among variables to determine the network’s structure.

Convenience sampling

Convenience sampling is a non-probabilistic sampling method where researchers select the most readily available individuals to be part of the sample, leading to a lack of randomization and potential bias.

Cost sensitive learning

Cost Sensitive Learning is a method to evaluate and select a model based on the costs of misclassification.

Covariance

Covariance between two variables measures how much they vary together. If two variables have a high covariance, it means they are positively correlated, while a low covariance indicates no correlation. In PCA, we choose the principal components with the expectation that they are uncorrelated.

CRISP-DM process

Cross-Industry Standard Process for Data Mining

Data de-identification

Data de-identification is the process of removing or altering identifying information from a dataset to protect the privacy and anonymity of individuals.

Data mining

Data mining is a method of using machine learning algorithms to detect unknown patents involving relationships among variables within large datasets to predict outcomes of interest, which can lead to informed business decisions.

Data mining model

See Predictive model

Data mining process

Data mining process refers to steps involves discovering meaningful patterns, relationships, and insights from large datasets using various techniques and algorithms.

Data modification

Data modification involves preparing and transforming the raw data to make it suitable for training our predictive models.

Data science

Data science is a larger and multidisciplinary field, focusing on capturing and extracting knowledge from data and communicating the outcomes. Data science consists of data mining as an essential part of analyzing data and other components such as data collection and management, data treatment, data visualization, computer programming, and artificial intelligence applications.

Data scientist

Data scientist is a professional who does a task or a combination of tasks involving analytics, data collection and treatment, data mining, machine learning, and programming

Data visualization

Data visualization is the graphical representation of data and information through charts, graphs, and other visual elements to help understand patterns, trends, and insights in a more intuitive and accessible way.

Decision matrix

Decision matrix is the table that presents the costs of misclassifications, including costs of false postives and costs of false negatives.

Decision node

When a sub-node is divided into additional sub-nodes, it is referred to as a decision node.

Decision tree

Decision tree is a logical rule-based method that presents a hierarchical structure of variables, including the root node, parent nodes, and child nodes.

Deep learning

Deep learning is a complex neural network with many hidden layers. Deep learning breakthroughs lead to AI boom.

Dendrograms

A dendrogram is a tree-like diagram that shows the hierarchical relationship between clusters in hierarchical clustering. The height of each branch represents the distance between clusters at that level.

Density plot

A density plot is a graphical representation of the distribution of continuous data, providing an estimate of the underlying probability density function, often using smoothed curves.

Dependent variable

See Target variable

Descriptive statistics

Descriptive statistics is a branch of statistics that involves the summarization and presentation of data to provide a clear and concise understanding of its main characteristics, such as measures of central tendency, dispersion, and distributions.

Dimensionality reduction

Once the principal components have been identified, PCA can be used to transfer the high-dimensional data into a lower-dimensional data, while still retaining as much of the original variation as possible. The number of principal components retained determines the dimensionality of the new data.

Directed Acyclic Graph (DAG)

The graphical structure of a Bayesian network is represented by a DAG, which is a graph without cycles (no path that starts and ends at the same node). The absence of cycles ensures that the network does not contain feedback loops and allows for efficient probabilistic inference.

Distance or similarity measures

A distance, also known as a similarity measure, quantifies the similarity or dissimilarity between pairs of objects. These measures are used to determine the proximity between objects and form the basis for clustering algorithms. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity.

Diversity

The primary strength of ensemble models lies in the diversity of their individual models. By using different learning algorithms or training on different subsets of the data, the ensemble captures various patterns and reduces the risk of overfitting.

Edges (in Bayesian networks)

Edges represent probabilistic dependencies between nodes. Directed edges indicate causal or direct influences between variables.

Eigenvectors and eigenvalues

PCA seeks to identify the directions of maximum variability in the data. Eigenvectors are the directions in the data that are not affected by a linear transformation, while eigenvalues indicate how much variance is captured by each eigenvector. In PCA, the principal components are the eigenvectors of the covariance matrix, sorted by their corresponding eigenvalues.

Ensemble

Ensemble is a machine learning method that involves combining the predictions of multiple models to produce a more accurate and robust prediction.

Ensemble size

The number of base learners in an ensemble is an important consideration. Increasing the ensemble size can lead to better performance up to a point, after which the returns diminish, and the model may become computationally expensive.

Error term

Error term (also known as the residual term) represents the discrepancy between the observed values of the target variable and the predicted values obtained from the linear regression model. It captures the part of the target variable that cannot be explained by the linear relationship with the predictors.

Ethical consideration

Ethical consideration refer to the principles, guidelines, and moral values that govern the responsible and respectful use of data throughout the entire data mining process. It is to ensure that we comply with any requirements and restrictions regarding data access to protect human subjects and avoid violating privacy.

Evaluation criteria

See Model fit metrics

Evaluation metrics

See Model fit metrics

Evidence

Evidence refers to the observed values or states of certain variables in the Bayesian network. It is used to update the probabilities of other variables in the network.

Explained variance

The proportion of variance explained by each principal component is crucial for understanding the importance of each component in the data. It helps us determine how many principal components to retain for dimensionality reduction while preserving significant information.

Explanatory Power (EP)

EP refers to the ability of a predictive model, especially association model, to explain or predict the variability observed in the data. It measures how well the model captures the underlying relationships between the input variables (predictors) and the target variable.

F-test

F-test is a test of the significance of the overall model. In other words, a significant F-test indicates that our regression model fits the data better than the model without any predictors.

F1 score

F1-score is the harmonic mean of the sensitivity (recall) and precision (positive predicted value); a high F1 score means low false positives (FP) and low false negatives (FN).

Factor analysis

See Principal component analysis

False Negative (FN)

FN is the prediction that wrongly indicates the absence of the event.

False Positive (FP)

FP is the prediction that wrongly indicates the presence of the event.

Feature importance

See Variable importance

Features

See Predictors

Forward propagation

Forward propagation is the process of transmitting input signals through the network, layer by layer, to produce an output. Each neuron in a layer receives input from the previous layer, performs computations using its activation function, and passes the output to the next layer.

Forward stepwise

Forward stepwise is the method in which the model starts with no predictor variables and iteratively adds one variable at a time, and at each step, the variable that improves the model’s performance the most is added to the model until no further improvement is achieved or a predefined stopping criterion is met.

Frequentist statistics

Frequentist statistics, also known as frequentist inference, is a statistical framework and approach to data analysis that focuses on the concept of repeated sampling and long-run frequencies of events.

Goodness of fit

See Model fit

Goodness-of-fit indices

See Model fit metrics

Gradient boosting

Gradient boosting is a machine learning ensemble method that sequentially combines weak learners to create a powerful predictive model by minimizing errors using gradient descent.

Gradient Boosting Additive Model

Gradient Boosting Additive Models (GBAM) is a variant of gradient boosting that extends the traditional algorithm to handle additive models. An additive model is a model that consists of a sum of functions of individual input variables, where each function depends only on a single input variable.

Gradient descent

Gradient boosting minimizes a specified loss function by using gradient descent optimization. It computes the gradients of the loss function and adjusts the subsequent weak learner to correct the errors made by previous learners. The learning process aims to iteratively reduce the loss and improve the overall model performance.

Hierarchical clustering

Hierarchical clustering builds a hierarchy of clusters, forming a tree-like structure called a dendrogram.

Histogram

A histogram is a graphical representation of continuous data that uses adjacent rectangles to display the distribution and frequency of data within specified intervals (bins).

Imbalanced dataset

An imbalanced dataset refers to a dataset where the number of samples in different classes significantly varies, resulting in unequal representation of the classes. In an imbalanced dataset, one class (the minority class) typically has significantly fewer instances than the other classes (majority classes).

Imputation

Imputation is a process of estimating or filling in missing values in a dataset with estimated values.

Independent variables

See Predictors

Input variables

See Predictors

Institutional Research Board (IRB)

Institutional Research Board (IRB) An Institutional Review Board (IRB) is a committee responsible for upholding research ethics by reviewing proposed research methods to ensure they meet ethical standards.

Intercept

Intercept is the constant term in the regression equation, representing the predicted value of Y (target variable) when all predictors are zero.

K-means clustering

K-means is a partitioning clustering algorithm that partitions the data into K distinct clusters by iteratively updating the cluster centroids and assigning data points to the nearest centroid. The objective is to minimize the sum of squared distances between data points and their respective cluster centroids.

Label

See Class

Layers

Neurons are organized into layers in a neural network. The input layer receives the input data, the output layer produces the final output, and there can be one or more hidden layers in between.

Leaf node

Nodes that do not split are called leaf nodes. They represent the output or the predicted value/class for the specific subgroup of data that reaches that node.

Learning (in Bayesiant networks)

Learning in Bayesian networks involves estimating the network structure and parameters from data. This can be done using approaches such as constraint-based methods or score-based methods.

Learning rate

The learning rate is a hyperparameter that controls the step size in gradient descent. It determines how much each weak learner’s contribution affects the final ensemble. A lower learning rate makes the algorithm converge slowly but can improve generalizability, while a higher learning rate leads to faster convergence but may increase the risk of overfitting.

Level

See Class

Level of measurement

See Variable scale

Lift chart

Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. The Lift chart, or Lift curve, presents the graphic representation between the lift values and response rates.

Likelihood

The likelihood in Bayesian networks represents the probability of observing specific evidence given the values of the variables in the network. It is derived from the conditional probability tables.

Line chart

A line chart is a graph that displays data points connected by straight lines, commonly used to visualize trends or changes over time or ordered categories.

Linear regression

Linear regression is a regression method model in which the target variable is interval. Linear regression is an association model.

Logistic regression

Logistic regression is a regression method in which the target variable is categorical (binary or nominal). It is a classification model.

Loss function

A loss function measures the discrepancy between the predicted output of the neural network and the actual target values. It quantifies the error and guides the adjustment of the parameters during training.

Machine learing

Machine learning is data analytic methods that detect unknown patterns from large data to predict the outcomes of interest.

Machine learning model

See Predictive model

Machine learning process

See Predictive modeling

Margin of error

The margin of error is a measure of the uncertainty or variability in the estimates obtained from the sample.

Mean Absolute Percentage Error (MAPE)

MAPE is a common metric for association models. It measures the percentage difference between the predicted values and the actual values in a dataset.

Measurement of variable

See Variable scale

Measurement scale

See Variable scale

Misclassification Rate (MR)

MR is a measure of how often a model misclassifies (makes incorrect predictions) the target variable (including actual positives and actual negatives).

Model fit

Model fit measures how well the model fits the data.

Model fit metrics

Criteria used to evaluate model fit

Model performance

See Model fit

Model reliability

Model reliability refers to the internal consistency of a model. It means the model produces consistent results with different runs using different data sets.

Model selection and tuning

Ensemble models require careful selection and tuning of base learners and configurations. Different combinations of models and configurations can significantly impact the model performance.

Model validity

Model validity refers to the accuracy of the model. It means the model consistently produces an accurate prediction of the target variable.

Neuron (node) (in neural networks)

A neuron, also called a node or unit, is a basic building block of a neural network. It receives inputs, processes them using an activation function, and produces outputs.

Nodes (in Bayesian networks)

Nodes represent random variables or events in the domain being modeled. Each node corresponds to a specific quantity or attribute of interest.

Non-probability sampling

Non-probability sampling is the method of selecting individuals based on non-random criteria, and not every individual in the sampling frame has a chance of being included.

Normalization

Data normalization, also known as variable scaling, is a preprocessing step in machine learning that converts the variables of a dataset to a similar range.

Oblimin

Oblimin is a oblique rotation method that allows for correlation or obliqueness between the rotated components. The main objective of Oblimin rotation is to achieve a simpler structure by minimizing the number of variables with high loadings on a component.

Oblique rotation

Oblique rotation is the rotation method, in which the rotated components are allowed to be correlated with each other.

Odds ratio

Odds ratio represents the ratio of the odds of the event occurring for one group compared to another group. In the context of logistic regression, it measures how the odds of the binary outcome change with respect to a one-unit change in the predictor, while holding all other predictor constant.

Optimization algorithm

The optimization algorithm is used during training to adjust the network’s parameters based on the computed gradients. Gradient descent is a common optimization algorithm used for this purpose.

Orthogonal rotation

Orthogonal rotation is the rotation method, in which the rotated components are constrained to be orthogonal to each other, meaning they are independent and uncorrelated.

Out-of-bag evaluation

Random forest utilizes an out-of-bag (OOB) evaluation technique. Since each decision tree is trained on a different subset of the training data, the samples not included in a tree’s training subset can be used for evaluation. This provides an unbiased estimate of the model’s performance without the need for a separate validation set.

Output

See Target variable

Overfitting

Overfitting is the situation in which the model is overtrained to the training sample and not generalized to other datasets. Hence, the model is invalid and unusable.

Oversampling

Oversampling is a technique that balances the data by incrementing the size of the minority (rare events) to the same size as the majority.

Parent node and child node (in decision trees)

These terms are relative in nature. Any node situated below another node is typically referred to as a child node or sub-node, while the node preceding these child nodes is commonly known as the parent node.

Parents and children (in Bayesian networks)

In a Bayesian network, the parents of a node are the nodes that directly influence it. The children of a node are the nodes that are directly influenced by it.

Partitioning clustering

Partitioning clustering divides the data into non-overlapping partitions or clusters. It directly assigns each data point to a specific cluster.

Pie chart

A pie chart is a circular chart that represents the proportions of different categories in a whole, with each category represented by a slice of the pie.

Population

Population is the primary target audience of which we want to examine the patterns of the data and make conclusions.

Posterior probability

The posterior probability represents the updated probability of a variable given observed evidence or data. It is calculated by combining the prior probability with the likelihood of the observed data using Bayes’ theorem.

Prediction

A general term for a predicting act

Prediction accuracy

Prediction accuracy is a measure of how accurate a model predicts the target variables (including actual positives and actual negatives).

Prediction errors

Errors measure the extent to which the trained model misfits the data. The lower the error, the more accurate the model.

Prediction model

See Predictive model

Predictive model

A model predicting either a continuous or categorical target variable

Predictive modeling

A process of using machine learning algorithms to develop a model that predicts future or unseen events or outcomes based on data.

Predictive power

Predictive power is the ability of a model to accurately capture and represent the underlying patterns, relationships, or trends present in the data and generalize the results to other datasets.

Predictive research

Predictive strategy looks into the future. In other words, organizations analyze a large amount of data to predict new potential safety problems in the future. By mining big data, they can develop a predictive model that predicts an incident or accident before it happens.

Predictors

Inputs being used in predicting the output

Principal components

Principal components are the eigenvectors that correspond to the highest eigenvalues of the covariance matrix. They are ordered in terms of their importance, where the first principal component explains the maximum variance in the dataset, followed by the second principal component, and so on.

Prior probability

The prior probability represents the initial belief or knowledge about a variable before any evidence or data is observed. It is typically specified as part of the Bayesian network modeling process.

Proactive research

Proactive research studies the present. In other words, organizations examine contributing factors to an incident/accident from various aspects of hazardous conditions and organizational processes and see how they are related to the incident or accident.

Probabilistic inference

Bayesian networks allow for probabilistic inference, which means computing the probability of specific events or variables given evidence (observed data or values) from other variables in the network. Inference is based on Bayes’ theorem.

Probabilistic relationship

A probabilistic relationship refers to the existence of a connection or association between two or more variables, where the relationship is characterized by uncertainty and is described using probabilities.

Probability sampling

Probability sampling is the method of selecting individuals randomly in such as way each individual in the sampling frame has an equal probability of being chosen.

Promax

Promax is a oblique rotation method that extends the advantages of Varimax rotation while also accounting for possible correlations between the rotated components. It is considered a compromise between the simplicity of orthogonal rotation and the flexibility of oblique rotation.

Pruning

Removing the sub-nodes of a parent node is called pruning. It is a technique used to simplify the decision tree by removing nodes that do not significantly improve its performance on the validation set. A tree is grown through splitting and shrunk through pruning.

Pseudo R-squared

Pseudo R-squared is a metric used in logistic regression to assess the model fit. Unlike the traditional R-squared used in linear regression, which measures the proportion of variance explained by the model, pseudo R-squared measures the proportion of the deviance explained by the model.

Purposive Sampling

Purposive sampling is a non-probabilistic sampling method where researchers deliberately select specific individuals or units based on pre-defined criteria to best represent certain characteristics or traits of interest.

R-squared or coefficient of determination

R-squared is a measure of the percentage of the variance in the target variable that is explained by variance in predictors, collectively.

Random forest

Random forest is a machine learning algorithm that combines the predictions of multiple decision trees to make more accurate and robust predictions. It falls under the category of ensemble learning, where multiple models are combined to form a more powerful predictive model.

Random subset selection

Each decision tree in the random forest is built using a random subset of the training data. This process, known as bootstrap aggregating or “bagging,” involves randomly sampling data points with replacement to create diverse subsets for training each tree.

Random variable selection

The random forest algorithm selects a random subset of input variables for each decision tree. This technique, known as variable bagging or variable subsampling, introduces additional randomness and helps to reduce the dominance of any single variable in the ensemble.

Reactive research

Reactive research focuses on examining events that have already happened and identifying the root causes of a specific incident or accident. Based on the identified root causes, the organizations can form appropriate strategies to mitigate the risk.

Receiver Operating Characteristic (ROC) chart

The ROC chart displays the true positive rate (sensitivity) on the vertical axis and the false positive rate (sensitivity) on the horizontal axis, illustrating the trade-off between sensitivity and specificity.

Reconstruction

PCA allows the inverse transformation of the lower-dimensional data back into the original high-dimensional space. Although some information is lost during dimensionality reduction, the reconstruction can still be useful for interpretation or other downstream tasks.

Recursive partitioning

Recursive partitioning denotes an iterative procedure involving the division of data into partitions, followed by further splitting along each branch.

Regression coefficients

Regression coefficients are the coefficients assigned to each predictor (Xi), indicating the magnitude and direction of their influence on the target variable. A positive coefficient means an increase in the predictor leads to an increase in the target variable, and vice versa for a negative coefficient.

Regression method

Regression is a popular method that estimates the relationship between one dependent variable (target variable) and multiple independent variables (predictors).

Regression model

See Association model

Relative Squared Error (RSE)

RSE is a common metric for association models. It measures the relative improvement of the model’s predictions compared to a simple baseline model that always predicts the mean of the actual values.

Replacement

Replacement is a data modification technique to correct errors, reassign values, or remove incorrect information in the data.

Representative sample

Representative sample is a sample that accurately reflects the characteristics and diversity of the population.

Residual by predicted chart

The residual by predicted chart presents not only how close the actual values are to the predicted values but also indicates any patterns in the residuals.

Residuals

Residuals measure the difference between actual values and predicted values in a dataset.

Robustness

The algorithm is known for its robustness against overfitting, noisy data, and outliers. By aggregating multiple decision trees, it can capture complex relationships and handle noisy datasets more effectively.

Root Mean Squared Error (RMSE)

RMSE is a common metric for association models. It measures the average difference between the predicted values and the actual values in a dataset, expressed in the same units as the predicted variable.

Root node

A root node is at the beginning of a tree. It represents the initial decision based on a specific feature that has the most significant impact on the target variable. Starting from the root node, the dataset is partitioned based on different features, and subsequently, these subgroups are further divided at each decision node located beneath the root node.

Sample

Sample is a subset of the population. We use the sample for data analysis and make inferences about the population.

Sample size

Sample size refers to the number of individuals or subjects included in the sample. It should be determined based on statistical considerations.

Sampling bias

Sampling bias occurs when the selected sample does not accurately represent the population, leading to skewed or inaccurate results.

Sampling frame

Sampling frame is the actual list of units or sources from which a sample is drawn.

Sampling method

Sampling method is the procedure that we use to select participants and collect the data.

Sampling with replacement

Sampling with replacement occurs when a randomly chosen unit from the population is put back into the population, and subsequently, another element is chosen at random. In this process, the population consistently retains all its original units, allowing for the possibility of selecting the same unit multiple times.

Scalability

Random forest can be parallelized, as each decision tree in the ensemble can be built independently. This allows for efficient computation and scalability to large datasets.

Scale of measurement

See Variable scale

Scatter plot

A scatter plot is a graphical representation of paired data points on a two-dimensional plane, used to visualize the relationship and correlation between two variables.

Score-based algorithm

A score-based algorithm is a Bayesian network structure learning approach that maximizes a scoring metric to identify the most probable network structure given the data. The algorithm aims to find the network structure that best fits the data according to a specific scoring criterion.

Scree plot

A scree plot is a graphical tool used in PCA to visualize the eigenvalues associated with each principal component. It is a simple line segment plot that shows the eigenvalues for each individual principal component. The scree plot helps determine the number of principal components based on the amount of variance they explain.

SEMMA process

Sample-Explore-Modify-Model-Assess

Sensitivity, Recall, or True positive rate (TPR)

Sensitivity is the probability of a positive result (event) or how well a model can predict true positives (events).

Significance level (alpha)

Significance level (alpha) refers to the significance level or the threshold used for hypothesis testing when evaluating the statistical significance of the estimated coefficients.

Significance value (p-value)

The significance value (p-value) is a statistical measure for assessing the significance of the estimated coefficients (also known as the regression weights or beta values) of the predictors in the regression. It determines whether the relationship between each predictor and the target variable is statistically significant or if it could have occurred by chance.

Simple random sampling

Simple random sampling is a sampling method that allows every member of the population to have an equal chance of being selected.

Snowball sampling

Snowball sampling is a non-probabilistic sampling method where initial participants are chosen based on specific criteria and then asked to refer other participants, creating a chain of referrals, commonly used for hard-to-reach or hidden populations.

Specificity, Selectivity, or True negative rate (TNR)

Specificity is the probability of a negative result (non-event) or how well a model can predict true negatives (non-events).

Splitting

Splitting is a process of dividing a node into two or more sub-nodes.

Splitting criterion

Splitting criterion is a measure used to evaluate and select the best predictor to split the data at each node of the tree. It is a measure of impurity or variance reduction.

Stacking

Stacking (stacked generalization) is a more complex ensemble technique that combines the predictions of multiple base learners by training a meta-model on their outputs. The meta-model learns to weigh the predictions of individual models effectively.

Standardized regression coefficient (beta)

Standardized regression coefficient is a measure used in linear regression to quantify the relationship between a predictor and the target variable after standardizing both variables to a common scale. Standardized coefficients range between 0 and 1 and are measured using the same scale in all cases via standard deviation.

Stepwise method

Stepwise method is an iterative variable selection method used in data mining to identify the most relevant subset of variables for a predictive model.

Stochastic Gradient Boosting

Stochastic Gradient Boosting (SGB) is a variant of the traditional gradient boosting algorithm that introduces randomness during the training process. It combines the concepts of gradient boosting and stochastic learning to enhance the model’s generalizability and reduce the risk of overfitting.

Stratified sampling

Stratified sampling is a sampling method where the population is divided into distinct subgroups or strata, and a random sample is drawn from each stratum to ensure representation of different characteristics within the overall population.

Sum of Squared Total (SST)

SST measures total variance, which is how far the data are from the mean.

Supervised learning

A machine learning approach for structured data, in which a model is trained from labeled data to make predictions of a pre-defined target variable.

Systematic random sampling

Systematic sampling is a sampling method where every nth member is selected from the population after randomly choosing a starting point, ensuring equal and consistent spacing between selected samples.

Target

See Target variable

Target variable

The output being predicted

Test sample

A sample of data used to perform the final test of the model

Training

Training a neural network involves iteratively adjusting the weights and biases using a training dataset. The goal is to minimize the error or loss function and improve the network’s prediction accuracy.

Training sample

A sample of data used to train the model

Tree depth and number of trees

These are hyperparameters that control the complexity and size of the individual decision trees. The tree depth affects the model’s capacity to capture complex relationships in the data, while the number of trees determines the boosting iterations and overall model complexity.

True Negative (TN)

TN is the prediction that correctly indicates the absence of the event.

True Positive (TP)

TP is the prediction that correctly indicates the presence of the event.

Undersampling

Undersampling is a technique that balances the data by reducing the size of the majority class to the same size as the minority class.

Unstandardized regression coefficient (B)

Unstandardized regression coefficient is a measure used in linear regression to quantify the relationship between a predictor and the target variable in its original units of measurement.

Unsupervised learning

A machine learning approach for unstructured data, in which a model is trained from unlabeled data without a specificc target variable.

USELEI processs

Understanding-Sampling-Exploring-Learning-Evaluating-Infering

Validation sample

A sample of data used to validate the model

Variable discretization

Variable discretization is a process of converting a variable with continuous scale (interval variable) into discrete categories (a categorical variable).

Variable importance

A measure of variable importance, indicating the relative contribution of each input variable in making predictions. It allows users to identify the most influential predictors and gain insights into the underlying data patterns.

Variable scale

A characteristic that describes the nature and properties of a variable; it indicates the type of values that a variable can take and the operations that can be performed on it

Variance

Variance of a variable measures how much the values of that variable vary from their mean. In PCA, the goal is to find the directions in the data with the highest variance, as these directions contain the most information about the data.

Varimax

Varimax is a orthogonal rotation method that. aims to maximize the variance of the squared loadings on each component, leading to a simple and sparse structure where each variable has a high loading on one component and close to zero loadings on other components.

Voluntary response sampling

Voluntary response sampling is a non-probabilistic sampling method where participants self-select to be part of the sample, often in response to a survey or study invitation, which can lead to biased results due to the voluntary nature of participation.

Voting/averaging

During prediction, each decision tree independently generates a prediction, and the final prediction of the random forest is determined through voting (for classification models) or averaging (for association models) the individual tree predictions. This ensemble approach helps reduce bias and variance in the modeling and improves the overall accuracy and robustness of the model.

Weak learners

In the context of gradient boosting, weak learners are individual models that perform slightly better than random guessing but are not highly accurate on their own. Decision trees are commonly used as weak learners in gradient boosting, but other models can also be used.

Weight

Each connection between neurons is assigned a weight, which represents the strength or importance of that connection. These weights determine how much influence the input of one neuron has on the output of another.