Glossary

All | # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

There are currently 28 names in this directory beginning with the letter S.

Sample

Sample is a subset of the population. We use the sample for data analysis and make inferences about the population.

Sample size

Sample size refers to the number of individuals or subjects included in the sample. It should be determined based on statistical considerations.

Sampling bias

Sampling bias occurs when the selected sample does not accurately represent the population, leading to skewed or inaccurate results.

Sampling frame

Sampling frame is the actual list of units or sources from which a sample is drawn.

Sampling method

Sampling method is the procedure that we use to select participants and collect the data.

Sampling with replacement

Sampling with replacement occurs when a randomly chosen unit from the population is put back into the population, and subsequently, another element is chosen at random. In this process, the population consistently retains all its original units, allowing for the possibility of selecting the same unit multiple times.

Scalability

Random forest can be parallelized, as each decision tree in the ensemble can be built independently. This allows for efficient computation and scalability to large datasets.

Scale of measurement

See Variable scale

Scatter plot

A scatter plot is a graphical representation of paired data points on a two-dimensional plane, used to visualize the relationship and correlation between two variables.

Score-based algorithm

A score-based algorithm is a Bayesian network structure learning approach that maximizes a scoring metric to identify the most probable network structure given the data. The algorithm aims to find the network structure that best fits the data according to a specific scoring criterion.

Scree plot

A scree plot is a graphical tool used in PCA to visualize the eigenvalues associated with each principal component. It is a simple line segment plot that shows the eigenvalues for each individual principal component. The scree plot helps determine the number of principal components based on the amount of variance they explain.

SEMMA process

Sample-Explore-Modify-Model-Assess

Sensitivity, Recall, or True positive rate (TPR)

Sensitivity is the probability of a positive result (event) or how well a model can predict true positives (events).

Significance level (alpha)

Significance level (alpha) refers to the significance level or the threshold used for hypothesis testing when evaluating the statistical significance of the estimated coefficients.

Significance value (p-value)

The significance value (p-value) is a statistical measure for assessing the significance of the estimated coefficients (also known as the regression weights or beta values) of the predictors in the regression. It determines whether the relationship between each predictor and the target variable is statistically significant or if it could have occurred by chance.

Simple random sampling

Simple random sampling is a sampling method that allows every member of the population to have an equal chance of being selected.

Snowball sampling

Snowball sampling is a non-probabilistic sampling method where initial participants are chosen based on specific criteria and then asked to refer other participants, creating a chain of referrals, commonly used for hard-to-reach or hidden populations.

Specificity, Selectivity, or True negative rate (TNR)

Specificity is the probability of a negative result (non-event) or how well a model can predict true negatives (non-events).

Splitting

Splitting is a process of dividing a node into two or more sub-nodes.

Splitting criterion

Splitting criterion is a measure used to evaluate and select the best predictor to split the data at each node of the tree. It is a measure of impurity or variance reduction.

Stacking

Stacking (stacked generalization) is a more complex ensemble technique that combines the predictions of multiple base learners by training a meta-model on their outputs. The meta-model learns to weigh the predictions of individual models effectively.

Standardized regression coefficient (beta)

Standardized regression coefficient is a measure used in linear regression to quantify the relationship between a predictor and the target variable after standardizing both variables to a common scale. Standardized coefficients range between 0 and 1 and are measured using the same scale in all cases via standard deviation.

Stepwise method

Stepwise method is an iterative variable selection method used in data mining to identify the most relevant subset of variables for a predictive model.

Stochastic Gradient Boosting

Stochastic Gradient Boosting (SGB) is a variant of the traditional gradient boosting algorithm that introduces randomness during the training process. It combines the concepts of gradient boosting and stochastic learning to enhance the model’s generalizability and reduce the risk of overfitting.

Stratified sampling

Stratified sampling is a sampling method where the population is divided into distinct subgroups or strata, and a random sample is drawn from each stratum to ensure representation of different characteristics within the overall population.

Sum of Squared Total (SST)

SST measures total variance, which is how far the data are from the mean.

Supervised learning

A machine learning approach for structured data, in which a model is trained from labeled data to make predictions of a pre-defined target variable.

Systematic random sampling

Systematic sampling is a sampling method where every nth member is selected from the population after randomly choosing a starting point, ensuring equal and consistent spacing between selected samples.