Potato Diversity Analysis:A Step-by-Step Guide

Introduction

In this article, we will delve into a comprehensive approach to analyzing the genetic diversity of potato using various statistical and multivariate analysis methods in R. The analysis encompasses techniques such as k-means clustering, hierarchical clustering, DBSCAN, and Principal Component Analysis (PCA), supported by clear visualizations to facilitate the interpretation of results. This guide aims to provide a robust framework for researchers and practitioners working in the field of plant genetics and diversity analysis.

Data Loading and Preparation

The first step in any analysis is to load and prepare the data. Here, we use R to read the data from an Excel file, select relevant columns, and scale the data. Scaling ensures that each variable contributes equally to the analysis, which is particularly important in clustering and PCA.

library(tidyverse)
library(readxl)

# Load the dataset from an Excel file
df <- read_excel("/Users/franklin/Documents/R/myblog/posts/conglomerados/DATOS_R_JUAN JOSE.xlsx")
head(df)

# A tibble: 6 × 54
   `N°` codigo_colecta nombre_comun  Departamento Provincia Municipio Localidad
  <dbl> <chr>          <chr>         <chr>        <chr>     <chr>     <chr>    
1     1 JCH 001        JACHA         POTOSI       Bustillos Chayanta  Jacha    
2     2 JCH 002        TANI TANI     POTOSI       Bustillos Chayanta  Jacha    
3     3 JCH 003        TUNANTE       POTOSI       Bustillos Chayanta  Jacha    
4     4 JCH 004        QOLLU  PEPINO POTOSI       Bustillos Chayanta  Jacha    
5     5 JCH 005        QHOLLU YURAQ  POTOSI       Bustillos Chayanta  Jacha    
6     6 JCH 006        QHOLLU PUCA   POTOSI       Bustillos Chayanta  Jacha    
# ℹ 47 more variables: Latitud <dbl>, Longitud <dbl>, Altitud <dbl>,
#   CAMPAÑA <dbl>, REGENERACION <chr>, BLOQ <dbl>, SURC <dbl>, RDTO <dbl>,
#   NTUBER <dbl>, N_caja <dbl>, Ev_Almacen <chr>, HPL <dbl>, CTL <dbl>,
#   FAL <dbl>, TDS <dbl>, NFL <dbl>, NIF <dbl>, NIP <dbl>, GFL <dbl>,
#   FCL <dbl>, CPF <dbl>, INT <dbl>, CSF <dbl>, DCS <dbl>, CPL <dbl>,
#   CLZ <dbl>, PAN <dbl>, PPT <dbl>, CBY <dbl>, FBY <dbl>, FGR <dbl>,
#   FRA <dbl>, POJ <dbl>, CPPL <dbl>, INTC <dbl>, CSP <dbl>, DSCPL <dbl>, …

# Column selection for the analysis (excluding TDS due to lack of variation)
db <- df %>% 
  select(2, 4, 15, 16, 19:21, 23:47)

# Data preparation: remove non-numeric columns and scale the data
sdf <- db %>% 
  select(-2) %>% 
  column_to_rownames("codigo_colecta")

sdf <- scale(sdf)
head(sdf)

              RDTO     NTUBER         HPL        CTL        FAL        NFL
JCH 001 -1.0364935  2.3631177 -2.33535670  1.5974622  0.2021708 -0.2949048
JCH 002 -1.0332533  1.3262372 -2.33535670  2.3526261 -3.0325614  1.3193108
JCH 003 -0.8518015  0.2422258 -1.21610741 -0.6680296 -1.4151953 -0.2949048
JCH 004 -0.4208535 -1.4309223 -0.09685811  0.0871343  0.2021708  2.9335265
JCH 005 -1.2114649 -1.1010057 -0.09685811 -0.6680296 -1.4151953 -1.9091204
JCH 006 -0.5277805 -0.2526490 -0.09685811  0.0871343  0.2021708 -0.2949048
               NIF        NIP        GFL        FCL        CPF         INT
JCH 001  2.0689872  1.3677450  2.6330006  0.3720828  0.3365174  0.08133921
JCH 002  0.3475899  1.3677450  2.6330006 -2.7236462  0.3365174 -1.12712898
JCH 003 -1.3738075 -0.7241003 -0.3761429  0.3720828 -0.4081168 -1.12712898
JCH 004  2.0689872 -0.7241003 -0.3761429  0.3720828  0.3365174  1.28980739
JCH 005 -1.3738075 -0.7241003 -0.3761429  0.3720828  0.3365174  0.08133921
JCH 006 -1.3738075  1.3677450 -0.3761429  0.3720828  0.3365174 -1.12712898
               CSF        DCS        CPL        CLZ        PAN         PPT
JCH 001 -0.2849313 -0.3623917 -0.5231525 -0.4795775 -0.4143268  0.09001093
JCH 002 -0.8140895 -1.2189538  1.4919534  0.9454527 -0.4143268  0.09001093
JCH 003 -0.2849313  1.3507325 -1.5307055 -1.1920925 -0.4143268 -1.31415965
JCH 004 -0.2849313 -0.3623917  1.4919534  0.9454527 -0.4143268  1.02612466
JCH 005 -0.2849313 -0.3623917  0.9881770 -0.4795775  2.9002873 -0.84610279
JCH 006 -0.2849313 -0.3623917 -0.5231525 -0.4795775  2.9002873  1.02612466
               CBY        FBY       FGR        FRA       POJ       CPPL
JCH 001  0.6475703 -0.2236502  0.364163 -0.6069687 -0.558308 -0.2817993
JCH 002  0.6475703 -0.2236502  0.364163 -0.6069687 -0.558308 -1.3284824
JCH 003  0.6475703 -0.2236502  0.364163 -0.6069687 -2.086309  0.4159894
JCH 004  0.6475703 -0.2236502 -1.568130 -0.2798975 -0.558308 -0.9795880
JCH 005 -1.2362705 -0.2236502  1.137080  0.7013162 -0.558308 -0.9795880
JCH 006 -0.7653103 -0.2236502  1.137080  0.7013162 -2.086309  0.7648838
              INTC        CSP      DSCPL        CPP       CSPL       DCSP
JCH 001 -0.9149821  0.4009646  0.1217388  0.8023403 -0.4827367 -0.4143496
JCH 002 -0.9149821 -0.2506029  0.1217388  0.8023403 -0.4827367 -0.4143496
JCH 003  1.4940847 -1.8795218 -2.0801452  0.8023403 -0.4827367 -0.4143496
JCH 004  0.2895513  0.4009646  0.1217388  2.4070210 -0.4827367 -0.4143496
JCH 005  0.2895513  0.4009646 -1.5296742 -0.8023403 -0.4827367 -0.4143496
JCH 006  0.2895513 -0.5763867  1.7731517  0.8023403 -0.4827367 -0.4143496

Clustering: K-means Method

K-means clustering is a partitioning method that divides data into a predetermined number of clusters, minimizing the within-cluster variance. The following steps include determining the optimal number of clusters and performing the clustering analysis.

Determining the Optimal Number of Clusters

Determining the optimal number of clusters is crucial for meaningful clustering. We employ several methods, including the within-cluster sum of squares (WSS), silhouette analysis, and Gap Statistic to identify the most appropriate number of clusters for the dataset.

library(factoextra)

# Compute distance matrix and visualize
distance <- get_dist(sdf)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# Elbow method for WSS
fviz_nbclust(sdf, kmeans, method = "wss")

# Silhouette method
fviz_nbclust(sdf, kmeans, method = "silhouette")

# Gap Statistic
library(cluster)
set.seed(123)
gap_stat <- clusGap(sdf, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
print(gap_stat, method = "firstmax")

Clustering Gap statistic ["clusGap"] from call:
clusGap(x = sdf, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 1
          logW   E.logW       gap     SE.sim
 [1,] 5.279958 5.652502 0.3725440 0.01149949
 [2,] 5.228562 5.598937 0.3703751 0.01042953
 [3,] 5.183834 5.560603 0.3767692 0.01100671
 [4,] 5.148115 5.529050 0.3809349 0.01071279
 [5,] 5.116327 5.501342 0.3850155 0.01052125
 [6,] 5.088564 5.476447 0.3878835 0.01032929
 [7,] 5.063318 5.453360 0.3900420 0.01027040
 [8,] 5.039872 5.431245 0.3913732 0.01002459
 [9,] 5.015518 5.410428 0.3949102 0.01009456
[10,] 4.996567 5.390239 0.3936718 0.01019443

fviz_gap_stat(gap_stat)

The WSS method, often referred to as the “elbow method,” helps to identify the point where the marginal gain in explanatory power diminishes significantly, indicating the optimal number of clusters. The silhouette method provides a measure of how similar an object is to its own cluster compared to other clusters. The Gap Statistic compares the total within intra-cluster variation for different numbers of clusters with their expected values under null reference distribution of the data.

Applying the K-means Algorithm

After identifying the optimal number of clusters, we apply the k-means algorithm and visualize the clusters.

set.seed(123)
kmeans_resultados <- kmeans(sdf, centers = 4)  # Replace 4 with the optimal number of clusters

# Visualization of clusters
fviz_cluster(kmeans_resultados, data = sdf)

The results from k-means clustering provide insight into how the samples are grouped based on their genetic characteristics. Each cluster represents a group of samples that share similar traits, and the visualizations help in understanding the distribution of these clusters within the dataset.

Hierarchical Clustering

Hierarchical clustering is another clustering technique that creates a hierarchy of clusters, which can be visualized as a tree or dendrogram. This method is particularly useful for visualizing the relationships between clusters and understanding the structure of the data.

Generating the Dendrogram

We begin by calculating the distance matrix using the Euclidean distance and applying the hierarchical clustering algorithm using Ward’s method, which minimizes the variance within clusters.

library(dendextend)

# Compute distance matrix
distancias <- dist(sdf, method = "euclidean")

# Perform hierarchical clustering
res.hc <- hclust(d = distancias, method = "ward.D2")

# Plot dendrogram
plot(res.hc, cex = 0.6, hang = -1)

# Cut the dendrogram to form clusters
grupos <- cutree(res.hc, k = 4)  # Replace 4 with the desired number of clusters

The dendrogram allows us to visualize how the clusters are formed and how individual data points or groups of points merge as the level of clustering increases. By cutting the dendrogram at a particular height, we can extract a specific number of clusters that represent the underlying structure of the data.

Assessment and Visualization

Different methods of hierarchical clustering can be compared to evaluate which method best captures the structure of the data. We then visualize the dendrogram with colored clusters for better interpretability.

# Comparison of different linkage methods
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")

ac <- function(x) {
  agnes(sdf, method = x)$ac
}

map_dbl(m, ac)

  average    single  complete      ward 
0.5463403 0.4630005 0.6197599 0.7498075

# Visualize dendrogram with colored clusters
fviz_dend(res.hc, cex = 0.5, k = 4, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE)

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

# Circular
fviz_dend(res.hc, cex = 0.6, lwd = 0.7, k = 4,
                 rect = TRUE,
                 k_colors = "jco",
                 rect_border = "jco",
                 rect_fill = TRUE,
                 type = "circular")

Here, we assess different linkage methods such as average, single, complete, and Ward’s method. The agglomerative coefficient (AC) is computed for each method, with higher values indicating a better fit. The final dendrogram is color-coded to represent the different clusters, providing a clear visual summary of the hierarchical structure.

Density-Based Clustering: DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a powerful clustering technique that identifies clusters based on density, allowing for the detection of outliers. Unlike k-means, DBSCAN does not require the number of clusters to be specified a priori, making it particularly useful for datasets with irregular cluster shapes.

Finding the Optimal Epsilon and Applying DBSCAN

The epsilon parameter controls the neighborhood radius for density estimation. We begin by identifying an appropriate epsilon value through a k-NN distance plot.

library(dbscan)

# Identify optimal epsilon using k-NN distance plot
kNNdistplot(sdf, k = 5)
abline(h = 0.5, col = "red")  # Adjust according to the plot

# Apply DBSCAN algorithm
dbscan_resultados <- dbscan(sdf, eps = 0.5, minPts = 5)

# Visualize DBSCAN results
fviz_cluster(dbscan_resultados, data = sdf)

The k-NN distance plot helps to identify a natural choice for epsilon by observing where the plot exhibits the most pronounced change in slope. The DBSCAN algorithm is then applied, and the results are visualized. Clusters are identified based on dense regions of points, with outliers being points that do not belong to any cluster.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the data into a set of orthogonal components, explaining the maximum variance with the fewest components. PCA is especially useful for visualizing high-dimensional data in two or three dimensions.

Performing PCA and Visualizing Results

We perform PCA on the scaled data and visualize the results, focusing on the variance explained by each component and the relationship between variables and samples.

# Perform PCA
acp_resultados <- prcomp(sdf, center = TRUE, scale. = TRUE)

# Summary of PCA results
summary(acp_resultados)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     1.8711 1.7506 1.57675 1.44454 1.32993 1.26967 1.25193
Proportion of Variance 0.1167 0.1022 0.08287 0.06956 0.05896 0.05374 0.05224
Cumulative Proportion  0.1167 0.2189 0.30173 0.37129 0.43024 0.48398 0.53622
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     1.14515 1.09504 1.04044 1.01979 0.97386 0.93209 0.92048
Proportion of Variance 0.04371 0.03997 0.03608 0.03467 0.03161 0.02896 0.02824
Cumulative Proportion  0.57994 0.61991 0.65599 0.69066 0.72227 0.75123 0.77947
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     0.87393 0.81854 0.81299 0.77542 0.75455 0.68666 0.65538
Proportion of Variance 0.02546 0.02233 0.02203 0.02004 0.01898 0.01572 0.01432
Cumulative Proportion  0.80493 0.82726 0.84930 0.86934 0.88832 0.90403 0.91835
                         PC22    PC23    PC24    PC25    PC26   PC27    PC28
Standard deviation     0.6318 0.61667 0.58666 0.54725 0.53177 0.4837 0.48009
Proportion of Variance 0.0133 0.01268 0.01147 0.00998 0.00943 0.0078 0.00768
Cumulative Proportion  0.9316 0.94433 0.95580 0.96579 0.97521 0.9830 0.99069
                          PC29    PC30
Standard deviation     0.43358 0.30199
Proportion of Variance 0.00627 0.00304
Cumulative Proportion  0.99696 1.00000

# Scree plot showing explained variance by each component
fviz_eig(acp_resultados)

# Plot of individuals in the space of the first two principal components
fviz_pca_ind(acp_resultados, geom.ind = "point", pointshape = 21, pointsize = 2, fill.ind = "blue", col.ind = "black", repel = TRUE)

# Plot of variables in the space of the first two principal components
fviz_pca_var(acp_resultados, col.var = "contrib", gradient.cols = c("blue", "yellow", "red"), repel = TRUE)

# Biplot showing both individuals and variables
fviz_pca_biplot(acp_resultados, repel = TRUE)

The scree plot shows the proportion of variance explained by each principal component, aiding in the selection of the number of components to retain. The PCA biplots allow us to interpret the contribution of variables to the principal components and observe the distribution of samples along these components. This visualization is key to understanding the underlying structure of the data and the relationships between different genetic traits.

Multiple Factor Analysis (MFA)

For datasets containing both quantitative and qualitative variables, Multiple Factor Analysis (MFA) is applied. MFA is an extension of PCA that can handle mixed data types and is particularly useful in genetics studies where both types of data are often present.

Performing MFA and Interpreting Results

We perform MFA on the mixed dataset and visualize the results, focusing on the contributions of quantitative variables and the distribution of individuals across groups.

# Convert character variables to factors
df2 <-

 db %>% 
  column_to_rownames("codigo_colecta") %>% 
  mutate_if(is.character, as.factor)

library(FactoMineR)

# Perform MFA
res.famd <- FAMD(df2, graph = F)
print(res.famd)

*The results are available in the following objects:

  name          description                             
1 "$eig"        "eigenvalues and inertia"               
2 "$var"        "Results for the variables"             
3 "$ind"        "results for the individuals"           
4 "$quali.var"  "Results for the qualitative variables" 
5 "$quanti.var" "Results for the quantitative variables"

# Scree plot showing the percentage of variance explained by each component
fviz_screeplot(res.famd)

# Plot showing the contribution of quantitative variables
quanti.var <- get_famd_var(res.famd, "quanti.var")
fviz_famd_var(res.famd, "quanti.var", repel = TRUE, col.var = "black")

# Visualization of individuals by group with confidence ellipses
fviz_mfa_ind(res.famd, 
             habillage = "Departamento",
             palette = c("#00AFBB", "#E7B800", "#FC4E07", "darkgreen"),
             addEllipses = F, 
             ellipse.type = "confidence", 
             repel = TRUE,
             label = "none"
             )

The MFA analysis allows us to decompose the data into factors that explain the most variation, considering both types of variables. The scree plot helps determine how many factors to retain, while the visualizations provide insights into the contributions of individual variables and the grouping of samples based on both qualitative and quantitative characteristics.

Conclusion

This workflow provides a comprehensive and detailed framework for conducting genetic diversity analysis in potato using various clustering and multivariate analysis techniques in R. By employing different methods, researchers can gain a deeper understanding of the genetic structure within their dataset, identify meaningful patterns of variation, and ultimately make informed decisions in breeding programs and conservation efforts. The choice of method depends on the specific nature of the data and the research objectives, whether it is to identify distinct genetic clusters, reduce data dimensionality, or both.

Citation

BibTeX citation:

@online{santos2024,
  author = {Santos, Franklin},
  title = {Potato {Diversity} {Analysis:A} {Step-by-Step} {Guide}},
  date = {2024-07-28},
  url = {https://franklinsantos.com/posts/conglomerados/},
  langid = {en}
}

For attribution, please cite this work as:

Santos, Franklin. 2024. “Potato Diversity Analysis:A Step-by-Step Guide.” July 28, 2024. https://franklinsantos.com/posts/conglomerados/.