Exploratory Analysis

Loading dataset files

setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #set to current directory

labels=read.csv("labels.csv")
features=read.csv("tpm_expression.csv")
covariates=read.csv("tcga_pancancer_lung_cov.csv")

Formatting

labels[,2]=factor(labels[,2])
colnames(labels)=c("ID","Label")

row.names(features)=features[,1]
features=features[,-1]

row.names(covariates)=covariates[,1]
covariates=as.data.frame(t(covariates[,-1]))
covariates[covariates == "[Not Applicable]" | covariates == "[Not Available]"] <- NA

head(labels)

##                             ID Label
## 1 TCGA-49-6742-11A-01R-1858-07  NORM
## 2 TCGA-77-8008-11A-01R-2187-07  NORM
## 3 TCGA-91-6836-11A-01R-1858-07  NORM
## 4 TCGA-91-6847-11A-01R-1949-07  NORM
## 5 TCGA-43-5670-11A-01R-2125-07  NORM
## 6 TCGA-44-6777-11A-01R-1858-07  NORM

dim(labels) #1091 observation

## [1] 1091    2

head(features[,1:2])

##             TCGA.49.6742.11A.01R.1858.07 TCGA.77.8008.11A.01R.2187.07
## 1/2-SBSRNA4                 1.398475e+00                 1.542257e+00
## A1BG                        4.073375e+00                 6.889316e+00
## A1BG-AS1                    8.639521e-01                 1.404587e+00
## A1CF                        5.852050e-03                 8.372394e-03
## A2LD1                       3.400021e+00                 3.847494e+00
## A2M                         1.339155e+03                 3.683596e+03

dim(features) #23368 genes as variables

## [1] 23368  1091

head(covariates[,1:2])

##              Diagnosis Age
## TCGA.49.6742          70.0
## TCGA.77.8008          68.0
## TCGA.91.6836          52.0
## TCGA.91.6847          62.0
## TCGA.43.5670          70.0
## TCGA.44.6777          85.0
##              Neoplasm Disease Stage American Joint Committee on Cancer Code
## TCGA.49.6742                                                      STAGE IIA
## TCGA.77.8008                                                       STAGE IB
## TCGA.91.6836                                                       STAGE IB
## TCGA.91.6847                                                       STAGE IB
## TCGA.43.5670                                                      STAGE IIA
## TCGA.44.6777                                                       STAGE IB

dim(covariates)

## [1] 1091   58

The dataset has 1091 observations, 23368 gens as variables, and 61 covariates available.

Classes

library(plotly)

## Warning: package 'plotly' was built under R version 4.2.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.2.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

class_counts <- table(labels[,2])
class_colors <- c("LUAD" = "red", "LUSC" = "green", "NORM" = "blue")
plot_ly(x = names(class_counts), y = class_counts, type = "bar", marker = list(color = class_colors)) %>%
  layout(title = "Distribution of classes", xaxis = list(title = "Class"), yaxis = list(title = "Count"))

We have a total of 1091 samples, the distribution of the classes is imbalanced. We have an higher number of samples of LUAD and LUSC cancer type, we have fewer normal/control samples

Features

Genes Distribution

set.seed(1)
random_genes <- sample(nrow(features), 30)
set.seed(2)
random_genes2 <- sample(nrow(features), 30)
p1 <- plot_ly()
p2 <- plot_ly()
for (i in random_genes) {
  dens= density(as.numeric(features[i,]))
  p1 <- add_lines(p1, x = dens$x, y = dens$y, name = rownames(features[i,]), 
                  line = list(color = i))
}
for (i in random_genes2) {
  dens= density(as.numeric(features[i,]))
  p2 <- add_lines(p2, x = dens$x, y = dens$y, name = rownames(features[i,]), 
                  line = list(color = i))
}

p1 <- layout(p1, xaxis = list(range = c(0, 7)), yaxis = list(range = c(0, 4)))
p2 <- layout(p2, xaxis = list(range = c(0, 7)), yaxis = list(range = c(0, 4)))

p1

p2

The distribution of the expression levels (TPM normalization here considered) across different genes is typically positively skewed. Gene expression values can only take positive values.

Can try to use log2(1+x) normalization (used in context of gene expression data), to make distribution of genes more normal, mitigate outliers and stabilise the variance (also within classes). In RNA-seq BARRA:curda dataset they use Deseq2 variance stabilising transformation (something that in results in similar to a log2 transformation), but stabilising around the mean of each classes, but such preprocessing could not be replicated in test set because we would have to use the class membership. As we want to mantain spirit of keeping test set just for the test we would not be able to replicate same preprocessing on new data and we should not train the model on such preprocessing. Using deseq2 stabilising transformation to stabilize around the overall mean probably would not be optimal given that NSC works around trying to differentiate among means (centroid) of different classes.

Logarithmic Transformation (Log2 or Log10): This is probably the most common transformation for gene expression data. It helps to stabilize variance, de-emphasize high counts, and bring data closer to a normal distribution, which is often assumed by subsequent statistical analysis methods.

Genes Correlation

library(gplots)

## Warning: package 'gplots' was built under R version 4.2.3

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(RColorBrewer)
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.2.3

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:plotly':
## 
##     subplot

## The following objects are masked from 'package:base':
## 
##     format.pval, units

set.seed(1)
random_genes <- sample(nrow(features), 1000)
correlation_matrix <- rcorr(t(features[random_genes,])) #because for some gene sd of zero, base corr does not work
correlation_matrix$r[is.na(correlation_matrix$r)] <- 0

heatmap.2(correlation_matrix$r, 
          trace = "none", 
          margins = c(3,3), 
          col=colorRampPalette(brewer.pal(9,"GnBu"))(25),
          key = TRUE, 
          density.info = "histogram")

rm(correlation_matrix) #delete this heavy object

Pamr Classification algorithm model hypothesize (interpreted as DA particular case)that expression values across genes are uncorrelated. While in reality some clusters of genes might be naturally correlated, it seems a pretty reasonable semplification for the approach used and the dimensionality of the dataset. The heatmap for a sample of 1000 genes at a time do not show extremely strongly correlated clusters.