setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #set to current directory
labels=read.csv("labels.csv")
features=read.csv("tpm_expression.csv")
covariates=read.csv("tcga_pancancer_lung_cov.csv")
labels[,2]=factor(labels[,2])
colnames(labels)=c("ID","Label")
row.names(features)=features[,1]
features=features[,-1]
row.names(covariates)=covariates[,1]
covariates=as.data.frame(t(covariates[,-1]))
covariates[covariates == "[Not Applicable]" | covariates == "[Not Available]"] <- NA
head(labels)
## ID Label
## 1 TCGA-49-6742-11A-01R-1858-07 NORM
## 2 TCGA-77-8008-11A-01R-2187-07 NORM
## 3 TCGA-91-6836-11A-01R-1858-07 NORM
## 4 TCGA-91-6847-11A-01R-1949-07 NORM
## 5 TCGA-43-5670-11A-01R-2125-07 NORM
## 6 TCGA-44-6777-11A-01R-1858-07 NORM
dim(labels) #1091 observation
## [1] 1091 2
head(features[,1:2])
## TCGA.49.6742.11A.01R.1858.07 TCGA.77.8008.11A.01R.2187.07
## 1/2-SBSRNA4 1.398475e+00 1.542257e+00
## A1BG 4.073375e+00 6.889316e+00
## A1BG-AS1 8.639521e-01 1.404587e+00
## A1CF 5.852050e-03 8.372394e-03
## A2LD1 3.400021e+00 3.847494e+00
## A2M 1.339155e+03 3.683596e+03
dim(features) #23368 genes as variables
## [1] 23368 1091
head(covariates[,1:2])
## Diagnosis Age
## TCGA.49.6742 70.0
## TCGA.77.8008 68.0
## TCGA.91.6836 52.0
## TCGA.91.6847 62.0
## TCGA.43.5670 70.0
## TCGA.44.6777 85.0
## Neoplasm Disease Stage American Joint Committee on Cancer Code
## TCGA.49.6742 STAGE IIA
## TCGA.77.8008 STAGE IB
## TCGA.91.6836 STAGE IB
## TCGA.91.6847 STAGE IB
## TCGA.43.5670 STAGE IIA
## TCGA.44.6777 STAGE IB
dim(covariates)
## [1] 1091 58
The dataset has 1091 observations, 23368 gens as variables, and 61 covariates available.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
class_counts <- table(labels[,2])
class_colors <- c("LUAD" = "red", "LUSC" = "green", "NORM" = "blue")
plot_ly(x = names(class_counts), y = class_counts, type = "bar", marker = list(color = class_colors)) %>%
layout(title = "Distribution of classes", xaxis = list(title = "Class"), yaxis = list(title = "Count"))
We have a total of 1091 samples, the distribution of the classes is imbalanced. We have an higher number of samples of LUAD and LUSC cancer type, we have fewer normal/control samples
set.seed(1)
random_genes <- sample(nrow(features), 30)
set.seed(2)
random_genes2 <- sample(nrow(features), 30)
p1 <- plot_ly()
p2 <- plot_ly()
for (i in random_genes) {
dens= density(as.numeric(features[i,]))
p1 <- add_lines(p1, x = dens$x, y = dens$y, name = rownames(features[i,]),
line = list(color = i))
}
for (i in random_genes2) {
dens= density(as.numeric(features[i,]))
p2 <- add_lines(p2, x = dens$x, y = dens$y, name = rownames(features[i,]),
line = list(color = i))
}
p1 <- layout(p1, xaxis = list(range = c(0, 7)), yaxis = list(range = c(0, 4)))
p2 <- layout(p2, xaxis = list(range = c(0, 7)), yaxis = list(range = c(0, 4)))
p1
p2
The distribution of the expression levels (TPM normalization here considered) across different genes is typically positively skewed. Gene expression values can only take positive values.
Can try to use log2(1+x) normalization (used in context of gene expression data), to make distribution of genes more normal, mitigate outliers and stabilise the variance (also within classes). In RNA-seq BARRA:curda dataset they use Deseq2 variance stabilising transformation (something that in results in similar to a log2 transformation), but stabilising around the mean of each classes, but such preprocessing could not be replicated in test set because we would have to use the class membership. As we want to mantain spirit of keeping test set just for the test we would not be able to replicate same preprocessing on new data and we should not train the model on such preprocessing. Using deseq2 stabilising transformation to stabilize around the overall mean probably would not be optimal given that NSC works around trying to differentiate among means (centroid) of different classes.
Logarithmic Transformation (Log2 or Log10): This is probably the most common transformation for gene expression data. It helps to stabilize variance, de-emphasize high counts, and bring data closer to a normal distribution, which is often assumed by subsequent statistical analysis methods.
library(gplots)
## Warning: package 'gplots' was built under R version 4.2.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(RColorBrewer)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.2.3
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:base':
##
## format.pval, units
set.seed(1)
random_genes <- sample(nrow(features), 1000)
correlation_matrix <- rcorr(t(features[random_genes,])) #because for some gene sd of zero, base corr does not work
correlation_matrix$r[is.na(correlation_matrix$r)] <- 0
heatmap.2(correlation_matrix$r,
trace = "none",
margins = c(3,3),
col=colorRampPalette(brewer.pal(9,"GnBu"))(25),
key = TRUE,
density.info = "histogram")
rm(correlation_matrix) #delete this heavy object
Pamr Classification algorithm model hypothesize (interpreted as DA particular case)that expression values across genes are uncorrelated. While in reality some clusters of genes might be naturally correlated, it seems a pretty reasonable semplification for the approach used and the dimensionality of the dataset. The heatmap for a sample of 1000 genes at a time do not show extremely strongly correlated clusters.