DESeq2 tutorial website: (https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#how-to-get-help-for-deseq2)
Harvard’s workshop for DESeq2 analysis: (https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html)
We first need to load all the libraries we need to use in the script
library(DESeq2)
library(airway)
For this example we will use the dataset that comes from the airway package:
data("airway")
dds <- DESeqDataSetFromMatrix(countData = assay(airway),
colData = colData(airway),
design = ~ cell + dex)
Hmmmm… What is assay(airway)
or
colData(airway)
or what does ~cell+dex
mean?
Let’s try to figure out.
In R, anything that has round parenthesis is a function, so we can explore each function alone and see what objects it produce. Be careful though… Bioinformatics data might be very large! Explore the object before you try to print it!
x <- assay(airway)
First see what class of an object it is:
class(x)
[1] "matrix" "array"
Great! This is a matrix, so we should be able to find its dimensions:
dim(x)
[1] 64102 8
Since there are many rows but only 8 columns we can take a look at
the beginning of this matrix. head()
function works for
data frames, matrices and many other R objects.
head(x)
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 679 448 873 408 1138 1047 770 572
ENSG00000000005 0 0 0 0 0 0 0 0
ENSG00000000419 467 515 621 365 587 799 417 508
ENSG00000000457 260 211 263 164 245 331 233 229
ENSG00000000460 60 55 40 35 78 63 76 60
ENSG00000000938 0 0 2 0 1 0 0 0
Great! By looking at the first few rows, we now know that this matrix has row names and row columns.
Let’s now move to the next function - colData(airway)
.
Let’s see what it does:
x <- colData(airway)
Again, let’s see the “class” of this object:
class(x)
[1] "DFrame"
attr(,"package")
[1] "S4Vectors"
The output says “DFrame”. It is not exactly a data frame
object we saw before, but maybe something similar? Let’s try
head()
function:
head(x)
DataFrame with 6 rows and 9 columns
SampleName cell dex albut Run avgLength Experiment Sample BioSample
<factor> <factor> <factor> <factor> <factor> <integer> <factor> <factor> <factor>
SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669
SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678
SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682
SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673
Ah! It does look like a data frame! How many rows and how many columns does it have?
dim(x)
[1] 8 9
The last argument that was passed to the
`function was
design = ~ cell +
dex. The expression that follows the tilde
~` sign is a
formula. Usually it is a list of column names from a data frame
that we want to include in our analysis as independent variables.
So now, let’s execute that function:
dds <- DESeqDataSetFromMatrix(countData = assay(airway),
colData = colData(airway),
design = ~ cell + dex)
The output is stored in a variable dds
. What is in it?
What can be done with it. Again, try using functions we used before:
class(dds)
[1] "DESeqDataSet"
attr(,"package")
[1] "DESeq2"
Well, it is some “DESeq” Data Set. What can I do with it. Usually you will be given directions, but you also can look for help:
help(DESeqDataSet)
Let’s use other funtions to explore what is hidden in this object:
head(dds)
class: DESeqDataSet
dim: 6 8
metadata(1): version
assays(1): counts
rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000000460 ENSG00000000938
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
Function str()
is very helpful, but for very complex
large objects, the output could be a little too big….
str(dds)
Now we are ready to perform differential expression analysis
dds_a <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
And again, it might be a good idea to understand what did we get as a
result. So we use the same functions - class
,
head
, str
to “see” what is hidden in this
object
class(dds_a)
[1] "DESeqDataSet"
attr(,"package")
[1] "DESeq2"
head(dds_a)
class: DESeqDataSet
dim: 6 8
metadata(1): version
assays(4): counts mu H cooks
rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000000460 ENSG00000000938
rowData names(34): baseMean baseVar ... deviance maxCooks
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample sizeFactor
And the tutorial (or lecture directions) tells us to use
results()
function to “view” the results.
# Extract results
res <- results(dds_a)
# View the results
head(res)
log2 fold change (MLE): dex untrt vs trt
Wald test p-value: dex untrt vs trt
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 708.602170 0.3812540 0.100654 3.787752 0.000152016 0.00128363
ENSG00000000005 0.000000 NA NA NA NA NA
ENSG00000000419 520.297901 -0.2068126 0.112219 -1.842943 0.065337292 0.19654609
ENSG00000000457 237.163037 -0.0379204 0.143445 -0.264356 0.791505742 0.91145948
ENSG00000000460 57.932633 0.0881682 0.287142 0.307054 0.758801924 0.89503278
ENSG00000000938 0.318098 1.3782270 3.499873 0.393793 0.693733530 NA
# Plot results
plotMA(res, main="DESeq2", ylim=c(-2,2))
# Save results to a CSV file
write.csv(as.data.frame(res), file="DESeq2_results.csv")