RMA normalization for microarray data

| category Bioinformatics  | tag GEO 

Why does microarray data need to be normalized?

account for technical variation between the arrays.

Methods

Robust Multichip Average (RMA) normalization.

Demo (GSE1297)

Step 1 Install packages

# install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")  # Get data from NCBI Gene Expression Omnibus (GEO)
biocLite("affy")  # Methods for Affymetrix Oligonucleotide Arrays
biocLite("hgu133a.db", type = "source")  # GSE1297: Platform_title = [HG-U133A]
biocLite("hgu133acdf")

Download the CEL file

library(GEOquery)

# Set working directory for download
setwd("D:/GEO/AD/GSE/GSE1297")

# Download the CEL file (by GSE - Geo series id), may take very long time
getGEOSuppFiles("GSE1297")
# this command does not work for this chip, so I need to download it
# directly Platform_title = [HG-U133A] Affymetrix Human Genome U133A Array
# Platform information can be found in SOFT file

# Unpack the CEL files
setwd("D:/GEO/AD/GSE/GSE1297/GSE1297")
untar("GSE1297_RAW.tar", exdir = "data")
cels = list.files("data/", pattern = "cel")
# sometiles, it is 'CEL', you need to check it first
sapply(paste("data", cels, sep = "/"), gunzip)
cels = list.files("data/", pattern = "cel")
# sometiles, it is 'CEL', you need to check it first

Perform RMA normalization

library(affy)
library(hgu133a.db)
library(hgu133acdf)

# Set working directory for normalization
setwd("D:/GEO/AD/GSE/GSE1297/GSE1297/data")
raw.data = ReadAffy(verbose = FALSE, filenames = cels, cdfname = "hgu133acdf")

# perform RMA normalization (log2)
data.rma.norm = rma(raw.data)
Background correcting
Normalizing
Calculating Expression

# Get the expression estimates for each array
rma = exprs(data.rma.norm)

# Take a look at the result (first 5 rows and 5 columes)
rma[1:5, 1:5]
GSM21203.cel GSM21204.cel GSM21205.cel GSM21206.cel GSM21207.cel
1007_s_at       10.606       10.339       10.390       10.523       10.791
1053_at          4.586        4.463        4.572        4.531        4.546
117_at           5.937        6.017        6.234       10.114        6.253
121_at           8.763        8.951        9.026        8.804        9.130
1255_g_at        4.361        4.439        4.598        4.019        4.354

# Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t")

Annotation

tt = cbind(row.names(rma), rma)
colnames(tt) = c("ProbID", sub(".cel", "", colnames(rma), ignore.case = TRUE))
rownames(tt) = NULL
tt[1:5, 1:5]
ProbID      GSM21203           GSM21204           GSM21205
[1,] "1007_s_at" "10.6061947450577" "10.3389916272064" "10.390181163724"
[2,] "1053_at"   "4.58630979543013" "4.46270777736086" "4.57178078332995"
[3,] "117_at"    "5.93684658044002" "6.01691505298677" "6.23421446338666"
[4,] "121_at"    "8.76288033305696" "8.9513454849938"  "9.02618022186855"
[5,] "1255_g_at" "4.3610859654913"  "4.43906808970735" "4.59752976632791"
GSM21206
[1,] "10.5230672101482"
[2,] "4.53054651020166"
[3,] "10.1138768429987"
[4,] "8.80361343714862"
[5,] "4.01915640899688"

require(RCurl)
myURL <- getURL("https://dl.dropboxusercontent.com/u/8272421/geo/HGU133A.na33.txt",
ssl.verifypeer = FALSE)
annot <- read.table(textConnection(myURL), header = TRUE, sep = "\t")
head(annot)
ProbeSetID EntrezGene
1  1007_s_at  100616237
2  1007_s_at        780
3    1053_at       5982
4     117_at       3310
5     121_at       7849
6  1255_g_at       2978

# probe sets were mapped to Entrez Gene IDs.
# comb=merge(annot,tt,by.x='ProbeSetID',by.y='ProbID',all.y=TRUE)
comb = merge(annot, tt, by.x = "ProbeSetID", by.y = "ProbID")
comb[1:5, 1:5]
ProbeSetID EntrezGene         GSM21203         GSM21204         GSM21205
1  1007_s_at  100616237 10.6061947450577 10.3389916272064  10.390181163724
2  1007_s_at        780 10.6061947450577 10.3389916272064  10.390181163724
3    1053_at       5982 4.58630979543013 4.46270777736086 4.57178078332995
4     117_at       3310 5.93684658044002 6.01691505298677 6.23421446338666
5     121_at       7849 8.76288033305696  8.9513454849938 9.02618022186855
write.table(comb, file = "comb2.txt", quote = FALSE, sep = "\t", row.names = FALSE)

# If multiple probe sets corresponded to the same gene, then the expression
# values of these probe sets were averaged.
comb2 <- subset(comb, select = -c(ProbeSetID))
comb2 <- data.frame(lapply(comb2, as.character), stringsAsFactors = FALSE)
comb2 <- data.frame(lapply(comb2, as.numeric), stringsAsFactors = FALSE)
out <- aggregate(. ~ EntrezGene, data = comb2, mean)

# Format values to 5 decimal places
out = format(out, digits = 5)
out[1:5, 1:5]
EntrezGene GSM21203 GSM21204 GSM21205 GSM21206
1          2   9.0928   9.7931   9.0962   9.8969
2          9   4.5077   4.3635   4.5445   4.3781
3         10   6.2881   6.2111   7.0938   5.7314
4         12  11.5590   8.0948   8.2142  12.3144
5         13   4.5742   4.6783   5.0740   4.5521
write.table(out, file = "GSE1297.RMA.txt", quote = FALSE, sep = "\t", row.names = FALSE)

Session information

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] RCurl_1.95-4.1       bitops_1.0-6         hgu133acdf_2.12.0
[4] hgu133a.db_2.9.0     org.Hs.eg.db_2.9.0   RSQLite_0.11.4
[7] DBI_0.2-7            AnnotationDbi_1.22.6 affy_1.38.1
[10] Biobase_2.20.1       BiocGenerics_0.6.0   knitr_1.5

loaded via a namespace (and not attached):
[1] affyio_1.28.0         BiocInstaller_1.10.4  evaluate_0.5.1
[4] formatR_0.10          IRanges_1.18.4        preprocessCore_1.22.0
[7] stats4_3.0.2          stringr_0.6.2         tools_3.0.2
[10] zlibbioc_1.6.0

Further Reading

Tutorial: Analysing microarray data in BioConductor
Using Bioconductor for Microarray Analysis
Methods of RMA Normalization for Affymetrix GeneChip Arrays
A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2):185-193


Previous     Next