Motif Discovery and Enrichment
Guanjing Hu
10-20-2021
Objectives
Questions
- How did the promoters differentiate between cotton (sub)genomes?
- What motifs are differentially enriched between cotton (sub)genomes?
Input Data
1. Promoter accessible chromatin regions (ACRs) identified
A2_ACRs_promoter1k.fasta
D5_ACRs_promoter1k.fasta
F1a_ACRs_promoter1k.fasta
F1d_ACRs_promoter1k.fasta
AD1a_ACRs_promoter1k.fasta
AD1d_ACRs_promoter1k.fasta
df[df$Feature=="Promoter (<=1kb)",]
# Feature Number RegionSize NumberPerc RegionSizePerc Sample ploidy
# 1 Promoter (<=1kb) 1046436 17894 0.3782327 0.3597869 A2 Diploid
# 12 Promoter (<=1kb) 849325 17826 0.3649802 0.3628998 D5 Diploid
# 23 Promoter (<=1kb) 863736 15950 0.3410552 0.3335355 F1a F1
# 34 Promoter (<=1kb) 907249 17345 0.3687499 0.3653656 F1d F1
# 45 Promoter (<=1kb) 1830576 24044 0.4619651 0.4185423 AD1a AD1
# 56 Promoter (<=1kb) 1529891 22074 0.4615450 0.4205853 AD1d AD1
2. Promter sequences as control
A2_promoter1k.fasta
- A2 WHU refD5_promoter1k.fasta
- D5 JGI refAD1a_promoter1k.fasta
- AD1 TM-1 UTX refAD1d_promoter1k.fasta
- AD1 TM-1 UTX ref
3. Motif database
Ath_TF_binding_motifs.meme
downloaded from http://planttfdb.gao-lab.org/download/motif/Ath_TF_binding_motifs.meme.gz , 619 motifs;Ath_TF_binding_motifs_information.txt
downloaded from http://planttfdb.gao-lab.org/download/motif/Ath_TF_binding_motifs_information.txt
## load motifi info
setwd("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3")
x<-read.table("0.input/Ath_TF_binding_motifs_information.txt",header=T,sep="\t")
head(x)
# Gene_id Family Matrix_id Species Method Datasource Datasource_ID
# 1 AT1G01060 MYB_related MP00119 Arabidopsis thaliana DAP PMID:27203113 AT1G01060
# 2 AT1G01250 ERF MP00120 Arabidopsis thaliana DAP PMID:27203113 AT1G01250
# 3 AT1G01260 bHLH MP00100 Arabidopsis thaliana PBM PMID:26531826 MA0958.1
table(x$Family)
# AP2 ARF ARR-B B3 BBR-BPC
# 8 6 5 10 3
# BES1 bHLH bZIP C2H2 C3H
# 6 32 35 27 4
# CAMTA CPP Dof E2F/DP EIL
# 4 5 22 5 2
# ERF FAR1 G2-like GATA GeBP
# 80 2 21 15 3
# GRAS GRF HD-ZIP HSF LBD
# 1 1 22 10 8
# LFY M-type_MADS MIKC_MADS MYB MYB_related
# 1 2 18 67 23
# NAC NF-YB Nin-like RAV S1Fa-like
# 59 2 3 4 1
# SBP SRS TCP Trihelix VOZ
# 12 2 18 14 1
# WOX WRKY YABBY ZF-HD
# 3 43 2 7
-
JASPAR2018_CORE_plants_non-redundant.meme
downloaded from https://meme-suite.org/meme/db/motifs -
JASPAR2022_CORE_plants_non-redundant_pfms_transfac.txt
andJASPAR2022_CORE_non-redundant_pfms_meme.txt
downloaded from https://jaspar.genereg.net/downloads/, contains 656 motifs with TF family information## load motifi info y<-as.data.frame(matrix(x,ncol=3,byrow=T)) names(y)<-c("id","family","class") head(y) # id family class # 1 MA0020.1 DOF Other C4 zinc finger-type factors # 2 MA0021.1 DOF Other C4 zinc finger-type factors # 3 MA0034.1 Myb Tryptophan cluster factors # 4 MA0053.1 DOF Other C4 zinc finger-type factors motifs<-y
MEME Installation
Download meme-5.4.1.tar.gz
from https://meme-suite.org/meme/doc/download.html and follow the online instruction
cd /Users/guanjinghu/tools/meme-5.4.1
./configure --prefix=$HOME/meme --enable-build-libxml2 --enable-build-libxslt
make
make test
make install
export PATH=/Users/guanjinghu/meme/bin:/Users/guanjinghu/meme/libexec/meme-5.4.1:$PATH
Analysis
1. Compare promoters
FIMO scanning motifs in 1kb promoters
## scan JASPAR 2018 nr
# promter
fimo --o ~/Desktop/jA2.p1k 0.input/JASPAR2018_CORE_plants_non-redundant.meme 0.input/A2_promoter1k.fasta
fimo --o ~/Desktop/jD5.p1k 0.input/JASPAR2018_CORE_plants_non-redundant.meme 0.input/D5_promoter1k.fasta
fimo --o ~/Desktop/jAD1a.p1k 0.input/JASPAR2018_CORE_plants_non-redundant.meme 0.input/AD1a_promoter1k.fasta
fimo --o ~/Desktop/jAD1d.p1k 0.input/JASPAR2018_CORE_plants_non-redundant.meme 0.input/AD1d_promoter1k.fasta
## scan JASPAR 2022 nr
# promter
fimo --o ~/Desktop/jA2.p1k 0.input/JASPAR2022_CORE_non-redundant_pfms_meme.txt 0.input/A2_promoter1k.fasta
fimo --o ~/Desktop/jD5.p1k 0.input/JASPAR2022_CORE_non-redundant_pfms_meme.txt 0.input/D5_promoter1k.fasta
fimo --o ~/Desktop/jAD1a.p1k 0.input/JASPAR2022_CORE_non-redundant_pfms_meme.txt 0.input/AD1a_promoter1k.fasta
fimo --o ~/Desktop/jAD1d.p1k 0.input/JASPAR2022_CORE_non-redundant_pfms_meme.txt 0.input/AD1d_promoter1k.fasta
# failed, error in meme
#The PSPM of motif MA0207.1 has probabilities which don't sum to 1 on row 6.
#The PSPM of motif MA0207.1 has unparsable characters "URL #http://jaspar.genereg.net/matrix/MA0207.1" on the end of row 6. Only spaces or tabs are allowed on the same line as the numbers.
#FATAL: An error occurred reading the motif file.
## scan At TFDB
# promoter
fimo --o ~/Desktop/jA2.p1k 0.input/Ath_TF_binding_motifs.meme 0.input/A2_promoter1k.fasta
fimo --o ~/Desktop/jD5.p1k 0.input/Ath_TF_binding_motifs.meme 0.input/D5_promoter1k.fasta
fimo --o ~/Desktop/jAD1a.p1k 0.input/Ath_TF_binding_motifs.meme 0.input/AD1a_promoter1k.fasta
fimo --o ~/Desktop/jAD1d.p1k 0.input/Ath_TF_binding_motifs.meme 0.input/AD1d_promoter1k.fasta
Compare motif occurance between genomes
## load scanning results
setwd("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/1.PromoterComparison")
# loop input
fl<-paste0(list.files(path="~/Desktop",pattern="p1k",full.names=T),"/fimo.tsv"); fl
# "jA2.p1k/fimo.tsv" "jAD1a.p1k/fimo.tsv" "jAD1d.p1k/fimo.tsv" "jD5.p1k/fimo.tsv"
for(i in fl)
{
x<-fread(i,select=1)
y<-as.data.frame(table(x$motif_id))
names(y)=c("MOTIF",gsub(".acr/.*","",i))
if(exists("res")){res=merge(res,y,all.x=T,all.y=T)}else{res=y}
}
head(res)
# MOTIF jA2.p1k/fimo.tsv jAD1a.p1k/fimo.tsv jAD1d.p1k/fimo.tsv jD5.p1k/fimo.tsv
# 1 MA0001.2 16660 14521 15870 15278
# 2 MA0005.2 15275 13314 14691 14240
# 3 MA0008.2 24341 25420 25707 24890
## compare promoters
x<-res[,-1]
names(x)<-gsub(".p1k/fimo.tsv|j","",names(x))
rownames(x)<-res$MOTIF
write.table(x,"comparePromoter.jaspar2018.txt",sep="\t")
# repeat for "comparePromoter.athMotif619.txt"
Examine promoter results
library("pheatmap")
for(fl in c("comparePromoter.jaspar2018.txt","comparePromoter.athMotif619.txt"))
{
print(fl)
x<-read.table(fl,header=T, sep="\t")
xx=log2(x)
# colSums
print(colSums(x))
print(colSums(xx))
# correlations
print(cor(x))
print(cor(xx))
# plot
pdf(gsub(".txt",".pdf",fl))
boxplot(x,main="Freq")
boxplot(xx, main="log2 Freq")
plot(x,main="Freq")
pheatmap(cor(x),main="Correlation matrix")
plot(xx, main="log2 Freq")
pheatmap(cor(xx),main="Correlation matrix, log2")
pheatmap(x, scale = "none", fontsize_row=3, main="Euclidean, unscaled") # default euclidean distance
pheatmap(xx, scale = "none", fontsize_row=3, main="log2, Euclidean, unscaled") # default euclidean distance
pheatmap(xx, scale = "row", fontsize_row=3, main="log2, Euclidean, row scaled") # default euclidean distance
pheatmap(xx, scale = "column", fontsize_row=3, main="log2, Euclidean, column scaled") # default euclidean distance
dev.off()
}
Print Output
[1] "comparePromoter.jaspar2018.txt"
A2 AD1a AD1d D5
6661743 5984436 6226172 6205415
A2 AD1a AD1d D5
6215.696 6089.921 6169.104 6135.315
A2 AD1a AD1d D5
A2 1.0000000 0.9399910 0.9433270 0.9609195
AD1a 0.9399910 1.0000000 0.9859531 0.9740298
AD1d 0.9433270 0.9859531 1.0000000 0.9594149
D5 0.9609195 0.9740298 0.9594149 1.0000000
A2 AD1a AD1d D5
A2 1.0000000 0.9853390 0.9839880 0.9853099
AD1a 0.9853390 1.0000000 0.9970224 0.9955319
AD1d 0.9839880 0.9970224 1.0000000 0.9968052
D5 0.9853099 0.9955319 0.9968052 1.0000000
[1] "comparePromoter.athMotif619.txt"
A2 AD1a AD1d D5
8793873 8027567 8324093 8131779
A2 AD1a AD1d D5
8029.477 7860.305 7949.761 7902.994
A2 AD1a AD1d D5
A2 1.0000000 0.9579867 0.9653300 0.9600881
AD1a 0.9579867 1.0000000 0.9828834 0.9993615
AD1d 0.9653300 0.9828834 1.0000000 0.9820444
D5 0.9600881 0.9993615 0.9820444 1.0000000
A2 AD1a AD1d D5
A2 1.0000000 0.9872215 0.9867716 0.9854858
AD1a 0.9872215 1.0000000 0.9975637 0.9983553
AD1d 0.9867716 0.9975637 1.0000000 0.9986782
D5 0.9854858 0.9983553 0.9986782 1.0000000
Above are based on motif occurance, let us check TF types
info<-read.table("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/0.input/Ath_TF_binding_motifs_information.txt",header=T, sep="\t")
#
fl="comparePromoter.athMotif619.txt"
x<-read.table(fl,header=T, sep="\t")
f<-info$Family[match(rownames(x),info$Gene_id)]
tf<-aggregate(x,by=list(f),sum)
x<-tf[,-1]
rownames(x)<-tf$Group.1
xx<-log2(x)
#
pdf(gsub(".txt",".TF.pdf",fl))
boxplot(x,main="Freq")
boxplot(xx, main="log2 Freq")
plot(x,main="Freq")
pheatmap(cor(x),main="Correlation matrix")
plot(xx, main="log2 Freq")
pheatmap(cor(xx),main="Correlation matrix, log2")
pheatmap(x, scale = "none", fontsize_row=8, main="Euclidean, unscaled") # default euclidean distance
pheatmap(xx, scale = "none", fontsize_row=8, main="log2, Euclidean, unscaled") # default euclidean distance
pheatmap(xx, scale = "row", fontsize_row=8, main="log2, Euclidean, row scaled") # default euclidean distance
pheatmap(xx, scale = "column", fontsize_row=8, main="log2, Euclidean, column scaled") # default euclidean distance
dev.off()
Conclusions: A2 promoter is different from the other three, so the background should be considered for comparing motif enrichment.
2. Obtain enriched motifs for each ACR list
2.1 Run AME for each ACR list, using the corresponding (sub)genome promoter sequences as control to obtain enriched motifs from Ath_TF_binding_motifs.meme
info<-read.table("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/0.input/Ath_TF_binding_motifs_information.txt",header=T, sep="\t")
#
setwd("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/2.MotifEnrich")
# loop input
fl<-list.files(pattern="tsv"); fl
# "ame_A2.tsv" "ame_D5.tsv" "ame_F1a.tsv" "ame_F1d.tsv"
for(i in fl)
{
x<-read.table(i,header=T,sep="\t")
# x<-x[,c("motif_ID","adj_p.value")]
x<-x[,c("motif_ID","rank")]
names(x)[2]<-gsub("ame_|.tsv","",i)
if(exists("res")){res=merge(res,x,all.x=T,all.y=T)}else{res=x}
}
x<-res[,-1]
rownames(x)<-res$motif_ID
head(x)
# rank
z<-x
z[is.na(z)]<-600 # NA as insignaificant
# binary
y<-x[,c("A2","D5","F1a","F1d","AD1a","AD1d")]
y[!is.na(y)]<-1
y[is.na(y)]<-0
colSums(y)
# A2 D5 F1a F1d AD1a AD1d
# 329 307 274 313 380 388
dim(y)
# 423 6
family= info$Family[match(rownames(y),info$Gene_id)]
pattern=apply(y[,1:6],1, function(x){paste0(x,collapse="")})
table(pattern)
# 000001 000010 000011 000100 000101 000110 000111 001000
# 15 13 33 2 2 1 2 1
# 001011 001111 010000 010010 010011 010101 010111 011000
# 1 2 1 2 4 3 8 1
# 011111 100000 100001 100010 100011 100100 100101 100111
# 3 7 3 2 9 1 2 6
# 101001 101010 101011 101111 110000 110011 110101 110111
# 1 1 5 7 1 7 2 23
# 111011 111100 111110 111111
# 3 1 1 247
# cross tabulation of motif family and grouping
mp<-as.matrix(table(family,pattern))
head(mp)
## plot
library(pheatmap)
library(RColorBrewer)
library(UpSetR)
pdf("compareMotifs.pdf")
# color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100)
pheatmap(z, scale = "none", fontsize_row=3, main="Rank: Euclidean, unscaled",colorRampPalette(rev(brewer.pal(n = 11, name =
"Spectral")))(100))
pheatmap(cor(z), scale = "none", fontsize_row=8, main="Correlation",colorRampPalette(brewer.pal(n = 8, name = "Oranges"))(50))
upset(y,nset=ncol(y),order.by="freq",decreasing=TRUE)
upset(y,nset=ncol(y))
pie(table(family),main=paste("Enriched motifs"))
for(i in unique(pattern))
{
use=family[pattern==i]
pie(table(use),main=paste(i,length(use)))
}
pheatmap(mp[,colnames(mp)!="111111"],main="Cross tabulation of motif TF family and grouping")
pheatmap(mp,scale="column",main="Cross tabulation of motif TF family and grouping, column scaled")
dev.off()
## extract motifs not common to all
mm<-as.data.frame(pattern[pattern!="111111"])
write.table(mm,"diffEnrichMotifs.txt",sep="\t")
write.table(cbind(x,pattern,family),"EnrichMotifs.txt",sep="\t")
2.2 Run XSTREME for each ACR list, using the corresponding (sub)genome promoter sequences as control to obtain enriched motifs
cd /Users/guanjinghu/Nutstore\ Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/2.MotifEnrich
# examine output file
ls XSTREME/*N/xstreme.txt
# 1.XSTREME/A2/xstreme.txt 1.XSTREME/D5/xstreme.txt
# 1.XSTREME/AD1a/xstreme.txt 1.XSTREME/F1a/xstreme.txt
# 1.XSTREME/AD1d/xstreme.txt 1.XSTREME/F1d/xstreme.txt
# check command to make sure input and parameters correct
tail -2 XSTREME/*N/xstreme.tsv
# check number of motif find
for j in $(ls XSTREME/*N/xstreme.tsv);do;
echo $j
tail -5 $j |head -1|cut -f1
done
# 1.XSTREME/A2N/xstreme.tsv 117
# 1.XSTREME/AD1aN/xstreme.tsv 367
# 1.XSTREME/AD1dN/xstreme.tsv 352
# 1.XSTREME/D5N/xstreme.tsv 133
# 1.XSTREME/F1a/xstreme.tsv 3 ??????? what is wrong
# 1.XSTREME/F1d/xstreme.tsv 143
# check number of seed motif find
for j in $(ls XSTREME/*N/xstreme.txt);do;
echo $j
grep -c "^MOTIF" $j
done
# XSTREME/A2N/xstreme.txt 25
# 1.XSTREME/AD1aN/xstreme.txt 47
# 1.XSTREME/AD1dN/xstreme.txt 45
# 1.XSTREME/D5N/xstreme.txt 36
# 1.XSTREME/F1aN/xstreme.txt 3
# 1.XSTREME/F1dN/xstreme.txt 28
# check Background frequency
for j in $(ls XSTREME/*N/xstreme.txt);do;
echo $j
sed -n 27p $j
done
# Background letter frequencies (from file `./background'):
#1.XSTREME/A2N/xstreme.txt A 0.34720 C 0.15280 G 0.15280 T 0.34720 #1.XSTREME/AD1aN/xstreme.txt A 0.35960 C 0.14040 G 0.14040 T 0.35960
#1.XSTREME/D5N/xstreme.txt A 0.35640 C 0.14360 G 0.14360 T 0.35640
#1.XSTREME/AD1dN/xstreme.txt A 0.35510 C 0.14490 G 0.14490 T 0.35510
#. see the drop of GC. content in AD1a promoters?? What does it mean by openness?
# merge meme
ls XSTREME/*N/xstreme.txt
cat XSTREME/*N/xstreme.txt |grep -c "^MOTIF"
# 184
cat XSTREME/*N/xstreme.txt |grep "^MOTIF" |sort |uniq |wc -l
# 121
meme2meme XSTREME/*N/xstreme.txt >1.XSTREME/merged.txt
grep -c "^MOTIF" XSTREME/merged.txt #184
# manually remove reduancy
grep -c "^MOTIF" XSTREME/merged_nr.txt #121
3. Cluster enriched motif
3.1 Try STAMP http://www.benoslab.pitt.edu/stamp/ with default
3.2 Run RSAT Matrix-Clustering for XTREME results to cluster all seed motifs obtained
pwd
# /Users/guanjinghu/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/2.MotifEnrich
# prep xtreme results for RSAT, must contain the Background line
cat XSTREME/A2N/xstreme.txt | sed s/'^MOTIF '/'MOTIF A2-'/g >../3.Clustering/XTREMEseedFull/xstremeA2.txt
cat XSTREME/D5N/xstreme.txt | sed s/'^MOTIF '/'MOTIF D5-'/g >../3.Clustering/XTREMEseedFull/xstremeD5.txt
cat XSTREME/F1aN/xstreme.txt | sed s/'^MOTIF '/'MOTIF F1a-'/g >../3.Clustering/XTREMEseedFull/xstremeF1a.txt
cat XSTREME/F1dN/xstreme.txt | sed s/'^MOTIF '/'MOTIF F1d-'/g >../3.Clustering/XTREMEseedFull/xstremeF1d.txt
cat XSTREME/AD1aN/xstreme.txt | sed s/'^MOTIF '/'MOTIF AD1a-'/g >../3.Clustering/XTREMEseedFull/xstremeAD1a.txt
cat XSTREME/AD1dN/xstreme.txt | sed s/'^MOTIF '/'MOTIF AD1d-'/g >../3.Clustering/XTREMEseedFull/xstremeAD1d.txt
# meme combined
meme2meme ../3.Clustering/XTREMEseedFull/xstreme*txt >../3.Clustering/XTREMEseedFull/combined.txt
grep -c "MOTIF" ../3.Clustering/XTREMEseedFull/combined.txt
# 184 motifs
3.3 Run RSAT Matrix-Clustering for XTREME seed motifs and AME motifs
cd /Users/guanjinghu/Nutstore\ Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/3.Clustering
cp ../0.input/Ath_TF_binding_motifs.meme .
m<-read.table("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/2.MotifEnrich/diffEnrichMotifs.txt", header=T, sep="\t",colClasses = 'character')
x<-readLines("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/3.Clustering/Ath_TF_binding_motifs.meme")
y<-x
ss<-grep("MOTIF",x)
for(i in ss){
#print(x[i])
tt<-unlist(strsplit(x[i]," "))
tt[2]<- paste0(tt[2],"-",m[tt[2],])
y[i]<-paste(tt,collapse=" ")
}
writeLines(y,"~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/3.Clustering/Ath_TF_binding_motifs.rename.meme")
Result:http://rsat.eead.csic.es/plants/tmp/apache/2021/11/23/matrix-clustering_2021-11-23.141735_M81GmO/
4. Scan enriched motifs for comparison
## Scan 121 enriched motifs
#acr
fimo --o 3.FIMO/A2.acr 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/A2_ACRs_promoter1k.fasta
fimo --o 3.FIMO/D5.acr 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/D5_ACRs_promoter1k.fasta
fimo --o 3.FIMO/F1a.acr 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/F1a_ACRs_promoter1k.fasta
fimo --o 3.FIMO/F1d.acr 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/F1d_ACRs_promoter1k.fasta
fimo --o 3.FIMO/AD1a.acr 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/AD1a_ACRs_promoter1k.fasta
fimo --o 3.FIMO/AD1d.acr 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/AD1d_ACRs_promoter1k.fasta
#promoter
fimo --o 3.FIMO/A2.p1k 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/A2_promoter1k.fasta
fimo --o 3.FIMO/D5.p1k 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/D5_promoter1k.fasta
fimo --o 3.FIMO/AD1a.p1k 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/AD1a_promoter1k.fasta
fimo --o 3.FIMO/AD1d.p1k 1.enrich\ with\ XSTREME/merged_nr.txt 0.input/AD1d_promoter1k.fasta
The motif occurance in ACRs were normalized by by total ACR sizes, and then divided by occurance in 1k promoter regions.
setwd("~/Nutstore Files/GHuLab-Me/2.Projects/Project_MNase-seq/7.Questions/Qregulation3/4.FIMO")
library(data.table)
## loop input FIMO scanning results and sum into table
fl<-paste0(list.files(pattern="acr"),"/fimo.tsv"); fl
# "A2.acr/fimo.tsv" "AD1a.acr/fimo.tsv" "AD1d.acr/fimo.tsv" "D5.acr/fimo.tsv" "F1a.acr/fimo.tsv" "F1d.acr/fimo.tsv"
for(i in fl)
{
x<-fread(i,select=1)
y<-as.data.frame(table(x$motif_id))
names(y)=c("MOTIF",gsub(".acr/.*","",i))
if(exists("res")){res=merge(res,y,all.x=T,all.y=T)}else{res=y}
}
fl<-paste0(list.files(pattern="p1k"),"/fimo.tsv"); fl
#"A2.p1k/fimo.tsv" "AD1a.p1k/fimo.tsv" "AD1d.p1k/fimo.tsv" "D5.p1k/fimo.tsv"
for(i in fl)
{
x<-fread(i,select=1)
y<-as.data.frame(table(x$motif_id))
names(y)=c("MOTIF",gsub("/.*","",i))
res=merge(res,y,all.x=T,all.y=T)
}
write.table(res,file="rawFreq.txt",sep="\t",row.names=FALSE)
## compare freq
x<-res[,-1]
rownames(x)<-res$MOTIF
colSums(x)
# A2 AD1a AD1d D5 F1a F1d A2.p1k AD1a.p1k AD1d.p1k D5.p1k
# 31525 67609 56027 21045 22621 24704 1213512 949836 1037081 986457
boxplot(log2(x[,1:6]),main="Freq in ACR, log2")
boxplot(log2(x[,7:10]),main="Freq in Promoter, log2")
# normalize by total acr length
freq_acr<-x[,1:6]
# acr size
lz<-c(17894,17826,15950,17345,24044,22074)
names(lz)<-c("A2","D5","F1a","F1d","AD1a","AD1d")
l<-lz[names(freq_acr)]
sizeFactor<-l/mean(l)
freq_acr_norm<-sweep(freq_acr, 2 ,sizeFactor, "/")
## then check enrichment in1kb promoter
y=freq_acr_norm
y$A2 = y$A2/x$A2.p1k
y$F1a = y$F1a/x$A2.p1k
y$D5 = y$D5/x$D5.p1k
y$F1d = y$F1d/x$D5.p1k
y$AD1a = y$AD1a/x$AD1a.p1k
y$AD1d = y$AD1d/x$AD1d.p1k
freqRatio_acr_norm= y
## then check enrichment in1kb promoter
y=freq_acr
y$A2 = y$A2/x$A2.p1k
y$F1a = y$F1a/x$A2.p1k
y$D5 = y$D5/x$D5.p1k
y$F1d = y$F1d/x$D5.p1k
y$AD1a = y$AD1a/x$AD1a.p1k
y$AD1d = y$AD1d/x$AD1d.p1k
freqRatio_acr= y
pdf("plotMotifEnich.pdf")
boxplot(log2(x[,1:6]),main="Freq in ACR, log2")
boxplot(log2(x[,7:10]),main="Freq in Promoter")
#
boxplot(log2(freq_acr),main="Freq in ACR, log2")
heatmap(as.matrix(log2(freq_acr)),scale="none", main="Freq in ACR, log2",cexRow=0.2)
#
boxplot(log2(freq_acr_norm),main="Freq in ACR normalized by size, log2")
heatmap(as.matrix(log2(freq_acr_norm)),scale="none", main="Freq in ACR normalized by size, log2", cexRow=0.2)
#
boxplot(freqRatio_acr,main="Freq ratio of acr/promoter")
heatmap(as.matrix(freqRatio_acr),scale="none", main="Freq ratio of acr/promoter", cexRow=0.2)
#
boxplot(freqRatio_acr_norm,main="Freq ratio of acr_norm/promoter")
heatmap(as.matrix(freqRatio_acr_norm),scale="none", main="Freq ratio of acr_norm/promoter", cexRow=0.2)
dev.off()
Tools
Main tasks include:
- De novo discovery
- Enrichment of known motifs
- Sanning to derive frequency
- Comparison and visualization
Popular suites
MEME http://meme-suite.org/index.html
RSAT http://rsat.eead.csic.es/plants/
Other tools
- DiNAMO (2018) highly sensitive DNA motif discovery in high-throughput sequencing data https://github.com/bonsai-team/DiNAMO
- DMINDA 2.0 (2017) integrated and systematic views of regulatory DNA motif identification and analyses http://bmbl.sdstate.edu/DMINDA2
- ….
RSAT Matrix-clustering Usage
Input
choose meme format
Output files
matrix-clustering_radial_tree.html
matrix-clustering_portable_logo_tree.html
matrix-clustering_aligned_logos/
pairwise alignment
matrix-clustering_html/pairwise_compa.html
summary of pairwise alignment
XSTREME Usage
Output files
xstreme.html
report page
xstreme.txt
contains the ‘seed’ motif from each cluster of similar (redundant) motifs. This column is 1 if this motif is the seed motif of its cluster; otherwise the column is 0. The seed motif in a cluster is either the most enriched discovered motif, or, if the cluster contains no discovered motifs, the most enriched known motif in the cluster. Enrichment is determined by the SEA motif enrichment analysis program. The motifs are in Minimal MEME Motif format.
xstreme.tsv
contains one line for each motif found to be significantly enriched, sorted in order of decreasing statistical significance. The first line in the file contains the (tab-separated) names of the fields. Your command line is given at the end of the TSV file in a comment line starting with the character ‘#’. The names and meanings of each of the fields are described in the table below.
field name contents
- RANK The rank of the enrichment of the motif in the primary sequences relative to the control sequences according to the SEA motif enrichment analysis program. The rank is relative to all of the ab initio motifs discovered by STREME and MEME, as well as all the known motifs you provided to XSTREME. The rank is based on the SEA enrichment p-values of the motifs. Motifs with tied SEA enrichment p-values are given the same rank. Click on the rank to see the motif in the SEA output. (There will be no link to the SEA output if the SEA E-value of the motif is larger than the E-value threshold.) Caution:The SEA statistical significance statistics (p-, E- and q-values) are NOT accurate for discovered motifs because they are based on the same sequences as input to STREME and MEME.
- SEED_MOTIF This column is 1 if this motif is the seed mo tif of its cluster; otherwise the column is 0. The seed motif in a cluster is either the most enriched discovered motif, or, if the cluster contains no discovered motifs, the most enriched known motif in the cluster. Enrichment is determined by the SEA motif enrichment analysis program.
- CLUSTER This column contains the cluster number of the cluster of similar motifs that the motif belongs to. Clusters are numbered in order of the enrichment of their seed motif.
- SOURCE The name of of the program that found the de novo motif, or the name of a file of motifs (“motif database file”) that contains the motif.
- ID The name of the motif, which is unique in the motif discovery program output or in the motif database file.
- ALT_ID An alternate name for the motif that may be provided in the motif discovery program output or in the motif database file.
- CONSENSUS A consensus sequence computed from the motif (as described below).
- WIDTH The width of the motif.
- SITES The number of primary sequences matching the motif according to the motif discovery program (discovered motifs), or according to the SEA enrichment analysis program (database motifs).
- SEA_PVALUE This is the p-value of the motif according to the SEA motif enrichment analysis program. The RANK of the motif is based on this value. This is NOT an accurate measure of the statistical significance of the motif because it is not adjusted for multiple tests. SEA estimates the p-value of motifs using either the Fisher exact test or the Binomial test, as is described here.
- EVALUE The E-value of the motif according to the motif discovery program (ab initio motifs), or according to the SEA motif enrichment analysis program (known motifs). It is an accurate measure of the statistical significance of the motif unless the value of EVALUE_ACC is “0”. The different programs estimate E-values as follows:
- SEA estimates the E-value of the motif using either the Fisher exact test or the Binomial test, as is described here.
- STREME also estimates the E-value of the motif using either the Fisher exact test or the Binomial test, as is described here.
- MEME estimates the E-value of the motif based on its log likelihood ratio, as is described here. Note: MEME E-values are based on a different null model than STREME and SEA E-values and are therefore not directly comparable to them.
- EVALUE_ACC The E-value is an accurate measure of motif significance if this is “1”; otherwise the EVALUE is not an accurate measure of motif significance.
- SIM_SOURCE Motifs discovered by a motif discovery program (STREME or MEME) are compared with known motifs in the motif database(s) you specified. This column and the next two columns identify the most similar known motif to the discovered motif. This column gives the similar motif’s database name, and the next two columns give its ID and its URL, respectively. Only known motifs with Tomtom similarity E-values of less than 1.0 to the discovered motif will be shown here. For known motifs, this column and the next two give the database name, ID and URL of the known motif.
- SIM_MOTIF For discovered motifs this column gives the ID (with any alternate ID in parentheses) of the most similar known motif in the motif database(s) that you provided to XSTREME. For known motifs it gives the known motif’s ID.
- MOTIF_URL For discovered motifs this columns gives the URL of the most similar known motif in the motif database(s) that you provided to XSTREME. For known motifs it gives the known motif’s URL in its online motif database.
A consensus sequence is constructed from each column in a motif’s frequency matrix using the “50% rule” as follows:
- The letter frequencies in the column are sorted in decreasing order.
- Letters with frequency less 50% of the maximum are discarded.
- The letter used in this position in the consensus sequence is determined by the first rule below that applies:
- If there is only one letter left, or if the remaining letters exactly match an ambiguous symbol in the alphabet, the letter or ambiguous symbol, respectively, is used.
- Otherwise, if the remaining set contains at least 50% of the core symbols in the alphabet, the alphabet’s wildcard (e.g., “N” for DNA or RNA, and “X” for protein) is used.
- Otherwise, the letter with the maximum frequency is used.