Question-driven Integrative Analysis

Regulatory landscape questions

How are various chromatin features compared with each other and in associated with gene expression?
Do different approaches - DNS, SPO, and ATAC provide a conserved view of chromatin assessiblity?

Polyploidy Questions

How does allopolyploidy alter chromatin architecture, in terms of nucleosome origanization and chromatin accessibility?
How does the alternation of chromatin structure drive gene expression evolution?
How to demonstrate the mechanistics of expression regulatory control using actual examples?

1. Inspect various chromatin features

Use Qregulation1.r to conduct following analyses:

Assemble various chromatin feature profiles that were preprocessed from differential MNase-seq (heavy and light nucleosome occupancy, DNS, SPO, nucleosome positioning) and ATAC-seq
Plot Pearson/Spearman correlations and plot PCA
Aggregate profiles on gene regions (scaled gene body, TSS, TES) relative to expression levels
Aggregate profiles on Gypsy regions
TO-DO: Aggregate NP over 3 ACR categories for DNS and SPO each.

2. Compare different profiles of chromatin accessibility

Use Qregulation2.r to conduct following analyses in D5 genome:

Compare characteristics (structure and annotation) of accesible chromatin regions (ACRs) resulted from DNS, SPO, and ATAC.
Aggregate profiles on each other ACR centers

3. Compare chromatin profiles between diploid and polyploid cotton

Examine hybridization and allopolyploidization effects: diploid A2/D5 vs F1 vs AD1.

TE composition TE.r: compare TE compositions, TE insertion relative to genes, Gypsy relative to more detailed genic structure。What percentages of the genome are covered by gene and TEs? What are the frequency of different TE categories in gene vicinity? What are the frequency of genes in TE vicinity?
Nucleosome organization nuc.post.r: constrast the genome coverages, nucleosome repeat link lengths for observed and predicted nucleosome position (observed Well/Fuzzy categorization)
Accessible chromatin regions ACR.r: marginal comparison of ACR categorization (DNS, SPO, DNase-seq); ChIPseeker genomic annotation for pACR and gACR (DNS); Peak overlaps for A2 vs F1.At and D5 vs F1.Dt.
Differential accessibility analysis diffACRs_csaw_AD1ref.r and diffACRs_csaw_F1ref.r : diffbind-like vs csaw methods
TE-ACR.r examines TE-derived ACRs，
3D genome organization hic.r:

4. Associate chromatin profiles with duplicated gene expression patterns

Examine promoter DNS and DA profiles in association with B, Bp, Hr, Wr, and Pr

Use Qregulation4.sh to prepare MNase-seq mapping results to BigWig files for aggregation plots: normalization RPGC vs RPKM; ref genome A2+D5 vs AD1
Use Qregulation4.r to analyze DNS and DA patterns by expression categories.

Motif discovery and enrichment analysis of 1KB promoters motif_analysis.md

5. Identify cases to demonstrate mechanistics

Use actual expression to visualize the regulatory mechanistics involving TE composition, ACR, motif, and duplicated expression patterns

Identify cases by putting evidence together Qregulation5.r
IGV visualization

Aggregate profiles on gene regions

Use aggreNvisual.prep.r to address the following question:
- What is the genomic composition of annotated gene and TE regions?
  - What percentages of the genome are covered by gene and TEs? What are the frequency of different TE categories in gene vicinity? What are the frequency of genes in TE vicinity?
- What are the patterns of nucleosome organization, nuclease sensitivity and chromatin openness associated with different genomic features?
  - H1: higher level of chromatin openess in gene promoter than other genomic regions.
  - H2: higher openess in gene promters depleted of TEs.
- How do above patterns differ between A and D diploids, and between the subgenomes of A2xD5 F1 and the tetraploid AD?

0. Prep and Tools

0.1 Prepare chromatin structure profile data

Use aggreNvisual.prep.r to prepare

Chromatin profiles that were preprocessed from differential MNase-seq, ATAC-seq , including:

nucleosome occupancy profiles from replicated heavy and light digestion, full-ranged fragment size
DNS profiles from normalized light minus heavy nucleosome occupancy profiles, full-ranged fragment size
nucleosome positioning profiles from heavy digestion only, <200 bp fragments trimmed to 50 bp around dyad
Size partitioned nucleosome occupancy profiles of heavy, light and DNS, range 0-130bp as sub-nucleosomal fragments and 130-260 bp.
Frenters of MOA-seq (MNase Open and Accessible chromatin mapping) by ~50-130 bp fragments from light digestion,

The RPM (reads per million) and quantile normalization in original DNS pipeline bothered me, given the 3-fold genome size difference between species surveyed. For comparative aggragation over genomic features, I generated new BigWig files from BAM using Deeptools bamCoverage first using RPGC (reads per 1x sequenced genomic content) normalization, and then switched to RPKM for simplicity, as explained here.

RPGC = reads per genomic content (1x normalization); scaling factor = (total number of mapped reads * fragment length) / effective genome size.

Effective genomic size is the part of genome that is uniquely mappable, according to General-deepTools-FAQs. I first tried faCount to report non-‘N’ size, but little Ns was found in A2Du, not right.

~/kent/faCount /work/LAS/jfw-lab/hugj2006/cottonLeaf/refGenomes/A2Du_13.fasta >/work/LAS/jfw-lab/hugj2006/cottonLeaf/refGenomes/A2Du_13.facount.txt
~/kent/faCount /work/LAS/jfw-lab/hugj2006/cottonLeaf/refGenomes/Dgenome2_13.fasta >/work/LAS/jfw-lab/hugj2006/cottonLeaf/refGenomes/Dgenome2_13.facount.txt
~/kent/faCount /work/LAS/jfw-lab/hugj2006/cottonLeaf/refGenomes/TM1new_26.fasta >/work/LAS/jfw-lab/hugj2006/cottonLeaf/refGenomes/TM1new_26.facount.txt

The BEDtool2 genomecov -d can be used to calculate the coverage per site in the genome, and regions with mapped reads have depth >0.

module load bedtools2
bedtools genomecov -ibam D1H_q20.bam -d |awk '$3>0 {print $3}' |wc -l
# 704408202

Using the obtained effective size to convert BAM to BigWig

bamCoverage -b D1H_q20.bam -o F2L_q20_142-200.bw --binSize 20 --normalizeTo1x 704408202
bamCoverage -b D1H_q20.bam -o F2L_q20_142-200.bw --binSize 20 --minFragmentLength 130 --maxFragmentLength 260 --normalizeTo1x 1878118078bamCoverage -b D1H_q20.bam -o F2L_q20_142-200.bw --binSize 20 --minFragmentLength 0 --maxFragmentLength 130 --normalizeTo1x 704408202
bamCoverage -b D1H_q20.bam -o F2L_q20_142-200.bw --binSize 20 --minFragmentLength 130 --maxFragmentLength 260 --normalizeTo1x 704408202

0.3 Tools

Many tools are available and reviewed in a biostars thread here:

Genomation - R package used here.
Deeptools - generally faster than R.
SeqPlots - corresponding genomic packages are required from bioconductor, useless for cotton, shame
ngsplot - unix stripts calling R and python usage; pretty figure but need to build genome

1. Aggregation and visualization of chromatin structure profiles

Use aggreNvisual.prep.r to address the following question:

What is the genomic composition of annotated gene and TE regions?
- What percentages of the genome are covered by gene and TEs? What are the frequency of different TE categories in gene vicinity? What are the frequency of genes in TE vicinity?
What are the patterns of nucleosome organization, nuclease sensitivity and chromatin openness associated with different genomic features?
- H1: higher level of chromatin openess in gene promoter than other genomic regions.
- H2: higher openess in gene promters depleted of TEs.
How do above patterns differ between A and D diploids, and between the subgenomes of A2xD5 F1 and the tetraploid AD?

2. Comparison of chromatin profiles in association with duplicated gene expression patterns

expNuc.r

Make plots for visualization. Generate heatmap of read coverages, for visualizing the read coverages for genomic regions. The default setting plots profile on top of heatmaps.

computeMatrix reference-point -S *q20_unified.bw -R D5.gene.bed -o TSS.gz --referencePoint TSS -b 1500 -a 1500 --skipZeros 
plotHeatmap -m TSS.gz -out plotHeatmap_TSS.png

computeMatrix reference-point -S *q20_unified.bw -R D5.gene.bed -o TES.gz --referencePoint TES -b 1500 -a 1500 --skipZeros 
plotHeatmap -m TES.gz -out plotHeatmap_TES.png

computeMatrix scale-regions -S *q20_unified.bw -R D5.gene.bed -o scaleGeneBody.gz –regionBodyLength 3000 -b 1500 -a 1500 –skipZeros plotHeatmap -m scaleGeneBody.gz -out plotHeatmap_scaleGeneBody.png

In addition, generate summary plot of “meta-profile”, for visualizing the average read coverages over a group of genomic regions, and output .tab file for R analysis.

plotProfile -m scaleGeneBody.gz --perGroup -out plotProfileGroup.png --outFileNameData plotProfile.tab