Data analysis of other datasets
TE annotation
The EDTA pipeline was used to perform high-quality and consistentl TE annotation for three cotton genome reference genomes: D5 ~800Mb, the smallest and highest quality assembly lss ftp. A2 ~1600Mb lss ftp. AD1 ~2200 Mb lss ftp
Two previous versions (v091919 and v111519) were generated, and the latest one is v010621 which conducted TE and panTE annotation for a total of 15 available Gossypium reference genomes.
Genomic annotation
Visualization and exploration of sequencing tracks (BigWig
) around genomic features (e.g. TSS) is the key of downstream analysis. makeTxDb.r prepared txdb.<refGenome>.sqlite
for handling gene anonation, prepared TEannotation.rdata
for TE annotation, and generated corresponding BED files. Alternatively, BEDOPS tool gff2bed can be used to convert gff to bed for deeptools, e.g. makeGenicBed.sh .
ATAC-seq, D5 only
Sep 27, 2018 - Josh transferred two ATAC-seq datasets (G.raimondii and G. longicalyx with 2 reps each) from BYU. Those were part of an aborted ATAC-seq project aiming to sample D5, F, A, B and E genomes, and libraries were made by Bob Schmitz et al.
ls /work/LAS/jfw-lab/ATAC/G.raimondii/ATAC
The D5 dataset was analyzed in comparison with dns-MNase-seq profiles, in order to better understand different types of open chromatin profiles by ATAC-seq and lightly digested MNase-seq 0-130bp fragments.
The analytic pipeline atac_callPeaks.r conducted quality QC and trimming of raw fastq reads, mapping against reference genome, and peak calling.
Regarding peak calling:
- Deal lab protocol used HOMER findPeaks in “region” mode:
findpeaks <tag.directory> -o <output> -gsize <effective.mappable.genome.size_7.1e8> minDist 150 -region
- Harvard FAS Informatics ATAC-seq Guidelines uses their own program Genrich, and also previously MACS2 in the order version.
DNase-seq
Hi-C
The young leaves Hi-C datasets were downloaded from the NCBI Sequence Read Archive database. Analyses were conducted to identify valid chromatin conformation interactions, A/B compartment, TADs, and loops using HOMER with hic.sh hic.r
- G. raimondii - SRX3051289
- G. arboreum - SRX3051297
- G. hirsutum - SRX2330709
bash hic.sh
######## Basic Statistics
cd /work/LAS/jfw-lab/hugj2006/cottonLeaf/HiC
#total read pairs
grep 'Total reads processed' */*report.txt
#trimmed and filtered read pairs
grep 'Reads written' */*report.txt
# mapping input
grep 'reads; of these' */*log
# uniquely mapped read pairs for each end
grep ' aligned exactly 1 time' */*log
# Total Tags before and after filter - paired ends counted twice
grep '^genome' */*/tagInfo.txt
# interactions from short
wc -l */*/short.sorted.txt
# count intra contacts
for j in $( ls */*/short.sorted.txt ); do
echo $j
awk '$2==$6{c++} END{print c+0}' $j
done
Rscript hic.r
Archived
Conserved Noncoding Sequence (CNSs): The COGE CNS Discovery Pipeline was used to find CNSs between A2 vs D5, At vs Dt, A2 vs At, D5 vs Dt genomes.The resulting common CNSs were used to inquiry whether or not chromatin accessible regions are conversed across genomes. cns.sh