Skip to content
Snippets Groups Projects
Commit 134f566c authored by Long Wang's avatar Long Wang
Browse files

Update demo06

parent 46bc9b8d
No related branches found
No related tags found
No related merge requests found
......@@ -121,7 +121,20 @@ Please refer to the source links for up-to-date version and detailed description
conda install -c bioconda iqtree
```
- #### pggb
- [ ] [Official](https://github.com/pangenome/pggb)
- #### panacus
- [ ] [Official](https://github.com/marschall-lab/panacus)
- #### Also install pggb and panacus via bioconda
```bash
conda install -c bioconda -c conda-forge pggb
conda install -c bioconda -c conda-forge panacus
```
</details>
## Disclaimer
......
# Example codes for demo06
- ### Activating conda enviroment
```bash
cd workshop3/
conda activate ws3
```
- ### Step1: Prepare input fasta file for PGGB
pggb runs on each chromosome and require a single fasta, here we use chromosome chrIX as an example
### 1) Extract chrIX sequences for 18 genomes
<details><summary>(Click to show possible answers)</summary>
```bash
mkdir chroms
for genome_file in `ls ../data/ScRAP_data/nuclear/*.nuclear_genome.tidy.fa.gz`;
do
filename=`basename ${genome_file}`
out_prefix=${filename/.asm01.HP0.nuclear_genome.tidy.fa.gz}
echo ${out_prefix}
zcat ${genome_file} > ./chroms/${out_prefix}.genome.fa
seqkit faidx ./chroms/${out_prefix}.genome.fa chrIX \
> ./chroms/${out_prefix}.chrIX.fa && \
rm ./chroms/${out_prefix}.genome.fa ./chroms/${out_prefix}.genome.fa.fai
done
```
</details>
### 2) Cat all 18 chrIX sequences into one fasta file, and index the fasta file as pggb need it
Before merge all files, we need to change the IDs as they are all identical.
We rename the sequence ID following PanSN-spec naming pattern (https://github.com/pangenome/PanSN-spec),the naming scheme for PanSN-spec is [sample_name][delim][haplotype_id][delim][contig_or_scaffold_name], so we will rename each ID as "Strain_ID#1#chrIX" where "1" represents haplotype id, and "#" is used as the delimiter
<details><summary>(Click to show possible answers)</summary>
```bash
> ./chroms/ScRAP_chrIX.fa
for chrom_file in `ls ./chroms/*.chrIX.fa`;
do
filename=`basename ${chrom_file}`
new_id=${filename/.chrIX.fa/#1#chrIX}
echo ${new_id}
sed "s/>chrIX/>${new_id}/" ${chrom_file} \
>> ./chroms/ScRAP_chrIX.fa
done
# pggb need the fasta file to be indexed by samtools faidx or seqkit faidx
seqkit faidx ./chroms/ScRAP_chrIX.fa
```
</details>
- ### Step2: It's now ready to run pggb
<details><summary>(Click to show possible answers)</summary>
```bash
mkdir pggb
# here we set -n 18, which is equal to the number of haplotypes
pggb -i ./chroms/ScRAP_chrIX.fa -o pggb -n 18 -t 4
```
</details>
- ### Step3: Interpret the results from pggb
<details><summary>(Click to show possible answers)</summary>
```bash
# first let's visualize the wfmash alignments
cd ./pggb
../wfmash/scripts/paf2dotplot png large ScRAP_chrIX.fa.92da240.alignments.wfmash.paf
cd ..
# next we generate some basic information on the graph using odgi stats
odgi stats -i ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.og -S \
> ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.og.stats
```
- ### Step4: Compare pggb results across different nucleotide identity threshold
The default minimum average nucleotide identity for segments of pggb is 90, which is suitable for genomes with a divergence <= 10%, let's try different threshold to see how this would affect pangraph
<details><summary>(Click to show possible answers)</summary>
```bash
# use a lower percent identity of 80 to allow more divergent alignment
pggb -i ./chroms/ScRAP_chrIX.fa -o pggb_p80 -n 18 -t 4 -p 80
odgi stats -i ./pggb_p80/ScRAP_chrIX.fa.925a0a8.417fcdf.0e1b05c.smooth.final.og -S \
> ./pggb_p80/ScRAP_chrIX.fa.925a0a8.417fcdf.0e1b05c.smooth.final.og.stats
# use a higher percent identity of 95 to consider more conserved alignment
pggb -i ./chroms/ScRAP_chrIX.fa -o pggb_p95 -n 18 -t 4 -p 95
odgi stats -i ./pggb_p95/ScRAP_chrIX.fa.83ec06e.417fcdf.4cea6f5.smooth.final.og -S \
> ./pggb_p95/ScRAP_chrIX.fa.83ec06e.417fcdf.4cea6f5.smooth.final.og.stats
```
</details>
- ### Step5: Visualize pggb pangenome graph
PGGB by default generates ODGI 1D (./pggb*/*.og.viz_*.png) and 2D graphs (./pggb*/*.og.lay.*.png), let's try to view the graphs with vg view
<details><summary>(Click to show possible answers)</summary>
```bash
# first we convert gfa format to vg format as gfa seems to be not so well supported by vg in some cases
vg convert -t 4 -v -g ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.gfa \
> ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.vg
# then we visualize a small region with vg view
vg chunk -x ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.vg \
-p UWOPS052272#1#chrIX:450-500 -c 1 | vg view -pd - | dot -Tpng \
> ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.UWOPS052272_chrIX_450-500.png
```
</details>
- ### Step6: Calculate pangenome openness with panacus (https://github.com/marschall-lab/panacus)
panacus is a tool for calculating statistics for GFA files, it supports the following calculations:
- coverage histogram
- pangenome growth statistics
- path-/group-resolved coverage table
here we mainly use it to calculate the pangenome growth statistics
<details><summary>(Click to show possible answers)</summary>
```bash
# 1) run panacus histgrowth to calculate coverage and pangenome growth for nodes (default) with coverage/quorum thresholds 1/0, 2/0, 1/1, 1/0.5, and 1/0.1
RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a \
./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.gfa \
> ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.histgrowth.node.tsv
2) visualize the growth statistics
panacus-visualize \
-e ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.histgrowth.node.tsv \
> ./pggb/ScRAP_chrIX.fa.92da240.417fcdf.b17acf2.smooth.final.histgrowth.node.pdf
```
</details>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment