App-Anchr
view release on metacpan or search on metacpan
doc/bacteria_2_3.md view on Meta::CPAN
- [Ngon: merge anchors](#ngon-merge-anchors)
- [Ngon: 3GS](#ngon-3gs)
- [Ngon: expand anchors](#ngon-expand-anchors)
- [Neisseria meningitidis FDAARGOS_209, èèçå¥çæ°è](#neisseria-meningitidis-fdaargos-209-èèçå¥çæ°è)
- [Nmen: download](#nmen-download)
- [Nmen: combinations of different quality values and read lengths](#nmen-combinations-of-different-quality-values-and-read-lengths)
- [Nmen: quorum](#nmen-quorum)
- [Nmen: down sampling](#nmen-down-sampling)
- [Nmen: k-unitigs and anchors (sampled)](#nmen-k-unitigs-and-anchors-sampled)
- [Nmen: merge anchors](#nmen-merge-anchors)
- [Nmen: 3GS](#nmen-3gs)
- [Nmen: expand anchors](#nmen-expand-anchors)
- [Bordetella pertussis FDAARGOS_195, ç¾æ¥å³åå¾·ç¹æ°æè](#bordetella-pertussis-fdaargos-195-ç¾æ¥å³åå¾·ç¹æ°æè)
- [Bper: download](#bper-download)
- [Bper: combinations of different quality values and read lengths](#bper-combinations-of-different-quality-values-and-read-lengths)
- [Bper: down sampling](#bper-down-sampling)
- [Bper: generate super-reads](#bper-generate-super-reads)
- [Bper: create anchors](#bper-create-anchors)
- [Bper: results](#bper-results)
- [Bper: merge anchors](#bper-merge-anchors)
- [Corynebacterium diphtheriae FDAARGOS_197, ç½åæè](#corynebacterium-diphtheriae-fdaargos-197-ç½åæè)
- [Cdip: download](#cdip-download)
- [Cdip: combinations of different quality values and read lengths](#cdip-combinations-of-different-quality-values-and-read-lengths)
- [Cdip: quorum](#cdip-quorum)
- [Cdip: down sampling](#cdip-down-sampling)
- [Cdip: k-unitigs and anchors (sampled)](#cdip-k-unitigs-and-anchors-sampled)
- [Cdip: merge anchors](#cdip-merge-anchors)
- [Cdip: 3GS](#cdip-3gs)
- [Cdip: expand anchors](#cdip-expand-anchors)
- [Francisella tularensis FDAARGOS_247, åæçå¼æè¥¿æ¯æ°è](#francisella-tularensis-fdaargos-247-åæçå¼æè¥¿æ¯æ°è)
- [Ftul: download](#ftul-download)
- [Ftul: combinations of different quality values and read lengths](#ftul-combinations-of-different-quality-values-and-read-lengths)
- [Ftul: quorum](#ftul-quorum)
- [Ftul: down sampling](#ftul-down-sampling)
- [Ftul: k-unitigs and anchors (sampled)](#ftul-k-unitigs-and-anchors-sampled)
- [Ftul: merge anchors](#ftul-merge-anchors)
- [Ftul: 3GS](#ftul-3gs)
- [Ftul: expand anchors](#ftul-expand-anchors)
- [Haemophilus influenzae FDAARGOS_199, æµæåè¡æè](#haemophilus-influenzae-fdaargos-199-æµæåè¡æè)
- [Hinf: download](#hinf-download)
- [Listeria monocytogenes FDAARGOS_351, åæ ¸ç»èå¢çææ¯ç¹æ°è](#listeria-monocytogenes-fdaargos-351-åæ ¸ç»èå¢çææ¯ç¹æ°è)
- [Lmon: download](#lmon-download)
- [Clostridioides difficile 630](#clostridioides-difficile-630)
- [Cdif: download](#cdif-download)
- [Campylobacter jejuni subsp. jejuni ATCC 700819, 空è å¼¯æ²æè](#campylobacter-jejuni-subsp-jejuni-atcc-700819-空è å¼¯æ²æè)
- [Cjej: download](#cjej-download)
# Escherichia virus Lambda
Project
[SRP055199](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP055199)
## lambda: download
* Reference genome
* Strain: Escherichia virus Lambda (viruses)
* Taxid: [10710](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10710&lvl=3&lin=f&keep=1&srchmode=1&unlock)
* RefSeq assembly accession:
[GCF_000840245.1](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/840/245/GCF_000840245.1_ViralProj14204/GCF_000840245.1_ViralProj14204_assembly_report.txt)
* Proportion of paralogs (> 1000 bp): 0.0
```bash
mkdir -p ~/data/anchr/lambda/1_genome
cd ~/data/anchr/lambda/1_genome
aria2c -x 9 -s 3 -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/840/245/GCF_000840245.1_ViralProj14204/GCF_000840245.1_ViralProj14204_genomic.fna.gz
TAB=$'\t'
cat <<EOF > replace.tsv
NC_001416.1${TAB}1
EOF
faops replace GCF_000840245.1_ViralProj14204_genomic.fna.gz replace.tsv genome.fa
#cp ~/data/anchr/paralogs/otherbac/Results/lambda/lambda.multi.fas paralogs.fas
```
* PacBio
```bash
mkdir -p ~/data/anchr/lambda/3_pacbio
cd ~/data/anchr/lambda/3_pacbio
cat << EOF > sra_ftp.txt
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR179/005/SRR1796325/SRR1796325.fastq.gz
EOF
aria2c -x 9 -s 3 -c -i sra_ftp.txt
cat << EOF > sra_md5.txt
2c663d7ea426eea0aaba9017e1a9168c SRR1796325.fastq.gz
EOF
md5sum --check sra_md5.txt
cd ~/data/anchr/lambda
faops filter -l 0 3_pacbio/SRR1796325.fastq.gz 3_pacbio/pacbio.fasta
```
## lambda: preprocess PacBio reads
```bash
BASE_NAME=lambda
cd ${HOME}/data/anchr/${BASE_NAME}
head -n 3000 3_pacbio/pacbio.fasta > 3_pacbio/pacbio.40x.fasta
anchr trimlong --parallel 16 -v \
3_pacbio/pacbio.40x.fasta \
-o 3_pacbio/pacbio.40x.trim.fasta
```
## lambda: reads stats
```bash
BASE_NAME=lambda
cd ${HOME}/data/anchr/${BASE_NAME}
printf "| %s | %s | %s | %s |\n" \
"Name" "N50" "Sum" "#" \
> stat.md
printf "|:--|--:|--:|--:|\n" >> stat.md
doc/bacteria_2_3.md view on Meta::CPAN
for my $n (
qw{
Q20L60
Q25L60
Q30L60
}
)
{
for my $i ( 1 .. 5 ) {
printf qq{%s_%d\n}, $n, ( 1000000 * $i );
}
}
' \
| parallel -k --no-run-if-empty -j 4 "
if [ ! -d ${BASE_DIR}/{} ]; then
exit;
fi
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 1 ${BASE_DIR}/{} ${REAL_G}
" >> ${BASE_DIR}/stat1.md
cat stat1.md
```
* Stats of anchors
```bash
BASE_DIR=$HOME/data/anchr/Lpne
cd ${BASE_DIR}
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 2 header \
> ${BASE_DIR}/stat2.md
perl -e '
for my $n (
qw{
Q20L60
Q25L60
Q30L60
}
)
{
for my $i ( 1 .. 5 ) {
printf qq{%s_%d\n}, $n, ( 1000000 * $i );
}
}
' \
| parallel -k --no-run-if-empty -j 8 "
if [ ! -e ${BASE_DIR}/{}/anchor/pe.anchor.fa ]; then
exit;
fi
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 2 ${BASE_DIR}/{}
" >> ${BASE_DIR}/stat2.md
cat stat2.md
```
| Name | SumFq | CovFq | AvgRead | Kmer | SumFa | Discard% | RealG | EstG | Est/Real | SumKU | SumSR | RunTime |
|:---------------|--------:|------:|--------:|-----------:|--------:|---------:|------:|------:|---------:|------:|------:|----------:|
| Q20L60_1000000 | 200.24M | 58.9 | 100 | "41,61,81" | 181.17M | 9.524% | 3.4M | 3.4M | 1.00 | 3.42M | 0 | 0:05'07'' |
| Q20L60_2000000 | 400.49M | 117.9 | 100 | "41,61,81" | 362.45M | 9.500% | 3.4M | 3.41M | 1.00 | 3.43M | 0 | 0:08'38'' |
| Q20L60_3000000 | 600.74M | 176.8 | 99 | "41,61,81" | 543.81M | 9.476% | 3.4M | 3.43M | 1.01 | 3.43M | 0 | 0:12'31'' |
| Q20L60_4000000 | 801M | 235.7 | 99 | "41,61,81" | 725.8M | 9.388% | 3.4M | 3.47M | 1.02 | 3.43M | 0 | 0:15'59'' |
| Q25L60_1000000 | 199.18M | 58.6 | 99 | "41,61,81" | 183.62M | 7.812% | 3.4M | 3.4M | 1.00 | 3.48M | 0 | 0:05'12'' |
| Q25L60_2000000 | 398.32M | 117.2 | 99 | "41,61,81" | 367.28M | 7.792% | 3.4M | 3.41M | 1.00 | 3.43M | 0 | 0:08'33'' |
| Q25L60_3000000 | 597.52M | 175.9 | 99 | "41,61,81" | 551.1M | 7.769% | 3.4M | 3.41M | 1.00 | 3.42M | 0 | 0:12'32'' |
| Q25L60_4000000 | 796.67M | 234.5 | 99 | "41,61,81" | 734.9M | 7.753% | 3.4M | 3.42M | 1.01 | 3.43M | 0 | 0:14'31'' |
| Q30L60_1000000 | 195.64M | 57.6 | 98 | "41,61,81" | 183.5M | 6.208% | 3.4M | 3.4M | 1.00 | 3.41M | 0 | 0:05'13'' |
| Q30L60_2000000 | 391.27M | 115.2 | 98 | "41,61,81" | 367.04M | 6.193% | 3.4M | 3.4M | 1.00 | 3.42M | 0 | 0:07'40'' |
| Q30L60_3000000 | 586.89M | 172.7 | 97 | "41,61,81" | 550.6M | 6.183% | 3.4M | 3.41M | 1.00 | 3.42M | 0 | 0:10'04'' |
| Q30L60_4000000 | 774.12M | 227.8 | 97 | "41,61,81" | 726.3M | 6.177% | 3.4M | 3.41M | 1.00 | 3.42M | 0 | 0:08'35'' |
| Name | N50SR | Sum | # | N50Anchor | Sum | # | N50Others | Sum | # | RunTime |
|:---------------|-------:|------:|----:|----------:|------:|----:|----------:|--------:|----:|----------:|
| Q20L60_1000000 | 59861 | 3.42M | 134 | 60611 | 3.36M | 110 | 43054 | 59.5K | 24 | 0:02'01'' |
| Q20L60_2000000 | 27709 | 3.43M | 222 | 28368 | 3.37M | 190 | 21578 | 65.35K | 32 | 0:03'10'' |
| Q20L60_3000000 | 14498 | 3.43M | 396 | 14852 | 3.38M | 353 | 955 | 55.98K | 43 | 0:04'01'' |
| Q20L60_4000000 | 7243 | 3.43M | 772 | 7331 | 3.34M | 648 | 749 | 90.87K | 124 | 0:04'34'' |
| Q25L60_1000000 | 120520 | 3.48M | 101 | 120520 | 3.37M | 66 | 88397 | 117.88K | 35 | 0:01'53'' |
| Q25L60_2000000 | 62424 | 3.43M | 124 | 62424 | 3.36M | 95 | 43054 | 62.52K | 29 | 0:03'02'' |
| Q25L60_3000000 | 30976 | 3.42M | 200 | 31319 | 3.39M | 175 | 897 | 32.64K | 25 | 0:03'32'' |
| Q25L60_4000000 | 20322 | 3.43M | 304 | 20440 | 3.38M | 273 | 14979 | 50.13K | 31 | 0:04'18'' |
| Q30L60_1000000 | 219482 | 3.41M | 83 | 219482 | 3.4M | 56 | 734 | 18.18K | 27 | 0:01'41'' |
| Q30L60_2000000 | 128845 | 3.42M | 93 | 128845 | 3.36M | 68 | 21578 | 59.28K | 25 | 0:02'42'' |
| Q30L60_3000000 | 68045 | 3.42M | 123 | 68045 | 3.37M | 97 | 15040 | 46.9K | 26 | 0:03'30'' |
| Q30L60_4000000 | 49793 | 3.42M | 146 | 49793 | 3.38M | 126 | 15040 | 42.66K | 20 | 0:03'54'' |
## Lpne: merge anchors
```bash
BASE_DIR=$HOME/data/anchr/Lpne
cd ${BASE_DIR}
# merge anchors
mkdir -p merge
anchr contained \
Q20L60_1000000/anchor/pe.anchor.fa \
Q20L60_2000000/anchor/pe.anchor.fa \
Q20L60_3000000/anchor/pe.anchor.fa \
Q20L60_4000000/anchor/pe.anchor.fa \
Q25L60_1000000/anchor/pe.anchor.fa \
Q25L60_2000000/anchor/pe.anchor.fa \
Q25L60_3000000/anchor/pe.anchor.fa \
Q25L60_4000000/anchor/pe.anchor.fa \
Q30L60_1000000/anchor/pe.anchor.fa \
Q30L60_2000000/anchor/pe.anchor.fa \
Q30L60_3000000/anchor/pe.anchor.fa \
Q30L60_4000000/anchor/pe.anchor.fa \
--len 1000 --idt 0.98 --proportion 0.99999 --parallel 16 \
-o stdout \
| faops filter -a 1000 -l 0 stdin merge/anchor.contained.fasta
anchr orient merge/anchor.contained.fasta --len 1000 --idt 0.98 -o merge/anchor.orient.fasta
anchr merge merge/anchor.orient.fasta --len 1000 --idt 0.999 -o stdout \
| faops filter -a 1000 -l 0 stdin merge/anchor.merge.fasta
# merge others
anchr contained \
Q20L60_2000000/anchor/pe.others.fa \
Q25L60_2000000/anchor/pe.others.fa \
Q30L60_2000000/anchor/pe.others.fa \
doc/bacteria_2_3.md view on Meta::CPAN
echo ' R1.fq.gz already presents'
exit;
fi
anchr trim \
--noscythe \
-q {1} -l {2} \
../R1.scythe.fq.gz ../R2.scythe.fq.gz \
-o stdout \
| bash
" ::: 20 25 30 ::: 60
```
* Stats
```bash
BASE_DIR=$HOME/data/anchr/Ngon
cd ${BASE_DIR}
printf "| %s | %s | %s | %s |\n" \
"Name" "N50" "Sum" "#" \
> stat.md
printf "|:--|--:|--:|--:|\n" >> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "Genome"; faops n50 -H -S -C 1_genome/genome.fa;) >> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "Paralogs"; faops n50 -H -S -C 1_genome/paralogs.fas;) >> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "Illumina"; faops n50 -H -S -C 2_illumina/R1.fq.gz 2_illumina/R2.fq.gz;) >> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "PacBio"; faops n50 -H -S -C 3_pacbio/pacbio.fasta;) >> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "uniq"; faops n50 -H -S -C 2_illumina/R1.uniq.fq.gz 2_illumina/R2.uniq.fq.gz;) >> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "scythe"; faops n50 -H -S -C 2_illumina/R1.scythe.fq.gz 2_illumina/R2.scythe.fq.gz;) >> stat.md
for qual in 20 25 30; do
for len in 60; do
DIR_COUNT="${BASE_DIR}/2_illumina/Q${qual}L${len}"
printf "| %s | %s | %s | %s |\n" \
$(echo "Q${qual}L${len}"; faops n50 -H -S -C ${DIR_COUNT}/R1.fq.gz ${DIR_COUNT}/R2.fq.gz;) \
>> stat.md
done
done
cat stat.md
```
| Name | N50 | Sum | # |
|:---------|--------:|-----------:|---------:|
| Genome | 2153922 | 2153922 | 1 |
| Paralogs | 4318 | 142093 | 53 |
| Illumina | 101 | 1491583958 | 14768158 |
| PacBio | 11808 | 1187845820 | 137516 |
| uniq | 101 | 1485449016 | 14707416 |
| scythe | 101 | 1460356291 | 14707416 |
| Q20L60 | 101 | 1239834586 | 12544518 |
| Q25L60 | 101 | 1062429395 | 10873960 |
| Q30L60 | 101 | 734805677 | 7775198 |
## Ngon: down sampling
```bash
BASE_DIR=$HOME/data/anchr/Ngon
cd ${BASE_DIR}
ARRAY=(
"2_illumina/Q20L60:Q20L60:4000000"
"2_illumina/Q25L60:Q25L60:4000000"
"2_illumina/Q30L60:Q30L60:4000000"
)
for group in "${ARRAY[@]}" ; do
GROUP_DIR=$(group=${group} perl -e '@p = split q{:}, $ENV{group}; print $p[0];')
GROUP_ID=$( group=${group} perl -e '@p = split q{:}, $ENV{group}; print $p[1];')
GROUP_MAX=$(group=${group} perl -e '@p = split q{:}, $ENV{group}; print $p[2];')
printf "==> %s \t %s \t %s\n" "$GROUP_DIR" "$GROUP_ID" "$GROUP_MAX"
perl -e 'print 1000000 * $_, qq{\n} for 1 .. 5' \
| parallel --no-run-if-empty -j 4 "
if [[ {} -gt '$GROUP_MAX' ]]; then
exit;
fi
echo ' ${GROUP_ID}_{}'
mkdir -p ${BASE_DIR}/${GROUP_ID}_{}
if [ -e ${BASE_DIR}/${GROUP_ID}_{}/R1.fq.gz ]; then
exit;
fi
seqtk sample -s{} \
${BASE_DIR}/${GROUP_DIR}/R1.fq.gz {} \
| pigz -p 4 -c > ${BASE_DIR}/${GROUP_ID}_{}/R1.fq.gz
seqtk sample -s{} \
${BASE_DIR}/${GROUP_DIR}/R2.fq.gz {} \
| pigz -p 4 -c > ${BASE_DIR}/${GROUP_ID}_{}/R2.fq.gz
"
done
```
```bash
BASE_DIR=$HOME/data/anchr/Ngon
cd ${BASE_DIR}
head -n 25000 3_pacbio/pacbio.fasta > 3_pacbio/pacbio.40x.fasta
faops n50 -S -C 3_pacbio/pacbio.40x.fasta
head -n 50000 3_pacbio/pacbio.fasta > 3_pacbio/pacbio.80x.fasta
faops n50 -S -C 3_pacbio/pacbio.80x.fasta
```
## Ngon: generate super-reads
doc/bacteria_2_3.md view on Meta::CPAN
* Stats of processed reads
```bash
BASE_NAME=Cdip
cd ${HOME}/data/anchr/${BASE_NAME}
REAL_G=2488635
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 1 header \
> stat1.md
parallel -k --no-run-if-empty -j 3 "
if [ ! -d 2_illumina/Q{1}L{2} ]; then
exit;
fi
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 1 2_illumina/Q{1}L{2} ${REAL_G}
" ::: 20 25 30 ::: 60 \
>> stat1.md
cat stat1.md
```
| Name | SumIn | CovIn | SumOut | CovOut | Discard% | AvgRead | Kmer | RealG | EstG | Est/Real | RunTime |
|:-------|--------:|------:|--------:|-------:|---------:|--------:|-----:|------:|------:|---------:|----------:|
| Q20L60 | 942.37M | 378.7 | 839.01M | 337.1 | 10.969% | 98 | "51" | 2.49M | 2.92M | 1.17 | 0:03'14'' |
| Q25L60 | 811.86M | 326.2 | 742.27M | 298.3 | 8.571% | 97 | "51" | 2.49M | 2.58M | 1.04 | 0:02'48'' |
| Q30L60 | 675.61M | 271.5 | 631.39M | 253.7 | 6.545% | 93 | "43" | 2.49M | 2.48M | 0.99 | 0:02'21'' |
* kmergenie
```bash
BASE_NAME=Cdip
cd ${HOME}/data/anchr/${BASE_NAME}
mkdir -p 2_illumina/kmergenie
cd 2_illumina/kmergenie
kmergenie -l 21 -k 91 -s 10 -t 8 ../R1.fq.gz -o oriR1
kmergenie -l 21 -k 91 -s 10 -t 8 ../R2.fq.gz -o oriR2
kmergenie -l 21 -k 91 -s 10 -t 8 ../Q30L60/pe.cor.fa -o Q30L60
```
## Cdip: down sampling
```bash
BASE_NAME=Cdip
cd ${HOME}/data/anchr/${BASE_NAME}
REAL_G=2488635
for QxxLxx in $( parallel "echo 'Q{1}L{2}'" ::: 25 30 ::: 60 ); do
echo "==> ${QxxLxx}"
if [ ! -e 2_illumina/${QxxLxx}/pe.cor.fa ]; then
echo "2_illumina/${QxxLxx}/pe.cor.fa not exists"
continue;
fi
for X in 40 80 120 160 240; do
printf "==> Coverage: %s\n" ${X}
rm -fr 2_illumina/${QxxLxx}X${X}*
faops split-about -l 0 \
2_illumina/${QxxLxx}/pe.cor.fa \
$(( ${REAL_G} * ${X} )) \
"2_illumina/${QxxLxx}X${X}"
MAX_SERIAL=$(
cat 2_illumina/${QxxLxx}/environment.json \
| jq ".SUM_OUT | tonumber | . / ${REAL_G} / ${X} | floor | . - 1"
)
for i in $( seq 0 1 ${MAX_SERIAL} ); do
P=$( printf "%03d" ${i})
printf " * Part: %s\n" ${P}
mkdir -p "2_illumina/${QxxLxx}X${X}P${P}"
mv "2_illumina/${QxxLxx}X${X}/${P}.fa" \
"2_illumina/${QxxLxx}X${X}P${P}/pe.cor.fa"
cp 2_illumina/${QxxLxx}/environment.json "2_illumina/${QxxLxx}X${X}P${P}"
done
done
done
```
```bash
BASE_DIR=$HOME/data/anchr/Cdip
cd ${BASE_DIR}
head -n 35000 3_pacbio/pacbio.fasta > 3_pacbio/pacbio.40x.fasta
faops n50 -S -C 3_pacbio/pacbio.40x.fasta
head -n 70000 3_pacbio/pacbio.fasta > 3_pacbio/pacbio.80x.fasta
faops n50 -S -C 3_pacbio/pacbio.80x.fasta
```
## Cdip: k-unitigs and anchors (sampled)
```bash
BASE_NAME=Cdip
cd ${HOME}/data/anchr/${BASE_NAME}
# k-unitigs (sampled)
parallel --no-run-if-empty -j 3 "
echo >&2 '==> Group Q{1}L{2}X{3}P{4}'
if [ ! -e 2_illumina/Q{1}L{2}X{3}P{4}/pe.cor.fa ]; then
echo >&2 ' 2_illumina/Q{1}L{2}X{3}P{4}/pe.cor.fa not exists'
exit;
fi
if [ -e Q{1}L{2}X{3}P{4}/k_unitigs.fasta ]; then
echo >&2 ' k_unitigs.fasta already presents'
exit;
fi
mkdir -p Q{1}L{2}X{3}P{4}
cd Q{1}L{2}X{3}P{4}
anchr kunitigs \
../2_illumina/Q{1}L{2}X{3}P{4}/pe.cor.fa \
../2_illumina/Q{1}L{2}X{3}P{4}/environment.json \
-p 8 \
--kmer 31,41,51,61,71,81 \
-o kunitigs.sh
bash kunitigs.sh
echo >&2
" ::: 25 30 ::: 60 ::: 40 80 120 160 240 ::: 000 001 002 003 004 005 006
# anchors (sampled)
parallel --no-run-if-empty -j 3 "
echo >&2 '==> Group Q{1}L{2}X{3}P{4}'
if [ -e Q{1}L{2}X{3}P{4}/anchor/pe.anchor.fa ]; then
exit;
fi
rm -fr Q{1}L{2}X{3}P{4}/anchor
bash ~/Scripts/cpan/App-Anchr/share/anchor.sh Q{1}L{2}X{3}P{4} 8 false
echo >&2
" ::: 25 30 ::: 60 ::: 40 80 120 160 240 ::: 000 001 002 003 004 005 006
# Stats of anchors
REAL_G=2488635
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 2 header \
> stat2.md
parallel -k --no-run-if-empty -j 6 "
if [ ! -e Q{1}L{2}X{3}P{4}/anchor/pe.anchor.fa ]; then
exit;
fi
bash ~/Scripts/cpan/App-Anchr/share/sr_stat.sh 2 Q{1}L{2}X{3}P{4} ${REAL_G}
" ::: 25 30 ::: 60 ::: 40 80 120 160 240 ::: 000 001 002 003 004 005 006 \
>> stat2.md
cat stat2.md
```
| Name | SumCor | CovCor | N50SR | Sum | # | N50Anchor | Sum | # | N50Others | Sum | # | Kmer | RunTimeKU | RunTimeAN |
|:---------------|:--------|-------:|------:|------:|----:|----------:|------:|----:|----------:|-------:|----:|--------------------:|----------:|----------:|
| Q25L60X40P000 | 99.55M | 40.0 | 34190 | 2.46M | 140 | 34190 | 2.45M | 131 | 844 | 7.28K | 9 | "31,41,51,61,71,81" | 0:02'48'' | 0:01'25'' |
| Q25L60X40P001 | 99.55M | 40.0 | 30045 | 2.46M | 148 | 30045 | 2.45M | 132 | 844 | 13.48K | 16 | "31,41,51,61,71,81" | 0:02'49'' | 0:01'35'' |
| Q25L60X40P002 | 99.55M | 40.0 | 27638 | 2.47M | 162 | 27680 | 2.45M | 145 | 742 | 13.08K | 17 | "31,41,51,61,71,81" | 0:02'46'' | 0:01'35'' |
| Q25L60X40P003 | 99.55M | 40.0 | 33236 | 2.46M | 131 | 33236 | 2.45M | 117 | 684 | 9.59K | 14 | "31,41,51,61,71,81" | 0:02'53'' | 0:01'23'' |
| Q25L60X40P004 | 99.55M | 40.0 | 49674 | 2.45M | 99 | 49674 | 2.45M | 91 | 748 | 6.37K | 8 | "31,41,51,61,71,81" | 0:02'58'' | 0:01'31'' |
| Q25L60X40P005 | 99.55M | 40.0 | 46364 | 2.46M | 108 | 46364 | 2.45M | 97 | 727 | 7.86K | 11 | "31,41,51,61,71,81" | 0:02'55'' | 0:01'33'' |
| Q25L60X40P006 | 99.55M | 40.0 | 47421 | 2.46M | 117 | 47421 | 2.45M | 105 | 783 | 9.12K | 12 | "31,41,51,61,71,81" | 0:02'45'' | 0:01'35'' |
| Q25L60X80P000 | 199.09M | 80.0 | 19434 | 2.46M | 238 | 19434 | 2.44M | 213 | 822 | 19.67K | 25 | "31,41,51,61,71,81" | 0:03'58'' | 0:02'06'' |
| Q25L60X80P001 | 199.09M | 80.0 | 15365 | 2.46M | 246 | 15447 | 2.45M | 227 | 727 | 13.98K | 19 | "31,41,51,61,71,81" | 0:03'53'' | 0:02'11'' |
| Q25L60X80P002 | 199.09M | 80.0 | 27534 | 2.46M | 163 | 27534 | 2.45M | 151 | 707 | 8.48K | 12 | "31,41,51,61,71,81" | 0:03'57'' | 0:01'56'' |
| Q25L60X120P000 | 298.64M | 120.0 | 9278 | 2.47M | 391 | 9498 | 2.44M | 351 | 770 | 29.14K | 40 | "31,41,51,61,71,81" | 0:05'22'' | 0:02'46'' |
| Q25L60X120P001 | 298.64M | 120.0 | 13839 | 2.46M | 290 | 13936 | 2.44M | 261 | 727 | 20.61K | 29 | "31,41,51,61,71,81" | 0:05'23'' | 0:02'34'' |
| Q25L60X160P000 | 398.18M | 160.0 | 6698 | 2.47M | 550 | 6848 | 2.42M | 479 | 727 | 50.73K | 71 | "31,41,51,61,71,81" | 0:07'01'' | 0:03'12'' |
| Q25L60X240P000 | 597.27M | 240.0 | 4746 | 2.47M | 759 | 4908 | 2.38M | 627 | 778 | 95.9K | 132 | "31,41,51,61,71,81" | 0:09'39'' | 0:03'43'' |
| Q30L60X40P000 | 99.55M | 40.0 | 55218 | 2.46M | 91 | 55218 | 2.44M | 81 | 10398 | 17.13K | 10 | "31,41,51,61,71,81" | 0:03'05'' | 0:01'35'' |
| Q30L60X40P001 | 99.55M | 40.0 | 55749 | 2.45M | 93 | 55749 | 2.45M | 85 | 844 | 6.42K | 8 | "31,41,51,61,71,81" | 0:02'52'' | 0:01'35'' |
| Q30L60X40P002 | 99.55M | 40.0 | 65454 | 2.46M | 75 | 65454 | 2.44M | 62 | 1126 | 13.91K | 13 | "31,41,51,61,71,81" | 0:03'01'' | 0:01'37'' |
| Q30L60X40P003 | 99.55M | 40.0 | 97954 | 2.45M | 68 | 97954 | 2.45M | 62 | 834 | 4.77K | 6 | "31,41,51,61,71,81" | 0:02'53'' | 0:01'25'' |
| Q30L60X40P004 | 99.55M | 40.0 | 71924 | 2.45M | 76 | 71924 | 2.45M | 67 | 727 | 6.18K | 9 | "31,41,51,61,71,81" | 0:02'54'' | 0:01'26'' |
| Q30L60X40P005 | 99.55M | 40.0 | 63766 | 2.45M | 88 | 63766 | 2.44M | 76 | 727 | 8.73K | 12 | "31,41,51,61,71,81" | 0:02'37'' | 0:01'26'' |
| Q30L60X80P000 | 199.09M | 80.0 | 60425 | 2.45M | 76 | 60425 | 2.45M | 70 | 753 | 4.59K | 6 | "31,41,51,61,71,81" | 0:04'00'' | 0:02'03'' |
| Q30L60X80P001 | 199.09M | 80.0 | 68973 | 2.45M | 64 | 68973 | 2.45M | 57 | 844 | 5.27K | 7 | "31,41,51,61,71,81" | 0:04'02'' | 0:02'00'' |
| Q30L60X80P002 | 199.09M | 80.0 | 89791 | 2.45M | 65 | 89791 | 2.45M | 58 | 809 | 5.53K | 7 | "31,41,51,61,71,81" | 0:04'01'' | 0:02'08'' |
| Q30L60X120P000 | 298.64M | 120.0 | 60425 | 2.45M | 74 | 60427 | 2.45M | 67 | 727 | 5.37K | 7 | "31,41,51,61,71,81" | 0:05'23'' | 0:02'27'' |
| Q30L60X120P001 | 298.64M | 120.0 | 71924 | 2.45M | 62 | 71924 | 2.44M | 56 | 844 | 4.9K | 6 | "31,41,51,61,71,81" | 0:05'18'' | 0:02'36'' |
| Q30L60X160P000 | 398.18M | 160.0 | 60427 | 2.45M | 75 | 60427 | 2.45M | 68 | 727 | 5.37K | 7 | "31,41,51,61,71,81" | 0:06'26'' | 0:03'04'' |
| Q30L60X240P000 | 597.27M | 240.0 | 57594 | 2.45M | 86 | 59198 | 2.44M | 74 | 844 | 9.05K | 12 | "31,41,51,61,71,81" | 0:06'47'' | 0:03'28'' |
## Cdip: merge anchors
```bash
BASE_NAME=Cdip
cd ${HOME}/data/anchr/${BASE_NAME}
# merge anchors
mkdir -p merge
anchr contained \
$(
parallel -k --no-run-if-empty -j 6 "
if [ -e Q{1}L{2}X{3}P{4}/anchor/pe.anchor.fa ]; then
echo Q{1}L{2}X{3}P{4}/anchor/pe.anchor.fa
fi
" ::: 25 30 ::: 60 ::: 40 80 120 160 240 ::: 000 001 002 003 004 005 006
) \
--len 1000 --idt 0.98 --proportion 0.99999 --parallel 16 \
-o stdout \
| faops filter -a 1000 -l 0 stdin merge/anchor.contained.fasta
anchr orient merge/anchor.contained.fasta --len 1000 --idt 0.98 -o merge/anchor.orient.fasta
anchr merge merge/anchor.orient.fasta --len 1000 --idt 0.999 -o stdout \
| faops filter -a 1000 -l 0 stdin merge/anchor.merge.fasta
# merge others
mkdir -p merge
anchr contained \
$(
parallel -k --no-run-if-empty -j 6 "
if [ -e Q{1}L{2}X{3}P{4}/anchor/pe.others.fa ]; then
echo Q{1}L{2}X{3}P{4}/anchor/pe.others.fa
fi
" ::: 25 30 ::: 60 ::: 40 80 120 160 240 ::: 000 001 002 003 004 005 006
) \
--len 1000 --idt 0.98 --proportion 0.99999 --parallel 16 \
-o stdout \
| faops filter -a 1000 -l 0 stdin merge/others.contained.fasta
anchr orient merge/others.contained.fasta --len 1000 --idt 0.98 -o merge/others.orient.fasta
anchr merge merge/others.orient.fasta --len 1000 --idt 0.999 -o stdout \
| faops filter -a 1000 -l 0 stdin merge/others.merge.fasta
# anchors sorted on ref
bash ~/Scripts/cpan/App-Anchr/share/sort_on_ref.sh merge/anchor.merge.fasta 1_genome/genome.fa merge/anchor.sort
nucmer -l 200 1_genome/genome.fa merge/anchor.sort.fa
mummerplot -png out.delta -p anchor.sort --large
# mummerplot files
rm *.[fr]plot
rm out.delta
rm *.gp
mv anchor.sort.png merge/
# quast
rm -fr 9_qa
quast --no-check --threads 16 \
-R 1_genome/genome.fa \
merge/anchor.merge.fasta \
merge/others.merge.fasta \
1_genome/paralogs.fas \
--label "merge,others,paralogs" \
-o 9_qa
```
## Cdip: 3GS
```bash
BASE_NAME=Cdip
REAL_G=2488635
cd $HOME/data/anchr/${BASE_NAME}
canu \
-p ${BASE_NAME} -d canu-raw-40x \
gnuplot=$(brew --prefix)/Cellar/$(brew list --versions gnuplot | sed 's/ /\//')/bin/gnuplot \
genomeSize=${REAL_G} \
-pacbio-raw 3_pacbio/pacbio.40x.fasta
canu \
-p ${BASE_NAME} -d canu-raw-80x \
gnuplot=$(brew --prefix)/Cellar/$(brew list --versions gnuplot | sed 's/ /\//')/bin/gnuplot \
genomeSize=${REAL_G} \
-pacbio-raw 3_pacbio/pacbio.80x.fasta
faops n50 -S -C canu-raw-40x/${BASE_NAME}.trimmedReads.fasta.gz
faops n50 -S -C canu-raw-80x/${BASE_NAME}.trimmedReads.fasta.gz
```
## Cdip: expand anchors
* anchorLong
```bash
doc/bacteria_2_3.md view on Meta::CPAN
printf "| %s | %s | %s | %s |\n" \
"Name" "N50" "Sum" "#" \
> stat3.md
printf "|:--|--:|--:|--:|\n" >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "Genome"; faops n50 -H -S -C 1_genome/genome.fa;) >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "Paralogs"; faops n50 -H -S -C 1_genome/paralogs.fas;) >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "anchor.merge"; faops n50 -H -S -C merge/anchor.merge.fasta;) >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "others.merge"; faops n50 -H -S -C merge/others.merge.fasta;) >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "anchor.cover"; faops n50 -H -S -C merge/anchor.cover.fasta;) >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "anchorLong"; faops n50 -H -S -C anchorLong/contig.fasta;) >> stat3.md
printf "| %s | %s | %s | %s |\n" \
$(echo "contigTrim"; faops n50 -H -S -C contigTrim/contig.fasta;) >> stat3.md
cat stat3.md
```
| Name | N50 | Sum | # |
|:-------------|--------:|--------:|---:|
| Genome | 1892775 | 1892775 | 1 |
| Paralogs | 33912 | 93531 | 10 |
| anchor.merge | 32813 | 1801122 | 73 |
| others.merge | 32404 | 64274 | 3 |
| anchor.cover | 32813 | 1796007 | 71 |
| anchorLong | 35248 | 1795927 | 70 |
| contigTrim | 1027458 | 1856949 | 4 |
* Clear QxxLxxXxx.
```bash
BASE_NAME=Ftul
cd ${HOME}/data/anchr/${BASE_NAME}
rm -fr 2_illumina/Q{20,25,30}L{1,60,90,120}X*
rm -fr Q{20,25,30}L{1,60,90,120}X*
```
# Haemophilus influenzae FDAARGOS_199, æµæåè¡æè
* Project
[SRP040661](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP040661)
* Other name: ATCC 51907D; Rd KW20
* BioSample: [SAMN04875536](https://www.ncbi.nlm.nih.gov/biosample/SAMN04875536)
## Hinf: download
* Reference genome
* Strain: Haemophilus influenzae Rd KW20 (g-proteobacteria)
* Taxid: [71421](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=71421)
* RefSeq assembly accession:
[GCF_000027305.1](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/027/305/GCF_000027305.1_ASM2730v1/GCF_000027305.1_ASM2730v1_assembly_report.txt)
* Proportion of paralogs (> 1000 bp): 0.0324
```bash
mkdir -p ~/data/anchr/Hinf/1_genome
cd ~/data/anchr/Hinf/1_genome
aria2c -x 9 -s 3 -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/027/305/GCF_000027305.1_ASM2730v1/GCF_000027305.1_ASM2730v1_genomic.fna.gz
TAB=$'\t'
cat <<EOF > replace.tsv
NC_000907.1${TAB}1
EOF
faops replace GCF_000027305.1_ASM2730v1_genomic.fna.gz replace.tsv genome.fa
cp ~/data/anchr/paralogs/otherbac/Results/Hinf/Hinf.multi.fas paralogs.fas
```
* Illumina
* [SRX2104758](https://www.ncbi.nlm.nih.gov/sra/SRX2104758) SRR4123928
```bash
mkdir -p ~/data/anchr/Hinf/2_illumina
cd ~/data/anchr/Hinf/2_illumina
cat << EOF > sra_ftp.txt
EOF
aria2c -x 9 -s 3 -c -i sra_ftp.txt
cat << EOF > sra_md5.txt
EOF
md5sum --check sra_md5.txt
ln -s SRR4124773_1.fastq.gz R1.fq.gz
ln -s SRR4124773_2.fastq.gz R2.fq.gz
```
# Listeria monocytogenes FDAARGOS_351, åæ ¸ç»èå¢çææ¯ç¹æ°è
Project
[SRP040661](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP040661)
## Lmon: download
* Reference genome
* Strain: Listeria monocytogenes EGD-e
* Taxid: [169963](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=169963)
* RefSeq assembly accession:
[GCF_000196035.1](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/196/035/GCF_000196035.1_ASM19603v1/GCF_000196035.1_ASM19603v1_assembly_report.txt)
* Proportion of paralogs (> 1000 bp): 0.0133
```bash
mkdir -p ~/data/anchr/Lmon/1_genome
cd ~/data/anchr/Lmon/1_genome
( run in 1.922 second using v1.01-cache-2.11-cpan-75ffa21a3d4 )