App-Anchr
view release on metacpan or search on metacpan
doc/e_coli.md view on Meta::CPAN
brew cleanup --force # only keep the latest version
```
* Compiling with `pitchfork`
```bash
mkdir -p ~/share/pitchfork
git clone https://github.com/PacificBiosciences/pitchfork ~/share/pitchfork
cd ~/share/pitchfork
cat <<EOF > settings.mk
HAVE_ZLIB = $(brew --prefix)/Cellar/$(brew list --versions zlib | sed 's/ /\//')
HAVE_BOOST = $(brew --prefix)/Cellar/$(brew list --versions boost | sed 's/ /\//')
HAVE_OPENBLAS = $(brew --prefix)/Cellar/$(brew list --versions openblas | sed 's/ /\//')
HAVE_PYTHON = $(brew --prefix)/bin/python
HAVE_CMAKE = $(brew --prefix)/bin/cmake
HAVE_CCACHE = $(brew --prefix)/Cellar/$(brew list --versions ccache | sed 's/ /\//')/bin/ccache
HAVE_HDF5 = $(brew --prefix)/Cellar/$(brew list --versions hdf5 | sed 's/ /\//')
EOF
# fix several Makefiles
sed -i".bak" "/rsync/d" ~/share/pitchfork/ports/python/virtualenv/Makefile
sed -i".bak" "s/-- third-party\/cpp-optparse/--remote/" ~/share/pitchfork/ports/pacbio/bam2fastx/Makefile
sed -i".bak" "/third-party\/gtest/d" ~/share/pitchfork/ports/pacbio/bam2fastx/Makefile
sed -i".bak" "/ccache /d" ~/share/pitchfork/ports/pacbio/bam2fastx/Makefile
cd ~/share/pitchfork
make pip
deployment/bin/pip install --upgrade pip setuptools wheel virtualenv
make bax2bam
```
* Compiled binary files are in `~/share/pitchfork/deployment`. Run
`source ~/share/pitchfork/deployment/setup-env.sh` will bring this
path to your `$PATH`. This action would also pollute your bash
environment, if anything went wrong, restart your terminal.
```bash
source ~/share/pitchfork/deployment/setup-env.sh
bax2bam --help
```
* Data of P4C2 and older are not supported in the current version of
PacBio softwares (SMRTAnalysis). So install SMRTAnalysis_2.3.0.
```bash
mkdir -p ~/share/SMRTAnalysis_2.3.0
cd ~/share/SMRTAnalysis_2.3.0
aria2c -x 9 -s 3 -c http://files.pacb.com/software/smrtanalysis/2.3.0/smrtanalysis_2.3.0.140936.run
aria2c -x 9 -s 3 -c http://files.pacb.com/software/smrtanalysis/2.3.0/smrtanalysis-patch_2.3.0.140936.p5.run
aria2c -x 9 -s 3 -c https://atlas.hashicorp.com/ubuntu/boxes/trusty64/versions/20170313.0.7/providers/virtualbox.box
vagrant box add ubuntu/trusty64 trusty-server-cloudimg-amd64-vagrant-disk1.box --force
curl -O https://raw.githubusercontent.com/mhsieh/SMRTAnalysis_2.3.0_install/master/vagrant-u1404/Vagrantfile
vagrant destroy -f
rm -fr .vagrant/
vagrant up --provider virtualbox
```
# *Escherichia coli* str. K-12 substr. MG1655
* Genome: INSDC
[U00096.3](https://www.ncbi.nlm.nih.gov/nuccore/U00096.3)
* Taxonomy ID:
[511145](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145)
* Proportion of paralogs (> 1000 bp): 0.0323
## Download
* Reference genome
```bash
mkdir -p ~/data/anchr/e_coli/1_genome
cd ~/data/anchr/e_coli/1_genome
curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=U00096.3&rettype=fasta&retmode=txt" \
> U00096.fa
# simplify header, remove .3
cat U00096.fa \
| perl -nl -e '
/^>(\w+)/ and print qq{>$1} and next;
print;
' \
> genome.fa
cp ~/data/anchr/paralogs/model/Results/e_coli/e_coli.multi.fas paralogs.fas
```
* Illumina
```bash
mkdir -p ~/data/anchr/e_coli/2_illumina
cd ~/data/anchr/e_coli/2_illumina
aria2c -x 9 -s 3 -c ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R1.fastq.gz
aria2c -x 9 -s 3 -c ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R2.fastq.gz
ln -s MiSeq_Ecoli_MG1655_110721_PF_R1.fastq.gz R1.fq.gz
ln -s MiSeq_Ecoli_MG1655_110721_PF_R2.fastq.gz R2.fq.gz
```
* PacBio
[Here](https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly)
PacBio provides a 7 GB file for *E. coli* (20 kb library), which is
gathered with RS II and the P6C4 reagent.
```bash
mkdir -p ~/data/anchr/e_coli/3_pacbio
cd ~/data/anchr/e_coli/3_pacbio
aria2c -x 9 -s 3 -c https://s3.amazonaws.com/files.pacb.com/datasets/secondary-analysis/e-coli-k12-P6C4/p6c4_ecoli_RSII_DDR2_with_15kb_cut_E01_1.tar.gz
tar xvfz p6c4_ecoli_RSII_DDR2_with_15kb_cut_E01_1.tar.gz
doc/e_coli.md view on Meta::CPAN
printf "| %s | %s | %s | %s |\n" \
$(echo "shuffle"; faops n50 -H -S -C 2_illumina/R1.shuffle.fq.gz 2_illumina/R2.shuffle.fq.gz;) >> stat.md
parallel -k --no-run-if-empty -j 3 "
printf \"| %s | %s | %s | %s |\n\" \
\$(
echo Q{1}L{2};
if [[ {1} -ge '30' ]]; then
faops n50 -H -S -C \
2_illumina/Q{1}L{2}/R1.fq.gz \
2_illumina/Q{1}L{2}/R2.fq.gz \
2_illumina/Q{1}L{2}/Rs.fq.gz;
else
faops n50 -H -S -C \
2_illumina/Q{1}L{2}/R1.fq.gz \
2_illumina/Q{1}L{2}/R2.fq.gz;
fi
)
" ::: 20 25 30 35 ::: 30 60 90 120 \
>> stat.md
printf "| %s | %s | %s | %s |\n" \
$(echo "PacBio"; faops n50 -H -S -C 3_pacbio/pacbio.fasta;) >> stat.md
parallel -k --no-run-if-empty -j 3 "
printf \"| %s | %s | %s | %s |\n\" \
\$(
echo PacBio.{};
faops n50 -H -S -C \
3_pacbio/pacbio.{}.fasta;
)
" ::: trim 20x 20x.trim 40x 40x.trim 80x 80x.trim \
>> stat.md
cat stat.md
```
| Name | N50 | Sum | # |
|:----------------|--------:|-----------:|---------:|
| Genome | 4641652 | 4641652 | 1 |
| Paralogs | 1934 | 195673 | 106 |
| Illumina | 151 | 1730299940 | 11458940 |
| uniq | 151 | 1727289000 | 11439000 |
| scythe | 151 | 1722450607 | 11439000 |
| shuffle | 151 | 1722450607 | 11439000 |
| Q20L30 | 151 | 1514584050 | 11126596 |
| Q20L60 | 151 | 1468709458 | 10572422 |
| Q20L90 | 151 | 1370119196 | 9617554 |
| Q20L120 | 151 | 1135307713 | 7723784 |
| Q25L30 | 151 | 1382782641 | 10841386 |
| Q25L60 | 151 | 1317617346 | 9994728 |
| Q25L90 | 151 | 1177142378 | 8586574 |
| Q25L120 | 151 | 837111446 | 5805874 |
| Q30L30 | 125 | 1192536117 | 10716954 |
| Q30L60 | 127 | 1149107745 | 9783292 |
| Q30L90 | 130 | 1021609911 | 8105773 |
| Q30L120 | 139 | 693661043 | 5002158 |
| Q35L30 | 64 | 588252718 | 9588363 |
| Q35L60 | 72 | 366922898 | 5062192 |
| Q35L90 | 95 | 35259773 | 364046 |
| Q35L120 | 124 | 647353 | 5169 |
| PacBio | 13982 | 748508361 | 87225 |
| PacBio.trim | 13630 | 688575670 | 77687 |
| PacBio.20x | 13962 | 99252919 | 11500 |
| PacBio.20x.trim | 13541 | 88697009 | 9980 |
| PacBio.40x | 13948 | 198650072 | 23000 |
| PacBio.40x.trim | 13565 | 179462005 | 20137 |
| PacBio.80x | 13996 | 395094712 | 46000 |
| PacBio.80x.trim | 13608 | 360190363 | 40682 |
## Spades
```bash
BASE_NAME=e_coli
cd ${HOME}/data/anchr/${BASE_NAME}
spades.py \
-t 16 \
-k 21,33,55,77 --careful \
-1 2_illumina/Q25L60/R1.fq.gz \
-2 2_illumina/Q25L60/R2.fq.gz \
-s 2_illumina/Q25L60/Rs.fq.gz \
-o 8_spades
spades.py \
-t 16 \
-k 21,33,55,77 --careful \
-1 2_illumina/Q30L60/R1.fq.gz \
-2 2_illumina/Q30L60/R2.fq.gz \
-s 2_illumina/Q30L60/Rs.fq.gz \
-o 8_spades_Q30L60
```
## Platanus
```bash
BASE_NAME=e_coli
cd ${HOME}/data/anchr/${BASE_NAME}
mkdir -p 8_platanus
cd 8_platanus
if [ ! -e pe.fa ]; then
faops interleave \
-p pe \
../2_illumina/Q25L60/R1.fq.gz \
../2_illumina/Q25L60/R2.fq.gz \
> pe.fa
faops interleave \
-p se \
../2_illumina/Q25L60/Rs.fq.gz \
> se.fa
fi
platanus assemble -t 16 -m 100 \
-f pe.fa se.fa \
2>&1 | tee ass_log.txt
platanus scaffold -t 16 \
( run in 2.453 seconds using v1.01-cache-2.11-cpan-98e64b0badf )