App-Dazz

 view release on metacpan or  search on metacpan

doc/dazzler.md  view on Meta::CPAN


```shell script
# Dmel iso_1 reference genome
mkdir -p ~/data/dazz/ref
cd ~/data/dazz/ref

rsync -avP \
    ftp.ncbi.nlm.nih.gov::genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/ \
    iso_1/

# rename
mkdir -p ~/data/dazz/dazzler
cd ~/data/dazz/dazzler

gzip -dcf ../ref/iso_1/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz |
    dazz dazzname stdin -o stdout |
    faops filter -l 0 stdin renamed.fasta

```

### Create and split DB

`myDB.db` and its hidden companions.

```shell script
cd ~/data/dazz/dazzler

echo "Make the dazzler DB"
DBrm myDB
fasta2DB myDB renamed.fasta
DBdust myDB
# each block is of size 50 MB
DBsplit -s50 myDB

BLOCK_NUMBER=$(cat myDB.db | perl -nl -e '/^blocks\s+=\s+(\d+)/ and print $1')
echo ${BLOCK_NUMBER}

```

### Retrieve some records from DB

* If the `-n` option is set then the DNA sequence is **not** displayed

```shell script
cd ~/data/dazz/dazzler

# headers
DBshow -n myDB 5-10 102 100-101

# sequences from the original file
faops some -l 0 renamed.fasta <(DBshow -n myDB 5-10 102 100-101 | sed 's/^>//') stdout

```

## daligner

`HPC.daligner`

* local alignments involving at least `-l` base pairs (default 1000)
* An average correlation rate of `-e` (default 70%) set to 80%
* The default number of threads is 4, set by `-T` option (power of 2)
* Set the `-t` parameter which suppresses the use of any *k*-mer that occurs more than *t* times in
  either the subject or target block.
* Let the program automatically select a value of *t* that meets a given memory usage limit
  specified (in Gb) by the `-M` parameter
* one or more interval tracks specified with the `-m` option (m for mask)

### Create jobs by `HPC.daligner` and execute it

Three .las (`myDB.[1-3].las`) files are generated then concatenated to `myDB.las`.

```shell script
cd ~/data/dazz/dazzler

if [[ -e myDB.las || -e myDB.1.las ]]; then
    rm myDB*.las
fi
HPC.daligner -v -M16 -e.96 -l500 -s500 -mdust myDB > job.sh
bash job.sh

LAcat -v myDB.@.las > myDB.las

```

Contents of `job.sh`

```shell script
# Daligner jobs (3)
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.1 myDB.@1-1
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.2 myDB.@1-2
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.3 myDB.@1-3
# Check initial .las files jobs (3) (optional but recommended)
LAcheck -vS myDB myDB.1.myDB.@
LAcheck -vS myDB myDB.2.myDB.@
LAcheck -vS myDB myDB.3.myDB.@
# Merge jobs (3)
LAmerge -v myDB.1 myDB.1.myDB.@ && LAcheck -vS myDB myDB.1
LAmerge -v myDB.2 myDB.2.myDB.@ && LAcheck -vS myDB myDB.2
LAmerge -v myDB.3 myDB.3.myDB.@ && LAcheck -vS myDB myDB.3
# Remove block .las files (optional)
rm myDB.1.myDB.*.las
rm myDB.2.myDB.*.las
rm myDB.3.myDB.*.las

```

The 3 lines of daligner are equivalent to the following:

```shell script
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.1 myDB.1
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.1 myDB.2
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.1 myDB.3
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.2 myDB.2
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.2 myDB.3
daligner -v -e0.96 -l500 -s500 -M16 -mdust myDB.3 myDB.3

```

Results.

```shell script



( run in 1.008 second using v1.01-cache-2.11-cpan-0bb4e1dffa6 )