Alfree | Download

ALFPY OVERVIEW

alfpy provides command-line Python scripts that let you compare unaligned FASTA sequences (DNA/RNA/protein) with more than 40 distance methods. By default, all distance calculations will output a distance matrix in Phylip format, which can be directly used as an input to popular tree-building software such as PHYLIP or EMBOSS fneighbor.

alfpy is also a Python package that provides alignment-free framework to compare biological sequences and infers their phylogenetic relationships. All its functionalities are wrapped in logically separeted Python modules that let you to write your own analysis scripts with just basic Python knowledge.

SOFTWARE REQUIREMENTS

Python version 2.7 or >= 3.3
NumPy

INSTALLATION

Option 1: Get the latest official version (recommended)

To download, build, and install the latest stable version, use pip:

pip install alfpy

If you are on Linux or MacOS, you probably need to run the installation with the sudo command.

sudo pip install alfpy

If you are not allowed to use sudo, install Alfpy as user

sudo pip install --user alfpy

Option 2: Get the latest development version

The latest development version is the one that’s in our Git repository. Get it using this shell command, which requires git:

git clone https://github.com/aziele/alfpy.git

If you don't feel like using git, just download the package manually as a zip archive.

Unpack the zip package, go to the directory and run the installation:

sudo python setup.py install

python setup.py install --user

USAGE OVERVIEW

Alfpy provides 12 command-line Python scripts for computing 44 dissimilarity measures among sequences provided in FASTA file.

Script name	Sequence type	Approach	Distace measures
calc_bbc.py	DNA/protein	Base-Base Correlation (BBC)	1
calc_graphdna.py	DNA	Two-dimensional (2D) graphical DNA curve	3
calc_fcgr.py	DNA	Frequency Chaos Game Representation (FCGR)	1
calc_lempelziv.py	DNA/protein	Lempel-Ziv complexity	5
calc_ncd.py	DNA/protein	Normalized Compression Distance (NCD)	1
calc_wmteric.py	protein	W-metric	1
calc_word.py	DNA/protein	Word-based: standard	19
calc_word_bool.py	DNA/protein	Word-based: boolean 1-D vectors of word counting occurrences	9
calc_word_cv.py	DNA/protein	Word-based: Composition vector (3rd order Markov model)	1
calc_word_d2.py	DNA/protein	Word-based: d2 metric (different word resolutions)	1
calc_word_ffp.py	DNA/protein	Word-based: Feature Frequency Profiles (FFPs)	1
calc_word_rtd.py	DNA/protein	Word-based: Return Time Distribution (RTD)	1

USAGE EXAMPLES

All the programs require at least a set of (unaligned) sequences in FASTA format (e.g. hiv_origin.pep.fa).

All the programs output distances in PHYLIP (by default) or a pairwise format.

EXAMPLE #1 - Basic usage

To calculate W-metric distance, you type on the command-line:

calc_wmetric.py --fasta hiv_origin.pep.fa

By default, you get a distance matrix in PHYLIP format.

    6
HIV1_MN    0.0000000 0.0028074 0.0072904 0.0247550 0.0209014 0.0198626
HIV1_KB    0.0028074 0.0000000 0.0052245 0.0268161 0.0216594 0.0208300
SIV_CZ     0.0072904 0.0052245 0.0000000 0.0310483 0.0233556 0.0225691
HIV2_RO    0.0247550 0.0268161 0.0310483 0.0000000 0.0037506 0.0072975
HIV2_CA    0.0209014 0.0216594 0.0233556 0.0037506 0.0000000 0.0045389
SIV_MK     0.0198626 0.0208300 0.0225691 0.0072975 0.0045389 0.0000000

EXAMPLE #2 - Changing output format

You can change the output format to pairwise:

calc_wmetric.py --fasta hiv_origin.pep.fa --outfmt pairwise

HIV1_MN    HIV1_KB 0.00280744374972
HIV1_MN SIV_CZ  0.00729040755063
HIV1_MN HIV2_RO 0.0247549689229
HIV1_MN HIV2_CA 0.0209013681508
HIV1_MN SIV_MK  0.0198626384014
HIV1_KB SIV_CZ  0.00522451291058
HIV1_KB HIV2_RO 0.026816057513
HIV1_KB HIV2_CA 0.0216594066907
HIV1_KB SIV_MK  0.020829977091
SIV_CZ  HIV2_RO 0.0310483221507
SIV_CZ  HIV2_CA 0.0233556481886
SIV_CZ  SIV_MK  0.0225690537356
HIV2_RO HIV2_CA 0.00375055665432
HIV2_RO SIV_MK  0.00729748162273
HIV2_CA SIV_MK  0.00453893527008

EXAMPLE #3 - Saving results to a file

You can tell it to write the output to a file, say results.txt, by typing:

calc_wmetric.py --fasta hiv_origin.pep.fa --outfmt pairwise --out results.txt

EXAMPLE #4 - Using help

Every alfpy script provides useful help messages.

For example, to see available options, you type: calc_word.py -h.

usage: calc_word.py --fasta FILE [--word_size N | --word_pattern FILE]
                    [--distance] [--vector] [--char_weights FILE]
                    [--char_freqs FILE] [--alphabet_size N] [--out FILE]
                    [--outfmt {phylip,pairwise}] [-h]

Calculate distances between DNA/protein sequences based on subsequence (words)
occurrences.

REQUIRED ARGUMENTS:
  --fasta FILE, -f FILE
                        input FASTA sequence filename

  Choose between the two options:
  --word_size N, -s N   word size for creating word patterns
  --word_pattern FILE, -w FILE
                        input filename w/ pre-computed word patterns

OPTIONAL ARGUMENTS:
  --distance , -d       choose from: angle_cos_diss, angle_cos_evol,
                        braycurtis, canberra, chebyshev, diff_abs_add,
                        diff_abs_mult, diff_abs_mult1, diff_abs_mult2,
                        euclid_norm, euclid_seqlen1, euclid_seqlen2,
                        euclid_squared, google, jsd, kld, lcc, manhattan,
                        minkowski [DEFAULT: google]
  --vector , -v         choose from: counts, freqs, freqs_std [DEFAULT: freqs]
  --char_weights FILE, -W FILE
                        file w/ weights of background sequence characters
                        (nt/aa)

FREQUENCY MODEL ARGUMENTS:
  Required for vector 'freqs_std'. Specify one of the two options:

  --char_freqs FILE, -F FILE
                        file w/ frequencies of background sequence characters
                        (nt/aa)
  --alphabet_size N, -a N
                        alphabet size

OUTPUT ARGUMENTS:
  --out FILE, -o FILE   output filename
  --outfmt {phylip,pairwise}
                        distances output format [DEFAULT: phylip]

OTHER OPTIONS:
  -h, --help            show this help message and exit

EXAMPLE #5 - Using optional arguments

To calculate Normalized Google Distance between count occurrences of words of size 2, type:

calc_word.py --fasta hiv_origin.pep.fasta --word_size 2 --vector counts --distance google

   6
HIV1_MN    0.0000000 0.2008282 0.3084886 0.3892340 0.3933747 0.4146825
HIV1_KB    0.2008282 0.0000000 0.3367983 0.4202899 0.4220374 0.4325397
SIV_CZ     0.3084886 0.3367983 0.0000000 0.4140787 0.3991684 0.4345238
HIV2_RO    0.3892340 0.4202899 0.4140787 0.0000000 0.2111801 0.2757937
HIV2_CA    0.3933747 0.4220374 0.3991684 0.2111801 0.0000000 0.3035714
SIV_MK     0.4146825 0.4325397 0.4345238 0.2757937 0.3035714 0.0000000

EXAMPLE #6 - Other optional arguments

If you often use calc_word, it is convenient to have a file with already computed word occurrences for a given set of sequences. You can create this file by typing:

create_wordpattern.py --fasta hiv_origin.pep.fasta --word_size 1 --out 1mer.txt

This file can then be used directly for calculating distances:

calc_word.py --fasta hiv_origin.pep.fasta --word_patterns 1mer.txt --vector counts --distance google

EXAMPLE #7 - Abbreviate long argument names

Most arguments can be abbreviated by a single dash and letter. For example, to calculate Euclidean distance between frequency vectors of words of size 1 and write results to a file, type:

calc_word.py -f hiv_origin.pep.fasta -s 1 -v freqs -d euclid_norm -o results.txt

PYTHON PACKAGE

If you want to use alfpy as a Python package, here are some examples that should give you an idea of how to use particular modules. To see more examples, please have a look at the docstrings of these modules.

EXAMPLE 1 - Read sequences

Read sequences from a FASTA file.

>>> from alfpy.utils import seqrecords

>>> fh = open('sample.pep.fasta')
>>> seq_records = seqrecords.read_fasta(fh)
>>> fh.close()
>>> print(seq_records)
SeqRecords (noseqs: 3)

Read sequences from Python strings

>>> from alfpy.utils import seqrecords

>>> seq_records = seqrecords.SeqRecords()
>>> seq_records.add(id='seq1', seq='MKSTGWHF')
>>> seq_records.add(id='seq2', seq='MKSSSSTGWGWG')
>>> seq_records.add(id='seq3', seq='MKSTLKNGTEQ')

>>> print(seq_records)
SeqRecords (noseqs: 3)

Read sequences from Python lists

>>> from alfpy.utils import seqrecords
>>> ids = ['seq1', 'seq2', 'seq3']
>>> seqs = ['MKSTGWHF', 'MKSSSSTGWGWG', 'MKSTLKNGTEQ']
>>> seq_records = seqrecords.SeqRecords(id_list=ids, seq_list=seqs)
>>> print(seq_records)
SeqRecords (noseqs: 3)

SeqRecords() attributes & methods

>>> print(seq_records.id_list)
['seq1', 'seq2', 'seq3']
>>> print(seq_records.seq_list)      
['MKSTGWHF', 'MKSSSSTGWGWG', 'MKSTLKNGTEQ']
>>> seq_records.length_list
[8, 12, 11]
>>> seq_records.count
3

>>> print(seq_recors.fasta())
>seq1
MKSTGWHF
>seq2
MKSSSSTGWGWG
>seq3
MKSTLKNGTEQ

EXAMPLE 2 - Calculating Normalized Compression Distance (NCD)

>>> from alfpy.utils import distmatrix
>>> from alfpy.utils import seqrecords
>>> from alfpy import ncd

>>> seq_records = SeqRecords()
>>> seq_records.add('seq1', 'MKSTGWHF')
>>> seq_records.add('seq2', 'MKSSSSTGWGWG')
>>> seq_records.add('seq3', 'MKSTLKNGTEQ')  

>>> dist = ncd.Distance(seq_records)
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.display()
   3
seq1       0.0000000 0.4117647 0.4736842
seq2       0.4117647 0.0000000 0.5789474
seq3       0.4736842 0.5789474 0.0000000

>>> matrix.display('pairwise')
seq1    seq2    0.4117647058823529
seq1    seq3    0.47368421052631576
seq2    seq3    0.5789473684210527

>>> # Calculate distance between first and second sequence.
>>> dist.pairwise_distance(0, 1)
0.4117647058823529

EXAMPLE 3 - Calculating W-metric

>>> from alfpy.utils.data import subsmat
>>> from alfpy import wmetric

>>> matrix = subsmat.get('blosum62')
>>> dist = wmetric.Distance(seq_records, matrix)

>>> # Calculate distance between first sequence with itself
>>> dist.pairwise_distance(0, 0)
0.0

>>> # Calculate distances between every pair of sequences
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.display()
   3
seq1       0.0000000 0.7291667 1.0330579
seq2       0.7291667 0.0000000 1.2925275
seq3       1.0330579 1.2925275 0.0000000


>>> # Iterate over a distance matrix
>>> for i, j, seq1, seq2, d in matrix:
     print(i, j, seq1, seq2, d) 
0 1 seq1 seq2 0.411764705882
0 2 seq1 seq3 0.473684210526
1 2 seq2 seq3 0.578947368421

EXAMPLE 4 - Calculating k-mer patterns

>>> from alfpy.utils import seqrecords
>>> from alfpy import word_pattern

>>> seq_records = seqrecords.SeqRecords()
>>> seq_records.add('seq1', 'MKSTGWHF')
>>> seq_records.add('seq2', 'MKSSSSTGWGWG')
>>> seq_records.add('seq3', 'MKSTLKNGTEQ') 

>>> pattern = word_pattern.create(seq_records.seq_list, word_size=2)
>>> print(pattern)
# col1 - number of word occurrences in input sequences
# col2 - number of input sequences the word is present
# col3 - word/pattern
# col4 - pairs of integer numbers (seq number: number of times word appears)
3   3   ST 0:1 1:1 2:1
3   3   MK 0:1 1:1 2:1
3   3   KS 0:1 1:1 2:1
3   2   GW 0:1 1:2
3   1   SS 1:3
2   2   TG 0:1 1:1
2   1   WG 1:2
1   1   WH 0:1
1   1   TL 2:1
1   1   TE 2:1
1   1   NG 2:1
1   1   LK 2:1
1   1   KN 2:1
1   1   HF 0:1
1   1   GT 2:1
1   1   EQ 2:1

>>> print(pattern.pat_list)
['HF', 'WH', 'ST', 'MK', 'TE', 'WG', 'KN', 'GW', 'GT', 'KS', 'TL', 'EQ', 'TG', 'LK', 'NG', 'SS']
>>> print(pattern.occr_list)
[{0: 1}, {0: 1}, {0: 1, 1: 1, 2: 1}, {0: 1, 1: 1, 2: 1}, {2: 1}, {1: 2}, {2: 1}, {0: 1, 1: 2}, {2: 1}, {0: 1, 1: 1, 2: 1}, {2: 1}, {2: 1}, {0: 1, 1: 1}, {2: 1}, {2: 1}, {1: 3}]

EXAMPLE 5 - Calculating word vectors (e.g. counts, freqs)

>>> from alfpy.utils import seqrecords
>>> from alfpy import word_pattern
>>> from alfpy import word_vector

>>> seq_records = seqrecords.SeqRecords()
>>> seq_records.add('seq1', 'MKSTGWHF')
>>> seq_records.add('seq2', 'MKSSSSTGWGWG')
>>> seq_records.add('seq3', 'MKSTLKNGTEQ')
>>> pattern = word_pattern.create(seq_records.seq_list, word_size=2)

>>> counts = word_vector.Counts(seq_records.length_list, pattern)
>>> print(counts)
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]
[2.0, 2.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]
[0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0]

>>> freqs = word_vector.Freqs(seq_records.length_list, pattern)

EXAMPLE 6 - Calculating Euclidean distance between word counts

>>> from alfpy.utils import seqrecords
>>> from alfpy import word_distance
>>> from alfpy import word_pattern
>>> from alfpy import word_vector
>>> from alfpy.utils import distmatrix

>>> fh = open('sample.pep.fasta')
>>> seq_records = seqrecords.read_fasta(fh)
>>> fh.close()

>>> p = word_pattern.create(seq_records.seq_list, word_size=2)
>>> counts = word_vector.Counts(seq_records.length_list, p)
>>> dist = word_distance.Distance(counts, 'euclid_norm')
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.display()
   3
seq1       0.0000000 4.0000000 3.3166248
seq2       4.0000000 0.0000000 5.0000000
seq3       3.3166248 5.0000000 0.0000000

EXAMPLE 7 - Calculating Google distance between word freqs

>>> freqs = word_vector.Freqs(seq_records.length_list, p)
>>> dist = word_distance.Distance(counts, 'google')
>>> dist.pairwise_distance(0, 1)
0.5454545454545454
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.data
array([[ 0.        ,  0.54545455,  0.7       ],
       [ 0.54545455,  0.        ,  0.72727273],
       [ 0.7       ,  0.72727273,  0.        ]])
>>> matrix.display()
   3
seq1       0.0000000 0.5454545 0.7000000
seq2       0.5454545 0.0000000 0.7272727
seq3       0.7000000 0.7272727 0.0000000