alfpy provides command-line Python scripts that let you compare unaligned FASTA sequences (DNA/RNA/protein) with more than 40 distance methods. By default, all distance calculations will output a distance matrix in Phylip format, which can be directly used as an input to popular tree-building software such as PHYLIP or EMBOSS fneighbor.
alfpy is also a Python package that provides alignment-free framework to compare biological sequences and infers their phylogenetic relationships. All its functionalities are wrapped in logically separeted Python modules that let you to write your own analysis scripts with just basic Python knowledge.
To download, build, and install the latest stable version, use pip:
pip install alfpy
If you are on Linux or MacOS, you probably need to run the installation with the sudo
command.
sudo pip install alfpy
If you are not allowed to use sudo
, install Alfpy as user
sudo pip install --user alfpy
The latest development version is the one that’s in our Git repository. Get it using this shell command, which requires git:
git clone https://github.com/aziele/alfpy.git
If you don't feel like using git
, just download the package manually as a zip archive.
Unpack the zip package, go to the directory and run the installation:
sudo python setup.py install
or
python setup.py install --user
Alfpy provides 12 command-line Python scripts for computing 44 dissimilarity measures among sequences provided in FASTA file.
Script name | Sequence type | Approach | Distace measures |
calc_bbc.py | DNA/protein | Base-Base Correlation (BBC) | 1 |
calc_graphdna.py | DNA | Two-dimensional (2D) graphical DNA curve | 3 |
calc_fcgr.py | DNA | Frequency Chaos Game Representation (FCGR) | 1 |
calc_lempelziv.py | DNA/protein | Lempel-Ziv complexity | 5 |
calc_ncd.py | DNA/protein | Normalized Compression Distance (NCD) | 1 |
calc_wmteric.py | protein | W-metric | 1 |
calc_word.py | DNA/protein | Word-based: standard | 19 |
calc_word_bool.py | DNA/protein | Word-based: boolean 1-D vectors of word counting occurrences | 9 |
calc_word_cv.py | DNA/protein | Word-based: Composition vector (3rd order Markov model) | 1 |
calc_word_d2.py | DNA/protein | Word-based: d2 metric (different word resolutions) | 1 |
calc_word_ffp.py | DNA/protein | Word-based: Feature Frequency Profiles (FFPs) | 1 |
calc_word_rtd.py | DNA/protein | Word-based: Return Time Distribution (RTD) | 1 |
All the programs require at least a set of (unaligned) sequences in FASTA format (e.g. hiv_origin.pep.fa).
All the programs output distances in PHYLIP (by default) or a pairwise format.
To calculate W-metric distance, you type on the command-line:
calc_wmetric.py --fasta hiv_origin.pep.fa
By default, you get a distance matrix in PHYLIP format.
6
HIV1_MN 0.0000000 0.0028074 0.0072904 0.0247550 0.0209014 0.0198626
HIV1_KB 0.0028074 0.0000000 0.0052245 0.0268161 0.0216594 0.0208300
SIV_CZ 0.0072904 0.0052245 0.0000000 0.0310483 0.0233556 0.0225691
HIV2_RO 0.0247550 0.0268161 0.0310483 0.0000000 0.0037506 0.0072975
HIV2_CA 0.0209014 0.0216594 0.0233556 0.0037506 0.0000000 0.0045389
SIV_MK 0.0198626 0.0208300 0.0225691 0.0072975 0.0045389 0.0000000
You can change the output format to pairwise
:
calc_wmetric.py --fasta hiv_origin.pep.fa --outfmt pairwise
HIV1_MN HIV1_KB 0.00280744374972
HIV1_MN SIV_CZ 0.00729040755063
HIV1_MN HIV2_RO 0.0247549689229
HIV1_MN HIV2_CA 0.0209013681508
HIV1_MN SIV_MK 0.0198626384014
HIV1_KB SIV_CZ 0.00522451291058
HIV1_KB HIV2_RO 0.026816057513
HIV1_KB HIV2_CA 0.0216594066907
HIV1_KB SIV_MK 0.020829977091
SIV_CZ HIV2_RO 0.0310483221507
SIV_CZ HIV2_CA 0.0233556481886
SIV_CZ SIV_MK 0.0225690537356
HIV2_RO HIV2_CA 0.00375055665432
HIV2_RO SIV_MK 0.00729748162273
HIV2_CA SIV_MK 0.00453893527008
You can tell it to write the output to a file, say results.txt
, by typing:
calc_wmetric.py --fasta hiv_origin.pep.fa --outfmt pairwise --out results.txt
Every alfpy script provides useful help messages.
For example, to see available options, you type: calc_word.py -h
.
usage: calc_word.py --fasta FILE [--word_size N | --word_pattern FILE]
[--distance] [--vector] [--char_weights FILE]
[--char_freqs FILE] [--alphabet_size N] [--out FILE]
[--outfmt {phylip,pairwise}] [-h]
Calculate distances between DNA/protein sequences based on subsequence (words)
occurrences.
REQUIRED ARGUMENTS:
--fasta FILE, -f FILE
input FASTA sequence filename
Choose between the two options:
--word_size N, -s N word size for creating word patterns
--word_pattern FILE, -w FILE
input filename w/ pre-computed word patterns
OPTIONAL ARGUMENTS:
--distance , -d choose from: angle_cos_diss, angle_cos_evol,
braycurtis, canberra, chebyshev, diff_abs_add,
diff_abs_mult, diff_abs_mult1, diff_abs_mult2,
euclid_norm, euclid_seqlen1, euclid_seqlen2,
euclid_squared, google, jsd, kld, lcc, manhattan,
minkowski [DEFAULT: google]
--vector , -v choose from: counts, freqs, freqs_std [DEFAULT: freqs]
--char_weights FILE, -W FILE
file w/ weights of background sequence characters
(nt/aa)
FREQUENCY MODEL ARGUMENTS:
Required for vector 'freqs_std'. Specify one of the two options:
--char_freqs FILE, -F FILE
file w/ frequencies of background sequence characters
(nt/aa)
--alphabet_size N, -a N
alphabet size
OUTPUT ARGUMENTS:
--out FILE, -o FILE output filename
--outfmt {phylip,pairwise}
distances output format [DEFAULT: phylip]
OTHER OPTIONS:
-h, --help show this help message and exit
To calculate Normalized Google Distance between count occurrences of words of size 2, type:
calc_word.py --fasta hiv_origin.pep.fasta --word_size 2 --vector counts --distance google
6
HIV1_MN 0.0000000 0.2008282 0.3084886 0.3892340 0.3933747 0.4146825
HIV1_KB 0.2008282 0.0000000 0.3367983 0.4202899 0.4220374 0.4325397
SIV_CZ 0.3084886 0.3367983 0.0000000 0.4140787 0.3991684 0.4345238
HIV2_RO 0.3892340 0.4202899 0.4140787 0.0000000 0.2111801 0.2757937
HIV2_CA 0.3933747 0.4220374 0.3991684 0.2111801 0.0000000 0.3035714
SIV_MK 0.4146825 0.4325397 0.4345238 0.2757937 0.3035714 0.0000000
If you often use calc_word, it is convenient to have a file with already computed word occurrences for a given set of sequences. You can create this file by typing:
create_wordpattern.py --fasta hiv_origin.pep.fasta --word_size 1 --out 1mer.txt
This file can then be used directly for calculating distances:
calc_word.py --fasta hiv_origin.pep.fasta --word_patterns 1mer.txt --vector counts --distance google
Most arguments can be abbreviated by a single dash and letter. For example, to calculate Euclidean distance between frequency vectors of words of size 1 and write results to a file, type:
calc_word.py -f hiv_origin.pep.fasta -s 1 -v freqs -d euclid_norm -o results.txt
If you want to use alfpy as a Python package, here are some examples that should give you an idea of how to use particular modules. To see more examples, please have a look at the docstrings of these modules.
Read sequences from a FASTA file.
>>> from alfpy.utils import seqrecords
>>> fh = open('sample.pep.fasta')
>>> seq_records = seqrecords.read_fasta(fh)
>>> fh.close()
>>> print(seq_records)
SeqRecords (noseqs: 3)
Read sequences from Python strings
>>> from alfpy.utils import seqrecords
>>> seq_records = seqrecords.SeqRecords()
>>> seq_records.add(id='seq1', seq='MKSTGWHF')
>>> seq_records.add(id='seq2', seq='MKSSSSTGWGWG')
>>> seq_records.add(id='seq3', seq='MKSTLKNGTEQ')
>>> print(seq_records)
SeqRecords (noseqs: 3)
Read sequences from Python lists
>>> from alfpy.utils import seqrecords
>>> ids = ['seq1', 'seq2', 'seq3']
>>> seqs = ['MKSTGWHF', 'MKSSSSTGWGWG', 'MKSTLKNGTEQ']
>>> seq_records = seqrecords.SeqRecords(id_list=ids, seq_list=seqs)
>>> print(seq_records)
SeqRecords (noseqs: 3)
SeqRecords() attributes & methods
>>> print(seq_records.id_list)
['seq1', 'seq2', 'seq3']
>>> print(seq_records.seq_list)
['MKSTGWHF', 'MKSSSSTGWGWG', 'MKSTLKNGTEQ']
>>> seq_records.length_list
[8, 12, 11]
>>> seq_records.count
3
>>> print(seq_recors.fasta())
>seq1
MKSTGWHF
>seq2
MKSSSSTGWGWG
>seq3
MKSTLKNGTEQ
>>> from alfpy.utils import distmatrix
>>> from alfpy.utils import seqrecords
>>> from alfpy import ncd
>>> seq_records = SeqRecords()
>>> seq_records.add('seq1', 'MKSTGWHF')
>>> seq_records.add('seq2', 'MKSSSSTGWGWG')
>>> seq_records.add('seq3', 'MKSTLKNGTEQ')
>>> dist = ncd.Distance(seq_records)
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.display()
3
seq1 0.0000000 0.4117647 0.4736842
seq2 0.4117647 0.0000000 0.5789474
seq3 0.4736842 0.5789474 0.0000000
>>> matrix.display('pairwise')
seq1 seq2 0.4117647058823529
seq1 seq3 0.47368421052631576
seq2 seq3 0.5789473684210527
>>> # Calculate distance between first and second sequence.
>>> dist.pairwise_distance(0, 1)
0.4117647058823529
>>> from alfpy.utils.data import subsmat
>>> from alfpy import wmetric
>>> matrix = subsmat.get('blosum62')
>>> dist = wmetric.Distance(seq_records, matrix)
>>> # Calculate distance between first sequence with itself
>>> dist.pairwise_distance(0, 0)
0.0
>>> # Calculate distances between every pair of sequences
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.display()
3
seq1 0.0000000 0.7291667 1.0330579
seq2 0.7291667 0.0000000 1.2925275
seq3 1.0330579 1.2925275 0.0000000
>>> # Iterate over a distance matrix
>>> for i, j, seq1, seq2, d in matrix:
print(i, j, seq1, seq2, d)
0 1 seq1 seq2 0.411764705882
0 2 seq1 seq3 0.473684210526
1 2 seq2 seq3 0.578947368421
>>> from alfpy.utils import seqrecords
>>> from alfpy import word_pattern
>>> seq_records = seqrecords.SeqRecords()
>>> seq_records.add('seq1', 'MKSTGWHF')
>>> seq_records.add('seq2', 'MKSSSSTGWGWG')
>>> seq_records.add('seq3', 'MKSTLKNGTEQ')
>>> pattern = word_pattern.create(seq_records.seq_list, word_size=2)
>>> print(pattern)
# col1 - number of word occurrences in input sequences
# col2 - number of input sequences the word is present
# col3 - word/pattern
# col4 - pairs of integer numbers (seq number: number of times word appears)
3 3 ST 0:1 1:1 2:1
3 3 MK 0:1 1:1 2:1
3 3 KS 0:1 1:1 2:1
3 2 GW 0:1 1:2
3 1 SS 1:3
2 2 TG 0:1 1:1
2 1 WG 1:2
1 1 WH 0:1
1 1 TL 2:1
1 1 TE 2:1
1 1 NG 2:1
1 1 LK 2:1
1 1 KN 2:1
1 1 HF 0:1
1 1 GT 2:1
1 1 EQ 2:1
>>> print(pattern.pat_list)
['HF', 'WH', 'ST', 'MK', 'TE', 'WG', 'KN', 'GW', 'GT', 'KS', 'TL', 'EQ', 'TG', 'LK', 'NG', 'SS']
>>> print(pattern.occr_list)
[{0: 1}, {0: 1}, {0: 1, 1: 1, 2: 1}, {0: 1, 1: 1, 2: 1}, {2: 1}, {1: 2}, {2: 1}, {0: 1, 1: 2}, {2: 1}, {0: 1, 1: 1, 2: 1}, {2: 1}, {2: 1}, {0: 1, 1: 1}, {2: 1}, {2: 1}, {1: 3}]
>>> from alfpy.utils import seqrecords
>>> from alfpy import word_pattern
>>> from alfpy import word_vector
>>> seq_records = seqrecords.SeqRecords()
>>> seq_records.add('seq1', 'MKSTGWHF')
>>> seq_records.add('seq2', 'MKSSSSTGWGWG')
>>> seq_records.add('seq3', 'MKSTLKNGTEQ')
>>> pattern = word_pattern.create(seq_records.seq_list, word_size=2)
>>> counts = word_vector.Counts(seq_records.length_list, pattern)
>>> print(counts)
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]
[2.0, 2.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]
[0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0]
>>> freqs = word_vector.Freqs(seq_records.length_list, pattern)
>>> from alfpy.utils import seqrecords
>>> from alfpy import word_distance
>>> from alfpy import word_pattern
>>> from alfpy import word_vector
>>> from alfpy.utils import distmatrix
>>> fh = open('sample.pep.fasta')
>>> seq_records = seqrecords.read_fasta(fh)
>>> fh.close()
>>> p = word_pattern.create(seq_records.seq_list, word_size=2)
>>> counts = word_vector.Counts(seq_records.length_list, p)
>>> dist = word_distance.Distance(counts, 'euclid_norm')
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.display()
3
seq1 0.0000000 4.0000000 3.3166248
seq2 4.0000000 0.0000000 5.0000000
seq3 3.3166248 5.0000000 0.0000000
>>> freqs = word_vector.Freqs(seq_records.length_list, p)
>>> dist = word_distance.Distance(counts, 'google')
>>> dist.pairwise_distance(0, 1)
0.5454545454545454
>>> matrix = distmatrix.create(seq_records.id_list, dist)
>>> matrix.data
array([[ 0. , 0.54545455, 0.7 ],
[ 0.54545455, 0. , 0.72727273],
[ 0.7 , 0.72727273, 0. ]])
>>> matrix.display()
3
seq1 0.0000000 0.5454545 0.7000000
seq2 0.5454545 0.0000000 0.7272727
seq3 0.7000000 0.7272727 0.0000000