Use python to make a seqlogo from a multiple sequence alignment

Sun, 05 Sep 2021 00:00:00 +0000

A sequence logo, otherwise referred to as a seqlogo, is a common graphical representation technique used to visualize patterns of sequence conservation in nucleotide or protein sequences. A key component of a seqlogo is a multiple sequence alignment. A multiple sequence alignment has every character in the sequence a fixed sized matrix such that the states of each character in a given column are assumed to be homologous for each sequence in the row of the matrix. Homology in this regard means that each character shares a common ancestor. The particular details of multiple sequence alignments and homology will be discussed in greater detail later, but suffice to say they are a critical component for making seqlogos. We can use biopython to parse multiple sequence alignment data, for this example I will be using the T7-like virus holin protein family.

If you click on the following you can get the sequences I retrieved. https://www.uniprot.org/uniref/?query=uniprot:(family%3A%22t7likevirus+holin+family%22)+identity:1.0 Simply press all, then press align to generate the multiple sequence alignment, an you can download it as a plain text file. Alternatively, you can just use the alignment file that I have provided here.

Install software dependencies

There are three main software packages we willl need to install

Biopython: parses alignment data
pandas: stores amino acid freqeuncy of alignment
seqlogo: plots seqlogo

!pip install biopython pandas seqlogo

Additionally, we will need to install ghostscript and pdf2svg, which can be installed easily on colab using apt-get

!apt-get install ghostscript pdf2svg

Load libraries

from Bio import AlignIO
import pandas as pd
import seqlogo

Load alignment data

t7_alignmentFile = "T7.family.aln"
t7_alignment = AlignIO.read(t7_alignmentFile, "clustal")
t7_alignment

<<class 'Bio.Align.MultipleSeqAlignment'> instance (89 records of length 78) at 7fc999c73690>

So the alignment has 89 sequences and 78 columns.

Calculate amino acid frequency

For each of the 78 sites in the alignment we need to calculate the frequency of each of the 20 amino acids

The most straightforward way to do this is to just tally up the residues and store them in a dataframe. We will do this using the function alnCompositionDF()

def alnSiteCompositionDF(aln, characters="ACDEFGHIKLMNPQRSTVWY"):
alnRows = aln.get_alignment_length()
compDict = {char:[0]*alnRows for char in characters}
for record in aln:
header = record.id
seq = record.seq
for aaPos in range(len(seq)):
aa = seq[aaPos]
if aa in characters:
compDict[aa][aaPos] += 1
return pd.DataFrame.from_dict(compDict)

t7_alignmentSiteCompDF = alnSiteCompositionDF(t7_alignment)
t7_alignmentSiteCompDF

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0
1	2	0	0	0	0	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0
2	5	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	2	0	0	0
3	0	0	0	0	0	4	0	0	0	0	79	0	0	0	0	6	0	0	0	0
4	0	0	0	0	0	0	0	12	0	77	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
73	0	0	1	2	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	0
74	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0
75	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
76	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
77	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

78 rows × 20 columns

So for each site in the alignment we have a tally for all the amino acids, now we just need to calculate the proportion of each residue per site so that all the values in each row add to one.

t7_alignmentSiteFreqDF = t7_alignmentSiteCompDF.div(t7_alignmentSiteCompDF.sum(axis=1), axis=0)
t7_alignmentSiteFreqDF

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	1.000000	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
1	0.285714	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.714286	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
2	0.500000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.300000	0.0	0.000000	0.0	0.0	0.000000	0.2	0.000000	0.0	0.0
3	0.000000	0.0	0.000000	0.000000	0.0	0.044944	0.0	0.000000	0.0	0.000000	0.887640	0.0	0.000000	0.0	0.0	0.067416	0.0	0.000000	0.0	0.0
4	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.134831	0.0	0.865169	0.000000	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
73	0.000000	0.0	0.166667	0.333333	0.0	0.166667	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.166667	0.0	0.0	0.000000	0.0	0.166667	0.0	0.0
74	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.5	0.500000	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
75	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.000000	1.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
76	0.000000	0.0	0.000000	0.000000	0.0	1.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
77	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	1.0	0.0	0.000000	0.0	0.000000	0.0	0.0

78 rows × 20 columns

Now we can take this and make our seqlogo!

t7_alignmentSiteFreqSeqLogo= seqlogo.Ppm(t7_alignmentSiteFreqDF,alphabet_type="AA")
seqlogo.seqlogo(t7_alignmentSiteFreqSeqLogo, ic_scale = False, format = 'svg', size = 'large')

There you have it! A sequence logo from a multiple sequnece alignment!

Wrapping up and things to consider

So this is just a slight introduction to a technique that is useful for visualizing multiple sequence alignments. It is far from an exhaustive approach, but will hopefully serve as a helpful template.

There are always improvements to be made, so on your own you can think about and consider what improvements you would make to the function, and think about possible problems. For instance, think about the following:

How could this be expanded to deal with nonstandard amino acids?

What happens if we use the function I provided with data that has nonstandard amino acids

Are there ways to modify the color of the residues?
Can residues belonging to certain domains be highlighted somehow?

Chi-squared analysis on multiple sequence alignments with python

Sun, 21 Feb 2021 00:00:00 +0000

Overview

For a general chi-square analysis for an alignment it is calulated $$chi2 = sum[i from 1 to k] (O_i - E_i)^2 / E_i$$

where k is the size of the alphabet (e.g. 4 for DNA, 20 for amino acids) and the values 1 to k correspond uniquely to one of the nucleotides or amino acids. O_i is the nucleotide or amino acid frequency in the sequence tested. E_i is the nucleotide or amino acid frequency expected from the ‘master’ distribution (e.g. the overall frequencies - depends on what one is using).

Whether the nucleotide (or amino acid) composition deviates significantly for the ‘master’ distribution is done by testing the chi2 value using the chi2-distribution with k-1 degrees of freedom (df=3 for DNA or df=19 for amino acids).

Python functions

Loading multiple sequence alignment

Using the biopython AlignIO function, we can load in a multiple sequence alignment file of a variety of formats.

alignment_file = "path/to/alignment.fasta"
alignment = AlignIO.read(open(alignment_file), "fasta")

Create composition matrix

import pandas as pd
def compositionMatrix(aln):
compDict = {}
fixedCharacters = ["-","A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y"]
for record in aln:
header = record.id
seq = record.seq
currentSeqMat = [0]*21
for i in range(len(seq)):
aa = seq[i]
try:
characterPos = fixedCharacters.index(aa)
currentSeqMat[characterPos]+= 1
except:
print("ERROR:", header, "contains character ("+aa+") not in the list:",fixedCharacters)
compDict[header] = currentSeqMat
compDF = pd.DataFrame.from_dict(compDict, orient='index',
columns=fixedCharacters)
return compDF

run chi-squared analysis

from scipy import stats
import numpy as np
import pandas as pd
def chi2test(compDF):
seqTotals = compDF.sum(axis=1)
gaps = compDF["-"]
gapsPerSeq = gaps/seqTotals
nonGap = compDF.loc[:, 'A':'Y']
nonGapTotals = nonGap.sum().to_frame()
nonGapSeqTotals = nonGap.sum(axis=1).to_frame()
numCharacters = nonGapTotals.sum()
expectedFreq = nonGapTotals / numCharacters
expectedCountArray = np.dot(nonGapSeqTotals,expectedFreq.transpose())
expectedCountDF = pd.DataFrame(expectedCountArray,columns =nonGap.columns, index =nonGap.index.values )
chi2DF = ((expectedCountDF - nonGap)**2)/expectedCountDF
chi2Sum = chi2DF.sum(axis=1)
pValueDf = 1 - stats.chi2.cdf(chi2Sum, 19)
outDF = pd.DataFrame({"Gap/Ambiguity":gapsPerSeq,"p-value":pValueDf})
outDF.index.name='header'
return outDF

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0
1	2	0	0	0	0	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0
2	5	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	2	0	0	0
3	0	0	0	0	0	4	0	0	0	0	79	0	0	0	0	6	0	0	0	0
4	0	0	0	0	0	0	0	12	0	77	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
73	0	0	1	2	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	0
74	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0
75	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
76	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
77	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0
1	2	0	0	0	0	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0
2	5	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	2	0	0	0
3	0	0	0	0	0	4	0	0	0	0	79	0	0	0	0	6	0	0	0	0
4	0	0	0	0	0	0	0	12	0	77	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
73	0	0	1	2	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	0
74	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0
75	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
76	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
77	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

bioinformatics | tijeco