<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>bioinformatics | tijeco</title><link>https://www.tije.co/tag/bioinformatics/</link><atom:link href="https://www.tije.co/tag/bioinformatics/index.xml" rel="self" type="application/rss+xml"/><description>bioinformatics</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Sun, 05 Sep 2021 00:00:00 +0000</lastBuildDate><image><url>https://www.tije.co/images/icon_hu14a0fda3d8b38bed6e82bba2624913c9_74337_512x512_fill_lanczos_center_2.png</url><title>bioinformatics</title><link>https://www.tije.co/tag/bioinformatics/</link></image><item><title>Use python to make a seqlogo from a multiple sequence alignment</title><link>https://www.tije.co/post/seqlogo_from_multiple_sequence_alignment/</link><pubDate>Sun, 05 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.tije.co/post/seqlogo_from_multiple_sequence_alignment/</guid><description>&lt;p>&lt;a href="https://colab.research.google.com/github/tijeco/personal_website/blob/master/content/post/seqlogo_from_multiple_sequence_alignment/index.ipynb" target="_blank" rel="noopener">&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">&lt;/a>&lt;/p>
&lt;p>A sequence logo, otherwise referred to as a seqlogo, is a common graphical representation technique used to visualize patterns of sequence conservation in nucleotide or protein sequences. A key component of a seqlogo is a multiple sequence alignment. A multiple sequence alignment has every character in the sequence a fixed sized matrix such that the states of each character in a given column are assumed to be homologous for each sequence in the row of the matrix. Homology in this regard means that each character shares a common ancestor. The particular details of multiple sequence alignments and homology will be discussed in greater detail later, but suffice to say they are a critical component for making seqlogos. We can use biopython to parse multiple sequence alignment data, for this example I will be using the T7-like virus holin protein family.&lt;/p>
&lt;p>If you click on the following you can get the sequences I retrieved.
&lt;a href="https://www.uniprot.org/uniref/?query=uniprot:(family%3A%22t7likevirus+holin+family%22)+identity:1.0">https://www.uniprot.org/uniref/?query=uniprot:(family%3A%22t7likevirus+holin+family%22)+identity:1.0&lt;/a>
Simply press all, then press align to generate the multiple sequence alignment, an you can download it as a plain text file. Alternatively, you can just use the alignment file that I have provided &lt;a href="">here&lt;/a>.&lt;/p>
&lt;h2 id="install-software-dependencies">Install software dependencies&lt;/h2>
&lt;p>There are three main software packages we willl need to install&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Biopython:&lt;/strong> parses alignment data&lt;/li>
&lt;li>&lt;strong>pandas:&lt;/strong> stores amino acid freqeuncy of alignment&lt;/li>
&lt;li>&lt;strong>seqlogo:&lt;/strong> plots seqlogo&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-python">!pip install biopython pandas seqlogo
&lt;/code>&lt;/pre>
&lt;p>Additionally, we will need to install ghostscript and pdf2svg, which can be installed easily on colab using apt-get&lt;/p>
&lt;pre>&lt;code class="language-python">!apt-get install ghostscript pdf2svg
&lt;/code>&lt;/pre>
&lt;h2 id="load-libraries">Load libraries&lt;/h2>
&lt;pre>&lt;code class="language-python">from Bio import AlignIO
import pandas as pd
import seqlogo
&lt;/code>&lt;/pre>
&lt;h2 id="load-alignment-data">Load alignment data&lt;/h2>
&lt;pre>&lt;code class="language-python">t7_alignmentFile = &amp;quot;T7.family.aln&amp;quot;
t7_alignment = AlignIO.read(t7_alignmentFile, &amp;quot;clustal&amp;quot;)
t7_alignment
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>&amp;lt;&amp;lt;class 'Bio.Align.MultipleSeqAlignment'&amp;gt; instance (89 records of length 78) at 7fc999c73690&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>So the alignment has 89 sequences and 78 columns.&lt;/p>
&lt;h2 id="calculate-amino-acid-frequency">Calculate amino acid frequency&lt;/h2>
&lt;p>For each of the 78 sites in the alignment we need to calculate the frequency of each of the 20 amino acids&lt;/p>
&lt;p>The most straightforward way to do this is to just tally up the residues and store them in a dataframe. We will do this using the function &lt;code>alnCompositionDF()&lt;/code>&lt;/p>
&lt;pre>&lt;code class="language-python">def alnSiteCompositionDF(aln, characters=&amp;quot;ACDEFGHIKLMNPQRSTVWY&amp;quot;):
alnRows = aln.get_alignment_length()
compDict = {char:[0]*alnRows for char in characters}
for record in aln:
header = record.id
seq = record.seq
for aaPos in range(len(seq)):
aa = seq[aaPos]
if aa in characters:
compDict[aa][aaPos] += 1
return pd.DataFrame.from_dict(compDict)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-python">t7_alignmentSiteCompDF = alnSiteCompositionDF(t7_alignment)
t7_alignmentSiteCompDF
&lt;/code>&lt;/pre>
&lt;div>
&lt;style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
&lt;pre>&lt;code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
&lt;/code>&lt;/pre>
&lt;p>&lt;/style>&lt;/p>
&lt;table border="1" class="dataframe">
&lt;thead>
&lt;tr style="text-align: right;">
&lt;th>&lt;/th>
&lt;th>A&lt;/th>
&lt;th>C&lt;/th>
&lt;th>D&lt;/th>
&lt;th>E&lt;/th>
&lt;th>F&lt;/th>
&lt;th>G&lt;/th>
&lt;th>H&lt;/th>
&lt;th>I&lt;/th>
&lt;th>K&lt;/th>
&lt;th>L&lt;/th>
&lt;th>M&lt;/th>
&lt;th>N&lt;/th>
&lt;th>P&lt;/th>
&lt;th>Q&lt;/th>
&lt;th>R&lt;/th>
&lt;th>S&lt;/th>
&lt;th>T&lt;/th>
&lt;th>V&lt;/th>
&lt;th>W&lt;/th>
&lt;th>Y&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;th>0&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>3&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>1&lt;/th>
&lt;td>2&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>5&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>2&lt;/th>
&lt;td>5&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>3&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>3&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>4&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>79&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>6&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>4&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>12&lt;/td>
&lt;td>0&lt;/td>
&lt;td>77&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>...&lt;/th>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>73&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>2&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>74&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>75&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>76&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>77&lt;/th>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>78 rows × 20 columns&lt;/p>
&lt;/div>
&lt;p>So for each site in the alignment we have a tally for all the amino acids, now we just need to calculate the proportion of each residue per site so that all the values in each row add to one.&lt;/p>
&lt;pre>&lt;code class="language-python">t7_alignmentSiteFreqDF = t7_alignmentSiteCompDF.div(t7_alignmentSiteCompDF.sum(axis=1), axis=0)
t7_alignmentSiteFreqDF
&lt;/code>&lt;/pre>
&lt;div>
&lt;style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
&lt;pre>&lt;code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
&lt;/code>&lt;/pre>
&lt;p>&lt;/style>&lt;/p>
&lt;table border="1" class="dataframe">
&lt;thead>
&lt;tr style="text-align: right;">
&lt;th>&lt;/th>
&lt;th>A&lt;/th>
&lt;th>C&lt;/th>
&lt;th>D&lt;/th>
&lt;th>E&lt;/th>
&lt;th>F&lt;/th>
&lt;th>G&lt;/th>
&lt;th>H&lt;/th>
&lt;th>I&lt;/th>
&lt;th>K&lt;/th>
&lt;th>L&lt;/th>
&lt;th>M&lt;/th>
&lt;th>N&lt;/th>
&lt;th>P&lt;/th>
&lt;th>Q&lt;/th>
&lt;th>R&lt;/th>
&lt;th>S&lt;/th>
&lt;th>T&lt;/th>
&lt;th>V&lt;/th>
&lt;th>W&lt;/th>
&lt;th>Y&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;th>0&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>1.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>1&lt;/th>
&lt;td>0.285714&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.714286&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>2&lt;/th>
&lt;td>0.500000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.300000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.2&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>3&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.044944&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.887640&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.067416&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>4&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.134831&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.865169&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>...&lt;/th>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;td>...&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>73&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.166667&lt;/td>
&lt;td>0.333333&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.166667&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.166667&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.166667&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>74&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.5&lt;/td>
&lt;td>0.500000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>75&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>1.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>76&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>1.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>77&lt;/th>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>1.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.000000&lt;/td>
&lt;td>0.0&lt;/td>
&lt;td>0.0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>78 rows × 20 columns&lt;/p>
&lt;/div>
&lt;p>Now we can take this and make our seqlogo!&lt;/p>
&lt;pre>&lt;code class="language-python">t7_alignmentSiteFreqSeqLogo= seqlogo.Ppm(t7_alignmentSiteFreqDF,alphabet_type=&amp;quot;AA&amp;quot;)
seqlogo.seqlogo(t7_alignmentSiteFreqSeqLogo, ic_scale = False, format = 'svg', size = 'large')
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="output_14_0.svg" alt="svg">&lt;/p>
&lt;p>There you have it! A sequence logo from a multiple sequnece alignment!&lt;/p>
&lt;h1 id="wrapping-up-and-things-to-consider">Wrapping up and things to consider&lt;/h1>
&lt;p>So this is just a slight introduction to a technique that is useful for visualizing multiple sequence alignments. It is far from an exhaustive approach, but will hopefully serve as a helpful template.&lt;/p>
&lt;p>There are always improvements to be made, so on your own you can think about and consider what improvements you would make to the function, and think about possible problems. For instance, think about the following:&lt;/p>
&lt;ol>
&lt;li>How could this be expanded to deal with nonstandard amino acids?&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>What happens if we use the function I provided with data that has nonstandard amino acids&lt;/li>
&lt;/ul>
&lt;ol start="2">
&lt;li>Are there ways to modify the color of the residues?&lt;/li>
&lt;li>Can residues belonging to certain domains be highlighted somehow?&lt;/li>
&lt;/ol></description></item><item><title>Chi-squared analysis on multiple sequence alignments with python</title><link>https://www.tije.co/post/chi_squared_analysis_on_multiple_sequence_alignment_with_python/</link><pubDate>Sun, 21 Feb 2021 00:00:00 +0000</pubDate><guid>https://www.tije.co/post/chi_squared_analysis_on_multiple_sequence_alignment_with_python/</guid><description>&lt;p>&lt;a href="https://colab.research.google.com/github/tijeco/personal_website/blob/master/content/post/chi_squared_analysis_on_multiple_sequence_alignment_with_python/chi_squared_analysis_on_multiple_sequence_alignment_with_python.ipynb" target="_blank" rel="noopener">&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">&lt;/a>&lt;/p>
&lt;h1 id="overview">Overview&lt;/h1>
&lt;p>For a general chi-square analysis for an alignment it is calulated
$$chi2 = sum[i from 1 to k] (O_i - E_i)^2 / E_i$$&lt;/p>
&lt;p>where k is the size of the alphabet (e.g. 4 for DNA, 20 for amino acids) and the values 1 to k correspond uniquely to one of the nucleotides or amino acids.
O_i is the nucleotide or amino acid frequency in the sequence tested.
E_i is the nucleotide or amino acid frequency expected from the ‘master’ distribution (e.g. the overall frequencies - depends on what one is using).&lt;/p>
&lt;p>Whether the nucleotide (or amino acid) composition deviates significantly for the ‘master’ distribution is done by testing the chi2 value using the chi2-distribution with k-1 degrees of freedom (df=3 for DNA or df=19 for amino acids).&lt;/p>
&lt;h1 id="python-functions">Python functions&lt;/h1>
&lt;h2 id="loading-multiple-sequence-alignment">Loading multiple sequence alignment&lt;/h2>
&lt;p>Using the biopython AlignIO function, we can load in a multiple sequence alignment file of a variety of formats.&lt;/p>
&lt;pre>&lt;code class="language-python">alignment_file = &amp;quot;path/to/alignment.fasta&amp;quot;
alignment = AlignIO.read(open(alignment_file), &amp;quot;fasta&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h2 id="create-composition-matrix">Create composition matrix&lt;/h2>
&lt;pre>&lt;code class="language-python">import pandas as pd
def compositionMatrix(aln):
compDict = {}
fixedCharacters = [&amp;quot;-&amp;quot;,&amp;quot;A&amp;quot;,&amp;quot;C&amp;quot;,&amp;quot;D&amp;quot;,&amp;quot;E&amp;quot;,&amp;quot;F&amp;quot;,&amp;quot;G&amp;quot;,&amp;quot;H&amp;quot;,&amp;quot;I&amp;quot;,&amp;quot;K&amp;quot;,&amp;quot;L&amp;quot;,&amp;quot;M&amp;quot;,&amp;quot;N&amp;quot;,&amp;quot;P&amp;quot;,&amp;quot;Q&amp;quot;,&amp;quot;R&amp;quot;,&amp;quot;S&amp;quot;,&amp;quot;T&amp;quot;,&amp;quot;V&amp;quot;,&amp;quot;W&amp;quot;,&amp;quot;Y&amp;quot;]
for record in aln:
header = record.id
seq = record.seq
currentSeqMat = [0]*21
for i in range(len(seq)):
aa = seq[i]
try:
characterPos = fixedCharacters.index(aa)
currentSeqMat[characterPos]+= 1
except:
print(&amp;quot;ERROR:&amp;quot;, header, &amp;quot;contains character (&amp;quot;+aa+&amp;quot;) not in the list:&amp;quot;,fixedCharacters)
compDict[header] = currentSeqMat
compDF = pd.DataFrame.from_dict(compDict, orient='index',
columns=fixedCharacters)
return compDF
&lt;/code>&lt;/pre>
&lt;h2 id="run-chi-squared-analysis">run chi-squared analysis&lt;/h2>
&lt;pre>&lt;code class="language-python">from scipy import stats
import numpy as np
import pandas as pd
def chi2test(compDF):
seqTotals = compDF.sum(axis=1)
gaps = compDF[&amp;quot;-&amp;quot;]
gapsPerSeq = gaps/seqTotals
nonGap = compDF.loc[:, 'A':'Y']
nonGapTotals = nonGap.sum().to_frame()
nonGapSeqTotals = nonGap.sum(axis=1).to_frame()
numCharacters = nonGapTotals.sum()
expectedFreq = nonGapTotals / numCharacters
expectedCountArray = np.dot(nonGapSeqTotals,expectedFreq.transpose())
expectedCountDF = pd.DataFrame(expectedCountArray,columns =nonGap.columns, index =nonGap.index.values )
chi2DF = ((expectedCountDF - nonGap)**2)/expectedCountDF
chi2Sum = chi2DF.sum(axis=1)
pValueDf = 1 - stats.chi2.cdf(chi2Sum, 19)
outDF = pd.DataFrame({&amp;quot;Gap/Ambiguity&amp;quot;:gapsPerSeq,&amp;quot;p-value&amp;quot;:pValueDf})
outDF.index.name='header'
return outDF
&lt;/code>&lt;/pre></description></item></channel></rss>