1. Length differences for gene, transcript, and CDS

    The lengths of gene, transcript, and CDS can be quite different in their lengths.

    First we need to understand all of these three terms:

    • gene Of these three, gene is a big concept, which contains both transcript and CDS.

    • transcript A gene can have multiple forms of transcripts (a.k.a. isoforms), of course with varied lengths

    • CDS Coding region composed of exon, but it does not include the 5' or 3' UTRs in the exon. A CDS always starts with a AUG codon and ends with a stop codon (UAG, UAA, and UGA).

    The job I need to do is to add gene function annotation (brief description) and CDS length with a list of gene in Arabidopsis thaliana. I found the biomaRt tool in BioConductor is quite helpful.

    Below are my script to retrieve those information from biomart within R environment.

    mart <- biomaRt::useMart(biomart = "plants_mart", dataset …

  2. Learn git and github

    Learn git and github

    • What is git? Why we use it?
    • Common git commands
    • Remote git repository

    install git and environmental setup

    git setup

    # tell git who you are
    git config --global user.name "Zenith Nobel"
    git config --global user.email "ZN@msu.edu"
    git config --global color.ui true
    git config --global core.editor vim

    Start git

    git init # to start a git repository
    # Another way to start a repo
    git clone git://github.com/lh3/seqtk.git

    Track files in git

    git add and git status

    echo 'Hello world!' >> README
    echo 'unstaged' >> unstage.txt
    git status
    git add README
    git status

    git commit - Take a snapshot of your project

    git commit -m 'initial commit'

    git diff - See the differences

    # append information to README
    echo '##Project Fusarium' >> README
    git diff
    # see more history
    git log
    git log --pretty=oneline --abbrev-commit
    git commit -a -m 'add project title to the …

  3. Making maps in R


    I have been working on my population genetics project. One of the main thing is to plot the data in the context of the sampling locations. I have been searching all different kinds of solutions. One of the best for my application was to use the packages: maps, mapplots.

    Below are my R code for plotting the partial US and Argentina maps.

    #! r code
    par(mar = (c(4, 1, 4, 1) + 0.1), mfrow = c(1,2))
    # prepare color from STRUCTURE and convert them into alpha .8
    # color order: C1=Blue, C2=Purple
    structureCol <- c("#0099e6", "#800080","#ff004d","#339933")
    structureCol <- adjustcolor(structureCol, alpha = 0.8)
    # read structure CLUMMP data along with GPS coordinates
    struGpsData <- read.table("data/raw/struGPS.txt", header=TRUE, sep="\t")
    # set radius size
    for (i in struGpsData$Pop){
      subData <- subset(struGpsData, Pop==i)
      if (subData$PopSize < 5){
        struGpsData$popRadius[i …

  4. Compile Bio++ and install egglib on Ubuntu 14.04 LTS

    I didn't find it difficult to install or compile the egglib module on Linux, when I first come to use egglib in python. However, things changed my mind when I was trying to use the internal function Align.polymorphismBPP(), which is a function that depends on the Bio++ library supoort. Somehow, I didn't install the C++ module correctly in the beginning, thus now I need to pay the price.

    Before we get into the installation, we need to make sure the pre-requisites by egglib and my local platform configuration.

    This tutorial was test on my platform:

    • Ubuntu Linux 14.04LTS
    • Bio-Linux installed
    • Python 2.7.6
    • GCC 4.8.2


    • cmake: sudo apt-get install cmake
    • doxygen: sudo apt-get install doxygen
    • Bio++'s install will be elaborated below

    Install dependencies

    For egglib-cpp module, you will only need to the three libraries from Bio++: bpp-core, bpp-seq, and bpp-popgen. Keep this mind …

  5. Good and useful bioinfomatics tools

    Here below are the list of softwares I found to be very handy for applied bioinformatics. Most of the tools I listed here are open-source program.

    • PRANK - Codon sequence alignment
    • MAFFT - Multiple sequence alignment, similar performance to MUSCLE
    • egglib - powerful tools for population genetics and genomics data analysis
    • Roary - tool for pan-core genome study (bacteria primarily)
    • Prodigal - is bacterial genome annotation tool, which works seamlessly with Roary.
    • ete - a nice phylogenomics tool kit with API to python
    • Bioconda bioinfo software package manager and environmental manager
    • d-maps high quality map images, multiple format supported
    • R-Bioconductor
    • To be continued

  6. Population genetics simulation in Python

    A python script writen in OOP style to simulate population genetic dfrit and spontaneous mutation. Microsatellite markers were used as a surrogate to calculate genetic diversity across time/generations.

  7. Pan-Core Genome Pipeline (Part I)

    - How to use Parallel (video)
    - Tutorial Document

    FASTQC ` #!/bin/bash #this file is to run fastqc

    #run paralell
    #-j indicates run as many jobs as possible
    #+0 indicates add 0 job to cpu core(s)
    ls *.gz | time parallel -j+0 --eta 'fastqc {}'`



    FASTx trimmer

    fastx_trimmer -Q33 -l 70 -i combined.fq | fastq_quality_filter -Q33 -q 30 -p 50 > combined-trim.fq
    fastx_trimmer -Q33 -l 70 -i s1_se | fastq_quality_filter -Q33 -q 30 -p …

  8. Build a bioinformatics lab server with a small budget in 2014 (Part1)

    A friend of mine is going to set up a plant pathology lab soon, and he is planning working with some bioinformatics work over there. I am helping him build a lab server with a reasonable price, which in this case was USD1.5k. I plan to use this opportunity to write a blog to document the whole process to share with the potential community.

    Nowadays, next generation sequencing (NGS) is becoming affordable and approachable to most of the labs. Although many other options are out there for people who want to seek for data computing solutions. But the cheap way for a lab with limited fundings could be a lab-owned server. Now let’s see what we can get with this price, and what’s the performance is gonna be.

    1. Consult for the scope of their research.

    At this step, I had a couple of conversation with my friend …

Page 1 / 1