HSRA
[NEW] 2019/01/23: HSRA v1.1 released! Check out the News section
Hadoop Spliced Read Aligner (HSRA) [1,2] is a MapReduce-based parallel tool for mapping reads from RNA sequencing (RNA-seq) experiments that supports single-end and paired-end read alignments from FASTQ/FASTA datasets. RNA-seq analyses typically begin by mapping reads to a reference genome in order to determine the location from which the reads were originated, which is a very time-consuming step in bioinformatics pipelines. This tool allows bioinformatics researchers to efficiently distribute their mapping tasks over the nodes of a computer cluster by combining a fast spliced aligner with a Big Data processing framework.
More specifically, HSRA takes advantage of the MapReduce programming model originally developed by Google [3] to extend the multithreading capabilities of the spliced HISAT2 aligner [4,5] to large scale distributed systems (e.g., cloud-based infrastructures). HSRA is built upon the open-source Apache Hadoop project [6], which is the most popular distributed computing framework for scalable Big Data processing, and currently supports all major 64-bit Linux distributions. Moreover, our tool uses the Hadoop Sequence Parser (HSP) library [7] to efficiently read the input datasets stored in the Hadoop Distributed File System (HDFS) [8], being able to process datasets compressed with Gzip and BZip2 codecs.
This tool is distributed as free software and is publicly available at the Downloads section under the GPLv3 license [9].
Citation
If you have used HSRA in your research, please cite our work using the following reference:
References
- [2] HSRA SourceForge webpage
- [3] Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1): 107-113 (2008)
- [4] Daehwan Kim, Langmead Ben, Steven L. Salzberg. HISAT: a fast spliced aligner with low memory requirements. Nature methods 12(4): 357-360 (2015)
- [5] HISAT2 webpage
- [6] Apache Hadoop project
- [7] Hadoop Sequence Parser (HSP) library for FASTQ/FASTA datasets
- [8] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler. The Hadoop Distributed File System. Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST'2010), Incline Village, NV, USA, pp. 1-10 (2010)
- [9] GNU General Public License version 3 (GPLv3)