Saturday, 17 March 2018

Big Data: Is not only a fancy/catchy name

The field of biomedical research has a new trend to use fancy terms in the title of papers/grants in order to attract the attention of reviewers, journals and grant agencies. Amount others are: large-scale, complete map, draft, landscape, deep, full, and Big Data. Figure 1 shows the exponential use of these words in pubmed articles.


Figure 1: Number of mentions of specific terms in pubmed by years.

I will stop here to discuss the term Big data.

Sunday, 11 March 2018

Data Visualization: Plots You Should be Using More

Inspired by this blog post

1- Parallel Coordinates — A parallel coordinates graph arrays multiple variables alongside one another with each scaled from highest to the lowest value (highest at the top, lowest at the bottom) and with lines connecting each entity’s position for each variable, horizontally across the graph. Due to a large number of cases represented, it is often presented using an interactive view where individual lines can be selected and highlighted.


Some great examples in life sciences

http://www.statisticsviews.com/details/feature/6314441/Visualising-Statistics-The-importance-of-seeing-not-just-describing-data.html

https://www.computer.org/csdl/trans/tg/2009/06/ttg2009061001-abs.html


Reference to a nice article: http://www.mdpi.com/2227-9709/4/3/21


2- Horizon Charts — Horizon charts show time-series data with both negative and positive values on the vertical scale, using coloring or shading to show negative values while transposing them above the baseline “horizon”.



Some great examples in life sciences:


http://homer.ucsd.edu/homer/ngs/ucsc.html




Nice References:

http://www.canoo.com/blog/2017/11/17/deutsch-friday-fun-li-horizon-charts/
http://www.perceptualedge.com/blog/?p=390
http://www.perceptualedge.com/articles/visual_business_intelligence/time_on_the_horizon.pdf


3- Box plots - are drawn for groups of scale values (e.g. expression profiles). They enable us to study the distributional characteristics of a group of values as well as the level of the values. To begin with, values are sorted. Then four equal-sized groups are made from the ordered scores. That is, 25% of all scores are placed in each group. The lines dividing the groups are called quartiles, and the groups are referred to as quartile groups. Usually, we label these groups 1 to 4 starting at the bottom (https://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots).


Some nice examples in life sciences: 

https://www.nature.com/articles/srep42460
http://biochemres.com/beautiful-minimalist-boxplots-with-r-and-ggplot2

4- Scatter Plot / PCA Classification -  A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are color-coded, one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. The scatterplot is one of the most used plots in life sciences. 


This is a well-known plot, then not introduction. 

5- Genome Circles (Circos) -  Circos uses a circular composition to show connections between objects or between positions, which are difficult to visually organize when the underlying layout is linear (or a graph, which can quickly become a hairball). In many cases, a linear layout makes impossible keeping the relationship lines from crossing other structures, deteriorates the effectiveness of the graphic.



Reference explaining what is Genome Circus here: http://mkweb.bcgsc.ca/template/circos/$url_root/tableviewer/


dfasdf

Wednesday, 24 January 2018

Monit: Monitoring your Services

Bioinformatics Applications are moving more in the direction of "Microservices" Architectures where services should be fine-grained and the protocols should be lightweight. Microservices Architectures decomposed the application into different smaller services improving the modularity; making the application easier to develop, deploy and maintain. It also parallelizes development by enabling small autonomous teams to develop, deploy and scale their respective services independently.



With more services (Databases, APIs, Web Applications, Pipelines) more components should be trace, monitor, to know the health of your application. There might be different roles that are played by different services (in different physical/logical machines) that can be even geographically isolated from each other. As a whole, these services/servers might be providing a combined service to the end application. A particular issue or problem on any of the server should not affect the final application behavior and must be found and fixed before the outage happens.

Multiple applications allow developers/devops and sysadmins to monitor all the services in a microservices application, but the most popular ones are Nagios and Monit.