//RESOURCES

SPRING 2015

Microsoft's R, Hadoop's Hives, GigaBeats, & Rapha

Microsoft's R

In late January this year, Microsoft acquired Revolution Analytics. The company launched in 2007 and commercialized the usage of The R Project for Statistical Computing. Clint Boulton wrote a great piece in the WSJ outlining the key drivers for this transaction.

  • R is the language of choice by many in enterprise and academia (2 million daily users)
  • Revolution Analytics will be instrumental for the Microsoft Azure cloud development.
  • Revolution Analytics is already integrated with Hortonworks's Hadoop.

Get all the details on Clint's article here.

Hadoop's Hives

Of late, the word Hadoop is becoming popular. It has become an industry buzzword that's often mentioned by organizations, IT and data people alike.

For those who are not aware, Hadoop is a data repository with multiple interfaces (i.e. Hive, Pig, amongst others) that works on a different premise than the relational database management system (RDBMS). The architecture of Hadoop is based on Hadoop Distributed File System (HDFS). This file system allows a job to be distributed into smaller jobs across multiple machines using a paradigm called Map Reduce. For an introduction on Hadoop, click here.

Hadoop originally an open source system from the Apache Foundation has now been commercialized by mainly two companies: Hortonworks and Cloudera.

Interestingly, different forms of Hives (SQL-like interfaces) have been flaring up in the market. Cloudera recently developed (Impala) to improve on Hive, while Hortonworks, has “doubled down” their effort to keep Hive as a part of the Hadoop ecosystem. Hence, Cloudera and Hortonworks are in dispute over the ideal architecture for Hive.

To get a macro view of the industry, and how it affects distributed computing and the map reduce paradigm, we suggest you take a look at Gil Press's article in Forbes.

Interesting times are ahead.

GigaBeats

Business understanding and data preparation are a necessity for statistical analysis. It's well known that it accounts for about 80% of an overall project timeline. Moreover, different statistical methods or algorithms, call for a specific data format.

It is with this notion that the GigaBeats project was conceptualized at MIT. It's ingenious because it allows for different modeling methods to be used simultaneously.

Practically speaking, it's akin to having a specific ETL and environment for each statistical method intended to be used. While it is common to have a data warehouse, it is not very common to see a modeling data warehouse, even less so multiple ones.

Get some historical background on GigaBeats, and view the detailed architecture.

Rapha

For those of you who are avid cyclists, this brand needs no introduction. The company recently introduced: a data print, a race wear collection made from actual riding data. Take a look at the beautiful data video but before doing so, prepare yourself and crank up the volume a little.

The design firm whom spearheaded the effort; Accept & Proceed is also known for their light calendar amongst other things. Well worth a look.



Happy Spring!



 
Copyright © 2015 Figurs* Inc. All rights reserved