Data Hygiene, RSQL 2016, Relaimpo, & The Browser.

Data Hygiene

It has been said that a lot of the work to get data ready for analysis is janitorial in nature. With the appetite for data collection on the rise, this problem is growing exponentially.

A recent study (The State of Enterprise Data Quality: 2016) commissioned by the Blazent suggests that there's still a lot of work to do.

Fewer than half of the study respondents (40%) were very confident in their organization's data quality management (DQM) practices or the quality of data within their company.

94% of respondents believe that business value is lost as a result of poor data quality – 65% of respondents believe that 10-49% of business value can be lost due to poor data quality, while 29% of respondents said 50% or more of business value can be lost.

We live in interesting times - many organizations minimize this problem, and even take on building large scale initiatives without addressing the issue.

Despite these findings, it's encouraging to see that more companies are making data hygiene their business. In fact, some argue that the "self-serve data prep" industry is on the rise.

R SQL 2016

Last year, we reported the acquisition of Revolution Analytics by Microsoft. Revolution Analytics was a valuable asset to Microsoft because it pushed the limit of R. As some of you know, R has a few limitations: a) its memory based (100GB max); b) single threaded (no parallel processing); and c) the data needed to be moved to the application where you run R.

What Revolution did with R is extended its functionality by creating parallel processing on single node, they also innovated parallel processing to multiple node. In addition, Revolution Analytics engineered a way to stream the data so that it did not need to be all stored in memory.

Now with the integration of Revolution Analytics in SQL Server 2016, Analyst will be able to run model from open source package in R within SQL Server environment. This is a major development as it will allow in-database processing and eliminate the movement of data from one environment to the other.

For more details, have a look here and also here.


Relaimpo is an R package developed by Ulrike Groemping (2007) used for assessing relative importance of repressors in the linear model. It's a way to quantify the share of contribution of each variable in a linear regression. The idea here is that the analyst can understand the impact of each variable in relation to its target and identify the most "important ones".

What's interesting is that in the process of researching this, we came across an article by Joe Caldwell who not only explained the mechanics, but illustrated an example that addresses methodological issue that comes in when the tool of data collection is not aligned with the business objective of the actual intervention. Have a read, it's actually quite interesting.

The Browser

Robert Cottrell is the editor of The Browser. The site recommends five daily articles that are curated using human judgement (not AI).

In a piece for the Financial Times he wrote: "1% of the writing online is of value to the intelligent reader, 4% is entertaining rubbish, and the remainder has no redeeming feature".

Interestingly, Robert considers that the advertising business model in publishing is not sustainable. He suggests further uncoupling with the notion of single article sales for $0.99. He recently started 1pass which enable this vision to be a reality.

For more details, listen to his recent interview on the Monocle.

"Our data are observations, but our goal is intervention."
- Judea Pearl

Copyright © 2015 Figurs* Inc. All rights reserved