Saturday 1 February 2014

Big data is just a euphemism for lazy and cheap

Maybe I'm getting cantankerous but I'm really over all of the talk about big data and how it is going revolutionize the world businesses are going to so efficient they will only need a CEO and a lowly marketing guy. Governments will so efficient taxes will be almost unnecessary. 

Enough! The reality is that big data isn't new and most organizations are not mature enough or focused enough to take advantage of the new technology. 

Learn the lessons of the past.
I was (am) a scientist. I did my Ph.D in neuroscience and genetics back when sequencing a single gene took months. For reference, the bleeding edge technologies can deliver a whole genome (about 20 thousand genes) in 15 minutes

I have already complained about the challenges in knowledge management in science - and the parallelism in businesses today in this blog. I'll summarize; businesses suck at getting the right information to workers because they are cheap and lazy. 

No one wants to pay to do it right, everyone thinks that the app should be cheap and reduce labor cost by reducing the need to hire smart people. 

Well folks organizing and analyzing data/information is hard and takes a deep understanding of the difference between junk and INFORMATION.

The original Big data problem
Scientists have always generated large, complex data sets that are almost too difficult to comprehend.
As we enter the genomics era in science it has gotten worse because most scientists have not taken the time to do quality control on the information that they submit to public databases. The public data is very spotty at best; how many scientists can honestly say that they trust the gene ontology notes?

N.B. For non-scientists the Gene ontology database is a repository of notes, data, or published papers about our combined knowledge of each gene's function, interactions and chemical inhibitors. It contains links across species and across several databases.

The problem is that it is incomplete NLM/NIH does not have the money to maintain it-nor do any of the primary owners. The pace of growth is to much for the curators to keep up with. The number of different sources has also grown, you have images, gene expression studies, drug testing, protein interaction maps. 

Science has had a big data problem since before computers. How has the scientific community moved forward and had success even in the face of such poor data stewardship?

People.

Anyone how gets through a Ph.D has a great analytical mind. They can see through poor quality data to those nuggets of truth. How do they do this? They focus on finding an answer to a question, and then they build out from that question until they have built a complex multifaceted answer.

You wan to know why science is becoming stagnant and have serious ethical and just plain stupid errors of reproducibility?

We do not train scientists to be critical and form questions. We teach them to get a whole lot of data and mold it into a a beautiful story. The logic being that if you look at enough data the truth will come out. It never does; if you start out with biased data you will get a biased answer. The data sets are inherently flawed.

There is no big data only poorly framed questions. If you have a big data problem it is because you have been a poor data steward and you don't have a question. so you have no ability to start sifting through information.

Their has always been a lot of information it is just That we trained people to work with it, understand it, analyze it and make decisions. More importantly we understood that failure was a good thing, it is a chance to define the question and focus on things that will work.

A lesson not learned
There is no such thing as big data, just better storage of the vast amounts of information that life generates. Nothing has really changed it just the problem is more visible-and we downsized all of the keepers of the knowledge. Most organizations- healthcare and Pharma being the key culprits refuse to train people to think critically and scrutinize the veracity and quality of information/data.

You want to fix the big data problem? Train people to ask questions and let them answer the question. Or hire someone well trained already such as the overstocked "bioinformatics Ph.D" class of scientists. The biottom line is that new shiny system is still going to give you crap data if the person asking the question is can't ask good and insightful questions.

Realize that autocorrect is the state of the art in predictive analytics right?......let that sink in for a minute. Are you will to leave your career or company to this?

You don't need more data, you need the right data and the time and confidence to fully vet the quality of the data. We need people that understand today  to test how well that information fits with the world today. This is a key element of accurate predictions

In biomedical sciences this really comes down to how we train graduate students; do we make them learn statistics or just hope that excel is good enough? Are we willing to mentor students or are they just cheap labor for the gratification of the professor? Do we pay attention to how we store and mange information so that the next student can find it?

For most businesses it comes down to why? Is there a business question that we need to solve, what is the problem that we need fix, is there a new source of revenue that we can exploit? What are our past failures and what can we learn from them?