Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts

Saturday, 1 February 2014

Big data is just a euphemism for lazy and cheap

Maybe I'm getting cantankerous but I'm really over all of the talk about big data and how it is going revolutionize the world businesses are going to so efficient they will only need a CEO and a lowly marketing guy. Governments will so efficient taxes will be almost unnecessary. 

Enough! The reality is that big data isn't new and most organizations are not mature enough or focused enough to take advantage of the new technology. 

Learn the lessons of the past.
I was (am) a scientist. I did my Ph.D in neuroscience and genetics back when sequencing a single gene took months. For reference, the bleeding edge technologies can deliver a whole genome (about 20 thousand genes) in 15 minutes

I have already complained about the challenges in knowledge management in science - and the parallelism in businesses today in this blog. I'll summarize; businesses suck at getting the right information to workers because they are cheap and lazy. 

No one wants to pay to do it right, everyone thinks that the app should be cheap and reduce labor cost by reducing the need to hire smart people. 

Well folks organizing and analyzing data/information is hard and takes a deep understanding of the difference between junk and INFORMATION.

The original Big data problem
Scientists have always generated large, complex data sets that are almost too difficult to comprehend.
As we enter the genomics era in science it has gotten worse because most scientists have not taken the time to do quality control on the information that they submit to public databases. The public data is very spotty at best; how many scientists can honestly say that they trust the gene ontology notes?

N.B. For non-scientists the Gene ontology database is a repository of notes, data, or published papers about our combined knowledge of each gene's function, interactions and chemical inhibitors. It contains links across species and across several databases.

The problem is that it is incomplete NLM/NIH does not have the money to maintain it-nor do any of the primary owners. The pace of growth is to much for the curators to keep up with. The number of different sources has also grown, you have images, gene expression studies, drug testing, protein interaction maps. 

Science has had a big data problem since before computers. How has the scientific community moved forward and had success even in the face of such poor data stewardship?

People.

Anyone how gets through a Ph.D has a great analytical mind. They can see through poor quality data to those nuggets of truth. How do they do this? They focus on finding an answer to a question, and then they build out from that question until they have built a complex multifaceted answer.

You wan to know why science is becoming stagnant and have serious ethical and just plain stupid errors of reproducibility?

We do not train scientists to be critical and form questions. We teach them to get a whole lot of data and mold it into a a beautiful story. The logic being that if you look at enough data the truth will come out. It never does; if you start out with biased data you will get a biased answer. The data sets are inherently flawed.

There is no big data only poorly framed questions. If you have a big data problem it is because you have been a poor data steward and you don't have a question. so you have no ability to start sifting through information.

Their has always been a lot of information it is just That we trained people to work with it, understand it, analyze it and make decisions. More importantly we understood that failure was a good thing, it is a chance to define the question and focus on things that will work.

A lesson not learned
There is no such thing as big data, just better storage of the vast amounts of information that life generates. Nothing has really changed it just the problem is more visible-and we downsized all of the keepers of the knowledge. Most organizations- healthcare and Pharma being the key culprits refuse to train people to think critically and scrutinize the veracity and quality of information/data.

You want to fix the big data problem? Train people to ask questions and let them answer the question. Or hire someone well trained already such as the overstocked "bioinformatics Ph.D" class of scientists. The biottom line is that new shiny system is still going to give you crap data if the person asking the question is can't ask good and insightful questions.

Realize that autocorrect is the state of the art in predictive analytics right?......let that sink in for a minute. Are you will to leave your career or company to this?

You don't need more data, you need the right data and the time and confidence to fully vet the quality of the data. We need people that understand today  to test how well that information fits with the world today. This is a key element of accurate predictions

In biomedical sciences this really comes down to how we train graduate students; do we make them learn statistics or just hope that excel is good enough? Are we willing to mentor students or are they just cheap labor for the gratification of the professor? Do we pay attention to how we store and mange information so that the next student can find it?

For most businesses it comes down to why? Is there a business question that we need to solve, what is the problem that we need fix, is there a new source of revenue that we can exploit? What are our past failures and what can we learn from them?

Tuesday, 21 January 2014

Twenty skills that I -or any Ph.D- has that are in demand

A while ago Christopher Buddle posted a blog on SciLogs about what you needed to know before becoming a professor. Many of those skills are the ones in demand outside of academia. 

It got me thinking generally what skills I have amassed over a Ph.D, Post-doc and faculty position. For any other "recovering scientists" reading this please feel free to steal this list, add to it or perfect it. Any comments or critique would be welcome. 
  1. Project managementover my academic career I managed to publish several papers in top journals. Some required precise planning of tasks and experiments on a short deadlines against competition. This requires ensuring that each set of experiments is finishes with a high quality deliverable.
  2. Human resource- as a professor I had to hire, fire and develop staff. This included students and early career professionals where you are balancing what they are capableof today, with their career goals. I picked projects for them that they matched their skills.
  3. Project planning- a PhD is a set of projects, that need to be planned out, with a full timeline, deliverables and costs set out. In addition a key part of a successful PhD or post-doc is knowing when to kill a project.
  4. Stakeholder relationship- each stage of a PhD requires you to set out goals with your faculty advisory committee. These people will provide guidance and advice for where you should spend your time. Part of success is ensuring that you cogent show progress toward each of the members ideas ofyour success. The stakes get higher as you move to a post-doc where you are expected to manage the project and manage the expectations of your boss.
  5. Budget building- as a professor I needed to build RFPs, prioritize purchases based on project needs-as well as the long term strategy of the lab, source infrastructure, mange vendors and raise funds.
  6. Publications- part of a scientists job is to communicate results to the community. This includes typical writing skills but also graphic design, matching the presentation visualizations to the message and audience.
  7. Data management- all aspects of data management including ensuring high quality data recording metadata, designing database considerations. Build database querying, integrating public and owned data into a complete set.
  8. Analytics- a key part of my PhD was defying how to quantitate behavior and images. This requires a clear analytic method that allows reproducibility through clear, logical rubric for scoring purposes.
  9. Web based research-not just the query but also the decision on good sources and bad ones.
  10. Public speaking- I have given hundreds of lectures to all sizes of groups both lay groups and expert groups. This gives me a large set of tools to fall back on for presentation design
  11. Individual drive- to do a PhD you need to an internal drive to do what must be done.
  12. Intellectual flexibility- as part of my PhD I learned at least 12 different technical skills at a high enough level to use them in peer reviewed publications and teach them to others. I learned these through reading and just dpingi didn't need to be walk through them multiple times.
  13. Records management- my laboratory work in a high demand, high competition environment. We needed to have all experiments documented in a way that would stand up to legal review and could be used as part of a patent process.
  14. Understanding of several healthcare related regulations- part of my work was related to drug discovery and some of it was in collaboration with clinicians. Meaning that we ensured that all documents and protocols met the required standards.
  15. Graphic design- genetics is a hard area to explain without pictures. I designed many successful visualizations using Photoshop, powepoint and old matte photography techniqies.
  16. Process design- my laboratory was at the bleeding edge of genetics. This meant that we were constantly building new processes and testing resources that would be best for that process.
  17. Process optimization- due to the unique methods we constantly needed to set production standards and build analytics that allowed us to evaluate and optimize process and make changes that reduced cost and increased reproducibility and accuracy.
  18. Contract negotiations-as part of my job, I have negotiated service contracts, terms of employment 
  19. Fund raising- academic labs are also look for new sources of funding and interacting with potential investors/funders
  20. Strategic product planning -a key part of success is understanding where government priorities are now and the next five years to develop a funding strategy. Successful scientists also have a understanding of the competitive landscape and position their employees and infrastructure to keep up.