Blowing Wynd-thoughts from a Anemoi: big data

Showing posts with label big data. Show all posts

Monday, 3 October 2016

Communicating science when we don't really understand it

(or how the heck do we explain what we actually know about epigenetics)

I was recently part of a Tweetchat [#EPNtalks](see here for @EpigenomicsNet storify of the chat)

I threw something out there that I have been thinking about for about five years.

Where is Genetics/Genomics/Epigenetics 's Stephen Hawking- or probably more accurately for today's kids our Neil Tyson DeGrasse?

N.B. While I love the Emperor of All Maladies, that is not the direction I envision with all due respect to Dr.Mukherjee.

I find it really frustrating that with all the great communicators that I know are in the field, we do not have a book that is targeted to the general public.

It is a frustration that I think has stunted the field, allowed a bunch of illogical mumbo-jumbo to replace clear concise analogy.

In my mind there are a lot of reasons that we have gotten here...some of them are acceptable like the fundamental changes in funding levels that have been seen world-wide in the last decade or so. There are also some that I find completely unacceptable.

Here is a short list of unacceptable reason why "our" ability to communicate genetics/epigenetics sucks:

We (scientists) don't actually understand much beyond their little patch of grass. I think we force a ridiculous level of specialization on trainees and students.
We don't have well thought out experiments. In the last five or so years we have gone with this notion that you can collect data and find clear real, unbiased answers without a hypothesis.
We don't train Ph.D students to analyze (strongly related to #2). It is common fort MolBio, Genetics or Biochem graduate students to NOT take a statistics course prior to starting their Ph.D.
We have fallen in love with jargon. Getting through a Science or Nature paper nowadays is a horrendous effort in acronyms and 5 syllable words.
We have left public relations to the universities as we are too busy - doing science- to explain.

I have put some thought into this and I have outline a potential book or probably better as a video series. I have attached it below. As always if you have any comments please add them below or email me directly at crwynder AT gmail DOT com

Script for a potential script and book on epigenetics

My script/book idea for a book focused on how the machinery of Epigenetics relates to the real world examples of diversity and development.

Tuesday, 8 April 2014

Clinical data random information

I've become an information hoarder. As I spend more time thinking about Information Management and speeding the move to better technical systems, I am amazed how general the principals of design are between the different industries.

Here is a noobs (i.e. me) "plain spoken" understanding of a key term in managing patient data across hospitials and for predicative analytics and personal health decison making.

Level setting (i.e. in general the definition of Clinical data warehousing) Clinical data warehousing is a patient identifier organized, integrated, historically archived collection of data.

For the most part the purpose of CDW is as a database for hospitals and healthcare workers to analyze and make informed decisions on both individual patient care and forecasting where a hospital’s patient population is going to need greater care (i.e. patient’s are showing up as obese; therefore the need for specific hospital programs to fight diabetes are a good idea).

Data warehousing in healthcare also has use in preparing for both full ICD-10 and meaningful use implementation. For example; McKesson through its Enterprise intelligence module probably has plenty of CDW management capabilities the only interested in meeting the upcoming ICD-10 and meaningful use deadlines. These kinds of worries are only for US hospitals. However since Canada requires ICD-10 compliance for all EMR systems this does present a benefit to Canadian healthcare.

In principal since data warehousing at its core is about building a relational database and should be EMR supplier agnostic. Since McKesson is an ICD-10 and meaningful use- ready supplier, the database itself should conform to standards that would allow general solutions to be used. This article goes through some of the potential benefits and pain points. It is tailored to clinical trials but the underlying message that building a CDW is a ongoing procedure is the same for other uses.

One example of how this may be done is Stanford’s STRIDE; they used HL7 reference information model to combine their Cerner and Epic databases. This is part of a larger opensource project that may be an option if an organization has some development expertise.

https://clinicalinformatics.stanford.edu/research/stride.html

Since the main user of CDWs tends to be the people doing the analysis (current buzzwords for search for analytics include:BI, Predictive analytics, enterprise planning, etc) it is probably useful for Health IT professionals to understand its WHO and WHAT the CDW is for within the organization...i.e. have a full blown Information Governance plan that places a value on information not just a risk assessment.

Saturday, 1 February 2014

Big data is just a euphemism for lazy and cheap

Maybe I'm getting cantankerous but I'm really over all of the talk about big data and how it is going revolutionize the world businesses are going to so efficient they will only need a CEO and a lowly marketing guy. Governments will so efficient taxes will be almost unnecessary.

Enough! The reality is that big data isn't new and most organizations are not mature enough or focused enough to take advantage of the new technology.

Learn the lessons of the past.

I was (am) a scientist. I did my Ph.D in neuroscience and genetics back when sequencing a single gene took months. For reference, the bleeding edge technologies can deliver a whole genome (about 20 thousand genes) in 15 minutes.

I have already complained about the challenges in knowledge management in science - and the parallelism in businesses today in this blog. I'll summarize; businesses suck at getting the right information to workers because they are cheap and lazy.

No one wants to pay to do it right, everyone thinks that the app should be cheap and reduce labor cost by reducing the need to hire smart people.

Well folks organizing and analyzing data/information is hard and takes a deep understanding of the difference between junk and INFORMATION.

The original Big data problem

Scientists have always generated large, complex data sets that are almost too difficult to comprehend.

As we enter the genomics era in science it has gotten worse because most scientists have not taken the time to do quality control on the information that they submit to public databases. The public data is very spotty at best; how many scientists can honestly say that they trust the gene ontology notes?

N.B. For non-scientists the Gene ontology database is a repository of notes, data, or published papers about our combined knowledge of each gene's function, interactions and chemical inhibitors. It contains links across species and across several databases.

The problem is that it is incomplete NLM/NIH does not have the money to maintain it-nor do any of the primary owners. The pace of growth is to much for the curators to keep up with. The number of different sources has also grown, you have images, gene expression studies, drug testing, protein interaction maps.

Science has had a big data problem since before computers. How has the scientific community moved forward and had success even in the face of such poor data stewardship?

People.

Anyone how gets through a Ph.D has a great analytical mind. They can see through poor quality data to those nuggets of truth. How do they do this? They focus on finding an answer to a question, and then they build out from that question until they have built a complex multifaceted answer.

You wan to know why science is becoming stagnant and have serious ethical and just plain stupid errors of reproducibility?

We do not train scientists to be critical and form questions. We teach them to get a whole lot of data and mold it into a a beautiful story. The logic being that if you look at enough data the truth will come out. It never does; if you start out with biased data you will get a biased answer. The data sets are inherently flawed.

There is no big data only poorly framed questions. If you have a big data problem it is because you have been a poor data steward and you don't have a question. so you have no ability to start sifting through information.

Their has always been a lot of information it is just That we trained people to work with it, understand it, analyze it and make decisions. More importantly we understood that failure was a good thing, it is a chance to define the question and focus on things that will work.

A lesson not learned

There is no such thing as big data, just better storage of the vast amounts of information that life generates. Nothing has really changed it just the problem is more visible-and we downsized all of the keepers of the knowledge. Most organizations- healthcare and Pharma being the key culprits refuse to train people to think critically and scrutinize the veracity and quality of information/data.

You want to fix the big data problem? Train people to ask questions and let them answer the question. Or hire someone well trained already such as the overstocked "bioinformatics Ph.D" class of scientists. The biottom line is that new shiny system is still going to give you crap data if the person asking the question is can't ask good and insightful questions.

Realize that autocorrect is the state of the art in predictive analytics right?......let that sink in for a minute. Are you will to leave your career or company to this?

You don't need more data, you need the right data and the time and confidence to fully vet the quality of the data. We need people that understand today to test how well that information fits with the world today. This is a key element of accurate predictions.

In biomedical sciences this really comes down to how we train graduate students; do we make them learn statistics or just hope that excel is good enough? Are we willing to mentor students or are they just cheap labor for the gratification of the professor? Do we pay attention to how we store and mange information so that the next student can find it?

For most businesses it comes down to why? Is there a business question that we need to solve, what is the problem that we need fix, is there a new source of revenue that we can exploit? What are our past failures and what can we learn from them?