Tag Archives: data sharing

Is Big Data Missing the Big Picture?

27 Apr

Forest_for_the_Trees

When I was defending my graduate thesis a number of years ago, I was asked by one of the faculty in attendance to explain why I had done “x” rather than “y” with my data. I stumbled for a bit until I finally said, somewhat out of frustration at not knowing the right answer, “Because that’s not what I said I’d do.” My statistics professor was also in attendance and as I quickly tried to backtrack from my response piped in, “That’s the right answer.”

As I’ve watched and listened to and read and been a part of so many discussions about data – data sharing, data citation, data management – over the past several years, I often find myself thinking back on that defense and my answer. More, I’ve thought of my professor’s comment; that data is collected, managed, and analyzed according to certain rules that a researcher or graduate student or any data collector decides from the outset. That’s best practice, anyway. And such an understanding always makes me wonder if in our exuberance to claim the importance, the need, the mandates, and the “sky’s the limit” views over data sharing, we don’t forget that.

I really enjoyed the panel that the Medical Library Association put together last week for their webinar, “The Diversity of Data Management: Practical Approaches for Health Sciences Librarianship.” The panelists included two data librarians and one research data specialist; Lisa Federer of the National Institutes of Health Library, Kevin Read from New York University’s Health Sciences Library, and Jacqueline Wirz of Oregon Health & Sciences University, respectively. As a disclosure, I know Lisa, Kevin and Jackie each personally and consider them great colleagues, so I guess I could be a little biased in my opinion, but putting that aside, I do feel that they each have a wealth of experience and knowledge in the topic and it showed in their presentations and dialogue.

Listening to the kind of work and the projects that these data-centric professionals shared, it’s easy and exciting to see the many opportunities that exist for libraries, librarians, and others with an interest in data science. At the same time, I admit that I wince when I sense our “We can do this! Librarians can do anything!” enthusiasm bubble up – as occasionally occurs when we gather together and talk about this topic – because I don’t think it’s true. I do believe that individually, librarians can move into an almost limitless career field, given our basic skills in information collection, retrieval, management, preservation, etc. We are well-positioned in an information age. That said, though, I also believe that (1) there IS a difference between information and data and (2) the skills librarians have as a foundation in terms of information science don’t, in and of themselves, translate directly to the age of big data. (I’m not fan of that descriptor, by the way. I tend to think it was created and is perpetuated by the tech industry and the media, both wishing we believe things are simpler than they ever are.) Some librarians, with a desire and propensity towards the opportunities in data science will find their way there. They’ll seek out the extra skills needed and they’ll identify new places and new roles that they can take on. I feel like I’ve done this myself and I know a good plenty handful of others who’ve done the same. But can we sell it as the next big thing that academic and research libraries need to do? Years later, I still find myself a little skeptical.

Moving beyond the individual, though, I wonder if libraries and other entities within information science, as a whole, don’t have a word of caution to share in the midst of our calls for openness of data. It’s certainly the belief of our profession(s) that access to information is vital for the health of a society on every level. However, in many ways it seems that in our discussions of data, we’ve simply expanded our dedication towards the principal of openness to information to include data, as well. Have we really thought through all that we’re saying when we wave that banner? Can we have a more tempered response and/or approach to the big data bandwagon?

Arguably, there are MANY valid reasons for supporting access in this area; peer review, expanded and more efficient science, reproducibility, transparency, etc. Good things, all. But going back to that lesson that I learned in grad school, it’s important to remember that data is collected, managed, and analyzed in certain ways for a reason; things decided by the original researcher. In other words, data has context. Just like information. And like information, I wonder (and have concern for) what happens to data when it’s taken out of its original context. And I wonder if my profession could perhaps advocate this position, too, along with those of openness and sharing, if nothing more than to raise the collective awareness and consciousness of everyone in this new world. To curb the exuberance just a tad.

I recently started getting my local paper delivered to my home. The real thing. The newsprint newspaper. The one that you spread out on the kitchen table and peruse through, page by page. You know what I’ve realized in taking up this long-lost activity again? When you look at a front page with articles of an earthquake in Nepal, nearby horses attacked by a bear, the hiring practices of a local town’s police force, and gay marriage, you’re forced to think of the world in its bigger context. At the very least, you’re made aware of the fact that there’s a bigger picture to see.

When I think of how information is so bifurcated today, I can’t help but ask if there’s a lesson there that can be applied to data before we jump overboard into the “put it all out there” sea. We take research articles out of the context of journals. We take scientific findings out of the context of science. We take individual experiences out of context of the very experience in which they occur. And of course, the most obvious, we take any and every politician’s words out of context in order to support whatever position we either want or don’t want him/her to support. I don’t know about you, but each and every one of these examples appears as a pretty clear reason to at least think about what can and will happen (already happens) to data if and when it suffers the same fate.

Are there reasons why librarians and information specialists are concerned with big data? Absolutely! I just hope that our concern also takes in the big picture.

 

Repeat After Me

22 Aug

Reproduction

Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method and relies on ceteris paribus. Wikipedia

I was going to start this post with a similar statement in my own words, but couldn’t resist the chance to quote Latin. It always makes you sound so smart. But regardless of whether these are a Wikipedia author’s words or my own, the point is the same – one of the foundations of good science is the ability to reproduce the results.

My work for the neuroimaging project involves developing a process for researchers in this field to cite their data in such a way that makes their work more easily reproducible. The current practice of citing data sets alone doesn’t always make reproducibility possible. A researcher might take different images from a number of different data sets to create an entirely new data set, in which case citing the previous sets in whole doesn’t tell exactly which images are being used. Thus, this gap can make the final research harder to replicate, as well as more difficult to review. We think that we may have a way to help fix this problem and that’s what I’ve been working on for the past few months.

At the same time, I’ve been working on a systematic review with the members of the mammography study team. This work has me locating and reading and discussing a whole slew of articles about the use of telephone call reminders to increase the rate of women receiving a mammogram within current clinical guidelines. It also has me wondering about the nature of clinical research and the concept of reproducible science, for in all of my work, I’ve yet to come across any two studies that are exactly alike. In other words, it doesn’t seem to be common practice for anyone to repeat anyone else’s study. And I can’t help but wonder why this is so.

I imagine it has something to do with funding. Will a funding agency award money to a proposal that seeks to repeat something; something unoriginal? Surely they are more apt to look to fund new ideas.

Maybe it has to do with scientific publishing. Like funding agencies, publishers probably much prefer to publish new ideas and new findings. Who wants to read an article that says the same thing as one they read last year?

Of course, it may also be that researchers look to improve on previous studies, rather than simply repeat them. This is what I see in all of the papers I’ve found for this particular systematic review. The methods are tweaked from study to study; the populations differ just a bit, the length of time varies, etc. It makes sense. The goal of this body of research is to determine what intervention works the best and in changing things slightly, you might just find the answer. What has me baffled about this process, though, is that as we continue to tweak this aspect or that aspect of a study’s methodology, when and/or how do we ever discover what aspect actually works and then put it into practice? 

Working on this particular review, I’ve collected easily 50+ relevant articles, yet as we pull them together – consolidate them to discover any conclusions – the task seems, at times, impossible. Too often, despite the relevancy of the articles to the question asked, what you really end up comparing is apples to oranges. How does this get to the heart of scientific discovery? How does it influence or generate “best practice”? I can’t help but wonder.

Yesterday, during my library’s monthly journal club, we discussed an article that had been recommended reading to me by one of the principal investigators on the mammography study. How to Read a Systematic Review and Meta-analysis and Apply the Results to Patient Care, is the latest User’s Guide on the subject from the Journal of the American Medical Association (JAMA). It prompted a lively session about everything from how research is done, to how medical students are taught to read the literature, to how the media portrays medical news. I recommend it.

Of course, there are many explanations to my question and many factors at play. My wondering and our journal club discussion doesn’t afford any concrete solution and/or answer, still I feel it’s a worthwhile topic for medical librarians to think about. If you have any thoughts, please keep the discussion going in the comments section below.

All of the Data that’s Fit to Collect

28 Jul

My graduate thesis in exercise physiology involved answering a research question that required collecting an awful lot of data before I had enough for analysis. I was comparing muscle fatigue in males and females, and in order to do this I had to find enough male-female pairs that matched for muscle volume. I took skin fold measurements and calculated the muscle volume of about 150 thighs belonging to men and women on the crew teams of Ithaca College. Out of all of that, I found 8 pairs that matched. It was hardly enough for grand findings, but it was enough to do the analysis, write my thesis, successfully defend it, and earn my degree. After all, that’s what research at this level is all about, i.e. learning how to put together a study and carry it all the way through to completion.

During my defense, one of my advisers asked, “With all of that data, you could have answered ___, too. Why didn’t you?” I hemmed and hawed for a bit, before finally answering, “Because that’s not what I said that I was going to do,” an answer that my statistics professor, also in attendance, said was the right answer. Was my adviser trying to trick me? I’m not sure, but it’s an experience that I remember often today when I read and talk and work in a field obsessed with the “data deluge.”

The temptation to do more than what you set out to do is ever present, maybe even more today than ever before. We have years worth of data – a lot of data – for the mammography study. When the grant proposal was written and funded, it laid out specifics regarding what analysis would be done; what questions would be answered. Five years down the road, it’s easy to see lots of other questions that can be answered with the same data. A common statement made in the team meetings is, “I think people want to know Y” or “Z is really important to find out.” The problem, however, is that we set out to answer X. While Y and Z may well be valuable, X is what the study was designed to answer.

LOD_Cloud_Diagram_as_of_September_2011

“LOD Cloud Diagram as of September 2011” by Anja Jentzsch – Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons

I see a couple of issues with this scenario. First, grant money is a finite resource. In a time when practically all research operates under this funding model, people have a certain amount of time dedicated, i.e. paid for, by a grant. If that time gets used up answering peripheral questions or going down interesting, but unplanned, rabbit holes, the chances of completing the initial work on time is jeopardized. As one who has seen my original funded aims change over time, this can be frustrating. And don’t hear me saying that it’s all frustrating. On the contrary, along with the frustration can come some pretty cool work. The mini-symposium on data management that I described in earlier posts was a HUGE success for my work, but it’s not what we originally set out to do. The ends justified the means, in that case, but this isn’t always what happens.

The second issue I see is one that I hear many researchers express when the topics of data sharing and data reuse are raised, i.e. data is collected a certain way to answer a certain question. Likewise, it’s managed under the same auspices. Being concerned about what another researcher will do with data that was collected for another reason is legitimate. It’s not a concern that can’t be addressed, but it’s certainly worth noting. When I was finished with my thesis data, a couple of faculty members offered to take it and do some further research with it. There were some different questions that could be answered using the larger data set, but not without taking into account the original research question and the methods I used to collect all of it. Anonymous data sharing and reuse, without such context, doesn’t always afford such, at least not in the current climate where data citation and identification is still evolving. (All the more reason to keep working in this area.)

We have so many tools today that allow faster and more efficient data collection. We have grant projects that go on for years, making it difficult to say “no” to ask new questions of the same project that come up along the way. We are inundated with data and information and resources that make it virtually impossible to focus on any one thing for any length of time.

The possibilities of science in a data-driven environment seem limitless. It’s easy to forget that some limits do, in fact, exist.