Tag Archives: data citation

Is Big Data Missing the Big Picture?

27 Apr

Forest_for_the_Trees

When I was defending my graduate thesis a number of years ago, I was asked by one of the faculty in attendance to explain why I had done “x” rather than “y” with my data. I stumbled for a bit until I finally said, somewhat out of frustration at not knowing the right answer, “Because that’s not what I said I’d do.” My statistics professor was also in attendance and as I quickly tried to backtrack from my response piped in, “That’s the right answer.”

As I’ve watched and listened to and read and been a part of so many discussions about data – data sharing, data citation, data management – over the past several years, I often find myself thinking back on that defense and my answer. More, I’ve thought of my professor’s comment; that data is collected, managed, and analyzed according to certain rules that a researcher or graduate student or any data collector decides from the outset. That’s best practice, anyway. And such an understanding always makes me wonder if in our exuberance to claim the importance, the need, the mandates, and the “sky’s the limit” views over data sharing, we don’t forget that.

I really enjoyed the panel that the Medical Library Association put together last week for their webinar, “The Diversity of Data Management: Practical Approaches for Health Sciences Librarianship.” The panelists included two data librarians and one research data specialist; Lisa Federer of the National Institutes of Health Library, Kevin Read from New York University’s Health Sciences Library, and Jacqueline Wirz of Oregon Health & Sciences University, respectively. As a disclosure, I know Lisa, Kevin and Jackie each personally and consider them great colleagues, so I guess I could be a little biased in my opinion, but putting that aside, I do feel that they each have a wealth of experience and knowledge in the topic and it showed in their presentations and dialogue.

Listening to the kind of work and the projects that these data-centric professionals shared, it’s easy and exciting to see the many opportunities that exist for libraries, librarians, and others with an interest in data science. At the same time, I admit that I wince when I sense our “We can do this! Librarians can do anything!” enthusiasm bubble up – as occasionally occurs when we gather together and talk about this topic – because I don’t think it’s true. I do believe that individually, librarians can move into an almost limitless career field, given our basic skills in information collection, retrieval, management, preservation, etc. We are well-positioned in an information age. That said, though, I also believe that (1) there IS a difference between information and data and (2) the skills librarians have as a foundation in terms of information science don’t, in and of themselves, translate directly to the age of big data. (I’m not fan of that descriptor, by the way. I tend to think it was created and is perpetuated by the tech industry and the media, both wishing we believe things are simpler than they ever are.) Some librarians, with a desire and propensity towards the opportunities in data science will find their way there. They’ll seek out the extra skills needed and they’ll identify new places and new roles that they can take on. I feel like I’ve done this myself and I know a good plenty handful of others who’ve done the same. But can we sell it as the next big thing that academic and research libraries need to do? Years later, I still find myself a little skeptical.

Moving beyond the individual, though, I wonder if libraries and other entities within information science, as a whole, don’t have a word of caution to share in the midst of our calls for openness of data. It’s certainly the belief of our profession(s) that access to information is vital for the health of a society on every level. However, in many ways it seems that in our discussions of data, we’ve simply expanded our dedication towards the principal of openness to information to include data, as well. Have we really thought through all that we’re saying when we wave that banner? Can we have a more tempered response and/or approach to the big data bandwagon?

Arguably, there are MANY valid reasons for supporting access in this area; peer review, expanded and more efficient science, reproducibility, transparency, etc. Good things, all. But going back to that lesson that I learned in grad school, it’s important to remember that data is collected, managed, and analyzed in certain ways for a reason; things decided by the original researcher. In other words, data has context. Just like information. And like information, I wonder (and have concern for) what happens to data when it’s taken out of its original context. And I wonder if my profession could perhaps advocate this position, too, along with those of openness and sharing, if nothing more than to raise the collective awareness and consciousness of everyone in this new world. To curb the exuberance just a tad.

I recently started getting my local paper delivered to my home. The real thing. The newsprint newspaper. The one that you spread out on the kitchen table and peruse through, page by page. You know what I’ve realized in taking up this long-lost activity again? When you look at a front page with articles of an earthquake in Nepal, nearby horses attacked by a bear, the hiring practices of a local town’s police force, and gay marriage, you’re forced to think of the world in its bigger context. At the very least, you’re made aware of the fact that there’s a bigger picture to see.

When I think of how information is so bifurcated today, I can’t help but ask if there’s a lesson there that can be applied to data before we jump overboard into the “put it all out there” sea. We take research articles out of the context of journals. We take scientific findings out of the context of science. We take individual experiences out of context of the very experience in which they occur. And of course, the most obvious, we take any and every politician’s words out of context in order to support whatever position we either want or don’t want him/her to support. I don’t know about you, but each and every one of these examples appears as a pretty clear reason to at least think about what can and will happen (already happens) to data if and when it suffers the same fate.

Are there reasons why librarians and information specialists are concerned with big data? Absolutely! I just hope that our concern also takes in the big picture.

 

Repeat After Me

22 Aug

Reproduction

Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method and relies on ceteris paribus. Wikipedia

I was going to start this post with a similar statement in my own words, but couldn’t resist the chance to quote Latin. It always makes you sound so smart. But regardless of whether these are a Wikipedia author’s words or my own, the point is the same – one of the foundations of good science is the ability to reproduce the results.

My work for the neuroimaging project involves developing a process for researchers in this field to cite their data in such a way that makes their work more easily reproducible. The current practice of citing data sets alone doesn’t always make reproducibility possible. A researcher might take different images from a number of different data sets to create an entirely new data set, in which case citing the previous sets in whole doesn’t tell exactly which images are being used. Thus, this gap can make the final research harder to replicate, as well as more difficult to review. We think that we may have a way to help fix this problem and that’s what I’ve been working on for the past few months.

At the same time, I’ve been working on a systematic review with the members of the mammography study team. This work has me locating and reading and discussing a whole slew of articles about the use of telephone call reminders to increase the rate of women receiving a mammogram within current clinical guidelines. It also has me wondering about the nature of clinical research and the concept of reproducible science, for in all of my work, I’ve yet to come across any two studies that are exactly alike. In other words, it doesn’t seem to be common practice for anyone to repeat anyone else’s study. And I can’t help but wonder why this is so.

I imagine it has something to do with funding. Will a funding agency award money to a proposal that seeks to repeat something; something unoriginal? Surely they are more apt to look to fund new ideas.

Maybe it has to do with scientific publishing. Like funding agencies, publishers probably much prefer to publish new ideas and new findings. Who wants to read an article that says the same thing as one they read last year?

Of course, it may also be that researchers look to improve on previous studies, rather than simply repeat them. This is what I see in all of the papers I’ve found for this particular systematic review. The methods are tweaked from study to study; the populations differ just a bit, the length of time varies, etc. It makes sense. The goal of this body of research is to determine what intervention works the best and in changing things slightly, you might just find the answer. What has me baffled about this process, though, is that as we continue to tweak this aspect or that aspect of a study’s methodology, when and/or how do we ever discover what aspect actually works and then put it into practice? 

Working on this particular review, I’ve collected easily 50+ relevant articles, yet as we pull them together – consolidate them to discover any conclusions – the task seems, at times, impossible. Too often, despite the relevancy of the articles to the question asked, what you really end up comparing is apples to oranges. How does this get to the heart of scientific discovery? How does it influence or generate “best practice”? I can’t help but wonder.

Yesterday, during my library’s monthly journal club, we discussed an article that had been recommended reading to me by one of the principal investigators on the mammography study. How to Read a Systematic Review and Meta-analysis and Apply the Results to Patient Care, is the latest User’s Guide on the subject from the Journal of the American Medical Association (JAMA). It prompted a lively session about everything from how research is done, to how medical students are taught to read the literature, to how the media portrays medical news. I recommend it.

Of course, there are many explanations to my question and many factors at play. My wondering and our journal club discussion doesn’t afford any concrete solution and/or answer, still I feel it’s a worthwhile topic for medical librarians to think about. If you have any thoughts, please keep the discussion going in the comments section below.

Back to the Starting Square

2 May

One of my favorite singer songwriters is Lucy Wainwright Roche. Fans of folk music who don’t know Lucy may well know her familiar last names. The daughter of Suzzy Roche and Loudon Wainwright III, she comes honestly to her musical gifts. One of my favorites of her songs is, “Starting Square.” It’s a song about seeing an old love again and taking note of the changes that happen after relationships end. That’s my take, anyway. And it’s summed up in the line,

I can tell you can tell it from there
That I may have been everywhere
But I’m back
Back to the starting square

Enjoy Lucy singing it.

I may not have been everywhere in the first round of informationist work, but as I met with the principal investigator of my latest grant-funded project this week, I did feel like I’m back at square one. This latest project is really very different from the mammography study that I’ve worked on for the past couple of years. This supplemental grant is to provide informationist services to the larger grant entitled, “A Knowledge Environment for Neuroimaging in Child Psychiatry.” Our ultimate goal (and there are more than a few steps to take before we’ll get there) is “to establish best practices and standards around data sharing in the discipline of neuroinformatics so that it becomes possible to generate accurate, easy to obtain quantitative metrics that give credit to the original source of data.” In short, it’s a project that will hopefully deliver a means for researchers to cite their data for both the purpose of data sharing and to make the science reproducible. I’ll work on determining the proper level of identification for neuroimages, the best identifier for the images (is it a DOI?), and the most efficient means of organizing and naming new data sets that are derived from bits and pieces of multiple other data sets.

During our first meeting, the PI showed me a whole bunch of really interesting websites and told me of many interesting projects happening in this area (directly and tangentially). I came back to my desk and promptly created a new folder of bookmarks for this work. So now… I’m back to the starting square. I’ve got a mountain of stuff to read and watch and become familiar with. It’s like the first day of class. The first assignments. And I need a new notebook!

I include a few of the resources below, if you’re interested in the topic and want to play a little catch up, too. Enjoy!