Tag Archives: data sharing

Is Big Data Missing the Big Picture?

27 Apr

Forest_for_the_Trees

When I was defending my graduate thesis a number of years ago, I was asked by one of the faculty in attendance to explain why I had done “x” rather than “y” with my data. I stumbled for a bit until I finally said, somewhat out of frustration at not knowing the right answer, “Because that’s not what I said I’d do.” My statistics professor was also in attendance and as I quickly tried to backtrack from my response piped in, “That’s the right answer.”

As I’ve watched and listened to and read and been a part of so many discussions about data – data sharing, data citation, data management – over the past several years, I often find myself thinking back on that defense and my answer. More, I’ve thought of my professor’s comment; that data is collected, managed, and analyzed according to certain rules that a researcher or graduate student or any data collector decides from the outset. That’s best practice, anyway. And such an understanding always makes me wonder if in our exuberance to claim the importance, the need, the mandates, and the “sky’s the limit” views over data sharing, we don’t forget that.

I really enjoyed the panel that the Medical Library Association put together last week for their webinar, “The Diversity of Data Management: Practical Approaches for Health Sciences Librarianship.” The panelists included two data librarians and one research data specialist; Lisa Federer of the National Institutes of Health Library, Kevin Read from New York University’s Health Sciences Library, and Jacqueline Wirz of Oregon Health & Sciences University, respectively. As a disclosure, I know Lisa, Kevin and Jackie each personally and consider them great colleagues, so I guess I could be a little biased in my opinion, but putting that aside, I do feel that they each have a wealth of experience and knowledge in the topic and it showed in their presentations and dialogue.

Listening to the kind of work and the projects that these data-centric professionals shared, it’s easy and exciting to see the many opportunities that exist for libraries, librarians, and others with an interest in data science. At the same time, I admit that I wince when I sense our “We can do this! Librarians can do anything!” enthusiasm bubble up – as occasionally occurs when we gather together and talk about this topic – because I don’t think it’s true. I do believe that individually, librarians can move into an almost limitless career field, given our basic skills in information collection, retrieval, management, preservation, etc. We are well-positioned in an information age. That said, though, I also believe that (1) there IS a difference between information and data and (2) the skills librarians have as a foundation in terms of information science don’t, in and of themselves, translate directly to the age of big data. (I’m not fan of that descriptor, by the way. I tend to think it was created and is perpetuated by the tech industry and the media, both wishing we believe things are simpler than they ever are.) Some librarians, with a desire and propensity towards the opportunities in data science will find their way there. They’ll seek out the extra skills needed and they’ll identify new places and new roles that they can take on. I feel like I’ve done this myself and I know a good plenty handful of others who’ve done the same. But can we sell it as the next big thing that academic and research libraries need to do? Years later, I still find myself a little skeptical.

Moving beyond the individual, though, I wonder if libraries and other entities within information science, as a whole, don’t have a word of caution to share in the midst of our calls for openness of data. It’s certainly the belief of our profession(s) that access to information is vital for the health of a society on every level. However, in many ways it seems that in our discussions of data, we’ve simply expanded our dedication towards the principal of openness to information to include data, as well. Have we really thought through all that we’re saying when we wave that banner? Can we have a more tempered response and/or approach to the big data bandwagon?

Arguably, there are MANY valid reasons for supporting access in this area; peer review, expanded and more efficient science, reproducibility, transparency, etc. Good things, all. But going back to that lesson that I learned in grad school, it’s important to remember that data is collected, managed, and analyzed in certain ways for a reason; things decided by the original researcher. In other words, data has context. Just like information. And like information, I wonder (and have concern for) what happens to data when it’s taken out of its original context. And I wonder if my profession could perhaps advocate this position, too, along with those of openness and sharing, if nothing more than to raise the collective awareness and consciousness of everyone in this new world. To curb the exuberance just a tad.

I recently started getting my local paper delivered to my home. The real thing. The newsprint newspaper. The one that you spread out on the kitchen table and peruse through, page by page. You know what I’ve realized in taking up this long-lost activity again? When you look at a front page with articles of an earthquake in Nepal, nearby horses attacked by a bear, the hiring practices of a local town’s police force, and gay marriage, you’re forced to think of the world in its bigger context. At the very least, you’re made aware of the fact that there’s a bigger picture to see.

When I think of how information is so bifurcated today, I can’t help but ask if there’s a lesson there that can be applied to data before we jump overboard into the “put it all out there” sea. We take research articles out of the context of journals. We take scientific findings out of the context of science. We take individual experiences out of context of the very experience in which they occur. And of course, the most obvious, we take any and every politician’s words out of context in order to support whatever position we either want or don’t want him/her to support. I don’t know about you, but each and every one of these examples appears as a pretty clear reason to at least think about what can and will happen (already happens) to data if and when it suffers the same fate.

Are there reasons why librarians and information specialists are concerned with big data? Absolutely! I just hope that our concern also takes in the big picture.

 

Repeat After Me

22 Aug

Reproduction

Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method and relies on ceteris paribus. Wikipedia

I was going to start this post with a similar statement in my own words, but couldn’t resist the chance to quote Latin. It always makes you sound so smart. But regardless of whether these are a Wikipedia author’s words or my own, the point is the same – one of the foundations of good science is the ability to reproduce the results.

My work for the neuroimaging project involves developing a process for researchers in this field to cite their data in such a way that makes their work more easily reproducible. The current practice of citing data sets alone doesn’t always make reproducibility possible. A researcher might take different images from a number of different data sets to create an entirely new data set, in which case citing the previous sets in whole doesn’t tell exactly which images are being used. Thus, this gap can make the final research harder to replicate, as well as more difficult to review. We think that we may have a way to help fix this problem and that’s what I’ve been working on for the past few months.

At the same time, I’ve been working on a systematic review with the members of the mammography study team. This work has me locating and reading and discussing a whole slew of articles about the use of telephone call reminders to increase the rate of women receiving a mammogram within current clinical guidelines. It also has me wondering about the nature of clinical research and the concept of reproducible science, for in all of my work, I’ve yet to come across any two studies that are exactly alike. In other words, it doesn’t seem to be common practice for anyone to repeat anyone else’s study. And I can’t help but wonder why this is so.

I imagine it has something to do with funding. Will a funding agency award money to a proposal that seeks to repeat something; something unoriginal? Surely they are more apt to look to fund new ideas.

Maybe it has to do with scientific publishing. Like funding agencies, publishers probably much prefer to publish new ideas and new findings. Who wants to read an article that says the same thing as one they read last year?

Of course, it may also be that researchers look to improve on previous studies, rather than simply repeat them. This is what I see in all of the papers I’ve found for this particular systematic review. The methods are tweaked from study to study; the populations differ just a bit, the length of time varies, etc. It makes sense. The goal of this body of research is to determine what intervention works the best and in changing things slightly, you might just find the answer. What has me baffled about this process, though, is that as we continue to tweak this aspect or that aspect of a study’s methodology, when and/or how do we ever discover what aspect actually works and then put it into practice? 

Working on this particular review, I’ve collected easily 50+ relevant articles, yet as we pull them together – consolidate them to discover any conclusions – the task seems, at times, impossible. Too often, despite the relevancy of the articles to the question asked, what you really end up comparing is apples to oranges. How does this get to the heart of scientific discovery? How does it influence or generate “best practice”? I can’t help but wonder.

Yesterday, during my library’s monthly journal club, we discussed an article that had been recommended reading to me by one of the principal investigators on the mammography study. How to Read a Systematic Review and Meta-analysis and Apply the Results to Patient Care, is the latest User’s Guide on the subject from the Journal of the American Medical Association (JAMA). It prompted a lively session about everything from how research is done, to how medical students are taught to read the literature, to how the media portrays medical news. I recommend it.

Of course, there are many explanations to my question and many factors at play. My wondering and our journal club discussion doesn’t afford any concrete solution and/or answer, still I feel it’s a worthwhile topic for medical librarians to think about. If you have any thoughts, please keep the discussion going in the comments section below.

All of the Data that’s Fit to Collect

28 Jul

My graduate thesis in exercise physiology involved answering a research question that required collecting an awful lot of data before I had enough for analysis. I was comparing muscle fatigue in males and females, and in order to do this I had to find enough male-female pairs that matched for muscle volume. I took skin fold measurements and calculated the muscle volume of about 150 thighs belonging to men and women on the crew teams of Ithaca College. Out of all of that, I found 8 pairs that matched. It was hardly enough for grand findings, but it was enough to do the analysis, write my thesis, successfully defend it, and earn my degree. After all, that’s what research at this level is all about, i.e. learning how to put together a study and carry it all the way through to completion.

During my defense, one of my advisers asked, “With all of that data, you could have answered ___, too. Why didn’t you?” I hemmed and hawed for a bit, before finally answering, “Because that’s not what I said that I was going to do,” an answer that my statistics professor, also in attendance, said was the right answer. Was my adviser trying to trick me? I’m not sure, but it’s an experience that I remember often today when I read and talk and work in a field obsessed with the “data deluge.”

The temptation to do more than what you set out to do is ever present, maybe even more today than ever before. We have years worth of data – a lot of data – for the mammography study. When the grant proposal was written and funded, it laid out specifics regarding what analysis would be done; what questions would be answered. Five years down the road, it’s easy to see lots of other questions that can be answered with the same data. A common statement made in the team meetings is, “I think people want to know Y” or “Z is really important to find out.” The problem, however, is that we set out to answer X. While Y and Z may well be valuable, X is what the study was designed to answer.

LOD_Cloud_Diagram_as_of_September_2011

“LOD Cloud Diagram as of September 2011” by Anja Jentzsch – Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons

I see a couple of issues with this scenario. First, grant money is a finite resource. In a time when practically all research operates under this funding model, people have a certain amount of time dedicated, i.e. paid for, by a grant. If that time gets used up answering peripheral questions or going down interesting, but unplanned, rabbit holes, the chances of completing the initial work on time is jeopardized. As one who has seen my original funded aims change over time, this can be frustrating. And don’t hear me saying that it’s all frustrating. On the contrary, along with the frustration can come some pretty cool work. The mini-symposium on data management that I described in earlier posts was a HUGE success for my work, but it’s not what we originally set out to do. The ends justified the means, in that case, but this isn’t always what happens.

The second issue I see is one that I hear many researchers express when the topics of data sharing and data reuse are raised, i.e. data is collected a certain way to answer a certain question. Likewise, it’s managed under the same auspices. Being concerned about what another researcher will do with data that was collected for another reason is legitimate. It’s not a concern that can’t be addressed, but it’s certainly worth noting. When I was finished with my thesis data, a couple of faculty members offered to take it and do some further research with it. There were some different questions that could be answered using the larger data set, but not without taking into account the original research question and the methods I used to collect all of it. Anonymous data sharing and reuse, without such context, doesn’t always afford such, at least not in the current climate where data citation and identification is still evolving. (All the more reason to keep working in this area.)

We have so many tools today that allow faster and more efficient data collection. We have grant projects that go on for years, making it difficult to say “no” to ask new questions of the same project that come up along the way. We are inundated with data and information and resources that make it virtually impossible to focus on any one thing for any length of time.

The possibilities of science in a data-driven environment seem limitless. It’s easy to forget that some limits do, in fact, exist.

Back to the Starting Square

2 May

One of my favorite singer songwriters is Lucy Wainwright Roche. Fans of folk music who don’t know Lucy may well know her familiar last names. The daughter of Suzzy Roche and Loudon Wainwright III, she comes honestly to her musical gifts. One of my favorites of her songs is, “Starting Square.” It’s a song about seeing an old love again and taking note of the changes that happen after relationships end. That’s my take, anyway. And it’s summed up in the line,

I can tell you can tell it from there
That I may have been everywhere
But I’m back
Back to the starting square

Enjoy Lucy singing it.

I may not have been everywhere in the first round of informationist work, but as I met with the principal investigator of my latest grant-funded project this week, I did feel like I’m back at square one. This latest project is really very different from the mammography study that I’ve worked on for the past couple of years. This supplemental grant is to provide informationist services to the larger grant entitled, “A Knowledge Environment for Neuroimaging in Child Psychiatry.” Our ultimate goal (and there are more than a few steps to take before we’ll get there) is “to establish best practices and standards around data sharing in the discipline of neuroinformatics so that it becomes possible to generate accurate, easy to obtain quantitative metrics that give credit to the original source of data.” In short, it’s a project that will hopefully deliver a means for researchers to cite their data for both the purpose of data sharing and to make the science reproducible. I’ll work on determining the proper level of identification for neuroimages, the best identifier for the images (is it a DOI?), and the most efficient means of organizing and naming new data sets that are derived from bits and pieces of multiple other data sets.

During our first meeting, the PI showed me a whole bunch of really interesting websites and told me of many interesting projects happening in this area (directly and tangentially). I came back to my desk and promptly created a new folder of bookmarks for this work. So now… I’m back to the starting square. I’ve got a mountain of stuff to read and watch and become familiar with. It’s like the first day of class. The first assignments. And I need a new notebook!

I include a few of the resources below, if you’re interested in the topic and want to play a little catch up, too. Enjoy!

Two and Two and Two: Making Connections

24 Oct

Two meetings with two principal investigators about two grant proposals over two days lead me to two observations and thoughts about the state of our profession and the work that we do:

1. Is the library a silo, too?

We speak a good bit in the profession about how often those that we serve, our patrons, live and work in silos. Scientists do research in specific areas. Departments treat diseases within a specialized field. Administrators make decisions within the context of the the top level that they know best. It’s very common. And it makes us quite frustrated because the reality of the world is that we rarely function in a world that doesn’t (or couldn’t) benefit from other areas, if only we knew about them. However, “Nobody knows what I do!” is a common cry not just from librarians, but across the board. Is this perhaps a glimpse that we, like our patrons, are living in a silo that we’ve created for ourselves? 

Yesterday, I sat down with a researcher to do some work on the proposal that we’re submitting for the next round of informationist grants from the National Library of Medicine. It is an absolutely fantastic project and each time that I come away from talking with Dr. Kennedy, I can’t help but think how refreshing it is to speak with a researcher who knows as much, no, more than I do, around the issues related to data sharing. Turns out that he’s internationally known as a proponent of data sharing in his field (neuroimaging), leading projects and initiatives and working groups and all sorts of attempts at advocating among his peers for the necessity of this practice. It is by chance – pure chance – that our paths crossed and that this crossing led us to work on the grant proposal together. You see, he knew of the RFP for the informationist supplement grants because of his connections to colleagues at the National Library of Medicine. I happened to give a talk at one of his lab’s meetings awhile back on an unrelated topic and he noted that the title on my signature line includes “informationist.” Thus, he asked me what this meant, what I did, what I was doing related to the supplement awards, and if I’d be interested in helping him on a project idea that he had. This is how we came to yesterday.

What I want to point out, however, is that Dr. Kennedy came across this information with no connection to the library. He learned of it from a colleague at the National Library of Medicine, yet that colleague, evidently, didn’t think to point him to his library as a place to find an informationist. 

Are you following me?

There’s a chat happening on the MEDLIB-l listserv today (and other days and in other circles of our profession, too, of course) regarding our name, i.e. should we incorporate “knowledge” into our job titles, use it in some form instead of “library” to describe our workplace, etc. I’m not going to get into that discussion here, but I bring it up because a consistent thread in these discussions is that if our patrons don’t know the value of the library, then we are evidently doing something wrong in our work.

To this I say, “Yes and no.” 

Yes, sometimes we haven’t done the best job at getting out and letting people know how we can build partnerships, collaborate on research projects, embed ourselves in curriculum, teach classes on a variety of relevant subjects, and much more. Our history is as a passive profession. For years and years and years we were able to meet our patrons here, in the library. They had to come to us to use our resources. Once here, they made the association that librarians were important because libraries had resources. But those days have been gone for decades now and we haven’t always been the best at getting out and helping people associate us less with the library and more with our skills. WE are the resource that we really need to save now, not the library or the journal collections or the subscription to UpToDate. We cost the administration more than those other resources, thus we best be able to prove that we are the resource worth keeping the next time the forced budget cuts come along.

But I also say no to the belief that if people don’t know our value, we’re doing something wrong. I’ve done a ton of right things over the past year and a half as an embedded informationist that have led me to all sorts of fantastic new opportunities, yet still it’s only by chance that I discover someone right here on my very campus who has been working on and advocating for many of the same things we’ve been talking about here in the Library. We work in different worlds, all of us, and despite the forward strides and promise of networked science, it remains so darned impossible to be able to make all of the connections that we could make that would ultimately lead to better work, e.g. science, medicine, information management. Work that would prove our value.

To me, that realization really hit home when yesterday when I thought about how someone who works for the National Library of Medicine, the funding agency behind these informationist grants (the National LIBRARY of Medicine) didn’t associate the library with those awards. I don’t say that out of any place of judgment, either. Well… maybe a little, but the truth is that there’s no point in judging and/or blaming and/or pointing fingers. It is simply our reality. We all live and work in some degree of a silo, but if we want to be associated with value, we need to be valuable. Visible and valuable. Both.

2. “You have a unique skill that only a handful of people on this campus have.”

I was told this today by another principal investigator as we discussed the rewriting of another grant proposal. The skill she refers to is my knowledge of how to use and leverage social media for all sorts of positive things. Her point was that when you have something that few others have, you’ve got to use it. Social media is trendy in medical research today, but few medical researchers actually use social media. They want the money to do the research, yet don’t have the expertise in the products to know how to use them effectively. Thus, when you do have the expertise, you have value. Research teams need you on their team. This is terrific!

Yet I felt myself hesitating at the thought that as a librarian, the skill I would bring to a research team lies in social media. Is that a librarian skill? As we talked though, the researcher described to me how knowing the social media tools and the social media landscape affords you the skill of knowing better how to collect and manage the data that’s generated from the use of these tools. Novices don’t have that. And data management… now THAT is a skill that the library is clamoring to get into. But even for me and my “out of the library box” thinking, making this connection took a few minutes. Even for me! 

It surprised me, but I wonder if as we break out of our silos and work closely with others, perhaps one of the things that gets blurry is the answer to the question, “Who knows what?” What are librarians supposed to know? What are researchers supposed to know? What do doctors know? And who does what? I think that it’s this vagueness that makes us argue over (or more politely, discuss) what we call ourselves, what services we provide, and what our value really is. Silos and walls keep us separated, but they also keep us neat and orderly. We say that they need to go. Are we ready for the flood of uncertainty that all the mixing-up to come will bring? 

National Preparedness Month was last month, but you can still celebrate it today.

life-preserver