Tag Archives: data sharing

Finding Hope in a Haystack

13 Sep

[Before I was a librarian, before I was an exercise physiologist, I was a minister. I was recently asked, after many years, to serve as a guest preacher one Sunday. I would usually share this on my non-library-related blog, but as the subject came from my day-to-day work in scholarly communications and data, I thought some followers of this blog might find it of interest.

The scripture referenced is Proverbs 1:20-33. The sermon was delivered on September 12, 2021 at First Baptist Church, Worcester, MA. A recording of the service can be found on the Church’s website and/or Facebook page.]

When Brent first asked me if I would consider giving a sermon here this morning, I thanked him, told him I was very touched and humbled that he’d consider me, but said I didn’t think it was a very good idea. I haven’t preached a sermon in a very long time. Eight years, to be exact. I’ve not been to church in a good while, either. I struggle with reconciling a lot of the things I once believed about God and faith and Christianity, with what I believe about the world today. I struggle with Church. Capital “C” church. The institution of it. Organized religion. I shared all of this with Brent and he, in his very pastoral way, assured me that all of this is okay. And he convinced me that perhaps, just perhaps, I might still have something to share from this place.

Once I said yes, of course, I was stuck. I’ve been a librarian for a very long time. Now I found myself struggling over different things – wondering what the heck do I do in my day-to-day life that I could possibly translate into something you might find relatable, meaningful, or especially inspiring for a sermon. But then, lying in bed one night a couple weeks back, I got to looking at the books in my bedroom and I thought of something I’d read in a book called, Living in Data: A Citizen’s Guide to a Better Information Future by the engineer, artist, computer programmer, National Geographic Explorer, and really wonderful storyteller, Jer Thorp. I highly recommend this book and I’ll reference it throughout these thoughts this morning. But lying in bed, I thought of the opening to the second chapter of the book. It reads:

Open the window and let the words in. Let them flow into the room in a stream, all of the words, hundreds of thousands of them, let them fill the space, let them hang in the air, tiny sparkling motes of language.

And the next morning I got up and counted the number of books in my room.

There are 318 books in my bedroom. It’s not an unruly number – really! – and while there are a few stacks of them by my bed and on a dresser, most are neatly arranged in a bookcase and on a couple of shelves on the wall. The shortest one is 13 pages long. A Prairie Dog’s Life.  The longest is a 2,000-plus page anthology of English literature. They probably average out though, as a whole, to around 300 pages. Estimating about 350 words to a page, that’s 33,390,000 words hanging out in my room. And this is just my bedroom. I could add to it all of the books in my home, in my office, in my Little Free Library outside of my house. And then multiply all of those out. Think about how many ideas these words generate; how many characters, real and imagined; how many interactions; how many emotions; how many more words they give rise to. They expand and expand and expand. Infinite. 

Did you know that the word “data” comes from Latin where it meant, “a thing given, a gift delivered or sent”? Early in its appearance in the English language, it is tied to the fields of mathematics and theology. All the way back in 1614, a clergyman named Thomas Tuke called the Sacraments Data. With a capital “D”. Divinely given.  

Today, of course, we think of data as numbers, words, bits and bytes, stuff collected in notebooks or spreadsheets, crunched by computers, analyzed and visualized. We don’t think of it as divine. 

Or do we?

I receive a weekly newsletter from Educause, a nonprofit association whose mission is “to advance higher education through the use of information technology”. In an article last week, I read this sentence:

From improving student success to forming optimal strategies that can maximize corporate and foundational relationships, data analytics is now higher education’s divining rod.

Interesting description, wouldn’t you say? Divining, dowsing, doodlebugging – that pseudoscience where a stick – a divining rod – leads you to water, the biological requirement for life. Data analytics is now perceived as what will lead us to our life source.

That sure makes data sound divine to me. It sounds just like the thing that’s going to save us. And it’s hardly slick marketing for a certain college major or field of study. There’s real evidence all around us of the downright amazing – some might say miraculous – outcomes of harnessing big data. 

The National Center for Biotechnology Information, NCBI, is part of the National Library of Medicine at the U.S. National Institutes of Health. One of the many things that NCBI does, as part of the Library of Medicine, is build, host, and manage a series of biomedical databases, including PubMed, one of the world’s largest bibliographic databases – a free resource to citations and abstracts in life sciences and biomedical literature. Since November 17, 2019, when the first case of the novel coronavirus we now know so well as COVID-19 was reported (just shy of 22 months ago), 175,593 peer-reviewed, published research articles on COVID-19 have been indexed in PubMed. The full-text versions of more than 200,000 articles are freely available via the public access site, PubMed Central. About 1.4 million nucleotide records have been uploaded and made available, along with a million sequence-related records. 317 articles on COVID-19 have been written by researchers just down the road at UMass Medical School, where I work. We house the full text of these in my library’s institutional repository and as of yesterday morning, those 317 papers had been downloaded more than 30,000 times by people all over the world. 

It is this unprecedented open sharing of information and data that allowed us to watch science unfold over the past year at a pace hardly ever seen. The biomedical research community, worldwide, developed multiple vaccines to fight COVID with an effectiveness unheard of before. These vaccines were developed, tested, trialed, and delivered in a 12-month time period. Amazing. Miraculous. Divine?

There is no doubt that many see the hand of God at work in all of this. If you believe that God grants us with gifts – with skills and knowledge and wisdom – to create and use all of these towards the betterment of the world, then yes, divine. Data is divine. A gift from God.

But. That’s a bit easy, isn’t it? A bit simple. 

In his book, The Promise of Access: Technology, Inequality, and the Political Economy of Hope, Daniel Greene defines and traces what he calls “the access doctrine,” a belief born out of the technology boom of the 1990s, where it seemed almost common sense that all one really needed to enter into the new information economy was access – access to technology (think the “laptop for every child” programs), access to the Internet (think broadband expansion), access to tech education (think charter schools and diploma programs with a hard focus on students’ use of and proficiency in different types of software and technology). Public libraries in particular have played a big role in propagating this promise. Holding true to their own belief that they exist to freely provide access to information, they were one of the first institutions to make computers freely available to the public.

Unfortunately, as Greene describes in his book, this promise has fallen short. Technology is a simple solution to the vastly complex problem of inequality. But like data, it’s a simple sell. And this simple selling of information and technology and data as some kind of commonsense cure to everything is powerful. It IS power. Much in the same way that the selling of a simplified idea of God or of faith or of religion is power.

Joel Osteen or Bill Gates, you may have vastly different opinions of these two powerful men, but they are strikingly similar in that they each preach a simple message with unwavering conviction. For Osteen it’s that a belief in God will grant one every bit of peace and prosperity. For Gates it’s the belief that every problem – from access to vaccines to climate change – can be solved via some form of technological invention – or intervention – almost always one that will generate the data, the information, the knowledge, and ultimately, the solution. It’s a great hope.

And data as a great hope starts to sound an awful lot like what we think of as religious faith. It holds in our minds and in our hearts this unfettered sense, this belief, that somehow, someway, somewhere within it is the key. The solution. The answer. To everything.  If we can only write the right algorithm, if we can only spot the trends, the patterns, then what we once didn’t know, well, now we will. It harkens right back to its Latin origins that data is something out there already – just like God – something given, something true. We just have to see it and recognize it. The truth that is already there. 

Simple. Powerful. Comforting, even.

But the concept that both ideas, data and religious faith, leaves out is a central and crucial one – that they are humanly constructed. As an aside, I’m not positing that God is a human construction. That’s an entirely different argument. But faith – what people believe and, to an extension, how they act on those beliefs – is certainly all tied up in the limits of what we can and do construct. Just like data.

If you return to NCBI’s SARS-CoV-2 resources web page, the site where I found many of those numbers on publications and genome sequence runs that I mentioned a few minutes ago, you’ll find a link to a resource called LitCovid, “a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus.” There’s a chart that shows how many publications are added weekly to the database and there’s also a map of the world, shaded to show the countries mentioned in the abstracts of all of these publications. Darker shading means more mentions. No shading means none. The United States and China stand out as the darkest blue. Most of the rest of the world is a slightly lighter shade, but there are some noticeable blank spots – Central America, a few countries in South America, and a large swath of Central and Western Africa. Does no one have COVID in those places? No. Do they lack the expertise and resources for scientific research? In some cases, definitely yes. But why haven’t those with the expertise and resources focused their research on the people in these parts of the world? There are many answers to this question, of course. It’s complex. But it clearly highlights the flaw in that belief that data is this objectively, unbiased entity that need only be collected and curated and analyzed to bring us the solutions to our problems. 

There is a chapter in Living in Data called “Data’s Dark Matter” and in it, Jer Thorp tells another story that highlights the limitations of data, even when one is trying their very best to avoid them. In 2009, he wrote a pair of algorithms to determine the placement of the almost 3,000 names of those killed in the 9/11 attacks on the World Trade Centers – what would be a significant part of the 9/11 Memorial. The designers were seeking what they called “meaningful adjacencies” – people related to one another, people who worked together. This is a mathematical problem that I cannot begin to fathom solving, let alone solve it. But Thorp did – at least to some degree. He admits his own shortcomings – or better put, the shortcomings of any data-driven solution – in this story:

Even in the meaningful adjacencies that my algorithm dutifully satisfied, there is much missing. Mohammad Salman Hamdani was a Pakistani American scientist and NYPD cadet who, like so many other first responders, rushed to the scene on September 11 determined to help. Like so many other first responders, he was killed. Hamdani’s name is inscribed on a parapet on the south pool of the memorial, on the very last panel dedicated to the victims who were killed in World Trade Center South. The algorithm placed Hamdani there, in part because there were no meaningful adjacencies recorded, no other names indicated by the data set for his name to sit beside. Why was this man, a police officer in training, not placed alongside the other first responders? According to memorial officials, Hamdani was not included with the other police officers because he wasn’t on active duty, an explanation that sits at odds with the fact that he was given a police funeral with full honors by the NYPD. We can find a more likely answer in a headline from the “New York Post” on October 12, 2001: “Missing – or Hiding? – Mystery of NYPD Cadet from Pakistan.”

Thorp’s algorithm could only run on the data he was provided. Data constructed by humans – from human stories, human reports, human experiences, and human biases. Sadly, but truthfully, also from human hatred, human fear, and human denial.

The artist and data scientist Mimi Onuoha created a mixed-media installation in 2016 entitled The Library of Missing Datasets. It is a white file cabinet filled with labeled, yet empty, file folders. In her artist’s statement on her website (you can see pictures of the piece there, along with photographs of the 2018 installation, The Library of Missing Datasets, 2.0) she says: 

“The Library of Missing Datasets” is a physical repository of those things that have been excluded in a society where so much is collected. “Missing data sets” are the blank spots that exist in spaces that are otherwise data-saturated. Wherever large amounts of data are collected, there are often empty spaces where no data live. The word “missing” is inherently normative. It implies both a lack and an ought: something does not exist, but it should. That which should be somewhere is not in its expected place; an established system is disrupted by distinct absence. That which we ignore reveals more than what we give our attention to. It’s in these things that we find cultural and colloquial hints of what is deemed important. Spots that we’ve left blank reveal our hidden social biases and indifferences.

Some examples of missing datasets in Onuoha’s piece include:

  • People excluded from public housing because of criminal records
  • Trans people killed or injured in instances of hate crime
  • Poverty and employment statistics that include people who are behind bars
  • Muslim mosques/communities surveilled by the FBI/CIA
  • Mobility for older adults with physical disabilities or cognitive impairments
  • LGBT older adults discriminated against in housing
  • Undocumented immigrants currently incarcerated and/or underpaid
  • Firm statistics on how often police arrest women for making false rape reports
  • Master database that details if/which Americans are registered to vote in multiple states

There are many more. Fortunately, one now-former missing dataset, thanks to the efforts of multiple citizen-led data collection projects around the US, is “Civilians killed in encounters with police or law enforcement agencies”. In 2015, when she began collecting the missing datasets for her piece, this wasn’t the case. 2015. Just 6 short years ago. A small grace from the growth of the Black Lives Matter movement and the tragedy of George Floyd’s murder.

The scripture reading this morning from the Book of Proverbs speaks of the wisdom and knowledge of God; from God’s mouth comes knowledge and understanding. I believe that the real kernel of wisdom within those words is the reminder to keep searching. Keep seeking knowledge, keep searching for information, keep collecting the data not because we simply haven’t found the right answer yet, but because we don’t yet possess enough of whatever it is that may yield the right answer. It is missing, if it even exists at all.

To my understanding, this is hope. It’s the hope of data, of science, of art, of technology, of education, of human relations, of human society, of all of creation. Perhaps it is a hope for the Church, too. Paul wrote to the Church in Corinth that faith, hope, and love abide; and the greatest of these is love. Me, I’ll take hope, for we can always hope to have faith, even when we have none. We can hope for love, even when there is no love to be found. And we can always hope to be better, even when we are far from it. 


Is Big Data Missing the Big Picture?

27 Apr


When I was defending my graduate thesis a number of years ago, I was asked by one of the faculty in attendance to explain why I had done “x” rather than “y” with my data. I stumbled for a bit until I finally said, somewhat out of frustration at not knowing the right answer, “Because that’s not what I said I’d do.” My statistics professor was also in attendance and as I quickly tried to backtrack from my response piped in, “That’s the right answer.”

As I’ve watched and listened to and read and been a part of so many discussions about data – data sharing, data citation, data management – over the past several years, I often find myself thinking back on that defense and my answer. More, I’ve thought of my professor’s comment; that data is collected, managed, and analyzed according to certain rules that a researcher or graduate student or any data collector decides from the outset. That’s best practice, anyway. And such an understanding always makes me wonder if in our exuberance to claim the importance, the need, the mandates, and the “sky’s the limit” views over data sharing, we don’t forget that.

I really enjoyed the panel that the Medical Library Association put together last week for their webinar, “The Diversity of Data Management: Practical Approaches for Health Sciences Librarianship.” The panelists included two data librarians and one research data specialist; Lisa Federer of the National Institutes of Health Library, Kevin Read from New York University’s Health Sciences Library, and Jacqueline Wirz of Oregon Health & Sciences University, respectively. As a disclosure, I know Lisa, Kevin and Jackie each personally and consider them great colleagues, so I guess I could be a little biased in my opinion, but putting that aside, I do feel that they each have a wealth of experience and knowledge in the topic and it showed in their presentations and dialogue.

Listening to the kind of work and the projects that these data-centric professionals shared, it’s easy and exciting to see the many opportunities that exist for libraries, librarians, and others with an interest in data science. At the same time, I admit that I wince when I sense our “We can do this! Librarians can do anything!” enthusiasm bubble up – as occasionally occurs when we gather together and talk about this topic – because I don’t think it’s true. I do believe that individually, librarians can move into an almost limitless career field, given our basic skills in information collection, retrieval, management, preservation, etc. We are well-positioned in an information age. That said, though, I also believe that (1) there IS a difference between information and data and (2) the skills librarians have as a foundation in terms of information science don’t, in and of themselves, translate directly to the age of big data. (I’m not fan of that descriptor, by the way. I tend to think it was created and is perpetuated by the tech industry and the media, both wishing we believe things are simpler than they ever are.) Some librarians, with a desire and propensity towards the opportunities in data science will find their way there. They’ll seek out the extra skills needed and they’ll identify new places and new roles that they can take on. I feel like I’ve done this myself and I know a good plenty handful of others who’ve done the same. But can we sell it as the next big thing that academic and research libraries need to do? Years later, I still find myself a little skeptical.

Moving beyond the individual, though, I wonder if libraries and other entities within information science, as a whole, don’t have a word of caution to share in the midst of our calls for openness of data. It’s certainly the belief of our profession(s) that access to information is vital for the health of a society on every level. However, in many ways it seems that in our discussions of data, we’ve simply expanded our dedication towards the principal of openness to information to include data, as well. Have we really thought through all that we’re saying when we wave that banner? Can we have a more tempered response and/or approach to the big data bandwagon?

Arguably, there are MANY valid reasons for supporting access in this area; peer review, expanded and more efficient science, reproducibility, transparency, etc. Good things, all. But going back to that lesson that I learned in grad school, it’s important to remember that data is collected, managed, and analyzed in certain ways for a reason; things decided by the original researcher. In other words, data has context. Just like information. And like information, I wonder (and have concern for) what happens to data when it’s taken out of its original context. And I wonder if my profession could perhaps advocate this position, too, along with those of openness and sharing, if nothing more than to raise the collective awareness and consciousness of everyone in this new world. To curb the exuberance just a tad.

I recently started getting my local paper delivered to my home. The real thing. The newsprint newspaper. The one that you spread out on the kitchen table and peruse through, page by page. You know what I’ve realized in taking up this long-lost activity again? When you look at a front page with articles of an earthquake in Nepal, nearby horses attacked by a bear, the hiring practices of a local town’s police force, and gay marriage, you’re forced to think of the world in its bigger context. At the very least, you’re made aware of the fact that there’s a bigger picture to see.

When I think of how information is so bifurcated today, I can’t help but ask if there’s a lesson there that can be applied to data before we jump overboard into the “put it all out there” sea. We take research articles out of the context of journals. We take scientific findings out of the context of science. We take individual experiences out of context of the very experience in which they occur. And of course, the most obvious, we take any and every politician’s words out of context in order to support whatever position we either want or don’t want him/her to support. I don’t know about you, but each and every one of these examples appears as a pretty clear reason to at least think about what can and will happen (already happens) to data if and when it suffers the same fate.

Are there reasons why librarians and information specialists are concerned with big data? Absolutely! I just hope that our concern also takes in the big picture.


Repeat After Me

22 Aug


Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method and relies on ceteris paribus. Wikipedia

I was going to start this post with a similar statement in my own words, but couldn’t resist the chance to quote Latin. It always makes you sound so smart. But regardless of whether these are a Wikipedia author’s words or my own, the point is the same – one of the foundations of good science is the ability to reproduce the results.

My work for the neuroimaging project involves developing a process for researchers in this field to cite their data in such a way that makes their work more easily reproducible. The current practice of citing data sets alone doesn’t always make reproducibility possible. A researcher might take different images from a number of different data sets to create an entirely new data set, in which case citing the previous sets in whole doesn’t tell exactly which images are being used. Thus, this gap can make the final research harder to replicate, as well as more difficult to review. We think that we may have a way to help fix this problem and that’s what I’ve been working on for the past few months.

At the same time, I’ve been working on a systematic review with the members of the mammography study team. This work has me locating and reading and discussing a whole slew of articles about the use of telephone call reminders to increase the rate of women receiving a mammogram within current clinical guidelines. It also has me wondering about the nature of clinical research and the concept of reproducible science, for in all of my work, I’ve yet to come across any two studies that are exactly alike. In other words, it doesn’t seem to be common practice for anyone to repeat anyone else’s study. And I can’t help but wonder why this is so.

I imagine it has something to do with funding. Will a funding agency award money to a proposal that seeks to repeat something; something unoriginal? Surely they are more apt to look to fund new ideas.

Maybe it has to do with scientific publishing. Like funding agencies, publishers probably much prefer to publish new ideas and new findings. Who wants to read an article that says the same thing as one they read last year?

Of course, it may also be that researchers look to improve on previous studies, rather than simply repeat them. This is what I see in all of the papers I’ve found for this particular systematic review. The methods are tweaked from study to study; the populations differ just a bit, the length of time varies, etc. It makes sense. The goal of this body of research is to determine what intervention works the best and in changing things slightly, you might just find the answer. What has me baffled about this process, though, is that as we continue to tweak this aspect or that aspect of a study’s methodology, when and/or how do we ever discover what aspect actually works and then put it into practice? 

Working on this particular review, I’ve collected easily 50+ relevant articles, yet as we pull them together – consolidate them to discover any conclusions – the task seems, at times, impossible. Too often, despite the relevancy of the articles to the question asked, what you really end up comparing is apples to oranges. How does this get to the heart of scientific discovery? How does it influence or generate “best practice”? I can’t help but wonder.

Yesterday, during my library’s monthly journal club, we discussed an article that had been recommended reading to me by one of the principal investigators on the mammography study. How to Read a Systematic Review and Meta-analysis and Apply the Results to Patient Care, is the latest User’s Guide on the subject from the Journal of the American Medical Association (JAMA). It prompted a lively session about everything from how research is done, to how medical students are taught to read the literature, to how the media portrays medical news. I recommend it.

Of course, there are many explanations to my question and many factors at play. My wondering and our journal club discussion doesn’t afford any concrete solution and/or answer, still I feel it’s a worthwhile topic for medical librarians to think about. If you have any thoughts, please keep the discussion going in the comments section below.