Tag Archives: data

Finding Hope in a Haystack

13 Sep

[Before I was a librarian, before I was an exercise physiologist, I was a minister. I was recently asked, after many years, to serve as a guest preacher one Sunday. I would usually share this on my non-library-related blog, but as the subject came from my day-to-day work in scholarly communications and data, I thought some followers of this blog might find it of interest.

The scripture referenced is Proverbs 1:20-33. The sermon was delivered on September 12, 2021 at First Baptist Church, Worcester, MA. A recording of the service can be found on the Church’s website and/or Facebook page.]

When Brent first asked me if I would consider giving a sermon here this morning, I thanked him, told him I was very touched and humbled that he’d consider me, but said I didn’t think it was a very good idea. I haven’t preached a sermon in a very long time. Eight years, to be exact. I’ve not been to church in a good while, either. I struggle with reconciling a lot of the things I once believed about God and faith and Christianity, with what I believe about the world today. I struggle with Church. Capital “C” church. The institution of it. Organized religion. I shared all of this with Brent and he, in his very pastoral way, assured me that all of this is okay. And he convinced me that perhaps, just perhaps, I might still have something to share from this place.

Once I said yes, of course, I was stuck. I’ve been a librarian for a very long time. Now I found myself struggling over different things – wondering what the heck do I do in my day-to-day life that I could possibly translate into something you might find relatable, meaningful, or especially inspiring for a sermon. But then, lying in bed one night a couple weeks back, I got to looking at the books in my bedroom and I thought of something I’d read in a book called, Living in Data: A Citizen’s Guide to a Better Information Future by the engineer, artist, computer programmer, National Geographic Explorer, and really wonderful storyteller, Jer Thorp. I highly recommend this book and I’ll reference it throughout these thoughts this morning. But lying in bed, I thought of the opening to the second chapter of the book. It reads:

Open the window and let the words in. Let them flow into the room in a stream, all of the words, hundreds of thousands of them, let them fill the space, let them hang in the air, tiny sparkling motes of language.

And the next morning I got up and counted the number of books in my room.

There are 318 books in my bedroom. It’s not an unruly number – really! – and while there are a few stacks of them by my bed and on a dresser, most are neatly arranged in a bookcase and on a couple of shelves on the wall. The shortest one is 13 pages long. A Prairie Dog’s Life.  The longest is a 2,000-plus page anthology of English literature. They probably average out though, as a whole, to around 300 pages. Estimating about 350 words to a page, that’s 33,390,000 words hanging out in my room. And this is just my bedroom. I could add to it all of the books in my home, in my office, in my Little Free Library outside of my house. And then multiply all of those out. Think about how many ideas these words generate; how many characters, real and imagined; how many interactions; how many emotions; how many more words they give rise to. They expand and expand and expand. Infinite. 

Did you know that the word “data” comes from Latin where it meant, “a thing given, a gift delivered or sent”? Early in its appearance in the English language, it is tied to the fields of mathematics and theology. All the way back in 1614, a clergyman named Thomas Tuke called the Sacraments Data. With a capital “D”. Divinely given.  

Today, of course, we think of data as numbers, words, bits and bytes, stuff collected in notebooks or spreadsheets, crunched by computers, analyzed and visualized. We don’t think of it as divine. 

Or do we?

I receive a weekly newsletter from Educause, a nonprofit association whose mission is “to advance higher education through the use of information technology”. In an article last week, I read this sentence:

From improving student success to forming optimal strategies that can maximize corporate and foundational relationships, data analytics is now higher education’s divining rod.

Interesting description, wouldn’t you say? Divining, dowsing, doodlebugging – that pseudoscience where a stick – a divining rod – leads you to water, the biological requirement for life. Data analytics is now perceived as what will lead us to our life source.

That sure makes data sound divine to me. It sounds just like the thing that’s going to save us. And it’s hardly slick marketing for a certain college major or field of study. There’s real evidence all around us of the downright amazing – some might say miraculous – outcomes of harnessing big data. 

The National Center for Biotechnology Information, NCBI, is part of the National Library of Medicine at the U.S. National Institutes of Health. One of the many things that NCBI does, as part of the Library of Medicine, is build, host, and manage a series of biomedical databases, including PubMed, one of the world’s largest bibliographic databases – a free resource to citations and abstracts in life sciences and biomedical literature. Since November 17, 2019, when the first case of the novel coronavirus we now know so well as COVID-19 was reported (just shy of 22 months ago), 175,593 peer-reviewed, published research articles on COVID-19 have been indexed in PubMed. The full-text versions of more than 200,000 articles are freely available via the public access site, PubMed Central. About 1.4 million nucleotide records have been uploaded and made available, along with a million sequence-related records. 317 articles on COVID-19 have been written by researchers just down the road at UMass Medical School, where I work. We house the full text of these in my library’s institutional repository and as of yesterday morning, those 317 papers had been downloaded more than 30,000 times by people all over the world. 

It is this unprecedented open sharing of information and data that allowed us to watch science unfold over the past year at a pace hardly ever seen. The biomedical research community, worldwide, developed multiple vaccines to fight COVID with an effectiveness unheard of before. These vaccines were developed, tested, trialed, and delivered in a 12-month time period. Amazing. Miraculous. Divine?

There is no doubt that many see the hand of God at work in all of this. If you believe that God grants us with gifts – with skills and knowledge and wisdom – to create and use all of these towards the betterment of the world, then yes, divine. Data is divine. A gift from God.

But. That’s a bit easy, isn’t it? A bit simple. 

In his book, The Promise of Access: Technology, Inequality, and the Political Economy of Hope, Daniel Greene defines and traces what he calls “the access doctrine,” a belief born out of the technology boom of the 1990s, where it seemed almost common sense that all one really needed to enter into the new information economy was access – access to technology (think the “laptop for every child” programs), access to the Internet (think broadband expansion), access to tech education (think charter schools and diploma programs with a hard focus on students’ use of and proficiency in different types of software and technology). Public libraries in particular have played a big role in propagating this promise. Holding true to their own belief that they exist to freely provide access to information, they were one of the first institutions to make computers freely available to the public.

Unfortunately, as Greene describes in his book, this promise has fallen short. Technology is a simple solution to the vastly complex problem of inequality. But like data, it’s a simple sell. And this simple selling of information and technology and data as some kind of commonsense cure to everything is powerful. It IS power. Much in the same way that the selling of a simplified idea of God or of faith or of religion is power.

Joel Osteen or Bill Gates, you may have vastly different opinions of these two powerful men, but they are strikingly similar in that they each preach a simple message with unwavering conviction. For Osteen it’s that a belief in God will grant one every bit of peace and prosperity. For Gates it’s the belief that every problem – from access to vaccines to climate change – can be solved via some form of technological invention – or intervention – almost always one that will generate the data, the information, the knowledge, and ultimately, the solution. It’s a great hope.

And data as a great hope starts to sound an awful lot like what we think of as religious faith. It holds in our minds and in our hearts this unfettered sense, this belief, that somehow, someway, somewhere within it is the key. The solution. The answer. To everything.  If we can only write the right algorithm, if we can only spot the trends, the patterns, then what we once didn’t know, well, now we will. It harkens right back to its Latin origins that data is something out there already – just like God – something given, something true. We just have to see it and recognize it. The truth that is already there. 

Simple. Powerful. Comforting, even.

But the concept that both ideas, data and religious faith, leaves out is a central and crucial one – that they are humanly constructed. As an aside, I’m not positing that God is a human construction. That’s an entirely different argument. But faith – what people believe and, to an extension, how they act on those beliefs – is certainly all tied up in the limits of what we can and do construct. Just like data.

If you return to NCBI’s SARS-CoV-2 resources web page, the site where I found many of those numbers on publications and genome sequence runs that I mentioned a few minutes ago, you’ll find a link to a resource called LitCovid, “a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus.” There’s a chart that shows how many publications are added weekly to the database and there’s also a map of the world, shaded to show the countries mentioned in the abstracts of all of these publications. Darker shading means more mentions. No shading means none. The United States and China stand out as the darkest blue. Most of the rest of the world is a slightly lighter shade, but there are some noticeable blank spots – Central America, a few countries in South America, and a large swath of Central and Western Africa. Does no one have COVID in those places? No. Do they lack the expertise and resources for scientific research? In some cases, definitely yes. But why haven’t those with the expertise and resources focused their research on the people in these parts of the world? There are many answers to this question, of course. It’s complex. But it clearly highlights the flaw in that belief that data is this objectively, unbiased entity that need only be collected and curated and analyzed to bring us the solutions to our problems. 

There is a chapter in Living in Data called “Data’s Dark Matter” and in it, Jer Thorp tells another story that highlights the limitations of data, even when one is trying their very best to avoid them. In 2009, he wrote a pair of algorithms to determine the placement of the almost 3,000 names of those killed in the 9/11 attacks on the World Trade Centers – what would be a significant part of the 9/11 Memorial. The designers were seeking what they called “meaningful adjacencies” – people related to one another, people who worked together. This is a mathematical problem that I cannot begin to fathom solving, let alone solve it. But Thorp did – at least to some degree. He admits his own shortcomings – or better put, the shortcomings of any data-driven solution – in this story:

Even in the meaningful adjacencies that my algorithm dutifully satisfied, there is much missing. Mohammad Salman Hamdani was a Pakistani American scientist and NYPD cadet who, like so many other first responders, rushed to the scene on September 11 determined to help. Like so many other first responders, he was killed. Hamdani’s name is inscribed on a parapet on the south pool of the memorial, on the very last panel dedicated to the victims who were killed in World Trade Center South. The algorithm placed Hamdani there, in part because there were no meaningful adjacencies recorded, no other names indicated by the data set for his name to sit beside. Why was this man, a police officer in training, not placed alongside the other first responders? According to memorial officials, Hamdani was not included with the other police officers because he wasn’t on active duty, an explanation that sits at odds with the fact that he was given a police funeral with full honors by the NYPD. We can find a more likely answer in a headline from the “New York Post” on October 12, 2001: “Missing – or Hiding? – Mystery of NYPD Cadet from Pakistan.”

Thorp’s algorithm could only run on the data he was provided. Data constructed by humans – from human stories, human reports, human experiences, and human biases. Sadly, but truthfully, also from human hatred, human fear, and human denial.

The artist and data scientist Mimi Onuoha created a mixed-media installation in 2016 entitled The Library of Missing Datasets. It is a white file cabinet filled with labeled, yet empty, file folders. In her artist’s statement on her website (you can see pictures of the piece there, along with photographs of the 2018 installation, The Library of Missing Datasets, 2.0) she says: 

“The Library of Missing Datasets” is a physical repository of those things that have been excluded in a society where so much is collected. “Missing data sets” are the blank spots that exist in spaces that are otherwise data-saturated. Wherever large amounts of data are collected, there are often empty spaces where no data live. The word “missing” is inherently normative. It implies both a lack and an ought: something does not exist, but it should. That which should be somewhere is not in its expected place; an established system is disrupted by distinct absence. That which we ignore reveals more than what we give our attention to. It’s in these things that we find cultural and colloquial hints of what is deemed important. Spots that we’ve left blank reveal our hidden social biases and indifferences.

Some examples of missing datasets in Onuoha’s piece include:

  • People excluded from public housing because of criminal records
  • Trans people killed or injured in instances of hate crime
  • Poverty and employment statistics that include people who are behind bars
  • Muslim mosques/communities surveilled by the FBI/CIA
  • Mobility for older adults with physical disabilities or cognitive impairments
  • LGBT older adults discriminated against in housing
  • Undocumented immigrants currently incarcerated and/or underpaid
  • Firm statistics on how often police arrest women for making false rape reports
  • Master database that details if/which Americans are registered to vote in multiple states

There are many more. Fortunately, one now-former missing dataset, thanks to the efforts of multiple citizen-led data collection projects around the US, is “Civilians killed in encounters with police or law enforcement agencies”. In 2015, when she began collecting the missing datasets for her piece, this wasn’t the case. 2015. Just 6 short years ago. A small grace from the growth of the Black Lives Matter movement and the tragedy of George Floyd’s murder.

The scripture reading this morning from the Book of Proverbs speaks of the wisdom and knowledge of God; from God’s mouth comes knowledge and understanding. I believe that the real kernel of wisdom within those words is the reminder to keep searching. Keep seeking knowledge, keep searching for information, keep collecting the data not because we simply haven’t found the right answer yet, but because we don’t yet possess enough of whatever it is that may yield the right answer. It is missing, if it even exists at all.

To my understanding, this is hope. It’s the hope of data, of science, of art, of technology, of education, of human relations, of human society, of all of creation. Perhaps it is a hope for the Church, too. Paul wrote to the Church in Corinth that faith, hope, and love abide; and the greatest of these is love. Me, I’ll take hope, for we can always hope to have faith, even when we have none. We can hope for love, even when there is no love to be found. And we can always hope to be better, even when we are far from it. 

Amen

Is Big Data Missing the Big Picture?

27 Apr

Forest_for_the_Trees

When I was defending my graduate thesis a number of years ago, I was asked by one of the faculty in attendance to explain why I had done “x” rather than “y” with my data. I stumbled for a bit until I finally said, somewhat out of frustration at not knowing the right answer, “Because that’s not what I said I’d do.” My statistics professor was also in attendance and as I quickly tried to backtrack from my response piped in, “That’s the right answer.”

As I’ve watched and listened to and read and been a part of so many discussions about data – data sharing, data citation, data management – over the past several years, I often find myself thinking back on that defense and my answer. More, I’ve thought of my professor’s comment; that data is collected, managed, and analyzed according to certain rules that a researcher or graduate student or any data collector decides from the outset. That’s best practice, anyway. And such an understanding always makes me wonder if in our exuberance to claim the importance, the need, the mandates, and the “sky’s the limit” views over data sharing, we don’t forget that.

I really enjoyed the panel that the Medical Library Association put together last week for their webinar, “The Diversity of Data Management: Practical Approaches for Health Sciences Librarianship.” The panelists included two data librarians and one research data specialist; Lisa Federer of the National Institutes of Health Library, Kevin Read from New York University’s Health Sciences Library, and Jacqueline Wirz of Oregon Health & Sciences University, respectively. As a disclosure, I know Lisa, Kevin and Jackie each personally and consider them great colleagues, so I guess I could be a little biased in my opinion, but putting that aside, I do feel that they each have a wealth of experience and knowledge in the topic and it showed in their presentations and dialogue.

Listening to the kind of work and the projects that these data-centric professionals shared, it’s easy and exciting to see the many opportunities that exist for libraries, librarians, and others with an interest in data science. At the same time, I admit that I wince when I sense our “We can do this! Librarians can do anything!” enthusiasm bubble up – as occasionally occurs when we gather together and talk about this topic – because I don’t think it’s true. I do believe that individually, librarians can move into an almost limitless career field, given our basic skills in information collection, retrieval, management, preservation, etc. We are well-positioned in an information age. That said, though, I also believe that (1) there IS a difference between information and data and (2) the skills librarians have as a foundation in terms of information science don’t, in and of themselves, translate directly to the age of big data. (I’m not fan of that descriptor, by the way. I tend to think it was created and is perpetuated by the tech industry and the media, both wishing we believe things are simpler than they ever are.) Some librarians, with a desire and propensity towards the opportunities in data science will find their way there. They’ll seek out the extra skills needed and they’ll identify new places and new roles that they can take on. I feel like I’ve done this myself and I know a good plenty handful of others who’ve done the same. But can we sell it as the next big thing that academic and research libraries need to do? Years later, I still find myself a little skeptical.

Moving beyond the individual, though, I wonder if libraries and other entities within information science, as a whole, don’t have a word of caution to share in the midst of our calls for openness of data. It’s certainly the belief of our profession(s) that access to information is vital for the health of a society on every level. However, in many ways it seems that in our discussions of data, we’ve simply expanded our dedication towards the principal of openness to information to include data, as well. Have we really thought through all that we’re saying when we wave that banner? Can we have a more tempered response and/or approach to the big data bandwagon?

Arguably, there are MANY valid reasons for supporting access in this area; peer review, expanded and more efficient science, reproducibility, transparency, etc. Good things, all. But going back to that lesson that I learned in grad school, it’s important to remember that data is collected, managed, and analyzed in certain ways for a reason; things decided by the original researcher. In other words, data has context. Just like information. And like information, I wonder (and have concern for) what happens to data when it’s taken out of its original context. And I wonder if my profession could perhaps advocate this position, too, along with those of openness and sharing, if nothing more than to raise the collective awareness and consciousness of everyone in this new world. To curb the exuberance just a tad.

I recently started getting my local paper delivered to my home. The real thing. The newsprint newspaper. The one that you spread out on the kitchen table and peruse through, page by page. You know what I’ve realized in taking up this long-lost activity again? When you look at a front page with articles of an earthquake in Nepal, nearby horses attacked by a bear, the hiring practices of a local town’s police force, and gay marriage, you’re forced to think of the world in its bigger context. At the very least, you’re made aware of the fact that there’s a bigger picture to see.

When I think of how information is so bifurcated today, I can’t help but ask if there’s a lesson there that can be applied to data before we jump overboard into the “put it all out there” sea. We take research articles out of the context of journals. We take scientific findings out of the context of science. We take individual experiences out of context of the very experience in which they occur. And of course, the most obvious, we take any and every politician’s words out of context in order to support whatever position we either want or don’t want him/her to support. I don’t know about you, but each and every one of these examples appears as a pretty clear reason to at least think about what can and will happen (already happens) to data if and when it suffers the same fate.

Are there reasons why librarians and information specialists are concerned with big data? Absolutely! I just hope that our concern also takes in the big picture.

 

Do you REALLY want it all?

10 Apr

Feeling the Big Squeeze? Remember that even a squeeze box can make a pretty song.

Feeling the Big Squeeze? Remember that even a squeeze box can make a pretty song.

There’s a billboard across the street from my office building, promoting the hospital that’s affiliated with the medical school where I work. It features a friendly looking young woman with the words above her head, “I want it all.” The implication, of course, is that the medical center can meet all of the health needs of this person, indeed of anyone who uses the hospital and its network of health care providers.

This isn’t a criticism of their advertising campaign, but more just a few thoughts that come to my mind every time that I drive past that sign. Wanting it all is pretty much the American dream, is it not? Maybe it’s the dream of all people, everywhere. We all want whatever it is that we want, whether we necessarily need it or not. You may not subscribe to this belief personally, but you have to admit that it’s an awfully loud societal message.

From the perspective of a provider, be one a provider of health care services or a provider of information services, we want it all, too. We want to say that we can provide anything and everything to anyone and everyone who comes through our doors. Libraries, especially, have this idea deeply ingrained in their DNA. They exist for everyone.

But as we have become such a specialized world, I think we’d do well to face the facts that our ability to meet that mission anymore is dwindling, if not altogether extinct. I’ve been working on an evaluation of one of the research cores for the CCTS and in talking to those involved with it, I can’t help but notice they speak many of the same concerns that I long heard in my former home in the library; a handful of people simply cannot meet the needs and demands of everyone.

This imbalance causes us to rethink much of what we do, how we measure our success, and how we plan for the future. The reality of health care is that you really cannot have it all. A few weeks back, I was feeling really miserable and went to the walk-in clinic of the hospital next door only to learn that it’s really not a walk-in clinic, but rather a place for patients who see a certain group of doctors there. These patients can walk in for a last-minute appointment. If one is available. My doctor is a doctor within the same system, but while he has an office a few floors above the very clinic where I was seeking treatment, his clinical office is in another location, thus I wasn’t able to use the services provided there. Again, not a criticism of the provider network (though I am a big critic of the messed-up system that dictates these type decisions), but I share the story as an example of how claiming all can be provided to everyone ought to be a statement with an asterisk after it. Some restrictions DO apply.

One of the reasons that I chose to leave the library and work for the CCTS is that I felt the expectations in this new role were somewhat more realistic. Here was a defined group of programs and research cores for me to evaluate. It’s a lot, but still seems a manageable number. It allows me the ability to focus more, to feel less scattered, to feel less pulled, to feel less like I’m always falling short of meeting my goals, not because I’m not trying hard or working hard, but because I am only one person and trying to give time to everyone feels like a losing proposition. To me.

Sustainability is a key issue as we continue to work in institutions and businesses and governments that are constantly under the pressures of too little resources to meet all of the required needs. We are limited in people, certainly. Positions are cut or people leave posts and are never replaced. Everyone feels overworked as we try to fill holes and do more.

But we’re also limited by our current service models. Yesterday, I was able to attend the annual eScience Symposium hosted by the NN/LM NER. The afternoon session featured two speakers from different universities who described their particular programs for data services. Regarding their data repositories, one school allows self-deposit while the other offers a mediated service, i.e. researchers send their data to the library and then staff their deposit on their behalf, adding all of the proper metadata, annotation, etc. necessary in order for people to search and find the data sets in the said repository. During the Q&A, I asked the speakers about the differences between their models. I asked them some of the same questions that are asked in the process of evaluating research cores and programs:

How did you decide which path to follow? How did you decide which aspect of your repository to sacrifice; the quality of the content (enhanced by the mediation) or the ability to be a bigger service (because you’re not limited by the time/efforts of staff in the library)?

As one speaker said, “It’s a balancing act.” Indeed. And it’s also a clear example of how believing we can be all for all is misguided. It’s just not possible. We have to set priorities and make choices.

For good and bad, though, these are the realities of academic institutions, health care providers, research centers, and libraries. The one thing that we all really do have is the challenge to face these limitations, all the while trying to come up with the solutions for providing the best of whatever we can offer to as many as possible. Whether it’s what we really want or not, THAT is the “all” that we have.