Choosing reference sets: good compared to what?

In the previous post we assumed we had a list of 100 papers to use as baseline for our percentile calculations. But what papers should be on this list?

It matters: not to brag, but I’m probably a 90th-percentile chess player compared to a reference set of 3rd-graders. The news isn’t so good when I’m compared to a reference set of Grandmasters. This is a really important point about percentiles: they’re sensitive to the reference set we pick.

The best reference set to pick depends on the situation, and the story we’re trying to tell. Because of this, in the future we’d like to make the choice for total-impact reference sets very flexible, allowing users to define custom reference sets based on query terms, doi lists, and so on.

For now, though, we’ll start simply, with just a few standard reference sets to get going. Standard reference sets should be:

meaningful
easily interpreted
not too high impact nor too low impact, so gradations in impact are apparent
applicable to a wide variety of papers
amenable to large-scale collection
available as a random sample if large

For practical reasons we focus first on the last three points. Total-impact needs to collect reference samples through automated queries. This will be easy for the diverse products we track: for Dryad datasets we’ll use other Dryad datasets, for GitHub code repositories we’ll use other GitHub repos. But what about for articles?

Unfortunately, few open scholarly indexes allow queries by scholarly discipline or keywords… with one stellar exception. PubMed. If only all of research had a PubMed! PubMed’s eUtils API lets us query by MeSH indexing term, journal title, funder name, all sorts of things. It returns a list of PMIDs that match our queries. The api doesn’t return a random sample, but we can fix that (code). We’ll build ourselves a random reference set for each publishing year, so a paper published in 2007 would be compared to other papers published in 2007.

What specific PubMed query should we use to derive our article reference set? After thinking hard about the first three points above and doing some experimentation, we’ve got a few top choices:

any article in PubMed
articles resulting from NIH-funded research, or
articles published in Nature,

All of these are broad, so they are roughly applicable to a wide variety of papers. Even more importantly, people have a good sense for what they represent — knowing that a metric is in the Xth percentile of NIH-funded research (or Nature, or PubMed) is a meaningful statistic.

There is of course one huge downside to PubMed-inspired reference sets: they focus on a single domain. Biomedicine is a huge and important domain, so that’s good, but leaving out other domains is unhappy. We’ll definitely be keeping an eye on other solutions to derive easy reference sets (a PubMed for all of Science? An open social science API? Or hopefully Mendeley will include query by subdiscipline in its api soon?).

Similarly, Nature examines only on a single publisher—and one that’s hardly representative of all publishing. As such, it may feel a bit arbitrary.

Right now, we’re leaning toward using NIH-funded papers as our default reference set, but we’d love to hear your feedback. What do you think is the most meaningful baseline for altmetrics percentile calculations?

(This is part 5 of a series on how total-impact will give context to the altmetrics we report.)

Percentiles, a test-drive

Let’s take the definitions from our last post for a test drive on tweeted percentiles for a hypothetical set of 100 papers, presented here in order of increasing readership with our assigned percentile ranges:

10 papers have 0 tweets (0-9th percentile)
40 papers have 1 tweet (10-49th)
10 papers have 2 tweets (50-59th)
20 papers have 5 tweets (60-79th)
1 paper has 9 tweets: (80th)
18 papers have 10 tweets (81-98th)
1 paper has 42 tweets (99th)

If someone came to us with a new paper that had 0 tweets, given the sample described above we would assign it to the 0-9th percentile (using a range rather than a single number because we roll like that). A new paper with 1 tweet would be in the 10th-49th percentile. A new paper with 9 tweets is easy: 80th percentile.

If we got a paper with 4 tweets we’d see it’s between the datapoints in our reference sample — the 59th and 60th percentiles — so we’d round down and report it as 59th percentile. If someone arrives with a paper that has more tweets than anything in our collected reference sample we’d give it a 100th percentile.

Does this map to what you’d expect? Our goal is to communicate accurate data as simply and intuitively as possible. Let us know what you think! @totalimpactorg on twitter, or team@total-impact.org.

(part 4 of a series on how total-impact plans to give context to the altmetrics it reports. see part 1, part 2, and part 3.)

Percentiles, the tricky bits

Normalizing altmetrics seem by percentiles seems so easy! And it is. except when it’s not.

Our first clue that percentiles have tricky bits is that there is no standard definition for what percentile means. When you get an 800/800 on your SAT test, the testing board announces you are in the 98th percentile (or whatever) because 2% of test-takers got an 800… their definition of percentile is the percentage of tests with scores less than yours. A different choice would be to declare that 800/800 is the 100th percentile, representing the percentage with tests with scores less than or equal to yours. Total-impact will use the first definition: when we say something is in the 50th percentile, we mean that 50% of reference items had strictly lower scores.

Another problem: how should we represent ties? Imagine there were only ten SAT takers: one person got 400, eight got 600s, and one person scored 700. What is the percentile for the eight people who scored 600? Well…it depends.

They are right in the middle of the pack so by some definitions they are in the 50th percentile.
An optimist might argue they’re in the 90th percentile, since only 10% of test-takers did better.
And by our strict definition they’d be in the 10th percentile, since they only beat the bottom 10% outright.

The problem is that none of these are really wrong; they just don’t include enough information to fully understand the ties situation, and they break our intuitions in some ways.

What if we included the extra information about ties? The score for a tie could instead be represented by a range, in this case the 10th-89th percentile. Altmetrics samples have a lot of ties: many papers recieve only one tweet, for example, so representing ties accurately is important. Total-impact will take this range approach, representing ties as percentile ranges. Here’s an example, using PubMed Central citations:

Finally, what to do with zeros? Impact metrics have many zeros: many papers have never been tweeted. Here, the range solution also works well. If your paper hasn’t been tweeted, but neither have 80% of papers in your field, then your percentile range for tweets would be 0-79th. In the case of zeros, when we need to summarize as a single number, we’ll use 0.

We’ll take these definitions for a test-drive in the next post.

(part 3 of a series on how total-impact plans to give context to the altmetrics it reports. see part 1, part 2, and part 4.)

Percentiles

In the last post we talked about the need to give raw counts context on expected impact. How should this background information be communicated?

Our favourite approach: percentiles.

Try it on for size: Your paper is in the 88th percentile of CiteULike bookmarks, relative to other papers like it. That tells you something, doesn’t it? The paper got a lot of bookmarks, but there are some papers with more. Simple, succinct, intuitive, and applicable to any type of metric.

Percentiles were also the favoured approach for context in the “normalization” breakout group at altmetrics12, and have already popped up as a total-impact feature request. Percentiles have been explored scientometrics for journal impact metrics, including in a recent paper by Leydesdorff and Bornmann [http://dx.doi.org/10.1002/asi.21609, free preprint PDF.] The abstract says “total impact” in it, did you catch that? 🙂

As it turns out, actually implementing percentiles for altmetrics isn’t quite as simple as it sounds. We have to make a few decisions about how to handle ties, and zeros, and sampling, and how to define “other papers like it”…. stay tuned.

(part 2 of a series on how total-impact plans to give context to the altmetrics it reports. see part 1, part 3, and part 4.)

What do we expect?

How many tweets is a lot?

Total-impact is getting pretty good at finding raw numbers of tweets, bookmarks, and other interactions. But these numbers are hard to interpret. Say I’ve got 5 tweets on a paper—am I doing well? To answer that, we must know how much activity we expect on a paper like this one.

But how do we know what to expect? To figure this out, we’ll need to account for a number of factors:

First, expected impact depends on the age of the paper. Older papers have had longer to accumulate impact: an older paper is likely to have more citations than a younger paper.

Second, especially for some metrics, expected impact depends on the absolute year of publication. Because papers often get a spike in social media attention at the time of publication, papers published in years when a social tool is very popular recieve more attention on that tool than papers published before or after the tool was popular. For example, papers published in years when twitter has been popular recieve more tweets than papers published in the 1980s.

Third, expected impact depends on the size of the field. The more people there are who read papers like this, the more people there are who might Like it.

Fourth, expected impact depends on the tool adoption patterns of the subdiscipline. Papers in fields with a strong Mendeley community will have more Mendeley readers than papers published in fields that tend to use Zotero.

Finally, expected impact levels depends on what we mean by papers “like this.” How do we define the relevant reference set? Other papers in this journal? Papers with the same indexing terms? Funded under the same program? By investigators I consider my competition?

There are other variables too. For example, a paper published in a journal that tweets all its new publications will get a twitter boost, an Open Access paper might receive more Shares than a paper behind a paywall, and so on.

Establishing a clear and robust baseline won’t be easy, given all of this compexity! That said, let’s start. Stay tuned for our plans…

(part 1 of a series on how total-impact plans to give context to the altmetrics it reports. see part 2, part 3, and part 4.)

What’s your pain?

We want to build a product users want. No, actually, we want to build a product users *need*. A product that solves pain, that solves problems. Best way to know what the problems are? Get out of the building and ask.

So, dear potential-future-users: where are you currently feeling real pain about tracking the impact of your research?

Here are three potential places:

You are desperate to learn more about your impact for your own curiosity.
You put all of this time into your research, you really want your circle to know about it. You need to share info about your impact.
You want to be rewarded for your impact when evaluated for hiring, promotion, grants, and awards.

What’s the rank order of these pains for you? Are there others? Tell us all about it so we can build the tool that you need: team@total-impact.org or @totalimpactdev.

Tell the NIH that grant biosketches need impact info

The NIH wants to hear your thoughts on how it should modify its biosketch requirements. Feedback due Friday JUNE 29th 2012, midnight EDT.

The request for information is wide open, but specifically requests feedback on the idea that a researcher’s biosketch could “include a short explanation of their most important scientific contributions.”

Sounds like a chance for scientists to tell their impact story! Good idea? And do you think impact stories should include impact metrics? If so, tell the NIH!

Right now the NIH biosketch instructions only include impact signalling through journal titles.

Some ideas for new biosketch instructions:

explicitly encourage all types of research output as publications, including software and datasets
explicitly welcome indications of impact, like citations, downloads and bookmarking counts
consider identifying articles only by authors, title, and ID/url rather than journal

Add your voice: here’s the form. We understand that the group receiving these responses is empowered to make changes.

(ht to Rebecca Rosen. More info at ResearchRemix. CC0.)

Keeping metrics free

Sustainability is important for the kind of infrastructure we want to build with total-impact. The obvious way to do this is to pass along our costs to folks who want to use the metrics, and we’ve discussed ways to do this.

However, over the last week, we’ve reached an important decision: in addition to keeping our source code and planning process open, we’ll keep our metrics free and open, too. We won’t charge for access or use.

This may seems quixotic, but it’s not motivated by blind “information wants to be free” fanpersonism. Rather, it’s motivated by our underlying goal for this project: not just a nifty new way to measure impact (although it’s that, too), but rather the base for a fundamentally transformed, web-native scholarly communication system.

The value in selling altmetrics is dwarfed by the value of what we can build using them. And we can only build these systems if the metrics themselves can flow like water between and among evaluators, readers, recommendation engines, authors, and all the other cogs of this scholarly communication system.

We’re both believers in The Market. There’s lots of money to be made in the coming post-journal world; we support those folks trying to make it. But we see that the market is not going to provide the kind of infrastructure that the next generation of recommendation and tools will need.

So over the next few months, we’ll be forming a non-profit foundation, and continuing to pursue philanthropic funding through at least the next year (while still looking at innovative ways to develop additional revenue streams). The Sloan Foundation have seen the value in what we’re doing; we think that Sloan and others will be excited to continue supporting the vision of a comprehensive, timely, free, and open metrics infrastructure.

We scholars have travelled the route of trusting our basic decision-making infrastructure to a for-profit before. Despite everyone’s best intentions, it’s not worked out so well. We’re excited about helping to start a new era of metrics along a different course.

Open impact metrics need #openaccess. Please sign.

Something exciting is going on. A petition for increased access to the scientific literature is gathering steam. If it gets 25k signatures in 30 days — and it looks like it will get many more — the proposal will go to Obama’s desk for integration into policy.

Total-Impact urges you to sign this petition and share it with others. We have 🙂

Improved access to the research literature is *essential* if we want innovative systems to track the impact of scholarly research products within the scholarly ecosystem.

As far as we know, there is only one cross-publisher open computer-accessible source for citations: PubMed Central. And the only cross-publisher search of full text that can be reused by computer programs? Comes from PubMed Central. PubMed Central is awesome, but it only has NIH-funded biomedical literature. Scholarship needs these resources for all research literature. This petition is an important step.

Please go sign the petition and spread the word. #altmetrics #OAMonday #openaccess #theFutureIsComing

What are metrics good for?

We talk a lot about metrics. And when you do that, there’s always the risk what you’re measuring or why will become unclear. So this is worth repeating, as was reminded in a nice conversation with Anurag Acharya of Google Scholar (thanks Anurag!).

Metrics are no good by themselves. They are, however, quite useful when they inform filters. In fact, by definition filtering requires measurement or assessment of some sort. If we find new relevant things to measure, we can make new filters along those dimensions. That’s what we’re excited about, not measuring for it’s own sake.

These filters can mediate search for literature. They can also filter other things, like job applicants or or grant applications. But they’re all based on some kind of measurement. And expanding our set of relevant features (and perhaps a machine-learning context is more useful here than the mechanical filter metaphor) is likely to improve the validity and responsiveness of all sorts of scholarly assessment.

The big question, of course, is whether altmetrics like tweets, mendeley, and so on are actually relevant features. We can’t now prove that one way or another, although we’re working on it. I do know that they’re relevant sometimes, and I have the suspicion that they will become more relevant as more scholars move their professional networks online (another assumption, but i think a safe one).

And of course, measuring and filtering are only half the game. You also have to aggregate, to the pull the conversation together. Back when citation was the only visible edge in the network, we used ISI et al. to do this. Of course the underlying network was always richer than that, but the citation graph was the best trace we had. But now the the underlying processes—conversations, reads, saves, etc—are becoming visible as well, and there’s even more value in pulling together these latent, disconnected conversations. But that’s another post 🙂

OurResearch blog