Introducing a new browser extension to make the paywall great again

It’s pretty clear at this point that open access is winning. Of course, the percentage of papers available as OA has been climbing steadily for years. But now on top of this, bold new mandates like Plan S are poised to fast-track the transition to universal open access.

But–and this may seem weird coming from the makers of Unpaywall–are we going too far, too fast? Sure, OA will accelerate discovery, help democratize knowledge, and whatnot. It’s obvious what we have to gain.

Maybe what’s less obvious is what we’re going to lose. We’re going to lose the paywall. And with it, maybe we’re going to lose a little something…of ourselves.

Think about it: some of humankind’s greatest achievements have been walls. You’ve got the Great Wall of China (useful for being seen from space!), the Berlin Wall (useful for being a tourist attraction!), and American levees (useful for driving your Chevy to, when they don’t break!)

Now, are the paywalls around research articles really great cultural achievements? With all due respect: what a fantastically stupid question. Of course they are! Or not! Who knows! It doesn’t matter. What matters is that losing the paywall means change, and that means it’s scary and probably bad.

Why, just the other day we went to read an scholarly article, and we wanted to pay someone money, and THERE WAS NOWHERE TO DO IT. Open Access took that away from us. We were not consulted. This is “progress?”

You used to know where you stood. Specifically, you stood on the other side of a towering paywall that kept you from accessing the research literature. But now: who knows? Who knows?

Well, good news friend: with our new browser extension, you know. That’s right, we are gonna make the paywall great again, with a new browser extension that magically erects a paywall to keep you from reading Open Access articles!

The extension is called Paywall (natch), and it’s elegantly simple: the next time you stumble upon one of those yucky open access articles, Paywall automatically hides it from you, and requires you pay $35 to read. That’s right, we’re gonna rebuild the paywall, and we’re gonna make you pay for it!

With Paywall, you’ll enjoy your reading so much more…after all, you paid $35 for that article so you better like it. And let’s be honest, you were probably gonna blow that money on something useless anyway. This way, at least you know you’re helping make the world a better place, particularly the part of the world that is our Cayman Islands bank account.

Paywalls are part of our heritage as researchers. They feel right. They are time-tested. They are, starting now, personally lucrative for the writers of this blog post. I mean, what more reasons do we need? BUILD. THE. WALL. Install Paywall. Now. Do it. Do it now.

Thanks so much for your continued support. Remember, we can’t stop the march of progress–but together, scratching and clawing and biting as one, maybe we can maybe slow it down a little. At least long enough to make a few extra bucks.

⇨ Click here to install Paywall!

~~~~~~~~~

Unpaywall extension adds 200,000th active user

We’re thrilled to announce that we’re now supporting over 200,000 active users of the Unpaywall extension for Chrome and Firefox!

The extension, which debuted nearly two years ago, helps users find legal, open access copies of paywalled scholarly articles. Since its release, the extension has been used more than 45 million times, finding an open access copy in about half of those. We’ve also been featured in The Chronicle of Higher Ed, TechCrunch, Lifehacker, Boing Boing, and Nature (twice).

However, although the extension gets the press, the database powering the extension is the real star. There are millions of people using the Unpaywall database every day:

  • We deliver nearly one million OA papers every day to users worldwide via our open API…that’s 10 papers every second!
  • Over 1,600 academic libraries use our SFX integration to automatically find and deliver OA copies of articles when they have no subscription access.
  • If you’re using an academic discovery tool, it probably includes Unpaywall data…we’re integrated into Web of Science, Europe PubMed Central, WorldCat, Scopus, Dimensions, and many others.
  • Our data is used to inform and monitor OA policy at organizations like the US NIH, UK Research and Innovation, the Swiss National Science Foundation, the Wellcome Trust, the European Open Science Monitor, and many others.

The Unpaywall database gets information from over 50,000 academic journals and 5000 scholarly repositories and archives, tracking OA status for more than 100 million articles. You can access this data for free using our open API, or user our free web-based query tool. Or if you prefer, you can just download the whole database for free.

Unpaywall is supported via subscriptions to the Unpaywall Data Feed, a high-throughput pipeline providing weekly updates to our free database dump. Thanks to Data Feed subscribers, Unpaywall is completely self-sustaining and uses no grant funding. That makes us real optimistic about our ability to stick around and provide open infrastructure for lots of other cool projects.

Thanks to everyone who has supported this project, and even more, thanks to everyone who has fought for open access. Without y’all, Unpaywall wouldn’t matter. With you: we’re changing the world. Together. Next stop 300k!

We’re building a search engine for academic literature–for everyone


Huzzah! Today we’re announcing an $850k grant from the Arcadia Fund to build a new way for folks to find, read, and understand the scholarly literature.

Wait, another search engine? Really?

Yep. But this one’s a little different: there are already a lot of ways for academic researchers to find academic literature…we’re building one for everyone else.

We’re aiming to meet the information needs of citizen scientists, patients, K-12 teachers, medical practitioners, social workers, community college students, policy makers, and millions more. What they all have in common: they’re folks who’d benefit from access to the scholarly record, but they’ve historically been locked out. They’ve had no access to the content or the context of the scholarly conversation.

Problem: it’s hard to access to content

Traditionaly, the scholarly literature was paywalled, cutting off access to the content. The Open Access movement is on the way to solving this: Half of new articles are now free to read somewhere, and that number is growing. The catch is that there are more than 50,000 different “somewheres” on web servers around the world, so we need a central index to find it. No one’s done a good job of this yet (Google Scholar gets close, but it’s aimed at specialists, not regular people. It’s also 100% proprietary, closed-source, closed-data, and subject to disappearing at Google’s whim.)

Problem: it’s hard to access to context

Context is the stuff that makes an article understandable for a specialist, but gobbledegook to the rest of us. So that includes everything from field-specific jargon, to strategies for on how to skim to the key findings, to knowledge of core concepts like p-values. Specialists have access to context. Regular folks don’t. This makes reading the scholarly literature like reading Shakespeare without notes: you get glimmers of beauty, but without some help it’s mostly just frustrating.

Solution: easy access to the content and context of research literature.

Our plan: provide access to both content and context, for free, in one place. To do that, we’re going to bring together an open a database of OA papers with a suite AI-powered support tools we’re calling an Explanation Engine.

We’ve already finished the database of OA papers. So that’s good. With the free Unpaywall database, we’ve now got 20 million OA articles from 50k sources, built on open source, available as open data, and with a working nonprofit sustainability model.

We’re building the “AI-powered support tools” now. What kind of tools? Well, let’s go back to the Hamlet example…today, publishers solve the context problem for readers of Shakespeare by adding notes to the text that define and explain difficult words and phrases. We’re gonna do the same thing for 20 million scholarly articles. And that’s just the start…we’re also working on concept maps, automated plain-language translations (think automatic Simple Wikipedia), structured abstracts, topic guides, and more. Thanks to recent progress in AI, all this can be automated, so we can do it at scale. That’s new. And it’s big.

The payoff

When Microsoft launched Altair BASIC for the new “personal computers,” there were already plenty of programming environments for experts. But here was one accessible to everyone else. That was new. And ultimately it launched the PC revolution, bringing computing the lives of regular folks. We think it’s time that same kind of movement happened in the world of knowledge.

From a business perspective, you might call this a blue ocean strategy. From a social perspective (ours), this is a chance to finally cash the cheques written by the Open Access movement. It’s a chance to truly open up access to the frontiers of human knowledge to all humans.

If that sounds like your jam, we’d love your support: tell your friends, sign up for early access, and follow us for updates. It’s gonna be quite an adventure.

Here’s the press release.

Why the name “altmetrics” doesn’t imply replacement of citations (and other bicycling metaphors)


“Based on the name “alternative” metrics, you clearly think altmetrics can replace citations. That’s dumb.”

I (Jason) have heard this critique more times than I care to count. And on one level, I get it. If  you take an “alternate route,” you don’t take the original route, you take a different one. There’s a replacement. And completely replacing citation metrics with altmetrics is, I agree, dumb. That said, I actually believe altmetrics should complement citation, and I further think that the name “altmetrics” (for all its flaws) is compatible with this view. To explain, here’s an example:

I’m currently looking out the window at a street which includes both a lane for cars, and another lane for “alternate transportation,” a category that includes bicycles, skateboards, and scooters.

Although these “alternate” vehicles have many advantages over cars (cleaner, smaller, etc) the goal of city planners is not, as I understand, to replace automobiles with alternate transportation. Rather, the goal is to make it easy for commuters to use the most suitable vehicle for their particular trip. This in turn supports a more efficient infrastructure for the city as a whole. Making it easy for commuters to choose alternate transportation for a given trip is helpful, even though no one really expects bikes to completely replace cars in the city as a whole.

(As an aside: these “alternate” vehicles could probably have some other, more descriptive name….for instance, “smaller-more-efficient vehicles.” However, as a practical matter, cars are the default for now so bikes etc remain “alternatives” for now. This is also true of altmetrics, of course, which I often hear will someday be obsolete as a term, once it really catches on. To this I say: excellent. The sooner the better.)

Like bikes et al, altmetrics aren’t right for every use case, and never will be. Altmetrics can’t and shouldn’t replace citation metrics for every task. But they are much better tools than citation metrics for some tasks (for example, understanding the impact of research on populations that don’t write scholarly papers). Therefore, using altmetrics alongside citations will let us measure scholarly impact more in a way that’s more efficient, nuanced, and comprehensive. Altmetrics are an alternative to the measurement gridlock that comes from over-reliance on citation metrics.

 

When will everything be Open Access?


OA continues to grow. But when will it be…done? When will everything be published as Open Access?

Using data from our recently-published PeerJ OA study, we took a crack at answering that question. This data we’re using comes from the Unpaywall database–now the largest open database of OA articles ever created, with comprehensive data on over 90 million articles. Check out the paper for more lots more details on how we assembled the data, along with assessments of accuracy and other goodies. But without further ado, here’s our projection of OA growth:

growth-over-time

In the study, we found that OA is increasingly likely for newer articles since around 1990. That’s the solid line part of the graph, and is based on hard data.

But since the curve is so regular, it was tempting to extend it so see what would happen at the current rate of increase. That’s the dotted line in the figure above. Of course it’s a pretty facile projection, in that no effort has been made to model the underlying processes. #limitations #futurework 😀. Moreover, the 2040 number is clearly too conservative since it doesn’t account for discontinuities–like the surge in OA we’ll see in 2020 when new European mandates take effect.

But while the dates can’t be known for certain, what the data makes very clear is that we are headed for an era of universal OA. It’s not a question of if, but when. And that’s great news.

Open Access, coming to a workflow near you: welcome to the year of Ubiquitous OA


Thanks to 20 years of OA innovation and advocacy, today you can legally access around half the recent research literature for free. However, in practice, much of this free literature is not as open as we’d like it to be, because it’s hard for readers to find the OA version.

OA lives on repositories and publisher websites. But very few people visit these sources directly to find a given article. Instead, people rely on the search tools that are already part of their existing workflows. Historically, these haven’t done a great job surfacing OA resources. Google, for instance, often fails to index OA versions, in addition to indexing content of dubious provenance. OA aggregators like BASE, CORE, and OpenAIRE aim to solve this by emphasizing OA coverage, but they require researchers to add a second or third search step to their existing workflows–something researchers have been reluctant to do.

So in addition to the well-known access problem, we also have a discovery problem. On the one there’s a healthy, efficient OA infrastructure in journals and repositories. On the other we have millions of individual readers doing their own thing. We need to connect these. We need to cover this last mile between the infrastructure and the individual user, and we need to make that connection easy and seamless and ubiquitous. Until we do, OA is writing a check it can’t fully cash.

But the news is good: over the last year, several efforts are emerging to cover that last mile. Our contribution was Unpaywall: an extension that shows a green tab in your browser on articles where there’s an OA version available.  Unpaywall has enjoyed lots of success, adding over 100,000 active users in under than a year. Moreover, the backend database of Unpaywall (formerly called oaDOI) can be integrated into any number of existing tools, making it easier to spread OA content all over the place. For instance, we’re already seeing over a million uses every day from library link resolvers.

Our most recent integration takes this to a new level, and we’re so excited about it: thanks to a new partnership between Impactstory and Clarivate Analytics, data from Impactstory’s Unpaywall database is now live in the Web of Science, making it the first editorially-curated and publisher-neutral resource to implement this technology. Web of Science has been able to use Unpaywall data to discover and link to millions more OA records amongst their existing content.  This enables millions of Web of Science users around the world to link straight from their search results to a trusted, legal, peer-reviewed OA version—and they can also filter search results by the different versions of OA.

This is a big deal because article and indexing (A&I) systems like Web of Science are currently the most important way researchers access literature.  And though it’s by no means the only A&I system out there, Web of Science is the most respected and most prevalent. Every month, millions of users access literature through Web of Science—and now, each and every one of them will see more OA options for articles they might not otherwise discover, right alongside subscribed content.  Every day. What a huge change from the days we had to convince folks that OA was legitimate at all! It’s a new era.

A new era: that’s not just a hyperbolic phrase. We think this year marks the turning of a new moment in the OA narrative. We’re moving out of the author-focused, advocacy-focused initial phase, and into a more mature era of ubiquitous Open Access, characterized by deep integration of OA into researcher workflows and value-add services built on top of the immense OA corpus. This is the era of user-focused OA.

As OA becomes the default state for published research, tools that centralize, mine, index, search, organize, and extract knowledge from papers suddenly become massively more powerful.  Integrations between Unpaywall and commercial services aren’t generating this new era, but they are one of the hallmarks of it. We’re not making new OA, but rather starting to leverage the massive OA corpus now available. In the last year, many others have begun to do this as well. Many, many more will follow

For years, we in the OA advocate community have been arguing that a critical mass of OA would not just improve scholarly communication, it would transform it. This is finally beginning to happen, and we think this partnership with Web of Science is an early part of that transformation. Now, a subscription to Web of Science—something most academic libraries globally already have—is also a subscription to a database of millions of free-to-read OA articles.

We’ve never been more excited about the future of OA–or more thankful for all the work the OA community as a whole has done to get here. And we can’t wait to keep working together with the community to help make the vision of ubiquitous open access a reality.

Green Open Access comes of age


This morning David Prosser, executive director of Research Libraries UK, tweeted, “So we have @unpaywall, @oaDOI_org, PubMed icons – is the green #OA infrastructure reaching maturity?(link).

We love this observation, and not just because two of the three projects he mentioned are from us at Impactstory 😀. We love it because we agree: Green OA infrastructure is at a tipping point where two decades of investment, a slew of new tools, and a flurry of new government mandates is about to make Green OA the scholarly publishing game-changer.

A lot of folks have suggested that Sci-Hub is scholarly publishing’s “Napster moment,” where the internet finally disrupts a very resilient, profitable niche market. That’s probably true. But like music industry shut down Napster, Elsevier will likely be able shut down Sci-Hub. They’ve got both the money and the legal (though not moral) high ground and that’s a tough combo to beat.

But the future is what comes after Napster. It’s in the iTunes and the Spotifys of scholarly communication. We’ve built something to help to create this future. It’s Unpaywall, a browser extension that instantly finds free, legal Green OA copies of paywalled research papers as you browse–like a master key to the research literature. If you haven’t tried it yet, install Unpaywall for free and give it a try.

Unpaywall has reached 5,000 active users in our first ten days of pre-release.

But Unpaywall is far from the only indication that we’re reaching a Green OA inflection point. Today is a great day to appreciate this, as there’s amazing Green OA news everywhere you look:

  • Unpaywall reached the 5000 Active Users milestone. We’re now delivering tens of thousands of OA articles to users in over 100 countries, and growing fast.
  • PubMed announced Institutional Repository LinkOut, which links every PubMed article to a free Green copy in institutional repositories where available. This is huge, since PubMed is one of the world’s most important portals to the research literature.
  • The Open Access Button announced a new integration with interlibrary loan that will make it even more useful for researchers looking for open content. Along with the interlibrary loan request, they send instructions to authors to help them self-archive closed publications.

Over the next few years, we’re going to see an explosion in the amount of research available openly, as government mandates in the US, UK, Europe, and beyond take force. As that happens, the raw material is there to build completely new ways of searching, sharing, and accessing the research literature.
We think Unpaywall is a really powerful example: When there’s a big Get It Free button next to the Pay Money button on publisher pages, it starts to look like the game is changing. And it is changing. Unpaywall is just the beginning of the amazing open-access future we’re going to see. We can’t wait!

How to smash an interstellar paywall


Last month, hundreds of news outlets covered an amazing story: seven earth-sized planets were discovered, orbiting a nearby star. It was awesome. Less awesome: the paper with the details, published in the journal Nature, was paywalled. People couldn’t read it.

That’s messed up. We’re working to fix it, by releasing our new free Chrome extension Unpaywall. Using Unpaywall, you can get access to the article, and millions like it, instantly and legally. Let’s learn more.

First, is this really a problem? Surely google can find the article. I mean, there might be aliens out there. We need to read about this. Here we go, let’s Google for “seven terrestrial planets nature article.” Great, there it is, first result. Click, and…

What, thirty-two bucks to read!? Well that’s that, I quit.

Or maybe there are some ways around the paywall? Well, you can know someone with access. My pal Cindy Wu helped out her journal club out this way, offering on Twitter to email them a copy of the paper. But you have to follow Cindy on Twitter for that to work.

Or you could know the right places to look for access. Astronomers generally post their papers are on a free web server called the ArXiv, and sure enough if you search there, you’ll find the Nature paper.  But you have to know about ArXiv for that to work. And check out those Google search results again: ArXiv doesn’t appear.

Most people don’t know Cindy, or ArXiv. And no one’s paying $32 for an article. So the knowledge in this paper, and thousands of papers like it, is locked away from the taxpayers who funded it. Research becomes the private reserve of those privileged few with the money, experience, or connections to get access.

We’re helping to change that.

Install our new, free Unpaywall Chrome extension and browse to the Nature article. See that little green tab on the right of the page? It means Unpaywall found a free version, the one the authors posted to ArXiv. Click the tab. Read for free. No special knowledge or searches or emails or anything else needed. 

Today you’ll find Unpaywall’s green tab on ten million articles, and that number is growing quickly thanks to the hard work of the open-access movement. Governments in the US, UK, Europe, and beyond are increasingly requiring that taxpayer-funded research be publically available, and as they do Unpaywall will get more and more effective.

Eventually, the paywalls will all fall. Till then, we’ll be standing next to ‘em, handing out ladders. Together with millions of principled scientists, libraries, techies, and activists, we’re helping make scholarly knowledge free to all humans. And whoever else is out there 😀 👽.

How big does our text-mining training set need to be?


We got some great feedback from reviewers our new Sloan grant, including a suggestion that we be more transparent about our process over the course of the grant. We love that idea, and you’re now reading part of our plan for how to do that: we’re going to be blogging a lot more about what we learn as we go.

A big part of the grant is using machine learning to automatically discover mentions of software use in the research literature. It’s going to be a really fun project because we’ll get to play around with some of the very latest in ML, which currently The Hotness everywhere you look. And we’re learning a lot as we go. One of the first questions we’ve tackled (also in response to some good reviewer feedback) is: how big does our training set need to be? The machine learning system needs to be trained to recognized software mentions, and to do that we need to give it a set of annotated papers where we, as humans, have marked what a software mention looks like (and doesn’t look like). That training set is called the gold standard. It’s what the machine learning system learns from. Below is copied from one of our reviewer responses:

We came up with the number of articles to annotate through a combination of theory, experience, and intuition.  As usual in machine learning tasks, we considered the following aspects of the task at hand:

  • prevalence: the number of software mentions we expect in each article
  • task complexity: how much do software-mention words look like other words we don’t want to detect
  • number of features: how many different clues will we give our algorithm to help it decide whether each word is a software mention (eg is it a noun, is it in the Acknowledgements section, is it a mix of uppercase and lowercase, etc)

None of these aspects are clearly understood for this task at this point (one outcome of the proposed project is that we will understand them better once we are done, for future work), but we do have rough estimates.  Software mention prevalence will be different in each domain, but we expect roughly 3 mentions per paper, very roughly, based on previous work by Howison et al. and others.  Our estimate is that the task is moderately complex, based on the moderate f-measures achieved by Pan et al. and Duck et al. with hand-crafted rules.  Finally, we are planning to give our machine learning algorithm about 100 features (50 automatically discovered/generated by word2vec, plus 50 standard and rule-based features, as we discuss in the full proposal).

We then used these estimates.  As is common in machine learning sample size estimation, we started by applying a rule-of-thumb for the number of articles we’d have to annotate if we were to use the most simple algorithm, a multiple linear regression.  A standard rule of thumb (see https://en.wikiversity.org/wiki/Multiple_linear_regression#Sample_size) is 10-20 datapoints are needed for each feature used by the algorithm, which implies we’d need 100 features * 10 datapoints = 1000 datapoints.  At 3 datapoints (software mentions) per article, this rule of thumb suggests we’d need 333 articles per domain.  

From there we modified our estimate based on our specific machine learning circumstance.  Conditional Random Fields (our intended algorithm) is a more complex algorithm than multiple linear regression, which might suggest we’d need more than 333 articles.  On the other hand, our algorithm will also use “negative” datapoints inherent in the article (all the words in the article that are *not* software mentions, annotated implicitly as not software mentions) to help learn information about what is predictive of being vs not being a software mention — the inclusion of this kind of data for this task means our estimate of 333 articles is probably conservative and safe.

Based on this, as well as reviewing the literature for others who have done similar work (Pan et al. used a gold standard of 386 papers to learn their rules, Duck et al. used 1479 database and software mentions to train their rule weighting, etc), we determined that 300-500 articles per domain was appropriate. We also plan to experiment with combining the domains into one general model — in this approach, the domain would be added as an additional feature, which may prove more powerful overall. This would bring all 1000-1500 articles to the test set.

Finally, before proposing 300-500 articles per domain, we did a gut-check whether the proposed annotation burden was a reasonable amount of work and cost for the value of the task, and we felt it was.

References

Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLOS ONE, 11(6), e0157989. http://doi.org/10.1371/journal.pone.0157989

Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST), Article first published online: 13 MAY 2015. http://doi.org/10.1002/asi.23538

Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. http://doi.org/10.1016/j.joi.2015.07.012

Comparing Sci-Hub and oaDOI


Nature writer Richard Van Noorden recently asked us for our thoughts about Sci-Hub, since in many ways it’s quite similar to our newest project, oaDOI. We love the idea of comparing the two, and thought he had (as usual) good questions. His recent piece on Sci-Hub founder Alexandra Elbakyan quotes some of our responses to him; we’re sharing the rest below:

Like many OA advocates, we see lots to admire in Sci-Hub.

First, of course, Sci-Hub is making actual science available to actual people who otherwise couldn’t read it. Whatever else you can say about it, that is a Good Thing.

Second, SciHub helps illustrate the power of universal OA. Imagine a world where when you wanted to read science, you just…did? Sci-Hub gives us a glimpse of what that will look like, when universal, legal OA becomes a reality. And that glimpse is powerful, a picture that’s worth a thousand words.

Finally, we suspect and hope that SciHub is currently filling toll-access publishers with roaring, existential panic. Because in many cases that’s the only thing that’s going to make them actually do the right thing and move to OA models.

All this said, SciHub is not the future of scholarly communication, and I think you’d be hard pressed to find anyone who thinks it is. The future is universal open access.

And it’s not going to happen tomorrow. But it is going to happen. And we built oaDOI to be a step along that path. While we don’t have the same coverage as SciHub, we are sustainable and built to grow, along with the growing percentage of articles that have open access versions. And as you point out, we offer a simple, straightforward way to get fulltext.

That interface was not exactly inspired by SciHub, but rather I think an example of convergent evolution. The current workflow for getting scholarly articles is, in many cases, absolutely insane. Of course this is the legacy of a publishing system that is built on preventing people from reading scholarship, rather than helping them read it. It doesn’t have to be this hard. Our goal at oaDOI is to make it less miserable to find and read science, and in that we’re quite similar to SciHub. We just think we’re doing it in a way that’s more powerful and sustainable over the long term.