MAG replacement update: meet OpenAlex!

Last month, we announced that we’re launching a replacement for Microsoft Academic Graph (MAG) this December–just before MAG itself will be discontinued.  We’ve heard from a lot of current MAG users since then. All of them have offered their support and encouragement (which we really appreciate), and all have also all been curious to learn more. So: here’s more! It’s a snapshot of what we know right now.  As the project progresses, we’ll have more details to share, keeping everyone as up-to-date as we can.

Name

We’ve now got a name for this project: OpenAlex. We like that it (a) emphasizes Open, and (b) is inspired by the ancient Library of Alexandria — like that fabled institution, OpenAlex will strive to create a comprehensive map of the global scholarly conversation. We’ll start with MAG data, and we’ll expand over time. Along with the name, we’ve got the beginnings of a webpage at openalex.org, and a Twitter account at @OpenAlex_org.

Mailing list

We’ve now got a mailing list where you can sign up for more announcements as they happen. You can sign up for the mailing list on the new OpenAlex homepage.

Funding

Our nonprofit OurResearch recently received a $4.5 million grant from the Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin. This grant has been in the works for some time, and is a big part of why we felt confident in announcing OpenAlex when we did. In the proposal, there was already a plan for a project similar to OpenAlex, so we were able to quickly pivot the grant details to direct about a million dollars to the development of OpenAlex. 

It’s a three year grant, which will give us plenty of time to develop and launch OpenAlex, as well as test and launch a long-term revenue. This model will not be built on selling data (see Openness below), but rather based on selling value-added services and service level agreements. We’ve got experience with this approach: we’re funding Unpaywall this way, and it’s been both open and fully self-sustaining for several years now.

Openness

We’re passionate about openness. It’s the “Our” in our name–we think research should belong to all of us, as humans.  Openness is the first of core values, and it’s a big piece of our recent public commitment to the Principles of Open Scholarly Infrastructure (POSI). A lot of our excitement about OpenAlex comes from the chance to make this rich dataset unprecedentedly open.  Specifically:

  • The code will be open source under an MIT license, hosted on our GitHub account and backed up by Software Heritage.
  • The data will be as openly licensed as possible. Some of the data consists of facts, which have no copyright (see this Crossref post for more about that idea). Where copyright is applicable, and where we have the option, we’ll apply the CC0 waiver. Where other rightsholders are involved, we will encourage them to allow a similarly open license.
  • The data will be free, as in no cost. It will be available via a free API (more details below) with generous limits, as well as periodic data dumps available at no charge (we may require the downloader to cover the 3rd party data transfer fees if these get heavy).

Data we are losing (at least to start)

As mentioned in our initial announcement, OpenAlex will be missing some data that MAG currently has–particularly at our launch in 2021, due to the very tight timeline. More accurately, we’ll have this data, but won’t be keeping it up to date. Specifically we won’t have: 

  • Conference Series and Conference Instances. Importantly, we’ll continue to bring in the vast majority of conference papers. But won’t be keeping track of  new conferences themselves (eg, The 34th Annual Conference of Foo), and with that the ability to link conference papers to those conferences.
  • Citation Contexts (the full text of the paragraph where each citation originally appeared)
  • Most abstracts. We will however probably have those (minority of) abstracts that publishers send Crossref or PubMed for redistribution.
  • Full coverage of DOI-unassigned works:. MAG is particularly good at finding scholarly papers without a DOI. We’ll be less good, especially at first. We will include many DOI-unassigned works…just not as many as MAG.

There is some other data that may or may not make it into OpenAlex by December 2021. We are still testing these for feasibility:

  • Patents
  • Paper recommendations

Data we are adding

Although we’ll be missing some data, we’ll also be bringing some new data to the party — stuff MAG doesn’t have right now. Specifically will include:

  • The Open Access status of papers (via the Unpaywall dataset, which has become the industry standard). We’ll be able to tell you whether a given paper is OA or not, its copyright license, and where to find it. 
  • A more comprehensive list of ISSNs associated with each journal, including the ISSN-L, which is helpful for deduplicating journals.
  • ORCID for author clusters. To start with, this will just be in unambiguous cases, when assignment is clear via the Crossref and ORCID datasets. Over time we may apply fancier, more inferential assignments.
  • ROR IDs for institutions, in addition to GRIDs

Over the long term, our goal with OpenAlex is to create a truly comprehensive map of the global scholarly conversation, so we’ll be continually looking to  expand and enhance the data it includes.

Data dumps

There will be (at least) two ways to get at the data: data dumps, and the API (below). 

Data dumps will be in the same table/column format as the MAG data, so that the downloads can be a drop-in replacement. There may be some additional tables and additional columns for new data we’re adding, and some data values will be missing (both of these are described above), but if you’re running code to ingest MAG dumps right now, you’ll be able to run pretty much the same code to ingest OpenAlex dumps in December. That’s a really important part of this project for us, because we know it will save a lot of folks a lot of time.

We will release new data dumps every 2 weeks, as either a full dump or an incremental update or both (we’re still looking into that). The data will likely be hosted on AWS S3 rather than Microsoft Azure.

API 

The other way main to get at the data will be via the API. Here we will be doing it pretty differently than Microsoft. We will not be supporting the Microsoft Academic Knowledge API or Microsoft Academic Knowledge Exploration Service (MAKES). Instead, we will host a central, open REST API that anyone can query. This API will have two kinds of endpoints: entity endpoints, and slice-and-dice endpoints. Both will be read-only (GET), deliver data in JSON format, and be rate-limited but with high rate-limits. 

  • Entity endpoints will let you quickly retrieve a specific scholarly entity (eg paper, person, journal, etc) by its ID. Signatures will look like  /doi/:doi and /journal/:issn.
  • Slice-and-dice endpoints will let you query the data with filters to return either item lists, or aggregate group counts. An example call might look something like /query?filter=issn:2167-8359,license:cc-by&groupby=year (that would give you the annual counts of CC-BY-licensed articles from the journal PeerJ). You could also use the slice-and-dice endpoints to do things like build a faceted scholarly search engine, or an evaluation tool.

Timeline

We appreciate that having a scheduled beta (or alpha!) release of the API and data dump would be very helpful. And we further realize that the sooner we can let you know that schedule, the better. Unfortunately, we don’t know the timeline for these releases yet. Our current best guess is the early fall. We’ll certainly be doing our best to get something pushed out there as soon as possible. We encourage you to  join the mailing list so we can keep you up to date. 

Your comments

Finally, we welcome your comments and questions! We’ve gotten oodles of helpful feedback already, and we really appreciate that. We’re especially interested in getting your current use-case for MAG…we’re working to prioritize supporting those cases, first and foremost.  You can do that via our community survey here, or drop us a line at team@ourresearch.org.

Open Science nonprofit OurResearch receives $4.5M grant from Arcadia Fund

OurResearch, a nonprofit seeking to speed the global adoption of Open Science, announced today that it had been awarded a new 3-year, $4.5M (USD) grant from the UK-based Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin.

The grant, which follows an 2018 award for $850,000, will help expand two existing open-source software projects, as well as support the launch of two new ones:

  • Unpaywall, launched in 2017, has become the world’s most-used index of Open Access (OA) scholarly papers. The free Unpaywall extension has 400,000 active users, and its underlying database powers OA-related features in dozens of other tools including Web of Science, Scopus, and the European Open Science Monitor. All Unpaywall data is free and open.
  • Unsub is an analytics dashboard that helps academic libraries cancel their large journal subscriptions, freeing up money for OA publishing. Launched in late 2019, Unsub is now used by over 500 major libraries in the US and worldwide, including the national library consortia of Canada, Australia, Greece, Hong Kong, and the UK. 
  • JournalsDB will be a free and open database of scholarly journals. This resource will gather a wide range of data on tens of thousands of journals, emphasizing coverage of emerging open venues. 
  • OpenAlex will be a free and open bibliographic database, cataloging papers, authors, affiliations, citations, and journals. Inspired by the ancient Library of Alexandria, OpenAlex will strive to create a comprehensive map of the global scholarly conversation.  In a recent blog post, the team announced that OpenAlex will be released in time to serve as a replacement for Microsoft Academic Graph, whose discontinuation was also recently announced.

OurResearch’s ongoing operations costs (about $1M annually) are currently covered by earned revenue from service-level agreements. The new funding will go toward accelerating development of new features and tools.

The new tools and features will be developed in keeping with OurResearch’s longstanding commitment to openness. OurResearch recently became one of the first to commit to the Principles of Open Scholarly Infrastructure (POSI), a set of guidelines encouraging openness, sustainability, and responsive governance. OurResearch has always fully shared its source code and datasets, and maintains a transparency webpage publishing salaries, tax filings, and other information. The proposal for this grant is itself shared on Open Grants.

“We are very grateful to the Arcadia Foundation for this grant, which will help us innovate more quickly than ever before. There is an urgent need for open scholarly infrastructure,” said Heather Piwowar, one of OurResearch’s two cofounders. 

“Since our beginning at a hackathon ten years ago, we’ve been working to build sustainable, open, community-oriented software tools to make research more open,” added her cofounder Jason Priem. “We’re so excited about the ways this grant will help us further that vision.” 

Work on the grant is expected to begin at once, with early versions of both JournalsDB and the OpenAlex launching later this year.

———————————-

OurResearch is a nonprofit that builds tools to help accelerate the transition to universal Open Science. Started at a hackathon in 2011, they remain committed to creating open, sustainable research infrastructure that solves real-world problems.

Arcadia is a charitable fund of Lisbet Rausing and Peter Baldwin. It supports charities and scholarly institutions that preserve cultural heritage and the environment. Arcadia also supports projects that promote open access and all of its awards are granted on the condition that any materials produced are made available for free online. Since 2002, Arcadia has awarded more than $777 million to projects around the world.

Our Research is now OurResearch

We love our name, but in the last few years we’ve found that it’s a bit confusing. In a lot of contexts, it’s not totally clear whether you’re talking about “Our Research” (the enthusiastic Open Science nonprofit) or “our research” (some research that belongs to some people, some of whom are us). That’s bad.

So, we’re changing the name. Or more accurately, we’re changing the spelling, by getting rid of the space. Our Research is now OurResearch! We’ve updated the spelling in all the places we could think of; this includes modifying our logo (hi-res version here):

If you find outdated usages we missed, please let us know. Also, if you’re using the old spelling or logo anywhere, we’d be thankful if you could change it to the new one, when it’s convenient. 

Thanks for reading, and thanks for your support! We’ll let Drake handle the outro:

We’re building a replacement for Microsoft Academic Graph

Edit 15 June: Read our latest update on OpenAlex (this tool’s new name) here, and sign up for our mailing list here to get new updates as they happen.

This week Microsoft Research announced that their free bibliographic database–Microsoft Academic Graph, or MAG for short–is being discontinued. This is sad news, because MAG was a great source of open scholcomm metadata, including citation counts and author affiliations. MAG data is used in Unsub, as well as several other well-known open science tools.

Thankfully, we’ve got a contingency plan for this situation, which we’ve been working on for a while now. We’re building a successor to MAG. Like all our projects, it’ll be open-source and the data will be free to everyone via data dump and API. It will launch at the end of the year, when MAG is scheduled to disappear.

It’s important to note that this new service will not be a perfect replacement, especially right when it launches. MAG has excellent support for conference proceedings, for example; we won’t match that for a while, if ever.  Instead, we’ll be focusing on supporting the most important use-cases, and building out from there. If you use MAG today, we’d love to hear what your key use-cases are, so we can prioritize accordingly. Here’s where you can tell us.

We plan to have this launched by the time MAG disappears at year’s end. That’s an aggressive schedule, but we’ve built and launched other large projects (Unpaywall, Unsub) in less time. We’ve also got a good head start, since we’ve been working toward this as an internal project for a while now.

We’d love to hear ideas and feedback from the community…drop us a line on Twitter (@our_research) or via email (team@ourresearch.org)!

PS Many thanks to the team behind MAG, who built a really cool thing and made the data free and open. Respect.

Stop by for a demo of Unpaywall Journals at ALA midwinter

We are at ALA Midwinter this weekend! If you are interested in data to help you reassess the value of your Big Deal, stop by table 867 for a demo of the new product, Unpaywall Journals!

Alternatively, you can book a time to make sure you have our undivided attention, or stop us in the halls any time you see us. We’ll be wearing our green Unpaywall t-shirts so we are hard to miss 🙂

Jason and Heather (CNI 2019)

If you don’t do collections or acquisitions, but you are a fan of the Unpaywall link resolver, browser extension, API, or integrations — stop by anyway and grab an Unpaywall sticker. Come and get them before they are gone! From what we’ve heard ALA midwinter is a little low on swag, so it’ll nice not go home empty handed….

Yes we have more than this, but not oodles, so come by early 🙂

Email us at team@ourresearch.org if email is better. Looking forward to a great conference! — Heather and Jason.

Update: In May 2020 we changed the name of Unpaywall Journals to Unsub.

Unpaywall Journals — helping librarians get more value from their serials budget

We’re thrilled to announce a new product:

Unpaywall Journals is a data dashboard that combines journal-level citations, downloads, Open Access statistics and more, to help librarians confidently manage their serials collections.

Learn more, join the announcement list, and help spread the word.

It’s going to be big.

Update: In May 2020 we changed the name of Unpaywall Journals to Unsub.

.

Introducing a new browser extension to make the paywall great again

It’s pretty clear at this point that open access is winning. Of course, the percentage of papers available as OA has been climbing steadily for years. But now on top of this, bold new mandates like Plan S are poised to fast-track the transition to universal open access.

But–and this may seem weird coming from the makers of Unpaywall–are we going too far, too fast? Sure, OA will accelerate discovery, help democratize knowledge, and whatnot. It’s obvious what we have to gain.

Maybe what’s less obvious is what we’re going to lose. We’re going to lose the paywall. And with it, maybe we’re going to lose a little something…of ourselves.

Think about it: some of humankind’s greatest achievements have been walls. You’ve got the Great Wall of China (useful for being seen from space!), the Berlin Wall (useful for being a tourist attraction!), and American levees (useful for driving your Chevy to, when they don’t break!)

Now, are the paywalls around research articles really great cultural achievements? With all due respect: what a fantastically stupid question. Of course they are! Or not! Who knows! It doesn’t matter. What matters is that losing the paywall means change, and that means it’s scary and probably bad.

Why, just the other day we went to read an scholarly article, and we wanted to pay someone money, and THERE WAS NOWHERE TO DO IT. Open Access took that away from us. We were not consulted. This is “progress?”

You used to know where you stood. Specifically, you stood on the other side of a towering paywall that kept you from accessing the research literature. But now: who knows? Who knows?

Well, good news friend: with our new browser extension, you know. That’s right, we are gonna make the paywall great again, with a new browser extension that magically erects a paywall to keep you from reading Open Access articles!

The extension is called Paywall (natch), and it’s elegantly simple: the next time you stumble upon one of those yucky open access articles, Paywall automatically hides it from you, and requires you pay $35 to read. That’s right, we’re gonna rebuild the paywall, and we’re gonna make you pay for it!

With Paywall, you’ll enjoy your reading so much more…after all, you paid $35 for that article so you better like it. And let’s be honest, you were probably gonna blow that money on something useless anyway. This way, at least you know you’re helping make the world a better place, particularly the part of the world that is our Cayman Islands bank account.

Paywalls are part of our heritage as researchers. They feel right. They are time-tested. They are, starting now, personally lucrative for the writers of this blog post. I mean, what more reasons do we need? BUILD. THE. WALL. Install Paywall. Now. Do it. Do it now.

Thanks so much for your continued support. Remember, we can’t stop the march of progress–but together, scratching and clawing and biting as one, maybe we can maybe slow it down a little. At least long enough to make a few extra bucks.

⇨ Click here to install Paywall!

~~~~~~~~~

Impactstory is hiring a full-time developer


We’re looking for a great software developer!  Help us spread the word!  Thanks 🙂

 

ABOUT US

We’re building tools to bring about an open science revolution.  

Impactstory began life as a hackathon project. As the hackathon ended, a few of us migrated into the hotel hallway to continue working, completing the prototype as the hotel started waking up for breakfast. Months of spare-time development followed, then funding. That was five years ago — we’ve got the same excitement for Impactstory today.

We’ve also got great momentum.  The scientific journal Nature recently profiled our main product:  “Unpaywall has become indispensable to many academics, and tie-ins with established scientific search engines could broaden its reach.”  We’re making solid revenue, and it’s time to expand our team.

We’re passionate about open science, and we run our non-profit company openly too.  All of our code is open source, we make our data as open as possible, and we post our grant proposals so that everyone can see both our successful and our unsuccessful ones.  We try to be the change we want to see 🙂

ABOUT THE POSITION

The position is lead dev for Unpaywall, our index of all the free-to-read scholarly papers in the world. Because Unpaywall is surfacing millions of formerly inaccessible open-access scientific papers, it’s growing very quickly, both in terms of usage and revenue. We think it’s a really transformative piece of infrastructure that will enable entire new classes of tools to improve science communication. As a nonprofit, that’s our aim.

We’re looking for someone to take the lead on the tech parts of Unpaywall.  You should know Python and SQL (we use PostgreSQL) and have 5+ years of experience programming, including managing a production software system.  But more importantly, we’re looking for someone who is smart, dedicated, and gets things done! As an early team member you will play a key role in the company as we grow.

The position is remote, with flexible working hours, and plenty of vacation time.  We are a small team so tell us what benefits are important to you and we’ll make them happen.

OUR TEAM

We’re at about a million dollars of revenue (grants and earned income) with just two employees: the two co-founders.  We value kindness, honesty, grit, and smarts. We’re taking our time on this hire, holding out for just the right person.

HOW TO APPLY

Sound like you? Email to team@impactstory.org with (1) what appeals to you about this specific job (this part is important to us), (2) a brief summary of your experience with directly maintaining and enhancing a production system (3) a copy of your resume or linkedin profile and (4) a link to your github profile. Thanks!

 

Edited Sept 25, 2018 to add minimum experience and more details on how to apply.

New partnership with Clarivate to help oaDOI find even more Open Access


We’re excited to announce a new partnership with Clarivate Analytics! 

This partnership between Impactstory and Clarivate will help fund better coverage of Open Access in the oaDOI database. The  improvements will grow our index of free-to-read fulltext copies, bringing the total number to more than 18 million, along with 86 million article records altogether. All this data will continue to be freely accessible to everyone via our open API.

The partnership with Clarivate Analytics will put oaDOI data in front of users at thousands of new institutions, by integrating our index into the popular Web of Science system.  The oaDOI API is already in use by more than 700 libraries via SFX, and delivers more than 500,000 fulltext articles to users worldwide every day.  It also powers the free Unpaywall browser extension, used by over seventy thousand people in 145 countries.  

You can read more about the partnership in Clarivate’s press release.  We’ll be sharing more details about improvements in the coming months.  Exciting!

How big does our text-mining training set need to be?


We got some great feedback from reviewers our new Sloan grant, including a suggestion that we be more transparent about our process over the course of the grant. We love that idea, and you’re now reading part of our plan for how to do that: we’re going to be blogging a lot more about what we learn as we go.

A big part of the grant is using machine learning to automatically discover mentions of software use in the research literature. It’s going to be a really fun project because we’ll get to play around with some of the very latest in ML, which currently The Hotness everywhere you look. And we’re learning a lot as we go. One of the first questions we’ve tackled (also in response to some good reviewer feedback) is: how big does our training set need to be? The machine learning system needs to be trained to recognized software mentions, and to do that we need to give it a set of annotated papers where we, as humans, have marked what a software mention looks like (and doesn’t look like). That training set is called the gold standard. It’s what the machine learning system learns from. Below is copied from one of our reviewer responses:

We came up with the number of articles to annotate through a combination of theory, experience, and intuition.  As usual in machine learning tasks, we considered the following aspects of the task at hand:

  • prevalence: the number of software mentions we expect in each article
  • task complexity: how much do software-mention words look like other words we don’t want to detect
  • number of features: how many different clues will we give our algorithm to help it decide whether each word is a software mention (eg is it a noun, is it in the Acknowledgements section, is it a mix of uppercase and lowercase, etc)

None of these aspects are clearly understood for this task at this point (one outcome of the proposed project is that we will understand them better once we are done, for future work), but we do have rough estimates.  Software mention prevalence will be different in each domain, but we expect roughly 3 mentions per paper, very roughly, based on previous work by Howison et al. and others.  Our estimate is that the task is moderately complex, based on the moderate f-measures achieved by Pan et al. and Duck et al. with hand-crafted rules.  Finally, we are planning to give our machine learning algorithm about 100 features (50 automatically discovered/generated by word2vec, plus 50 standard and rule-based features, as we discuss in the full proposal).

We then used these estimates.  As is common in machine learning sample size estimation, we started by applying a rule-of-thumb for the number of articles we’d have to annotate if we were to use the most simple algorithm, a multiple linear regression.  A standard rule of thumb (see https://en.wikiversity.org/wiki/Multiple_linear_regression#Sample_size) is 10-20 datapoints are needed for each feature used by the algorithm, which implies we’d need 100 features * 10 datapoints = 1000 datapoints.  At 3 datapoints (software mentions) per article, this rule of thumb suggests we’d need 333 articles per domain.  

From there we modified our estimate based on our specific machine learning circumstance.  Conditional Random Fields (our intended algorithm) is a more complex algorithm than multiple linear regression, which might suggest we’d need more than 333 articles.  On the other hand, our algorithm will also use “negative” datapoints inherent in the article (all the words in the article that are *not* software mentions, annotated implicitly as not software mentions) to help learn information about what is predictive of being vs not being a software mention — the inclusion of this kind of data for this task means our estimate of 333 articles is probably conservative and safe.

Based on this, as well as reviewing the literature for others who have done similar work (Pan et al. used a gold standard of 386 papers to learn their rules, Duck et al. used 1479 database and software mentions to train their rule weighting, etc), we determined that 300-500 articles per domain was appropriate. We also plan to experiment with combining the domains into one general model — in this approach, the domain would be added as an additional feature, which may prove more powerful overall. This would bring all 1000-1500 articles to the test set.

Finally, before proposing 300-500 articles per domain, we did a gut-check whether the proposed annotation burden was a reasonable amount of work and cost for the value of the task, and we felt it was.

References

Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLOS ONE, 11(6), e0157989. http://doi.org/10.1371/journal.pone.0157989

Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST), Article first published online: 13 MAY 2015. http://doi.org/10.1002/asi.23538

Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. http://doi.org/10.1016/j.joi.2015.07.012