OurResearch receives $7.5M grant from Arcadia to establish OpenAlex, a milestone development for Open Science

OurResearch is proud to announce a $7.5M grant from Arcadia, to establish a sustainable and completely open index of the world’s research ecosystem. With this 5-year grant, OurResearch expands their open science ambitions to replace paywalled knowledge graphs with OpenAlex.

Researchers, funders, and organizations around the world rely on scientific knowledge graphs to find, perform, and manage their research. For decades, only paywalled proprietary systems have provided this information and they have become unaffordable (costing libraries $1B annually); uninclusive (systematically excluding works from some fields and geographies); and unavailable (even paid subscribers are limited in their use of the data).

OpenAlex indexes more than twice as many scholarly works as the leading proprietary products and the entirety of the knowledge graph and its source code are openly licensed and freely available through data snapshots, an easy to use API, and a nascent user interface.

OurResearch has a decade of sustained experience developing tools that advance open science. Funds from Arcadia will fuel the development needed to establish OpenAlex as the go-to scientific knowledge graph for researchers and organizations around the world. Long-term sustainability of OpenAlex will be achieved through value-add premium services.

Development of OpenAlex started only two years ago and it already serves 115M API calls per month; underlies a major university ranking; is displacing proprietary products at Universities; and has established partnerships with national governments. We are excited by these early successes of OpenAlex and its promise to revolutionize scholarly communication and democratize the world’s research.

— — — — 

OurResearch is a nonprofit that builds tools to help accelerate the transition to universal Open Science. Started at a hackathon in 2011, they remain committed to creating open, sustainable research infrastructure that solves real-world problems, like Unpaywall and Unsub.

Arcadia is a charitable foundation that works to protect nature, preserve cultural heritage and promote open access to knowledge. Since 2002 Arcadia has awarded more than $1 billion to organizations around the world.

Coverage in the Financial Times of OpenAlex and the Sorbonne

The Financial Times recently published an article detailing Sorbonne University’s “radical decision” to switch to OpenAlex for its publication database and bibliometric analytics. The article (behind a paywall, unfortunately 😞) came out a little while ago, but we wanted to highlight it here in case you missed it.

The news comes in the context of “a wider pushback against the current model in academic publishing, where researchers publish and review papers for free but have to buy expensive subscriptions to the journals in which they are published to analyse data relating to their work.” It includes a quote from OurResearch/OpenAlex co-founder and CEO Jason Priem: “We felt there’s a mismatch between the values of the academy and the shareholder boardroom. Research is fundamentally about sharing, while for-profits are fundamentally about capturing and enclosing. We aim to create and sustain research infrastructure that’s truly aligned with . . . the values of the research community.”

Exciting times for OpenAlex and open science!

Jack, Andrew. “Sorbonne’s Embrace of Free Research Platform Shakes up Academic Publishing.” Financial Times, December 27, 2023. https://www.ft.com/content/89098b25-78af-4539-ba24-c770cf9ec7c3.

Sorbonne University announces switch to OpenAlex

We at OpenAlex are thrilled at Sorbonne University’s recent announcement that they will be switching to OpenAlex for their publication database and bibliometric analytics, abandoning the use of proprietary products! The Sorbonne, a leading French university, made their announcement in a recent post (click here for the English version; click here for the French version). Starting in 2024, they will be ending their subscription to Web of Science and Clarivate’s bibliometric tools. They will instead be adopting “open, free and participatory tools, and [they are] now working on the consolidation of a sustainable and international alternative, relying in particular on the OpenAlex tool.”

OpenAlex has been working closely with the Sorbonne to make this switch possible, and as they note, “A partnership agreement will shortly be established between Sorbonne University and OpenAlex to formalize their contributions and mutual commitments … and to bring about developments that will meet the needs of its community.” This is an extremely exciting milestone for us and for open science! We invite you all to celebrate with us 🎉🎉🎉!

Assigning Institutions — New England Journal of Medicine Case Study

The New England Journal of Medicine uses a non-standard format when presenting authors and their institutional affiliations, which is a problem when we want to keep track of these links in our data. We developed a custom algorithm to solve this problem, preserving more than a hundred thousand author-institution links.

Linking works, authors, and institutions

Part of a diagram from the OpenAlex docs, showing how authors and institutions are linked to works through authorships.
OpenAlex data has links between works, authors, and institutions.

Works, authors, and institutions are three of the basic entities in the OpenAlex data. Keeping track of the relationships between these entities is one of the core things we do. It’s important that we identify these links correctly, so they can be used for downstream tasks like university research intelligence, ranking, etc. Often, this information comes to us via structured data which is not difficult to ingest. Many times, however, the data is messy, and using it is not so straightforward.

Affiliation data in the New England Journal of Medicine

Publications from the New England Journal of Medicine (NEJM) are an example of this messiness. Author affiliations in these papers are presented in a format that is human-readable, but not straightforward for a computer to parse automatically. In most other journals, authors are listed alongside their affiliated institutions, and so it is relatively easy for a program to link them together. NEJM does it a different way—as shown in the screenshot of a paper from the journal’s website, institutions are listed together with the initials of the authors, which in turn correspond to the full author names at the top of the paper.

Screenshot of the affiliations of a paper from the New England Journal of Medicine's website.
Author affiliations in NEJM come in a nonstandard format that is not easy for a computer to parse.

We might hope that the structured metadata we get from Crossref would have the data in a more standard format. But alas, this isn’t the case, as shown in the screenshot of data from the Crossref API.

Screenshot of JSON data from the Crossref API
Data about the paper from the Crossref API is also in the nonstandard format.

There are around 170,000 works from this journal. This is a relatively tiny proportion of the total number of works in OpenAlex. However, NEJM is a highly influential journal in medicine, so it’s a priority that we get this right.

Custom OpenAlex solution to assign institutions to NEJM authors

OpenAlex team member Nolan created a bespoke algorithm specifically for NEJM papers to parse the affiliation strings and assign authors to institutions. This rule-based algorithm identifies the author initials that might correspond to the full names, and uses those as a mapping to get the link from institution to author, as shown in the screenshot from the OpenAlex API of the example paper from above. The full data for this work can be found at https://api.openalex.org/works/W4386208393.

We have been able to apply this to around 35,000 articles, amounting to 158,000 institutional affiliations. Additionally, we identified about ten thousand raw affiliation strings that we couldn’t match to an institution, but can still prove useful to our users.

The NEJM case is an example of the attention to data and extra effort that is part of the value that OpenAlex hopes to provide. The data can be messy sometimes. It’s our mission to help make sense of it, so the world can have access to high-quality, free and open data.

Screenshot of JSON data from the OpenAlex API
OpenAlex data has institutional affiliations as structured, fully linked data.

New study shows OpenAlex is a good alternative to Scopus for demographic research

Highlights

  • New research from the Max Planck Institute for Demographic Research analyzes global migration of scholars, using bibliometric data. They do a side-by-side comparison of this analysis between Scopus and OpenAlex data.
  • Counts of scholars by country are highly correlated between Scopus and OpenAlex.
  • Migration events are less correlated between the two, but trends in migration between top pairs of countries are consistent between them. There is higher correlation with Western countries, and OpenAlex has more coverage of non-Western countries.
  • OpenAlex is open. Scopus is not. This puts limits on how researchers can perform and share this type of analysis.

A new working paper[1] from researchers at the Max Planck Institute for Demographic Research (MPIDR) uses bibliometric data to study the migration patterns of scholars between countries. Within the field of demography, there is a lack of high-quality data about human migration; so this use of scholarly publication data to infer global-scale migration of scholars is a welcome contribution. They compare the use of two sources of large-scale bibliometric data: “Elsevier’s proprietary Scopus and the openly available OpenAlex.”

The findings of the paper suggest that OpenAlex is a source of open data that shows promise as a replacement for the more established—but more restricted—Scopus data. Overall counts of scholars between countries over time have a high correlation between Scopus and OpenAlex, “with a median correlation close to 1.” The analysis of migration events between the two databases shows less correlation overall, but among the top pairs of countries, “the bilateral flows … are consistent in the two databases.” The authors go on to discuss the reason for the differences, noting that “[this] could signal a large difference in coverage of individual migration trajectories between these two databases and can also stem from the small net migration rates which fluctuate with small differences in measurement rather than population counts which are larger and small changes do not cause them to fluctuate.” In other words, while smaller scale trends may present differently between different data sources due to the nuances and idiosyncrasies of each one, the larger-scale trends are consistent.

The results also suggest that, in some cases, OpenAlex may be an even better resource than Scopus for this analysis. The authors note that the magnitude of migration flows is much larger in OpenAlex compared to Scopus, and that “this could indicate that the higher coverage of publications in OpenAlex might help discover some under-explored scholarly migration corridors worldwide.”

The paper does note some limitations of using OpenAlex as opposed to Scopus for their purposes, specifically, “the quality of the author name disambiguation and identifiers in OpenAlex needs further evaluation in future research.” Evaluating the job that OpenAlex has done assigning authors to all of their papers was outside the scope of this research, but they are able to refer to established research validating the Scopus data. We look forward to this validation on the OpenAlex data both from us and from other independent researchers. We’re also happy to say that we are continually making improvements in our author name disambiguation, so our data will be getting better and better!

Finally, there is the big difference between the two services: OpenAlex is open, while Scopus is not. The authors touch on this several times throughout the paper, both directly and indirectly. They mention that they must limit the years of their analysis, due to “our license terms for Scopus data”. In their Methods section, they describe the multiple steps they had to take to gain access to and acquire the Scopus data, while for OpenAlex, the process was much simpler: “we obtain the publicly available data and process it ourselves”. And in the Acknowledgements section, they explain that the Scopus license terms only permit sharing aggregated results, and no individual data is shared.

Overall, we are very proud that OpenAlex is being recognized as an emerging high-quality, completely open source of bibliometric data that can be used for demographic research. The lack of restrictions on our data is extremely important as it eliminates barriers that researchers face in doing their work. Please check out their paper to learn more about their work!


[1] Akbaritabar, A., Theile, T. & Zagheni, E. Global flows and rates of international migration of scholars. WP-2023-018 https://www.demogr.mpg.de/en/publications_databases_6118/publications_1904/mpidr_working_papers/global_flows_and_rates_of_international_migration_of_scholars_7729 (2023) doi:10.4054/MPIDR-WP-2023-018.

Introducing Jason Portenoy, newest full-time team member at OpenAlex

Photo of Jason Portenoy

Hi, I’m Jason Portenoy, and I’m very happy to be joining OurResearch as the newest full-time team member! As a data engineer, I will be focusing my efforts on user engagement and outreach for OpenAlex. It is my responsibility to understand the OpenAlex dataset—its strengths and limitations—and work with the user community to improve it and make it easier to use.

I completed my PhD in Information Science at the University of Washington, studying the use of the scholarly literature as data to curate, explore, and evaluate scientific research. This field—known by various terms including scientometrics, science of science, metascience, and Big Scholarly Data—captivated me from the moment I learned about it. As the scale of scientific output continues to increase well beyond the capacity of any individual to make sense of it, the need for new tools and techniques to help becomes more and more pronounced. Working with Dr. Jevin West at the UW Datalab, I developed these tools and techniques—analyzing and visualizing scholarly data, and building recommender systems to connect scientists to new research and ideas. I extended this work through projects with Semantic Scholar, the Chan-Zuckerberg Initiative, and JSTOR.

While working on these tools and analyses, I came to rely on several scholarly data sets, such as Web of Science and Microsoft Academic Graph. Through my experience, I became an advocate for having high-quality, open, and accessible data for researchers and builders to use. A solid foundation of quality data will strengthen all downstream applications, from simple counts and bibliometric statistics, to advanced natural language processing and complex systems approaches.

Joining the OpenAlex team is a fantastic opportunity for me to contribute to the future of scholarly data. When Microsoft decided to end its academic service, myself and many others in the community wondered what would come next. It has become clear that OpenAlex will play a key role in the future of this field. I come to this position with technical training as a data engineer and data scientist, as well as experience with scholarly data. My goal is to work with the community of users to continually improve the OpenAlex data and experience. If there’s anything you think I might be able to help with, please let us know!

OpenAlex launch!

OpenAlex launched this week! (January 3rd 2022 for those reading from the future 🙂 )

As expected:

We’re now pulling in new content on our own. Until now, we’ve been getting new works, authors, and other entities from MAG. Now that MAG is gone, we’re gathering all of our own data from the big wide internet.

The new REST API is launched! This is a much faster and easier way to access the OpenAlex database than downloading and installing the snapshot. It’s completely open and free–you don’t even need a user account or token.

We’ve now got oodles of new documentation here: https://docs.openalex.org/

Slight change of plan:

The MAG Format snapshot is now hosted for free, thanks to the AWS Open Data program. This will cover the data transfer fees (which turned out to be $70!) so you don’t have to. Here are the new instructions on how to download the MAG format snapshot to your machine.


We are extending the beta period for OpenAlex; we’ll emerge from beta in February. This is mostly in response to discovering issues with the coverage and structure of existing data sources including MAG. Extending the beta reflects the fact that the data will improve significantly between now and February.

Huge exciting news:

OpenAlex was built to offer a drop-in replacement for MAG. We’re doing that. But today, we’re also unveiling some moves toward a more innovative future for Openalex:

We’ve now built around a simple new five-entity model: works, authors, venues (journals and repositories), institutions, and concepts. Everything in OpenAlex is one of these entities, or a connection between them. Each type of entity has its own API endpoint.

We’ve got a new Standard Format for the snapshot, one that’s closely tied to both the five-entity model the API. In the future, this will become the only supported format. The MAG format is now deprecated and will go away on July 1, 2022.

In conclusion:

Thanks for your support, and please send us any feedback you find! In particular, let us know about bugs…it’s early days, and there will be plenty. We’re currently fixing these very quickly. Happy New Year, and happy OpenAlexing!

Best,
Jason and Heather

New perspective for OA: Date of Observation

We’d like to share one of the fun parts of our recent preprint. It’s fun because the concept of Date of Observation helps to untangle issues around embargoes — and also because we think we came up with a neat way to explain what is otherwise a fairly complicated concept, and hopefully make it accessible to everybody.

See what you think — here is our description of the Date of Observation, from section 3.3 of the preprint:

Let’s imagine two observers, Alice (blue) and Bob (red), shown by the two stick figures at the top of the figure:

Alice lives at the end of Year 1–that’s her “Date Of Observation.” Looking down, she can see all 8 articles (represented by solid colored dots) published in Year 1, along with their access status: Gold OA, Green OA, or Closed. The Year of Publication for all eight of these articles is Year 1.

Alice likes reading articles, so she decides to read all eight Year 1 articles, one by one.

She starts with Article A. This article started its life early in the year as Closed. Later that year, though–after an OA Lag of about six months–Article A became Green OA as its author deposited a manuscript (the green circle) in their institutional repository. Now, at Alice’s Date of Observation, it’s open! Excellent. Since Alice is inclined toward organization, she puts Article A article in a stack of Green articles she’s keeping below.

Now let’s look at Bob. Bob lives in Alice’s future, in Year 3 (ie, his “Date of Observation” is Year 3). Like Alice, he’s happy to discover that Article A is open. He puts it in his stack of Green OA articles, which he’s further organized by date of their publication (it goes in the Year 1 stack).

Next, Alice and Bob come to Article B, which is a tricky one. Alice is sad: she can’t read the article, and places it in her Closed stack. Unbeknownst to poor Alice, she is a victim of OA Lag, since Article B will become OA in Year 2. By contrast, Bob, from his comfortable perch in the future, is able to read the article. He places it in his Green Year 1 stack. He now has two articles in this stack, since he’s found two Green OA articles in Year 1.

Finally, Alice and Bob both find Article C is closed, and place it in the closed stack for Year 1. We can model this behavior for a hypothetical reader at each year of observation, giving us their view on the world–and that’s exactly the approach we take in this paper.

Now, let’s say that Bob has decided he’s going to figure out what OA will look like in Year 4. He starts with Gold. This is easy, since Gold article are open immediately upon publication, and publication date is easy to find from article metadata. So, he figures out how many articles were Gold for Alice (1), how many in Year 2 (3), and how many in his own Year 3 (6). Then he computes percentages, and graphs them out using the stacked area chart at the bottom of the figure. From there, it’s easy to extrapolate forward a year.

For Green, he does the same thing–but he makes sure to account for OA Lag. Bob is trying to draw a picture of the world every year, as it appeared to the denizens of that world. He wants Alice’s world as it appeared to Alice, and the same for Year 2, and so on. So he includes OA Lag in his calculations for Green OA, in addition to publication year. Once he has a good picture from each Date Of Observation, and a good understanding of what the OA Lag looks like, he can once again extrapolate to find Year 4 numbers.

Bob is using the same approach we will use in this paper–although in practice, we will find it to be rather more complex, due to varying lengths of OA Lag, additional colors of OA, and a lack of stick figures.

The Future of OA: A large-scale analysis projecting Open Access publication and readership

We are excited to announce our most recent study has just been posted on bioRxiv:

Piwowar, Priem, Orr (2019) The Future of OA: A large-scale analysis projecting Open Access publication and readership. bioRxiv: https://doi.org/10.1101/795310

This is the largest, most comprehensive analysis ever to predict the future of Open Access. Importantly, we look not only at publication trends but also at *viewership* — what do people want to read, and how much of it is OA?

The abstract is included below, we’ll be highlighting a few of the cool findings in subsequent blog posts, and you can read the full paper here (DOI not resolving yet). All the raw data and code is available, as is our style: http://doi.org/10.5281/zenodo.3474007. Enjoy, and let us know what you think!


Understanding the growth of open access (OA) is important for deciding funder policy, subscription allocation, and infrastructure planning.

This study analyses the number of papers available as OA over time. The models includes both OA embargo data and the relative growth rates of different OA types over time, based on the OA status of 70 million journal articles published between 1950 and 2019.

The study also looks at article usage data, analyzing the proportion of views to OA articles vs views to articles which are closed access. Signal processing techniques are used to model how these viewership patterns change over time. Viewership data is based on 2.8 million uses of the Unpaywall browser extension in July 2019.

We found that Green, Gold, and Hybrid papers receive more views than their Closed or Bronze counterparts, particularly Green papers made available within a year of publication. We also found that the proportion of Green, Gold, and Hybrid articles is growing most quickly.

In 2019:

  • 31% of all journal articles are available as OA
  • 52% of all article views are to OA articles

Given existing trends, we estimate that by 2025:

  • 44% of all journal articles will be available as OA
  • 70% of all article views will be to OA articles

The declining relevance of closed access articles is likely to change the landscape of scholarly communication in the years to come.


Additional blog posts about this paper:

Now, a better way to find and reward open access


There’s always been a wonderful connection between altmetrics and open science.

Altmetrics have helped to demonstrate the impact of open access publication. And since the beginning, altmetrics have excited and provoked ideas for new, open, and revolutionary science communication systems. In fact, the two communities have overlapped so much that altmetrics has been called a “school” of open science.

We’ve always seen it that way at Impactstory. We’re uninterested in bean-counting. We are interested in setting the stage for a second scientific revolution, one that will happen when two open networks intersect: a network of instantly-available diverse research products and a network of comprehensive, open, distributed significance indicators.

So along with promoting altmetrics, we’ve also been big on incentives for open access. And today we’re excited that we got a lot better at it.

We’re launching a new Open Access badge, backed by a really accurate new system for automatically detecting fulltext for online resources. It finds not just Gold OA, but also self-archived Green OA, hybrid OA, and born-open products like research datasets.

A  lot of other projects have worked on this sticky problem before us, including the Open Article Gauge, OACensus, Dissemin, and the Open Access Button. Admirably, these have all been open-source projects, so we’ve been able to reuse lots of their great ideas.

Then we’ve added oodles of our own ideas and techniques, along with plenty of research and testing. The result? Impactstory is now the best, most accurate way to automatically assess openness of publications. We’re proud of that.

And we know this is just the beginning! Fork our code or send us a pull request if you want to make this even better. Here’s a list of where we check for OA to get you started:

  • The Directory of Open Access Journals to see if it’s in their index of OA journals,
  • CrossRef’s license metadata field,  to see if the publisher has uploaded an open license.
  • Our own custom list DOI prefixes, to see if it’s in a known preprint repo
  • DataCite, to see if it’s an open dataset.
  • The wonderful BASE OA search engine to see if there’s a Green OA copy of the article.
  • Repository pages directly, in cases where BASE was unable to determine openness.
  • Journal article pages directly, to see if there’s a free PDF link (this is great for detecting hybrid OA)

What’s it mean for you? Well, Impactstory is now a powerful tool for spreading the word about open access. We’ve found that seeing that openness badge–or OH NOES lack of a badge!–on their new profile is powerful for a researcher who might otherwise not think much about OA.

So, if you care about OA: challenge your colleagues to go make a free profile and see how open they really are. Or you can use our API to learn about the openness of groups of scholars (great for librarians, or for a presentation to your department). Just hit the endpoint http://impactstory.org/u/someones_orcid_id to find out the openness stats for anyone.

Hit us up with any thoughts or comments, and enjoy!