OurResearch receives $7.5M grant from Arcadia to establish OpenAlex, a milestone development for Open Science

OurResearch is proud to announce a $7.5M grant from Arcadia, to establish a sustainable and completely open index of the world’s research ecosystem. With this 5-year grant, OurResearch expands their open science ambitions to replace paywalled knowledge graphs with OpenAlex.

Researchers, funders, and organizations around the world rely on scientific knowledge graphs to find, perform, and manage their research. For decades, only paywalled proprietary systems have provided this information and they have become unaffordable (costing libraries $1B annually); uninclusive (systematically excluding works from some fields and geographies); and unavailable (even paid subscribers are limited in their use of the data).

OpenAlex indexes more than twice as many scholarly works as the leading proprietary products and the entirety of the knowledge graph and its source code are openly licensed and freely available through data snapshots, an easy to use API, and a nascent user interface.

OurResearch has a decade of sustained experience developing tools that advance open science. Funds from Arcadia will fuel the development needed to establish OpenAlex as the go-to scientific knowledge graph for researchers and organizations around the world. Long-term sustainability of OpenAlex will be achieved through value-add premium services.

Development of OpenAlex started only two years ago and it already serves 115M API calls per month; underlies a major university ranking; is displacing proprietary products at Universities; and has established partnerships with national governments. We are excited by these early successes of OpenAlex and its promise to revolutionize scholarly communication and democratize the world’s research.

— — — — 

OurResearch is a nonprofit that builds tools to help accelerate the transition to universal Open Science. Started at a hackathon in 2011, they remain committed to creating open, sustainable research infrastructure that solves real-world problems, like Unpaywall and Unsub.

Arcadia is a charitable foundation that works to protect nature, preserve cultural heritage and promote open access to knowledge. Since 2002 Arcadia has awarded more than $1 billion to organizations around the world.

Coverage in the Financial Times of OpenAlex and the Sorbonne

The Financial Times recently published an article detailing Sorbonne University’s “radical decision” to switch to OpenAlex for its publication database and bibliometric analytics. The article (behind a paywall, unfortunately 😞) came out a little while ago, but we wanted to highlight it here in case you missed it.

The news comes in the context of “a wider pushback against the current model in academic publishing, where researchers publish and review papers for free but have to buy expensive subscriptions to the journals in which they are published to analyse data relating to their work.” It includes a quote from OurResearch/OpenAlex co-founder and CEO Jason Priem: “We felt there’s a mismatch between the values of the academy and the shareholder boardroom. Research is fundamentally about sharing, while for-profits are fundamentally about capturing and enclosing. We aim to create and sustain research infrastructure that’s truly aligned with . . . the values of the research community.”

Exciting times for OpenAlex and open science!

Jack, Andrew. “Sorbonne’s Embrace of Free Research Platform Shakes up Academic Publishing.” Financial Times, December 27, 2023. https://www.ft.com/content/89098b25-78af-4539-ba24-c770cf9ec7c3.

Sorbonne University announces switch to OpenAlex

We at OpenAlex are thrilled at Sorbonne University’s recent announcement that they will be switching to OpenAlex for their publication database and bibliometric analytics, abandoning the use of proprietary products! The Sorbonne, a leading French university, made their announcement in a recent post (click here for the English version; click here for the French version). Starting in 2024, they will be ending their subscription to Web of Science and Clarivate’s bibliometric tools. They will instead be adopting “open, free and participatory tools, and [they are] now working on the consolidation of a sustainable and international alternative, relying in particular on the OpenAlex tool.”

OpenAlex has been working closely with the Sorbonne to make this switch possible, and as they note, “A partnership agreement will shortly be established between Sorbonne University and OpenAlex to formalize their contributions and mutual commitments … and to bring about developments that will meet the needs of its community.” This is an extremely exciting milestone for us and for open science! We invite you all to celebrate with us 🎉🎉🎉!

Assigning Institutions — New England Journal of Medicine Case Study

The New England Journal of Medicine uses a non-standard format when presenting authors and their institutional affiliations, which is a problem when we want to keep track of these links in our data. We developed a custom algorithm to solve this problem, preserving more than a hundred thousand author-institution links.

Linking works, authors, and institutions

Part of a diagram from the OpenAlex docs, showing how authors and institutions are linked to works through authorships.
OpenAlex data has links between works, authors, and institutions.

Works, authors, and institutions are three of the basic entities in the OpenAlex data. Keeping track of the relationships between these entities is one of the core things we do. It’s important that we identify these links correctly, so they can be used for downstream tasks like university research intelligence, ranking, etc. Often, this information comes to us via structured data which is not difficult to ingest. Many times, however, the data is messy, and using it is not so straightforward.

Affiliation data in the New England Journal of Medicine

Publications from the New England Journal of Medicine (NEJM) are an example of this messiness. Author affiliations in these papers are presented in a format that is human-readable, but not straightforward for a computer to parse automatically. In most other journals, authors are listed alongside their affiliated institutions, and so it is relatively easy for a program to link them together. NEJM does it a different way—as shown in the screenshot of a paper from the journal’s website, institutions are listed together with the initials of the authors, which in turn correspond to the full author names at the top of the paper.

Screenshot of the affiliations of a paper from the New England Journal of Medicine's website.
Author affiliations in NEJM come in a nonstandard format that is not easy for a computer to parse.

We might hope that the structured metadata we get from Crossref would have the data in a more standard format. But alas, this isn’t the case, as shown in the screenshot of data from the Crossref API.

Screenshot of JSON data from the Crossref API
Data about the paper from the Crossref API is also in the nonstandard format.

There are around 170,000 works from this journal. This is a relatively tiny proportion of the total number of works in OpenAlex. However, NEJM is a highly influential journal in medicine, so it’s a priority that we get this right.

Custom OpenAlex solution to assign institutions to NEJM authors

OpenAlex team member Nolan created a bespoke algorithm specifically for NEJM papers to parse the affiliation strings and assign authors to institutions. This rule-based algorithm identifies the author initials that might correspond to the full names, and uses those as a mapping to get the link from institution to author, as shown in the screenshot from the OpenAlex API of the example paper from above. The full data for this work can be found at https://api.openalex.org/works/W4386208393.

We have been able to apply this to around 35,000 articles, amounting to 158,000 institutional affiliations. Additionally, we identified about ten thousand raw affiliation strings that we couldn’t match to an institution, but can still prove useful to our users.

The NEJM case is an example of the attention to data and extra effort that is part of the value that OpenAlex hopes to provide. The data can be messy sometimes. It’s our mission to help make sense of it, so the world can have access to high-quality, free and open data.

Screenshot of JSON data from the OpenAlex API
OpenAlex data has institutional affiliations as structured, fully linked data.

New study shows OpenAlex is a good alternative to Scopus for demographic research

Highlights

  • New research from the Max Planck Institute for Demographic Research analyzes global migration of scholars, using bibliometric data. They do a side-by-side comparison of this analysis between Scopus and OpenAlex data.
  • Counts of scholars by country are highly correlated between Scopus and OpenAlex.
  • Migration events are less correlated between the two, but trends in migration between top pairs of countries are consistent between them. There is higher correlation with Western countries, and OpenAlex has more coverage of non-Western countries.
  • OpenAlex is open. Scopus is not. This puts limits on how researchers can perform and share this type of analysis.

A new working paper[1] from researchers at the Max Planck Institute for Demographic Research (MPIDR) uses bibliometric data to study the migration patterns of scholars between countries. Within the field of demography, there is a lack of high-quality data about human migration; so this use of scholarly publication data to infer global-scale migration of scholars is a welcome contribution. They compare the use of two sources of large-scale bibliometric data: “Elsevier’s proprietary Scopus and the openly available OpenAlex.”

The findings of the paper suggest that OpenAlex is a source of open data that shows promise as a replacement for the more established—but more restricted—Scopus data. Overall counts of scholars between countries over time have a high correlation between Scopus and OpenAlex, “with a median correlation close to 1.” The analysis of migration events between the two databases shows less correlation overall, but among the top pairs of countries, “the bilateral flows … are consistent in the two databases.” The authors go on to discuss the reason for the differences, noting that “[this] could signal a large difference in coverage of individual migration trajectories between these two databases and can also stem from the small net migration rates which fluctuate with small differences in measurement rather than population counts which are larger and small changes do not cause them to fluctuate.” In other words, while smaller scale trends may present differently between different data sources due to the nuances and idiosyncrasies of each one, the larger-scale trends are consistent.

The results also suggest that, in some cases, OpenAlex may be an even better resource than Scopus for this analysis. The authors note that the magnitude of migration flows is much larger in OpenAlex compared to Scopus, and that “this could indicate that the higher coverage of publications in OpenAlex might help discover some under-explored scholarly migration corridors worldwide.”

The paper does note some limitations of using OpenAlex as opposed to Scopus for their purposes, specifically, “the quality of the author name disambiguation and identifiers in OpenAlex needs further evaluation in future research.” Evaluating the job that OpenAlex has done assigning authors to all of their papers was outside the scope of this research, but they are able to refer to established research validating the Scopus data. We look forward to this validation on the OpenAlex data both from us and from other independent researchers. We’re also happy to say that we are continually making improvements in our author name disambiguation, so our data will be getting better and better!

Finally, there is the big difference between the two services: OpenAlex is open, while Scopus is not. The authors touch on this several times throughout the paper, both directly and indirectly. They mention that they must limit the years of their analysis, due to “our license terms for Scopus data”. In their Methods section, they describe the multiple steps they had to take to gain access to and acquire the Scopus data, while for OpenAlex, the process was much simpler: “we obtain the publicly available data and process it ourselves”. And in the Acknowledgements section, they explain that the Scopus license terms only permit sharing aggregated results, and no individual data is shared.

Overall, we are very proud that OpenAlex is being recognized as an emerging high-quality, completely open source of bibliometric data that can be used for demographic research. The lack of restrictions on our data is extremely important as it eliminates barriers that researchers face in doing their work. Please check out their paper to learn more about their work!


[1] Akbaritabar, A., Theile, T. & Zagheni, E. Global flows and rates of international migration of scholars. WP-2023-018 https://www.demogr.mpg.de/en/publications_databases_6118/publications_1904/mpidr_working_papers/global_flows_and_rates_of_international_migration_of_scholars_7729 (2023) doi:10.4054/MPIDR-WP-2023-018.

How you can help keep OpenAlex free

Next month, we’re submitting a renewal application for our main grant. This grant helps keep OpenAlex free to you. We need your help to get the renewal. There are two ways to help:

  1. Write a short testimonial 
  2. Subscribe to our Premium service

Details below:

1. Testimonial

We need to show our funder that we’re making a real and necessary impact. Testimonials are amazing for that. Could you write us a quick testimonial? We need 5-7 sentences that answer these questions:

  • Who are you?
  • What problem is OpenAlex solving for you?
  • How did you solve it before us?
  • Why is OpenAlex a better solution?
  • What’s a concrete good outcome of using OpenAlex?
  • Would you recommend OpenAlex to others?

Here’s an (imaginary) example:

CatCademia connects academic researchers to share research ideas and cat pictures. For this we need publication lists for all users. Originally, users had to curate their own lists, which was a big pain point. Now we use OpenAlex’s open API to auto-generate users’ publication lists at signup. Upon launching this feature, we saw an immediate increase in user retention and cat-picture sharing. We highly recommend OpenAlex to anyone who needs high-quality, open scholarly data.

You can submit your testimonial here. Thanks!

2. Premium

We recently launched a paid upgrade to our service called OpenAlex Premium. Premium offers:

  • Faster updates, so you can get fresher data,
  • Higher API limits, so you can use the API more, and
  • Priority support for faster and more detailed help.

Our funder (correctly, imho) wants to see we’re on the road to self-sustainability. So, we’re asking you to take a look at Premium and see if it’s something that would help you. 

If not, no worries–we’re delighted to make most of what we do free, and we want users to enjoy that. But if it looks useful, please get in touch ASAP! We’re offering hefty early-adopter discounts to folks that sign up this month.

Thanks very much for your time and support!

Best,

The OpenAlex Team

New OpenAlex API features – continents, regions, and more!

You can now use the OpenAlex API to filter and group by continents and large geographic regions, such as the Global South. The full documentation is here.

To see a list of institutions in Europe you can do:

https://api.openalex.org/institutions?filter=continent:europe

So simple! You can group by continent as well. This will return a count of works where an author is associated with the institution’s continent:

https://api.openalex.org/works?group-by=institutions.continent

{
  "key": "Q46",
  "key_display_name": "Europe",
  "count": 26968686
},
{
  "key": "Q49",
  "key_display_name": "North America",
  "count": 25175848
},
{
  "key": "Q48",
  "key_display_name": "Asia",
  "count": 24805214
}...

The key field is the wikidata identifier for the continent, such as South America (Q18).

Querying the Global South

The Global South is a term used to identify regions within Latin America, Asia, Africa, and Oceania. We used data from the United Nations to build a list of countries associated with the Global South. It’s available as a boolean filter like:

https://api.openalex.org/institutions?filter=is_global_south:true

This allows for some very cool groupings, such as “show me authors associated with the Global South, grouped by country”:

https://api.openalex.org/authors?filter=last_known_institution.is_global_south:true&group-by=last_known_institution.country_code

New API Filters

We’ve added new filters for works:

  • has_pmid – works that have a PubMed identifier
  • has_pmcid – works that have a PubMed Central identifier
  • repository – works that can be found at the given repository, based on venue ID
  • version – works where the given version is available, such as acceptedVersion

Concepts Improvements

As requested by OpenAlex users, we modified the concepts tree so that it is a true hierarchy. This means when you search for works with the concept Computer Science, you’re also getting works tagged with those sub-concepts, such as Artificial Intelligence

Fulltext search in OpenAlex

We’re excited to announce that we’ve added fulltext search to 57 million articles in OpenAlex, based on data from the General Index. This feature moves OpenAlex’s search function beyond title and abstract, covering the full text of 57 million documents, resulting in ~30 times more search results for many keyword searches!

What is the General Index?

The General Index is a very large database of n-grams that were extracted from 107 million journal articles. It’s openly available without restrictions, and is supported by 100 prominent professors and researchers.

An n-gram is a set of words that occur in a document. For example, in the sentence “the quick brown fox jumped”, a 3-gram would be “quick brown fox” and a bigram would be “brown fox”.

The n-grams from the General Index look like this:

{
    ngram: "sheet of cellulose nitrate",
    ngram_tokens: 4,
    ngram_count: 4
},
{
    ngram: "high than the diameter",
    ngram_tokens: 4,
    ngram_count: 1
}

So we know that the phrase “sheet of cellulose nitrate” occurred in the document four times. The General Index used a tool called spaCy to extract n-grams from articles, capturing from 5-grams down to unigrams from each document.

You cannot recreate a document from these n-grams due to the way that the text was processed (we checked this carefully). However, with the n-grams we know the phrases that exists in each document and how many times it was mentioned… which is great for search!

Enabling fulltext search

We matched the n-grams that had metadata to records in OpenAlex, then loaded the n-grams into Elasticsearch. The result is fine-grained, fulltext search across many articles in OpenAlex. This allows you to find words and phrases deep within a document. This feature is ready to use today!

Fulltext search is integrated into the main search feature, with priority given to title, then abstract, then fulltext: https://api.openalex.org/works?search=dna.

You can filter records to see those that have fulltext available, and you can search fulltext only.

Can I see the n-grams?

Yes you can! Each Work object in OpenAlex now includes an ngrams_url field; the URL you find there that points to a list of that work’s ngrams.

You can also access a work’s ngrams directly via DOI, by using this REST API endpoint:

/works/:doi/ngrams

So for example, to get the ngrams for the work with DOI 10.1016/s0022-2836(75)80083-0, you can call https://api.openalex.org/works/10.1016/s0022-2836(75)80083-0/ngrams.

And the best part is, because these API queries are cached they can be served even more quickly than the rest of our REST API…so you can feel free to scroll through thousands or even millions of DOIs using this endpoint.

Exploring the data

Looking across the OpenAlex data set, about 32% of articles prior to 2000 have fulltext, and about 25% of articles between 2000 and 2020 have fulltext:

The count by year is:

The coverage increases above 50% when a record has a DOI:

Finally, coverage increases above 70% and even up to 80% in years when an article has more than 50 incoming citations:

We hope you enjoy this new feature! We’re thankful to The General Index project for making this incredible data set available, and we’re proud to be one of the first organizations to host it in an easy-to-use manner.

New OpenAlex API features!

We’ve got a ton of great API improvements to report! If you’re an API user, there’s a good chance there’s something in here you’re gonna love.

Search

You can now search both titles and abstracts. We’ve also implemented stemming, so a search for “frogs” now automatically gets your results mentioning “frog,” too. Thanks to these changes, searches for works now deliver around 10x more results. This can all be accessed using the new search query parameter.

New entity filters

We’ve added support for tons of new filters, which are documented here. You can now:

  • get all of a work’s outgoing citations (ie, its references section) with a single query. 
  • search within each work’s raw affiliation data to find an arbitrary string (eg a specific department within an organization)
  • filter on whether or not an entity has a canonical external ID (works: has_doi, authors: has_orcid, etc)

Request multiple records by ID at once

This has been our most-requested feature and we’re super excited to roll it out! By using the new OR operator, you can request up to 50 entities in a single API call. You can use any ID we support–DOI, ISSN, OpenAlex ID, etc.

Deep paging

Using cursor-based paging, you can now retrieve an infinite number of results (it used to be just the top 10,000). But remember: if you want to download the entire dataset, please use the snapshot, not the API! The snapshot is the exact same data in the exact same format, but much much faster and cheaper for you and us.

More groups in group_by queries

We now return the top 200 groups (it used to be just the top 50).

New Autocomplete endpoint

Our new autocomplete endpoint dead easy to use our data to power an autocomplete/typeahead widget in your own projects. It works for any of our five entity types (works, authors, venues, institutions, or concepts). If you’ve got users inputting the names of journals, institutions, or other entities, now you can easily let them choose an entity instead of entering free text–and then you can store the ID (ISSN, ROR, whatever) instead of passing strings around everywhere. 

Better docs

In addition to documenting the new features above, we’ve also added lots of new documentation for existing features, addressing our most frequent questions and requests:

Thanks to everyone who’s been in touch to ask for new features, report bugs, and tell us where we can improve (also where we’re doing well, we’re ok with that too).
We’ll continue improving the API and the docs. We’re also putting tons of work into improving the underlying dataset’s accuracy and coverage, and we’re happy to report that we’ve improved a lot on what we inherited from MAG, with more improvements to come. We’ve delayed the launch of the full web UI, but expect that in the summer…we are so excited about all the possibilities that’s going to open up.

OpenAlex Update: Jan 24 2022

The OpenAlex launch is going well!  Thanks for all of your feedback, comments, questions, and help spreading the word.  A few updates for you below.

Snapshot updates

There is a new native-format snapshot, with the following updates:

  • includes “abstract_inverted_index” in works
  • includes “raw_affiliation_string” in works.authorships (thanks for requesting this!)
  • includes “cited_by_api_url” in works is now a string not a list (sorry! the list was a bug)
  • corrected the spelling of institution.associated_institutions
  • “ids” dict doesn’t include entries for empty ids anymore (simplifies the data)

This new snapshot doesn’t have additional new works since the previous one, but we expect new works to be added in the next week, and approximately every 2 weeks after that.  A new MAG-format snapshot including new works will also be release at that time.  Each new snapshot will contain articles published up to just a few days before the snapshot release (rather than several weeks old as was the case with MAG). 

API updates

The same changes as described above for the snapshot, importantly including the “abstract_inverted_index” in the list and filter endpoints.

Nature write-up

The OpenAlex launch was covered in Nature this week!  You can read about it here:  https://doi.org/10.1038/d41586-022-00138-y  
We are really happy to hear that people are finding it easy to use!

OpenAlex Tips of the Day

We have been posting tips for using OpenAlex on Twitter every weekday.  

You can see past tips at this search link (whether you have a twitter account or not), and you can follow us on twitter here: @openalex_org

Questions?

We’d love to hear from you: team@ourresearch.org