OpenAlex documentation improvements

It’s a new year and at OurResearch we’re starting off 2023 full steam ahead! We’ve revamped the OpenAlex documentation so that it’s easier to get started, and easier to find the fields and filters that are available in the OpenAlex API. It should take less “clicks” to find what you need.

Poised for growth

The major change we made was to highlight the core entities (works, authors, etc) in OpenAlex, giving them their own up-front space. OpenAlex grew considerably in 2022, not only in number records, but also by the number of ways that you can filter, group, and search scholarly data. This new approach provides more room to add and document filters. We can better describe the unique search capabilities available in each entity. Overall, it sets us up to grow again in 2023.

Our goal is to maintain friendly and approachable documentation, so hopefully we’ve kept that up as well. If you find something broken, or have some suggested improvements, let us know!

Author search in OpenAlex: improved handling of diacritics within names

We’ve improved the author search feature within OpenAlex, so you get more results when searching for author names that may or may not include diacritics. For example, a search for the name “David Tarragó” will return the same number of results as the the version that is converted via Lucene’s ASCII folding filter, which in this case is “David Tarrago”.

When searching with diacritics, results with the queried diacritics are more likely to be ranked towards the top. So the two searches may have slightly different rankings. You can see the results of these two searches in the API:

Search for David Tarragó: https://api.openalex.org/authors?search=david%20Tarrag%C3%B3
Search for David Tarrago: https://api.openalex.org/authors?search=david%20Tarrago

These queries return the same number of results, with diacritic and non-diacritic names included. Keep in mind that results are weighted by the author’s works count, so that has an impact on relevance as well.

Why make this change?

When creating the OpenAlex author search capability, it was important for us to honor author’s names by respecting diacritics. So searching with a diacritic returned results with diacritics. However, this strict approach makes it harder to find some authors. We’re comfortable with the compromise of searching with and without diacritics at the same time, while giving priority to the intended search query. Hopefully this improved feature is helpful!

Fetch multiple DOIs in one OpenAlex API request

Did you know that you can request up to 50 DOIs in a single API call? That’s possible due to the OR query in the OpenAlex API and looks like this:

https://api.openalex.org/works?filter=doi:10.3322/caac.21660|https://doi.org/10.1136/bmj.n71|10.3322/caac.21654&mailto=support@openalex.org

We simply separate our DOIs with the pipe symbol ‘|’. That query will return three works associated with the three DOIs we entered. As you can see in the query, a short form DOI or long form DOI (as a URL) are both supported.

This will save time and resources when requesting many DOIs. This technique works with all IDs in OpenAlex, to include OpenAlex IDs and PubMed Central IDs (PMID).

Example with python requests

Let’s write an example python script to show how we can get DOIs in batches of 50 using requests:

import requests

dois = ["10.3322/caac.21660", "https://doi.org/10.1136/bmj.n71", "10.3322/caac.21654"]
pipe_separated_dois = "|".join(dois)
r = requests.get(f"https://api.openalex.org/works?filter=doi:{pipe_separated_dois}&per-page=50&mailto=support@openalex.org")
works = r.json()["results"]

for work in works:
  print(work["doi"], work["display_name"])

# results
https://doi.org/10.3322/caac.21660 Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries
https://doi.org/10.1136/bmj.n71 The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
https://doi.org/10.3322/caac.21654 Cancer Statistics, 2021

Hope this is helpful!

New OpenAlex API features – continents, regions, and more!

You can now use the OpenAlex API to filter and group by continents and large geographic regions, such as the Global South. The full documentation is here.

To see a list of institutions in Europe you can do:

https://api.openalex.org/institutions?filter=continent:europe

So simple! You can group by continent as well. This will return a count of works where an author is associated with the institution’s continent:

https://api.openalex.org/works?group-by=institutions.continent

{
  "key": "Q46",
  "key_display_name": "Europe",
  "count": 26968686
},
{
  "key": "Q49",
  "key_display_name": "North America",
  "count": 25175848
},
{
  "key": "Q48",
  "key_display_name": "Asia",
  "count": 24805214
}...

The key field is the wikidata identifier for the continent, such as South America (Q18).

Querying the Global South

The Global South is a term used to identify regions within Latin America, Asia, Africa, and Oceania. We used data from the United Nations to build a list of countries associated with the Global South. It’s available as a boolean filter like:

https://api.openalex.org/institutions?filter=is_global_south:true

This allows for some very cool groupings, such as “show me authors associated with the Global South, grouped by country”:

https://api.openalex.org/authors?filter=last_known_institution.is_global_south:true&group-by=last_known_institution.country_code

New API Filters

We’ve added new filters for works:

has_pmid – works that have a PubMed identifier
has_pmcid – works that have a PubMed Central identifier
repository – works that can be found at the given repository, based on venue ID
version – works where the given version is available, such as acceptedVersion

Concepts Improvements

As requested by OpenAlex users, we modified the concepts tree so that it is a true hierarchy. This means when you search for works with the concept Computer Science, you’re also getting works tagged with those sub-concepts, such as Artificial Intelligence

Meet Casey – Now full time with OurResearch

Hi I’m Casey. I am excited to announce that I am now full time with OurResearch as a software engineer working on OpenAlex and Unpaywall!

My Journey

I freelanced for OurResearch prior to joining full time this summer. With Jason and Heather’s help I maintained Paperbuzz, Cite-As, and also built out a project to catalog academic journal pricing. With freelancing I was able to improve my python and data management skills in order to tackle bigger projects.

Prior to freelancing I enjoyed a career in the US Air Force, which I am proud of. I’m fortunate to have hundreds of hours as aircrew on multiple aircraft, as well as a variety of technical and leadership assignments. So if you ever want to talk airplanes be ready because I might talk your ear off!

My academic experience comes from my time in university pursuing advanced education.

My Vision with OurResearch

In December I helped build the API and set up Elasticsearch for a project called OpenAlex. That project has continued to grow and I love to see how many people are using it. My core job with OpenAlex is to provide front-line customer support, as well as maintain and improve the API and search infrastructure. I’m also working on several parts of UnPaywall.

It’s incredible that OurResearch tools are freely open and available. I find OurResearch has similar core values as my time in the Air Force: small teams empowered to make decisions, humble and accepting of feedback in order to make things better. That’s why we believe our community of users are invaluable and important in keeping those tools free, open, and easy to use.

So we will listen to your feedback, fix bugs and implement features quickly, and continue to maintain our documentation so the dataset and APIs are as frictionless as they can be. We welcome and need your help with this mission! So do not hesitate to contact me or the team.

I look forward to improving OpenAlex and Unpaywall, and to meeting those of you using OurResearch products!

– Casey

Fulltext search in OpenAlex

We’re excited to announce that we’ve added fulltext search to 57 million articles in OpenAlex, based on data from the General Index. This feature moves OpenAlex’s search function beyond title and abstract, covering the full text of 57 million documents, resulting in ~30 times more search results for many keyword searches!

What is the General Index?

The General Index is a very large database of n-grams that were extracted from 107 million journal articles. It’s openly available without restrictions, and is supported by 100 prominent professors and researchers.

An n-gram is a set of words that occur in a document. For example, in the sentence “the quick brown fox jumped”, a 3-gram would be “quick brown fox” and a bigram would be “brown fox”.

The n-grams from the General Index look like this:

{
    ngram: "sheet of cellulose nitrate",
    ngram_tokens: 4,
    ngram_count: 4
},
{
    ngram: "high than the diameter",
    ngram_tokens: 4,
    ngram_count: 1
}

So we know that the phrase “sheet of cellulose nitrate” occurred in the document four times. The General Index used a tool called spaCy to extract n-grams from articles, capturing from 5-grams down to unigrams from each document.

You cannot recreate a document from these n-grams due to the way that the text was processed (we checked this carefully). However, with the n-grams we know the phrases that exists in each document and how many times it was mentioned… which is great for search!

Enabling fulltext search

We matched the n-grams that had metadata to records in OpenAlex, then loaded the n-grams into Elasticsearch. The result is fine-grained, fulltext search across many articles in OpenAlex. This allows you to find words and phrases deep within a document. This feature is ready to use today!

Fulltext search is integrated into the main search feature, with priority given to title, then abstract, then fulltext: https://api.openalex.org/works?search=dna.

You can filter records to see those that have fulltext available, and you can search fulltext only.

Can I see the n-grams?

Yes you can! Each Work object in OpenAlex now includes an ngrams_url field; the URL you find there that points to a list of that work’s ngrams.

You can also access a work’s ngrams directly via DOI, by using this REST API endpoint:

/works/:doi/ngrams

So for example, to get the ngrams for the work with DOI 10.1016/s0022-2836(75)80083-0, you can call https://api.openalex.org/works/10.1016/s0022-2836(75)80083-0/ngrams.

And the best part is, because these API queries are cached they can be served even more quickly than the rest of our REST API…so you can feel free to scroll through thousands or even millions of DOIs using this endpoint.

Exploring the data

Looking across the OpenAlex data set, about 32% of articles prior to 2000 have fulltext, and about 25% of articles between 2000 and 2020 have fulltext:

The count by year is:

The coverage increases above 50% when a record has a DOI:

Finally, coverage increases above 70% and even up to 80% in years when an article has more than 50 incoming citations:

We hope you enjoy this new feature! We’re thankful to The General Index project for making this incredible data set available, and we’re proud to be one of the first organizations to host it in an easy-to-use manner.

OurResearch blog