We’re excited to announce that we’ve added fulltext search to 57 million articles in OpenAlex, based on data from the General Index. This feature moves OpenAlex’s search function beyond title and abstract, covering the full text of 57 million documents, resulting in ~30 times more search results for many keyword searches!
What is the General Index?
The General Index is a very large database of n-grams that were extracted from 107 million journal articles. It’s openly available without restrictions, and is supported by 100 prominent professors and researchers.
An n-gram is a set of words that occur in a document. For example, in the sentence “the quick brown fox jumped”, a 3-gram would be “quick brown fox” and a bigram would be “brown fox”.
The n-grams from the General Index look like this:
{
ngram: "sheet of cellulose nitrate",
ngram_tokens: 4,
ngram_count: 4
},
{
ngram: "high than the diameter",
ngram_tokens: 4,
ngram_count: 1
}
So we know that the phrase “sheet of cellulose nitrate” occurred in the document four times. The General Index used a tool called spaCy to extract n-grams from articles, capturing from 5-grams down to unigrams from each document.
You cannot recreate a document from these n-grams due to the way that the text was processed (we checked this carefully). However, with the n-grams we know the phrases that exists in each document and how many times it was mentioned… which is great for search!
Enabling fulltext search
We matched the n-grams that had metadata to records in OpenAlex, then loaded the n-grams into Elasticsearch. The result is fine-grained, fulltext search across many articles in OpenAlex. This allows you to find words and phrases deep within a document. This feature is ready to use today!
Fulltext search is integrated into the main search feature, with priority given to title
, then abstract
, then fulltext
: https://api.openalex.org/works?search=dna.
You can filter records to see those that have fulltext available, and you can search fulltext only.
Can I see the n-grams?
Yes you can! Each Work object in OpenAlex now includes an ngrams_url field; the URL you find there that points to a list of that work’s ngrams.
You can also access a work’s ngrams directly via DOI, by using this REST API endpoint:
/works/:doi/ngrams
So for example, to get the ngrams for the work with DOI 10.1016/s0022-2836(75)80083-0
, you can call https://api.openalex.org/works/10.1016/s0022-2836(75)80083-0/ngrams.
And the best part is, because these API queries are cached they can be served even more quickly than the rest of our REST API…so you can feel free to scroll through thousands or even millions of DOIs using this endpoint.
Exploring the data
Looking across the OpenAlex data set, about 32% of articles prior to 2000 have fulltext, and about 25% of articles between 2000 and 2020 have fulltext:
The count by year is:
The coverage increases above 50% when a record has a DOI:
Finally, coverage increases above 70% and even up to 80% in years when an article has more than 50 incoming citations:
We hope you enjoy this new feature! We’re thankful to The General Index project for making this incredible data set available, and we’re proud to be one of the first organizations to host it in an easy-to-use manner.