Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PubMed IDs #227

Merged
merged 27 commits into from
Apr 22, 2024
Merged

Add PubMed IDs #227

merged 27 commits into from
Apr 22, 2024

Conversation

gaurav
Copy link
Collaborator

@gaurav gaurav commented Jan 23, 2024

Adds 36,980,104 PubMed IDs (PMIDs) and their titles into NodeNorm (but not into NameRes, since I don't think putting the titles in there matches our use-case). Also includes DOI and PMCID mappings. Closes #204.

To implement this, I also added a recursive download option to our FTP download option.

@gaurav gaurav changed the base branch from master to babel-1.3 January 23, 2024 16:06
Base automatically changed from babel-1.3 to master January 24, 2024 19:17
@gaurav gaurav changed the base branch from master to fix-conflated-preferred-name March 26, 2024 03:37
Base automatically changed from fix-conflated-preferred-name to master March 28, 2024 16:56
@gaurav gaurav marked this pull request as ready for review April 15, 2024 04:24
@gaurav gaurav requested a review from cbizon April 15, 2024 04:24
Copy link
Contributor

@cbizon cbizon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between PMID and PMC? I'm assuming that the pubmed download doesn't include PMC or doi?

@gaurav
Copy link
Collaborator Author

gaurav commented Apr 22, 2024

What is the relationship between PMID and PMC? I'm assuming that the pubmed download doesn't include PMC or doi?

It does! The PubMed download consists of 1000+ baseline XML files and 100s of update XML files. Each file conforms to the PubMed DTD, and consists of a list of articles, each of which can have multiple ArticleId elements, which can provide DOIs, PMCIDs, and other identifiers. For Article IDs we assume that they are additive, i.e. if a baseline XML file has a PMID with 1 DOI and then an update XML file has the same PMID with 1 PMCID (i.e. the DOI has been removed for some reason), we would write out the PMID with both 1 DOI and 1 PMCID. But I can't think of any reason why it would be correct to remove an ArticleId in an update, so I think this is okay.

@gaurav gaurav merged commit f755b8c into master Apr 22, 2024
@gaurav gaurav deleted the add-pubmed-ids branch April 22, 2024 16:47
@cbizon
Copy link
Contributor

cbizon commented Apr 22, 2024

So does this now allow transformation between PMID and PMCID?

@cbizon
Copy link
Contributor

cbizon commented Apr 22, 2024

Oh that's awesome! We should alert translator to this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add PubMed IDs (PMIDs), titles and PMCIDs
2 participants