Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Skip to content

<i>Salmonella enterica </i>pangenome graph and variant call data for 539,283 genomes

Metadata Updated: July 11, 2025

Salmonella pangenome graph and variant call data for 539,283 genomesDescription:Salmonella enterica causes human disease and decreases agricultural production. The overall goals of this project is to generate a large database of S. enterica variants with 539,283 samples and 236,069 features for applications in machine learning and genomics. We transformed single nucleotide polymorphism (SNP) data into reduced dimensional representations which are tolerant of missing data based on disentangled variational autoencoders. TFRecord files were made with custom Python scripts that parsed the variant call formats (VCF) into sparse tensors and combined them with the Salmonella In Silico Typing Resource (SISTR) serotype data.The data directory contains:The tar file of TFRecords: tfrecords.tar (103 GB). The TFRecords are organized first by how they were genotyped. mpileup records were created with Mpileup, and the gvg records were created with graph variant calling. In each of these directories batches of ~10,000 sequence reads named Sra10k_XX.tfrecord.gz (00--54). File Sra10k_99.tfrecord.gz contains incomplete SRAs. Each TFRecord contains the shape of the tensor, the indices of non-zero variants, sample name, serotype, and sparse values. Value 99 was assigned to '.' records.The file output.tar (11.4 TB) contains the .vcf files used to create the TFRecords above. The data in here is contained more succinctly in the TTFrecord format. This data will not normally be used.A tar file of metadata files for the samples, metadata (95 MB). Sequence read archive (SRA) accessions were downloaded using edirect/eutilities and saved as SraAccList.txt.esearch -db sra -query "txid28901[Organism:exp] AND (cluster_public[prop] AND 'biomol dna'[Properties] AND 'library layout paired'[Properties] AND 'platform illumina'[Properties] AND 'strategy wgs'[Properties] OR 'strategy wga'[Properties] OR 'strategy wcs'[Properties] OR 'strategy clone'[Properties] OR 'strategy finishing'[Properties] OR 'strategy validation'[Properties])" | efetch -format runinfo -mode xml | xtract -pattern Row -element Run > SraAccList.txtGoogle BigQuery was used to download metadata for the SRA accessions from the National Institute of Health (NIH).SELECT * FROM nih-sra-datastore.sra.metadata as metadata INNER JOIN {table_id} as leiacc ON metadata.acc = leiacc.accID;Files were processed into batches of ~10,000 and named Sra_completed_XX.csv (00--53).A VCF document mapping the TFRecord data to the positions in the graph subjected to the Type strain LT2: mapping/DRR452337.gvg.vcf-with_TFRecord_in_1st_column.txtScripts for creating and reading TFRecord data: code.reading_and_parsing_fns.py defines functions for converting VCFs of variants called using gvg to sparse tensors and makes the TFRecord files.gvg_to_tfrecord.py creates TFRecords from from the sparse tensors.Tutorial for using the TFRecords: Example_logistic_regression.mdPangenome graph files and references used for variant calling and genotyping: pangenome.refPlus100.fasta.gz which contains the genomes of the 101 Salmonella strains without plasmids used for construction of the pangenome graph.salm.100.NC_003197_v2.d2_complete.gfa.gz The complete 101 Salmonella strain pangenome graph in Graphical Fragment Assembly (GFA2) Format 2.0 including alt nodes used for genotypingsalm.100.NC_003197_v2.full.gfa.gz the full graph including alt nodes.salm.100.NC_003197_v2.full.vcf.gz A VCF of the file abovegenotyped.gvg.vcf the genotype calls in vcf formatpaths.txt the paths of the graphSCINet users: The data folder can be accessed/retrieved with valid SCINet account at this ___location: /LTS/ADCdatastorage/NAL/published/node28083194/See the SCINet File Transfer guide for more information on moving large files: https://scinet.usda.gov/guides/data/datatransferGlobus users: The files can also be accessed through Globus by following this data link. The user will need to log in to Globus in order to access this data. User accounts are free of charge with several options for signing on. Instructions for creating an account are on the login page.

Access & Use Information

Public: This dataset is intended for public access and use. License: Creative Commons CCZero

Downloads & Resources

Dates

Metadata Created Date July 11, 2025
Metadata Updated Date July 11, 2025
Data Update Frequency irregular

Metadata Source

Harvested from USDA JSON

Additional Metadata

Resource Type Dataset
Metadata Created Date July 11, 2025
Metadata Updated Date July 11, 2025
Publisher Agricultural Research Service
Maintainer
Identifier 10.15482/USDA.ADC/28083194.v1
Data Last Modified 2025-06-25
Public Access Level public
Data Update Frequency irregular
Bureau Code 005:20, 005:18
Metadata Context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
Schema Version https://project-open-data.cio.gov/v1.1/schema
Catalog Describedby https://project-open-data.cio.gov/v1.1/schema/catalog.json
Harvest Object Id cae15d89-4215-463b-a369-25589da5b85f
Harvest Source Id d3fafa34-0cb9-48f1-ab1d-5b5fdc783806
Harvest Source Title USDA JSON
License https://creativecommons.org/publicdomain/zero/1.0/
Program Code 005:040
Source Datajson Identifier True
Source Hash 1d7decf7d2b07d3d525a8276ac4feb16fb0e25972dec9804559eb610c28e11b4
Source Schema Version 1.1
Temporal 2000-01-01/2000-01-01

Didn't find what you're looking for? Suggest a dataset here.