jittor_geometric.datasets
Dataset loaders and utilities.
- class jittor_geometric.datasets.Planetoid(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
The citation network datasets “Cora”, “CiteSeer” and “PubMed” from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper.
This class represents three widely-used citation network datasets: Cora, CiteSeer, and PubMed. Nodes correspond to documents, and edges represent citation links between them. The datasets are designed for semi-supervised learning tasks, where training, validation, and test splits are provided as binary masks.
Dataset Details:
Cora: A citation network where nodes represent machine learning papers, and edges represent citations. The task is to classify papers into one of seven classes.
CiteSeer: A citation network of research papers in computer and information science. The task is to classify papers into one of six classes.
PubMed: A citation network of biomedical papers on diabetes. The task is to classify papers into one of three classes.
Splitting Options: - public: The original fixed split from the paper “Revisiting Semi-Supervised Learning with Graph Embeddings”. - full: Uses all nodes except those in the validation and test sets for training, inspired by “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling”. - random: Generates random splits for train, validation, and test sets based on the specified parameters.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset (
"Cora"
,"CiteSeer"
,"PubMed"
).split (str) – The type of dataset split (
"public"
,"full"
,"random"
). Default is"public"
.num_train_per_class (int, optional) – Number of training samples per class for
"random"
split. Default is 20.num_val (int, optional) – Number of validation samples for
"random"
split. Default is 500.num_test (int, optional) – Number of test samples for
"random"
split. Default is 1000.transform (callable, optional) – A function/transform that takes in a
torch_geometric.data.Data
object and returns a transformed version. Default isNone
.pre_transform (callable, optional) – A function/transform that takes in a
torch_geometric.data.Data
object and returns a transformed version before saving to disk. Default isNone
.
Example
>>> dataset = Planetoid(root='/path/to/dataset', name='Cora', split='random') >>> data = dataset[0] # Access the processed data object
- url = 'https://github.com/kimiyoung/planetoid/raw/master/data'
- __init__(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None)[source]
- property raw_dir
- property processed_dir
- property raw_file_names
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- property processed_file_names
The name of the files to find in the
self.processed_dir
folder in order to skip the processing.
- class jittor_geometric.datasets.Amazon(root, name, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
The Amazon Computers and Amazon Photo datasets from the paper “Pitfalls of Graph Neural Network Evaluation” <https://arxiv.org/abs/1811.05868>`_.
This class represents the Amazon dataset used in the paper “Pitfalls of Graph Neural Network Evaluation”. In this dataset, nodes represent products, and edges indicate that two products are frequently bought together. The dataset provides product reviews represented as bag-of-words node features, and the task is to classify products into their respective categories.
Dataset Details:
Amazon Computers: This dataset contains products related to computers, where the task is to classify the products based on the reviews and co-purchase information.
Amazon Photo: This dataset contains products related to photography, with a similar task of classifying products based on reviews and co-purchase data.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset, either
"Computers"
or"Photo"
.transform (callable, optional) – A function/transform that takes in a
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed on each access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in a
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)
Example
>>> dataset = Amazon(root='/path/to/dataset', name='Computers') >>> dataset.data >>> dataset[0] # Accessing the first data point
- url = 'https://github.com/shchur/gnn-benchmark/raw/master/data/npz/'
- property raw_file_names: str
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.WikipediaNetwork(root, name, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
Heterophilic dataset from the paper ‘A critical look at the evaluation of GNNs under heterophily: Are we really making progress?’ <https://arxiv.org/abs/2302.11640>.
This class represents a collection of heterophilic graph datasets used to evaluate the performance of Graph Neural Networks (GNNs) in heterophilic settings. These datasets consist of graphs where nodes are connected based on certain relationships, and the task is to classify the nodes based on their features or labels. The datasets in this collection come from different domains, and each dataset has a unique structure and task.
Dataset Details:
Chameleon
Squirrel
Chameleon-Filtered
Squirrel-Filtered
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset to load. Options include: - “chameleon” - “squirrel” - “chameleon_filtered” - “squirrel_filtered”
transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed on every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)
Example
>>> dataset = Wikipedia(root='/path/to/dataset', name='chameleon') >>> dataset.data >>> dataset[0] # Accessing the first data point
- url = 'https://github.com/yandex-research/heterophilous-graphs/raw/main/data'
- property raw_file_names: str
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.GeomGCN(root, name, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
The GeomGCN datasets used in the “Geom-GCN: Geometric Graph Convolutional Networks” <https://openreview.net/forum?id=S1e2agrFvS>`_ paper.
This class represents the datasets used in the Geom-GCN paper, which focuses on geometric graph convolutional networks. The datasets consist of graphs where nodes represent various entities, and edges represent relationships between them. The goal is to apply graph convolutional networks (GCNs) in the context of geometric graphs to classify nodes based on their features.
Dataset Details:
Cornell, Texas, Wisconsin: These datasets represent web pages from the Cornell, Texas, and Wisconsin universities, where nodes are web pages, and edges represent hyperlinks between them. The task is to classify web pages into one of five categories: student, project, course, staff, and faculty.
Actor: In the Actor dataset, each node corresponds to an actor, and edges between nodes represent co-occurrence on the same Wikipedia page. The task is to classify the actors into one of five categories based on keywords extracted from their Wikipedia pages.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset to load. Options include: -
"Cornell"
-"Texas"
-"Wisconsin"
-"Actor"
transform (callable, optional) – A function/transform that takes in a
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in a
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)
Example
>>> dataset = GeomGCN(root='/path/to/dataset', name='Cornell') >>> dataset.data >>> dataset[0] # Accessing the first data point
- url = 'https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master'
- property raw_file_names: List[str]
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.LINKXDataset(root, name, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
A variety of non-homophilous graph datasets from the paper “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” <https://arxiv.org/abs/2110.14446>.
Dataset Details:
Penn94: A friendship network of university students from the Facebook 100 dataset. Nodes represent students, with labels indicating gender. Node features include major, dorm, year, and high school.
Pokec: A friendship network from a Slovak online social network. Nodes represent users, connected by directed friendship relations. Node features include profile information like region, registration time, and age, with labels based on gender.
arXiv-year: Based on the ogbn-arXiv network, with nodes representing papers and edges representing citations. The classification task is set to predict the year a paper was posted, using word2vec features derived from the title and abstract.
snap-patents: A citation network of U.S. utility patents, where nodes represent patents and edges denote citations. The classification task is to predict the year a patent was granted, with node features derived from patent metadata.
genius: A social network from genius.com, where nodes are users connected by mutual follows. The task is to predict whether a user account is marked as “gone,” based on usage features like expertise score and contribution counts.
twitch-gamers: A network of Twitch accounts with edges between mutual followers. Node features include account statistics like views, creation date, and account status. The binary classification task is to predict whether a channel has explicit content.
wiki: A graph of Wikipedia articles, with nodes representing pages and edges representing links between them. Node features are GloVe embeddings from the title and abstract. Labels represent total page views, categorized into quintiles.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset to load. Options include: -
"penn94"
-"pokec"
-"arxiv-year"
-"snap-patents"
-"genius"
-"twitch-gamers"
-"wiki"
transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed on each access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)
Example
>>> dataset = LINKXDataset(root='/path/to/dataset', name='pokec') >>> dataset.data >>> dataset[0] # Accessing the first data point
- property raw_file_names: List[str]
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.OGBNodePropPredDataset(name, root='dataset', transform=None, pre_transform=None, meta_dict=None)[source]
Bases:
InMemoryDataset
The Open Graph Benchmark (OGB) Node Property Prediction Datasets, provided by the OGB team. These datasets are designed to benchmark large-scale node property prediction tasks on real-world graphs.
This class provides access to various OGB datasets focused on node property prediction tasks. Each dataset contains nodes representing entities (e.g., papers, products) and edges representing relationships (e.g., citations, co-purchases). The goal is to predict specific node-level properties, such as categories or timestamps, based on the graph structure and node features.
Dataset Details:
ogbn-arxiv: A citation network where nodes represent arXiv papers and directed edges indicate citation relationships. The task is to predict the subject area of each paper based on word2vec features derived from the title and abstract.
ogbn-products: An Amazon product co-purchasing network where nodes represent products and edges indicate frequently co-purchased products. The task is to classify each product based on its category, with node features based on product descriptions.
ogbn-paper100M: A large-scale citation network where nodes represent research papers and edges indicate citation links. The node features are derived from word embeddings of the paper abstracts. The task is to predict the subject area of each paper.
These datasets are provided by the Open Graph Benchmark (OGB) team, which aims to facilitate machine learning research on graphs by offering diverse, large-scale datasets. For more details, visit the OGB website: https://ogb.stanford.edu/.
- Parameters:
name (str) – The name of the dataset to load. Options include: -
"ogbn-arxiv"
-"ogbn-products"
-"ogbn-paper100M"
root (str) – Root directory where the dataset folder will be stored.
transform (callable, optional) – A function/transform that takes in a graph object and returns a transformed version. The graph object will be transformed on each access. (default:
None
)pre_transform (callable, optional) – A function/transform that takes in a graph object and returns a transformed version. The graph object will be transformed before being saved to disk. (default:
None
)meta_dict (dict, optional) – A dictionary containing meta-information about the dataset. When provided, it overrides default meta-information, useful for debugging or contributions from external users.
Example
>>> dataset = OGBNodePropPredDataset(name="ogbn-arxiv", root="path/to/dataset") >>> data = dataset[0] # Access the first graph object
- Acknowledgment:
The OGBNodePropPredDataset is developed and maintained by the Open Graph Benchmark (OGB) team. We sincerely thank the OGB team for their significant contributions to the graph machine learning community.
- property num_classes
The number of classes in the dataset.
- property raw_file_names
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- property processed_file_names
The name of the files to find in the
self.processed_dir
folder in order to skip the processing.
- class jittor_geometric.datasets.HeteroDataset(root, name, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
Heterophilic dataset from the paper ‘A critical look at the evaluation of GNNs under heterophily: Are we really making progress?’ <https://arxiv.org/abs/2302.11640>.
This class represents a collection of heterophilic graph datasets used to evaluate the performance of Graph Neural Networks (GNNs) in heterophilic settings. These datasets consist of graphs where nodes are connected based on certain relationships, and the task is to classify the nodes based on their features or labels. The datasets in this collection come from different domains, and each dataset has a unique structure and task.
Dataset Details:
Roman Empire: A graph from the Wikipedia article on the Roman Empire. Nodes represent words, connected based on their order or syntactic dependencies, with the task to classify words by syntactic roles.
Amazon Ratings: Based on Amazon co-purchasing data, where nodes are products connected if frequently bought together. The task is to predict the product’s average rating.
Minesweeper: A synthetic graph based on Minesweeper. Nodes represent grid cells with edges to neighbors. The task is to predict which cells contain mines.
Tolokers: Represents workers from the Toloka platform, with edges indicating shared tasks. The goal is to predict if a worker was banned.
Questions: Built from Yandex Q data, with nodes representing users who answered each other’s questions on the topic of medicine. The task is to predict if a user remained active.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset to load. Options include: - “roman-empire” - “amazon-ratings” - “minesweeper” - “tolokers” - “questions”
transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed on every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)
Example
>>> dataset = HeteroDataset(root='/path/to/dataset', name='amazon-ratings') >>> dataset.data >>> dataset[0] # Accessing the first data point
- url = 'https://github.com/yandex-research/heterophilous-graphs/raw/main/data'
- property raw_file_names: str
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.JODIEDataset(root, name, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
The temporal graph datasets from the paper “JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks” <https://cs.stanford.edu/~srijan/pubs/jodie-kdd2019.pdf>.
This class handles loading and processing temporal graph datasets used in the JODIE paper. It is designed for graph-based machine learning tasks, such as dynamic embedding and link prediction. The dataset includes interactions between users and entities (e.g., subreddits, Wikipedia pages, songs, or MOOC course items), and the interactions are timestamped.
Dataset Details:
Reddit Post Dataset: This dataset consists of interactions between users and subreddits. We selected the 1,000 most active subreddits and the 10,000 most active users, resulting in over 672,447 interactions. Each post’s text is represented as a feature vector using LIWC categories.
Wikipedia Edits: This dataset represents edits made by users on Wikipedia pages. We selected the 1,000 most edited pages and users with at least 5 edits, totaling 8,227 users and 157,474 interactions. Each edit is converted into a LIWC-feature vector.
LastFM Song Listens: This dataset records user-song interactions, with 1,000 users and the 1,000 most listened-to songs, resulting in 1,293,103 interactions. Unlike other datasets, interactions do not have features.
MOOC Student Drop-Out: This dataset captures student interactions (e.g., viewing videos, submitting answers) on a MOOC online course. There are 7,047 users interacting with 98 items (videos, answers, etc.), generating over 411,749 interactions, including 4,066 drop-out events.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset, options include: -
"Reddit"
-"Wikipedia"
-"LastFM"
-"MOOC"
transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed on each access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in a
Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)
Example
>>> dataset = JODIEDataset(root='/path/to/dataset', name='Reddit') >>> dataset.data >>> dataset[0] # Accessing the first data point
- url = 'http://snap.stanford.edu/jodie/{}.csv'
- names = ['reddit', 'wikipedia', 'mooc', 'lastfm']
- property raw_file_names: str
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.Reddit(root, transform=None, pre_transform=None)[source]
Bases:
InMemoryDataset
The Reddit dataset from the “Inductive Representation Learning on Large Graphs” paper, containing Reddit posts belonging to different communities.
This dataset is designed for large-scale graph representation learning. Nodes in the graph represent Reddit posts, and edges represent interactions (e.g., comments) between posts in the same community. The task is to classify posts into one of the 41 communities based on their content and connectivity.
Dataset Statistics:
Number of Nodes: 232,965
Number of Edges: 114,615,892
Number of Features: 602
Number of Classes: 41
The dataset is pre-split into training, validation, and test sets using node type masks.
- Parameters:
root (str) – Root directory where the dataset should be saved.
transform (callable, optional) – A function/transform that takes in a
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False
)
Example
>>> dataset = Reddit(root='/path/to/reddit') >>> data = dataset[0] # Access the first graph object
- url = 'https://data.dgl.ai/dataset/reddit.zip'
- property raw_file_names: List[str]
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.TemporalDataLoader(data, batch_size=1, neg_sampling_ratio=None, drop_last=False, num_neg_sample=None, neg_samples=None)[source]
Bases:
object
- class jittor_geometric.datasets.QM9(root, transform=None, pre_transform=None, pre_filter=None)[source]
Bases:
InMemoryDataset
# ! IF YOU MEET NETWORK ERROR, PLEASE TRY TO RUN THE COMMAND BELOW: # export HF_ENDPOINT=https://hf-mirror.com, # TO USE THE MIRROR PROVIDED BY Hugging Face.
The QM9 dataset from the “MoleculeNet: A Benchmark for Molecular Machine Learning” paper, consisting of about 130,000 molecules with 19 regression targets. Each molecule includes complete spatial information for the single low energy conformation of the atoms in the molecule. In addition, we provide the atom features from the “Neural Message Passing for Quantum Chemistry” paper.
Target
Property
Description
Unit
0
\(\mu\)
Dipole moment
\(\textrm{D}\)
1
\(\alpha\)
Isotropic polarizability
\({a_0}^3\)
2
\(\epsilon_{\textrm{HOMO}}\)
Highest occupied molecular orbital energy
\(\textrm{eV}\)
3
\(\epsilon_{\textrm{LUMO}}\)
Lowest unoccupied molecular orbital energy
\(\textrm{eV}\)
4
\(\Delta \epsilon\)
Gap between \(\epsilon_{\textrm{HOMO}}\) and \(\epsilon_{\textrm{LUMO}}\)
\(\textrm{eV}\)
5
\(\langle R^2 \rangle\)
Electronic spatial extent
\({a_0}^2\)
6
\(\textrm{ZPVE}\)
Zero point vibrational energy
\(\textrm{eV}\)
7
\(U_0\)
Internal energy at 0K
\(\textrm{eV}\)
8
\(U\)
Internal energy at 298.15K
\(\textrm{eV}\)
9
\(H\)
Enthalpy at 298.15K
\(\textrm{eV}\)
10
\(G\)
Free energy at 298.15K
\(\textrm{eV}\)
11
\(c_{\textrm{v}}\)
Heat capavity at 298.15K
\(\frac{\textrm{cal}}{\textrm{mol K}}\)
12
\(U_0^{\textrm{ATOM}}\)
Atomization energy at 0K
\(\textrm{eV}\)
13
\(U^{\textrm{ATOM}}\)
Atomization energy at 298.15K
\(\textrm{eV}\)
14
\(H^{\textrm{ATOM}}\)
Atomization enthalpy at 298.15K
\(\textrm{eV}\)
15
\(G^{\textrm{ATOM}}\)
Atomization free energy at 298.15K
\(\textrm{eV}\)
16
\(A\)
Rotational constant
\(\textrm{GHz}\)
17
\(B\)
Rotational constant
\(\textrm{GHz}\)
18
\(C\)
Rotational constant
\(\textrm{GHz}\)
Note
We also provide a pre-processed version of the dataset in case
rdkit
is not installed. The pre-processed version matches with the manually processed version as outlined inprocess()
.- Parameters:
root (str) – Root directory where the dataset should be saved.
transform (callable, optional) – A function/transform that takes in an
jt_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
jt_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
jt_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)
STATS:
#graphs
#nodes
#edges
#features
#tasks
130,831
~18.0
~37.3
11
19
- raw_url = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/molnet_publish/qm9.zip'
- raw_url2 = 'https://ndownloader.figshare.com/files/3195404'
- property raw_file_names: List[str]
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.MoleculeNet(root, name, transform=None, pre_transform=None, pre_filter=None, from_smiles=None)[source]
Bases:
InMemoryDataset
The MoleculeNet benchmark collection from the “MoleculeNet: A Benchmark for Molecular Machine Learning” paper, containing datasets from physical chemistry, biophysics and physiology. All datasets come with the additional node and edge features introduced by the Open Graph Benchmark.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset (
"ESOL"
,"FreeSolv"
,"Lipo"
,"PCBA"
,"MUV"
,"HIV"
,"BACE"
,"BBBP"
,"Tox21"
,"ToxCast"
,"SIDER"
,"ClinTox"
).transform (callable, optional) – A function/transform that takes in an
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
jittor_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)from_smiles (callable, optional) – A custom function that takes a SMILES string and outputs a
Data
object. If not set, defaults tofrom_smiles()
. (default:None
)
STATS:
Name
#graphs
#nodes
#edges
#features
#classes
ESOL
1,128
~13.3
~27.4
9
1
FreeSolv
642
~8.7
~16.8
9
1
ClinTox
1,484
~26.1
~55.5
9
2
- url = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/{}'
-
names:
Dict
[str
,Tuple
[str
,str
,str
,int
,Union
[int
,slice
]]] = {'bace': ('BACE', 'bace.csv', 'bace', 0, 2), 'bbbp': ('BBBP', 'BBBP.csv', 'BBBP', -1, -2), 'clintox': ('ClinTox', 'clintox.csv.gz', 'clintox', 0, slice(1, 3, None)), 'esol': ('ESOL', 'delaney-processed.csv', 'delaney-processed', -1, -2), 'freesolv': ('FreeSolv', 'SAMPL.csv', 'SAMPL', 1, 2), 'hiv': ('HIV', 'HIV.csv', 'HIV', 0, -1), 'lipo': ('Lipophilicity', 'Lipophilicity.csv', 'Lipophilicity', 2, 1), 'muv': ('MUV', 'muv.csv.gz', 'muv', -1, slice(0, 17, None)), 'pcba': ('PCBA', 'pcba.csv.gz', 'pcba', -1, slice(0, 128, None)), 'sider': ('SIDER', 'sider.csv.gz', 'sider', 0, slice(1, 28, None)), 'tox21': ('Tox21', 'tox21.csv.gz', 'tox21', -1, slice(0, 12, None)), 'toxcast': ('ToxCast', 'toxcast_data.csv.gz', 'toxcast_data', 0, slice(1, 618, None))}
- __init__(root, name, transform=None, pre_transform=None, pre_filter=None, from_smiles=None)[source]
- property raw_file_names: str
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.MD17(root, name, train=None, transform=None, pre_transform=None, pre_filter=None)[source]
Bases:
InMemoryDataset
A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. This class provides access to the original MD17 datasets, their revised versions, and the CCSD(T) trajectories.
For every trajectory, the dataset contains the Cartesian positions of atoms (in Angstrom), their atomic numbers, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. The latter two are the regression targets for this collection.
Note
Data objects contain no edge indices as these are most commonly constructed via the
jittor_geometric.transforms.RadiusGraph
transform, with its cut-off being a hyperparameter.The original MD17 dataset contains ten molecule trajectories. This version of the dataset was found to suffer from high numerical noise. The revised MD17 dataset contains the same molecules, but the energies and forces were recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. The third version of the dataset contains fewer molecules, computed at the CCSD(T) level of theory. The benzene molecule at the DFT FHI-aims level of theory was released separately.
Check the table below for detailed information on the molecule, level of theory and number of data points contained in each dataset. Which trajectory is loaded is determined by the
name
argument. For the coupled cluster trajectories, the dataset comes with pre-defined training and testing splits which are loaded separately via thetrain
argument.Molecule
Level of Theory
Name
#Examples
Benzene
DFT
benzene
627,983
Uracil
DFT
uracil
133,770
Naphthalene
DFT
naphthalene
326,250
Aspirin
DFT
aspirin
211,762
Salicylic acid
DFT
salicylic acid
320,231
Malonaldehyde
DFT
malonaldehyde
993,237
Ethanol
DFT
ethanol
555,092
Toluene
DFT
toluene
442,790
Paracetamol
DFT
paracetamol
106,490
Azobenzene
DFT
azobenzene
99,999
Benzene (R)
DFT (PBE/def2-SVP)
revised benzene
100,000
Uracil (R)
DFT (PBE/def2-SVP)
revised uracil
100,000
Naphthalene (R)
DFT (PBE/def2-SVP)
revised naphthalene
100,000
Aspirin (R)
DFT (PBE/def2-SVP)
revised aspirin
100,000
Salicylic acid (R)
DFT (PBE/def2-SVP)
revised salicylic acid
100,000
Malonaldehyde (R)
DFT (PBE/def2-SVP)
revised malonaldehyde
100,000
Ethanol (R)
DFT (PBE/def2-SVP)
revised ethanol
100,000
Toluene (R)
DFT (PBE/def2-SVP)
revised toluene
100,000
Paracetamol (R)
DFT (PBE/def2-SVP)
revised paracetamol
100,000
Azobenzene (R)
DFT (PBE/def2-SVP)
revised azobenzene
99,988
Benzene
CCSD(T)
benzene CCSD(T)
1,500
Aspirin
CCSD
aspirin CCSD
1,500
Malonaldehyde
CCSD(T)
malonaldehyde CCSD(T)
1,500
Ethanol
CCSD(T)
ethanol CCSD(T)
2,000
Toluene
CCSD(T)
toluene CCSD(T)
1,501
Benzene
DFT FHI-aims
benzene FHI-aims
49,863
Warning
It is advised to not train a model on more than 1,000 samples from the original or revised MD17 dataset.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – Keyword of the trajectory that should be loaded.
train (bool, optional) – Determines whether the train or test split gets loaded for the coupled cluster trajectories. (default:
None
)transform (callable, optional) – A function/transform that takes in an
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
jittor_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)
STATS:
Name
#graphs
#nodes
#edges
#features
#tasks
Benzene
627,983
12
0
1
2
Uracil
133,770
12
0
1
2
Naphthalene
326,250
10
0
1
2
Aspirin
211,762
21
0
1
2
Salicylic acid
320,231
16
0
1
2
Malonaldehyde
993,237
9
0
1
2
Ethanol
555,092
9
0
1
2
Toluene
442,790
15
0
1
2
Paracetamol
106,490
20
0
1
2
Azobenzene
99,999
24
0
1
2
Benzene (R)
100,000
12
0
1
2
Uracil (R)
100,000
12
0
1
2
Naphthalene (R)
100,000
10
0
1
2
Aspirin (R)
100,000
21
0
1
2
Salicylic acid (R)
100,000
16
0
1
2
Malonaldehyde (R)
100,000
9
0
1
2
Ethanol (R)
100,000
9
0
1
2
Toluene (R)
100,000
15
0
1
2
Paracetamol (R)
100,000
20
0
1
2
Azobenzene (R)
99,988
24
0
1
2
Benzene CCSD-T
1,500
12
0
1
2
Aspirin CCSD-T
1,500
21
0
1
2
Malonaldehyde CCSD-T
1,500
9
0
1
2
Ethanol CCSD-T
2000
9
0
1
2
Toluene CCSD-T
1,501
15
0
1
2
Benzene FHI-aims
49,863
12
0
1
2
- gdml_url = 'http://quantum-machine.org/gdml/data/npz'
- revised_url = 'https://archive.materialscloud.org/record/file?filename=rmd17.tar.bz2&record_id=466'
- file_names = {'aspirin': 'md17_aspirin.npz', 'aspirin CCSD': 'aspirin_ccsd.zip', 'azobenzene': 'azobenzene_dft.npz', 'benzene': 'md17_benzene2017.npz', 'benzene CCSD(T)': 'benzene_ccsd_t.zip', 'benzene FHI-aims': 'benzene2018_dft.npz', 'ethanol': 'md17_ethanol.npz', 'ethanol CCSD(T)': 'ethanol_ccsd_t.zip', 'malonaldehyde': 'md17_malonaldehyde.npz', 'malonaldehyde CCSD(T)': 'malonaldehyde_ccsd_t.zip', 'naphthalene': 'md17_naphthalene.npz', 'paracetamol': 'paracetamol_dft.npz', 'revised aspirin': 'rmd17_aspirin.npz', 'revised azobenzene': 'rmd17_azobenzene.npz', 'revised benzene': 'rmd17_benzene.npz', 'revised ethanol': 'rmd17_ethanol.npz', 'revised malonaldehyde': 'rmd17_malonaldehyde.npz', 'revised naphthalene': 'rmd17_naphthalene.npz', 'revised paracetamol': 'rmd17_paracetamol.npz', 'revised salicylic acid': 'rmd17_salicylic.npz', 'revised toluene': 'rmd17_toluene.npz', 'revised uracil': 'rmd17_uracil.npz', 'salicylic acid': 'md17_salicylic.npz', 'toluene': 'md17_toluene.npz', 'toluene CCSD(T)': 'toluene_ccsd_t.zip', 'uracil': 'md17_uracil.npz'}
- property raw_file_names: str | List[str]
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.PCQM4Mv2(root, split='train', transform=None, from_smiles=None)[source]
Bases:
InMemoryDataset
The PCQM4Mv2 dataset from the “OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs” paper.
PCQM4Mv2
is a quantum chemistry dataset originally curated under the PubChemQC project. The task is to predict the DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs.- Parameters:
root (str) – Root directory where the dataset should be saved.
split (str, optional) – If
"train"
, loads the training dataset. If"val"
, loads the validation dataset. If"test"
, loads the test dataset. If"holdout"
, loads the holdout dataset. (default:"train"
)transform (callable, optional) – A function/transform that takes in an
jittor_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)from_smiles (callable, optional) – A custom function that takes a SMILES string and outputs a
Data
object. If not set, defaults tofrom_smiles()
. (default:None
)
- url = 'https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m-v2.zip'
- split_mapping = {'holdout': 'test-challenge', 'test': 'test-dev', 'train': 'train', 'val': 'valid'}
- property raw_file_names: List[str]
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.MovieLens1M(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
Bases:
RecSysBase
- MovieLens-1M dataset with auto-download from Recbole:
https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip
- Expected (after extraction) in raw_dir:
ml-1m.item
ml-1m.user
ml-1m.inter
Files are tab-separated; first header row is skipped (skiprows=1).
- Parameters:
with_aux (
bool
)
- url = 'https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip'
- __init__(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
- Parameters:
with_aux (
bool
)
- property raw_file_names
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.MovieLens100K(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
Bases:
RecSysBase
MovieLens-100K (RecBole processed).
Downloads: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-100k.zip
Expected in raw_dir after extraction:
ml-100k.item
ml-100k.user
ml-100k.inter
- Parameters:
with_aux (
bool
)
- url = 'https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-100k.zip'
- __init__(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
- Parameters:
with_aux (
bool
)
- property raw_file_names
The name of the files to find in the
self.raw_dir
folder in order to skip the download.
- class jittor_geometric.datasets.Yelp2018(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
Bases:
RecSysBase
Yelp-2018 (RecBole processed).
Downloads: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Yelp/yelp2018.zip
Accepts either file naming variant inside the zip:
yelp2018.item/.user/.inter (common)
yelp-2018.item/.user/.inter (also supported)
After extraction, we normalize to yelp-2018.* in raw_dir.
- Parameters:
with_aux (
bool
)
- url = 'https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Yelp/yelp2018.zip'
- property raw_file_names
The name of the files to find in the
self.raw_dir
folder in order to skip the download.