jittor_geometric.datasets

Dataset loaders and utilities.

class jittor_geometric.datasets.Planetoid(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

The citation network datasets “Cora”, “CiteSeer” and “PubMed” from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper.

This class represents three widely-used citation network datasets: Cora, CiteSeer, and PubMed. Nodes correspond to documents, and edges represent citation links between them. The datasets are designed for semi-supervised learning tasks, where training, validation, and test splits are provided as binary masks.

Dataset Details:

  • Cora: A citation network where nodes represent machine learning papers, and edges represent citations. The task is to classify papers into one of seven classes.

  • CiteSeer: A citation network of research papers in computer and information science. The task is to classify papers into one of six classes.

  • PubMed: A citation network of biomedical papers on diabetes. The task is to classify papers into one of three classes.

Splitting Options: - public: The original fixed split from the paper “Revisiting Semi-Supervised Learning with Graph Embeddings”. - full: Uses all nodes except those in the validation and test sets for training, inspired by “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling”. - random: Generates random splits for train, validation, and test sets based on the specified parameters.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset ("Cora", "CiteSeer", "PubMed").

  • split (str) – The type of dataset split ("public", "full", "random"). Default is "public".

  • num_train_per_class (int, optional) – Number of training samples per class for "random" split. Default is 20.

  • num_val (int, optional) – Number of validation samples for "random" split. Default is 500.

  • num_test (int, optional) – Number of test samples for "random" split. Default is 1000.

  • transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. Default is None.

  • pre_transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version before saving to disk. Default is None.

Example

>>> dataset = Planetoid(root='/path/to/dataset', name='Cora', split='random')
>>> data = dataset[0]  # Access the processed data object
url = 'https://github.com/kimiyoung/planetoid/raw/master/data'
__init__(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None)[source]
property raw_dir
property processed_dir
property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

class jittor_geometric.datasets.Amazon(root, name, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

The Amazon Computers and Amazon Photo datasets from the paper “Pitfalls of Graph Neural Network Evaluation” <https://arxiv.org/abs/1811.05868>`_.

This class represents the Amazon dataset used in the paper “Pitfalls of Graph Neural Network Evaluation”. In this dataset, nodes represent products, and edges indicate that two products are frequently bought together. The dataset provides product reviews represented as bag-of-words node features, and the task is to classify products into their respective categories.

Dataset Details:

  • Amazon Computers: This dataset contains products related to computers, where the task is to classify the products based on the reviews and co-purchase information.

  • Amazon Photo: This dataset contains products related to photography, with a similar task of classifying products based on reviews and co-purchase data.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset, either "Computers" or "Photo".

  • transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. The data object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = Amazon(root='/path/to/dataset', name='Computers')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
url = 'https://github.com/shchur/gnn-benchmark/raw/master/data/npz/'
__init__(root, name, transform=None, pre_transform=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

class jittor_geometric.datasets.WikipediaNetwork(root, name, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

Heterophilic dataset from the paper ‘A critical look at the evaluation of GNNs under heterophily: Are we really making progress?’ <https://arxiv.org/abs/2302.11640>.

This class represents a collection of heterophilic graph datasets used to evaluate the performance of Graph Neural Networks (GNNs) in heterophilic settings. These datasets consist of graphs where nodes are connected based on certain relationships, and the task is to classify the nodes based on their features or labels. The datasets in this collection come from different domains, and each dataset has a unique structure and task.

Dataset Details:

  • Chameleon

  • Squirrel

  • Chameleon-Filtered

  • Squirrel-Filtered

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - “chameleon” - “squirrel” - “chameleon_filtered” - “squirrel_filtered”

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = Wikipedia(root='/path/to/dataset', name='chameleon')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
url = 'https://github.com/yandex-research/heterophilous-graphs/raw/main/data'
__init__(root, name, transform=None, pre_transform=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

process(undirected=True)[source]

Processes the dataset to the self.processed_dir folder.

class jittor_geometric.datasets.GeomGCN(root, name, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

The GeomGCN datasets used in the “Geom-GCN: Geometric Graph Convolutional Networks” <https://openreview.net/forum?id=S1e2agrFvS>`_ paper.

This class represents the datasets used in the Geom-GCN paper, which focuses on geometric graph convolutional networks. The datasets consist of graphs where nodes represent various entities, and edges represent relationships between them. The goal is to apply graph convolutional networks (GCNs) in the context of geometric graphs to classify nodes based on their features.

Dataset Details:

  • Cornell, Texas, Wisconsin: These datasets represent web pages from the Cornell, Texas, and Wisconsin universities, where nodes are web pages, and edges represent hyperlinks between them. The task is to classify web pages into one of five categories: student, project, course, staff, and faculty.

  • Actor: In the Actor dataset, each node corresponds to an actor, and edges between nodes represent co-occurrence on the same Wikipedia page. The task is to classify the actors into one of five categories based on keywords extracted from their Wikipedia pages.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - "Cornell" - "Texas" - "Wisconsin" - "Actor"

  • transform (callable, optional) – A function/transform that takes in a jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = GeomGCN(root='/path/to/dataset', name='Cornell')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
url = 'https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master'
__init__(root, name, transform=None, pre_transform=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

class jittor_geometric.datasets.LINKXDataset(root, name, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

A variety of non-homophilous graph datasets from the paper “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” <https://arxiv.org/abs/2110.14446>.

Dataset Details:

  • Penn94: A friendship network of university students from the Facebook 100 dataset. Nodes represent students, with labels indicating gender. Node features include major, dorm, year, and high school.

  • Pokec: A friendship network from a Slovak online social network. Nodes represent users, connected by directed friendship relations. Node features include profile information like region, registration time, and age, with labels based on gender.

  • arXiv-year: Based on the ogbn-arXiv network, with nodes representing papers and edges representing citations. The classification task is set to predict the year a paper was posted, using word2vec features derived from the title and abstract.

  • snap-patents: A citation network of U.S. utility patents, where nodes represent patents and edges denote citations. The classification task is to predict the year a patent was granted, with node features derived from patent metadata.

  • genius: A social network from genius.com, where nodes are users connected by mutual follows. The task is to predict whether a user account is marked as “gone,” based on usage features like expertise score and contribution counts.

  • twitch-gamers: A network of Twitch accounts with edges between mutual followers. Node features include account statistics like views, creation date, and account status. The binary classification task is to predict whether a channel has explicit content.

  • wiki: A graph of Wikipedia articles, with nodes representing pages and edges representing links between them. Node features are GloVe embeddings from the title and abstract. Labels represent total page views, categorized into quintiles.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - "penn94" - "pokec" - "arxiv-year" - "snap-patents" - "genius" - "twitch-gamers" - "wiki"

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = LINKXDataset(root='/path/to/dataset', name='pokec')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
__init__(root, name, transform=None, pre_transform=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

class jittor_geometric.datasets.OGBNodePropPredDataset(name, root='dataset', transform=None, pre_transform=None, meta_dict=None)[source]

Bases: InMemoryDataset

The Open Graph Benchmark (OGB) Node Property Prediction Datasets, provided by the OGB team. These datasets are designed to benchmark large-scale node property prediction tasks on real-world graphs.

This class provides access to various OGB datasets focused on node property prediction tasks. Each dataset contains nodes representing entities (e.g., papers, products) and edges representing relationships (e.g., citations, co-purchases). The goal is to predict specific node-level properties, such as categories or timestamps, based on the graph structure and node features.

Dataset Details:

  • ogbn-arxiv: A citation network where nodes represent arXiv papers and directed edges indicate citation relationships. The task is to predict the subject area of each paper based on word2vec features derived from the title and abstract.

  • ogbn-products: An Amazon product co-purchasing network where nodes represent products and edges indicate frequently co-purchased products. The task is to classify each product based on its category, with node features based on product descriptions.

  • ogbn-paper100M: A large-scale citation network where nodes represent research papers and edges indicate citation links. The node features are derived from word embeddings of the paper abstracts. The task is to predict the subject area of each paper.

These datasets are provided by the Open Graph Benchmark (OGB) team, which aims to facilitate machine learning research on graphs by offering diverse, large-scale datasets. For more details, visit the OGB website: https://ogb.stanford.edu/.

Parameters:
  • name (str) – The name of the dataset to load. Options include: - "ogbn-arxiv" - "ogbn-products" - "ogbn-paper100M"

  • root (str) – Root directory where the dataset folder will be stored.

  • transform (callable, optional) – A function/transform that takes in a graph object and returns a transformed version. The graph object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a graph object and returns a transformed version. The graph object will be transformed before being saved to disk. (default: None)

  • meta_dict (dict, optional) – A dictionary containing meta-information about the dataset. When provided, it overrides default meta-information, useful for debugging or contributions from external users.

Example

>>> dataset = OGBNodePropPredDataset(name="ogbn-arxiv", root="path/to/dataset")
>>> data = dataset[0]  # Access the first graph object
Acknowledgment:

The OGBNodePropPredDataset is developed and maintained by the Open Graph Benchmark (OGB) team. We sincerely thank the OGB team for their significant contributions to the graph machine learning community.

__init__(name, root='dataset', transform=None, pre_transform=None, meta_dict=None)[source]
get_idx_split(split_type=None)[source]
property num_classes

The number of classes in the dataset.

property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

class jittor_geometric.datasets.HeteroDataset(root, name, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

Heterophilic dataset from the paper ‘A critical look at the evaluation of GNNs under heterophily: Are we really making progress?’ <https://arxiv.org/abs/2302.11640>.

This class represents a collection of heterophilic graph datasets used to evaluate the performance of Graph Neural Networks (GNNs) in heterophilic settings. These datasets consist of graphs where nodes are connected based on certain relationships, and the task is to classify the nodes based on their features or labels. The datasets in this collection come from different domains, and each dataset has a unique structure and task.

Dataset Details:

  • Roman Empire: A graph from the Wikipedia article on the Roman Empire. Nodes represent words, connected based on their order or syntactic dependencies, with the task to classify words by syntactic roles.

  • Amazon Ratings: Based on Amazon co-purchasing data, where nodes are products connected if frequently bought together. The task is to predict the product’s average rating.

  • Minesweeper: A synthetic graph based on Minesweeper. Nodes represent grid cells with edges to neighbors. The task is to predict which cells contain mines.

  • Tolokers: Represents workers from the Toloka platform, with edges indicating shared tasks. The goal is to predict if a worker was banned.

  • Questions: Built from Yandex Q data, with nodes representing users who answered each other’s questions on the topic of medicine. The task is to predict if a user remained active.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - “roman-empire” - “amazon-ratings” - “minesweeper” - “tolokers” - “questions”

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = HeteroDataset(root='/path/to/dataset', name='amazon-ratings')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
url = 'https://github.com/yandex-research/heterophilous-graphs/raw/main/data'
__init__(root, name, transform=None, pre_transform=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

process(undirected=True)[source]

Processes the dataset to the self.processed_dir folder.

class jittor_geometric.datasets.JODIEDataset(root, name, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

The temporal graph datasets from the paper “JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks” <https://cs.stanford.edu/~srijan/pubs/jodie-kdd2019.pdf>.

This class handles loading and processing temporal graph datasets used in the JODIE paper. It is designed for graph-based machine learning tasks, such as dynamic embedding and link prediction. The dataset includes interactions between users and entities (e.g., subreddits, Wikipedia pages, songs, or MOOC course items), and the interactions are timestamped.

Dataset Details:

  • Reddit Post Dataset: This dataset consists of interactions between users and subreddits. We selected the 1,000 most active subreddits and the 10,000 most active users, resulting in over 672,447 interactions. Each post’s text is represented as a feature vector using LIWC categories.

  • Wikipedia Edits: This dataset represents edits made by users on Wikipedia pages. We selected the 1,000 most edited pages and users with at least 5 edits, totaling 8,227 users and 157,474 interactions. Each edit is converted into a LIWC-feature vector.

  • LastFM Song Listens: This dataset records user-song interactions, with 1,000 users and the 1,000 most listened-to songs, resulting in 1,293,103 interactions. Unlike other datasets, interactions do not have features.

  • MOOC Student Drop-Out: This dataset captures student interactions (e.g., viewing videos, submitting answers) on a MOOC online course. There are 7,047 users interacting with 98 items (videos, answers, etc.), generating over 411,749 interactions, including 4,066 drop-out events.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset, options include: - "Reddit" - "Wikipedia" - "LastFM" - "MOOC"

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = JODIEDataset(root='/path/to/dataset', name='Reddit')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
url = 'http://snap.stanford.edu/jodie/{}.csv'
names = ['reddit', 'wikipedia', 'mooc', 'lastfm']
__init__(root, name, transform=None, pre_transform=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

class jittor_geometric.datasets.Reddit(root, transform=None, pre_transform=None)[source]

Bases: InMemoryDataset

The Reddit dataset from the “Inductive Representation Learning on Large Graphs” paper, containing Reddit posts belonging to different communities.

This dataset is designed for large-scale graph representation learning. Nodes in the graph represent Reddit posts, and edges represent interactions (e.g., comments) between posts in the same community. The task is to classify posts into one of the 41 communities based on their content and connectivity.

Dataset Statistics:

  • Number of Nodes: 232,965

  • Number of Edges: 114,615,892

  • Number of Features: 602

  • Number of Classes: 41

The dataset is pre-split into training, validation, and test sets using node type masks.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

Example

>>> dataset = Reddit(root='/path/to/reddit')
>>> data = dataset[0]  # Access the first graph object
url = 'https://data.dgl.ai/dataset/reddit.zip'
__init__(root, transform=None, pre_transform=None)[source]
Parameters:
property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

class jittor_geometric.datasets.TemporalDataLoader(data, batch_size=1, neg_sampling_ratio=None, drop_last=False, num_neg_sample=None, neg_samples=None)[source]

Bases: object

__init__(data, batch_size=1, neg_sampling_ratio=None, drop_last=False, num_neg_sample=None, neg_samples=None)[source]
class jittor_geometric.datasets.QM9(root, transform=None, pre_transform=None, pre_filter=None)[source]

Bases: InMemoryDataset

# ! IF YOU MEET NETWORK ERROR, PLEASE TRY TO RUN THE COMMAND BELOW: # export HF_ENDPOINT=https://hf-mirror.com, # TO USE THE MIRROR PROVIDED BY Hugging Face.

The QM9 dataset from the “MoleculeNet: A Benchmark for Molecular Machine Learning” paper, consisting of about 130,000 molecules with 19 regression targets. Each molecule includes complete spatial information for the single low energy conformation of the atoms in the molecule. In addition, we provide the atom features from the “Neural Message Passing for Quantum Chemistry” paper.

Target

Property

Description

Unit

0

\(\mu\)

Dipole moment

\(\textrm{D}\)

1

\(\alpha\)

Isotropic polarizability

\({a_0}^3\)

2

\(\epsilon_{\textrm{HOMO}}\)

Highest occupied molecular orbital energy

\(\textrm{eV}\)

3

\(\epsilon_{\textrm{LUMO}}\)

Lowest unoccupied molecular orbital energy

\(\textrm{eV}\)

4

\(\Delta \epsilon\)

Gap between \(\epsilon_{\textrm{HOMO}}\) and \(\epsilon_{\textrm{LUMO}}\)

\(\textrm{eV}\)

5

\(\langle R^2 \rangle\)

Electronic spatial extent

\({a_0}^2\)

6

\(\textrm{ZPVE}\)

Zero point vibrational energy

\(\textrm{eV}\)

7

\(U_0\)

Internal energy at 0K

\(\textrm{eV}\)

8

\(U\)

Internal energy at 298.15K

\(\textrm{eV}\)

9

\(H\)

Enthalpy at 298.15K

\(\textrm{eV}\)

10

\(G\)

Free energy at 298.15K

\(\textrm{eV}\)

11

\(c_{\textrm{v}}\)

Heat capavity at 298.15K

\(\frac{\textrm{cal}}{\textrm{mol K}}\)

12

\(U_0^{\textrm{ATOM}}\)

Atomization energy at 0K

\(\textrm{eV}\)

13

\(U^{\textrm{ATOM}}\)

Atomization energy at 298.15K

\(\textrm{eV}\)

14

\(H^{\textrm{ATOM}}\)

Atomization enthalpy at 298.15K

\(\textrm{eV}\)

15

\(G^{\textrm{ATOM}}\)

Atomization free energy at 298.15K

\(\textrm{eV}\)

16

\(A\)

Rotational constant

\(\textrm{GHz}\)

17

\(B\)

Rotational constant

\(\textrm{GHz}\)

18

\(C\)

Rotational constant

\(\textrm{GHz}\)

Note

We also provide a pre-processed version of the dataset in case rdkit is not installed. The pre-processed version matches with the manually processed version as outlined in process().

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in an jt_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an jt_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an jt_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

STATS:

#graphs

#nodes

#edges

#features

#tasks

130,831

~18.0

~37.3

11

19

raw_url = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/molnet_publish/qm9.zip'
raw_url2 = 'https://ndownloader.figshare.com/files/3195404'
__init__(root, transform=None, pre_transform=None, pre_filter=None)[source]
Parameters:
mean(target)[source]
Parameters:

target (int)

Return type:

float

std(target)[source]
Parameters:

target (int)

Return type:

float

atomref(target)[source]
Parameters:

target (int)

Return type:

Optional[Var]

property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

get_idx_split(frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=42)[source]
Parameters:
class jittor_geometric.datasets.MoleculeNet(root, name, transform=None, pre_transform=None, pre_filter=None, from_smiles=None)[source]

Bases: InMemoryDataset

The MoleculeNet benchmark collection from the “MoleculeNet: A Benchmark for Molecular Machine Learning” paper, containing datasets from physical chemistry, biophysics and physiology. All datasets come with the additional node and edge features introduced by the Open Graph Benchmark.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset ("ESOL", "FreeSolv", "Lipo", "PCBA", "MUV", "HIV", "BACE", "BBBP", "Tox21", "ToxCast", "SIDER", "ClinTox").

  • transform (callable, optional) – A function/transform that takes in an jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an jittor_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • from_smiles (callable, optional) – A custom function that takes a SMILES string and outputs a Data object. If not set, defaults to from_smiles(). (default: None)

STATS:

Name

#graphs

#nodes

#edges

#features

#classes

ESOL

1,128

~13.3

~27.4

9

1

FreeSolv

642

~8.7

~16.8

9

1

ClinTox

1,484

~26.1

~55.5

9

2

url = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/{}'
names: Dict[str, Tuple[str, str, str, int, Union[int, slice]]] = {'bace': ('BACE', 'bace.csv', 'bace', 0, 2), 'bbbp': ('BBBP', 'BBBP.csv', 'BBBP', -1, -2), 'clintox': ('ClinTox', 'clintox.csv.gz', 'clintox', 0, slice(1, 3, None)), 'esol': ('ESOL', 'delaney-processed.csv', 'delaney-processed', -1, -2), 'freesolv': ('FreeSolv', 'SAMPL.csv', 'SAMPL', 1, 2), 'hiv': ('HIV', 'HIV.csv', 'HIV', 0, -1), 'lipo': ('Lipophilicity', 'Lipophilicity.csv', 'Lipophilicity', 2, 1), 'muv': ('MUV', 'muv.csv.gz', 'muv', -1, slice(0, 17, None)), 'pcba': ('PCBA', 'pcba.csv.gz', 'pcba', -1, slice(0, 128, None)), 'sider': ('SIDER', 'sider.csv.gz', 'sider', 0, slice(1, 28, None)), 'tox21': ('Tox21', 'tox21.csv.gz', 'tox21', -1, slice(0, 12, None)), 'toxcast': ('ToxCast', 'toxcast_data.csv.gz', 'toxcast_data', 0, slice(1, 618, None))}
__init__(root, name, transform=None, pre_transform=None, pre_filter=None, from_smiles=None)[source]
Parameters:
property raw_dir: str
property processed_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

class jittor_geometric.datasets.MD17(root, name, train=None, transform=None, pre_transform=None, pre_filter=None)[source]

Bases: InMemoryDataset

A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. This class provides access to the original MD17 datasets, their revised versions, and the CCSD(T) trajectories.

For every trajectory, the dataset contains the Cartesian positions of atoms (in Angstrom), their atomic numbers, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. The latter two are the regression targets for this collection.

Note

Data objects contain no edge indices as these are most commonly constructed via the jittor_geometric.transforms.RadiusGraph transform, with its cut-off being a hyperparameter.

The original MD17 dataset contains ten molecule trajectories. This version of the dataset was found to suffer from high numerical noise. The revised MD17 dataset contains the same molecules, but the energies and forces were recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. The third version of the dataset contains fewer molecules, computed at the CCSD(T) level of theory. The benzene molecule at the DFT FHI-aims level of theory was released separately.

Check the table below for detailed information on the molecule, level of theory and number of data points contained in each dataset. Which trajectory is loaded is determined by the name argument. For the coupled cluster trajectories, the dataset comes with pre-defined training and testing splits which are loaded separately via the train argument.

Molecule

Level of Theory

Name

#Examples

Benzene

DFT

benzene

627,983

Uracil

DFT

uracil

133,770

Naphthalene

DFT

naphthalene

326,250

Aspirin

DFT

aspirin

211,762

Salicylic acid

DFT

salicylic acid

320,231

Malonaldehyde

DFT

malonaldehyde

993,237

Ethanol

DFT

ethanol

555,092

Toluene

DFT

toluene

442,790

Paracetamol

DFT

paracetamol

106,490

Azobenzene

DFT

azobenzene

99,999

Benzene (R)

DFT (PBE/def2-SVP)

revised benzene

100,000

Uracil (R)

DFT (PBE/def2-SVP)

revised uracil

100,000

Naphthalene (R)

DFT (PBE/def2-SVP)

revised naphthalene

100,000

Aspirin (R)

DFT (PBE/def2-SVP)

revised aspirin

100,000

Salicylic acid (R)

DFT (PBE/def2-SVP)

revised salicylic acid

100,000

Malonaldehyde (R)

DFT (PBE/def2-SVP)

revised malonaldehyde

100,000

Ethanol (R)

DFT (PBE/def2-SVP)

revised ethanol

100,000

Toluene (R)

DFT (PBE/def2-SVP)

revised toluene

100,000

Paracetamol (R)

DFT (PBE/def2-SVP)

revised paracetamol

100,000

Azobenzene (R)

DFT (PBE/def2-SVP)

revised azobenzene

99,988

Benzene

CCSD(T)

benzene CCSD(T)

1,500

Aspirin

CCSD

aspirin CCSD

1,500

Malonaldehyde

CCSD(T)

malonaldehyde CCSD(T)

1,500

Ethanol

CCSD(T)

ethanol CCSD(T)

2,000

Toluene

CCSD(T)

toluene CCSD(T)

1,501

Benzene

DFT FHI-aims

benzene FHI-aims

49,863

Warning

It is advised to not train a model on more than 1,000 samples from the original or revised MD17 dataset.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – Keyword of the trajectory that should be loaded.

  • train (bool, optional) – Determines whether the train or test split gets loaded for the coupled cluster trajectories. (default: None)

  • transform (callable, optional) – A function/transform that takes in an jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an jittor_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

STATS:

Name

#graphs

#nodes

#edges

#features

#tasks

Benzene

627,983

12

0

1

2

Uracil

133,770

12

0

1

2

Naphthalene

326,250

10

0

1

2

Aspirin

211,762

21

0

1

2

Salicylic acid

320,231

16

0

1

2

Malonaldehyde

993,237

9

0

1

2

Ethanol

555,092

9

0

1

2

Toluene

442,790

15

0

1

2

Paracetamol

106,490

20

0

1

2

Azobenzene

99,999

24

0

1

2

Benzene (R)

100,000

12

0

1

2

Uracil (R)

100,000

12

0

1

2

Naphthalene (R)

100,000

10

0

1

2

Aspirin (R)

100,000

21

0

1

2

Salicylic acid (R)

100,000

16

0

1

2

Malonaldehyde (R)

100,000

9

0

1

2

Ethanol (R)

100,000

9

0

1

2

Toluene (R)

100,000

15

0

1

2

Paracetamol (R)

100,000

20

0

1

2

Azobenzene (R)

99,988

24

0

1

2

Benzene CCSD-T

1,500

12

0

1

2

Aspirin CCSD-T

1,500

21

0

1

2

Malonaldehyde CCSD-T

1,500

9

0

1

2

Ethanol CCSD-T

2000

9

0

1

2

Toluene CCSD-T

1,501

15

0

1

2

Benzene FHI-aims

49,863

12

0

1

2

gdml_url = 'http://quantum-machine.org/gdml/data/npz'
revised_url = 'https://archive.materialscloud.org/record/file?filename=rmd17.tar.bz2&record_id=466'
file_names = {'aspirin': 'md17_aspirin.npz', 'aspirin CCSD': 'aspirin_ccsd.zip', 'azobenzene': 'azobenzene_dft.npz', 'benzene': 'md17_benzene2017.npz', 'benzene CCSD(T)': 'benzene_ccsd_t.zip', 'benzene FHI-aims': 'benzene2018_dft.npz', 'ethanol': 'md17_ethanol.npz', 'ethanol CCSD(T)': 'ethanol_ccsd_t.zip', 'malonaldehyde': 'md17_malonaldehyde.npz', 'malonaldehyde CCSD(T)': 'malonaldehyde_ccsd_t.zip', 'naphthalene': 'md17_naphthalene.npz', 'paracetamol': 'paracetamol_dft.npz', 'revised aspirin': 'rmd17_aspirin.npz', 'revised azobenzene': 'rmd17_azobenzene.npz', 'revised benzene': 'rmd17_benzene.npz', 'revised ethanol': 'rmd17_ethanol.npz', 'revised malonaldehyde': 'rmd17_malonaldehyde.npz', 'revised naphthalene': 'rmd17_naphthalene.npz', 'revised paracetamol': 'rmd17_paracetamol.npz', 'revised salicylic acid': 'rmd17_salicylic.npz', 'revised toluene': 'rmd17_toluene.npz', 'revised uracil': 'rmd17_uracil.npz', 'salicylic acid': 'md17_salicylic.npz', 'toluene': 'md17_toluene.npz', 'toluene CCSD(T)': 'toluene_ccsd_t.zip', 'uracil': 'md17_uracil.npz'}
__init__(root, name, train=None, transform=None, pre_transform=None, pre_filter=None)[source]
Parameters:
mean()[source]
Return type:

float

property raw_dir: str
property processed_dir: str
property raw_file_names: str | List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: List[str]

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

class jittor_geometric.datasets.PCQM4Mv2(root, split='train', transform=None, from_smiles=None)[source]

Bases: InMemoryDataset

The PCQM4Mv2 dataset from the “OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs” paper. PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC project. The task is to predict the DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • split (str, optional) – If "train", loads the training dataset. If "val", loads the validation dataset. If "test", loads the test dataset. If "holdout", loads the holdout dataset. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • from_smiles (callable, optional) – A custom function that takes a SMILES string and outputs a Data object. If not set, defaults to from_smiles(). (default: None)

url = 'https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m-v2.zip'
split_mapping = {'holdout': 'test-challenge', 'test': 'test-dev', 'train': 'train', 'val': 'valid'}
__init__(root, split='train', transform=None, from_smiles=None)[source]
Parameters:
property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

class jittor_geometric.datasets.MovieLens1M(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]

Bases: RecSysBase

MovieLens-1M dataset with auto-download from Recbole:

https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip

Expected (after extraction) in raw_dir:
  • ml-1m.item

  • ml-1m.user

  • ml-1m.inter

Files are tab-separated; first header row is skipped (skiprows=1).

Parameters:

with_aux (bool)

url = 'https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-1m.zip'
__init__(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
Parameters:

with_aux (bool)

property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

download()[source]

Download and extract ml-1m.zip into raw_dir (idempotent).

read_raw()[source]
Must return either:
  • interactions_df (if self.with_aux == False)

  • (interactions_df, {‘items’: df?, ‘users’: df?}) (if self.with_aux == True)

class jittor_geometric.datasets.MovieLens100K(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]

Bases: RecSysBase

MovieLens-100K (RecBole processed).

Downloads: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-100k.zip

Expected in raw_dir after extraction:

  • ml-100k.item

  • ml-100k.user

  • ml-100k.inter

Parameters:

with_aux (bool)

url = 'https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/MovieLens/ml-100k.zip'
__init__(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
Parameters:

with_aux (bool)

property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

download()[source]

Downloads the dataset to the self.raw_dir folder.

read_raw()[source]
Must return either:
  • interactions_df (if self.with_aux == False)

  • (interactions_df, {‘items’: df?, ‘users’: df?}) (if self.with_aux == True)

class jittor_geometric.datasets.Yelp2018(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]

Bases: RecSysBase

Yelp-2018 (RecBole processed).

Downloads: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Yelp/yelp2018.zip

Accepts either file naming variant inside the zip:

  • yelp2018.item/.user/.inter (common)

  • yelp-2018.item/.user/.inter (also supported)

After extraction, we normalize to yelp-2018.* in raw_dir.

Parameters:

with_aux (bool)

url = 'https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Yelp/yelp2018.zip'
property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

__init__(root, transform=None, pre_transform=None, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=42, shuffle=True, with_aux=False)[source]
Parameters:

with_aux (bool)

download()[source]

Downloads the dataset to the self.raw_dir folder.

read_raw()[source]
Must return either:
  • interactions_df (if self.with_aux == False)

  • (interactions_df, {‘items’: df?, ‘users’: df?}) (if self.with_aux == True)