jittor_geometric.datasets

class jittor_geometric.datasets.Amazon(root, name, transform=None, pre_transform=None)[source]

The Amazon Computers and Amazon Photo datasets from the paper “Pitfalls of Graph Neural Network Evaluation” <https://arxiv.org/abs/1811.05868>`_.

This class represents the Amazon dataset used in the paper “Pitfalls of Graph Neural Network Evaluation”. In this dataset, nodes represent products, and edges indicate that two products are frequently bought together. The dataset provides product reviews represented as bag-of-words node features, and the task is to classify products into their respective categories.

Dataset Details:

  • Amazon Computers: This dataset contains products related to computers, where the task is to classify the products based on the reviews and co-purchase information.

  • Amazon Photo: This dataset contains products related to photography, with a similar task of classifying products based on reviews and co-purchase data.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset, either "Computers" or "Photo".

  • transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. The data object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = Amazon(root='/path/to/dataset', name='Computers')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

property processed_dir: str
property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/shchur/gnn-benchmark/raw/master/data/npz/'
class jittor_geometric.datasets.GeomGCN(root, name, transform=None, pre_transform=None)[source]

The GeomGCN datasets used in the “Geom-GCN: Geometric Graph Convolutional Networks” <https://openreview.net/forum?id=S1e2agrFvS>`_ paper.

This class represents the datasets used in the Geom-GCN paper, which focuses on geometric graph convolutional networks. The datasets consist of graphs where nodes represent various entities, and edges represent relationships between them. The goal is to apply graph convolutional networks (GCNs) in the context of geometric graphs to classify nodes based on their features.

Dataset Details:

  • Cornell, Texas, Wisconsin: These datasets represent web pages from the Cornell, Texas, and Wisconsin universities, where nodes are web pages, and edges represent hyperlinks between them. The task is to classify web pages into one of five categories: student, project, course, staff, and faculty.

  • Actor: In the Actor dataset, each node corresponds to an actor, and edges between nodes represent co-occurrence on the same Wikipedia page. The task is to classify the actors into one of five categories based on keywords extracted from their Wikipedia pages.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - "Cornell" - "Texas" - "Wisconsin" - "Actor"

  • transform (callable, optional) – A function/transform that takes in a jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = GeomGCN(root='/path/to/dataset', name='Cornell')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

property processed_dir: str
property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_dir: str
property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master'
class jittor_geometric.datasets.HeteroDataset(root, name, transform=None, pre_transform=None)[source]

Heterophilic dataset from the paper ‘A critical look at the evaluation of GNNs under heterophily: Are we really making progress?’ <https://arxiv.org/abs/2302.11640>.

This class represents a collection of heterophilic graph datasets used to evaluate the performance of Graph Neural Networks (GNNs) in heterophilic settings. These datasets consist of graphs where nodes are connected based on certain relationships, and the task is to classify the nodes based on their features or labels. The datasets in this collection come from different domains, and each dataset has a unique structure and task.

Dataset Details:

  • Roman Empire: A graph from the Wikipedia article on the Roman Empire. Nodes represent words, connected based on their order or syntactic dependencies, with the task to classify words by syntactic roles.

  • Amazon Ratings: Based on Amazon co-purchasing data, where nodes are products connected if frequently bought together. The task is to predict the product’s average rating.

  • Minesweeper: A synthetic graph based on Minesweeper. Nodes represent grid cells with edges to neighbors. The task is to predict which cells contain mines.

  • Tolokers: Represents workers from the Toloka platform, with edges indicating shared tasks. The goal is to predict if a worker was banned.

  • Questions: Built from Yandex Q data, with nodes representing users who answered each other’s questions on the topic of medicine. The task is to predict if a user remained active.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - “roman-empire” - “amazon-ratings” - “minesweeper” - “tolokers” - “questions”

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = HeteroDataset(root='/path/to/dataset', name='amazon-ratings')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
download()[source]

Downloads the dataset to the self.raw_dir folder.

process(undirected=True)[source]

Processes the dataset to the self.processed_dir folder.

property processed_dir: str
property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/yandex-research/heterophilous-graphs/raw/main/data'
class jittor_geometric.datasets.JODIEDataset(root, name, transform=None, pre_transform=None)[source]

The temporal graph datasets from the paper “JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks” <https://cs.stanford.edu/~srijan/pubs/jodie-kdd2019.pdf>.

This class handles loading and processing temporal graph datasets used in the JODIE paper. It is designed for graph-based machine learning tasks, such as dynamic embedding and link prediction. The dataset includes interactions between users and entities (e.g., subreddits, Wikipedia pages, songs, or MOOC course items), and the interactions are timestamped.

Dataset Details:

  • Reddit Post Dataset: This dataset consists of interactions between users and subreddits. We selected the 1,000 most active subreddits and the 10,000 most active users, resulting in over 672,447 interactions. Each post’s text is represented as a feature vector using LIWC categories.

  • Wikipedia Edits: This dataset represents edits made by users on Wikipedia pages. We selected the 1,000 most edited pages and users with at least 5 edits, totaling 8,227 users and 157,474 interactions. Each edit is converted into a LIWC-feature vector.

  • LastFM Song Listens: This dataset records user-song interactions, with 1,000 users and the 1,000 most listened-to songs, resulting in 1,293,103 interactions. Unlike other datasets, interactions do not have features.

  • MOOC Student Drop-Out: This dataset captures student interactions (e.g., viewing videos, submitting answers) on a MOOC online course. There are 7,047 users interacting with 98 items (videos, answers, etc.), generating over 411,749 interactions, including 4,066 drop-out events.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset, options include: - "Reddit" - "Wikipedia" - "LastFM" - "MOOC"

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = JODIEDataset(root='/path/to/dataset', name='Reddit')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
download()[source]

Downloads the dataset to the self.raw_dir folder.

names = ['reddit', 'wikipedia', 'mooc', 'lastfm']
process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_dir: str
property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_dir: str
property raw_file_names: str

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'http://snap.stanford.edu/jodie/{}.csv'
class jittor_geometric.datasets.LINKXDataset(root, name, transform=None, pre_transform=None)[source]

A variety of non-homophilous graph datasets from the paper “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” <https://arxiv.org/abs/2110.14446>.

Dataset Details:

  • Penn94: A friendship network of university students from the Facebook 100 dataset. Nodes represent students, with labels indicating gender. Node features include major, dorm, year, and high school.

  • Pokec: A friendship network from a Slovak online social network. Nodes represent users, connected by directed friendship relations. Node features include profile information like region, registration time, and age, with labels based on gender.

  • arXiv-year: Based on the ogbn-arXiv network, with nodes representing papers and edges representing citations. The classification task is set to predict the year a paper was posted, using word2vec features derived from the title and abstract.

  • snap-patents: A citation network of U.S. utility patents, where nodes represent patents and edges denote citations. The classification task is to predict the year a patent was granted, with node features derived from patent metadata.

  • genius: A social network from genius.com, where nodes are users connected by mutual follows. The task is to predict whether a user account is marked as “gone,” based on usage features like expertise score and contribution counts.

  • twitch-gamers: A network of Twitch accounts with edges between mutual followers. Node features include account statistics like views, creation date, and account status. The binary classification task is to predict whether a channel has explicit content.

  • wiki: A graph of Wikipedia articles, with nodes representing pages and edges representing links between them. Node features are GloVe embeddings from the title and abstract. Labels represent total page views, categorized into quintiles.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset to load. Options include: - "penn94" - "pokec" - "arxiv-year" - "snap-patents" - "genius" - "twitch-gamers" - "wiki"

  • transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = LINKXDataset(root='/path/to/dataset', name='pokec')
>>> dataset.data
>>> dataset[0]  # Accessing the first data point
download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_dir: str
property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_dir: str
property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

class jittor_geometric.datasets.OGBNodePropPredDataset(name, root='dataset', transform=None, pre_transform=None, meta_dict=None)[source]

The Open Graph Benchmark (OGB) Node Property Prediction Datasets, provided by the OGB team. These datasets are designed to benchmark large-scale node property prediction tasks on real-world graphs.

This class provides access to various OGB datasets focused on node property prediction tasks. Each dataset contains nodes representing entities (e.g., papers, products) and edges representing relationships (e.g., citations, co-purchases). The goal is to predict specific node-level properties, such as categories or timestamps, based on the graph structure and node features.

Dataset Details:

  • ogbn-arxiv: A citation network where nodes represent arXiv papers and directed edges indicate citation relationships. The task is to predict the subject area of each paper based on word2vec features derived from the title and abstract.

  • ogbn-products: An Amazon product co-purchasing network where nodes represent products and edges indicate frequently co-purchased products. The task is to classify each product based on its category, with node features based on product descriptions.

  • ogbn-paper100M: A large-scale citation network where nodes represent research papers and edges indicate citation links. The node features are derived from word embeddings of the paper abstracts. The task is to predict the subject area of each paper.

These datasets are provided by the Open Graph Benchmark (OGB) team, which aims to facilitate machine learning research on graphs by offering diverse, large-scale datasets. For more details, visit the OGB website: https://ogb.stanford.edu/.

Parameters:
  • name (str) – The name of the dataset to load. Options include: - "ogbn-arxiv" - "ogbn-products" - "ogbn-paper100M"

  • root (str) – Root directory where the dataset folder will be stored.

  • transform (callable, optional) – A function/transform that takes in a graph object and returns a transformed version. The graph object will be transformed on each access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a graph object and returns a transformed version. The graph object will be transformed before being saved to disk. (default: None)

  • meta_dict (dict, optional) – A dictionary containing meta-information about the dataset. When provided, it overrides default meta-information, useful for debugging or contributions from external users.

Example

>>> dataset = OGBNodePropPredDataset(name="ogbn-arxiv", root="path/to/dataset")
>>> data = dataset[0]  # Access the first graph object
Acknowledgment:

The OGBNodePropPredDataset is developed and maintained by the Open Graph Benchmark (OGB) team. We sincerely thank the OGB team for their significant contributions to the graph machine learning community.

download()[source]

Downloads the dataset to the self.raw_dir folder.

get_idx_split(split_type=None)[source]
property num_classes

The number of classes in the dataset.

process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class jittor_geometric.datasets.Planetoid(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None)[source]

The citation network datasets “Cora”, “CiteSeer” and “PubMed” from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper.

This class represents three widely-used citation network datasets: Cora, CiteSeer, and PubMed. Nodes correspond to documents, and edges represent citation links between them. The datasets are designed for semi-supervised learning tasks, where training, validation, and test splits are provided as binary masks.

Dataset Details:

  • Cora: A citation network where nodes represent machine learning papers, and edges represent citations. The task is to classify papers into one of seven classes.

  • CiteSeer: A citation network of research papers in computer and information science. The task is to classify papers into one of six classes.

  • PubMed: A citation network of biomedical papers on diabetes. The task is to classify papers into one of three classes.

Splitting Options: - public: The original fixed split from the paper “Revisiting Semi-Supervised Learning with Graph Embeddings”. - full: Uses all nodes except those in the validation and test sets for training, inspired by “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling”. - random: Generates random splits for train, validation, and test sets based on the specified parameters.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset ("Cora", "CiteSeer", "PubMed").

  • split (str) – The type of dataset split ("public", "full", "random"). Default is "public".

  • num_train_per_class (int, optional) – Number of training samples per class for "random" split. Default is 20.

  • num_val (int, optional) – Number of validation samples for "random" split. Default is 500.

  • num_test (int, optional) – Number of test samples for "random" split. Default is 1000.

  • transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. Default is None.

  • pre_transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version before saving to disk. Default is None.

Example

>>> dataset = Planetoid(root='/path/to/dataset', name='Cora', split='random')
>>> data = dataset[0]  # Access the processed data object
download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_dir
property processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_dir
property raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/kimiyoung/planetoid/raw/master/data'
class jittor_geometric.datasets.QM9(root, transform=None, pre_transform=None, pre_filter=None)[source]

# ! IF YOU MEET NETWORK ERROR, PLEASE TRY TO RUN THE COMMAND BELOW: # export HF_ENDPOINT=https://hf-mirror.com, # TO USE THE MIRROR PROVIDED BY Hugging Face.

The QM9 dataset from the “MoleculeNet: A Benchmark for Molecular Machine Learning” paper, consisting of about 130,000 molecules with 19 regression targets. Each molecule includes complete spatial information for the single low energy conformation of the atoms in the molecule. In addition, we provide the atom features from the “Neural Message Passing for Quantum Chemistry” paper.

Target

Property

Description

Unit

0

μ

Dipole moment

D

1

α

Isotropic polarizability

a03

2

ϵHOMO

Highest occupied molecular orbital energy

eV

3

ϵLUMO

Lowest unoccupied molecular orbital energy

eV

4

Δϵ

Gap between ϵHOMO and ϵLUMO

eV

5

R2

Electronic spatial extent

a02

6

ZPVE

Zero point vibrational energy

eV

7

U0

Internal energy at 0K

eV

8

U

Internal energy at 298.15K

eV

9

H

Enthalpy at 298.15K

eV

10

G

Free energy at 298.15K

eV

11

cv

Heat capavity at 298.15K

calmol K

12

U0ATOM

Atomization energy at 0K

eV

13

UATOM

Atomization energy at 298.15K

eV

14

HATOM

Atomization enthalpy at 298.15K

eV

15

GATOM

Atomization free energy at 298.15K

eV

16

A

Rotational constant

GHz

17

B

Rotational constant

GHz

18

C

Rotational constant

GHz

Note

We also provide a pre-processed version of the dataset in case rdkit is not installed. The pre-processed version matches with the manually processed version as outlined in process().

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in an jt_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an jt_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an jt_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

STATS:

#graphs

#nodes

#edges

#features

#tasks

130,831

~18.0

~37.3

11

19

atomref(target)[source]
Return type:

Optional[Var]

download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

get_idx_split(frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=42)[source]
mean(target)[source]
Return type:

float

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

std(target)[source]
Return type:

float

class jittor_geometric.datasets.Reddit(root, transform=None, pre_transform=None)[source]

The Reddit dataset from the “Inductive Representation Learning on Large Graphs” paper, containing Reddit posts belonging to different communities.

This dataset is designed for large-scale graph representation learning. Nodes in the graph represent Reddit posts, and edges represent interactions (e.g., comments) between posts in the same community. The task is to classify posts into one of the 41 communities based on their content and connectivity.

Dataset Statistics:

  • Number of Nodes: 232,965

  • Number of Edges: 114,615,892

  • Number of Features: 602

  • Number of Classes: 41

The dataset is pre-split into training, validation, and test sets using node type masks.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in a torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

Example

>>> dataset = Reddit(root='/path/to/reddit')
>>> data = dataset[0]  # Access the first graph object
download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

property raw_file_names: List[str]

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://data.dgl.ai/dataset/reddit.zip'
class jittor_geometric.datasets.TemporalDataLoader(data, batch_size=1, neg_sampling_ratio=None, drop_last=False, num_neg_sample=None, neg_samples=None)[source]
class jittor_geometric.datasets.WikipediaNetwork(root, name, geom_gcn_preprocess=True, transform=None, pre_transform=None)[source]

The Wikipedia networks introduced in the “Multi-scale Attributed Node Embedding” paper.

This class represents Wikipedia networks where nodes correspond to web pages, and edges represent hyperlinks between them. The node features are derived from informative nouns on the Wikipedia pages, and the task is to predict the average daily traffic of each web page.

Dataset Details:

  • Chameleon: A Wikipedia page graph with node features representing nouns and the task of predicting traffic.

  • Squirrel: Similar to Chameleon but derived from a different subset of Wikipedia pages.

Geometric GCN Preprocessing: - If geom_gcn_preprocess is set to True, the dataset is preprocessed following the `”Geom-GCN: Geometric Graph Convolutional Networks”

<https://arxiv.org/abs/2002.05287>`_ paper. In this case, the traffic prediction task is converted into a five-category classification problem.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset ("chameleon", "squirrel").

  • geom_gcn_preprocess (bool) – Whether to load the preprocessed data from the Geom-GCN paper. If set to True, preprocessed splits will also be available. (default: True)

  • transform (callable, optional) – A function/transform that takes in a jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in a jittor_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

Example

>>> dataset = WikipediaNetwork(root='/path/to/dataset', name='chameleon', geom_gcn_preprocess=True)
>>> data = dataset[0]  # Access the processed data object
download()[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process()[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

property processed_dir: str
property processed_file_names: str

The name of the files to find in the self.processed_dir folder in order to skip the processing.

processed_url = 'https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/f1fc0d14b3b019c562737240d06ec83b07d16a8f'
property raw_dir: str
property raw_file_names: List[str] | str

The name of the files to find in the self.raw_dir folder in order to skip the download.

raw_url = 'https://graphmining.ai/datasets/ptg/wiki'

```