pydgn.data

data.dataset

class pydgn.data.dataset.ConcatFromListDataset(*args: Any, **kwargs: Any)

Bases: torch_geometric.data.InMemoryDataset

Create a dataset from a list of torch_geometric.data.Data objects. Inherits from torch_geometric.data.InMemoryDataset

Parameters

data_list (list) – List of graphs.

download()

Does nothing, the data list is already provided

process()

Does nothing, the data list is already provided

property processed_file_names: Union[str, List[str], Tuple]

Does nothing, the data list is already provided

property raw_file_names: Union[str, List[str], Tuple]

Does nothing, the data list is already provided

class pydgn.data.dataset.DatasetInterface(*args: Any, **kwargs: Any)

Bases: torch_geometric.data.dataset.Dataset

Class that defines a number of properties essential to all datasets implementations inside PyDGN. These properties are used by the training engine and forwarded to the model to be trained. For some datasets, e.g., torch_geometric.datasets.TUDataset, implementing this interface is straightforward.

Parameters
  • root (str) – root folder where to store the dataset

  • name (str) – name of the dataset

  • transform (Optional[Callable]) – transformations to apply to each Data object at run time

  • pre_transform (Optional[Callable]) – transformations to apply to each Data object at dataset creation time

  • pre_filter (Optional[Callable]) – sample filtering to apply to each Data object at dataset creation time

property dim_edge_features: int

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features: int

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target: int

Specifies the dimension of each target vector.

download()

Downloads the dataset to the self.raw_dir folder.

get(idx: int) torch_geometric.data.Data

Gets the data object at index idx.

len() int

Returns the number of graphs stored in the dataset. Note: we need to implement both len and __len__ to comply with PyG interface

process()

Processes the dataset to the self.processed_dir folder.

property processed_file_names: Union[str, List[str], Tuple]

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names: Union[str, List[str], Tuple]

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

class pydgn.data.dataset.IterableDatasetInterface(*args: Any, **kwargs: Any)

Bases: torch.utils.data.IterableDataset

Class that implements the Iterable-style dataset, including multi-process data loading (https://pytorch.org/docs/stable/data.html#iterable-style-datasets). Useful when the dataset is too big and split in chunks of files to be stored on disk. Each chunk can hold a single sample or a set of samples, and there is the chance to shuffle sample-wise or chunk-wise. To get a subset of this dataset, just provide an argument url_indices specifying which chunks you want to use. Must be combined with an appropriate pydgn.data.provider.IterableDataProvider.

NOTE 1: We assume the splitter will split the dataset with respect to to the number of files stored on disk, so be sure that the length of your dataset reflects that number. Then, examples will be provided sequentially, so if each file holds more than one sample, we will still be able to create a batch of samples from one or multiple files.

NOTE 2: NEVER override the __len__() method, as it varies dynamically with the url_indices argument.

Parameters
  • root (str) – root folder where to store the dataset

  • name (str) – name of the dataset

  • transform (Optional[Callable]) – transformations to apply to each Data object at run time

  • pre_transform (Optional[Callable]) – transformations to apply to each Data object at dataset creation time

  • url_indices (Optional[List]) – list of indices used to extract a portion of the dataset

property dim_edge_features: int

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features: int

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target: int

Specifies the dimension of each target vector.

download()

Downloads the dataset to the self.raw_dir folder.

get(idx: int) torch_geometric.data.Data

Gets the data object at index idx.

process()

Processes the dataset to the self.processed_dir folder.

property processed_dir: Union[str, pathlib.Path]

The folder where to store processed data files.

property processed_file_names: Union[str, List[str], Tuple]

The list of file names that must be present in order to skip downloading.

property processed_paths: List[Union[str, pathlib.Path]]

The absolute filepaths that must be present in order to skip processing.

property raw_dir: Union[str, pathlib.Path]

The path where the raw data should be downloaded :return: a string

property raw_file_names: List[Union[str, pathlib.Path]]

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

property raw_paths: List[Union[str, pathlib.Path]]

The absolute filepaths that must be present in order to skip downloading.

shuffle_urls(value: bool)

Shuffles urls associated to individual files stored on disk

Parameters

value (bool) – whether or not to shuffle urls

shuffle_urls_elements(value: bool)

Shuffles elements contained in each file (associated with an url). Use this method when a single file stores multiple samples and you want to provide them in shuffled order. IMPORTANT: in this case we assume that each file contains a list of Data objects!

Parameters

value (bool) – whether or not to shuffle urls

splice(start: int, end: int)

Use this method to assign portions of the dataset to load to different workers, otherwise they will load the same samples.

Parameters
  • start (int) – the index where to start

  • end (int) – the index where to stop

class pydgn.data.dataset.OGBGDatasetInterface(*args: Any, **kwargs: Any)

Bases: ogb.graphproppred.PygGraphPropPredDataset

Class that wraps the ogb.graphproppred.PygGraphPropPredDataset class to provide aliases of some fields. It implements the interface DatasetInterface but does not extend directly to avoid clashes of __init__ methods

property dim_edge_features: int

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features: int

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target: int

Specifies the dimension of each target vector.

download()

Downloads the OGB dataset to the self.raw_dir folder.

process()

Processes the OGB dataset to the self.processed_dir folder.

property processed_file_names: Union[str, List[str], Tuple]

The list of file names that must be present in order to skip downloading.

class pydgn.data.dataset.PlanetoidDatasetInterface(*args: Any, **kwargs: Any)

Bases: torch_geometric.datasets.Planetoid

Class that wraps the torch_geometric.datasets.Planetoid class to provide aliases of some fields. It implements the interface DatasetInterface but does not extend directly to avoid clashes of __init__ methods

property dim_edge_features: int

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features: int

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target: int

Specifies the dimension of each target vector.

download()

Downloads the Planetoid dataset to the self.raw_dir folder.

process()

Processes the Planetoid dataset to the self.processed_dir folder

class pydgn.data.dataset.TUDatasetInterface(*args: Any, **kwargs: Any)

Bases: torch_geometric.datasets.TUDataset

Class that wraps the torch_geometric.datasets.TUDataset class to provide aliases of some fields. It implements the interface DatasetInterface but does not extend directly to avoid clashes of __init__ methods

property dim_edge_features: int

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features: int

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target: int

Specifies the dimension of each target vector.

download()

Downloads the TUDataset dataset to the self.raw_dir folder.

process()

Processes the TUDataset dataset to the self.processed_dir folder

class pydgn.data.dataset.TemporalDatasetInterface(*args: Any, **kwargs: Any)

Bases: pydgn.data.dataset.DatasetInterface

Extension of DatasetInterface to the temporal scenario.

get(idx: int) torch_geometric.data.Data

Gets element idx from object self.dataset

Parameters

idx (int) – the sample index

Returns:

get_mask(data: Union[torch_geometric.data.Batch, torch_geometric.data.Data]) torch.Tensor

Computes the mask of time steps for which we need to make a prediction.

Parameters

data – the data object

Returns

A tensor indicating the time-steps at which we expect predictions

class pydgn.data.dataset.ToyIterableDataset(*args: Any, **kwargs: Any)

Bases: pydgn.data.dataset.IterableDatasetInterface

Class that implements the Iterable-style dataset, including multi-process data loading (https://pytorch.org/docs/stable/data.html#iterable-style-datasets). Useful when the dataset is too big and split in chunks of files to be stored on disk. Each chunk can hold a single sample or a set of samples, and there is the chance to shuffle sample-wise or chunk-wise. To get a subset of this dataset, just provide an argument url_indices specifying which chunks you want to use. Must be combined with an appropriate pydgn.data.provider.IterableDataProvider.

Parameters
  • root (str) – root folder where to store the dataset

  • name (str) – name of the dataset

  • transform (Optional[Callable]) – transformations to apply to each Data object at run time

  • pre_transform (Optional[Callable]) – transformations to apply to each Data object at dataset creation time

  • url_indices (Optional[List]) – list of indices used to extract a portion of the dataset

property dim_edge_features: int

Specifies the number of edge features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_node_features: int

Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).

property dim_target: int

Specifies the dimension of each target vector.

download()

Does nothing, the data list is already provided

process()

Creates a fake dataset and stores it to the self.processed_dir folder. Each file will contain a list of 10 fake graphs.

property processed_file_names: Union[str, List[str], Tuple]

The list of file names that must be present in order to skip downloading.

property raw_file_names: Union[str, List[str], Tuple]

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

class pydgn.data.dataset.ZipDataset(*args: Any, **kwargs: Any)

Bases: torch.utils.data.Dataset

This Dataset takes n datasets and “zips” them. When asked for the i-th element, it returns the i-th element of all n datasets.

Parameters

datasets (List[torch.utils.data.Dataset]) – An iterable with PyTorch Datasets

Precondition:

The length of all datasets must be the same

data.provider

class pydgn.data.provider.DataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.DataLoader], Callable[[...], torch_geometric.loader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: object

A DataProvider object retrieves the correct data according to the external and internal data splits. It can be additionally used to augment the data, or to create a specific type of data loader. The base class does nothing special, but here is where the i-th element of a dataset could be pre-processed before constructing the mini-batches.

IMPORTANT: if the dataset is to be shuffled, you MUST use a pydgn.data.sampler.RandomSampler object to determine the permutation.

Parameters
  • data_root (str) – the path of the root folder in which data is stored

  • splits_filepath (str) – the filepath of the splits. with additional metadata

  • dataset_class – (Callable[…,:class:pydgn.data.dataset.DatasetInterface]): the class of the dataset

  • data_loader_class – (Union[Callable[…,:class:torch.utils.data.DataLoader], Callable[…,:class:torch_geometric.loader.DataLoader]]): the class of the data loader to use

  • data_loader_args (dict) – the arguments of the data loader

  • dataset_name (str) – the name of the dataset

  • outer_folds (int) – the number of outer folds for risk assessment. 1 means hold-out, >1 means k-fold

  • inner_folds (int) – the number of outer folds for model selection. 1 means hold-out, >1 means k-fold

_get_dataset(**kwargs: dict) pydgn.data.dataset.DatasetInterface

Instantiates the dataset. Relies on the parameters stored in the dataset_kwargs.pt file.

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset. Not used in the base version

Returns

a DatasetInterface object

_get_loader(indices: list, **kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Instantiates the data loader.

Parameters
  • indices (sequence) – Indices in the whole set selected for subset

  • kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

_get_splitter() pydgn.data.splitter.Splitter

Instantiates the splitter with the parameters stored in the file self.splits_filepath

Returns

a Splitter object

get_dim_edge_features() int

Returns the number of node features of the dataset

Returns

the value of the property dim_edge_features in the dataset

get_dim_node_features() int

Returns the number of node features of the dataset

Returns

the value of the property dim_node_features in the dataset

get_dim_target() int

Returns the dimension of the target for the task

Returns

the value of the property dim_target in the dataset

get_inner_train(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the training set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_inner_val(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the validation set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_test(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the test set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_train(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the training set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_val(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the validation set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

set_exp_seed(seed: int)

Sets the experiment seed to give to the DataLoader. Helps with reproducibility.

Parameters

seed (int) – id of the seed

set_inner_k(k)

Sets the parameter k of the model selection procedure. Called by the evaluation modules to load the correct subset of the data.

Parameters

k (int) – the id of the fold, ranging from 0 to K-1.

set_outer_k(k: int)

Sets the parameter k of the risk assessment procedure. Called by the evaluation modules to load the correct data subset.

Parameters

k (int) – the id of the fold, ranging from 0 to K-1.

class pydgn.data.provider.IterableDataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.DataLoader], Callable[[...], torch_geometric.loader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: pydgn.data.provider.DataProvider

A DataProvider object that allows to fetch data from an Iterable-style Dataset (see pydgn.data.dataset.IterableDatasetInterface).

_get_loader(indices: list, **kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Instantiates the data loader, passing to the dataset an additional url_indices argument with the indices to fetch. This is because each time this method is called with different indices a separate instance of the dataset is called.

Parameters
  • indices (sequence) – Indices in the whole set selected for subset

  • kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

class pydgn.data.provider.LinkPredictionSingleGraphDataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.DataLoader], Callable[[...], torch_geometric.loader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: pydgn.data.provider.DataProvider

An extension of the DataProvider class to deal with link prediction on a single graph. Designed to work with LinkPredictionSingleGraphSplitter. We also assume the single-graph dataset can fit in memory WARNING: this class modifies the dataset by creating copies. It may not work if a “shared dataset” feature is added to PyDGN.

_get_dataset(**kwargs)

Compared to superclass method, this always returns a new instance of the dataset, optionally passing extra arguments specified at runtime.

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset. Not used in the base version

Returns

a DatasetInterface object

_get_loader(indices: list, **kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

This method returns a data loader for a single graph augmented with additional fields. The y field becomes (y, positive EVAL edges, negative EVAL edges), where eval means these are the edges on which to evaluate losses and scores (in fact, eval could also mean training!). A list of different Data objects is created, where the evaluation edges are randomly permuted. This depends on the size of the batch that is specified.

Parameters
  • indices (sequence) – Indices in the whole set selected for subset

  • kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

_get_splitter()
Instantiates the splitter with the parameters stored in the file

self.splits_filepath. Only works with ~pydgn.data.splitter.LinkPredictionSingleGraphSplitter.

Returns

a Splitter object

get_inner_train(**kwargs)

Returns the training set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_inner_val(**kwargs)

Returns the validation set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_test(**kwargs)

Returns the test set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_train(**kwargs)

Returns the training set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_val(**kwargs)

Returns the validation set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

class pydgn.data.provider.SingleGraphDataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.DataLoader], Callable[[...], torch_geometric.loader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: pydgn.data.provider.DataProvider

A DataProvider subclass that only works with pydgn.data.splitter.SingleGraphSplitter.

Parameters
  • data_root (str) – the path of the root folder in which data is stored

  • splits_filepath (str) – the filepath of the splits. with additional metadata

  • dataset_class – (Callable[…,:class:pydgn.data.dataset.DatasetInterface]): the class of the dataset

  • data_loader_class – (Union[Callable[…,:class:torch.utils.data.DataLoader], Callable[…,:class:torch_geometric.loader.DataLoader]]): the class of the data loader to use

  • data_loader_args (dict) – the arguments of the data loader

  • dataset_name (str) – the name of the dataset

  • outer_folds (int) – the number of outer folds for risk assessment. 1 means hold-out, >1 means k-fold

  • inner_folds (int) – the number of outer folds for model selection. 1 means hold-out, >1 means k-fold

_get_dataset(**kwargs: dict) pydgn.data.dataset.DatasetInterface

Compared to superclass method, this always returns a new instance of the dataset, optionally passing extra arguments specified at runtime.

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset. Not used in the base version

Returns

a DatasetInterface object

_get_loader(eval_indices: list, training_indices: list, **kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Compared to superclass method, returns a dataloader with the single graph augmented with additional fields. These are training_indices with the indices that refer to training nodes (usually always available) and eval_indices, which specify which are the indices on which to evaluate (can be validation or test).

Parameters
  • indices (sequence) – Indices in the whole set selected for subset

  • eval_set (bool) – whether or not indices refer to eval set (validation or test) or to training

  • kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

_get_splitter()

Instantiates the splitter with the parameters stored in the file self.splits_filepath. Only works with ~pydgn.data.splitter.SingleGraphSplitter.

Returns

a Splitter object

get_inner_train(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the training set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_inner_val(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the validation set for model selection associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_test(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the test set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_train(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the training set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

get_outer_val(**kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Returns the validation set for risk assessment associated with specific outer and inner folds

Parameters

kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded.Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

class pydgn.data.provider.SingleGraphSequenceDataProvider(data_root: str, splits_filepath: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], dataset_name: str, data_loader_class: Union[Callable[[...], torch.utils.data.DataLoader], Callable[[...], torch_geometric.loader.DataLoader]], data_loader_args: dict, outer_folds: int, inner_folds: int)

Bases: pydgn.data.provider.DataProvider

This class is responsible for building the dynamic dataset at runtime.

_get_loader(indices: list, **kwargs: dict) Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader]

Instantiates the data loader. Only works with torch.utils.data.DataLoader.

Parameters
  • indices (sequence) – Indices in the whole set selected for subset

  • kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version

Returns

a Union[torch.utils.data.DataLoader, torch_geometric.loader.DataLoader] object

classmethod collate_fn(samples_list: List[torch_geometric.data.Data]) List[torch_geometric.data.Batch]

Creates a Batch object for each sample in the data list.

Parameters

samples_list (List[Data]) – the list of graphs to batch

Returns

a list of Batch objects

pydgn.data.provider.seed_worker(exp_seed, worker_id)

Used to set a different, but reproducible, seed for all data-retriever workers. Without this, all workers will retrieve the data in the same order (important for Iterable-style datasets).

Parameters
  • exp_seed (int) – base seed to be used for reproducibility

  • worker_id (int) – id number of the worker

data.sampler

class pydgn.data.sampler.RandomSampler(*args: Any, **kwargs: Any)

Bases: torch.utils.data.sampler.RandomSampler

This sampler wraps the dataset and saves the random permutation applied to the samples, so that it will be available for further use (e.g. for saving graph embeddings in the original order). The permutation is saved in the ‘permutation’ attribute.

Parameters

data_source (pydgn.data.DatasetInterface) – the dataset object

data.splitter

class pydgn.data.splitter.Fold(train_idxs, val_idxs=None, test_idxs=None)

Bases: object

Simple class that stores training, validation, and test indices.

Parameters
  • train_idxs (Union[list, tuple]) – training indices

  • val_idxs (Union[list, tuple]) – validation indices. Default is None

  • test_idxs (Union[list, tuple]) – test indices. Default is None

class pydgn.data.splitter.InnerFold(train_idxs, val_idxs=None, test_idxs=None)

Bases: pydgn.data.splitter.Fold

Simple extension of the Fold class that returns a dictionary with training and validation indices (model selection).

todict() dict

Creates a dictionary with the training/validation indices.

Returns

a dict with keys ['train', 'val'] associated with the respective indices

class pydgn.data.splitter.LinkPredictionSingleGraphSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1, undirected: bool = False, avoid_opposite_negative_edges: bool = True)

Bases: pydgn.data.splitter.Splitter

Class that inherits from Splitter and computes link splits for link classification tasks. IMPORTANT: This class implements bootstrapping rather than k-fold cross-validation, so different outer test sets may have overlapping indices.

Does not support edge attributes at the moment.

Parameters
  • n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold

  • n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold

  • seed (int) – random seed for reproducibility (on the same machine)

  • stratify (bool) – whether to apply stratification or not (should be true for classification tasks)

  • shuffle (bool) – whether to apply shuffle or not

  • inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is 0.1

  • outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is 0.1

  • test_ratio (float) – percentage of test set for hold_out model assessment. Default is 0.1

  • undirected (bool) – whether or not the graph is undirected. Default is False

  • avoid_opposite_negative_edges (bool) – whether or not to avoid creating negative edges that are opposite to existing edges Default is True

_splitter_args()

Compared to the superclass version, adds two boolean arguments undirected and avoid_opposite_negative_edges.

Returns

a dict containing all splitter’s arguments.

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds. Links are selected at random: this means outer test folds will overlap almost surely with if test_ratio is 10% of the total samples. The recommended procedure here is to use the outer folds to do bootstrapping rather than k-fold cross-validation. Idea taken from: https://arxiv.org/pdf/1811.05868.pdf . IMPORTANT: calling split() sets the seed of numpy, torch, and random for reproducibility.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

train_val_test_edge_split(edge_index, edge_attr, val_ratio, test_ratio, num_nodes)

Sample training/validation/test edges at random.

class pydgn.data.splitter.NoShuffleTrainTestSplit(test_ratio)

Bases: object

Class that implements a very simple training/test split. Can be used to further split training data into training and validation.

Parameters

test_ratio – percentage of data to use for evaluation.

split(idxs, y=None)

Splits the data.

Parameters
  • idxs – the indices to split according to the test_ratio parameter

  • y – Unused argument

Returns

a list of a single tuple (train indices, test/eval indices)

class pydgn.data.splitter.OGBGSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: pydgn.data.splitter.Splitter

Splitter specific to OGBG Datasets, reuses the already given splits (hence it works only in hold-out mode).

split(dataset: pydgn.data.dataset.OGBGDatasetInterface, targets=None)

Computes the OGBG splits according to those already provided by the authors of the datasets and stores them in the list fields self.outer_folds and self.inner_folds. IMPORTANT: calling split() sets the seed of numpy, torch, and random for reproducibility.

Parameters
  • dataset (OGBGDatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

class pydgn.data.splitter.OuterFold(train_idxs, val_idxs=None, test_idxs=None)

Bases: pydgn.data.splitter.Fold

Simple extension of the Fold class that returns a dictionary with training and test indices (risk assessment)

todict() dict

Creates a dictionary with the training/validation/test indices.

Returns

a dict with keys ['train', 'val', 'test'] associated with the respective indices

class pydgn.data.splitter.SameInnerSplitSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: pydgn.data.splitter.Splitter

Splitter subclass that can be used to have multiple training runs of the same configuration at model selection time. It is not meant to be combined with a double-nested CV, for which the different inner splits are already enough to gauge the training stability of each configuration.

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds. IMPORTANT: calling split() sets the seed of numpy, torch, and random for reproducibility.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

class pydgn.data.splitter.SingleGraphSequenceSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: pydgn.data.splitter.TemporalSplitter

Class for dynamic graphs that generates the splits at dataset creation time. It assumes that there is a single graph sequence, so the split happens on time steps. What is more, n_inner_folds here will create an inner K-fold CV split for model selection where, however, the training/validation split will not change (because there is no way to split time steps in a different way). This allows for different initializations of the same model, evaluating the avg performance on the VL set.

get_targets(dataset: pydgn.data.dataset.TemporalDatasetInterface) Tuple[bool, numpy.ndarray]

Reads the entire dataset and returns the targets.

Parameters

dataset (DatasetInterface) – the dataset

Returns

a tuple of two elements. The first element is a boolean, which is True if target values exist or an exception has not been thrown. The second value holds the actual targets or None, depending on the first boolean value.

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds. IMPORTANT: calling split() sets the seed of numpy, torch, and random for reproducibility.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

class pydgn.data.splitter.SingleGraphSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: pydgn.data.splitter.Splitter

A splitter for a single graph dataset that randomly splits nodes into training/validation/test

Parameters
  • n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold

  • n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold

  • seed (int) – random seed for reproducibility (on the same machine)

  • stratify (bool) – whether to apply stratification or not (should be true for classification tasks)

  • shuffle (bool) – whether to apply shuffle or not

  • inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is 0.1

  • outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is 0.1

  • test_ratio (float) – percentage of test set for hold_out model assessment. Default is 0.1

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Compared with the superclass version, the only difference is that the range of indices spans across the number of nodes of the single graph taken into consideration.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

class pydgn.data.splitter.Splitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: object

Class that generates and stores the data splits at dataset creation time.

Parameters
  • n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold

  • n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold

  • seed (int) – random seed for reproducibility (on the same machine)

  • stratify (bool) – whether to apply stratification or not (should be true for classification tasks)

  • shuffle (bool) – whether to apply shuffle or not

  • inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is 0.1

  • outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is 0.1

  • test_ratio (float) – percentage of test set for hold_out model assessment. Default is 0.1

_get_splitter(n_splits: int, stratified: bool, eval_ratio: float)

Instantiates the appropriate splitter to use depending on the situation

Parameters
  • n_splits (int) – the number of different splits to create

  • stratified (bool) – whether or not to perform stratification. Works with graph classification tasks only!

  • eval_ratio (float) – the amount of evaluation (validation/test) data to use in case n_splits==1 (i.e., hold-out data split)

Returns

a Splitter object

_splitter_args() dict

Returns a dict with all the splitter’s arguments for subsequent re-loading at experiment time.

Returns

a dict containing all splitter’s arguments.

get_targets(dataset: pydgn.data.dataset.DatasetInterface) Tuple[bool, numpy.ndarray]

Reads the entire dataset and returns the targets.

Parameters

dataset (DatasetInterface) – the dataset

Returns

a tuple of two elements. The first element is a boolean, which is True if target values exist or an exception has not been thrown. The second value holds the actual targets or None, depending on the first boolean value.

classmethod load(path: str)

Loads the data splits from disk.

:param : param path: the path of the yaml file with the splits

Returns

a Splitter object

save(path: str)

Saves the split as a dictionary into a torch file. The arguments of the dictionary are * seed (int) * splitter_class (str) * splitter_args (dict) * outer_folds (list of dicts) * inner_folds (list of lists of dicts)

Parameters

path (str) – filepath where to save the object

split(dataset: pydgn.data.dataset.DatasetInterface, targets: Optional[numpy.ndarray] = None)

Computes the splits and stores them in the list fields self.outer_folds and self.inner_folds. IMPORTANT: calling split() sets the seed of numpy, torch, and random for reproducibility.

Parameters
  • dataset (DatasetInterface) – the Dataset object

  • targets (np.ndarray]) – targets used for stratification. Default is None

class pydgn.data.splitter.TemporalSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)

Bases: pydgn.data.splitter.Splitter

Reads the entire dataset and returns the targets. In this case, each sample in the dataset represents a temporal graph, so we get the classification value at the last time step. Use this method to stratify a dataset of multiple temporal graphs.

Parameters

dataset (DatasetInterface) – the dataset

Returns

a tuple of two elements. The first element is a boolean, which is True if target values exist or an exception has not been thrown. The second value holds the actual targets or None, depending on the first boolean value.

get_targets(dataset: pydgn.data.dataset.TemporalDatasetInterface) Tuple[bool, numpy.ndarray]

Reads the entire dataset and returns the targets.

Parameters

dataset (TemporalDatasetInterface) – the temporal dataset

Returns

a tuple of two elements. The first element is a boolean, which is True if target values exist or an exception has not been thrown. The second value holds the actual targets or None, depending on the first boolean value.

data.transform

class pydgn.data.transform.ConstantEdgeIfEmpty(value=1)

Bases: object

Adds a constant value to each edge feature only if edge_attr is None.

Parameters

value (int) – The value to add. Default is 1)

class pydgn.data.transform.ConstantIfEmpty(value=1)

Bases: object

Adds a constant value to each node feature only if x is None.

Parameters

value (int) – The value to add. Default is 1

class pydgn.data.transform.Degree(in_degree: bool = False, cat: bool = True)

Bases: object

Adds the node degree to the node features.

Parameters
  • in_degree (bool) – If set to True, will compute the in-degree of nodes instead of the out-degree.

  • (default (Not relevant if the graph is undirected) – False).

  • cat (bool) – Concat node degrees to node features instead of replacing them. (default: True)

data.util

pydgn.data.util.check_argument(cls: object, arg_name: str) bool

Checks whether arg_name is in the signature of a method or class.

Parameters
  • cls (object) – the class to inspect

  • arg_name (str) – the name to look for

Returns

True if the name was found, False otherwise

pydgn.data.util.filter_adj(edge_index: torch.Tensor, edge_attr: torch.Tensor, mask: torch.Tensor)

Adapted from pytorch-geometric. Does the same thing but with a different signature

Parameters
  • edge_index (torch.Tensor) – the usual PyG matrix of edge indices

  • edge_attr (torch.Tensor) – the usual PyG matrix of edge attributes

  • mask (torch.Tensor) – boolean tensor with edges to filter

Returns

a tuple (filtered edge index, filtered edge attr or None if edge_attr is None)

pydgn.data.util.get_or_create_dir(path: str) str

Creates directories associated to the specified path if they are missing, and it returns the path string.

Parameters

path (str) – the path

Returns

the same path as the given argument

pydgn.data.util.load_dataset(data_root: str, dataset_name: str, dataset_class: Callable[[...], pydgn.data.dataset.DatasetInterface], **kwargs: dict) pydgn.data.dataset.DatasetInterface

Loads the dataset using the dataset_kwargs.pt file created when parsing the data config file.

Parameters
  • data_root (str) – path of the folder that contains the dataset folder

  • dataset_name (str) – name of the dataset (same as the name of the dataset folder that has been already created)

  • dataset_class – (Callable[…, DatasetInterface]): the class of the dataset to instantiate with the parameters stored in the dataset_kwargs.pt file.

  • kwargs (dict) – additional arguments to be passed to the dataset (potentially provided by a DataProvider)

Returns

a DatasetInterface object

pydgn.data.util.preprocess_data(options: dict)

One of the main functions of the PyDGN library. Used to create the dataset and its associated files that ensure the correct functioning of the data loading steps.

Parameters

options (dict) – a dictionary of dataset/splitter arguments as defined in the data configuration file used.

pydgn.data.util.to_lower_triangular(edge_index: torch.Tensor)

Transform Pytorch Geometric undirected edge index into its “lower triangular counterpart”