CS224W - Colab 2

In Colab 2, we will work to construct our own graph neural network using PyTorch Geometric (PyG) and then apply that model on two Open Graph Benchmark (OGB) datasets. These two datasets will be used to benchmark your model’s performance on two different graph-based tasks: 1) node property prediction, predicting properties of single nodes and 2) graph property prediction, predicting properties of entire graphs or subgraphs.

First, we will learn how PyTorch Geometric stores graphs as PyTorch tensors.

Then, we will load and inspect one of the Open Graph Benchmark (OGB) datasets by using the ogb package. OGB is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. The ogb package not only provides data loaders for each dataset but also model evaluators.

Lastly, we will build our own graph neural network using PyTorch Geometric. We will then train and evaluate our model on the OGB node property prediction and graph property prediction tasks.

Note: Make sure to sequentially run all the cells in each section, so that the intermediate variables / packages will carry over to the next cell

We recommend you save a copy of this colab in your drive so you don’t lose progress!

The expected time to finish this Colab is 2 hours. However, debugging training loops can easily take a while. So, don’t worry at all if it takes you longer! Have fun and good luck on Colab 2 :)

Device

You might need to use a GPU for this Colab to run quickly.

Please click Runtime and then Change runtime type. Then set the hardware accelerator to GPU.

Setup

As discussed in Colab 0, the installation of PyG on Colab can be a little bit tricky. First let us check which version of PyTorch you are running

1
2
3

import torch
import os
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.4.1

Download the necessary packages for PyG. Make sure that your version of torch matches the output from the cell above. In case of any issues, more information can be found on the PyG’s installation page.

# Install torch geometric
if 'IS_GRADESCOPE_ENV' not in os.environ:
    torch_version = str(torch.__version__)
    scatter_src = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"
    sparse_src = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"
    !pip install torch-scatter -f $scatter_src
    !pip install torch-sparse -f $sparse_src
    !pip install torch-geometric
    !pip install ogb

1) PyTorch Geometric (Datasets and Data)

PyTorch Geometric has two classes for storing and/or transforming graphs into tensor format. One is torch_geometric.datasets, which contains a variety of common graph datasets. Another is torch_geometric.data, which provides the data handling of graphs in PyTorch tensors.

In this section, we will learn how to use torch_geometric.datasets and torch_geometric.data together.

PyG Datasets

The torch_geometric.datasets class has many common graph datasets. Here we will explore its usage through one example dataset.

from torch_geometric.datasets import TUDataset

if 'IS_GRADESCOPE_ENV' not in os.environ:
    root = './enzymes'
    name = 'ENZYMES'

    # The ENZYMES dataset
    pyg_dataset= TUDataset(root, name)

    # You will find that there are 600 graphs in this dataset
    print(pyg_dataset)

ENZYMES(600)

Question 1: What is the number of classes and number of features in the ENZYMES dataset? (5 points)

def get_num_classes(pyg_dataset):
    '''获取数据集的类别数量'''
    return pyg_dataset.num_classes

def get_num_features(pyg_dataset):
    '''获取特征数量'''
    return pyg_dataset.num_features

if 'IS_GRADESCOPE_ENV' not in os.environ:
    num_classes = get_num_classes(pyg_dataset)
    num_features = get_num_features(pyg_dataset)
    print("{} dataset has {} classes".format(name, num_classes))
    print("{} dataset has {} features".format(name, num_features))

ENZYMES dataset has 6 classes
ENZYMES dataset has 3 features

PyG Data

Each PyG dataset stores a list of torch_geometric.data.Data objects, where each torch_geometric.data.Data object represents a graph. We can easily get the Data object by indexing into the dataset.

For more information such as what is stored in the Data object, please refer to the documentation.

Question 2: What is the label of the graph with index 100 in the ENZYMES dataset? (5 points)

def get_graph_class(pyg_dataset, idx):
    '''返回索引idx节点的分类标签'''
    label = pyg_dataset[idx].y.item()
    return label

# Here pyg_dataset is a dataset for graph classification
if 'IS_GRADESCOPE_ENV' not in os.environ:
    graph_0 = pyg_dataset[0]
    print(graph_0)
    idx = 100
    label = get_graph_class(pyg_dataset, idx)
    print('Graph with index {} has label {}'.format(idx, label))

Data(edge_index=[2, 168], x=[37, 3], y=[1])
Graph with index 100 has label 4

Question 3: How many edges does the graph with index 200 have? (5 points)

def get_graph_num_edges(pyg_dataset, idx):
    '''获取idx节点的边数'''
    edge_index = pyg_dataset[idx].edge_index
    num_edges = edge_index.size(1) // 2   # 无向
    return num_edges

if 'IS_GRADESCOPE_ENV' not in os.environ:
    idx = 200
    num_edges = get_graph_num_edges(pyg_dataset, idx)
    print('Graph with index {} has {} edges'.format(idx, num_edges))

Graph with index 200 has 53 edges

2) Open Graph Benchmark (OGB)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can then be evaluated by using the OGB Evaluator in a unified manner.

Dataset and Data

OGB also supports PyG dataset and data classes. Here we take a look on the ogbn-arxiv dataset.

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset

if 'IS_GRADESCOPE_ENV' not in os.environ:
    dataset_name = 'ogbn-arxiv'
    # Load the dataset and transform it to sparse tensor
    dataset = PygNodePropPredDataset(name=dataset_name, transform=T.ToSparseTensor())
    print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))

    # Extract the graph
    data = dataset[0]
    print(data)

The ogbn-arxiv dataset has 1 graph
Data(num_nodes=169343, x=[169343, 128], node_year=[169343, 1], y=[169343, 1], adj_t=[169343, 169343, nnz=1166243])


/opt/anaconda3/envs/graph/lib/python3.11/site-packages/ogb/nodeproppred/dataset_pyg.py:69: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.data, self.slices = torch.load(self.processed_paths[0])

Question 4: How many features are in the ogbn-arxiv graph? (5 points)

def graph_num_features(data):
    '''获取特征数量'''
    return data.num_features

if 'IS_GRADESCOPE_ENV' not in os.environ:
    num_features = graph_num_features(data)
    print('The graph has {} features'.format(num_features))

The graph has 128 features

3) GNN: Node Property Prediction

In this section we will build our first graph neural network using PyTorch Geometric. Then we will apply it to the task of node property prediction (node classification).

Specifically, we will use GCN as the foundation for your graph neural network (Kipf et al. (2017)). To do so, we will work with PyG’s built-in GCNConv layer.

Setup

import torch
import pandas as pd
import torch.nn.functional as F
print(torch.__version__)

# The PyG built-in GCNConv
from torch_geometric.nn import GCNConv

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

2.4.1

Load and Preprocess the Dataset

if 'IS_GRADESCOPE_ENV' not in os.environ:
    dataset_name = 'ogbn-arxiv'
    dataset = PygNodePropPredDataset(name=dataset_name, transform=T.ToSparseTensor())
    data = dataset[0]

    # Make the adjacency matrix to symmetric
    data.adj_t = data.adj_t.to_symmetric()

    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # If you use GPU, the device should be cuda
    print('Device: {}'.format(device))

    data = data.to(device)
    split_idx = dataset.get_idx_split()
    train_idx = split_idx['train'].to(device)

Device: cpu

GCN Model

Now we will implement our GCN model!

Please follow the figure below to implement the forward function.

test

参考：

https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv

https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html

https://pytorch.org/docs/stable/nn.functional.html

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch.nn import BatchNorm1d, LogSoftmax, ModuleList


class GCN(torch.nn.Module):
    '''GCN 模型'''
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, return_embeds=False):
        '''初始化'''
        super(GCN, self).__init__()

        # A list of GCNConv layers
        self.convs = ModuleList()
        self.convs.append(GCNConv(input_dim, hidden_dim))
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_dim, hidden_dim))
        self.convs.append(GCNConv(hidden_dim, output_dim))

        # A list of 1D batch normalization layers
        self.bns = ModuleList([BatchNorm1d(hidden_dim) for _ in range(num_layers - 1)])

        # The log softmax layer
        self.softmax = LogSoftmax(dim=1)

        # Probability of an element getting zeroed
        self.dropout = dropout

        # Skip classification layer and return node embeddings
        self.return_embeds = return_embeds

    def reset_parameters(self):
        '''重置参数'''
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        '''前向传播'''
        # Forward pass through the network
        for i in range(len(self.convs) - 1):
            x = self.convs[i](x, adj_t)
            x = self.bns[i](x)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        
        # Last convolutional layer (no batch normalization or ReLu)
        x = self.convs[-1](x, adj_t)
        
        # If return_embeds is True, skip the softmax and return embeddings
        if self.return_embeds:
            return x
        
        # Otherwise, apply log softmax for classification
        out = self.softmax(x)
        return out

def train(model, data, train_idx, optimizer, loss_fn):
    '''训练函数'''
    model.train()
    # 清空梯度
    optimizer.zero_grad()
    
    # 模型输入输出
    out = model(data.x, data.adj_t)
    
    # 计算损失
    train_output = out[train_idx]
    train_label = data.y[train_idx]
    loss = loss_fn(train_output, train_label.squeeze())

    # 反向传播 更新模型参数
    loss.backward()
    optimizer.step()

    return loss.item()

# Test function here
@torch.no_grad()
def test(model, data, split_idx, evaluator, save_model_results=False):
    '''测试函数'''
    # 评估模式
    model.eval()

    # 获取模型对所有数据的输出
    out = model(data.x, data.adj_t)
    
    # 通过预测的最大概率类别得到 y_pred
    y_pred = out.argmax(dim=-1, keepdim=True)

    # 准确率
    train_acc = evaluator.eval({
        'y_true': data.y[split_idx['train']],
        'y_pred': y_pred[split_idx['train']],
    })['acc']
    valid_acc = evaluator.eval({
        'y_true': data.y[split_idx['valid']],
        'y_pred': y_pred[split_idx['valid']],
    })['acc']
    test_acc = evaluator.eval({
        'y_true': data.y[split_idx['test']],
        'y_pred': y_pred[split_idx['test']],
    })['acc']

    # 保存结果
    if save_model_results:
        print ("Saving Model Predictions")
        
        data = {}
        data['y_pred'] = y_pred.view(-1).cpu().detach().numpy()
        
        df = pd.DataFrame(data=data)
        df.to_csv('ogbn-arxiv_node.csv', sep=',', index=False)

    # 返回测试函数准确率
    return train_acc, valid_acc, test_acc

# Please do not change the args
if 'IS_GRADESCOPE_ENV' not in os.environ:
    args = {
      'device': device,
      'num_layers': 3,
      'hidden_dim': 256,
      'dropout': 0.5,
      'lr': 0.01,
      'epochs': 100,
    }
    args

if 'IS_GRADESCOPE_ENV' not in os.environ:
    model = GCN(data.num_features, args['hidden_dim'],
              dataset.num_classes, args['num_layers'],
              args['dropout']).to(device)
    evaluator = Evaluator(name='ogbn-arxiv')

# Please do not change these args
# Training should take <10min using GPU runtime
import copy

if 'IS_GRADESCOPE_ENV' not in os.environ:
    # reset the parameters to initial random value
    model.reset_parameters()

    optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
    loss_fn = F.nll_loss   # 多分类问题的负对数似然损失函数

    best_model = None
    best_valid_acc = 0

    for epoch in range(1, 1 + args["epochs"]):
        loss = train(model, data, train_idx, optimizer, loss_fn)
        result = test(model, data, split_idx, evaluator)
        train_acc, valid_acc, test_acc = result
        if valid_acc > best_valid_acc:
            best_valid_acc = valid_acc
            best_model = copy.deepcopy(model)
        print(f'Epoch: {epoch:02d}, '
              f'Loss: {loss:.4f}, '
              f'Train: {100 * train_acc:.2f}%, '
              f'Valid: {100 * valid_acc:.2f}% '
              f'Test: {100 * test_acc:.2f}%')

Epoch: 01, Loss: 4.2264, Train: 19.30%, Valid: 25.86% Test: 23.30%
Epoch: 02, Loss: 2.3266, Train: 22.93%, Valid: 22.20% Test: 27.58%
Epoch: 03, Loss: 1.9266, Train: 29.55%, Valid: 26.12% Test: 31.58%
Epoch: 04, Loss: 1.7227, Train: 32.78%, Valid: 33.06% Test: 35.57%
Epoch: 05, Loss: 1.6185, Train: 35.70%, Valid: 32.67% Test: 33.84%
Epoch: 06, Loss: 1.5277, Train: 39.24%, Valid: 35.68% Test: 35.93%
Epoch: 07, Loss: 1.4655, Train: 43.58%, Valid: 43.85% Test: 45.12%
Epoch: 08, Loss: 1.4239, Train: 47.11%, Valid: 49.65% Test: 51.94%
Epoch: 09, Loss: 1.3798, Train: 47.55%, Valid: 49.59% Test: 52.15%
Epoch: 10, Loss: 1.3428, Train: 47.21%, Valid: 48.50% Test: 51.24%
...
Epoch: 95, Loss: 0.9132, Train: 73.62%, Valid: 71.05% Test: 69.49%
Epoch: 96, Loss: 0.9135, Train: 73.67%, Valid: 71.52% Test: 70.41%
Epoch: 97, Loss: 0.9091, Train: 73.74%, Valid: 71.47% Test: 69.89%
Epoch: 98, Loss: 0.9096, Train: 73.82%, Valid: 71.75% Test: 70.59%
Epoch: 99, Loss: 0.9087, Train: 73.90%, Valid: 71.91% Test: 71.12%
Epoch: 100, Loss: 0.9057, Train: 74.07%, Valid: 71.67% Test: 70.79%

Question 5: What are your `best_model` validation and test accuracies?(20 points)

Run the cell below to see the results of your best of model and save your model’s predictions to a file named ogbn-arxiv_node.csv. You can view this file by clicking on the Folder icon on the left side pannel. Report the results on Gradescope.

if 'IS_GRADESCOPE_ENV' not in os.environ:
    best_result = test(best_model, data, split_idx, evaluator, save_model_results=True)
    train_acc, valid_acc, test_acc = best_result
    print(f'Best model: '
        f'Train: {100 * train_acc:.2f}%, '
        f'Valid: {100 * valid_acc:.2f}% '
        f'Test: {100 * test_acc:.2f}%')

Saving Model Predictions
Best model: Train: 73.90%, Valid: 71.91% Test: 71.12%

4) GNN: Graph Property Prediction

In this section we will create a graph neural network for graph property prediction (graph classification).

Load and preprocess the dataset

from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.loader import DataLoader
from tqdm.notebook import tqdm

if 'IS_GRADESCOPE_ENV' not in os.environ:
    # Load the dataset
    dataset = PygGraphPropPredDataset(name='ogbg-molhiv')

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print('Device: {}'.format(device))

    split_idx = dataset.get_idx_split()

    # Check task type
    print('Task type: {}'.format(dataset.task_type))

Device: cpu
Task type: binary classification

# Load the dataset splits into corresponding dataloaders
# We will train the graph classification task on a batch of 32 graphs
# Shuffle the order of graphs for training set
if 'IS_GRADESCOPE_ENV' not in os.environ:
    train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
    valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
    test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)

if 'IS_GRADESCOPE_ENV' not in os.environ:
    # Please do not change the args
    args = {
      'device': device,
      'num_layers': 5,
      'hidden_dim': 256,
      'dropout': 0.5,
      'lr': 0.001,
      'epochs': 30,
    }
    args

Graph Prediction Model

Graph Mini-Batching

Before diving into the actual model, we introduce the concept of mini-batching with graphs. In order to parallelize the processing of a mini-batch of graphs, PyG combines the graphs into a single disconnected graph data object (torch_geometric.data.Batch). torch_geometric.data.Batch inherits from torch_geometric.data.Data (introduced earlier) and contains an additional attribute called batch.

The batch attribute is a vector mapping each node to the index of its corresponding graph within the mini-batch:

batch = [0, ..., 0, 1, ..., n - 2, n - 1, ..., n - 1]

This attribute is crucial for associating which graph each node belongs to and can be used to e.g. average the node embeddings for each graph individually to compute graph level embeddings.

Implemention

Now, we have all of the tools to implement a GCN Graph Prediction model!

We will reuse the existing GCN model to generate node_embeddings and then use Global Pooling over the nodes to create graph level embeddings that can be used to predict properties for the each graph. Remeber that the batch attribute will be essential for performining Global Pooling over our mini-batch of graphs.

参考：

https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers

from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.nn import global_add_pool, global_mean_pool

### GCN to predict graph property
class GCN_Graph(torch.nn.Module):
    '''预测图的属性'''
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        '''初始化'''
        super(GCN_Graph, self).__init__()

        # 加载分子图中原子的编码器
        self.node_encoder = AtomEncoder(hidden_dim)

        # Node embedding model
        # 注意输入输出维度都是 hidden_dim
        self.gnn_node = GCN(hidden_dim, hidden_dim, hidden_dim, num_layers, dropout, return_embeds=True)

        # 全局池化层(均值池化)
        self.pool = global_mean_pool

        # 输出层，用于预测每个图的属性
        self.linear = torch.nn.Linear(hidden_dim, output_dim)


    def reset_parameters(self):
        '''重置模型参数'''
        self.gnn_node.reset_parameters()
        self.linear.reset_parameters()
        

    def forward(self, batched_data):
        '''前向传播'''
        # 获取小批量图中的重要属性
        x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
        
        # 使用节点编码器对节点特征进行编码
        embed = self.node_encoder(x)
        
        # 使用 GCN 模型生成节点嵌入
        node_embeddings = self.gnn_node(embed, edge_index)
        
        # 使用全局均值池化，将节点嵌入聚合为图嵌入
        graph_embeddings = self.pool(node_embeddings, batch)
        
        # 通过线性层预测每个图的属性
        out = self.linear(graph_embeddings)

        return out

def train(model, device, data_loader, optimizer, loss_fn):
    '''训练函数'''
    # 训练模式
    model.train()
    total_loss = 0

    # 加载数据加载器中的小批量数据
    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
        batch = batch.to(device)
        
        # 如果批次中的节点数目为 1 ，或所有节点都属于同一个图，跳过（无效批次）
        if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
            pass
        else:
            # 1. 清空优化器梯度
            optimizer.zero_grad()
            
            # 2. 模型输入输出
            out = model(batch)
            
            # 3. 计算损失（使用 is_labeled 掩码过滤输出和标签，忽略 NaN 标签）
            is_labeled = batch.y == batch.y
            out = out[is_labeled]
            labels = batch.y[is_labeled].to(torch.float32)
            loss = loss_fn(out, labels)

            # 4. 反向传播 更新参数
            loss.backward()
            optimizer.step()
            
            # 累加损失
            total_loss += loss.item()

    return total_loss / len(data_loader)

# The evaluation function
def eval(model, device, loader, evaluator, save_model_results=False, save_file=None):
    '''测试函数'''
    # 评估模式
    model.eval()
    
    # 真实标签 预测标签
    y_true = []
    y_pred = []

    # 按批次测试
    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)
        # 若只有一个节点，跳过（避免单一节点对评估产生偏差）
        if batch.x.shape[0] == 1:
            pass
        else:
            # 预测
            with torch.no_grad():
                pred = model(batch)
            # 预测结果
            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())
    
    # 拼接成一个长向量，转为numpy数组，字典存储
    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()
    input_dict = {"y_true": y_true, "y_pred": y_pred}

    # 保存结果
    if save_model_results:
        print ("Saving Model Predictions")

        data = {}
        data['y_pred'] = y_pred.reshape(-1)
        data['y_true'] = y_true.reshape(-1)

        df = pd.DataFrame(data=data)
        df.to_csv('ogbg-molhiv_graph_' + save_file + '.csv', sep=',', index=False)

    # 调用评估器，返回评估指标
    return evaluator.eval(input_dict)

if 'IS_GRADESCOPE_ENV' not in os.environ:
    model = GCN_Graph(args['hidden_dim'],
                      dataset.num_tasks, args['num_layers'],
                      args['dropout']).to(device)
    evaluator = Evaluator(name='ogbg-molhiv')

# Please do not change these args
# Training should take <10min using GPU runtime
import copy

if 'IS_GRADESCOPE_ENV' not in os.environ:
    model.reset_parameters()

    optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
    loss_fn = torch.nn.BCEWithLogitsLoss()

    best_model = None
    best_valid_acc = 0

    for epoch in range(1, 1 + args["epochs"]):
        print('Training...')
        loss = train(model, device, train_loader, optimizer, loss_fn)

        print('Evaluating...')
        train_result = eval(model, device, train_loader, evaluator)
        val_result = eval(model, device, valid_loader, evaluator)
        test_result = eval(model, device, test_loader, evaluator)

        train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
        if valid_acc > best_valid_acc:
            best_valid_acc = valid_acc
            best_model = copy.deepcopy(model)
            
        print(f'Epoch: {epoch:02d}, '
              f'Loss: {loss:.4f}, '
              f'Train: {100 * train_acc:.2f}%, '
              f'Valid: {100 * valid_acc:.2f}% '
              f'Test: {100 * test_acc:.2f}%')

Training...
Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Evaluating...
Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Epoch: 01, Loss: 0.1584, Train: 68.88%, Valid: 62.36% Test: 62.30%

Training...
Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Evaluating...
Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Epoch: 02, Loss: 0.1501, Train: 72.18%, Valid: 74.67% Test: 70.18%

...
Epoch: 29, Loss: 0.1252, Train: 84.55%, Valid: 77.83% Test: 74.29%

Training...
Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Evaluating...
Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Epoch: 30, Loss: 0.1249, Train: 84.06%, Valid: 78.64% Test: 75.13%

Question 6: What are your `best_model` validation and test ROC-AUC scores? (20 points)

Run the cell below to see the results of your best of model and save your model’s predictions in files named ogbg-molhiv_graph_[valid,test].csv. Again, you can view the files by clicking on the Folder icon on the left side pannel. Report the results on Gradescope.

if 'IS_GRADESCOPE_ENV' not in os.environ:
    train_auroc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
    valid_auroc = eval(best_model, device, valid_loader, evaluator, save_model_results=True, save_file="valid")[dataset.eval_metric]
    test_auroc  = eval(best_model, device, test_loader, evaluator, save_model_results=True, save_file="test")[dataset.eval_metric]

    print(f'Best model: '
      f'Train: {100 * train_auroc:.2f}%, '
      f'Valid: {100 * valid_auroc:.2f}% '
      f'Test: {100 * test_auroc:.2f}%')

Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Saving Model Predictions
Iteration:   0%|          | 0/129 [00:00<?, ?it/s]
Saving Model Predictions
Best model: Train: 81.89%, Valid: 80.14% Test: 74.33%

Question 7 (Optional): Experiment with the two other global pooling layers in Pytorch Geometric.

Submission

To submit Colab 2, please submit to the following assignments on Gradescope:

“Colab 2”: submit your answers to the questions in this assignment
“Colab 2 Code”: submit your completed CS224W_Colab_2.ipynb. From the “File” menu select “Download .ipynb” to save a local copy of your completed Colab. PLEASE DO NOT CHANGE THE NAME! The autograder depends on the .ipynb file being called “CS224W_Colab_2.ipynb”.

Clarrification:

In “Colab 2 Code”, we grade Q1-Q4 (non-training questions) using autograder.
In “Colab 2”, we grade Q5-Q6 (training questions), where Q1-Q4 are assigned 0 points.

CS224W - Colab 2

Device

Setup

1) PyTorch Geometric (Datasets and Data)

PyG Datasets

Question 1: What is the number of classes and number of features in the ENZYMES dataset? (5 points)

PyG Data

Question 2: What is the label of the graph with index 100 in the ENZYMES dataset? (5 points)

Question 3: How many edges does the graph with index 200 have? (5 points)

2) Open Graph Benchmark (OGB)

Dataset and Data

Question 4: How many features are in the ogbn-arxiv graph? (5 points)

3) GNN: Node Property Prediction

Setup

Load and Preprocess the Dataset

GCN Model

Question 5: What are your best_model validation and test accuracies?(20 points)

4) GNN: Graph Property Prediction

Load and preprocess the dataset

Graph Prediction Model

Graph Mini-Batching

Implemention

Question 6: What are your best_model validation and test ROC-AUC scores? (20 points)

Question 7 (Optional): Experiment with the two other global pooling layers in Pytorch Geometric.

Submission

Question 5: What are your `best_model` validation and test accuracies?(20 points)

Question 6: What are your `best_model` validation and test ROC-AUC scores? (20 points)