My Graph Notes #1: Learning the Basics Before the Hard Stuff

Graph offers a powerful framework for mapping interconnectivy across various domains, including human relationships, routing and bioinformatics. While Graph Neural Networks (GNNs) have recently set a new standard for processing complex data and recommendation systems, truly mastering them requires looking back. In this post, we’ll explore the “traditional” framework of graph modeling to understand how to view data through a relational lens and build a robust machine learning solutions.

Open Table of Contents

Graph
Tasks
Embeddings

Graph

A graph $\mathcal{G}$ consists of several nodes $\mathcal{N} = \{n_0, n_1, n_2, \dots, n_{N-1}\}$ , where each node represent object/entity (For example, a person, an image patch, or a word token). Nodes may be connected to one another through edges, forming a set of edges $\mathcal{E} \subseteq \mathcal{N}\times\mathcal{N}$ , which describe the relationships or interactions between nodes.

A = \begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix}

In this example, the graph has $N = 3$ nodes: $\{ n_0, n_1, n_2 \}$ . A value of $1$ in $A_{ij}$ means there is an edge between node $n_i$ and node $n_j$ , while a value of $0$ means no edge exists. For instance, $A_{01} = 1$ indicates that the node $n_0$ is connected to node $n_1$ , and $A_{02} = 1$ . Since the matrix is symmetric $(A_{ij} = A_{ji})$ and all diagonal entries are zero, this graph is undirected and contains no self-loop.

Thanks to the kind people in the internet you can explore graph types and the meaning “self-loop” here.

In some other cases, each edge may have a weight $w(e)$ that represents the strength, distance, or importance of the connection. For instance, in a routing problem, a weight could represent the physical distance between two cities.

W = \begin{bmatrix} 0 & 2.5 & 0.8 \\ 2.5 & 0 & 0 \\ 0.8 & 0 & 0 \end{bmatrix}

In this weighted matrix $W$ , the connection between $n_0$ and $n_1$ has a strength of $2.5$ , while the connection between $n_0$ and $n_2$ is much weaker at $0.8$ . Note that $W_{12} = 0$ , meaning there is no direct path between $n_1$ and $n_2$ .

Tasks

Okay now that we understand the simplest definition of graphs, we can identify the core tasks for machine learning on graphs. These tasks are defined by the element of the graph we wish to predict.

Node Classification

Node classification is the task of predicting a label for each node in a graph. Formally, given a graph $\mathcal{G} = (\mathcal{N}, \mathcal{E})$ with adjacency matrix $A$ , and a set of known labels ${y_i}$ for a subset of nodes $\mathcal{N}_L \subset \mathcal{N}$ , the goal is to predict the labels for the remaining unlabeled nodes $\mathcal{N}_U = \mathcal{N} \setminus \mathcal{N}_L$ .

A model learns a function $f$ so that:

\hat{y_j} = f(A, X)_{j} \quad \text{for} \quad n_j \in \mathcal{N}_U

Here, $X$ represents any initial node features.

Using the example graph with three nodes and adjacency matrix $A$ , if we know the labels for nodes $n_0$ and $n_1$ , we can predict the label for $n_2$ . This prediction leverages the homophily principle: connected nodes often share similar properties.

Edge Classification

Edge classification is the task of predicting a label or property for each edge in a graph. Formally, given a graph $\mathcal{G} = (\mathcal{N}, \mathcal{E})$ , we aim to assign a label $y_{ij}$ to each edge $(n_i, n_j) \in \mathcal{E}$ . In many applications, we may also predict the existence of a missing edge.

A model for edge classification learns a function $g$ that maps pairs of node features and the graph structure to edge labels:

\hat{y}_{ij} = g(A, X, n_i, n_j; \theta)

where $\theta$ are the model parameters.

In the example graph with adjacency matrix $A$ , we might predict the type of relationship between connected nodes. For instance, in a weighted adjacency matrix $W$ , the weight $w(e)$ could be the target label, or we might classify edges as “strong” or “weak” based on their weights.

Community Detection

Community detection, or graph clustering, aims to partition the node set $\mathcal{N}$ into $k$ disjoint subsets $\{C_1, C_2, \dots, C_k\}$ such that nodes within the same community are densely connected, while nodes in different communities are sparsely connected.

Given the adjacency matrix $A$ , the goal is to find a mapping $h: \mathcal{N} \rightarrow \{1, 2, \dots, k\}$ that assigns each node to a community. This is often an unsupervised task, but can be guided by known labels or constraints.

In the example graph with three nodes and symmetric adjacency matrix $A$ , note that all nodes are mutually connected. Therefore, the entire graph might be considered a single community. However, if we remove the edge between $n_1$ and $n_2$ (setting $A_{12} = A_{21} = 0$ ), we might partition the graph into two communities: $\{n_0, n_1\}$ and $\{n_2\}$ , or $\{n_0, n_2\}$ and $\{n_1\}$ , depending on the algorithm.

Graph Classification

Graph classification is the task of predicting a label for an entire graph $\mathcal{G}$ . This is common in applications where each graph represents a distinct entity, such as a molecule in chemistry or a document in natural language processing.

Given a set of graphs $\{\mathcal{G}_1, \mathcal{G}_2, \dots, \mathcal{G}_M\}$ and their corresponding labels $\{y_1, y_2, \dots, y_M\}$ , the goal is to learn a function $p$ that maps a graph to a label:

\hat{y}_m = p(\mathcal{G}_m; \phi)

where $\phi$ are the model parameters.

For example, consider a set of graphs, each representing a molecule. The adjacency matrix might indicate bonds between atoms (nodes), and node features could include atom types. The task could be to classify each molecule as toxic or non-toxic.

In the context of our example, if we consider the three-node graph as one entity (e.g., a small molecule), we would assign a single label to the entire structure, regardless of the individual node labels.

Embeddings

Now, how do we turn the graph into a numerical representation that a model can learn from?

The answer is embeddings.

Unlike modern GNNs, where embeddings are learned and adaptive, they used to be handcrafted. This meant they were carefully designed by combining node/edge features with domain-specific knowledge.

There are many different strategies for this. I haven’t dived deep into this yet. While some strategies, like degree-centrality and betweenness-centrality seem relatively easy to understand (I think?), others require a deeper mathematical background (in my opinion — or is it a skill issue? Maybe).

I will be back for the next post in this graph series once I understand them better.

See you. I hope this personal note is helpful.