Tutorial

Introduction

This tutorial will guide you through the basic usage of the MMSBM library, showing how to work with nodes, metadata, and the bipartite network structure.

The `nodes_layer` Class

The nodes_layer class represents one type of nodes that forms the bipartite network. It can represent people, researchers, papers, metabolites, movies… That depends on your dataset.

The best way to initialize a nodes_layer is from a pandas DataFrame:

import pandas as pd
import numpy as np
from numba import jit
import sys, os
import BiMMSBM as sbm
from BiMMSBM.functions.utils import save_MMSBM_parameters,add_codes,load_EM_parameters

# Dataframe to use
df_politicians = pd.DataFrame({
    "legislator": ["Pedro", "Santiago", "Alberto", "Yolanda"],
    "Party": ["PSOE", "VOX", "PP", "Sumar"],
    "Movies_preferences": ["Action|Drama", "Belic", "Belic|Comedy", "Comedy|Drama"]
})

# Number of groups
K = 9
# You have to tell in which the name of the nodes will be as the second parameter
politicians = sbm.nodes_layer(K, "legislator", df_politicians)

Once the object is initialized, you can access the dataframe from the df attribute, but now it will contain a new column with an integer id that the library will use in the future. The name of the column is the same as the column of the names, but finished in _id.

display(politicians.df)

legislator	Party	Movies_preferences	legislator_id
Pedro	PSOE	Action\|Drama	1
Santiago	VOX	Belic	2
Alberto	PP	Belic\|Comedy	0
Yolanda	Sumar	Comedy\|Drama	3

The assignment of the ids with the names of the nodes is in the dict_codes attribute and the inverse in the dict_decodes attribute. This ids represents the array position that corresponds to each node for the theta and omega matrices.

You can modify whenever you want the number of groups from the K attribute:

print(f"Number of groups of politicians: {politicians.K}")
politicians.K = 2
print(f"Number of groups of politicians: {politicians.K}")

Number of groups of politicians: 9
Number of groups of politicians: 2

Adding Metadata

When in your dataframe you have extra information about the nodes, you have to tell which columns are metadata and which type of metadata. There are two types of metadata:

Exclusive metadata: These are metadata where each node can only have assigned one attribute. For example the age of a person. A person only has one age, not more than one.
Inclusive metadata: These are metadata where each node can have assigned more than one attribute. For example the genre of a movie, one movie can belong to different genres at the same time.

Exclusive Metadata

Once the nodes_layer is initialized, you can add the metadata using the add_exclusive_metadata method that will return an exclusive_metadata class:

# Importance of the metadata
lambda_party = 100
parties = politicians.add_exclusive_metadata(lambda_party, "Party")

Also, this object will be stored inside the nodes_layer object in the meta_exclusives attribute that is a dictionary whose keys are the column names of the metadata and the value the object.

The value of lambda_party is how important the metadata will be while the inference procedure is running and it can be accessed from the lambda_val attribute:

print(f"Importance of political parties: {parties.lambda_val}")
parties.lambda_val = 2.3
print(f"Importance of political parties: {parties.lambda_val}")

Importance of political parties: 100
Importance of political parties: 2.3

When the metadata has been added to the nodes_layer object, its dataframe will add a new column with the ids of the metadata with the same column name but finished in _id.

display(politicians.df)

legislator	Party	Movies_preferences	legislator_id	Party_id
Pedro	PSOE	Action\|Drama	1	1
Santiago	VOX	Belic	2	3
Alberto	PP	Belic\|Comedy	0	0
Yolanda	Sumar	Comedy\|Drama	3	2

Similarly to the nodes_layer, you can access the metadata ids through the dict_codes attribute.

print(parties.dict_codes)

{'PSOE': 1, 'VOX': 3, 'PP': 0, 'Sumar': 2}

Inclusive Metadata

Once the nodes_layer is initialized, you can add the metadata using the add_inclusive_metadata method that will return an inclusive_metadata class:

# Importance of the metadata
lambda_movies = 0.3
# Number of groups of genres
Tau_movies = 6
movies = politicians.add_inclusive_metadata(lambda_movies, "Movies_preferences", Tau_movies)

Also, this object will be stored inside the nodes_layer object in the meta_inclusives attribute that is a dictionary whose keys are the column names of the metadata and the value the object.

The value of lambda_movies is how important the metadata will be while the inference procedure is running and it can be accessed from the lambda_val attribute:

print(f"Importance of politicians movies preferences: {movies.lambda_val}")
movies.lambda_val = 20
print(f"Importance of politicians movies preferences: {movies.lambda_val}")

Importance of politicians movies preferences: 0.3
Importance of politicians movies preferences: 20

The value of Tau_movies is the number of groups which the metadata will be grouped in the inference and it can be accessed from the Tau attribute:

print(f"Number of groups of politicians: {movies.Tau}")
movies.Tau = 3
print(f"Number of groups of politicians: {movies.Tau}")

Number of groups of politicians: 6
Number of groups of politicians: 3

When the metadata has been added to the nodes_layer object, its dataframe will add a new column with the ids of the metadata with the same column name but finished in _id.

display(politicians.df)

legislator	Party	Movies_preferences	legislator_id	Party_id	Movies_preferences_id
Pedro	PSOE	Action\|Drama	1	1	2\|3
Santiago	VOX	Belic	2	3	0
Alberto	PP	Belic\|Comedy	0	0	0\|1
Yolanda	Sumar	Comedy\|Drama	3	2	1\|3

Similarly to the nodes_layer, you can access the metadata ids through the dict_codes attribute.

Accessing Metadata Objects by Name

You can access the metadata_layer objects without using the meta_inclusive and meta_exclusives dictionaries:

politicians[str(movies)] == movies
politicians[str(parties)] == parties

BiNet Class

The BiNet class contains the information about a bipartite network. It contains information about:

Each of the layers that forms the bipartite network
The observed links.

BiNet Class Without Nodes Metadata

To declare a BiNet object you need, at least, a dataframe with three columns:

One with the source node
One with the target node
The label of the link

links_df = pd.DataFrame({
    "source": [0,0,0,1,1,1,2,2,2],
    "target": ["A","B","C","A","B","C","A","B","C"],
    "labels": ["positive","negative","positive","positive","negative","positive","negative","negative","positive"]
})
BiNet = sbm.BiNet(links_df, "labels", nodes_a_name="source", Ka=1, nodes_b_name="target", Kb=2)

Notice that you need to specify which columns represent nodes and which is the column of the labels. Also, because the class only distinguishes undirected networks, the columns assignments of nodes_a and nodes_b are irrelevant. Only the indexing of the matrices of the MMSBM parameters will be affected.

Once the object is initialized, you can access the dataframe from the df attribute, but now it will contain three new columns, one for each node type and another for the labels, with an integer id that the library will use in the future. The name of the column is the same as the column of the names, but finished in _id.

display(BiNet.df)

source	target	labels	labels_id	source_id	target_id
0	A	positive	1	0	0
0	B	negative	0	0	1
0	C	positive	1	0	2
1	A	positive	1	1	0
1	B	negative	0	1	1
1	C	positive	1	1	2
2	A	negative	0	2	0
2	B	negative	0	2	1
2	C	positive	1	2	2

Accessing the `nodes_layer` Objects

Two attributes that contain the information of the nodes are the nodes_a and nodes_b attributes, which are nodes_layer objects.

print(BiNet.nodes_a, type(BiNet.nodes_a))
print(BiNet.nodes_b, type(BiNet.nodes_b))

source <class 'BiMMSBM.nodes_layer'>
target <class 'BiMMSBM.nodes_layer'>

An easier way to access these objects is by using the name of the layer:

print(BiNet["source"] == BiNet.nodes_a)
print(BiNet["target"] == BiNet.nodes_b)

True
True

As before, you can access a dataframe with the df method. Also, it will contain an extra column with the ids.

display(BiNet["source"].df)

source	source_id
0	0
1	1
2	2

target	target_id
A	0
B	1
C	2

Using `nodes_layer` Objects to Initialize a `BiNet` Object

The previous example only has a link list with labels. Sometimes you want to infer using nodes’ metadata. The best way to do that is by using nodes_layer objects.

First, let’s create the nodes_layer objects:

# Dataframe to use
df_politicians = pd.DataFrame({
    "legislator": ["Pedro", "Santiago", "Alberto", "Yolanda"],
    "Party": ["PSOE", "VOX", "PP", "Sumar"],
    "Movies_preferences": ["Action|Drama", "Belic", "Belic|Comedy", "Comedy|Drama"]
})
# Number of groups
K = 2
politicians = sbm.nodes_layer(K, "legislator", df_politicians)
politicians.add_exclusive_metadata(1, "Party")
politicians.add_inclusive_metadata(1, "Movies_preferences", 1)

# Dataframe to use
df_bills = pd.DataFrame({
    "bill": ["A", "B", "C", "D"],
    "Year": [2020, 2020, 2021, 2022]
})
K = 2
bills = sbm.nodes_layer(K, "bill", df_bills)

Now we can create the BiNet object, but with the difference that instead of specifying the name of the nodes layer, you have to use as a parameter the nodes_layer object using the nodes_a and nodes_b parameters.

# Dataframe to use
df_votes = pd.DataFrame({
    "legislator": ["Pedro","Pedro","Pedro","Santiago","Santiago","Santiago",
                   "Alberto", "Alberto", "Alberto", "Yolanda", "Yolanda", "Yolanda"],
    "bill": ["A", "B", "D",  "A","C", "D",
             "A", "B", "C",  "B","C", "D",],
    "votes": ["Yes","No","No",  "No","Yes","Yes",
              "No","No","Yes",  "Yes","No","No"]
})
# Creating the BiNet object
votes = sbm.BiNet(df_votes, "votes", nodes_a=bills, nodes_b=politicians)

Notice that you do not need to specify the number of the groups of each nodes_layer because it is contained in the corresponding nodes_layer.

Important

The name of the columns of the layer in both DataFrames (from the nodes_layer object and for the BiNet object) must coincide. Else, a KeyError will arise.

It is not mandatory to use two nodes_layer to create the BiNet object when you need metadata from only one of the layers. Remember to specify the number of groups.

# Example using only one nodes_layer object
votes = sbm.BiNet(df_votes, "votes", nodes_a_name="bill", Ka=2, nodes_b=politicians)

If you display the dataframe of the BiNet and the nodes_layer objects, the nodes ids from both layers will coincide.

display(votes.df[["legislator","legislator_id","bill","bill_id"]])
display(votes["legislator"].df[["legislator","legislator_id"]])
display(votes["bill"].df[["bill","bill_id"]])

legislator	legislator_id	bill	bill_id
Pedro	1	A	0
Pedro	1	B	1
Pedro	1	D	3
Santiago	2	A	0
Santiago	2	C	2
Santiago	2	D	3
Alberto	0	A	0
Alberto	0	B	1
Alberto	0	C	2
Yolanda	3	B	1
Yolanda	3	C	2
Yolanda	3	D	3

legislator	legislator_id
Pedro	1
Santiago	2
Alberto	0
Yolanda	3

bill	bill_id
A	0
B	1
D	3
C	2

The Expectation Maximization (EM) algorithm

To start to infer the parameters of the MMSBM, you have to initialize the parameters. It can be easily done with the init_EM method.

votes.init_EM()

Once the EM has been initialized, the parameters will be stored in attributes. For the membership parameters, each nodes_layer will have a theta attribute that is a matrix.

votes["legislator"].theta

array([[0.39067672, 0.60932328],
       [0.51318295, 0.48681705],
       [0.23656348, 0.76343652],
       [0.8699203 , 0.1300797 ]])

votes["bill"].theta

array([[0.33855864, 0.66144136],
       [0.10264972, 0.89735028],
       [0.33213194, 0.66786806],
       [0.43570408, 0.56429592]])

The first index corresponds to the id of the node, the second correspond to the group number.

For the BiNet object, the probabilities matrix and the expectation parameters will be stored in the pkl and omega attributes respectivly.

votes.pkl

array([[[0.73640347, 0.26359653],
        [0.66204141, 0.33795859]],

       [[0.61438835, 0.38561165],
        [0.7342769 , 0.2657231 ]]])

The first and second index corresponds to the groups from nodes_a and nodes_b respectively. The third correspond to the label id.

votes.omega

array([[[[0.14143346, 0.19831325],
        [0.23053494, 0.42971834]],

        [[0.14403937, 0.17518567],
        [0.41166991, 0.26910505]],

        [[0.08461626, 0.24549825],
        [0.13792355, 0.53196193]],

        [[0.        , 0.        ],
        [0.        , 0.        ]]],


      [[[0.04293584, 0.06020319],
        [0.31314921, 0.58371176]],

       [[0.05742163, 0.04897093],
        [0.4188002 , 0.47480724]],

       [[0.        , 0.        ],
        [0.        , 0.        ]],

       [[0.06536891, 0.01253214],
        [0.83596087, 0.08613808]]],


     [[[0.10985576, 0.21967309],
        [0.32315681, 0.34731435]],

      [[0.        , 0.        ],
        [0.        , 0.        ]],

      [[0.0683947 , 0.28299025],
        [0.20119303, 0.44742201]],

      [[0.32134512, 0.04319874],
        [0.53911187, 0.09634426]]],


     [[[0.        , 0.        ],
        [0.        , 0.        ]],

      [[0.24047583, 0.2050852 ],
        [0.2598447 , 0.29459428]],

      [[0.08892354, 0.36793049],
        [0.16847772, 0.37466824]],

      [[0.41526893, 0.05582501],
        [0.44871632, 0.08018974]]]])

The first and second index corresponds to the nodes id from nodes_a and nodes_b respectively. The second and third index corresponds to the groups from nodes_a and nodes_b respectively.

Running the EM Algorithm and Checking Convergence

To run the EM algorithm, you have to use the EM_step method. It will make an iteration of the algorithm by default. You can specify the number of iterations with the N_steps parameter. To check the convergence, you can use the converges method.

N_itt = 100
N_check = 5  # Number of iterations to measure the convergence

for itt in range(N_itt//N_check):
    votes.EM_step(N_check)
    converges = votes.converges()
    print(f"Iteration {itt*N_check}: {converges}")
    if converges:
        break

Iteration 0: False
Iteration 5: False
Iteration 10: False
Iteration 15: False
Iteration 20: True

Using Training Sets and Test Sets

You can select a training set instead of using all the links to infer the parameters. You can do that using the training parameter when you initialize the EM algorithm.

This parameter can be a list of the links ids that you want to use as a training set, or another dataframe with more links. If not specified, all the links will be used.

from sklearn.model_selection import train_test_split

# Defining the training and test sets
df_train, df_test = train_test_split(votes.df, test_size=0.2)

# Initializing the EM algorithm with the training set
votes.init_EM(training=df_train)

# Running the EM algorithm
N_itt = 100
N_check = 5  # Number of iterations to measure the convergence
for itt in range(N_itt//N_check):
    votes.EM_step(N_check)
    converges = votes.converges()
    print(f"Iteration {itt*N_check}: converges? {converges}")
    if converges:
        break

Iteration 0: converges? False
Iteration 5: converges? False
Iteration 10: converges? False
Iteration 15: converges? False
Iteration 20: converges? False
Iteration 25: converges? False
Iteration 30: converges? False
Iteration 35: converges? False
Iteration 40: converges? False
Iteration 45: converges? True

Checking the Accuracy and Getting Predictions

Once the EM algorithm has converged, you can get the predictions using the get_predicted_labels method. You can specify which links you want to infer its labels with the links parameter. If no links are specified, it will use the links used for training the model.

votes.get_predicted_labels()
votes.get_predicted_labels(links=df_test)

Checking the Accuracy

You can check the accuracy of the predictions using the get_accuracy method. By default, it will compute the accuracy of the training set. You can specify the test set with the links parameter, by using a list of the links ids or another dataframe with other links.

# Accuracy of the training set
print(f"Accuracy of the training set: {votes.get_accuracy()}")
print(f"Accuracy of the test set: {votes.get_accuracy(links=df_test)}")

Accuracy of the training set: 0.8888888888888888
Accuracy of the test set: 0.0

Saving and Loading the Parameters

For long runs or for using the parameters later, you can save the parameters. It is very important to notice that it is also important to save the ids of the nodes and labels, and some information of the nodes_layer and BiNet objects before initializing the EM algorithm. To save the parameters you can use the save_nodes_layer and save_BiNet methods.

The `save_nodes_layer` Method

This method is useful when you only want to save the information of a nodes_layer object. One example can be when you want to do a 5-fold cross-validation, instead of saving the nodes information for each fold, you can save it once and load it later once for all the folds.

The name of the JSON will be layer_{nodes_layer.name}_data.json.

Saving the Parameters with `save_MMSBM_parameters` Function

To save the parameters of the EM procedure, you can use the save_MMSBM_parameters function:

from MMSBM_library.functions.utils import save_MMSBM_parameters

from sklearn.model_selection import train_test_split

try:
    os.mkdir("tutorial_saves")
    os.mkdir("tutorial_saves/example_BiNet")
    os.mkdir("tutorial_saves/example_parameters")
except:
    pass

# Defining the training and test sets
df_train, df_test = train_test_split(votes.df, test_size=0.2)

votes.save_BiNet("./tutorial_saves/example_BiNet/")

# Initializing the EM algorithm with the training set
votes.init_EM(training=df_train)

# Running the EM algorithm
N_itt = 100
N_check = 5  # Number of iterations to measure the convergence
for itt in range(N_itt//N_check):
    votes.EM_step(N_check)
    converges = votes.converges()
    print(f"Iteration {itt*N_check}: converges? {converges}")
    if converges:
        save_MMSBM_parameters(votes, "./tutorial_saves/example_parameters")
        break

Now different .npy files have been created inside example_parameters folder:

theta_a.npy and theta_b.npy contain the parameters of the nodes_layer objects that form the BiNet object.
pkl.npy contains the membership probabilities.
For each exclusive metadata it will generate: - qka_{meta_name}.npy with the membership probability for each metadata.
For each inclusive metadata it will generate: - q_k_tau_{meta_name}.npy with the membership probability for each metadata. - zeta_{meta_name}.npy with the membership factors for each metadata.

The `load_BiNet_from_json` and the `init_EM_from_directory` methods

Also, you can load your saved BiNet class using the load_BiNet_from_json class method:

loaded_votes = sbm.BiNet.load_BiNet_from_json("./tutorial_saves/example_BiNet/BiNet_data.json",
                                              links=df_votes, links_label="votes",
                                              nodes_a=bills, nodes_b=politicians)

If you want to load the parameters obtained from an EM procedure to continue the procedure or to analyze the parameters, you have to use the init_EM_from_directory method.

loaded_votes.init_EM_from_directory(dir="./tutorial_saves/example_parameters", training=df_train)

From here you can continue the EM procedure using the EM_step method:

loaded_votes.df
loaded_votes.EM_step(10)

Or analyze the parameters and/or links and/or accuracies:

loaded_votes.df

Plotting the Membership Matrices

You can visualize the membership matrices of the politicians and the votes using matplotlib:

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Plot theta parameters for both nodes as heatmaps
im1 = ax1.imshow(loaded_votes.nodes_a.theta, cmap='viridis', aspect='auto')
im2 = ax2.imshow(loaded_votes.nodes_b.theta, cmap='viridis', aspect='auto')

# Add colorbars
plt.colorbar(im1, ax=ax1)
plt.colorbar(im2, ax=ax2)

# Set titles
ax1.set_title('Legislators Theta Parameters')
ax2.set_title('Bills Theta Parameters')

# Label axes
ax1.set_xlabel('Group')
ax2.set_xlabel('Group')

# Set y-tick labels to node IDs
ax1.set_yticks(range(len(politicians)))
ax1.set_yticklabels([politicians.dict_decodes[i] for i in range(len(politicians))])
ax2.set_yticks(range(len(bills)))
ax2.set_yticklabels([bills.dict_decodes[i] for i in range(len(bills))])

ax1.set_xticks(range(politicians.K))
ax2.set_xticks(range(bills.K))