Instagram friend recommendation using Graph Mining

Introduction

We use many social networking apps in our day to day life such as Facebook,Twitter,whatsapp,Instagram and if you are a cse graduate you surely have given it a thought about how FB,Twitter and Instagram recommend friends to users .But let me tell you Mutual friends and Dijkstra algorithms are not the only things that these sites use to suggest users.In this blog we are going to do a case study on how Instagram recommend friends to users using graph mining.

Contents

  • Problem Definition
  • Dataset Overview
  • Mapping problem to Supervised Algorithms
  • Business Constraints and Objectives
  • Performance metric
  • Exploratory Data Analysis

— In-degree

— Out-degree

— Number of followers of a user

— Number of following of a user

— Weakly Connected Components

  • Posing a problem as classification problem
  • Training and Test data split
  • Featurization

Similarity Measures

Ranking Measures

Graph Features

Adar Index

Katz Centrality

Hitz Score

Singular Value Decomposition

Weight Features

  • Modeling
  • Profile
  • References

Problem Statement

Given a directed social graph,We have to predict missing links to recommend users/followers/friends/connections (Link Prediction in graph).

Data Overview

Taken data from facebook’s recruiting challenge on kaggle https://www.kaggle.com/c/FacebookRecruiting

data contains two columns source and destination each edge in graph — Data columns (total 2 columns):

* source_node datatype — int64

* destination_node datatype — int64

  • It is an Instagram type of dataset instead of Facebook as the given data is directed means if there is an edge from vertex u to v that does not mean there is also an edge from vertex v to u.but in FB if a user u is friend if user v then user v is also friend of user u just like undirected graph but in case of instagram if user u follows user v then user v may or may not be following user u hence it is more of a directed type .
  • Dataset has 18,662,220 number of Users and 94,37,520 number of edges or connections
  • We are not given any meta information like place ,city or age about the users.Hence this makes it a purely Graph based problem because we are not given any information about the user . If we know that two people/users share the same school or college we would have dealt with the given problem differently.

Mapping problem to Supervised Algorithms

  • Map this to a Binary Classification problem where 1 means the edge is present between two users and 0 implies an absence of an edge.
  • After mapping the problem to the Binary Classification problem we have to do extensive Feature Engineering as we are provided with only two columns .Some simple features could be the number of common vertices(mutual friends).

Business Constraints and Objectives

  • No low-latency requirement.
  • Probability of prediction is useful to recommend highest probability links

Performance Metric

  • Both precision and recall is important so F1 score is a good choice as we want to recommend friends to a user as precisely as possible and we also do not want to miss out any connection.
  • Confusion matrix

The detailed EDA Code with easy to understand comments is always present on my GITHUB

Exploratory Data Analysis

let’s start by Importing the Libraries

Since the data is vast hence we will be making intermediate files so that it runs smooth on our system

  • Let’s start by Making a Graph from the dataset
  • We will use networkx for dealing with graph related algorithms as it has all the algorithms inbuilt in it and we can directly load and use them wherever the need arises.
traincsv = pd.read_csv(‘MyDrive/Case/Facebook/data/train.csv’) 
#loading the data
traincsv.head()

since data is non Trivial let’s make a subgraph names “subgraph” so that we can use it whenever we have to understand small piece of code and then later we can apply the same functionalty on the whole graph name “g”

Now we will extract the top 50 rows and store it in a sample so that we can make a subgraph and see how it looks.

The following subgraph contains 50 edges displayed in such a way that there is minimum crossing among the edges.

before Moving on further let’s understand some basic terms called In-degree and Out-degree of the vertex.

In-degree

In-degree of a vertex U is the number of edges incident on the vertex U

Out-degree —

Out-degree of a vertex U is the number of edges going out from the vertex U

better check this out for better understanding with pictures

Follower —

If a user U follows a user V then U is a follower of V.

  • If you can think intuitively number of Followers are nothig but In-degree of a vertex

let’s visualize how many number of followers each users has.

There is a steep curve after a very long steady line which indicates that around 1.75m users has followers has very few followers but there are some users(celebrity) that has more than 500 followers also.

### 99–100 percentilefor i in range(10,110,10):print(99+(i/100),’percentile value is’,np.percentile(indegree_dist,99+(i/100)))
  • 99.9% of the users has less than 112 number of followers

Followee —

If a user U follows a user V then V is followee of U.

number of followee is nothing but Out-degree of a node.

Even this curve has steep curve after 1.75m users means not many people are following other people but there are some users that are following more than 500 users also.

### 99–100 percentilefor i in range(10,110,10):print(99 + (i/100),’percentile value is’,np.percentile(outdegree_dist,99+(i/100)))
  • 99.9% of the people have less than 123 following that just shows not many people likes to have more connections.

let’s do some basic anaslyis

print(‘No of persons those are not following anyone are’ ,sum(np.array(outdegree_dist)==0),’and % is’,sum(np.array(outdegree_dist)==0)*100/len(outdegree_dist) )OUTPUT -- No of persons those are not following anyone are 274512 and % is 14.741115442858524print(‘No of persons having zero followers are’ ,sum(np.array(indegree_dist)==0),’and % is’,sum(np.array(indegree_dist)==0)*100/len(indegree_dist) )OUTPUT -- No of persons having zero followers are 188043 and % is 10.097786512871734

You can do same analysis on number of connections = in-degree + out-degree

Weakly connected component

If there is a path from a vertex U to V but there is not any path from vertex V to U then these two vertices can be thought of as weakly connected.

now if we have weakly connected components of size = 2 then that means user U follows V but user V does not follow user U.

print(‘No of weakly connected components’,len( list(nx.weakly_connected_components(g))) )count = 0for i in list(nx.weakly_connected_components(g)):if len(i) == 2:count +=1print(‘weakly connected components with 2 nodes’,count)OUTPUT -- 
* No of weakly connected components 45558
* weakly connected components with 2 nodes 32195
  • So there are 32,195 user pairs who are weakly connected to each other.

Posing a problem as classification problem

  • In our dataset only those entries are present which has edge between them hence those can be labeled as 1 but what about the edges that are not present ?
  • Hence we have to generated some Bad links from graph which are not in graph and whose shortest path is greater than 2 because if distance is 2 then it implies User U is mutual friend of user V.
  • Only taking the vertices whose shortest distance > 2, The reason being there is very high chance that user U will follow user V if they are mutual friends. but if we have shortest distance greater than 2 then probability of following each other is quite low.

Training and Test data split

Featurization

Similarity Measures

Ranking Measures

Graph Features

Adar Index

Katz Centrality

Hitz Score

Singular Value Decomposition

Weight Features

  • Whenever we are given a Graph based problem Featurization is the most important task.
  • This is the most Intuitive phase of any Machine Learning problem
  • Remember we just cannot apply Machine Learning models over a set of vertices rather we have to prepare our data in such a way that it is in the right format to feed the models or else we can’t get the required performance from our models.As thet say “ Garbage in Garbage Out” means if you are giving garbage as an input to the model it will definitely give garbage as an output .

Hence the need arises to do Featurization , we will start with some basic featurization techniques like similarity measure and then we will increase the complexity little by little.

Similarity Measures

Jaccard Distance

It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations.Hence larger the Jaccard distance more is the chance of having a connection between two vertices.

Jaccard Index = (the number in both sets) / (the number in either set)*100

Here the set can be set of all followers of a vertex U/V or all followees of vertex U/V.

Otsuka — Ochiai coefficient

Cosine distance is the same as that of Jaccard distance with just a slight modification in the denominator part.and since formula is more or less the same some properties also remain same like more the Otsuka-Ochiai coefficient more is the chance of having a connection between two vertices.

It is also known as Cosine Distance but it is slightly different from the standard Cosine distance which does not have square root in the denominator part.

Ranking Measures

Page Ranking

PageRank computes a ranking of the nodes in the graph G based on the structure of the incoming links

  • PageRank is a way of measuring the importance of website pages.

According to Google:-

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that was used by the company, and it is the best-known.

https://en.wikipedia.org/wiki/PageRank
  • Page C has a higher PageRank than Page E, even though there are fewer links to C; the one link to C comes from an important page and hence is of high value
  • so it just not take number of links into the account but also the importance of the link too.

But you might ask how is this Page Rank algorithm is relevant to our case study ?

well the answer is The PageRank can tell us about relative importance.

let’s consider a vertex U (Elon Musk)

  • If a user has U a high PageRank score, and other significant users are linking to U.Then vertex U must be a very important person because other important users(people) are also following him like Cristiano ronaldo,Mike Tyson ,Muksesh Ambani .
  • Hence there is a very high possibility that Elon musk is also followed by you and me both.

We got following results

Mean is very crucial here as it can be used as an imputer for the nodes that are not present in our dataset.

Graph Features

Shortest path

  • Shortest path serves as a type of similarity.if the length of path from vertex U to vertex V is short then there is high probablity of both them knowing each other.
  • Here we will find the shortest path between the vertices.if nodes are directly connected means if vertex U has a direct edge to vertex V then we will remove that edge and after doing that we will calculate the shortest path .
  • We are removing the direct edge, so that we can explore the next fewest nodes through which i and j can be connected. Having that direct edge between two connected nodes doesn’t give any insight about other important intermediate nodes. Thus, if the direct edge is not removed then for the training data we would have the shortest path for all connected nodes as 1.

Checking for same Community

Weakly connected components can be thought of as community of people who shares something similar among themselves. e.g college friends, work related friends. Thus if two users who are in same community have very high chance of following each other.

  • Thus WCC helps us detect communities in the graph.

for in-depth understanding of strongly connected components and weakly connected components check this https://www.quora.com/What-are-strongly-and-weakly-connected-components

Adar/Adamic Index

  • It is desgined to predict links in social network

Adamic/Adar measures is defined as inverted sum of degrees of common neighbours(any vertex connected directly to the vertex x is neighbor of vertex x). The formlua of Adar index for two vertices x and y

  • First find the common neighbors and store them in list say ‘n’,now every vertex in list n will have it’s own neighborhood,let’s say any vertex u from list n, if u has very large neighborhood then chances that u is celebrity are very high.
  • If x and y are following u or somehow they are connected to u it is also possible that u is following either or both of them
  • The chances that x and y will follow each other are very low. if i am following eminem andsomeone else is also following him does not mean that i will follow eminem’s followers.
  • However if u is small then chances of x following y are very high.
  • The definition is based on the concept that common elements with very large neighbourhoods are less significant when predicting a connection between two nodes compared with elements shared between a small number of nodes

Does followee follows back ?

  • if a person is following someone then there is very high chance that followed peron will also follow him back.
def follows_back(a,b):   if train_graph.has_edge(b,a):
return 1
return 0

Katz Centrality

  • It is used to measure the influence of the node.
  • The influence of node a depends on the influence of it’s neighbors and neighbors influence depends on their neighbors, It is very similar to google’s page rank algorithm.
  • Katz centrality computes the centrality for a node based on the centrality of its neighbors. It is a generalization of the eigenvector centrality

OUtPUT —

  • min 0.0007313532484065916
  • max 0.003394554981699122
  • mean 0.0007483800935562018

HITS Score

Hits stands for Hyperlink-Induced Topic Search . It is a link analysis algorithm that rates web pages.

The HITS algorithm computes two numbers for a node. Authorities estimates the node value based on the incoming links. Hubs estimates the node value based on outgoing links.Thus if outdegree is large then the page is called Hub.

The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages

  • Thus Authority in our problem would mean a person which has many followers and that implies the person is celebrity .

Output —

  • min 0.0
  • max 0.004868653378780953
  • mean 5.615699699344123e-07

Singular Value Decomposition

Since we have already converted our data into Graph named g,now let’s use some more properties of it .

Adjacency Matrix is used to represent the graph in matrix form

  • let A be the adjacency matrix of graphb our graph G, now if our graph has n vertices then matrix A would be of size n*n . where each the cell A(i,j) will contain the value either 0 or 1 ,1 means there is an edge from ith to jth vertex and 0 means no edge from ith and jth vertex.
  • This case is also known as Binary Adjacency Matric as there are only two values possible.
  • Now if we apply SVd on top of this matrix A we will get three matrices say USVt(where U S and V all are matrics and T represent matrix V is taken in transpose form)
  • If you can recall our dataset had 1.78M unique vertices ,So our Adjacency matrix A would be 1.78M*1.78M, binary sparse matrix.now using SVD the A matrix would be decomposed into three matrices with the following dimensions.
  • let number of components k = 6

U — 1.78M*

6S — 6*6

Vt — 6*1.78M

  • This means that for any vertex i in our graph

now we have 6 dimensional representation of the vertex i present in matrix U(also called left singular matrix)

similarly the ith column of Vt (also called right singular matrix)represent the ith vertex in 6 dimensions

Thus we have two representations of our vertex i . Hence now we can use both of them as feature vector of ith vertex.

  • Remember for given vertex a and b we had to predict if there is an edge or not between them , so we will store two 6 dimensional vectors corresponding to a and similarly two 6 dimensional vectors corresponding to vertex b. Thus we are getting 24 features just using SVD.

Output —

  • Adjacency matrix Shape (1780722, 1780722)
  • U Shape (1780722, 6)
  • S Shape (6,)
  • V Shape (6, 1780722)

WEIGHT FEATURES

Consider one million people following a celebrity on Instagram then chances are most of them never met each other or the celebrity. On the other hand, if a user hasless connections say 50 or so , then chances are higher that many of them know each other.

Thus in order to know if user is celebrity or not we will take the help of weight features which are defined for any vertex U. Here X is the number of incoming edged and W is the weight of the vertex.

  • weight of incoming edges
  • weight of outgoing edges
  • weight of incoming edges + weight of outgoing edges
  • weight of incoming edges * weight of outgoing edges
  • 2*weight of incoming edges + weight of outgoing edges
  • weight of incoming edges + 2*weight of outgoing edges

Modeling

I have used Random Forest model and done hyperparameter tuning using RandomizedsearchCV . for code check my Modeling file

Profile

Thank you so much for going through such a long blog post .A clap would be nice to appreciate my hardwork moreover feel free to give valuable suggestions on how I can write better technical blogs and improve my writing skills . Here are some ways you can connect with me .

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store