Version: 2.2.1

Understanding music (with query modules)

This article is part of a series intended to show users how to use Memgraph on real-world data and, by doing so, retrieve some interesting and useful information.

We highly recommend checking out the other articles from this series which are listed in our tutorial overview section.

Introduction

Getting useful information from your data is always challenging. With Memgraph, you can manipulate and extract a huge amount of information by writing queries. Memgraph even offers a set of built-in graph algorithms. You can use those algorithms in your queries, extending your power over the data. But what if you wanted to do more?

At its core, Memgraph "simply" builds a graph from your data. Graphs were always interesting to scientists and engineers because of their interesting properties allowing you to represent a specific kind of data in an intuitive way which makes extracting useful information much easier. A field called graph theory emerged in mathematics, producing a great number of algorithms, metrics, and other functions that are applied to the graphs.

Memgraph allows you to use the underlying graph in your functions by using Query modules, and those functions are called procedures. In this tutorial, we'll see how easy it is to extend the capabilities of Memgraph using Python. It will also show you how to use one of the most popular Python libraries for graphs, called NetworkX, which contains an insane amount of functions and algorithms for the graphs.

Data model

Social graphs is a relatively recent term. Simply said, it's a representation of a social network. Social networks appear in various sites, e.g., Youtube, which is primarily a site for watching videos, can be classified as a social network. For this tutorial, we'll use data consisting of users of the music streaming platform called Deezer.

The data consists of around 50k Deezer users from Croatia, but we prepared a subset of the dataset only composed of 2k users. Each user is defined by id and a list of genres he loved. The edges represent the mutual friendship between the users. You can find a more detailed explanation of the dataset on the GitHub alongside many more similar datasets kindly provided by the same authors.

Importing the dataset

To import the dataset, download the Memgraph Lab desktop application and navigate to the Datasets tab in the sidebar. From there, choose the dataset Music genres social network and continue with the tutorial.

Defining a directory with query modules

We need to set up the directory from which Memgraph will search for custom query modules by changing the --query-modules-directory flag in the main configuration file(/etc/memgraph/memgraph.conf) or by supplying it as a command-line parameter using the directory containing our modules as the value. For a more detailed explanation take a look at Load and call query modules.

sudo -u memgraph /usr/lib/memgraph/memgraph --query-modules-directory=/modules

When using Memgraph installed from a DEB or RPM package, the currently running Memgraph server may need to be stopped. The user can do so using the following command:

systemctl stop memgraph

When using Docker, the query module directory can be mounted with the following command:

docker run -it -p 7687:7687 -p 7444:7444 -p 3000:3000 \
  -v $(pwd)/modules:/modules \
  -e MEMGRAPH="--query-modules-directory=/modules" \
  memgraph/memgraph-platform

Example queries and procedures

First, let's create a Python script in our modules directory which will contain our procedures. We'll name the script deezer_example.py.

After each change to the script, we need to tell Memgraph to reload all the modules. We can do that by calling the following command:

CALL mg.load_all();

Let's count how many times a specific genre occurs in our network!

We will define a procedure called genre_count:

import mgp


@mgp.read_proc
def genre_count(context: mgp.ProcCtx,
                genre: str) -> mgp.Record(genre=str, count=int):
    count = len(
        [v for v in context.graph.vertices if genre in v.properties['genres']])
    return mgp.Record(genre=genre, count=count)

Let's reload the modules and check the results:

CALL deezer_example.genre_count("Pop")
YIELD *;

We can notice multiple things:

import of the mgp module which contains helper functions and types for defining custom procedures
@mgp.read_proc decorator which marks the function as a procedure
every argument is annotated with a type
result is defined as an object of mgp.Record which also has annotated types of its members

This example is probably not that interesting to you because we can get the same result using the following query:

MATCH (n)
WITH n, "Pop" AS genre
WHERE genre IN n.genres
RETURN genre, count(n);

Let's try something more interesting. The genres are listed in the order the users have added them. If we assume that users picked the genres in order of preference, let's write a function that tells us in what percentage each genre appears in top n places.

from collections import defaultdict


@mgp.read_proc
def in_top_n_percentage(context: mgp.ProcCtx,
                        n: int) -> mgp.Record(genre=str,
                                              percentage=float,
                                              size=int):
    genre_count = defaultdict(lambda: {'total_count': 0, 'in_top_n_count': 0})

    for v in context.graph.vertices:
        for index, genre in enumerate(v.properties['genres']):
            genre_count[genre]['total_count'] += 1
            genre_count[genre]['in_top_n_count'] += index < n

    def get_record(genre, counts): return mgp.Record(
        genre=genre,
        percentage=counts['in_top_n_count'] / counts['total_count'],
        size=counts['total_count']
    )

    return [get_record(
        genre,
        counts) for genre,
        counts in genre_count.items()]

Let's see what we got:

CALL deezer_example.in_top_n_percentage(3)
YIELD *
WITH genre, percentage, size
WHERE size > 10
RETURN genre, percentage, size
ORDER BY percentage DESC;

As said in the introduction, we want to use the power of the graphs to extract some useful information from our data which would be otherwise impossible. Most of those functions are complex and writing them from scratch would be tedious. As every modern programmer, we'll just use a package that has everything we need and more. To be precise, we'll be using NetworkX that has a huge amount of utility functions and graph algorithms implemented.

To use NetworkX algorithms we need to transform our graph to a type NetworkX recognizes. In our case, we need to use an undirected graph networkX.Graph. To make our lives easier, let's write a helper function that transforms Memgraph graph to networkX.Graph.

import networkx as nx
import networkx.algorithms as nxa
import itertools


def _create_undirected_graph(context: mgp.ProcCtx) -> nx.Graph:
    g = nx.Graph()

    for v in context.graph.vertices:
        context.check_must_abort()
        g.add_node(v)

    for v in context.graph.vertices:
        context.check_must_abort()
        for e in v.out_edges:
            g.add_edge(e.from_vertex, e.to_vertex)

    return g

Now let's get some information about the graph. As our data represents social network we would like to know if it has bridges and we would like to calculate the average clustering.

@mgp.read_proc
def analyze_graph(
        context: mgp.ProcCtx) -> mgp.Record(
        average_clustering=float,
        has_bridges=bool):
    g = _create_undirected_graph(context)
    return mgp.Record(
        average_clustering=nxa.average_clustering(g),
        has_bridges=nxa.has_bridges(g))

And to get and display the data let's run the following command:

CALL deezer_example.analyze_graph()
YIELD *;

Another interesting property of a node in a graph is its centrality. Centrality tells us how important a node is for a graph. In our case, it would mean higher the centrality, the more popular the user is. Let's find out which user is the most popular in our network and take a peek at his/her music taste. We will use the betweenness centrality.

@mgp.read_proc
def betweenness_centrality(
        context: mgp.ProcCtx) -> mgp.Record(node=mgp.Vertex,
                                            centrality=mgp.Number):
    g = _create_undirected_graph(context)
    return [mgp.Record(node=node, centrality=centrality)
            for node,
            centrality in nxa.centrality.betweenness_centrality(g).items()]

Now let's take a look at the results:

CALL deezer_example.betweenness_centrality()
YIELD *
RETURN node.id, node.genres, centrality
ORDER BY centrality DESC
LIMIT 10;

NOTE

Calculating betweenness centrality for each node can take some time to finish. The issue of slower NetworkX implementations is something we at Memgraph would like to address in the future. An example of this can be seen in the next section of this tutorial.

For our last trick, let's try to locate communities inside our network. Communities are a set of nodes that are densely connected. The goal of the community detection algorithms can be nicely described with the next visualization:

As for centrality, there are multiple algorithms for finding communities in a graph. We will write a function that takes a method for calculating communities, uses it to find the communities, and, optionally, calculates some metrics specific to the graph partitioning so we can compare algorithms. To make things more interesting, let's find out which genre is the most popular in the community and return the percentage which tells us how many of the users have that genre on their list. In the end, music is something that connects us!

def _get_communities(
        context: mgp.ProcCtx,
        community_function,
        calculate_quality: bool):
    g = _create_undirected_graph(context)

    communities = list(community_function(g))

    if calculate_quality:
        quality = {
            'coverage': nxa.community.quality.coverage(g, communities),
            'modularity': nxa.community.quality.modularity(g, communities),
            'performance': nxa.community.quality.performance(g, communities)
        }
    else:
        quality = {}

    communities = [list(community) for community in communities]

    def get_community_info(community):
        info = {
            'size': len(community),
        }

        genre_count = defaultdict(lambda: 0)
        for genre in itertools.chain(
                *[user.properties["genres"] for user in community]):
            genre_count[genre] += 1

        if len(genre_count) != 0:
            mpg = max(
                genre_count.items(),
                key=lambda item: item[1])

            info['most_popular_genre'] = mpg[0]
            info['most_popular_genre_percentage'] = mpg[1] / info['size']

        return info

    return mgp.Record(communities=[get_community_info(c)
                                   for c in communities], quality=quality)

Now that we have our function in place let's test some algorithms. We will be checking out community detection using greedy modularity maximization by Clauset-Newman-Moore and label propagation.

@mgp.read_proc
def greedy_modularity_communities(
        context: mgp.ProcCtx,
        calculate_quality: bool = False) -> mgp.Record(
        communities=list,
        quality=mgp.Map):
    return _get_communities(
        context,
        nxa.community.greedy_modularity_communities,
        calculate_quality)


@mgp.read_proc
def label_propagation_communities(
        context: mgp.ProcCtx,
        calculate_quality: bool = False) -> mgp.Record(
        communities=list,
        quality=mgp.Map):
    return _get_communities(
        context,
        nxa.community.label_propagation_communities,
        calculate_quality)

In the above snippet, we can notice an optional argument calculate_quality and usage of the type mgp.Map which is provided by Memgraph.

Let's see the results with:

CALL deezer_example.greedy_modularity_communities()
YIELD communities
UNWIND communities as community
WITH community
WHERE community.size > 10
RETURN community.most_popular_genre, community.most_popular_genre_percentage, community.size
ORDER BY community.size DESC;

and

CALL deezer_example.label_propagation_communities()
YIELD communities
UNWIND communities AS community
WITH community
WHERE community.size > 10
RETURN community.most_popular_genre, community.most_popular_genre_percentage, community.size
ORDER BY community.size DESC;

Your results should look something like this:

Hmm, Pop sure is popular. Let's ignore that genre:

for genre in itertools.chain(
        *[user.properties["genres"] for user in community]):
    if genre != 'Pop':
        genre_count[genre] += 1

and call our procedures again for each algorithm. Well, people always liked to dance!

We saw the biggest communities in our network using two different methods. It's hard to say which of the methods did better so let's check a couple of metrics by calling the same procedure with calculate_quality set to true:

CALL deezer_example.greedy_modularity_communities(true)
YIELD communities, quality
RETURN quality;

and

CALL deezer_example.label_propagation_communities(true)
YIELD communities, quality
RETURN quality;

I think it should come as no surprise that an algorithm that maximizes modularity has higher modularity...

Optimized NetworkX integration

As noted before, we at Memgraph are aware of NetworkX's potential but the performance for some functions isn't that good. We decided to tackle this problem by writing a wrapper object for Memgraph's graph and with smarter usage of NetworkX algorithms. To make things even more convenient, we decided to implement procedures that call NetworkX functions with our graphs, so you have out-of-the-box access to the graph algorithms. NetworkX contains a huge amount of functions, and writing procedures for all of them require insane effort, so don't blame us if some of the algorithms aren't available. We are always open to any feedback, so if you think that an implementation for an algorithm is needed, we will surely take that into account.

To demonstrate performance improvement, we will calculate the betweenness centrality again, this time by using Memgraph's procedure.

To get access to the NetworkX procedures, start your Memgraph server without modifying the query modules directory path. That way, the path will be set to the default path, which contains nxalg module.

Now let's call the procedure:

CALL nxalg.betweenness_centrality()
YIELD *
RETURN node.id, node.genres, betweenness
ORDER BY betweenness DESC
LIMIT 10;

You should get the same results as with our previous procedure for the betweenness centrality but in a much lower time.

Introduction​

Data model​

Importing the dataset​

Defining a directory with query modules​

Example queries and procedures​

Optimized NetworkX integration​

Further reading​