mastodon-stream/README.md

# Mastodon usage - counting toots with DuckDB  

[Mastodon](https://joinmastodon.org/) is a _decentralized_ social networking platform. Mastodon users are members of a _specific_ Mastodon server, and servers are capable of joining other servers to form a global (or at least federated) social network.

I wanted to start exploring Mastodon usage, and perform some exploratory data analysis of user activity, server popularity and language usage. You may want to jump straight to the [data analysis](#data-analysis)

Tools used
- [Mastodon.py](https://mastodonpy.readthedocs.io/) - Python library for interacting with the Mastodon API
- [Apache Kafka](https://kafka.apache.org/) - distributed event streaming platform
- [DuckDB](https://duckdb.org/) - in-process SQL OLAP database and the [HTTPFS DuckDB extension](https://duckdb.org/docs/extensions/httpfs.html) for reading remote/writing remote files of object storage using the S3 API
- [MinIO](https://min.io/) - S3 compatible server
- [Seaborn](https://seaborn.pydata.org/) - visualization library 


![mastodon architecture](./docs/mastodon_arch.png)

# Data processing
We will us Kafka as distributed stream processing platform to collect data from multiple instances. To run Kafka, Kafka Connect (with the S3 sink connector) and schema registry (to support AVRO serialisation) and MinIO setup containers with this command

```console
 docker-compose up -d
 ```

# Data collection

## Setup virtual python environment
Create a [virtual python](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) environment to keep dependencies separate. The _venv_ module is the preferred way to create and manage virtual environments. 

 ```console
python3 -m venv env
```

Before you can start installing or using packages in your virtual environment you’ll need to activate it.

```console
source env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
 ```


## Federated timeline
These are the most recent public posts from people on this and other servers of the decentralized network that this server knows about.

## Mastodon listener
The python `mastodonlisten` application listens for public posts to the specified server, and sends each toot to Kafka. You can run multiple Mastodon listeners, each listening to the activity of different servers

```console
python mastodonlisten.py --baseURL https://mastodon.social --enableKafka
```

## Testing producer (optional)
As an optional step, you can check that AVRO messages are being written to kafka

```console
kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic mastodon-topic --from-beginning
```


# Kafka Connect
To load the Kafka Connect [config](./config/mastodon-sink-s3-minio.json) file run the following

```console
curl -X PUT -H  "Content-Type:application/json" localhost:8083/connectors/mastodon-sink-s3/config -d '@./config/mastodon-sink-s3-minio.json'
```

# Open s3 browser
Go to the MinIO web browser http://localhost:9001/

- username `minio`
- password `minio123`


# Data analysis
Now we have collected a week of Mastodon activity, let's have a look at some data. These steps are detailed in the [notebook](./notebooks/mastodon-analysis.ipynb)


## Query parquet files directly from s3 using DuckDB

Load the [HTTPFS DuckDB extension](https://duckdb.org/docs/extensions/httpfs.html) for reading remote/writing remote files of object storage using the S3 API

```console
INSTALL httpfs;
LOAD httpfs;

set s3_endpoint='localhost:9000';
set s3_access_key_id='minio';
set s3_secret_access_key='minio123';
set s3_use_ssl=false;
set s3_url_style='path';
```

And you can now query the parquet files directly from s3

```sql
select *
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/*');
```

![SQL](./docs/select_from_s3_result.png)

## Daily Mastodon usage

We can query the `mastodon_toot` table directly to see the number of _toots_, _users_ each day by counting and grouping the activity by the day

We can use the [mode](https://duckdb.org/docs/sql/aggregates.html#statistical-aggregates) aggregate function to find the most frequent "bot" and "not-bot" users to find the most active Mastodon users


## The Mastodon app landscape
What clients are used to access mastodon instances. We take the query the `mastodon_toot` table, excluding "bots" and load query results into the `mastodon_app_df` Panda dataframe. [Seaborn](https://seaborn.pydata.org/) is a visualization library for statistical graphics  in Python, built on the top of [matplotlib](https://matplotlib.org/). It also works really well with Panda data structures.

![App usage](./docs/app_usage.png)


## Time of day Mastodon usage
Let's see when Mastodon is used throughout the day and night. I want to get a raw hourly cound of _toots_ each hour of each day. We can load the results of this query into the `mastodon_usage_df` dataframe

![hour of day](./docs/hr_of_day_usage.png)

## Language usage
A wildly inaccurate investigation of language tags

![language usage](./docs/language_usage.png)

## Language density
A wildly inaccurate investigation of language tags

![language density](./docs/language_density.png)


# Optional steps


## Cleanup of virtual environment
If you want to switch projects or otherwise leave your virtual environment, simply run:

```console
deactivate
```

If you want to re-enter the virtual environment just follow the same instructions above about activating a virtual environment. There’s no need to re-create the virtual environment.
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								# Mastodon usage - counting toots with DuckDB
 								[Mastodon](https://joinmastodon.org/) is a _decentralized_ social networking platform. Mastodon users are members of a _specific_ Mastodon server, and servers are capable of joining other servers to form a global (or at least federated) social network.
 								I wanted to start exploring Mastodon usage, and perform some exploratory data analysis of user activity, server popularity and language usage. You may want to jump straight to the [data analysis](#data-analysis)
 								Tools used
 								- [Mastodon.py](https://mastodonpy.readthedocs.io/) - Python library for interacting with the Mastodon API
 								- [Apache Kafka](https://kafka.apache.org/) - distributed event streaming platform
 								- [DuckDB](https://duckdb.org/) - in-process SQL OLAP database and the [HTTPFS DuckDB extension](https://duckdb.org/docs/extensions/httpfs.html) for reading remote/writing remote files of object storage using the S3 API
 								- [MinIO](https://min.io/) - S3 compatible server
 								- [Seaborn](https://seaborn.pydata.org/) - visualization library
 								![mastodon architecture](./docs/mastodon_arch.png)
-												Add docker commands

											
										
										
											2023-02-23 05:26:44 +00:00
+								# Data processing
 								We will us Kafka as distributed stream processing platform to collect data from multiple instances. To run Kafka, Kafka Connect (with the S3 sink connector) and schema registry (to support AVRO serialisation) and MinIO setup containers with this command
 								```console
 								 docker-compose up -d
 								 ```
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
 								# Data collection
-												Add docker commands

											
										
										
											2023-02-23 05:26:44 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Setup virtual python environment
 								Create a [virtual python](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) environment to keep dependencies separate. The _venv_ module is the preferred way to create and manage virtual environments.
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
 								 ```console
 								python3 -m venv env
 								```
 								Before you can start installing or using packages in your virtual environment you’ll need to activate it.
 								```console
 								source env/bin/activate
-												Scale test changes

											
										
										
											2023-02-02 07:15:56 +00:00
+								pip install --upgrade pip
 								pip install -r requirements.txt
 								 ```
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Federated timeline
-												Incremental setup

											
										
										
											2023-02-01 22:17:10 +00:00
+								These are the most recent public posts from people on this and other servers of the decentralized network that this server knows about.
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Mastodon listener
 								The python `mastodonlisten` application listens for public posts to the specified server, and sends each toot to Kafka. You can run multiple Mastodon listeners, each listening to the activity of different servers
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								```console
 								python mastodonlisten.py --baseURL https://mastodon.social --enableKafka
 								```
-												Improve setup steps

											
										
										
											2023-02-20 06:16:20 +00:00
-												Add docker commands

											
										
										
											2023-02-23 05:26:44 +00:00
+								## Testing producer (optional)
 								As an optional step, you can check that AVRO messages are being written to kafka
-												Improve setup steps

											
										
										
											2023-02-20 06:16:20 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								```console
-												Improve setup steps

											
										
										
											2023-02-20 06:16:20 +00:00
+								kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic mastodon-topic --from-beginning
 								```
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
+								# Kafka Connect
-												Add docker commands

											
										
										
											2023-02-23 05:26:44 +00:00
+								To load the Kafka Connect [config](./config/mastodon-sink-s3-minio.json) file run the following
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								```console
-												Improve setup steps

											
										
										
											2023-02-20 06:16:20 +00:00
+								curl -X PUT -H  "Content-Type:application/json" localhost:8083/connectors/mastodon-sink-s3/config -d '@./config/mastodon-sink-s3-minio.json'
 								```
-												update

											
										
										
											2023-02-21 05:50:57 +00:00
+								# Open s3 browser
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								Go to the MinIO web browser http://localhost:9001/
-												Incremental dev

											
										
										
											2023-02-14 05:02:54 +00:00
-												Add docker commands

											
										
										
											2023-02-23 05:26:44 +00:00
+								- username `minio`
 								- password `minio123`
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								# Data analysis
 								Now we have collected a week of Mastodon activity, let's have a look at some data. These steps are detailed in the [notebook](./notebooks/mastodon-analysis.ipynb)
-												Resiliance fixes

											
										
										
											2023-02-05 19:51:22 +00:00
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Query parquet files directly from s3 using DuckDB
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								Load the [HTTPFS DuckDB extension](https://duckdb.org/docs/extensions/httpfs.html) for reading remote/writing remote files of object storage using the S3 API
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								```console
 								INSTALL httpfs;
 								LOAD httpfs;
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								set s3_endpoint='localhost:9000';
 								set s3_access_key_id='minio';
 								set s3_secret_access_key='minio123';
 								set s3_use_ssl=false;
 								set s3_url_style='path';
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
+								```
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								And you can now query the parquet files directly from s3
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								```sql
 								select *
 								from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/*');
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
+								```
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								![SQL](./docs/select_from_s3_result.png)
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Daily Mastodon usage
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								We can query the `mastodon_toot` table directly to see the number of _toots_, _users_ each day by counting and grouping the activity by the day
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								We can use the [mode](https://duckdb.org/docs/sql/aggregates.html#statistical-aggregates) aggregate function to find the most frequent "bot" and "not-bot" users to find the most active Mastodon users
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## The Mastodon app landscape
 								What clients are used to access mastodon instances. We take the query the `mastodon_toot` table, excluding "bots" and load query results into the `mastodon_app_df` Panda dataframe. [Seaborn](https://seaborn.pydata.org/) is a visualization library for statistical graphics  in Python, built on the top of [matplotlib](https://matplotlib.org/). It also works really well with Panda data structures.
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								![App usage](./docs/app_usage.png)
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Time of day Mastodon usage
 								Let's see when Mastodon is used throughout the day and night. I want to get a raw hourly cound of _toots_ each hour of each day. We can load the results of this query into the `mastodon_usage_df` dataframe
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								![hour of day](./docs/hr_of_day_usage.png)
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Language usage
 								A wildly inaccurate investigation of language tags
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								![language usage](./docs/language_usage.png)
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Language density
 								A wildly inaccurate investigation of language tags
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								![language density](./docs/language_density.png)
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								# Optional steps
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								## Cleanup of virtual environment
 								If you want to switch projects or otherwise leave your virtual environment, simply run:
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
 								```console
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								deactivate
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00
+								```
-												Add readme steps

											
										
										
											2023-02-22 10:51:28 +00:00
+								If you want to re-enter the virtual environment just follow the same instructions above about activating a virtual environment. There’s no need to re-create the virtual environment.
-												initial checkin

											
										
										
											2023-01-28 10:28:14 +00:00