Mastodon usage - counting toots with DuckDB

Go to file

Simon Aubury d7bb25fa00 update		2023-02-21 16:50:57 +11:00
avro	Incremental dev	2023-02-14 16:02:54 +11:00
config	Incremental dev	2023-02-14 16:02:54 +11:00
docs	Improve setup steps	2023-02-20 17:16:20 +11:00
duckdb	update mask	2023-02-20 11:42:57 +11:00
kafka-connect	further cleanup	2023-02-17 11:17:29 +11:00
notebooks	update	2023-02-21 16:50:57 +11:00
shell	Resiliance fixes	2023-02-06 06:51:22 +11:00
.env	Incremental dev	2023-02-14 16:02:54 +11:00
.gitignore	Incremental dev	2023-02-14 16:02:54 +11:00
README.md	update	2023-02-21 16:50:57 +11:00
docker-compose.yml	update	2023-02-21 16:50:57 +11:00
kafkaproducer.py	broker as tcp v4	2023-02-14 16:01:18 +11:00
mastodonlisten.py	Resiliance fixes	2023-02-06 06:51:22 +11:00
requirements.txt	update	2023-02-21 16:50:57 +11:00

README.md

Setup virtual python environment

Optionally, you can use a virtual python environment to keep dependencies separate. The venv module is the preferred way to create and manage virtual environments.

python3 -m venv env

Before you can start installing or using packages in your virtual environment you’ll need to activate it.

source env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Federated timeline

These are the most recent public posts from people on this and other servers of the decentralized network that this server knows about. https://data-folks.masto.host/public

Proudcer

python mastodonlisten.py --baseURL https://data-folks.masto.host/ --enableKafka

Testing producer

You can check that AVRO messages are being written to kafka

kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic mastodon-topic --from-beginning

Kafka Connect

curl -X PUT -H  "Content-Type:application/json" localhost:8083/connectors/mastodon-sink-s3/config -d '@./config/mastodon-sink-s3-minio.json'

Open s3 browser

http://localhost:9001/

Kafka Connect OLD

confluent-hub install confluentinc/kafka-connect-s3:10.3.0

curl -X PUT -H "Content-Type:application/json" localhost:8083/connectors/mastodon-sink-s3/config -d '@./config/mastodon-sink-s3.json'

curl -X PUT -H "Content-Type:application/json" localhost:8083/connectors/mastodon-sink-s3-aws/config -d '@./config/mastodon-sink-s3-aws.json'

DuckDB

duckdb --init duckdb/init.sql

select * FROM read_parquet('s3://mastodon/topics/mastodon-topic*');

select 'epoch'::TIMESTAMP + INTERVAL 1675325510 seconds;

AWS

aws s3 ls s3://2023mastodon --recursive --human-readable --summarize | tail

aws s3 cp s3://2023mastodon . --recursive --exclude "*" --include "*.parquet"

OLD Notes

Docker Notes

docker-compose up -d postgres datagen

Password postgres

psql -h localhost -U postgres -d postgres
select * from public.user limit 3;

docker-compose up -d redpanda redpanda-console connect

Redpanda Console at http://localhost:8080

docker exec -it connect /bin/bash

curl -X PUT -H  "Content-Type:application/json" localhost:8083/connectors/pg-src/config -d '@/connectors/pg-src.json'

curl -X PUT -H  "Content-Type:application/json" localhost:8083/connectors/s3-sink/config -d '@/connectors/s3-sink.json'


curl -X PUT -H  "Content-Type:application/json" localhost:8083/connectors/s3-sink-m/config -d '@/connectors/s3-sink-m.json'

docker-compose up -d minio mc

http://localhost:9000


Login with : `minio / minio123`

docker-compose up -d duckdb

docker-compose exec duckdb bash duckdb --init duckdb/init.sql

SELECT count(value.after.id) as user_count FROM read_parquet('s3://user-payments/debezium.public.user-*');


## Kafka notes

python avro-producer.py -b "localhost:9092" -s "http://localhost:8081" -t aubury.mytopic



## LakeFS

docker run --pull always -p 8000:8000 \
   -e LAKEFS_BLOCKSTORE_TYPE='s3' \
   -e AWS_ACCESS_KEY_ID='YourAccessKeyValue' \
   -e AWS_SECRET_ACCESS_KEY='YourSecretKeyValue' \
   treeverse/lakefs run --local-settings

docker run --pull always -p 8000:8000 \
   -e LAKEFS_BLOCKSTORE_TYPE='s3' \
   -e LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE='true' \
   -e LAKEFS_BLOCKSTORE_S3_ENDPOINT='http://minio:9000' \
   -e LAKEFS_BLOCKSTORE_S3_DISCOVER_BUCKET_REGION='false' \
   -e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID='minio' \
   -e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY='minio123' \
   treeverse/lakefs run --local-settings   

set s3_endpoint='minio:9000';
set s3_access_key_id='minio';
set s3_secret_access_key='minio123';
set s3_use_ssl=false;
set s3_region='us-east-1';
set s3_url_style='path';



### Installing packages

Now that you’re in your virtual environment you can install packages.

```console
python -m pip install --requirement requirements.txt

JupyterLab

Once installed, launch JupyterLab with:

jupyter-lab

Cleanup of virtual environment

If you want to switch projects or otherwise leave your virtual environment, simply run:

deactivate

If you want to re-enter the virtual environment just follow the same instructions above about activating a virtual environment. There’s no need to re-create the virtual environment.

README.md Unescape Escape