Navigating the Data Landscape: Exploring Key Elements of Superset and Kafka for a Real-Time Analytics Platform

I changed the nature of my personal sprint objective, moving away slightly from AI concerns to reconnect with more general considerations about Data and its processing: from collection to its visualization through its “processing”. Ultimately, of course, this treatment could be injected with AI or ML. As always, when we start thinking about such a vast subject, the first question is what do we want to land on to define the scope? In this case, I wondered how to approach the question of data.

So, according to me and after compiling several resources, finally the two things that seem most important to me for now:

  • How can I improve visualisation of existing data e.g csv for instance?
  • What are the ways to improve data collection?

For this post, you can find all files for each project on my GitHub account. See https://github.com/bflaven/BlogArticlesExamples/tree/master/how_to_use_superset_kafka_agile2

Key-ideas that should drives a POC

Like always, I read some stuff that I found inspiring and note everything down. Sometimes, I found some consolations and some advice that act like mantras when you start a POC to avoid the WIP’s hell.

Learning is an active process. We learn by doing. Only knowledge that is used sticks in your mind. – Dale Carnegie

For records, some good stuff from Agile2

Apparently, Agile is dead long live to Agile2.

I just read this article: https://medium.com/developer-rants/agile-has-failed-officially-8136b0522c49

Funny, this reminds me of several things:
Like any belief or ideology, it is beneficial to question/criticize any “dominant” model of thought. Quite provocative and indeed, sensible for two reasons:

  1. Agile without leadership is worthless, it amounts to submitting to stupid and pointless formalism 🙂
  2. Agile 2 is also a clever way to sell even more, it’s a bit like Persil laundry detergent, a new formula. We sell you the same product but with extra soul.

A good reminder to meditate.

At some point, a project must produce a final product.

Some other excerpts from the Agile2 core values on some specific aspects for project management.

(i) Planning, Transition & Transformation

  • Any initiative requires both a vision or goal, and a flexible, steerable, outcome-oriented plan.
  • Any significant transformation is mostly a learning journey – not merely a process change.
  • Product development is mostly a learning journey – not merely an “implementation.”

(ii) Product, Portfolio & Stakeholders

  • Obtain feedback from the market and stakeholders continuously.
  • Work iteratively in small batches.
  • The only proof of value is a business outcome.
  • Organizations need an “inception framework” tailored to their needs.
  • Create documentation to share and deepen understanding.

(iii) Continuous Improvement

  • Place limits on things that cause drag.
  • Integrate early and often.

Streamline the data analytics flow…

Anyway, let’s get back to the main course. I was about to make a complete benchmark among the market solutions: Superset, Redash and even ClickHouse. But, finally Superset is enough for my product discovery.

I was looking for some ideas on how to improve data analytics flow. Indeed, it is always useful “to set up a system that allows you to get a deeper understanding of the behaviour of your customers”.

Source: https://xebia.com/blog/real-time-analytics-divolte-kafka-druid-superset/

Having an alternative pipeline that could be called “real-time analytics platform” can help to better perform:

Descriptive analysis: either the act of analyzing the data and describing what they say at a given time;

Predictive analysis: predicting a potential result based on data extracted from past or current activities.

Source: https://betterprogramming.pub/building-an-order-delivery-analytics-application-with-fastapi-kafka-apache-pinot-and-dash-part-ca276a3ee631

Based on these 2 posts, I decide to go for :

  1. Improve visualisation of existing data with Superset.
  2. Explore quickly Kafka to improve data collection.

1. Superset with Docker

Superset is the most intuitive tool that I found to improve visualisation. Here is a quick way to install and manage Superset with Docker.

# go to path
cd /Users/brunoflaven/Documents/01_work/blog_articles/how_to_use_superset/


# command to open docker
open -a docker

# clone the dir
git clone https://github.com/apache/superset.git superset


# get into the dir
cd superset


# get the superset stuff 
docker compose -f docker-compose-non-dev.yml pull

# start the superset stuff 
docker compose -f docker-compose-non-dev.yml up

# start using Superset
# http://localhost:8088
# username: admin
# password: admin

2. Connecting Superset to Databases

You need to have some databases e.g. mariadb, mongodb, mysql, postgresql installed on your machine to leverage on a database. To be sure to connect Superset to your local database, you need to use the hostname docker.for.mac.host.internal instead of localhost.

On mac, the best way to do so: install databases, it is to use homebrew. Here are the commands to install databases.


# list services for database
brew services list


# classical commands before install
brew update
brew upgrade
brew doctor

To Install Homebrew if you haven’t already: https://brew.sh/

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/main/install.sh)"

# Update the Homebrew formulae:
brew update

2.1 MARIADB
MariaDB is an open-source, community-developed relational database
management system (RDBMS) that serves as a drop-in replacement for MySQL.
It offers a robust and flexible SQL engine with features such as stored
procedures, views, subqueries, and triggers.

Advantages of MariaDB:

  1. Compatibility: MariaDB is highly compatible with MySQL, making it easy
    to migrate existing databases without having to rewrite code or modify
    applications.
  2. Performance: MariaDB offers improved performance compared to MySQL
    through enhancements like optimistic optimization and asynchronous
    replication. This results in faster query processing and reduced latency
    in real-time environments.
  3. Scalability: MariaDB is designed for scalability, with features such as
    partitioning that can help handle large datasets efficiently while
    ensuring optimal performance.
# Managing the mariadb database

# mariadb
brew search mariadb
brew install mariadb
brew services start mariadb
brew services stop mariadb

# connect to mariadb (no password)
mysql
mysql -u brunoflaven 

# create a root user with all privileges
CREATE USER 'root'@'hostname' IDENTIFIED BY 'root';
# CREATE USER 'root'@'%' IDENTIFIED BY 'root';
SELECT USER,is_role,default_role FROM mysql.user;
GRANT ALL PRIVILEGES ON *.* TO 'root'@localhost IDENTIFIED BY 'root';
FLUSH PRIVILEGES;
SHOW GRANTS FOR 'root'@localhost;

# connection infos
# select mysql
# add port 3306
# add host docker.for.mac.host.internal or 127.0.0.1
# db_name : mydatabase_try_mariadb
# user: root
# pwd: root



# useful commands
# create databases
CREATE DATABASE try_mariadb;
USE try_mariadb;
CREATE TABLE testtable
(
 id int not null primary key,
 name varchar(20) not null,
 lastupdate timestamp not null
 );

# insert
INSERT INTO testtable
 (id, name, lastupdate)
 values (1,'Sample name','2022-09-22 18:53');

INSERT INTO testtable
 (id, name, lastupdate)
 values (2,'Sample name 2','2022-09-22 18:54');

# update
UPDATE testtable set name = 'updated name' where id=1;

# delete one record with the id equal to 4
DELETE FROM testtable where id = 4;

# select all content from the table testtable 
SELECT * FROM testtable;

# drop
DROP TABLE testtable;

# empty
TRUNCATE testtable;



2.2 POSTGRES

PostgreSQL (Postgres) is a powerful, open-source object-relational
database system with a strong emphasis on reliability, data integrity, and
correctness. It supports a wide range of data types, including
geographical data, large objects such as images, JSON, and XML documents,
and advanced features like stored procedures, triggers, and rules.

Advantages of PostgreSQL:

  1. Robustness: Postgres is known for its robustness in handling complex
    queries, concurrency, and reliability. It follows the ACID (Atomicity,
    Consistency, Isolation, Durability) principles to ensure data integrity.
  2. Extensibility: Postgres offers a rich ecosystem with built-in support
    for various languages and data types. Its plugin architecture allows for
    seamless integration of new features and functionality without altering
    the core system.
  3. Compatibility: PostgreSQL is highly compatible with many popular
    database systems, including SQL Server, Oracle, MySQL, and DB2. This makes
    it easy to migrate existing applications oruser

# install and start postgresql with homebrew 
brew search postgresql
brew install postgresql
brew services start postgresql
brew services stop postgresql

# connect to postgres
psql postgres

# way_1 to connect to postgresql
# in the console
CREATEUSER -s postgres
# in the postgres client
ALTER USER postgres WITH PASSWORD 'password';

# way_2 to connect to postgresql

# in the postgres client
CREATE ROLE root WITH LOGIN PASSWORD 'root';
ALTER ROLE root CREATEDB;

# connect to postgres in a terminal
psql postgres

# your username should be listed
postgres=# \du

# let's validate it
postgres=# \q;

# and then:
psql -U brunoflaven postgres;
psql -U root postgres;

# to quit
postgres=# \q;

# list all databases
postgres=# \list;
postgres=# \l;

# connect to a certain database
postgres=# \c; 

# examples with real postgres databases
postgres=# \c postgres;
postgres=# \c mydatabase_try_postgresql;
postgres=# \c template1;

# list all tables in the current database using your search_path
postgres=# \dt;


# In postgres, in the console, give a complete creation tables and and insert datas for a database named "mydatabase_try_postgresql"


# connect to the newly created database
\c mydatabase_try_postgresql

# Create 'users' table
CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    username VARCHAR(50) NOT NULL,
    email VARCHAR(100) NOT NULL
);

# Create 'orders' table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    order_date DATE,
    total_amount DECIMAL(10, 2) NOT NULL
);




# configure postgresql db in superset
# Not working localhost or 127.0.0.1 on Mac
# Working docker.for.mac.host.internal or 127.0.0.1 on Mac


# select postgresql
# add port 5432
# add host docker.for.mac.host.internal
# db_name : mydatabase_try_postgresql
# user: postgres
# pwd: password


# createdb mydatabase_try_postgresql
# dropdb mydatabase_try_postgresql

This time, for this POC, I did not install mysql and mongodb as I want to have a “quick and dirty” csv conversion into superset.

Customization with .env for Superset
Modify the .env files for environment-specific configurations and install additional Python packages by adding them to requirements-local.txt.
You can use a combinaison of a docker-compose.yml file and .env file to install Superset with docker-compose.

# You can create some_random_base64_string using this command in shell
openssl rand -base64 42
# OUTPUT: uKqlflwJGdDH/+NpwuRhJh8mZrNsTGu45OMT7akZhGhaBlOqkkOR0xMP

# in the mac terminal define the SUPERSET_SECRET_KEY
export SUPERSET_SECRET_KEY="uKqlflwJGdDH/+NpwuRhJh8mZrNsTGu45OMT7akZhGhaBlOqkkOR0xMP"

# Lists containers (and tells you which images they are spun from)
docker ps -a                

# Lists images 
docker images               

# Removes a stopped container
docker rm     

# Forces the removal of a running container (uses SIGKILL)
docker rm -f  

# Removes an image
# Will fail if there is a running instance of that image i.e. container
docker rmi        


# Forces removal of image even if it is referenced in multiple repositories, 
# i.e. same image id given multiple names/tags 
# Will still fail if there is a docker container referencing image
docker rmi -f     


# command for docker
docker info
docker container prune -a
docker image prune
docker volume prune
docker network prune
docker system prune
docker system prune -a

3. Collecting data: Using kafka, a corner stone


Apache Kafka is a distributed streaming platform that is widely used for building real-time data pipelines and streaming applications. In the context of an analytics application, Kafka plays a crucial role in handling the flow of data between different components of the application. It provides a scalable, fault-tolerant, and high-throughput messaging system that allows seamless communication between various modules of the analytics application.

Here are some key purposes of Kafka within an analytics application:

  1. Data Ingestion: Kafka acts as a central hub for ingesting data from various sources such as databases, logs, sensors, and other systems. It enables the application to handle large volumes of incoming data in a scalable and efficient manner.
  2. Event Streaming: Kafka allows the streaming of events in real-time. This is beneficial for analytics applications that require continuous processing of data, enabling real-time insights and analytics.
  3. Decoupling of Components: Kafka helps in decoupling different components of the analytics application. Producers can publish data to Kafka topics without worrying about who will consume it, and consumers can subscribe to the topics they are interested in.
  4. Fault Tolerance and Durability: Kafka ensures fault tolerance by replicating data across multiple nodes. This makes it a reliable and durable solution, ensuring that data is not lost in case of failures.

Source: https://kafka.apache.org/

We are going to take the most straightforward way on mac, meaning using homebrew like we did previously for database. The commands will almost the same as we will use Homebrew.

Again, if you need to install homebrew, type in the console, the following command.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

To install kafka with Homebrew

brew install kafka

To list the services from Homebrew

brew services list

If you’ve installed Kafka and Zookeeper using Homebrew on your macOS system, you can use the following commands to interact with Kafka and gain an understanding of its general principles.

1. Start Zookeeper:
Zookeeper is a prerequisite for Kafka, and it manages distributed configurations and synchronization between nodes. Open a terminal and start Zookeeper:

brew services start zookeeper

2. Start Kafka Server:

Now, start the Kafka server using Homebrew:

brew services start kafka

This command will start the Kafka server as a background service.

3. Create a Topic:

Kafka organizes data into topics. Create a Kafka topic to publish and subscribe messages:

kafka-topics --create --topic brunotopic1 --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Replace `brunotopic1` with the desired topic name.

4. List Topics:

List the existing Kafka topics:

kafka-topics --list --bootstrap-server localhost:9092

5. Produce Messages:

Produce some messages to the topic:

kafka-console-producer --topic brunotopic1 --bootstrap-server localhost:9092

This command opens a console where you can type messages. Press `Ctrl + D` to exit.

6. Consume Messages:

Open a new terminal and consume messages from the topic:

kafka-console-consumer --topic brunotopic1 --bootstrap-server localhost:9092 --from-beginning

This command subscribes to the topic and prints incoming messages.

7. Describe a Topic:

Describe the properties of a Kafka topic:

kafka-topics --describe --topic brunotopic1 --bootstrap-server localhost:9092

8. Kafka Commands Documentation:

Explore additional Kafka commands and options by checking the official documentation:

kafka-topics --help
kafka-console-producer --help
kafka-console-consumer --help

9. Stop Kafka:

When you’re done, you can stop the Kafka server:

brew services stop kafka
brew services stop zookeeper

This will stop the Kafka server running as a background service.

These commands provide a basic overview of Kafka’s functionalities. You can experiment further and refer to the [official documentation](https://kafka.apache.org/documentation/) for more in-depth understanding and configuration options.

Using faststream

To go further, you can leverage on FastStream. A kind of FastAPI for Kafka. Indeed, FastStream simplifies the process of writing producers and consumers for message queues, handling all the parsing, networking and documentation generation automatically.

Source: https://faststream.airt.ai/latest/faststream/

More infos


Superset

Kafka