Data Engineering Projects

Personalized News Aggregator

The Personalized News Aggregator is an end-to-end data engineering project that collects, processes, and delivers personalized news recommendations using a suite of modern tools. It utilizes Dockerized Python scripts for data collection from various sources, Google Cloud Pub/Sub for message brokering, Dagster for ETL orchestration, Trino and Google Cloud NLP for data processing, Databricks for advanced transformations and model training, Google BigQuery for data storage, Looker for visualization, and Slack API for personalized notifications. The entire infrastructure, including Google Cloud resources and Databricks clusters, is managed using Terraform for scalability and efficiency. This project is ideal for media companies and content platforms aiming to enhance user engagement by providing tailored news content based on individual preferences and reading habits.

Tools: GCP, Databricks, Dagster, Docker, Terraform, Looker. 

YouTube ad revenue optimization

This project optimizes YouTube ad revenue by creating a comprehensive data engineering pipeline that collects, processes, analyzes, and optimizes video data. Using tools like YouTube API, PostgreSQL, Apache Spark, Pandas, Matplotlib, and Scikit-learn, the project extracts actionable insights to enhance ad revenue strategies. Additionally, it integrates with Telegram for real-time notifications, providing instant updates on optimization results. The use case is to help content creators and marketers maximize their ad revenue by understanding viewer engagement and predicting the best strategies for monetization. 

Tools: AWS, Spark, DBT, Airflow, Docker.


Integrated Blockchain Data Pipeline for Tax Compliance

This is a robust system designed to make it easier to manage and report blockchain transaction data for tax purposes. By bringing together powerful tools like Apache Beam, Google Cloud Dataflow, BigQuery, PostgreSQL, Neo4j, and Apache Airflow, this project automates everything from fetching transaction data to generating comprehensive tax reports. It uses Google Cloud Functions for seamless integrations and serverless operations, and employs Pandas, Matplotlib, and Seaborn for insightful data analysis and visualizations. This system is especially valuable for financial institutions, cryptocurrency exchanges, and individual investors who need a reliable way to comply with tax regulations. By automating the entire workflow, it saves time and reduces errors, providing a streamlined solution for the complex task of blockchain transaction tax reporting. 

Tools: GCP, PostgreSQL, Neo4j, Airflow.


Customer 360 view

This project aims to create an efficient way to handle customer data from various sources. By using Databricks for data processing, Kafka for real-time streaming, and Elasticsearch for quick searches, this project offers a complete view of customer interactions. The pipeline ingests data, transforms and enriches it with machine learning models, and stores it in a searchable format. This helps businesses gain valuable insights into customer behavior and make better decisions.

This project is perfect for companies looking to enhance their marketing by understanding customer behavior, improve customer support by analyzing interactions, predict future sales trends, and refine product development based on feedback and analytics. By implementing this end-to-end data pipeline, businesses can achieve a unified view of their customers, leading to smarter decisions and happier customers.

Tools: Databricks, Kafka, Spark, ElasticSearch, Kibana 

Supply Chain Optimization and Analytics System

This project is an end-to-end supply chain optimization and analytics system designed to enhance logistics, decision-making, and overall supply chain performance. By integrating data from FedEx and other sources, and using tools like AWS for storage, Snowflake for data warehousing, dbt for data transformations, Terraform for infrastructure management, Airflow for workflow orchestration, and Docker for containerization, the system automates the entire data pipeline from ingestion to reporting. This setup provides real-time insights and analytics that help businesses streamline their supply chain operations.

For instance, a retail company can use this system to track and manage its logistics more efficiently. With real-time data from FedEx on shipment statuses, delivery times, and costs, the company can monitor inventory levels, predict delays, and optimize shipping routes. This leads to better on-time delivery rates, reduced transportation costs, and improved inventory management. Automated data transformations and comprehensive analytics offer valuable insights, supporting strategic decision-making and enhancing overall supply chain performance.

Tools: AWS, Snowflake, DBT, Terraform, Airflow, Docker. 

Social Media Analytics Pipeline

This project demonstrates an end-to-end data pipeline for collecting, transforming, and analyzing social media data using Snowflake, DBT, Airflow, and Docker, with Twitter as the data source. By leveraging these tools, the pipeline automates the process of fetching tweets, storing them in a Snowflake data warehouse, and transforming the data using DBT for insightful analytics. This system can be used by businesses and researchers to monitor social media trends, perform sentiment analysis, and gain insights into public opinion and engagement around specific topics or brands. 

Tools: X API, Snowflake, DBT, Airflow, Docker