Project Ideas - JohnnyFoulds/firstrepo GitHub Wiki

This page serves as a reference for me of possible future projects I want to work on. The first set of projects will be specifically around Data Engineering to supplement the course training I have completed in this area.

IIS Logs Kafka Producer

Build a Kafka producer similar to KafkaTailer but use .NET Core and write it as a Windows Service so that it does not need a JVM deployed on the server.

References

JVM Implementation - https://github.com/johnmpage/KafkaTailer
Creating Windows Services with .NET Core - https://www.pmichaels.net/2019/01/08/creating-a-windows-service-using-net-core-2-2/
.NET Core FileSystemWatcher Class - https://docs.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=netstandard-2.0
Tail.NET - https://www.codeproject.com/Articles/7568/Tail-NET
Tailf - https://github.com/kerryjiang/Tailf

#IKA

RPA Job Scraping

Use UiPath to scrape job postings from sites such as https://www.careerjunction.co.za and https://www.indeed.co.za/

The intention of acquiring this information is to do some basic analysis and determine what the job market is looking for and which of my skills I should best focus on when looking to change roles or up-skill.

Have a list of keywords to search for such as: C#, .NET Core, Cassandra, UiPath, Apache Spark, Hadoop, etc.
For each separate keyword extract the job title, salary & job description etc. to be processed later.
Instead if just writing a CSV, Use a Kafka producer and then have a Scala Kafka steaming consumer that loads this data into HIVE.
When batching the steams also use the opportunity to make sure duplicates are not inserted to the data warehouse. Review the Lambda Architecture Course on Pluralsight.
Analyse the job description and look for trends and what they are looking for.
Do further analysis based on salary and the data above to decide what job to pursue (use Zeppelin to do graphs).
Have a look at an article of on Scala vs. Other Technologies to see the interesting analysis they have done in section 4.

References

Kafka to HDFS/S3 Batch Ingestion Through Spark - https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark
LinkedIn Profile Scraping - I found the following article of somebody having a similar idea but instead he looked at LinkedIn to find profiles of people already employed in the field and then draw conclusions from that, I should definitely also look at this as part of the project. https://towardsdatascience.com/i-wasnt-getting-hired-as-a-data-scientist-so-i-sought-data-on-who-is-c59afd7d56f5
Spark Text Analytics - https://community.hortonworks.com/articles/84781/spark-text-analytics-uncovering-data-driven-topics.html
10 Great Programming Projects to Improve Your Resume and Learn to Program - https://dev.to/seattledataguy/10-great-programming-projects-to-improve-your-resume-and-learn-to-program-1e2h
1. Scala vs. Other Technologies - https://data-flair.training/blogs/scala-job-opportunities/

#RJS

Scaling Spark

This project idea is to test the different ways to scale out Apache Spark for remote processing.

Connect the Apache Zeppelin Docker Image to the Local Hadoop Cluster.
Deploy OpenShift on the ESXi server created for local-hadoop.
Extend the Apache Spark Development Docker Image and use them to create a Standalone Spark Cluster.
Create a Zeppelin Dockerfile based on apache/zeppelin:0.8.1 that can connect to Databricks.

References

Deploying OpenShift Container Platform 3 on VMware vSphere - https://access.redhat.com/articles/2745171
Creating a Spark Standalone Cluster with Docker and docker-compose - https://medium.com/@marcovillarreal_40011/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-ba9d743a157f