Project Ideas - JohnnyFoulds/firstrepo GitHub Wiki

This page serves as a reference for me of possible future projects I want to work on. The first set of projects will be specifically around Data Engineering to supplement the course training I have completed in this area.

IIS Logs Kafka Producer

Build a Kafka producer similar to KafkaTailer but use .NET Core and write it as a Windows Service so that it does not need a JVM deployed on the server.

References

#IKA

RPA Job Scraping

Use UiPath to scrape job postings from sites such as https://www.careerjunction.co.za and https://www.indeed.co.za/

The intention of acquiring this information is to do some basic analysis and determine what the job market is looking for and which of my skills I should best focus on when looking to change roles or up-skill.

  1. Have a list of keywords to search for such as: C#, .NET Core, Cassandra, UiPath, Apache Spark, Hadoop, etc.
  2. For each separate keyword extract the job title, salary & job description etc. to be processed later.
  3. Instead if just writing a CSV, Use a Kafka producer and then have a Scala Kafka steaming consumer that loads this data into HIVE.
  4. When batching the steams also use the opportunity to make sure duplicates are not inserted to the data warehouse. Review the Lambda Architecture Course on Pluralsight.
  5. Analyse the job description and look for trends and what they are looking for.
  6. Do further analysis based on salary and the data above to decide what job to pursue (use Zeppelin to do graphs).
  7. Have a look at an article of on Scala vs. Other Technologies to see the interesting analysis they have done in section 4.

References

#RJS

Scaling Spark

This project idea is to test the different ways to scale out Apache Spark for remote processing.

  1. Connect the Apache Zeppelin Docker Image to the Local Hadoop Cluster.
  2. Deploy OpenShift on the ESXi server created for local-hadoop.
  3. Extend the Apache Spark Development Docker Image and use them to create a Standalone Spark Cluster.
  4. Create a Zeppelin Dockerfile based on apache/zeppelin:0.8.1 that can connect to Databricks.

References