Project Ideas - JohnnyFoulds/firstrepo GitHub Wiki
This page serves as a reference for me of possible future projects I want to work on. The first set of projects will be specifically around Data Engineering to supplement the course training I have completed in this area.
IIS Logs Kafka Producer
Build a Kafka producer similar to KafkaTailer but use .NET Core and write it as a Windows Service so that it does not need a JVM deployed on the server.
References
- JVM Implementation - https://github.com/johnmpage/KafkaTailer
- Creating Windows Services with .NET Core - https://www.pmichaels.net/2019/01/08/creating-a-windows-service-using-net-core-2-2/
- .NET Core FileSystemWatcher Class - https://docs.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=netstandard-2.0
- Tail.NET - https://www.codeproject.com/Articles/7568/Tail-NET
- Tailf - https://github.com/kerryjiang/Tailf
#IKA
RPA Job Scraping
Use UiPath to scrape job postings from sites such as https://www.careerjunction.co.za and https://www.indeed.co.za/
The intention of acquiring this information is to do some basic analysis and determine what the job market is looking for and which of my skills I should best focus on when looking to change roles or up-skill.
- Have a list of keywords to search for such as: C#, .NET Core, Cassandra, UiPath, Apache Spark, Hadoop, etc.
- For each separate keyword extract the job title, salary & job description etc. to be processed later.
- Instead if just writing a CSV, Use a Kafka producer and then have a Scala Kafka steaming consumer that loads this data into HIVE.
- When batching the steams also use the opportunity to make sure duplicates are not inserted to the data warehouse. Review the Lambda Architecture Course on Pluralsight.
- Analyse the job description and look for trends and what they are looking for.
- Do further analysis based on salary and the data above to decide what job to pursue (use Zeppelin to do graphs).
- Have a look at an article of on Scala vs. Other Technologies to see the interesting analysis they have done in section 4.
References
- Kafka to HDFS/S3 Batch Ingestion Through Spark - https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark
- LinkedIn Profile Scraping - I found the following article of somebody having a similar idea but instead he looked at LinkedIn to find profiles of people already employed in the field and then draw conclusions from that, I should definitely also look at this as part of the project. https://towardsdatascience.com/i-wasnt-getting-hired-as-a-data-scientist-so-i-sought-data-on-who-is-c59afd7d56f5
- Spark Text Analytics - https://community.hortonworks.com/articles/84781/spark-text-analytics-uncovering-data-driven-topics.html
- 10 Great Programming Projects to Improve Your Resume and Learn to Program - https://dev.to/seattledataguy/10-great-programming-projects-to-improve-your-resume-and-learn-to-program-1e2h
-
- Scala vs. Other Technologies - https://data-flair.training/blogs/scala-job-opportunities/
#RJS
Scaling Spark
This project idea is to test the different ways to scale out Apache Spark for remote processing.
- Connect the Apache Zeppelin Docker Image to the Local Hadoop Cluster.
- Deploy OpenShift on the ESXi server created for local-hadoop.
- Extend the Apache Spark Development Docker Image and use them to create a Standalone Spark Cluster.
- Create a Zeppelin Dockerfile based on
apache/zeppelin:0.8.1
that can connect to Databricks.
References
- Deploying OpenShift Container Platform 3 on VMware vSphere - https://access.redhat.com/articles/2745171
- Creating a Spark Standalone Cluster with Docker and docker-compose - https://medium.com/@marcovillarreal_40011/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-ba9d743a157f