Challenge #5 - ajainPSU/ECE410-510-Codefests GitHub Wiki

Challenge No. 5

In this challenge we were to develop (or acquire) several Python workloads/scripts for a variety of applications and/or tasks which would then be compiled via Python bytecode, disassembled to view the instructions and counted, then analyzed to measure execution time and resource usage of the workload as well as the algorithmic structure and data dependencies to find potential parallelism. All Large Language Model (LLM) usage in this challenge was based off of Github Copilot which utilizes the architecture of GPT-4o.

The following workloads I tested/examined are:

  1. A Reverse Polish Notation Calculator.
  2. A Matrix Multiplication script.
  3. A Quicksort script.

Reverse Polish Notation Calculator

The first script I tested was a Reverse Polish Notation (RPN) calculator. Unlike how one would assume calculators would work, a RPN calculator allows someone to enter the parameters first than the operator, essentially calculations are listed as "3 4 +" which would be "3 + 7" traditionally. In my RPN Calculator Example, shown as RPNExample.py in the Challenge #5 folder, it has a list of expressions that it will calculate and print, as shown in Figure 1 below.

Fig1

Figure 1: Execution of RPNExample.py

Figure 2 below shows the disassembly of the bytecode for RPNExample.py

Fig2a Fig2b

Figure 2: Disassembly of RPNExample.py's bytecode.

Figure 3 shows the instruction count of RPNExample.py.

Fig3

Figure 3: Instruction count of RPNExample.py's disassembly.

Fig4a Fig4b

Figure 4: Profiling & Analysis of algorithms and data dependencies.

Figure 4 above shows the profiling and analysis of the RPNExample.py script, while Figure 5 below shows the snakeviz visualization of the profiling.

RPNSnakeviz1 RPNSnakeviz2 RPNSnakeviz3

Figure 5: Snakeviz profiling visualization of RPNExample.py

Matrix Multiplication

The Matrix Multiplication script is designed similarly to the RPN Calculator script in which several prebuilt examples are calculated and displayed.

Figure 6 below showcases the output of the Matrix Multiplication example script. Fig6

Figure 6: Output of Matrix script.

Fig7a Fig7b

Figure 7: Disassembly of Python bytecode of Matrix Multiplication script.

Fig8

Figure 8: Instruction count for the disassembly of the Matrix Multiplication script.

Fig9a Fig9b

Figure 9: Execution/Resource usage analysis with algorithmic and dependency analysis for potential parallelism of the Matrix Multiplication Script.

MatrixSnakeviz1 MatrixSnakeviz2 MatrixSnakeviz3

Figure 10: Snakeviz visualization of dependencies of the Matrix Multiplication script.

Quicksort

The Quicksort script goes through three sections. First, it does a functional implementation by utilizing the quicksort algorithm. Then it will partition the array it quicksort for an in-place quicksort, and lastly will sort an array in-place utilizing the quicksort algorithm.

Fig11

Figure 11: Output of the Quicksort example program.

Fig12A Fig12B Fig12C

Figure 12: Disassembly of Python Bytecode for the Quicksort program.

Fig13

Figure 13: Instruction count total for the disassembly of the Quicksort bytecode.

Fig14

Figure 14: Command Terminal analysis of algorithms and dependencies of the Quicksort example.

QSSnakeviz1 QSSnakeviz2 QSSnakeviz3

Figure 15: Snakeviz visualization of profiling for the Quicksort.

Analysis

The first thing of note is that the Python Virtual Machine (PVM) is a stack machine of sorts.

If we examine the snakeviz of the RPN program, most of the time is spent in subprocess.run() from inside the Compilation fle (RPN), which means the actual RPN computations are not the bottleneck, but rather its from spawning and waitin for a subprocess. This brings the implication that the profiler is dominated by OS-level context switching and I/O wait. Essentially the workload is lightweight and I/O latency dominates the bottleneck rather than CPU instructions. So given the profile and nature of the RPN calculation workload, an ideal instruction architecture should be one to support efficient stack operations with minimal overhead -- thus a stack-based architecture would align with the operand flow of the program and give a tight yet predictable execution.

The quicksort example also has the same analysis. The actual workload itself is very small and negligible, with most of the execution time is spent in the subprocess module rather than the quicksort logic itself. In a more specific sense, subprocess is mostly spent in subprocess.run() with communicate(), wait() and windows system calls, which indicate the overhead of launching the child process to run QSTest.py, which significantly outweighs the actual execution time of the quicksort logic. Thus, the architecture suggested for this workload would be any multi-core CPU architectures or distributed systems depending the size of the data set.

The Matrix example actually does not have some of its execution time in its actual program, instead from the Snakeviz flame graph, 100% of the execution time is in subprocess.run() which is dominated by the calls as seen above with the quicksort example. Thus the actual matrix computations are offloaded to a child process and are not directly profiled. Yet the dependency and algorithm analysis shows that it is a predictable control flow with regularly memory access patterns. With this, and the fact that the instruction architecture should be optimized for the high arithmetic intensity and control flow based on how matrix programs work, an out-of-order RISC with fused multiply-add type of architecture.

Learning Reflection

I think the main things I learned is that in realms of "vibe coding" it does have premise, you do have the LLM's generate code which are tested and refined. However I think the caveats of this are that as seen with the Silicon Brain utilized with ChatGPT assistance, one does need to be specific in order what they want. Several prompts had to be made to remedy incorrect assumptions the LLM had made that weren't close to the prompt, as well as providing additional knowledge. Other things that were learned was examining the diassembly of a Python bytecode, profiling and checking via Snakeviz for dependencies and looking at potential parallelism, and suggesting instruction architectures based on the workloads.