Execution times - derlin/bda-lsa-project GitHub Wiki

Launch parameters

For comparing the execution times, we launched all the jobs with the following configuration:

Config Value
master yarn
deploy-mode client
number of executors 10
driver memory 20G
executor memory 15G

The command is thus:

spark-submit --master yarn --deploy-mode client \
     --num-executors 10 --driver-memory 20G --executor-memory 15g \
     --class <class> <jar> <arguments>

Configuration (config.properties)

path.base=/shared/wikipedia/docIds
path.wikidump=/shared/wikipedia/wikidump.xml
path.wikidump.parquet=/shared/wikipedia/wikidump-parquet
path.matrix=/shared/wikipedia/docIds/matrix-20000

Execution times

Here, we used the whole wikipedia dataset (> 50 GB). The jobs have been launched on the daplab.

  • Number of terms in the vocabulary: 20'000
  • Number of topics to infer (k): 1'000

The DocTermMatrixWriter processed an RDD of textfiles already computed using XmlToParquetWriter. The models used the matrix created by DocTermMatrixWriter.

Jobs Time
DocTermMatrixWriter -- numTerms = 20'000 8 3h08
SVD 2511 48 min
mllib.LDA -- maxIter=20/maxIter=50 31/61 2h14/5h00
ml.LDA -- maxIter=50 ? 1h27
⚠️ **GitHub.com Fallback** ⚠️