Execution times - derlin/bda-lsa-project GitHub Wiki
For comparing the execution times, we launched all the jobs with the following configuration:
Config | Value |
---|---|
master | yarn |
deploy-mode | client |
number of executors | 10 |
driver memory | 20G |
executor memory | 15G |
The command is thus:
spark-submit --master yarn --deploy-mode client \
--num-executors 10 --driver-memory 20G --executor-memory 15g \
--class <class> <jar> <arguments>
path.base=/shared/wikipedia/docIds
path.wikidump=/shared/wikipedia/wikidump.xml
path.wikidump.parquet=/shared/wikipedia/wikidump-parquet
path.matrix=/shared/wikipedia/docIds/matrix-20000
Here, we used the whole wikipedia dataset (> 50 GB). The jobs have been launched on the daplab.
- Number of terms in the vocabulary: 20'000
- Number of topics to infer (k): 1'000
The DocTermMatrixWriter
processed an RDD of textfiles already computed using XmlToParquetWriter
. The models used the matrix created by DocTermMatrixWriter
.
Jobs | Time | |
---|---|---|
DocTermMatrixWriter -- numTerms = 20'000 | 8 | 3h08 |
SVD | 2511 | 48 min |
mllib.LDA -- maxIter=20/maxIter=50 | 31/61 | 2h14/5h00 |
ml.LDA -- maxIter=50 | ? | 1h27 |