Execution times

Launch parameters

For comparing the execution times, we launched all the jobs with the following configuration:

Config	Value
master	yarn
deploy-mode	client
number of executors	10
driver memory	20G
executor memory	15G

The command is thus:

spark-submit --master yarn --deploy-mode client \
     --num-executors 10 --driver-memory 20G --executor-memory 15g \
     --class <class> <jar> <arguments>

Configuration (config.properties)

path.base=/shared/wikipedia/docIds
path.wikidump=/shared/wikipedia/wikidump.xml
path.wikidump.parquet=/shared/wikipedia/wikidump-parquet
path.matrix=/shared/wikipedia/docIds/matrix-20000

Here, we used the whole wikipedia dataset (> 50 GB). The jobs have been launched on the daplab.

Number of terms in the vocabulary: 20'000
Number of topics to infer (k): 1'000

The DocTermMatrixWriter processed an RDD of textfiles already computed using XmlToParquetWriter. The models used the matrix created by DocTermMatrixWriter.

	Jobs	Time
DocTermMatrixWriter -- numTerms = 20'000	8	3h08
SVD	2511	48 min
mllib.LDA -- maxIter=20/maxIter=50	31/61	2h14/5h00
ml.LDA -- maxIter=50	?	1h27

Execution times - derlin/bda-lsa-project GitHub Wiki

Launch parameters

Configuration (config.properties)

Execution times

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️