Parser Wikipedia Run it - nielsenbe/Spark-Wiki-Parser GitHub Wiki
Get the dump file.
This application reads data from the MediaWiki xml dump files. These are BZ2 compressed XML files. Wikipedia is the largest (~ 15 GB compressed). They also break the files into 50 smaller parts (useful for testing).
Main site:
- enwiki = English Wikipedia
- enwiki-[date]-pages-articles.xml.bz2 = Full backup
- enwiki-[date]-pages-articles1.xml-p[id].bz2 = Full backup divided into 50 pieces
Caution using this file format:
- pages-meta-history = Contains all revisions for a page. This greatly increases the size (into terabytes) and for most tasks is not needed.
Do not use this file format:
- pages-meta-current = Contains user pages and discussion pages. These are not supported by the parser.
The basic element of the dump files is the 'page'. All pages have roughly the same element layout. A page can be a redirect (no wikitext), a standard Wikipedia article, a template, or a variety of other constructs.
Cluster Recommendations
The parsing phase is much more compute and memory intensive then the raw file size (15 GB compressed) would indicate. Parser is compatible down to 2.0+.
With 20 cores and 100GB of RAM parsing will take 2-3 hours. Adding more cores will roughly decrease that time in a linear fashion.
Required Jars
Package version depends on your Spark cluster version
com.databricks:spark-xml_2.11:0.5.0
com.github.nielsenbe:spark-wiki-parser_2.11:1.0
Arguments
--dumpfile
Full path of dump file. Path format will vary depending on source system(HDFS, S3, DFS, etc).
Examples:
s3://[bucketname]/enwiki-20190101-pages-articles-multistream.xml.bz2
adl://[account].azuredatalakestore.net/wiki/enwiki-20190101-pages-articles-multistream.xml.bz2
hdfs://wiki/enwiki-20190101-pages-articles-multistream.xml.bz2
/dbfs/wiki/enwiki-20190101-pages-articles-multistream.xml.bz2
--destloc
The path to where the parser should store the files.
--destformat (optional)
File format for DB tables. Default is parquet.
Examples: parquet json orc csv
--lowmemorymode (optional)
Caches the intermediate dataset on disk instead of in memory. User will need to clean up the temp file when run is complete. Files are stored in [--destformat]/stg True/false
Spark submit
Spark Submit Documentation Example:
spark-submit \
[Cluster settings (--master local, --master yarn, etc]
--class "com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain" \
--packages com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0 \
"" \
"--dumpfile" "[dump file location]" \
"--destloc" "[dest file location]"
We are running the job from the maven package instead of a supplied jar. This is a little non standard but works.
Launch in notebook
Load the Spark XML and Spark-Wiki-Parser Maven coordinates to the environment.
import com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild._
// Give location of dump file and
val dumpFile = "s3://[bucket name]/[dump file name]"
val destinationFolder = "s3://[bucket]/wkp/"
val destinationFormat = "parquet"
val lowMemoryMode = false
val args = Arguments(dumpFile, destinationFolder, destinationFormat, lowMemoryMode)
val wf = new DatabaseBuildFull()
wf.parseFileAndCreateDatabase(spark, args)
Launch in notebook. Retrieve intermediate Dataset.
import com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild._
import com.github.nielsenbe.sparkwikiparser.wikipedia.WikipediaPage
val dumpFile = "s3://[bucket name]/[dump file name]"
val destinationFolder = "s3://[bucket]/wkp/"
val destinationFormat = "parquet"
val lowMemoryMode = false
val args = Arguments(dumpFile, destinationFolder, destinationFormat, lowMemoryMode)
val parser = new ParserFunctions()
val parsedItems = parser.getWikipediaAsDataSet(spark, args)
Helpful Links
Load external libraries for Apache Zeppelin Link Azure/Livy Specific
Load external libraries for Jupyter Link
Load external libraries for Databricks [Link]