Examples: Resource Usage and How Big Spark Excel Can Handle? - nightscape/spark-excel GitHub Wiki
Purpose: Find out the resource usage and limitation of spark-excel
Limitation
Under the hood, spark-excel relies on the Apache POI to do everything excel. Here are the limitation of single excel file, copied from SpreadsheetVersion in the Apache POI document (in the reference)
EXCEL97 format aka BIFF8 (xls)
- The total number of available rows is 64k (2^16)
- The total number of available columns is 256 (2^8)
- The maximum number of arguments to a function is 30
- Number of conditional format conditions on a cell is 3
- Number of cell styles is 4000
- Length of text cell contents is 32767
Excel2007 (xlsx)
- The total number of available rows is 1M (2^20)
- The total number of available columns is 16K (2^14)
- The maximum number of arguments to a function is 255
- Number of conditional format conditions on a cell is unlimited (actually limited by available memory in Excel)
- Number of cell styles is 64000
- Length of text cell contents is 32767
Spark-excel supports read and write multiple excel files, so the total number of rows in data frame for both reading or writing just depend of resource available and how we partitioning the data in writing.
Let do some tests
TBD
References
- Apache POI - HSSF and XSSF Limitations
- Enum SpreadsheetVersion
- #79 Writing a large Dataset into an Excel file causes java.lang.OutOfMemoryError: GC overhead limit exceeded
- #142 read quite big excel error, size=300M
- #322 [Read an Excel File]: GC overhead limit exceeded
- #388 Error Reading files in Excel Worksheet 97-2003 File - xls format