Introduction
Excel is a popular spreadsheet for different types of tasks related to business, research, analytics, and tabular tasks. The Apache Foundation released the POI library December 30, 2001. It is a pure Java library which supports Microsoft office formats. We can read small files easily with POI with some simple interfaces but we need some more advance support to read large files. The main goal of this article is to read large Excel files with the streaming API from the POI library.
Architecture of an Excel OOXML-Based File (.xlsx file)
Excel files actually combine some XML files in a zip format (for .xlsx format). As we can see, an Excel file looks like this.
We will rename the file extension to .zip from .xlsx. Now we can see that the extracted folder only contains some XML files which are organixed in a different folder.
Here we can see all the files are written in an XML structure which we will read and write with our POI library.
So, in summary, we can say that the whole Excel file as workbook. This workbook contains some worksheets. SharedStrings.xml contains shared strings of the whole workbook and the styles contain the styles of each cell of every sheet.
POI Library
The POI library in Java is used to read and write Excel files in Java. Actually excel file has two types, One is Binary format and another is OOXML based. Apache POI contains different types of components to read and write Excel files.
- HSSF (Horrible SpreadSheet Format) - Used to read and write Excel files from the 97-2003(.xls) format. All interfaces and implentetions ar eavailable in org.apache.poi.hssf.usermodel package in the POI library.
- XSSF (XML SpreadSheet Format) - Used to read and write Excel files from the 2007+ versions (.xlsx) of Excel files. All interfaces and implentetions are available in org.apache.poi.xssf.eventusermodel package in the POI library.
- SXSSF (Streaming extension of XSSF) - Used to read .xlsx format Excel files with the stream concept and popular for reading large files efficiently.
Basic Interfaces and Their Implementations
Workbook is an interface which is inherited from the Closeable interface and the Sheet Iterable. A Closeable gives support to the data stream so that it can be closed. The close method is invoked to release resources that the object is holding (such as open files).
Workbook has several implementations:
1. HSSFWorkbook
2. XSSFWorkbook
3. SXSSFWorkbook
Sheet is an interface which is used to write and read row, column, and cells properties.
Sheet has several implementations:
1. HSSFSheet
2. XSSFSheet
3. SXSSFSheet
4. XSSFDialogsheet
Row
Row has also several implementation like XSSFRow, HSSFRow, SXSSFRow
Cell
Cell also has several implementations, like XSSFCell, HSSFCell, SXSSFCell.
So, in summery, we can say that workbooks contain sevaral sheets, sheets contain many rows, and every row contains some cells.
Exploring File Writting and Reading With XSSF Interface
Here I am presenting some sample code to write and read Excel files using the XSSF interface. You can find the code for the Resident and ExcelFileHeaderConstants files here:
Yes this will work fine. But this not suitable for large files. Large files take up too much memory. So sometimes we also get an 'Out of Memory' error. Just update the number counter loops at line 12 in the previous code like this:
I am still getting the 'Out of Memory' error. My laptop configuration is Core i7 with 4GB of ram.
To overcome this error, we can use SXSSF which will flush rows after the defined row is created. So it keeps the memory clean.
Implementation of SXSSF-Based Interface to Read and Write Excel Files
For streaming purposes, we have to change the type of workbook when creating workbook.
So with this implementation we will never get an 'Out of Memory' error, because it always keeps 100 rows in memory. Others will be flushed imediately if the row number is greater than 100. Here low memory is being used. That is the main feature of SXSSF APIs.
If we update this code with SXSSF then writting tasks will be completed successfully; but carefully look at the reading task to see if it's still in XSSF format. So its still getting error.
So lets go implement it with SXSSF event based memory efficient readding with SAX parser. Let's modify the
read()
method. Here the OPCPackage
works as a container which will contain all parts of an Excel file like workbook, sheets, sharedStrings, etc.ParseExcel
will initalize all event properties. The process method will read OPCPackage
using XSSFReader. And then parse sharedStrings, styles and relationship data. Then it will iterate all sheet of workbook and call the processSheet
method to process a prticular sheet. Then the processSheet
method initializes some properties of SAXParser
to parse sheets. When XMLReader
calls the parse()
method, an event will be thrown to read the Excel sheet.
And
XSSFSheetEventHandler
is here, which will handle all events during the reading of a sheet. Here, if one element of a sheet is going to start a read it will go to the startelement
method and after completely reading it will go to the endelement
method. For example, if events start to read a row it will go to the startelement
method and after the last data cell of a row's reading is completed it will go the endelement
method. The same applies for cells, columns, etc.
I am using
SheetModel
here to get a populated result from an event based reading of an Excel file. SheetModel
's code is given here:
So with using SXSSF interfaces we can easily write and read large Excel files. This is a memory efficient procedure.
I will explain all properties of event properties in my next article. Then I will give the link here. All of my code from this this article (similar code) is available on GitHub at this link: PoiAdvanceExample.
0 Comments