PDFBox: do PDDocument and PDPage have references to one another?

57
January 11, 2019, at 09:20 AM

Does a PDPage object contains a reference to the PDDocument to which it belongs?
In other words, does a PDPage has knowledge of its PDDocument?
Somewhere in the application I have a list of PDDocuments.
These documents get merged into one new PDDocument:

PDFMergerUtility pdfMerger = new PDFMergerUtility();
PDDocument mergedPDDocument = new PDDocument();
for (PDDocument pdfDocument : documentList) {
    pdfMerger.appendDocument(mergedPDDocument, pdfDocument);
}

Then this PdDocument gets split into bundles of 10:

Splitter splitter = new Splitter();
splitter.setSplitAtPage(bundleSize);
List<PDDocument> bundleList = splitter.split(mergedDocument);

My question with this is now:
if I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?

Also, if you have a PDPage object, can you get information from it like, it's pagenumber, ....? Or can you get this via another way?

Answer 1
  1. Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument?

Unfortunately the PDPage does not contain a reference to its parent PDDocument, but it has a list of all other pages in the document that can be used to navigate between pages without a reference to the parent PDDocument.

  1. If you have a PDPage object, can you get information from it like its page number, or can you get this via another way?

There is a workaround to get information about the position of a PDPage in the document without the PDDocument available. Each PDPage has a dictionary with information about the size of the page, resources, fonts, content, etc. One of these attributes is called Parent, this is an array of Pages that have all the information needed to create a shallow clone of the PDPage using the constructor PDPage(COSDictionary). The pages are in the correct order so the page number can be obtain by the position of the record in the array.

  1. If I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?

Once you merge the document list into a single document all references to the original documents will be lost. You can confirm this by looking at the Parent object inside the PDPage, go to Parent > Kids > COSObject[n] > Parent and see if the number for Parent is the same for all the elements in the array. In this example Parent is COSName {Parent} : 1781256139; for all pages.

COSName {Parent} : COSObject {
  COSDictionary {
    COSName {Kids} : COSArray {
      COSObject {
        COSDictionary {
          COSName {TrimBox} : COSArray {0; 0; 612; 792;};
          COSName {MediaBox} : COSArray {0; 0; 612; 792;};
          COSName {CropBox} : COSArray {0; 0; 612; 792;};
          COSName {Resources} : COSDictionary {
            ...
          };
          COSName {Contents} : COSObject {
            ...
          };
          COSName {Parent} : 1781256139;
          COSName {StructParents} : COSInt {68};
          COSName {ArtBox} : COSArray {0; 0; 612; 792; };
          COSName {BleedBox} : COSArray {0; 0; 612; 792; };
          COSName {Type} : COSName {Page};
        }
    }
    ...
    COSName {Count} : COSInt {4};
    COSName {Type} : COSName {Pages};
  }
};

Source code

I wrote the following code to show how the information from the PDPage dictionary can be used to navigate the pages back and forward and get the page number using the position in the array.

public class PDPageUtils {
    public static void main(String[] args) throws InvalidPasswordException, IOException {
        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
        PDDocument document = null;
        try {
            String filename = "src/main/resources/pdf/us-017.pdf";
            document = PDDocument.load(new File(filename));
            System.out.println("listIterator(PDPage)");
            ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
            while (pageIterator.hasNext()) {
                PDPage page = pageIterator.next();
                System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }
    /**
     * Returns a <code>ListIterator</code> initialized with the list of pages from
     * the dictionary embedded in the specified <code>PDPage</code>. The current
     * position of this <code>ListIterator</code> is set to the position of the
     * specified <code>PDPage</code>.
     * 
     * @param page the specified <code>PDPage</code>
     * 
     * @see {@link java.util.ListIterator}
     * @see {@link org.apache.pdfbox.pdmodel.PDPage}
     */
    public static ListIterator<PDPage> listIterator(PDPage page) {
        List<PDPage> pages = new LinkedList<PDPage>();
        COSDictionary pageDictionary = page.getCOSObject();
        COSDictionary parentDictionary = pageDictionary.getCOSDictionary(COSName.PARENT);
        COSArray kidsArray = parentDictionary.getCOSArray(COSName.KIDS);
        List<? extends COSBase> kidList = kidsArray.toList();
        for (COSBase kid : kidList) {
            if (kid instanceof COSObject) {
                COSObject kidObject = (COSObject) kid;
                COSBase type = kidObject.getDictionaryObject(COSName.TYPE);
                if (type == COSName.PAGE) {
                    COSBase kidPageBase = kidObject.getObject();
                    if (kidPageBase instanceof COSDictionary) {
                        COSDictionary kidPageDictionary = (COSDictionary) kidPageBase;
                        pages.add(new PDPage(kidPageDictionary));
                    }
                }
            }
        }
        int index = pages.indexOf(page);
        return pages.listIterator(index);
    }
}

Sample output

In this example the PDF document has 4 pages and the iterator was initialized with the first page. Notice that the page number is the previousIndex()

System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
    PDPage page = pageIterator.next();
    System.out.println("page #: " + pageIterator.previousIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 0, Structural Parent Key: 68
page #: 1, Structural Parent Key: 69
page #: 2, Structural Parent Key: 70
page #: 3, Structural Parent Key: 71

You can also navigate backwards by starting from the last page. Notice now that the page number is the nextIndex().

ListIterator<PDPage> pageIterator = listIterator(document.getPage(3));
pageIterator.next();
while (pageIterator.hasPrevious()) {
    PDPage page = pageIterator.previous();
    System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 3, Structural Parent Key: 71
page #: 2, Structural Parent Key: 70
page #: 1, Structural Parent Key: 69
page #: 0, Structural Parent Key: 68
READ ALSO
How to send youtube-dl stdout to browser to save as a file?

How to send youtube-dl stdout to browser to save as a file?

I am making a php frontend for youtube-dlI am getting stdout from youtube-dl

37
Get the required fields for Google Tag Manager

Get the required fields for Google Tag Manager

I am currently integrating a google tag manager in one of my Silverstripe websites and I require to programatially retrieve details of Product name, order total price and currencyI have retrieved the other details like order id, coupon code but I am unable...

66
Get Value From Ajax in C#

Get Value From Ajax in C#

In PHP I can get value from ajax with this code: $order = $_GET['order']

39
How to JOIN 2 arrays with different key values in PHP/Laravel

How to JOIN 2 arrays with different key values in PHP/Laravel

I wanna write a cool function to preview a PDF file before store it into my hard driveThe problem is: i have 2 arrays

56