GLOBALISE - VOC Document Segmentation Dataset (hdl:10622/XMCZLZ)

View:

Part 1: Document Description
Part 2: Study Description
Part 3: Data Files Description
Part 4: Variable Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

GLOBALISE - VOC Document Segmentation Dataset

Identification Number:

hdl:10622/XMCZLZ

Distributor:

IISH Data Collection

Date of Distribution:

2025-09-22

Version:

1

Bibliographic Citation:

Smit, Renate, 2025, "GLOBALISE - VOC Document Segmentation Dataset", https://hdl.handle.net/10622/XMCZLZ, IISH Data Collection, V1, UNF:6:6pNG8HfphDeciynT9YU+RA== [fileUNF]

Study Description

Citation

Title:

GLOBALISE - VOC Document Segmentation Dataset

Identification Number:

hdl:10622/XMCZLZ

Authoring Entity:

Smit, Renate (Huygens Institute)

Distributor:

IISH Data Collection

Access Authority:

Pepping, Kay

Depositor:

Pepping, Kay

Date of Deposit:

2025-09-01

Holdings Information:

https://hdl.handle.net/10622/XMCZLZ

Study Scope

Keywords:

Arts and Humanities

Topic Classification:

archive, Dutch East India Company

Abstract:

This dataset contains detailed annotations of Dutch East India Company (VOC) archival documents based on the TANAP (Towards a New Age of Partnership) project. The dataset provides precise boundaries and classifications for documents within digitized archival volumes, serving as training data for machine learning approaches to historical document segmentation and classification. This work supports the broader goal of making VOC archives more accessible beyond traditional finding aids that often reflect colonial perspectives.

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Related Publications

Citation

Bibliographic Citation:

Schnober, C., Smit, R., Kuruppath, M., Pepping, K., van Wissen, L., & Petram, L. (2024). Page Embeddings: Extracting and Classifying Historical Documents with Generic Vector Representations. In Proceedings of the Computational Humanities Research Conference 2024: Aarhus, Denmark, December 4-6, 2024 (Vol. 3834, pp. 999-1011). (CEUR Workshop Proceedings). https://ceur-ws.org/Vol-3834/paper73.pdf

File Description--f34885

File: 1120 - Document Segmentation.tab

  • Number of cases: 972

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:PaALHGHcxn8vZzdIoK338Q==

File Description--f34878

File: 1267 - Document Segmentation.tab

  • Number of cases: 1426

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:mbuRDVkEBgHKzZip9MFlbw==

File Description--f34896

File: 1274 - Document Segmentation.tab

  • Number of cases: 1835

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:0CRaDzC5pJN/To8jr+/5aA==

File Description--f34889

File: 1539 - Document Segmentation.tab

  • Number of cases: 1475

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:EfwH8NmoW4Itv6flCBnuow==

File Description--f34898

File: 1547 - Document Segmentation.tab

  • Number of cases: 690

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:HY7HTyDWzOhdpb0O4TYIPQ==

File Description--f34884

File: 1557 - Document Segmentation.tab

  • Number of cases: 1778

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:AtuvqqF04kp2zCeUsHK6Mg==

File Description--f34880

File: 2448 - Document Segmentation.tab

  • Number of cases: 2656

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:/jd92CDEedNui+xXLnJYDw==

File Description--f34886

File: 2548 - Document Segmentation.tab

  • Number of cases: 2419

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:fpxgSlWx2cLLT64IDyoP+Q==

File Description--f34881

File: 2555 - Document Segmentation.tab

  • Number of cases: 1105

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:r9a3wBxqJMfJx+cMI32S9g==

File Description--f34882

File: 2775 - Document Segmentation.tab

  • Number of cases: 707

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:um34/99KE/qyz4SrYcHGMg==

File Description--f34887

File: 3142 - Document Segmentation.tab

  • Number of cases: 754

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:XpiE2Tq7UgwCxZmu5aKeVQ==

File Description--f34888

File: 3891 - Document Segmentation.tab

  • Number of cases: 856

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:HiQtpXQp6H0YB6g8WYVHTQ==

File Description--f34895

File: 7923 - Document Segmentation.tab

  • Number of cases: 180

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:GOJcy4ef7RrxF7beQlWc5Q==

File Description--f34897

File: 8023 - Document Segmentation.tab

  • Number of cases: 156

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:BJ3ukW8BMVUg5oCWmk06Ig==

File Description--f34891

File: 8121 - Document Segmentation.tab

  • Number of cases: 735

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:pJbJM07imrdbYvheLKJl2Q==

File Description--f34879

File: 8237 - Document Segmentation.tab

  • Number of cases: 211

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:cIuvOf6jG28J3Gj4eZe/9Q==

File Description--f34893

File: 8276 - Document Segmentation.tab

  • Number of cases: 182

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:p0nklJvNDQsJZbEcYgTA7w==

File Description--f34883

File: 8284 - Document Segmentation.tab

  • Number of cases: 228

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:WulLRLNd1GHvC6YXJNxpWg==

File Description--f34892

File: 8697 - Document Segmentation.tab

  • Number of cases: 236

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:/P0ICj2jlWffC4BpSZTM3g==

File Description--f34890

File: 8834 - Document Segmentation.tab

  • Number of cases: 659

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:hERnkAlAKFo5YYdwWoj1DQ==

Variable Description

List of Variables:

Variables

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34885 Location:

Variable Format: character

Notes: UNF:6:PaALHGHcxn8vZzdIoK338Q==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34878 Location:

Variable Format: character

Notes: UNF:6:mbuRDVkEBgHKzZip9MFlbw==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34896 Location:

Variable Format: character

Notes: UNF:6:0CRaDzC5pJN/To8jr+/5aA==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34889 Location:

Variable Format: character

Notes: UNF:6:EfwH8NmoW4Itv6flCBnuow==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34898 Location:

Variable Format: character

Notes: UNF:6:HY7HTyDWzOhdpb0O4TYIPQ==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34884 Location:

Variable Format: character

Notes: UNF:6:AtuvqqF04kp2zCeUsHK6Mg==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34880 Location:

Variable Format: character

Notes: UNF:6:/jd92CDEedNui+xXLnJYDw==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page;;

f34886 Location:

Variable Format: character

Notes: UNF:6:fpxgSlWx2cLLT64IDyoP+Q==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34881 Location:

Variable Format: character

Notes: UNF:6:r9a3wBxqJMfJx+cMI32S9g==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34882 Location:

Variable Format: character

Notes: UNF:6:um34/99KE/qyz4SrYcHGMg==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34887 Location:

Variable Format: character

Notes: UNF:6:XpiE2Tq7UgwCxZmu5aKeVQ==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34888 Location:

Variable Format: character

Notes: UNF:6:HiQtpXQp6H0YB6g8WYVHTQ==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34895 Location:

Variable Format: character

Notes: UNF:6:GOJcy4ef7RrxF7beQlWc5Q==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34897 Location:

Variable Format: character

Notes: UNF:6:BJ3ukW8BMVUg5oCWmk06Ig==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34891 Location:

Variable Format: character

Notes: UNF:6:pJbJM07imrdbYvheLKJl2Q==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34879 Location:

Variable Format: character

Notes: UNF:6:cIuvOf6jG28J3Gj4eZe/9Q==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34893 Location:

Variable Format: character

Notes: UNF:6:p0nklJvNDQsJZbEcYgTA7w==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34883 Location:

Variable Format: character

Notes: UNF:6:WulLRLNd1GHvC6YXJNxpWg==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34892 Location:

Variable Format: character

Notes: UNF:6:/P0ICj2jlWffC4BpSZTM3g==

Scan File_Name;TANAP Boundaries;TANAP ID;Subdocument boundaries;Type of non-document page

f34890 Location:

Variable Format: character

Notes: UNF:6:hERnkAlAKFo5YYdwWoj1DQ==

Other Study-Related Materials

Label:

README - GLOBALISE - VOC Document Segmentation Dataset.pdf

Notes:

application/pdf