Defining Your Segmentation Approach
Your segmentation approach determines how data is stored in the E3C storage repository. E3C storage has the following logical and physical layers:
• Segment layers – a physical separation or content holder for one or more collections.
• Collection layers – a logical way to divide and load data into PTC Arbortext Content Delivery.
This is an aggregation of different contexts present in the bundle.
When adding data sources to PTC Arbortext Content Delivery, you should decide to which collection the sources will be loaded. The collection level usually represent a basic unit that comes as a bundle from the authoring system. Since the Collection level is the logical layer, it has no impact on the system. The segment layer is the physical layer in which all the sources are stored in the E3C storage.
When content is published to the Viewer servers, the content is divided into segments within the E3C Storage to maintain acceptable search performance and to minimize the impact of input and output operations. Developing a segmentation plan is highly dependent on the published authored content.
You should take several considerations into account for the segmentation plan. The following sections provide details on these considerations to help you decide how to split your data into segments.
The Number and Size of Sources Per Segment
The number of data sources is one of the major considerations that affects the number of segments that will be created in the system. There is no limit for the size of E3C storage. However, there is a limitation on the number of words and phrases, based on their occurrences, that are stored inside a segment.
An occurrence is a number that is associated with every word (and every opening and closing element in an XML document) in the data. A core segment is limited to 2 GB of occurrences. Reaching almost the maximum occurrence capacity for a segment affects both the viewer performance (for example, when performing a search) and incremental update performance. In general, the recommended number of words (occurrences) in a segment is up to 500 million (0.5 GB).
For planning the number and size of sources per segment, you must analyze the data and identify the number of words. In addition to this number, you must take into account a buffer for incremental data loading. Based on an analysis of a variety of data samples, it is recommended that you use the following analysis approach to determine your segments.
Ideally, a capacity of 25% to 50% occurrences (meaning about 500 MB to 1 GB of occurrences) is the goal when deciding what data goes into a segment. This number should not be too low, as you might end up with too many segments and their associated overhead. You also do not want the segment to be too full, as that impacts on performance and gets too close to the segment limit.
The following tables provide a rough estimation of how much occurrences each of the data types typically contain. The percentage below is based on 100% of a segment’s capacity.
The results based on the number of file (with a granularity of around 1000 files when possible) are:
Data Type
|
Number of Files
|
Occurrence Contribution
|
Occurrence Contribution (%)
|
PartsList
|
1042 (2084 with XMD)
|
749364
|
0.0375
|
PDF
|
906
|
41093041
|
2
|
IEXML
|
1000
|
2833986
|
0.14
|
The results based on the disk size (with granularity of 10MB when possible) are:
Data Type
|
Size
|
Occurrence Contribution
|
Occurrence Contribution (%)
|
PartsList
|
10 MB
|
277542
|
0.0138
|
PDF
|
10 MB
|
37020
|
0.002
|
IEXML
|
10 MB
|
1190750
|
0.06
|
It is recommended that you perform the calculation of your data based on both tables and choose either the average or the lowest number to be on the safe side.
Taking into account the XML types with different indexing definitions, the following data sizes are recommended based on the IEXML type:
Data Type
|
Data Size
|
XML (PartList, IEXML)
|
5-7 GB
|
PDF
|
150-200GB
|
If you want to mix data types, you can use a relative portion of files. For example, 3 GB of XML data and 80 GB of PDF data.
If the data size exceeds the limit in the table, then your data should probably be broken into several fragments. For example, if you have 20 GB of XML and 500 GB of PDF, then you will probably need six segments.
References Between Documents
PTC Arbortext Content Delivery enables you to link from one data source to another (document, image, and so forth) using predefined links between the sources. Links can be done only within the same segment. Links to sources in other segments do not work in the viewer. Therefore, linked sources must be loaded in the same segment.
Types of Data
The type of data also has an impact on the size of the segment. For example, scanned PDF documents have minor impact on the segment size. These documents have a properties file that accompanies them inside the storage, but there will be very few occurrences that will be indexed in this case.
Therefore, you must analyze the data to understand the different data types in the segment.
Searching Across Multiple Segments
Search between segments is done by the business logic layer. Searching across multiple segments is more wasteful, as the search is done in each segment separately and then the layer unifies and sorts the separate results into one search results list. The less segments there are in the system, the more efficient the search will be.
When defining the segmentation, keep the number of segments small as possible given the other considerations.
The Number of Shared Documents Between Collections
Shared documents are documents that are loaded in more than one collection. In Shared mode, PTC Arbortext Content Delivery stores only one copy of the shared document per segment no matter how many collections in the segment include this source.
If you have collections with many shared documents, it is recommended that you load them in a single segment to reduce the amount of copies of these documents.
Offline Segments
Offline packages are created from a segment. This means that full segments are distributed to an offline system along with all of the associated collections. You should split collections that should not be distributed in the same offline package into different segments.