Sanity Testing Baseline Updates using Endeca CAS

Introduction

When working with Endeca indexes, a question frequently asked by customers is whether it is possible to insert quality checks into the baseline update procedure, such that if the checks don’t succeed, the baseline update doesn’t proceed. In this blog I will demonstrate a technique using the Endeca CAS API that allows customers to validate the data in a Record Store to ensure the quantity/quality of data, prior to performing a baseline update. This technique can be very useful for ATG customers, for example, who are encountering incomplete or incorrect catalog data in their MDEX. With a few simple validation rules, you can ensure the integrity of your source data before any changes are made, and avoid late detection of issues.

Main Article

In Endeca, the Content Acquisition System (CAS) is responsible for collecting input data from a set of well-defined data sources, transforming the data if necessary, and merging it before passing it off to the indexer (Dgidx). The CAS output, by default, is persisted to a generational, flat file structure known as a Record Store. Each record in a RecordStore consists of an unordered collection of name/value pairs, and has no structural relationships to any other records. Consequently, Endeca RecordStores cannot be queried like relational databases. Fortunately, the CAS RecordStore API does provide methods for iteratively inspecting individual records of a RecordStore, and in this blog I will demonstrate how this API can be used to validate the integrity of the data before proceeding to index.

To start, let’s look at a simple example of how to connect to the CAS Service, locate a RecordStore, and ensure that it contains more than 100 records:

import com.endeca.itl.record.Record;
import com.endeca.itl.recordstore.*;

public class RecordStoreValidator {
    public static void main(String[] args) {
        int count = 0;
        if (args.length != 3) {
            System.out.println("usage: <cas host> <cas port> <rs name>");
            System.exit(-1);
        }
        String casHost = args[0];
        int casPort = Integer.parseInt(args[1]);
        String rsName = args[2];
        RecordStoreLocator locator = RecordStoreLocator.create(casHost, casPort, rsName);
        RecordStore recordStore = locator.getService();
        try {
            TransactionId tid = recordStore.startTransaction(TransactionType.READ);
            RecordStoreReader reader = RecordStoreReader.createBaselineReader(recordStore, tid);
            while (reader.hasNext()) {
                Record record = reader.next();
                count++;
            }
            reader.close();
            if (tid != null) {
                recordStore.rollbackTransaction(tid);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println((count>100)?"SUCCESS":"FAILURE");
    }
}

In the above code, RecordStoreLocator is used to connect to the CAS Service, and locate the named RecordStore. Then RecordStoreReader is used to create a baseline reader and iterate over all records while maintaining a total count. To avoid any changes to the client read state, the transaction is rolled back. Finally, if the total record count is greater than 100, a message of SUCCESS is displayed.

In order to run the example, you’ll need to identify a RecordStore of interest first. If none are available, you can either configure a file system crawl, or deploy one of the sample apps. You can retrieve a list of available RecordStores using the component-manager-cmd, like so:

component-manager-cmd.bat list-components

 
To count the number of records in a RecordStore, you can use:

recordstore-cmd.bat read-baseline -a Discover-data -c

 
When you’ve identified a RecordStore of interest, pass it as an argument string along with the host name and port number of the CAS Service:

java RecordStoreValidator localhost 8500 Discover-data

 
If the record count is greater than 100, you should see a message response of SUCCESS. You could then amend this program to call System.exit() with a value of 0 for success or 1 for failure. Then call it in the baseline_update script and only proceed with the BaselineUpdate if the exit status is 0.

That would be a simple solution to the problem, but as the complexity of the validation logic grows, such a solution might become difficult to maintain. One way to improve on it is to create an interface for validation tasks. Such separation would allow for more complex validation logic, as well as the ability to selectively enable which validations you would like to perform. Attached below is the source code to such a solution. Rather than accumulate all validation logic into a single class, an abstract class for validation tasks is defined. By extending this class and implementing its abstract methods, you can focus on the validation tasks necessary for your application. Using this class, the above example can be rewritten as:

import com.endeca.itl.record.Record;

public class MinimumRecordThreshold extends RecordStoreValidationTask {
    private int count = 0;
    private int minRecordThreshold;

    public MinimumRecordThreshold(int minRecordThreshold) {
        this.minRecordThreshold = minRecordThreshold;
    }

    @Override
    public boolean checkRecord(final Record record) {
        count++;
        return true;
    }

    @Override
    public boolean doRecordsPass() {
        if (count < minRecordThreshold) {
            setFailureMessage("Insufficient records in RecordStore.");
            return false;
        }
        return true;
    }
}

Then added to the RecordStoreValidator like so:

RecordStoreValidator validator = new RecordStoreValidator("localhost", 8500, "Discover-data");
validator.addValidationTask(new MinimumRecordThreshold(100));

 
With this code, the validator will first call checkRecord() for each record in the specified RecordStore. After all records have been processed, the validator will call doRecordsPass() to determine whether the validation task was successful or not. Since record-level validation is not necessary for the minimum threshold test, the checkRecord() method just increments the count and returns true. If record-level validation were necessary, returning false would cause the validation task to fail.

To integrate the RecordStoreValidator into the baseline update script, edit DataIngest.xml to include the following:

<script id="RecordStoreValidation">
 <bean-shell-script>
    <![CDATA[ 
      RecordStoreValidator validator = new RecordStoreValidator();
      validator.setRecordStoreName("Discover-data");
      validator.addValidationTask(new MinimumRecordThreshold(100));
      boolean success = validator.runAll();
      if (!success) {
        throw new Exception("RecordStore Validation Failed!");
      }
    ]]>
  </bean-shell-script>
</script>

Then, call it in the BaselineUpdate script, prior to acquiring a lock to begin the update:

<script id="BaselineUpdate">
  ...
  // run validations
  RecordStoreValidation.run();

  // obtain lock
  ...
</script>

If the validation is not successful an exception will be thrown, which will cause the baseline update script to fail before the update process begins, hence preventing a baseline update if the RecordStore data does not pass validation.

To make RecordStoreValidator available to your BeanShell scripts, you will need to copy recordstore-validator-1.0.jar to the application directory config/lib/java, and modify the beanshell.imports file to include the following line:

import com.oracle.ateam.endeca.cas.validation.*;

 
You will also need to add the following line to runcommand.bat for runtime support:

set CLASSPATH=%CLASSPATH%;%ENDECA_ROOT%\..\..\CAS\11.1.0\lib\recordstore-api\*

 
And for logging you will need to copy slf4j-jdk14-1.5.2.jar to config/lib/java, and add the following line to logging.properties:

com.oracle.ateam.endeca.cas.validation=DEBUG

 
Then just run the baseline_update script, and if the conditions specified in your validation tasks do not succeed, you should see a message sequence similar to the following:

INFO: Starting baseline update script.
INFO: Opening CAS connection to record store 'Discover-data' on localhost:8500
SEVERE: Validation task 'MinimumRecordThreshold-1' failed with message "Insufficient records in RecordStore. Expected 8000, but found only 5684."
INFO: Completed 1 validation tasks in 3499ms
INFO: 1 out of 1 validations failed
INFO: Validation status: FAILURE
SEVERE: RecordStore Validation Failed!

 
Feel free to improve the code as you see fit. Keep in mind however, that since RecordStores are flat file structures with limited query capabilities, iterating over all records in a record store to check a set of validation constraints can be a lengthy, time-consuming process. Avoid overly complicating your validation tasks, and make sure to filter records that do not need to be validated.

Source Code

The attached source code requires Gradle, Maven, and Java 7 SDK to build. Once extracted, edit scripts/mvn_install.bat to point to your Endeca installation directory. Then run the script to install the dependent libraries into a local Maven repository. Finally, run “gradlew build” to build recordstore-validator-1.0.jar, and “gradlew javadoc” to build the javadocs.

RecordStoreValidatorSource

Special thanks to Greg Eschbacher for the validator idea and for providing the initial implementation of the RecordStoreValidator code.

Comments

  1. Subrata,

    The ‘get-last-read-generation’ task will return the last generation that was crawled. For example, in the Discover CAS example, if you start with a baseline, and then run ‘load_partial_test_data.bat’, you will see that even though there are two generations, the ‘get-last-read-generation’ task returns a value of 1. However, if you then run ‘partial_update.bat’ followed by ‘recordstore-cmd get-last-read-generation -a Discover-data -c Discover-last-mile-crawl’, you will see that the last read generation task now returns a value of 2. If you then run ‘recordstore-cmd.bat read-delta -a Discover-data -s 1 -e 2 -c’ you should see a difference of 7 records. So, you should be able to use ‘get-last-read-generation’ and ‘get-last-committed-generation’, as you described, to find the number of delta records.

    However, if your interpretation of get-last-read-generation is something different, then you can always use the ‘set-last-read-generation’ task to set a value of your own. The client id of set-last-read-generation is just a string of your own choosing, so just set one manually using ‘recordstore-cmd set-last-read-generation -a Discover-data -c <someString> -g <genId>’. To list existing client states, you can use ‘recordstore-cmd list-client-states -a Discover-data’.

    Lastly, I just wanted to point out that all of this can also be done programmatically through the RecordStore interface:
    https://docs.oracle.com/cd/E55325_01/CAS.110/apidoc/recordstore-javadoc/index.html

    Hope that helps!

  2. Subrata Ghosh says:

    Hi Jim,
    Something similar I was trying few days back. For a checkpoint for the baseline update, I used the command ‘recordstore-cmd.bat read-baseline -a Discover-data -c’ to get the number of records and then compare with the number of records in my last backup copy. But I stumbled on implementing the similar for partial update. First I tried with the ‘recordstore-cmd read-delta -a Discover-data -c’ but it didn’t provide me the actual delta that I was looking for, rather it gets the count of all records from the available generations. Well it does work I’m able to provide the start-generation and end-generation parameters. I can possibly get the end-generation by running ‘get-last-committed-generation’ but stuck on determining the start-generation. Thought the ‘get-last-read-generation’ command would be helpful to find the start-generation but that requires to run the ‘set-last-read-generation’ with the generation id.
    So, still looking for a way to find the number of delta records since the last read job (baseline or partial). Would like to hear if you have some suggestions!

Add Your Comment