Analyze CPQ Data Tables with Pandas DataFrames

March 29, 2024 | 5 minute read
Text Size 100%:

Introduction

If you haven't used pandas before, get ready to use one of the most important Python packages available for data analysis. Pandas is the standard package for data manipulation and exploratory data analysis. Because pandas can read from and write to a large number of formats, it's a powerful tool for data scientists and the data science curious. 

Keep in mind that pandas is often used in conjunction with other packages and tools which range from machine learning packages for producing predictions on data like scikit-learn, data visualization packages like Matplotlib and seaborn, and tools like Jupyter Notebooks for creating and sharing reports that contain code, visualizations, and text. Together these tools comprise the PyData ecosystem.

Let's get started unlocking the data analysis potential of your CPQ data tables.

CPQ Data Table Export

In order to get started, you need to understand the CPQ Data Table export format. CPQ exports data tables in CSV format but with a couple rows of metadata at the beginning of the file. While the extra rows complicate how you read the file, there's great information included that we'll use.

Here's an example exported CPQ Data Table CSV.

_start meta data
_update_action,Field1,Field2,Field3,Field4
String,String,Integer,String,Float
,Field1,Field2,Field3,Field4
_end meta data
modify,1070001002,888116,DO244,1.0
modify,1070001003,988116,DO245,2.0

Our goal is to read this file into a DataFrame using the pandas.read_csv() method.

Define Column Metadata

The metadata included in the exported file is very useful for setting column names and data types in our pandas DataFrame. To do so we need to parse the lines, translate the CPQ data types, and combine the result into a dictionary that pandas understands.

The pandas.read_csv() method contains a parameter, dtype, that enables you to specify the data types to apply to either the whole dataset or individual columns. You can use the following syntax to specify the dtype of each column in a DataFrame when importing a CSV file into pandas:

df = pd.read_csv('my_data.csv', dtype = {'Field1': str, 'Field2': int,  'Field3': str, 'Field4': float})

To start, we need to read the first metadata lines of the file and parse lists for column names and data types, which we zip together into a dictionary. In order populate our dtype dictionary correctly, we also need to translate the CPQ data types to numpy data types.

The below sample code will handle these tasks.

def translate_data_type(data_types):
    # Translate CPQ data table types to numpy data types
    # CPQ data table type 'String' = numpy 'string'
    # CPQ data table type 'Float' = numpy 'float'
    # CPQ data table type 'Integer' = numpy 'int'
    return ['int' if val == 'Integer' else val.lower() for i, val in enumerate(data_types)]

def get_metadata():
    # Get column names and data types zipped together into a dictionary
    headers = []
    data_types = []
    for i, line in enumerate(f):
        if i == 0: # Row contains _start meta data
            continue
        elif i == 1:
            headers = line.replace('\n','').split(',')
        elif i == 2:
            data_types = line.replace('\n','').split(',')
        elif i > 2:
            break
    return dict(zip(headers, translate_data_type(data_types)))

 

Read CSV File into DataFrame

Now that we've done the hard work of parsing the metadata to populate the dtype dictionary, the next step is to read the file into our DataFrame.

There's a couple details to be aware of in the below sample code. First, we'll start the import at row 5, skipping the metadata rows. Also, the column names are specified via the names parameter, which accepts a list of column names which we have handy as the keys to our dtype dictionary

Finally, the export data tables in CPQ that have natural keys enabled contain an additional column named '_update_action' which appears as the first column. In order to ignore that column, use the usecols parameter to specify the range of columns beginning after the first column.

with open(file_name, mode='r', encoding='utf-8') as f:            
    col_metadata = get_metadata()
    # Read CSV file, include column names and data types, start import at row 5
    # Skip first column. For Data Tables with Natural Keys enabled, a column called '_update_action' is first column created
    df = pd.read_csv(f, sep=',', names=list(col_metadata.keys()),  dtype=col_metadata, skiprows=5, usecols=range(1,len(col_metadata)))
    print(df.info())

Analyzing your Data

Now to do something interesting with your data. Pandas enables you to create reports and visualizations but we'll focus on a simpler task: determining the selectivity of your columns. Selectivity is the ratio of the number of distinct values in a column to the number of records in the table and is a great guide to whether your column and its data would benefit from an index.

An example with good selectivity is a table having 100,000 records and one column has 88,000 distinct values, then the selectivity of this index is 88'000 / 10'0000 = 0.88.

An example with bad selectivity is a table of 100,000 records and one column had only 500 distinct values, then the index's selectivity is 500 / 100,000 = 0.005 In this case it is evident that a full table scan is more efficient as using such an index where much more I/O is needed to scan repeatedly the index and the table.

# Total number of rows
num_rows = len(df)
print(f'numRows: {num_rows}')

# Get unique values for each column
unique_values = pd.Series({c: df[c].unique() for c in df})

# For each column...
for col, values in unique_values.items():
    # Print column details
    print(f'name: {col}, nunique: {len(values)}, selectivity: {(len(values)/num_rows)}')

The above sample code uses a list comprehension and the pandas.Series.unique() method in order to return the unique values for each column. The selectivity of each column is calculated by dividing the number of unique values into the total number of rows.

Summary

That's all you need to get started using pandas DataFrames and to begin your data analysis journey with all of your CPQ data tables. Happy hunting!

~sn

Shea Nolan


Previous Post

Security Best Practices Guide for new OCI Tenancy

Johannes Murmann | 5 min read

Next Post


All Your (Python) Logs Are Belong to Us

KC Flynn | 6 min read