If you haven't used pandas before, get ready to use one of the most important Python packages available for
Let's get started unlocking the data analysis potential of your CPQ data tables.
In order to get started, you need to understand the CPQ Data Table export format. CPQ exports data tables in CSV format but with a couple rows of metadata at the beginning of the file. While the extra rows complicate how you read the file, there's great information included that we'll use.
Here's an example exported CPQ Data Table CSV.
_start meta data _update_action,Field1,Field2,Field3,Field4 String,String,Integer,String,Float ,Field1,Field2,Field3,Field4 _end meta data modify,1070001002,888116,DO244,1.0 modify,1070001003,988116,DO245,2.0
Our goal is to read this file into a DataFrame using the pandas.read_csv() method.
The metadata included in the exported file is very useful for setting column names and data types in our pandas DataFrame. To do so we need to parse the lines, translate the CPQ data types, and combine the result into a dictionary that pandas understands.
The pandas.read_csv() method contains a parameter, dtype, that enables you to specify the data types to apply to either the whole dataset or individual columns. You can use the following syntax to specify the dtype of each column in a DataFrame when importing a CSV file into pandas:
df = pd.read_csv('my_data.csv', dtype = {'Field1': str, 'Field2': int, 'Field3': str, 'Field4': float})
To start, we need to read the first metadata lines of the file and parse lists for column names and data types, which we zip together into a dictionary. In order populate our dtype dictionary correctly, we also need to translate the CPQ data types to numpy data types.
The below sample code will handle these tasks.
def translate_data_type(data_types): # Translate CPQ data table types to numpy data types # CPQ data table type 'String' = numpy 'string' # CPQ data table type 'Float' = numpy 'float' # CPQ data table type 'Integer' = numpy 'int' return ['int' if val == 'Integer' else val.lower() for i, val in enumerate(data_types)] def get_metadata(): # Get column names and data types zipped together into a dictionary headers = [] data_types = [] for i, line in enumerate(f): if i == 0: # Row contains _start meta data continue elif i == 1: headers = line.replace('\n','').split(',') elif i == 2: data_types = line.replace('\n','').split(',') elif i > 2: break return dict(zip(headers, translate_data_type(data_types)))
Now that we've done the hard work of parsing the metadata to populate the dtype dictionary, the next step is to read the file into our DataFrame.
There's a couple details to be aware of in the below sample code. First, we'll start the import at row 5, skipping the metadata rows. Also, the column names are specified via the names parameter, which accepts a list of column names which we have handy as the keys to our dtype dictionary
Finally, the export data tables in CPQ that have natural keys enabled contain an additional column named '_update_action' which appears as the first column. In order to ignore that column, use the usecols parameter to specify the range of columns beginning after the first column.
with open(file_name, mode='r', encoding='utf-8') as f: col_metadata = get_metadata() # Read CSV file, include column names and data types, start import at row 5 # Skip first column. For Data Tables with Natural Keys enabled, a column called '_update_action' is first column created df = pd.read_csv(f, sep=',', names=list(col_metadata.keys()), dtype=col_metadata, skiprows=5, usecols=range(1,len(col_metadata))) print(df.info())
Now to do something interesting with your data. Pandas enables you to create reports and visualizations but we'll focus on a simpler task: determining the selectivity of your columns. Selectivity is the ratio of the number of distinct values in a column to the number of records in the table and is a great guide to whether your column and its data would benefit from an index.
An example with good selectivity is a table having 100,000 records and one column has 88,000 distinct values, then the selectivity of this index is 88'000 / 10'0000 = 0.88.
An example with bad selectivity is a table of 100,000 records and one column had only 500 distinct values, then the index's selectivity is 500 / 100,000 = 0.005 In this case it is evident that a full table scan is more efficient as using such an index where much more I/O is needed to scan repeatedly the index and the table.
# Total number of rows num_rows = len(df) print(f'numRows: {num_rows}') # Get unique values for each column unique_values = pd.Series({c: df[c].unique() for c in df}) # For each column... for col, values in unique_values.items(): # Print column details print(f'name: {col}, nunique: {len(values)}, selectivity: {(len(values)/num_rows)}')
The above sample code uses a list comprehension and the pandas.Series.unique() method in order to return the unique values for each column. The selectivity of each column is calculated by dividing the number of unique values into the total number of rows.
That's all you need to get started using pandas DataFrames and to begin your data analysis journey with all of your CPQ data tables. Happy hunting!
~sn