Best Practice Guide: Data Enrichment
Data enrichment is one of the key processes by which you can add more value to your data. It refines, improves and enhances your data set with the addition of new attributes. For example, using an address post code/ZIP field, you can take simple address data and enrich it by adding socio economic demographic data, such as average income, household size and population attributes. By enriching data through trusted services such as those found at Endato.com, you can get a better understanding of your customer base, and potential target customers.
Enrichment techniques
Enrichment techniques typically encompass six common tasks:
- Appending Data
- Segmentation
- Derived Attributes
- Imputation
- Entity Extraction
- Categorization
1. Appending Data
Appending data to your dataset involves integrating multiple data sources to create a more comprehensive, precise, and coherent dataset compared to any single source alone. For instance, consolidating customer data from your CRM, Financial System, and Marketing systems provides a richer understanding of your customers than relying on just one system.
In addition to internal data sources, appending data also entails incorporating third-party data, such as demographic or geographic data based on postcodes/ZIP codes, into your dataset. Other valuable examples include exchange rates, weather data, date/time hierarchies, and traffic information. Enriching location data is a particularly common technique, given its widespread availability for most countries.
2. Data Segmentation
Data segmentation involves dividing a data object, such as a customer, product, or location, into distinct groups based on predefined variables, like age, gender, income, for customers. These segments serve to categorize and describe entities more effectively.
Examples of common customer segmentation include:
- Demographic Segmentation: Based on factors like gender, age, occupation, marital status, and income.
- Geographic Segmentation: Divided by country, state, city, or even specific towns or counties for local businesses.
- Technographic Segmentation: Centered on preferred technologies, software, and mobile devices.
- Psychographic Segmentation: Focused on personal attitudes, values, interests, or personality traits.
- Behavioral Segmentation: Defined by actions or inactions, spending habits, feature use, session frequency, browsing history, average order value, etc.
These segments may lead to customer groups such as “Trend Setters” or “Tree Changers.”
You can create your own segmentation by generating calculated fields either in an ETL process or within a metadata layer, utilizing the available data attributes.
3. Derived Attributes
Derived attributes refer to fields that are not initially stored in the original dataset but can be computed from one or more existing fields. For instance, while ‘Age’ may not be directly stored, it can be derived from a ‘date of birth’ field. These attributes prove highly valuable as they often encapsulate logic frequently utilized for analysis. Creating derived attributes within an ETL process or at the metadata layer serves to expedite the creation of new analyses while ensuring consistency and accuracy in the measures employed.
Common examples of derived attributes include:
- Counter Field: Generated based on a unique ID within the dataset, facilitating straightforward aggregations.
- Date Time Conversions: Utilizing a date field to extract information such as the day of the week, month of the year, quarter, etc.
- Time Between: Calculating periods elapsed between two datetime fields, such as response times for tickets.
- Dimensional Counts: Counting values within a field to create new counter fields for specific categories, like the count of narcotic offenses, weapons offenses, petty crimes, enabling easier comparative analysis at the report level.
- Higher Order Classifications: Deriving attributes like product category from product or age band from age.
Advanced derived attributes may result from data science models applied to the dataset. For example, predicting customer churn risk or propensity to spend.
4. Data Manipulation
Data imputation involves replacing values for missing or inconsistent data within fields, ensuring more accurate analysis rather than skewing aggregations by treating missing values as zeros.
For instance, if the value for an order is missing, estimation based on previous orders by the same customer or for similar bundles of goods can help to provide a more accurate representation.
5. Entity extraction
Entity extraction involves the process of extracting structured data from unstructured or semi-structured data sources, thereby deriving meaningful information.
This extraction method enables the identification of various entities, including individuals, locations, organizations, concepts, numerical expressions (such as dates, times, currency amounts, and phone numbers), as well as temporal expressions (such as dates, times, durations, and frequencies).
For example, through data parsing, one can extract a person’s name from an email address or determine the organization associated with a web domain. Similarly, entity extraction allows for the segmentation of names, addresses, and other data elements from an address in an envelope format into discrete components such as building name, unit, house number, street, postal code, city, state/province, and country.
6. Data Categorization
Data categorization involves the act of assigning labels to unstructured data, thereby rendering it structured and suitable for analysis. This process encompasses two primary categories:
- Sentiment Analysis: This involves extracting emotions and feelings from text. For instance, determining whether customer feedback reflects frustration, delight, positivity, or neutrality.
- Topic Analysis: This entails identifying the main subject or “topic” of the text. For example, discerning whether the text pertains to politics, sports, or house prices.
Both of these techniques enable the analysis of unstructured text, providing insights to better understand the underlying data.
Data Enrichment Best Practices
Data enrichment is not a one-time endeavor. In an analytics environment where new data continually enters the system, enrichment steps must be recurrent. Several best practices are essential to achieve desired outcomes and uphold data quality.
Reproducibility and consistency
Every data enrichment task should be reproducible, consistently yielding the expected results. It’s imperative to establish rules-driven processes, ensuring that each execution reliably produces the same outcome every time.
Clear Evaluation Criterion
Every data enrichment task should adhere to clear evaluation criteria. It’s crucial to be able to verify the success of each process. For instance, following execution, you should compare recent outcomes with previous runs to ensure consistency and expected results.
Scalability
Every data enrichment task should demonstrate scalability across resources, timing, and costs. It’s essential to anticipate the growth of your data over time and ensure that any process created can be sustained as data volume expands or as additional transformation workloads are introduced. For instance, manual processes can quickly become limiting in terms of processing capacity and cost-effectiveness. Therefore, prioritize automation using infrastructure that can seamlessly accommodate evolving requirements.
Completeness
Every data enrichment task should ensure completeness concerning the input data, generating results with consistent characteristics. This entails anticipating all potential outcomes, including scenarios where results are ‘unknown’. By maintaining comprehensiveness, particularly when new data is incorporated into the system, one can ensure a reliable outcome from the enrichment process at all times.
Generality
The data enrichment task should be applicable across various datasets. Ideally, the processes developed should be transferable, allowing for the reuse of logic across multiple tasks. For instance, operations such as day-of-week extraction should be universally applied to any date field. This fosters consistency in outcomes and aids in upholding the business rules associated with your data across diverse subject domains.