🛣️Data Ingestion Workflow

Enhanced Data Ingestion and Transformation Workflow

Our advanced data ingestion and transformation workflow stands out from standard processes by incorporating machine learning (ML) for deduplication and detailed transformation steps to ensure the highest data quality. Here's a simplified description of the workflow:

  1. Ingestion:

    • Sources: Data is initially ingested from our different sources, including open source and external providers.

  2. Loading:

    • After ingestion, the data is loaded to be used in the next (and most complex) step.

  3. Transformation:

    • The loaded data undergoes several meticulous transformation steps to standardize, match, and enhance it:

      1. Deduplication by Pair: Every data source is deduplicated with a paired source using machine learning (ML) techniques. The order of pairing has been thoughtfully evaluated and tested, demonstrating our expertise in this area. This step identifies and removes duplicates.

      2. Assign Place ID: Assigns unique IDs to each place (POI) for consistent identification of our Places. This step ensures the uniqueness of our POIs.

      3. Brand Matching: Matches POIs with corresponding brands to ensure brand consistency. This step combines our internal knowledge (collected in our Master Data Management system) with ML methodologies to assign a brand to a POI.

      4. Category Matching: Associates POIs with appropriate categories for better classification. The categories are classified according to our internal (Echo category) or NAICS (industry standard).

      5. Shape Matching: Combines our Shapes and POI data based on our internal methodology to ensure geographical accuracy.

      6. Format Standardization: Standardizes the data format for uniformity.

      7. Cleaning: Removes irrelevant information from POIs to maintain privacy and consistency.

      8. Publish: Finalizes and publishes the transformed data, ready for use by our customers and for further analysis.

  4. Analytics:

    • The final stage involves in-depth analytics on the processed data:

      • Country Confidence Score: Calculates confidence scores for each country to assess data quality based on our internal methodology.

      • Place Analytics: Performs detailed analytics specific to POIs for actionable insights.

Key Differentiators

  • Machine Learning Deduplication: Unlike standard workflows, our deduplication process leverages machine learning to accurately identify and remove duplicates. This significantly improves data reliability and allows us to iterate and improve through continuous model training.

  • Detailed Transformation Steps: We implement comprehensive transformation steps, including brand, category, and shape matching, to ensure the highest level of data precision and consistency.

  • Multi-Source Integration: Our workflow seamlessly integrates data from multiple sources, ensuring a comprehensive and unified dataset.

  • Industry Standard Compliance: Incorporates industry standards such as NAICS for categorization, ensuring our data is relevant and comparable across various industries.

  • High-Quality Analytics: Post-transformation, our data undergoes advanced analytics to provide valuable insights, improving decision-making processes.

This enhanced workflow ensures that data from multiple sources is ingested, deduplicated, standardized, matched, and formatted before being thoroughly analyzed and published, ultimately offering superior data quality and reliability.

Last updated