Data Quality

December 21st, 2016 by

In this blog we share our definition of Data Quality, examine best practices and discuss the implications of improved Data Quality.

Data quality defined

At Energyworx, we believe data quality should go beyond integrity. It is not enough to ask, is my data accurate? Important, no doubt, but fails to capture the full breadth of data quality.

Instead, ask the following and gain a much clearer picture of your data quality.

  1. Can the right people access the right data?
  2. Is the data findable and in the right format?
  3. Can we measure and improve accuracy, completeness and consistency?
  4. Can we correlate the timeseries with metadata to uncover new context?

If your organization does not have solid answers to these questions, have a look at the methods below to identify ways to improve your data quality.

User Control & Identity Access Management

The first question refers to the term Identity Access Management (IAM). The goal of datalakes, datahubs, etc. is to have all data available in one central repository so everyone can access the data they need for their specific roles. This is a good thing from an innovation standpoint but brings up very serious security concerns. Identify Access Management strategies have been developed just for this purpose and ensure all users have access to the data they need and nothing more.

Energyworx manages a multi-tenant environment where each tenant represents a unique customer. All data is stored and processed in that tenant and is inaccessible by any other tenant. Data control and permissions are further managed by the Energyworx Permission Model which extents to tenants, groups and users. The Permission Model controls access to datasources (e.g. meters) and can even set the level of granularity seen by that user (e.g. 15min values vs. monthly). Critical features that define run configurations, sequence and rule parameters can also be managed directly through the controls in console. 

Screenshot of the IAM tab on the Energyworx Console

Multidimensional Queries

Next, we should be able to find the data we need without spending too much time searching. This process has two key components: searching for the data and retrieving it. Energyworx tags its timeseries data with additional metadata (e.g. account/customer information) so it is possible to do a “fuzzy search” on the desired dataset. The user can always query directly on the datasource identifier (e.g. meter #) but often only the parameters are known (e.g. accounts >1MW served by utility A). Without tagged data, this simple request becomes much more challenging. Click the link to see a short demo of our flexible search capabilities

Now that we’ve found our dataset, we need to retrieve the timeseries and metadata so we can analyze it. Large requests from traditional databases can take several minutes to hours and are prone to failure. Energyworx uses the elasticity of the cloud to scale up instantly to meet demand and process the usage data in parallel. Thousands of interval meters with several years of historical data can be returned in seconds.

Data Cleansing

With all the data where you want it, the next step is to upgrade the quality directly. Validation, Estimation and Editing (VEE) is the traditional term for this process and is an important function of any Meter Data Management System. Energyworx has expanded on the traditional VEE process by incorporating additional data cleansing rules to fix common mistakes such as formatting and timezone / daylight savings time handling. Each rule has the ability to annotate individual data points which can be informational or instruct the next sequence to provide a correction (i.e. estimation/editing) or calculation. This process is highly configurable, allowing users to create and automate separate flows for load forecasting and billing for example. Energyworx provides its customers with dozens of rules based on industry best practices and often collaborates with customers to develop proprietary rules and flows specific to their use case and customer base.

Screenshot from the Energyworx Datalab

Data Quality is a process that never sleeps, making it difficult to examine how it’s working and propose innovations. Recognizing this, Energyworx developed an interactive environment called The Datalab which allows users to develop, tweak and interactively test new algorithms in a Notebook Environment. New VEE rules and flows can be benchmarked and scored against existing rules allowing you to actually see and measure the increase in data quality. Once you’re happy, the new rule can be published to the Energyworx backend to run (automatically) at scale on the production environment.

Implications

With data at the highest quality possible, what can your organization gain?

Billing errors are one of the most common results of poor data quality. It either results in unhappy customers or missed revenue. Energyworx cleans up faulty meter reads and missing values so billing determinants can calculated be correctly – leaving no surprises for you or the customer. 

Load forecasting is a use case where data quality is absolutely critical. The most sophisticated forecasting models can be developed but if they are applied to data with errors, the forecast is highly unlikely to resemble actuals. For prospective customers, this means less competitive offerings. For existing customers, poor forecasts increase risk in your portfolio and hurt your ability to retain your most valuable customers.

Data quality for Load Forecasting must remove errors but also identify trends and events that are impacting usage patterns. By working with tagged data, Energyworx adds valuable context to the timeseries – helping explain why the usage behaves as it does. We call this process contextual awareness and it can play an important role in Load Forecasting and other predictive algorithms. By identifying influential events and adapting to changing profiles, the resulting forecast is much more likely to match actual usage.

Our next blog posts will explore anomaly detection methods and how they impact forecasting and billing processes.

Are you taking the right steps to improve the quality of your data? Contact us to take control of the data that’s powering your most critical workflows!