As data challenges become more defined and difficult, purpose-built databases have emerged to optimize specific problems, leading to a proliferation of specialized solutions. This has resulted in complex and messy data infrastructures, patched together by third-party data pipelines and event streaming products.
Companies rarely rely on a single primary data storage system, often deploying multiple databases to handle different needs, which creates an intricate web of interconnected data systems.
But, the idea of a generalist, all-in-one database that is scalable, performant across contexts, and commercially appealing remains elusive.
Different database types, such as relational OLTP, non-relational document, and memory-based cache, are optimized for specific use cases and face unique challenges that prevent effective consolidation. While tools like Object-Relational Mappers (ORMs) attempt to simplify interactions across databases, they fall short in managing non-relational types, underscoring the complexity of creating a unified solution.
Postgres, with its extensibility through plugins, comes close but still falls short of being a one-stop-shop. Ultimately, the technical and practical hurdles make an all-in-one database unfeasible, leaving the modern data stack as a complex but necessary reality.
In theory, an all-in-one database would struggle with optimization, data model overhead, and latency. Each database type is built for specific use cases, making a universal solution inefficient.
Closest Solution Today: Postgres
While not a full one-stop-shop, Postgres can be extended with plugins like pg_vector for vector search. However, it's not intended to solve every problem efficiently, evidenced by complex data stacks in companies using Postgres.
Below are some of our tips on how to manage the data requests backlog. We believe that great data teams should be proactive. They should adopt tools and processes that ensure that the data team never has to answer the same question twice. We hope this step by step list is helpful to all data teams that want to improve their efficiency and reduce their workload in the future:
1. Set expectations about the data requests workflow with reasonable timelines
The first step of setting expectations around a team is communicating the way that your team is working to the stakeholders in the company. We suggest adopting an Agile workflow with weekly or bi-weekly sprints. Although your Scrum team may be small at first, setting expectations about when certain requests will be answered with the sprint methodology can be helpful. With scrum, a product is built in a series of iterations called sprints that break down big, complex projects into bite-sized pieces.
2. Define your requests workflow
Data requests are questions that employees have about data that exists in the organization or about new data that isn't being collected yet. Traditionally, data teams will take data requests through an intake form or a Slack channel where employees can ask for the request. Some of the requests are unique and difficult questions, which require the full attention of the data team. On the other hand, some data requests are repetitive and low priority, which doesn’t require much effort from the data team.
3. Create a requests template
We’ve created a data requests template for teams that are looking for a better way to manage inbound questions. Below is the template. Feel free to copy the template and use it in your team's workflow. Here's our template:
What is the business question you are trying to answer?
What is the impact of this question and how will it help the company?
Who will be using this data?
What time frames are crucial here? (Example: Monthly, weekly, daily)
What is the visualization you are trying to create?
What interactions/drill-downs are required? (Ie. the type of use, revenue amount etc.)
Are there any other details we should know about this data request?
4. Automate repetitive data requests
Data teams that take the next step with their data requests process can start to think proactively about data requests. Customer support teams have been deflecting inbound questions for years using tools like Intercoms knowledge base and Ada automated customer support chatbots. Smart data teams realize that they can do the same. Data teams can automate and deflect common questions with tools that allow them to document data requests in the same place teammates are looking for answers.
5. Measure the data requests workflow
Lastly, you can’t improve anything you don’t measure. Taking the time to measure what your users are asking, which tables are used the most, and who is the most influential user in your organization is a great way to automate more common questions.
If you found this useful, you can find the full article here: https://www.secoda.co/blog/how-to-manage-and-prioritize-data-requests
An example of the data catalog problems shared by one of the delivery companies we spoke with. At this company, it was difficult to get aligned on which tables were commonly used, joined, how they were used together and what columns meant. Similarly, it’s difficult to monitor the number of data assets that exist across different departments, especially when the number of resources grows at a faster rate than people. Why is this the case?
Data is becoming more decentralized through concepts like the data mesh. As more teams outside of the data function start to use data in their day-to-day, different tables, dashboards and definitions are being created at an almost exponential rate. Data catalogs are important because they help you organize your data whether you are working with structured or unstructured data.
Below are the steps that teams need to take when creating a data catalog:
1. Gather sources from across the organization
The first step data teams need to take is to collect the different resources that are scattered across different tools in the origination. This may require multiple meetings and stakeholders to come together and figure out which resources need to be in the catalog. Today, this collection could be done in a spreadsheet with an ongoing list of all resources and how they connect.
2. Give each resource an owner
After data teams have identified all the resources from across the company that they would like to include in their data catalog, we recommend assigning ownership to each resource. Teams that we’ve worked within the past have assigned ownership based on the source, schema or even domain. Teams that start assigning ownership should look for people who are familiar with the data knowledge they are responsible for managing and are willing to help others who want to learn how to use it.
3. Get support and sign off
Once these meetings conclude and owners are on the same page, have the owners sign off on their responsibilities. The owners should be in alignment with the documentation and feel like the data team worked collaboratively with them to come to this ownership structure. If the team leadership team sees the value of a data catalog, this can move at a much faster pace.
4. Integrate the catalog base into your workflow
After data teams have received support for their data documentation process, they should look for ways to integrate this tool into their workflow. This step is critical for maintenance and upkeep. Without a tool that allows teammates to receive notifications on Slack, it will likely be forgotten. By creating a process around the data catalog, teams can ensure that it is not left behind as the team grows
5. Upkeep the data catalog
Although the documentation should be stable, it may need to change over time. One instance that might require documentation to change is when a new revenue stream is introduced or when the pricing of an existing revenue line changes. These changes traditionally come from the business team and might require the data team to implement the changes into the data catalog.
Teams that invest the time to get alignment using a data catalog can see major benefits in the long term as they make faster decisions as a team. Creating a data catalog is not a small undertaking. You can read the full step-by-step guide here if you found this post useful: https://www.secoda.co/blog/how-to-create-a-data-catalog-a-step-by-step-guide
As a company grows, so does its data. Tables, metrics, queries, and dashboards often become isolated and are difficult to find. Even with great practices, organizations still struggle to get value out of their data - up to 73% of all enterprise data goes unused. One of the big contributors to this problem is that organizations create data silos by not documenting and centralizing their data knowledge in a single place where every employee can access information about data.
Andrew and I experienced this problem first hand at the last company we worked at. Andrew led the Product team and I led the Operations team and found that it was extremely difficult to find, understand and use data without looping in someone on the data team to help. The problem was that we only had 1 employee on the data team who supported over 100 employees asking questions about how to find and use company data, which meant that it would take around 2 weeks to get an answer to any data request.
Other data management tools focus on listing all data resources, regardless of their relevance or accuracy - you generally just get a list of what's available, but not in a form that's very meaningful. We adopted some of these tools in our last jobs but found that they created an overwhelming index of too many tables, dashboards and queries that weren’t relevant to most employees. This meant that even after adopting a tool to solve the problem, most employees still couldn’t use them to find, understand and use data.
Our approach to solving this problem is to build Secoda as a tool that helps data teams curate metadata for less technical employees. Instead of listing every resource, data teams can use our tool to curate and document data for specific departments or roles. As a result, employees who are less familiar with data will not be overloaded by information that is irrelevant or too technical. Our goal is basically to be like Google search for in-company data. You enter what you need and you get back the relevant information. We integrate into databases, data warehouses, BI, and transformation tools and offer both an on-prem and cloud-hosted deployment.
Over the last six months, our team has been improving our product closely with our early adopters to build a better product. Today, we’re excited to share the launch of our self-service product with the HN community. You can now sign up to Secoda, connect your database or data warehouse and start using Secoda without a sales call. We offer a free 14-day trial (no credit card required). After the free trial, we charge per editor, per month. If you’d like, you can also take a look at this video of us setting up our Secoda workspace: https://www.loom.com/share/f41b317441554a36930b9cfe4c91a45f.
We're also hiring for a number of roles, which you can find here: https://www.workatastartup.com/companies/secoda.
We’d love to hear about your experiences with data discovery and any ideas/feedback/questions you might have about what we’re building!
TLDR
Solving data discovery starts by getting on top of your team's data debt. Data debt is a type of technical debt that is created when teams don’t catalogue, clean and categorize or organize their data. It drags down productivity and costs the organization in compute costs. This costs teams time and effort, but too many times, data debt is difficult to measure. There are two primary drivers that drive the cost of data debt and lead to understanding your need for a data discovery tool:
Discovery time: Data discovery time is the amount of time that it takes for your data engineering and analytics team to find the right data, understand what it means and use it to analyze the data request.
Organization time: This is the amount of time spent cleaning, documenting and organizing data to make it legible for other employees
You can measure the financial impact of data debt by looking at how much money it costs your team to discover and organize the data. We recommend calculating data debt as the hours spent discovering + organizing data * average cost per hour.
By having an idea of the cost of data debt, teams can more easily calculate their return on investment for a data discovery tool. Without the existing baseline, it’s much more difficult to get buy-in from the managers controlling the budget for your team. We hope this helps.
Data teams work as fast as possible to make sure that the business team is using the right information to make decisions. This pace and pressure creates data debt.
We've learned that most teams put this process off today not because they don’t want to, but because they feel like the work done to document and manage data is never complete and always outdated.
The problem with putting data governance off is that it creates inefficiencies that compound over time.
Putting off data governance creates data debt. Data debt is when you have undocumented, unused, incomplete and inconsistent data and it becomes a problem for teams much earlier than most teams realize.
This problem costs organizations directly and indirectly. The direct costs are related to data storage and compute. Data storage costs have decreased and will continue to drop, but compute is still a widely value and price resource. Running jobs to update tables that are not used costs the organization direct compute resources that have no benefit. Additionally, Collecting unused, undocumented, muddled data makes it much more difficult to find the right information.
Today, most teams don’t evaluate their data debt. Instead, they continue to collect data and dashboards, regardless of their value to the organization. Decreasing data debt will decrease technology costs and significantly increase productivity.
We've built a simple tool that gives teams a simple dashboard to manage their data. Teams can add documentation, remove tables and collaborate with other employees to stay ahead of data debt with www.secoda.co. We’re excited to share Secoda with everyone looking to benefit from data management at an early stage.