Data catalogues can be created and managed using a variety of techniques. Here are 16 well-known data catalog tools’ salient traits, prowess, and constituent parts.
An increasing data sprawl across numerous databases and other repositories in on-premises systems, cloud services, and IoT infrastructure is a problem for many enterprises. Suppose data scientists, other data analysts, and business users are unable to locate pertinent data and comprehend its meaning. In that case, data management becomes more difficult and BI and data analytics programs become less effective. According to Priya Iragavarapu, vice president of the Center of Data Excellence at consultancy firm AArete, “organizations are awash with data yet famished for insights.”
A single view of an organization’s data assets can be provided via data catalogs. Since the beginning of relational databases, when IT teams desired to keep track of how data sets were connected, joined, and transformed across SQL tables, the concept of a catalog has existed. In addition to data lakes, data warehouses, NoSQL databases, cloud object storage, and other types of data stores, modern data catalog technologies also catalogue data and gather metadata about it from a larger range of data sources.
In order to assist firms in keeping up with evolving regulatory compliance requirements and other facets of governance programs, they are frequently combined with data governance software. The technologies are also developing to benefit from natural language searches, machine learning, and other AI capabilities. The phrase “data catalog” will be replaced in 2021 by “augmented data cataloging and metadata management” in research firm Gartner’s Hype Cycle reports on emerging technologies.
Details on 16 well-known data catalog tools are provided below in alphabetical order. These tools may be able to assist your company in overcoming metadata management obstacles and enhancing end users’ access to and comprehension of data.
1. Alation Data Catalog
Alation was established in 2012, and its first goods went on sale in 2015. The company’s flagship data catalog software makes use of artificial intelligence (AI), machine learning, automation, and natural language processing techniques to streamline data discovery, automatically produce business glossaries, and power its central Behavioral Analysis Engine, which examines data usage patterns with a view to streamlining data stewardship, data governance, and query optimization. To produce popularity rankings, usage recommendations, and other insights, the engine indexes various data sources and uses pattern recognition.
2. Alex Augmented Data Catalog
A more recent provider of data catalog and metadata management, Alex Solutions was established in 2016. The business purposefully designed its data catalog software to benefit from AI and machine learning methods. The process of finding data assets and then bringing them into a consolidated catalog can be automated with the use of Alex’s Augmented Data Catalog. The tool supports different types of organized, semistructured, and unstructured data. The business has also developed a market for metadata connectors that assist in capturing and fine-tuning metadata for particular business needs or industry demands.
3. Ataccama Data Catalog
The main offering of Ataccama One, a consolidated platform that enables data governance and management operations automated through the use of AI, is a data catalog tool. Ataccama was created in 2008 and offers this technology. Ataccama Data Catalog has interfaces for many well-known on-premises and cloud data platforms and can catalog data from databases, data lakes, file systems, and other sources. Capabilities for automating data discovery and change detection are included in the data catalog.
4. Atlan Data Catalog
One of the newest data catalog suppliers is Atlan, which launched its tool on the market in 2018. With design cues from end-user platforms like GitHub, Slack, and others, it pitches the product as a third-generation data catalog. Atlan Data Catalog, in particular, is made to facilitate simple collaboration and the smooth integration of standard data procedures. Data teams, for instance, can use the data catalog tool to highlight issues that need to be handled in a practical way.
5. AWS Glue Data Catalog
The persistent information repository for AWS Glue, an extract, transform, and load (ETL) service provided by AWS, is called AWS Glue Data Catalog. When building data warehouses or data lakes on the AWS cloud platform, data management teams can utilize the data catalog to store, annotate, and share metadata for usage in ETL integration tasks. It includes comparable features and is compatible with Apache Hive’s megastore repository, a well-liked open source data warehouse technology. In some circumstances, businesses can use the AWS data catalog to incorporate an external megastore for Hive data.
6. Boomi Data Catalog and Preparation
The company’s AtomSphere Platform, a collection of products that also enables data integration, master data management, and other activities, includes Boomi Data Catalog and Preparation. It combines a data catalog with tools for data preparation: To maintain data sets, processing jobs, and workflow schedules, organizations may utilize the catalog to establish a consolidated business lexicon of metadata. They can then use a data prep recommendation engine to automatically clean, enrich, normalize, and transform their data.
7. Collibra Data Catalog
Founded in 2008, Collibra provides a Data Intelligence Cloud platform that is built around the Collibra Data Catalog. A wide range of automated characteristics for data identification and classification using a custom machine learning algorithm, data curation using a machine learning-powered algorithm, and data lineage are supported by its data catalog capabilities. The data catalog tool also offers graph-based metadata management strategies that assist in giving users knowledge of the provenance and quality of the data.
8. Data. world
A SaaS application called Data.world provides a cloud-native data catalog tool. The 2015-founded company brags about providing new features at a rapid clip, with more than 1,000 distinct product upgrades annually. It is well known for its knowledge graph approach, which gives users access to a semantically organized view of enterprise data assets and the metadata that goes with them across many platforms. Its purpose is to make it simpler for users of business and analytics to locate pertinent data and comprehend its context.
9. Erwin Data Catalog
For data modelling, the original Erwin software was developed in 1983; throughout time, the product line underwent a number of acquisitions and is presently held by Quest Software. Additionally, it has developed to enable new capabilities, such as this data catalog tool that was created as a component of a larger platform that was introduced in 2017 to support various facets of data governance. The program, previously named as Erwin Data Catalog by Quest, automatically gathers catalogs, and curates metadata.
10. Google Cloud Data Catalog
A fully managed data discovery and metadata management solution, Google Cloud Data Catalog works with both cloud and on-premises data sources. Its user interface is created to allow both data experts and business users to tag data at scale and search a catalog using natural language queries. The tool has built-in integrations with Google’s BigQuery, Pub/Sub, and Cloud Storage data services to support data security and compliance management as part of data governance initiatives. It is also integrated with the company’s Identity and Access Management and Cloud Data Loss Prevention services.
11. IBM Watson Knowledge Catalog
A metadata store called IBM Watson Knowledge Catalog was created specifically to support workflows for AI, machine learning, and other types of analytics. To assist enterprises in discovering and governing data across cloud and on-premises sources, it integrates with the company’s underpinning InfoSphere Information Governance Catalog. Structured, unstructured, and semi-structured data formats and machine learning models are among the data and analytics assets that the Watson tool can categorize. It facilitates data discovery and intelligent cataloging, which may be sparked by automated search suggestions.
12. Informatica Enterprise Data Catalog
Since its founding in 1993 with a concentration on data integration tools, Informatica has broadened the scope of the technologies it offers, including this data catalog tool. Informatica Enterprise Data Catalog can automatically scan, ingest, and classify data from systems within an enterprise as well as from multi-cloud platforms, BI tools, ETL workflows, and external metadata catalogs using an engine powered by machine learning algorithms.
13. Lumada DataOps Data Catalog
In 2017, Hitachi created Hitachi Vantara, a new entity, to house all of its data management, analytics, and storage capabilities. This technology was acquired by the firm through the 2020 acquisition of data catalog vendor Waterline Data and is now a part of Hitachi Vantara’s rebranded line of data management and analytics products called Lumada DataOps. The data catalog software broadens the capabilities of metadata management to accommodate well-known databases, developing IoT data architecture, and other data sources.
14. Microsoft Purview Data Catalog
This tool is a part of Microsoft Purview, a cloud service for data governance, compliance, and risk management that was launched in April 2022 after the firm rebranded and enlarged an Azure Purview product line that had recently become accessible. Azure Data Catalog, an outdated technology that has been replaced by the Purview utility, is now officially replaced by the data catalog software. There is no longer a need for Excel-based data dictionaries thanks to the enterprise-level business lexicon offered by Microsoft Purview Data Catalog.
15. Oracle Cloud Infrastructure Data Catalog
OCI Data Catalog, also known as Oracle Cloud Infrastructure Data Catalog, was created to support Oracle’s existing technological infrastructure. A list of data assets and a business lexicon are created for users by the metadata management cloud service. It can use either an on-demand or a schedule-based technique to automatically gather metadata from Oracle data stores and a number of other well-known data sources in both cloud and on-premises systems.
The 2013 company OvalEdge offers a data catalog tool with integrated data governance features. The company highlights the price and usability of its eponymous software, saying that, on average, it has a lower total cost of ownership than comparable data catalog products. The OvalEdge tool indexes metadata by crawling numerous databases, data lake platforms, BI and analytics systems, and custom applications. Based on tags, usage patterns, and other indicators, the tool then automatically organizes and catalogs the data using AI and machine learning algorithms.
Open source data catalog software
Companies may also take into account other open source data catalog tools. Many of them were created by businesses looking to create a more effective and efficient technology to aid in solving their own data cataloging problems. The following tools are some of the best open source options:
- Amundsen: Lyft developed this data discovery and metadata engine to aid in boosting the output of data scientists and other users within its intricate data architecture. In 2019, the ride-hailing business made the technology available as open source.
- Apache Atlas: Data governance, metadata management, and a data catalog are all included in the Atlas software. Hortonworks, a previous provider of big data platforms, founded it primarily for usage in Hadoop clusters, then in 2015, it was transferred to the Apache Software Foundation.
- DataHub: By redesigning and enhancing the WhereHows tool, the data team at LinkedIn produced this metadata search and discovery tool to assist internal users in understanding the context of data. In 2020, DataHub went open source.
- Metacat: To streamline data discovery, data preparation, and data science activities in its big data environment, Netflix developed this federated metadata discovery and exploration tool. In 2018, the technology became open source.