Introduction to Graph Databases: Azure Cosmos DB for Apache Gremlin
Previously we introduced Getting started with Azure Cosmos Database (A Deep Dive) blog which is an end-to-end introduction of Azure Cosmos DB. In this blog we are going to talk about one of the Azure Cosmos Database API known as Apache Gremlin.
Graph databases are a powerful way to model and query complex relationships in your data. Let’s dive into working with Azure Cosmos DB for Apache Gremlin, which provides a graph database service.
What’s covered in this blog
Need for a Graph database
Challenge: Mapping Relationships and Insights
What is Apache Gremlin?
Azure Cosmos DB for Apache Gremlin
Benefits for Gremlin API
Common scenarios for Gremlin API
Introduction to graph databases
provisioning Azure Cosmos DB for Apache Gremlin via Azure portal
Modeling a graph Database
When Do You Need a Graph Database?
Graph databases are optimally used when your data domain exhibits certain characteristics:
Entities are highly connected through descriptive relationships.
There are cyclic relationships or self-referenced entities (which can be challenging in relational or document databases).
Dynamically evolving relationships exist between entities (common in hierarchical or tree-structured data).
Many-to-many relationships between entities.
Both entities and relationships have read and write requirements.
If your data fits these criteria, a graph database approach can provide advantages in terms of query complexity, data model scalability, and query performance.
Introduction:
Scenario: Building a Social Network for Student Entrepreneurs
Your vision is to build a dynamic social network where aspiring entrepreneurs can connect, collaborate, and support each other’s ventures. As you embark on this journey, you encounter various challenges in structuring and querying your database to efficiently handle complex relationships and data connections.
The Challenge: Mapping Relationships and Insights
In the world of social networking platforms, understanding relationships between users, their interests, connections, and interactions is paramount. Traditional relational databases might fall short when it comes to representing these intricate networks. This is where Azure Cosmos DB for Apache Gremlin steps in as your trusted guide.
Apache Gremlin: Your Graph Database Companion
In your quest to build a thriving social network, you realize that a graph database is the perfect fit for capturing the interconnected nature of user relationships and interactions.
What is Apache Gremlin?
Apache Gremlin is a graph traversal language and virtual machine developed by Apache Software Foundation. It’s specifically designed for querying graph databases and navigating complex networks of interconnected data. Gremlin provides a flexible and powerful framework for traversing graphs, enabling users to perform a wide range of operations such as pathfinding, filtering, and graph analytics.
Azure Cosmos DB for Apache Gremlin
Azure Cosmos DB for Apache Gremlin is a graph database service that can be used to store massive graphs with billions of vertices and edges. You can query the graphs with millisecond latency and evolve the graph structure easily. This integration enables users to store and query graph data using Azure Cosmos DB’s globally distributed, available infrastructure.
The API for Gremlin combines the power of graph database algorithms with highly scalable, managed infrastructure. This approach provides a unique and flexible solution to common data problems associated with inflexible or relational constraints.
Benefits for Gremlin API
The API for Gremlin has added benefits of being built on Azure Cosmos DB:
Elastically Scalable Throughput and Storage: Cosmos DB supports horizontally scalable graph databases with unlimited storage and provisioned throughput. Data is automatically distributed using graph partitioning as the database scale grows.
Multi-region Replication: Graph data can be automatically replicated to any Azure region worldwide, enabling global access to data with minimal latency. Cosmos DB provides a service-managed regional failover mechanism to ensure application continuity in case of service interruptions.
Fast Queries and Traversals with Gremlin: Cosmos DB supports querying heterogeneous vertices and edges using the Gremlin query language, which is widely adopted in the graph database community. This allows for rich real-time queries and traversals without the need for schema hints, secondary indexes, or views.
Fully Managed Graph Database: Cosmos DB eliminates the need for managing database and machine resources, allowing developers to focus on delivering application value. It automatically handles tasks such as virtual machine management, software updates, sharding, replication, and backups, ensuring high availability and reliability.
Automatic Indexing: Gremlin API automatically indexes all properties within nodes and edges without requiring schema definition or creation of secondary indices.
Compatibility with Apache TinkerPop: The Gremlin API supports the open-source Apache TinkerPop standard, enabling integration with a vast ecosystem of applications and libraries.
Tunable Consistency Levels: Cosmos DB provides five well-defined consistency levels (strong, bounded-staleness, session, consistent prefix, and eventual) to balance consistency, availability, and latency based on application requirements. This flexibility allows developers to make informed tradeoffs to optimize performance.
Common scenarios for Gremlin API
Social networks/Customer 365: By combining data about your customers and their interactions with other people, you can develop personalized experiences, predict customer behavior, or connect people with others with similar interests. Azure Cosmos DB can be used to manage social networks and track customer preferences and data.
Recommendation engines: This scenario is commonly used in the retail industry. By combining information about products, users, and user interactions, like purchasing, browsing, or rating an item, you can build customized recommendations. The low latency, elastic scale, and native graph support of Azure Cosmos DB is ideal for these scenarios.
Geospatial: Many applications in telecommunications, logistics, and travel planning need to find a location of interest within an area or locate the shortest/optimal route between two locations. Azure Cosmos DB is a natural fit for these problems.
Internet of Things: With the network and connections between IoT devices modeled as a graph, you can build a better understanding of the state of your devices and assets. You also can learn how changes in one part of the network can potentially affect another part.
Introduction to graph databases
A graph database approach relies on persisting relationships in the storage layer instead, which leads to highly efficient graph retrieval operations. The API for Gremlin supports the property graph model.
Property graph objects
A property graph is a structure that’s composed of vertices and edges. Both objects can have an arbitrary number of key-value pairs as properties.
Vertices/nodes: Vertices denote discrete entities, such as a person, place, or an event.
ID: Each vertex has a unique ID enforced per partition. If no value is supplied upon insertion, an auto-generated GUID is stored.
Label: The vertex label defines the type of entity it represents. If no value is supplied, a default vertex label is used.
Properties: Vertices can have properties (stored as key-value pairs). These properties can be of type string, boolean, or numeric.
Partition Key: Vertices are partitioned, and the partition key determines their distribution across partitions.
Edges/relationships: Edges denote relationships between vertices. For example, a person might know another person, be involved in an event, or have recently been at a location.
Properties: Properties express information (or metadata) about the vertices and edges. There can be any number of properties in either vertices or edges, and they can be used to describe and filter the objects in a query. Example properties include a vertex that has name and age, or an edge, which can have a time stamp and/or a weight.
Example of a graph database
This graph has the following vertex types. These types are also called labels in Gremlin:
People: The graph has three people; Robin, Thomas, and Ben.
Interests: Their interests, in this example, include the game of Football.
Devices: The devices that people use.
Operating Systems: The operating systems that the devices run on.
Place: The place[s] where devices are accessed.
We represent the relationships between these entities via the following edge types:
Knows: Represent familiarity. For example, “Thomas knows Robin”.
Interested: Represent the interests of the people in our graph. For example, “Ben is interested in Football”.
RunsOS: Represent what OS a device runs. For example, “Laptop runs the Windows OS”.
Uses: Represent which device a person uses. For example, “Robin uses a Motorola phone with serial number 77”.
Located: Represent the location from which the devices are accessed.
Creating Azure Cosmos DB for Apache Gremlin in Azure portal
Prerequisites
Azure account – Click to create an account: Create Azure account
Azure subscription – More about subscriptions: more on subscriptions
Steps for Creating Resources in the Azure Portal
Let’s Create an Account
Search for Azure Cosmos DB
Create Azure Cosmos DB Account
Choose Azure Cosmos DB for Apache Gremlin
Under create Azure cosmos DB Account page
Choose your subscription.
Choose or create a resource group.
Create the account name (make it unique).
Choose the availability zone if you want to improve your apps availability and resilient.
Choose the location of your DB according to the available data centers.
Capacity Mode enables you to define the throughput. The Provisioned option also comes with a free tier option.
Select Geo-Redundancy will enable your database to be available to the paired region ie East US and West Us or South Africa North and South Africa West. For this demo ‘South Africa West’ is not included in my subscription
Multi-region writes capability allows you to take advantage of the provisioned throughput for your databases and containers across the globe.
Under networking, your Azure Cosmos DB account either publicly, via public IP addresses or service endpoints, or privately, using a private endpoint. Choose according to your use case.
Connection Security Settings – I will go with TLS 1.2
Backup policy defines the way your backup will occur.
I will let Microsoft encrypt my account using service-managed keys.
I don’t need to create a tag for now, so I will review and create.
Once I have successfully provisioned the resource.
Click Go To resource to the overview page of the resource.
Congratulations, you have Created an Azure Cosmos DB for Apache Gremlin.
Creating a container in Azure Cosmos DB for Gremlin
This part explains way to create a container in Azure Cosmos DB for Gremlin. It shows how to create a container using Azure portal with data explorer. It demonstrates how to create a container, specify the partition key, and provision throughput.
NOTE: When creating containers, make sure you don’t create two containers with the same name but different casing. That’s because some parts of the Azure platform are not case-sensitive, and this can result in confusion/collision of telemetry and actions on containers with such names.
Open the Data Explorer pane and select New Graph. Next, provide the following details:
Indicate whether you are creating a new database or using an existing one. Since this is a new DB account i will new database
Enter a Database ID.
Select database throughput whether auto scale or manual.
Enter a throughput to be provisioned (for example, 1000 RUs).
Enter a graph ID.
Select indexing (automatic or off).
Enter a partition key for vertices.
Select OK.
Congratulations, you have Created a container in Azure Cosmos DB for Apache Gremlin.
Let’s Create an Some Vertex and Edges
You begin by clicking New Vertex button
Start by giving a Label (For Example gamer tag).
Select a gameId ( i.e. 1111)
Add more property like Team and preferredClass
Team: red
preferredClass: Mage
Click OK.
Repeat step 1,2 &3 to add more labels with the data below twice
And now you have successfully added 3 vertices.
Let make some edges by connecting between the labels with a relationship
We can add a relationship between John and Ben that they know each other.
This is done by clicking John Id then click target to add Ben’s id and label as “know”
We can also add the relationship between John and jane that they know each other by repeating the step above again. The final relationship graph will look like this.
Congratulations, you have Created Vertex and Edges on Gremlin graph.
Read more:
Graph data modeling with Azure Cosmos DB for Apache Gremlin
Azure Cosmos DB for Gremlin graph support and compatibility with TinkerPop features
Visualize graph data stored in Azure Cosmos DB for Gremlin with data visualization solutions
Query Azure Cosmos DB for Gremlin by using Gremlin
Pricing model in Azure Cosmos DB
QuickStart SDK:
Azure Cosmos DB for Apache Gremlin library for Python
Azure Cosmos DB for Apache Gremlin library for Node.js
Azure Cosmos DB for Apache Gremlin library for .NET
Microsoft Tech Community – Latest Blogs –Read More