Save money on your Sentinel ingestion costs with Data Collection Rules
This article is co-authored by Brain Delaney, Andrea Fisher, and Jon Shectman.
As digital environments continue to expand, Security Operations teams are often asked to optimize costs even as the amount of data they need to collect and store grows exponentially. Teams may often feel they have to choose between not collecting a particular data set or log source, and balancing their limited security budget.
Today, we’ll outline a strategy you can use to reduce your data volume while also collecting and retaining the information that really matters. We’ll show you how to use Data Collection Rules (DCRs) to drop information from logs that are less valuable to you. Specifically, we’ll first discuss the thought process around deciding what’s important in a log to your organization. Then we’ll show you the process of using DCRs to “project-away” information you don’t want or need using two log source examples. This process saves direct ingress and long-term retention costs, and reduces analyst fatigue.
One word of caution. Only you can really decide what’s important to your organization in a particular log or table. Nothing you do can’t be undone, but it may result in data not being captured if you elect to drop it (“project-away”). This is why we’re spending time discussing the thought process of deciding what’s really important.
A Word about DCRs (or What is a DCR and Why Should I Care?)
We won’t have space in this blog entry to go deep into DCRs, as they can quickly get complicated. For a thorough discussion, please visit Custom data ingestion and transformation in Microsoft Sentinel | Microsoft Learn.
There are two points that we need to discuss here. First, what exactly is a DCR and why should I care? A DCR is one way that Sentinel and Log Analytics give you a high degree of control over specific data that actually gets ingested into your workspace. Think of a DCR as a way to manipulate the ingestion pipeline. For our purposes here, DCRs can be thought of as a set of basic KQL queries applied to incoming logs that allow you to do something to that data: filter out irrelevant data, enrich existing data, mask sensitive attributes, or even perform Advanced Security Information Model (ASIM) normalization. As you’ve probably guessed by now, it’s this first capability (filter out irrelevant data) we’re concerned with here.
Second, for our purposes, there are two kinds of DCRs: standard and workspace.
Standard DCRs are currently supported for AMA-based connectors and workflows using the new Logs Ingestion API. An example of a standard DCR will be the one used for the Windows Security Events collected through the AMA.
Workspace transformation DCRs serve supported workflows in a workspace that aren’t served by standard DCRs. A Sentinel workspace can have only one workspace transformation DCR, but that DCR will contain separate transformations for each input stream. An example of a workspace DCR will be the one used for AADNonInteractiveSigninLogs collected via diagnostic settings.
Workspace DCRs do not apply when a standard DCR is used to ingest the data.
Finding High-volume Sources
To optimize costs, it is important to understand where all the data is going before making difficult decisions about which logs to drop and which logs to keep. We recommend focusing on high-volume sources, as they will have the biggest return on your efforts.
Determining High volume tables
First, if you haven’t already, you’ll want to determine your high-volume billable tables (as not all tables are billable) to see where you can have the most impact when optimizing costs. You can get this with a simple KQL query:
Usage
| where TimeGenerated > ago(30d)
| where IsBillable
| summarize SizeInGB=sum(Quantity) / 1000 by DataType
| sort by SizeInGB desc
Record Level Analysis
Once you have determined your high-volume billable tables, you may want to look at volume per record type. You may need to experiment with different combinations to find some high-volume patterns that you may not find security value in. For example, with the SecurityEvent table, it would be interesting to know which Event IDs contribute to the most volume so you can assess their security value. Keep in mind that the count of events is not directly related to the cost as some events are much bigger in size than others. For this, we will use the _BilledSize column which contains the billed size for the record in bytes:
SecurityEvent
| summarize SizeInMB=sum(_BilledSize) / 1000 / 1000 by EventID
| sort by SizeInMB
Column Level Analysis
In some cases, you may not be able to discard entire records, but there may be an opportunity to discard columns or parts of columns. When browsing a data source, you may find some columns have significant amounts of data, such as AADNonInteractiveUserSignInLogs and its ConditionalAccessPolicies column which is a large array of the status of each conditional access policy and whether it applied or not to the background token activity. For this we will use the estimate_data_size() function:
AADNonInteractiveUserSignInLogs
| extend ColumnSize = estimate_data_size(ConditionalAccessPolicies)
| summarize RecordSizeInMB=sum(_BilledSize) / 1000 / 1000, ColumnSizeInMB=sum(ColumnSize) / 1000 / 1000
| extend PercentOfTotal = ColumnSizeInMB / RecordSizeInMB
Examining the Process
Let’s look at this process of reducing ingestion using DCRs in two examples – one for workspace DCRs and one for standard.
AADNonInteractiveSigninLogs
SOC engineers and managers often worry about the cost of bringing in additional logs like the AADNonInteractiveSigninLogs. Non-interactive user sign-ins are sign-ins performed by a client app or an OS component on behalf of a user. Unlike interactive user sign-ins, these sign-ins do not require the user to supply authentication. Instead, they use a token or code to authenticate on behalf of a user. You can see how bad actors might make use of this type of authentication so there is a good reason to ingest them.
There is a potentially significant optimization opportunity with the AADNonInteractiveSigninLogs table. One of the fields contains information about conditional access policy evaluation. Fifty to eighty percent of the log data is typically conditional access policy data. In many cases the non-interactive log will have the same conditional access outcome as occurred in the interactive log; however the non-interactive volume is much higher. In the cases where the outcome is different, is it critical for you to know which specific conditional access policy allowed or blocked a session? Does knowing this add investigative value?
For this example, we’ll use a workspace DCR since there is no standard DCR available for this data type (e.g. it’s a diagnostic logs dataflow).
If you already have a workspace DCR, you’ll edit it like this:
Conversely, if you don’t already have a workspace DCR, you’ll have to create one:
Once you have it, click Next. Then click on </> Transformation editor on the top and use the following query if you want to remove all ConditionalAccessPolicies from this table:
source
| project-away ConditionalAccessPolicies
Alternately, as this array is sorted where applied policies (success/failure) appear to the top, if you wanted to only keep the first few policies, you could use this transformation:
source
| extend CAP = todynamic(ConditionalAccessPolicies)
| extend CAPLen = array_length(CAP)
| extend ConditionalAccessPolicies = tostring(iff(CAPLen > 0, pack_array(CAP[0], CAP[1], CAP[2], CAP[3]), todynamic(‘[]’)))
| project-away CAPLen, CAP
SecurityEvent
The security event logs are another source that can be verbose. The easiest way to ingest the data is to use the standard categories of “minimal”, “common,” or “all.” But are those options the right ones for you? Some known noisy event IDs may have questionable security value. We recommend looking closely at what you are collecting in this table presently and appraising the noisiest events to see if they truly add security value.
For example, you’ll likely want event IDs like 4624 (“An account was successfully logged on”) and 4688 (“A new process has been created”). But do you need to keep 4634 (“An account was logged off”) and 4647 (“User initiated logoff”)? These might be useful for auditing, but less so for breach detection. You could drop these events out of your logs by setting the category to “minimal,” but may find that you’re missing other event IDs that you find valuable.
In the event you are using collection tier “all,” the XPath query does not explicitly collect these events by number. To remove an event, you will need to replace the XPath in the DCR to select all but specific events with a query such as:
Security!*[System[(EventID!=4634) and (EventID!=4647)]]
In the event you are using collection tier “common” or “minimal,” the event IDs will already be listed in the DCR’s XPath queries and you can simply remove them along with the corresponding “or” statement from the query:
Security!*[System[(EventID=1102) or (EventID=1107) or (EventID=1108) or (EventID=4608) or (EventID=4610) or (EventID=4611) or (EventID=4614) or (EventID=4622) or (EventID=4624) or (EventID=4625) or (EventID=4634) or (EventID=4647) or (EventID=4648) or (EventID=4649) or (EventID=4657)]]
Alternatively, you can drop these events by adding a transformKql statement to the DCR, though in this case it will be less efficient than using XPath:
source
| where EventID !in (toint(4634), toint(4647))
For more information on updating a standard DCR, review the Monitoring and Modifying DCRs section of https://techcommunity.microsoft.com/t5/microsoft-sentinel-blog/create-edit-and-monitor-data-collection-rules-with-the-data/ba-p/3810987
In Summary
As digital footprints grow exponentially, it is increasingly important that security teams remain judiciously intentional in the data that they collect and retain. By thoughtfully selecting data sources and refining data sets with DCRs, you can ensure that you are spending your security budget in the most efficient and effective manner.
Microsoft Tech Community – Latest Blogs –Read More