Month: December 2023
Looking for expanded documentation on Exchange objects
Probably a simple answer, at least I hope it is, but I’m looking for expanded documentation on Exchange objects as they appear in powershell.
For just one example: mailbox objects.
Of course I can pipe mailbox objects into | get-member to see what all the various properties of a mailbox object are, but I’d like to know more about what each of those properties MEANS, what Exchange “uses” it for, how it relates to other objects, etc. and be able to infer from that information why I might care?
For example, I’ve just discovered some mailboxes on my Exchange server (on prem) whose “ServerName” property refers to a long ago decomissioned server that I replaced with a differently named server (DAG is in play). So I’m curious about the usage of the ‘ServerName’ property in Exchange mailboxes. All the mailboxes are working fine (at least as far as I can see) so I don’t think I have a “problem” that needs to be solved. It just made me want to find detailed/expanded documentation about mailbox properties and methods, and for other object classes as well.
TIA
Robert
Probably a simple answer, at least I hope it is, but I’m looking for expanded documentation on Exchange objects as they appear in powershell.For just one example: mailbox objects. Of course I can pipe mailbox objects into | get-member to see what all the various properties of a mailbox object are, but I’d like to know more about what each of those properties MEANS, what Exchange “uses” it for, how it relates to other objects, etc. and be able to infer from that information why I might care? For example, I’ve just discovered some mailboxes on my Exchange server (on prem) whose “ServerName” property refers to a long ago decomissioned server that I replaced with a differently named server (DAG is in play). So I’m curious about the usage of the ‘ServerName’ property in Exchange mailboxes. All the mailboxes are working fine (at least as far as I can see) so I don’t think I have a “problem” that needs to be solved. It just made me want to find detailed/expanded documentation about mailbox properties and methods, and for other object classes as well.TIARobert Read More
msdb..backupfile table in SQL 2019
Hi,
I used to query msdb..backupfile table to get database file level backup size or backup page counts. But in SQL 2019 backup_size column or backup_page_count columns are either 0 or showing very less value as compared to actual data. Kindly advice if there has been any changes in these tables in SQL 2019 as in earlier version we could use this table to check database file growth but in SQL 2019 this data is not of relevance for DB files growth estimation.
Hi, I used to query msdb..backupfile table to get database file level backup size or backup page counts. But in SQL 2019 backup_size column or backup_page_count columns are either 0 or showing very less value as compared to actual data. Kindly advice if there has been any changes in these tables in SQL 2019 as in earlier version we could use this table to check database file growth but in SQL 2019 this data is not of relevance for DB files growth estimation. Read More
SetSearchKey and SetEmbeddingKey return “does not contain a definition for …”
I’m currently using Azure.AI.OpenAI 1.0 beta 12, I’m using the samples provided by Microsoft however when I compile the code I get the following error
‘AzureCognitiveSearchChatExtensionConfiguration’ does not contain a definition for ‘SetSearchKey’ and no accessible extension method ‘SetSearchKey’ accepting a first argument of type ‘AzureCognitiveSearchChatExtensionConfiguration’ could be found (are you missing a using directive or an assembly reference?)
and
‘AzureCognitiveSearchChatExtensionConfiguration’ does not contain a definition for ‘SetEmbeddingKey’ and no accessible extension method ‘SetEmbeddingKey’ accepting a first argument of type ‘AzureCognitiveSearchChatExtensionConfiguration’ could be found (are you missing a using directive or an assembly reference?).
The lines of code I’m using are
I’ve checked the solution and it has the correct version of Azure.AI.OpenAI installed.
Any thoughts welcome.
Thanks
Rob Ireland
I’m currently using Azure.AI.OpenAI 1.0 beta 12, I’m using the samples provided by Microsoft however when I compile the code I get the following error’AzureCognitiveSearchChatExtensionConfiguration’ does not contain a definition for ‘SetSearchKey’ and no accessible extension method ‘SetSearchKey’ accepting a first argument of type ‘AzureCognitiveSearchChatExtensionConfiguration’ could be found (are you missing a using directive or an assembly reference?) and ‘AzureCognitiveSearchChatExtensionConfiguration’ does not contain a definition for ‘SetEmbeddingKey’ and no accessible extension method ‘SetEmbeddingKey’ accepting a first argument of type ‘AzureCognitiveSearchChatExtensionConfiguration’ could be found (are you missing a using directive or an assembly reference?). The lines of code I’m using are// Initialize the AzureCognitiveSearchChatExtensionConfigurationvar search = new AzureCognitiveSearchChatExtensionConfiguration(){ SearchEndpoint = new Uri(searchEndpoint), IndexName = searchIndexName}; // Set the SearchKeysearch.SetSearchKey(searchKey); // Set the EmbeddingKeysearch.SetEmbeddingKey(embeddingKey); I’ve checked the solution and it has the correct version of Azure.AI.OpenAI installed. Any thoughts welcome. ThanksRob Ireland Read More
About azure DevOps
I’m a fresher I wanna start my career in azure what’re the things I shouls to do master that?
provide me some things.
I’m a fresher I wanna start my career in azure what’re the things I shouls to do master that?provide me some things. Read More
About azure DevOps
I’m a fresher I wanna start my career in azure what’re the things I shouls to do master that?
provide me some things.
I’m a fresher I wanna start my career in azure what’re the things I shouls to do master that?provide me some things. Read More
Update to Admin Center 1.5.23.12.09001
Hello together,
after last update I get the message “AADSTS9002325: Proof Key for Code Exchange is required for cross-origin authorization code redemption.”
The admin center is registered in Azure. URi has migrated to SPA as recommended by the notice. Windows domain login works and Azure MFA fails after successful login with said error. When I switch the Uri to Web, the problem doesn’t occur. First when I want to connect via configuration accounts to the Azure account get a different error message “AADSTS9002326: Cross-origin token redemption is permitted only for the ‘Single-Page Application’ client-type”
Does anyone else have the problem ?
Hello together,after last update I get the message “AADSTS9002325: Proof Key for Code Exchange is required for cross-origin authorization code redemption.”The admin center is registered in Azure. URi has migrated to SPA as recommended by the notice. Windows domain login works and Azure MFA fails after successful login with said error. When I switch the Uri to Web, the problem doesn’t occur. First when I want to connect via configuration accounts to the Azure account get a different error message “AADSTS9002326: Cross-origin token redemption is permitted only for the ‘Single-Page Application’ client-type” Does anyone else have the problem ? Read More
Transcripts and setting an auto-deletion
Hello,
My business has disabled recordings and transcriptions because of legal concerns, but are reviewing those as a result of requests we get from end users perioidically (especially with Copilot using transcription in Teams meetings as a useful summary of meetings).
We’ve identified that you can set recordings (and their associated transcriptions) to delete automatically using the Teams Admin Center and they are stored in OneDrive.
But it appears if you have just a transcript ran, the transcript is stored in the meeting organisers exchange online under FolderPath: “/ApplicationDataRoot/93c8660e-1330-4e40-8fda-fd27f9eafe10/MeetingTranscriptCollection” – I’ve verified this by browsing to that location in MFCMAPI.
I’ve been tasked with setting a policy to delete any transcripts in those locations older than 7 days, but I cannot for the life of me work out how to accomplish it. I know labels are going to be the way to go for this, but simulations fail to pick up any results (when I know my mailbox has transcripts stored within it).
Has anyone else had this issue?
Hello, My business has disabled recordings and transcriptions because of legal concerns, but are reviewing those as a result of requests we get from end users perioidically (especially with Copilot using transcription in Teams meetings as a useful summary of meetings). We’ve identified that you can set recordings (and their associated transcriptions) to delete automatically using the Teams Admin Center and they are stored in OneDrive. But it appears if you have just a transcript ran, the transcript is stored in the meeting organisers exchange online under FolderPath: “/ApplicationDataRoot/93c8660e-1330-4e40-8fda-fd27f9eafe10/MeetingTranscriptCollection” – I’ve verified this by browsing to that location in MFCMAPI. I’ve been tasked with setting a policy to delete any transcripts in those locations older than 7 days, but I cannot for the life of me work out how to accomplish it. I know labels are going to be the way to go for this, but simulations fail to pick up any results (when I know my mailbox has transcripts stored within it). Has anyone else had this issue? Read More
Excel Data Validation
Hi,
I am working on an excel sheet where I want the input of a cell in a data entry sheet to be either a drop down option or a 6 digit number, depending on the input in the cell above. The above cell is item class and the next cell is item, if the item class is set to “Other” I would like it to allow a 6 digit item code to be entered. And if it is a standard item then the dropdown list will change accordingly.
I currently have the dynamic drop down list working but do not know how to get the cell to allow me to input a 6 digit number if the item class is set to other “Other”.
Thanks,
Jack
Hi,I am working on an excel sheet where I want the input of a cell in a data entry sheet to be either a drop down option or a 6 digit number, depending on the input in the cell above. The above cell is item class and the next cell is item, if the item class is set to “Other” I would like it to allow a 6 digit item code to be entered. And if it is a standard item then the dropdown list will change accordingly. I currently have the dynamic drop down list working but do not know how to get the cell to allow me to input a 6 digit number if the item class is set to other “Other”. Thanks,Jack Read More
Como contacto con Latam Airlines Telefono en Espana-34919016231
Imaginamos que usted vive en españa y desea viajar a cualquier lugar a travs de Latam airlines pero antes de reservación, usted quiere saber algunas información sobre su vuelo como, cómo añadir equipaje extra o como elegir asiento, documentos requisitos de viajar etcétera pero usted no sabe cómo puede contactar con Latam airlines españa, bueno si esta es su problema en este condición usted puede continuar esta página y conseguir la respuesta de este tipo de consultas.
Varias maneras para ponerse en contacto con Latam españa:
Hablar por telefono:- Al llamar es una manera más rápida para contactar con el equipo de servicio al cliente de Latam España. Usted puede marcar Latam número de telefono y discutir cualquier problema sobre sus servicios. El número de Latam telefono espana es +800000304/34919016231, Después de marcar Latam España número de teléfono usted puede escuchar todas las opciones de IVR y continuar con la opción que necesita. Usted puede hablar con un consejero sin problema y obtener la solución inmediatamente.es recomendable que tenga un bloc de notas para ahorrar la respuesta.
Hablar por chat en tiempo real:- Además usted puede ponerse en contacto con los agentes de Latam España a través de whatsapp chat al instante. Muchas veces el número de Latam España puede estar ocupado o congestionado, es esta situación puede intentar esta opción. Los pasos de charla con Latam españa son:
Explorar el sitio web oficial de Latam españa Después de esto usted puede chasquear en atención al cliente botón Ahora usted necesita chasquear en chatear por whatsapp botón Después de esto puede escanear el código de Latam españa en su móvil inteligente Ahora puedes chatear con un asesor de Latam España relajadamente.
También usted puede mandar un correo electrónico al Latam España al email address removed for privacy reasons y recibir la solución.
Imaginamos que usted vive en españa y desea viajar a cualquier lugar a travs de Latam airlines pero antes de reservación, usted quiere saber algunas información sobre su vuelo como, cómo añadir equipaje extra o como elegir asiento, documentos requisitos de viajar etcétera pero usted no sabe cómo puede contactar con Latam airlines españa, bueno si esta es su problema en este condición usted puede continuar esta página y conseguir la respuesta de este tipo de consultas. Varias maneras para ponerse en contacto con Latam españa: Hablar por telefono:- Al llamar es una manera más rápida para contactar con el equipo de servicio al cliente de Latam España. Usted puede marcar Latam número de telefono y discutir cualquier problema sobre sus servicios. El número de Latam telefono espana es +800000304/34919016231, Después de marcar Latam España número de teléfono usted puede escuchar todas las opciones de IVR y continuar con la opción que necesita. Usted puede hablar con un consejero sin problema y obtener la solución inmediatamente.es recomendable que tenga un bloc de notas para ahorrar la respuesta. Hablar por chat en tiempo real:- Además usted puede ponerse en contacto con los agentes de Latam España a través de whatsapp chat al instante. Muchas veces el número de Latam España puede estar ocupado o congestionado, es esta situación puede intentar esta opción. Los pasos de charla con Latam españa son:Explorar el sitio web oficial de Latam españa Después de esto usted puede chasquear en atención al cliente botón Ahora usted necesita chasquear en chatear por whatsapp botón Después de esto puede escanear el código de Latam españa en su móvil inteligente Ahora puedes chatear con un asesor de Latam España relajadamente. También usted puede mandar un correo electrónico al Latam España al email address removed for privacy reasons y recibir la solución. Read More
Mails sitting in Poison queue Exchange 2019
Hi All,
I have seen in past few days, few mails have been sitting in Poison queues in our Exchange 2019 servers. Sometimes there are no errors with those mails, sometimes it says “LED=530 5.3.0 Too many related errors” , sometimes “LED=421 4.4.2 Connection dropped due to ConnectionReset”.
The mail’s been delivered to other recipients but for some recipients the mail is queued up in Poison queues.
So, not sure why the emails are qualified as poisonous, need some info on what we can do as admins on such situations?
I know that we can either export the messages manually or just reject these messages with/without NDR.
Hi All,I have seen in past few days, few mails have been sitting in Poison queues in our Exchange 2019 servers. Sometimes there are no errors with those mails, sometimes it says “LED=530 5.3.0 Too many related errors” , sometimes “LED=421 4.4.2 Connection dropped due to ConnectionReset”.The mail’s been delivered to other recipients but for some recipients the mail is queued up in Poison queues.So, not sure why the emails are qualified as poisonous, need some info on what we can do as admins on such situations?I know that we can either export the messages manually or just reject these messages with/without NDR. Read More
Mails sitting in Poison queue Exchange 2019
Hi All,
I have seen in past few days, few mails have been sitting in Poison queues in our Exchange 2019 servers. Sometimes there are no errors with those mails, sometimes it says “LED=530 5.3.0 Too many related errors” , sometimes “LED=421 4.4.2 Connection dropped due to ConnectionReset”.
The mail’s been delivered to other recipients but for some recipients the mail is queued up in Poison queues.
So, not sure why the emails are qualified as poisonous, need some info on what we can do as admins on such situations?
I know that we can either export the messages manually or just reject these messages with/without NDR.
Hi All,I have seen in past few days, few mails have been sitting in Poison queues in our Exchange 2019 servers. Sometimes there are no errors with those mails, sometimes it says “LED=530 5.3.0 Too many related errors” , sometimes “LED=421 4.4.2 Connection dropped due to ConnectionReset”.The mail’s been delivered to other recipients but for some recipients the mail is queued up in Poison queues.So, not sure why the emails are qualified as poisonous, need some info on what we can do as admins on such situations?I know that we can either export the messages manually or just reject these messages with/without NDR. Read More
Pinterest Druid Holiday Load Testing
By Isabel Tallam | Senior Software Engineer; Jian Wang | Senior Software Engineer; Jiaqi Gu| Senior Software Engineer; Yi Yang | Senior Software Engineer; and Kapil Bajaj | Engineering Manager, Real-time Analytics team
Like many companies, Pinterest sees an increase in traffic in the last three months of the year. We need to make sure our systems are ready for this increase in traffic so we don’t run into any unexpected problems. This is especially important as Pinners come to Pinterest at this time for holiday planning and shopping. Therefore, we do a yearly exercise of testing our systems with additional load. During this time, we verify that our systems are able to handle the expected traffic increase. On Druid we look at several checks to verify:
Queries: We make sure the service is able to handle the expected increase in QPS while at the same time supporting the P99 Latency SLA our clients need.
Ingestion: We verify that the real-time ingestion is able to handle the increase in data.
Increase in Data size: We confirm that the storage system has sufficient capacity to handle the increased data volume.
In this post, we’ll provide details about how we run the holiday load test and verify Druid is able to handle the expected increases mentioned above.
Pinterest traffic increases as users look for inspiration for holidays.
How We Run Load Tests
As mentioned above, the areas our teams focus on are:
Can the system handle increased query traffic?
Can the system handle the increase in data ingestion?
Can the system handle the increase in data volume?
Can the System Handle Increased Query Traffic?
Testing query traffic and SLA is a main goal during holiday load testing. We have two different options for load testing in our Druid system. The first option generates queries based on the current data set in the Druid data and then runs these queries in Druid. The other option captures real production queries and re-runs these queries in Druid. Both of these options have their advantages and disadvantages.
Sample Versus Production Queries
The first option — using generated queries — is fairly simple to run anytime and does not require preparation like capturing queries. However, this type of testing may not accurately show how the system will behave in production scenarios. A real production query may look different and touch different data, query types, and timeframes than what is tested using generated queries. Additionally, any corner cases would be ignored in this type of testing.
The second option has the advantage of having real production queries that would be very similar to what we expect to see during any future traffic. The disadvantage here, however, is that setting up the tests is more involved, as production queries need to be captured and potentially need to be updated to match the new timeline when holiday testing is performed. In Druid, running the same query today versus one week from today may give different latency results, as data will move through different host stages in which data is supported by faster high-memory hosts in the first days/weeks versus slower disk stages for older data.
We decided to move ahead with real production queries because one of our priorities was to replicate production use cases as closely as possible. We made use of a Druid native feature that automatically logs any query that is being sent to a Druid broker host (broker hosts handle all the query work in a Druid cluster).
Test Environment Setup
Holiday testing is not done in the production environment, as this could adversely impact the production traffic. However, the test needs an environment setup as similar to the production environment as possible. Therefore, we created a copy of the production environment that is short-lived and solely used for testing. To test query traffic, the only stages required are brokers, historical stages, and coordinators. We have several tiers of historical stages in the production environment and we replicated the same setup in the test environment as well. We also made sure to use the same host machine types, configurations, pool size, etc.
The data we used for testing was copied over from production. We used a simple MySQL dump to create a copy of all the segments stored in the production environment. Once the dump is added to the MySQL instance in the test environment, the coordinator will automatically trigger the data to be replicated in the historical stages of the test environment.
Before initiating the copy, however, we needed to identify what data is required. This will depend on the client team and on the timeframe their queries request. In some cases, it may not be necessary to copy all data, but only the most recent days, weeks, or months.
Test environment is set up with the same configuration and hosts as Prod environment.
Our test system first connects to the broker hosts on the test environment, then loads the queries from the log file and sends them to the broker hosts. We use a multi-threaded implementation to increase the QPS being sent to the broker nodes. First, we run tests to identify how many threads are needed as a baseline that matches production traffic — for example, 300 QPS. Based on that, we can define how many threads we need to use for testing expected holiday traffic (two, three, or more times the standard traffic).
In our use case, we had loaded the data received up to a specific date (e.g. October 1st). At this point, we were re-running the captured log files on the same date or the day before, to match production behavior. Our test script also was able to update the time frame in a query to match either the current time or a predefined time to allow running any log file and translating it to the data available on the test environment.
Evaluating the Results
To determine the health of our system, we used our existing metrics to compare QPS and P99 latency on brokers and historical nodes, as well as determining system health via indicators like CPU usage of the brokers. These metrics help us identify any bottlenecks.
Query response time with normal traffic and 2x increase on basic system setup.
Typical bottlenecks can include the historical nodes or the broker nodes.
The historical nodes may show a higher latency for increased QPS, which will in turn increase the overall latency. To resolve this, we would add mirror hosts and increase the number of replicas of the data to support better latency under higher load. This step is something that will take time to implement, as hosts need to be added and data needs to be loaded, which can take several hours depending on the data size. Therefore, this is something that should be completed before traffic increases on the production system.
If the broker nodes are no longer able to handle the incoming query traffic, the size of the broker pool needs to be increased. If this is seen in the test environment, or even the production environment, it is much faster to increase the pool size and can potentially be done ad-hoc as well.
Testing with an increased data size on the test environment helps us determine which steps are needed to support the expected holiday traffic changes. We can make these configuration changes in advance, and we can make the support team aware of changes and of the maximum traffic the system is able to handle within the specified SLA (QPS and P99 latency requirements from the client teams).
Can the system handle the increase in data ingestion?
Testing the capacity for real-time data ingestion is similar to testing query performance. It is possible to start with making an estimate of the supported ingestion rate based on the dimensions/cardinality of the ingested data. However, this is only a guideline, and for some high-priority use cases it is a good idea to test early on.
We set up a test environment that has the same capacity, configuration, etc. as the production environment. However, in this step, some help from client teams may be required as we also need to test with increased data from the ingestion source like Kafka topic.
When reviewing the ingestion test, we focused on several key metrics. The ingestion lag should be low, and the number of both successful and rejected events (due to rejection window exceeded) should be closely similar to comparable values in the production environment. We also include validation of ingested data and general system health of overlord and middle manager stages — the stages handling ingestion of real time data.
Sample metrics for successfully ingested events, rejected events and kafka ingestion lag.
Sample metrics for successfully ingested events, rejected events and kafka ingestion lag.
Sample metrics for successfully ingested events, rejected events and kafka ingestion lag.
Can the system handle the increase in data volume?
Evaluating if the system can handle the increase in data volume is probably the simplest and quickest check, though just as important as the previous steps. For this, we take a look at the coordinator UI: here we can see all historical stages, the pool size, and at what capacity they are currently running. Once clients provide details on the expected increase in data volume, it is a fairly simple process to calculate the amount of additional data that needs to be stored over the holiday period and potentially some period after that.
The space is at a healthy percentage (~70%) allowing for some growth.
Results
In the tests we ran this year, we found that our historical stages are in a very good state and are able to handle the additional traffic expected during the holiday time. We did see, however, that the broker pool may need some additional hosts if traffic meets a certain threshold. We have been sure to keep this communication visible with the client teams and support teams so team members are aware and know that the pool size may need to be increased.
Learnings
Timing is very critical with holiday testing. This project has a fixed end date by which all changes need to be completed in the systems before any traffic increases, and the teams need to make sure to have all the pieces in place before results are due. As is true of many projects as well as this one, we need to leave additional buffer time for unexpected changes in timeline and requirements.
Druid is a backend service, which is not always top of mind for many client teams as long as it is performing well. Therefore, is it a good idea to reach out to client teams before testing starts to get their estimation of expected Holiday traffic increases. Some of our clients reached out to us on their own; however, the due date for any capacity increase requests to governance teams would have already passed. In these cases, or where client teams are not sure yet, it is a good practice to make a general estimation on traffic increase and start testing with those numbers.
Keeping track of holiday planning and applied changes for each year is also a good practice. Having a history of changes every year and keeping track of the actual increase versus the original estimates made beforehand will help to make educated estimates on what traffic increases may be expected in the following year.
Knowing the details on the capacity of brokers and historical stages before the holiday updates makes it easier for teams to evaluate what capacities to reduce the clusters after the holidays as well as considering organic growth on a per-month basis.
Future Work
In this year’s use case, we chose the option of capturing broker logs to retrieve the queries we wanted to re-play back to Druid. This option worked for us at this time, though we are planning to look into other options for capturing queries going forward. The log files option works well for a one-off need, but it would be useful to have continuous logging of queries and storing these in Druid. This can help with debugging issues and identifying high-latency queries that may need some tweaking to get performance improvements.
StackShare | Tech stack deep dives from top startups and engineering teams
Cost Reduction in Goku
By Monil Mukesh Sanghavi | Software Engineer, Real Time Analytics Team; Rui Zhang | Software Engineer, Real Time Analytics Team; Hao Jiang | Software Engineer, Real Time Analytics Team; Miao Wang | Software Engineer, Real Time Analytics Team;
In 2018, we launched Goku, a scalable and high performant time series database system, which served as the storage and query serving engine for short term metrics (less than one day old). In early 2020, we launched GokuL (Goku long term), which extended Goku’s capability by supporting long term metrics data (i.e. data older than a day and up to a year). Both of these completely replaced OpenTSDB. For GokuL, we used 3 clusters of i3.4xlarge SSD backed EC2 instances which, over time, we realized are very costly. Reducing this cost was one of our primary aims going into 2021. This blog post will cover the approach we took to achieve our ambition.
Background
We use a tiered approach to segregate the long term data and store it in the form of buckets.
Table 1: table of a tiered approach
Tiers 1–5 contain the data stored on the GokuL (long term) clusters. GokuL uses RocksDB to store its long term data, and the data is ingested in the form of SST files.
Query Analysis
We analyzed the queries going to the long term cluster and observed the following:
There are very few metrics (approximately ~6K) out of a total of 10B for which data points older than three months were queried from GokuL.
More than half of the GokuL queries had specified rollup intervals of one day or more.
Tier 5 Data Analysis
We randomly selected a few shards in GokuL and analyzed the data. We observed the memory consumption of tier 5 data was much more than all the other tiers (1–4) combined. This was despite the fact that tier 5 contains only one hour of rolled up data, whereas the other tiers contained a mix of raw and 15 minute rolled up data.
Table 2: SST File size for each bucket in MiB
Solutions
It was inferred from the query and tier 5 analysis that tier 5 data (which holds six buckets of 64 days of data each) was the least queried as well as the most disk consuming. We planned our solutions to target this tier as it would give us the most benefits. Mentioned below are some of the solutions which were discussed.
Namespace
Implementation of a functionality called namespace would store configurations like ttl, rollup interval, and tier configurations for a set of metrics following that namespace. Uber’s M3 also has a similar solution. This would help us set appropriate configurations for the select sete.g. set a lower ttl for metrics that do not require longer retention, etc). The time to production for this project was longer, and hence we decided to make this a separate project in the future. This is a project being actively worked upon.
Rollup Interval Adjust for Tier 5 Data
We experimented with changing the rollup interval of tier 5 data from one hour to one day and observed the change in the final SST file(s) size for the tier 5 bucket.
Table 3
The savings that came out of this solution were not strong enough to support putting this into production.
On Demand Loading of Tier 5 Data
GokuL clusters would only store data from tiers 1–4 on startup and would load the tier 5 buckets as necessary (based on queries). The cons of this solution were:
Users would have to wait and retry the query once the corresponding tier 5 bucket from s3 had been ingested by the GokuL host.
Once ingested, the bucket would remain in GokuL unless thrown away by an eviction algorithm.
We decided not to go with this solution because it was not user friendly.
Tiered Storage
We decided to move tier 5 data into a separate HDD based cluster. While there was some notable difference observed in the query latency, it could be ignored because the number of queries hitting this tier was much less. We calculated that tier 5 was consuming approximately 1 TB of each of the 650 hosts in the GokuL cluster. We decided to use the d2.2xlarge instance to store and serve the tier 5 data in GokuL.
Table 4
The cost savings that came out of this solution were huge. We replaced around 325 i3.4xlarge instances with 111 d2.2xlarge instances, and the cost reduction was huge. We reduced nearly 30–35% of our costs with this change.
To support this, we had to design and implement tier-based routing in the goku root cluster, which routes the queries to short term and long term leaf clusters. This was one of the solutions that gave us a huge cost savings.
In the future, we can evaluate if we can reduce the number of replicas and compromise on availability in opposition to the low number of queries.
RocksDB Tuning
As mentioned above, GokuL uses RocksDB to store the long term data. We observed that the RocksDB options we were using were not optimal for Goku’s data that has high volume and low QPS.
We experimented with using a stronger compression algorithm (ZSTD with level 5), and this reduced the disk usage by 40%. In addition to this, we enabled the partitioned index filter wherein only the top level index is loaded into memory. On top of this, we enabled caching with higher priority for filter and index blocks so that they use the same cache as the data blocks and also minimize the performance impact.
With both the above changes, we noticed that the latency difference was not large and the reduction in data space usage was approximately 50%. We immediately put this into production and shrunk the size and cost of our GokuL clusters by another half.
What’s Next
Namespace
As mentioned, we are actively working on the implementation of the namespace feature, which will help us reduce the long term cluster costs even further by reducing the ttl for most of the current metrics that do not need the high retention anyways.
Acknowledgments
Huge thanks to Brian Overstreet, Wei Zhu, and the observability team for providing and supporting solutions on the table.
StackShare | Tech stack deep dives from top startups and engineering teams
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More
Partner Blog | AI in retail: making the most of NRF 2024
Our guest contributor for today’s blog is Noel Pennington, Director of Partner Strategy, Industry Cloud.
National Retail Federation’s (NRF) Big Show is where the retail and consumer goods industries come together to hear from the biggest changemakers, experience the latest innovations, and make the relationships that matter most. Microsoft will again play a prominent role in this year’s event with new announcements and a showcase of industry innovation.
Our “RetAIl” lineup this year will center on four key pillars: maximizing the value of your data, elevating the shopper experience, building a real-time retail supply chain, and empowering frontline store associates.
Continue reading here
Microsoft Tech Community – Latest Blogs –Read More