By late 2019 it was clear we had outgrown our existing monitoring and observability system. It was time to shop around for a new solution that could match Grammarly’s rapid pace of growth with faster performance and a heightened developer experience. Most of the time, upgrading a product costs more—this time, however, we actually cut our costs by an order of magnitude by upgrading.
In this post, we’ll discuss why we needed a new monitoring platform, the qualities we were looking for in our search, and why VictoriaMetrics stood out as the best option. We’ll also cover how we migrated to the OpenMetrics standard and the benefits of this transition.
The status quo monitoring solution was unsustainable
We had been using the same systems since 2014, and after five years they were showing their age. We were storing our data (such as time-series health metrics, CPU %, disk space, etc.) in the Graphite format using a go-graphite implementation. Although more performant than vanilla Graphite, our approach organized metrics into excessively nested directories that were expensive to traverse at scale.
Between chronic performance issues and large operational costs for our platform engineers, our existing application monitoring system made for a poor developer experience. The system was unwieldy in many ways:
- Multiple users couldn’t run queries simultaneously without serious latency.
- It was hard to switch between clusters (we had clusters with different retention periods).
- Debugging and testing were cumbersome.
The problems went beyond the developer experience and had a direct business impact:
- Alerting was unreliable and inconsistent with our dashboards.
- The system couldn’t handle the ever-increasing data load we were putting on it—this was only going to get worse.
- The overwhelming computational cost ran up our AWS bill.
For these reasons, we were motivated to act quickly. Our first instinct was to purchase a third-party SaaS solution.
Trialing several SaaS providers convinced us to choose open-source
There were half a dozen vendors to choose from, so we methodically compared them in a spreadsheet. You can see a high-level, visual representation of the assessment below.
This exercise helped us narrow down the options to three that we chose to pilot. After the trial periods, we realized that any of the three contenders could theoretically meet our needs, but not without some significant drawbacks.
All of them would impose a high engineering cost for doing the migration, not to mention the significant financial costs. Moreover, we weren’t keen on vendor lock-in and the loss of optionality we’d face down the road.
We decided we’d give open-source a try instead. These SaaS explorations weren’t a wasted effort, though. They helped us refine the exact qualities we were looking for. For instance, we reaffirmed the importance of labels in adding context to metrics without having to restructure them.
Graphite metric:
platform.roman-test.prod.ip-1-2-3-4.http_requests.count
Metric with labels:
http_requests{service=”roman-test”, instance=”1.2.3.4”, env=”prod”, team=”platform”}
Rather than a completely hierarchical layout, labels enable more flexibility in wrangling our metrics. They were a must-have for our next solution.
VictoriaMetrics provided open-source software with enterprise-grade support
When it came to building a custom solution for our needs, it was hard to match the increased flexibility of open-source software. On the other hand, we could do without the more involved deployment and maintenance common to open-source tools.
Fortunately, with VictoriaMetrics, we were able to sign a contract to get the support we needed to confidently run a monitoring system at our scale. We have a dedicated support channel in Slack to consult engineers from VictoriaMetrics. On top of that, team members from VictoriaMetrics organized talks and workshops that were instrumental in helping Grammarly migrate to a new system.
VictoriaMetrics didn’t just talk the talk. Their solution met and exceeded the following requirements:
Graphite compatibility: For historical reasons, we wanted to continue using metrics in the Graphite format, at least in the short term. Options where we’d have to migrate monitoring systems and metrics formats all at once were non-starters.
- Labels support: Labels enable a flexible way to work with metrics at scale without significantly impeding performance.
- Cost-effective: Compute and storage costs needed to be low enough that we could monitor and retain data on the order of months given our budget.
- Stable: The system had to be resilient to the substantial load imposed by metrics churn. We required high performance in our testing, which involved ingesting real data and serving that data for dashboards and alerts.
- Real-time: For effective monitoring, data sent to the system needed to be ready to query with as little delay as possible.
- Enterprise-grade security: Every piece of third-party software Grammarly uses must pass a thorough security review confirming it meets our enterprise standards.
The numbers spoke for themselves. Our proof-of-concept trial showed dramatically reduced compute and storage costs, translating into a 10x lower AWS bill:
Existing Solution | VictoriaMetrics | |
Load | ~600K dps | ~600K dps |
Number of nodes | 30 nodes in 3 clusters | 1 node |
IOPS | ~1,000 read ops/s
~28,000 write ops/s |
~30 read ops/s
~90 write ops/s |
Disk space usage | ~14 TB for 2 weeks of data | ~20 TB for 13 months of data |
The above table is for a single-node installation of VictoriaMetrics we piloted, although in production we deployed a clustered version for redundancy and high availability.
Also worth noting is that VictoriaMetrics’ retention far exceeded our previous capabilities. Of our three clusters, one held data for a few hours primarily for alerting purposes, another held data for 14 days for dashboards, and the third (somewhat unreliable) cluster held data for a few months. With VictoriaMetrics, we could retain 18 months’ worth of data, and because of downsampling for older data, we didn’t worry about exorbitant storage costs.
The only hiccup was that, at the time, VictoriaMetrics didn’t support GraphiteQL, the query language we use to build dashboards. This was only a minor obstacle. The beauty of open-source software is that we can engineer our own solution. We leveraged another open-source project, carbonapi, so that we could continue using GraphiteQL alongside VictoriaMetrics. Although carbonapi wasn’t immediately viable, we were very happy to give back to the open-source community by iterating on the project. We fixed several bugs and used debugging techniques like profiling and bottleneck analysis with flame graphs to further optimize performance.
As a testament to VictoriaMetrics’ commitment to serving their customers, they later added native GraphiteQL support per our initial request. Now, anyone who uses VictoriaMetrics can use this feature, too.
A seamless migration with big results
With everything ready to go, we set a deadline for the migration and gave the whole company a two-week heads-up. It’s always at least a bit nerve-racking to do these major infrastructure migrations, but this was honestly one of the smoothest transitions in Grammarly’s history.
We managed to decouple the migrations of how we wrote and read the data. We started writing data simultaneously into both the older clusters and the new one, until we were ready to completely remove the old infrastructure. Similarly, we migrated the appropriate dashboards for reading the data and sunsetted the obsolete ones. We stuck to our schedule without issue—the new infrastructure went live on the same day as the deadline we gave the whole company, and we turned off the old systems the next day after confirming our data’s integrity.
Then, with the main migration done, we were in a position to rethink how we store our data. VictoriaMetrics made it easy to adopt the OpenMetrics protocol, the industry standard. Today, over 95 percent of Grammarly’s data points are in OpenMetrics, which has built-in support for labels as first-class entities, unlike Graphite, which emphasizes a hierarchical format.
Overall, the migration has been a resounding success because of the cost savings, the huge performance improvements, and the enhanced developer experience with our new system. Since OpenMetrics is so widely used, it also makes it easier for new team members to immediately make a difference when they join. There’s so much impact left to be made, as we have future plans to introduce budgeting for metrics per service and eliminate possible cardinality explosions, among other projects. If that sounds exciting to you, consider checking out open roles at Grammarly today.