Effortlessly query your big data infrastructure
By ZoomInfo Engineering, March 29, 2022By Cody Carrell
Clickagy, a ZoomInfo company, is an innovative data provider in the AdTech industry, giving advertisers and website owners the data they need to be successful in their operations. Clickagy utilizes proprietary Natural Language Processing (NLP) to identify the intent of billions of website visits every month, allowing advertisers to target people that are actively in-market. Clickagy relies heavily on Amazon Web Services to reliably deliver this actionable data to a variety of platforms, including Centro DSP, Adobe Audience Manager, LiveRamp DataStore, and many more. In this blog post, we’ll take a look at how Amazon’s Athena offering has made this possible and allowed Clickagy to have greater insight into the data they’re collecting.
First, it helps to understand how Athena fits into Clickagy’s stack. Clickagy sees hundreds of thousands of requests per second and utilizes a combination of Amazon Kinesis and proprietary software hosted on EC2 instances to store this massive scale of data into ORC files that live in Amazon’s S3. Athena allows Clickagy to access this data at will, whenever the need arises, without the need to spin up a Presto or Hive cluster or maintain a costly persistent cluster.
The diagram above shows the flow of events from one of Clickagy’s applications into a Kinesis Stream, and then a custom application stages these events into S3 in a simple JSON format. A message is then pushed into an SQS queue that triggers a custom application on EC2, which pulls the data from the S3 staging bucket, and then cleans and converts it to ORC files. Once the data is in the output bucket, users are able to query it directly using Athena.
Since Athena is essentially a serverless Presto environment that is always on and always accessible, Clickagy can query the data a client needs faster than ever before. This eliminates unnecessary overhead and keeps costs reasonable. Since Athena only charges for successful queries based on the amount of data scanned, if properly set up, there is no easier or cheaper way to query your big data. Athena’s pricing structure ensures you are only paying for queries that matter to you and to your customers. To achieve faster and cheaper queries, it is important to partition your data in a way that makes sense. Clickagy partitions most of their data based on the date and the account the data belongs to, which means they’re always looking at the smallest possible subset of data, equating to less data scanned and cheaper, more reliable Athena queries.
Thanks to the plethora of options available to connect to Athena, Clickagy can utilize it throughout its entire stack. The JDBC drivers allow Clickagy’s backend servers to automatically query and deliver data to clients in a cost and time efficient manner, while the AWS SDK allows Clickagy’s front-end services to query Athena to display important real-time stats to both DSP and Clickagy Insights customers alike. Clickagy data scientists also use Athena to access data used to build and train machine learning models. The Athena interface is quite powerful, allowing you to see a history of queries, their status, run time, the amount of data scanned, and a preview of the results.
Thanks to the ease of use of the AWS Console and the read-only nature of Athena, Clickagy can provide their account support representatives a login to the AWS Console. This allows them to pull any raw data they need to answer a client’s questions. The ability to see query history from anyone within Clickagy’s Athena environment enables their tech team to monitor and provide feedback to ensure queries are always optimized. This ensures optimal query performance and the correct use of partitioning, saving Clickagy money. The Saved Queries feature within the Athena UI allows developers to write pristine Presto queries for any number of tasks to be saved and later used by users, saving Clickagy a lot of time and money.
The most important thing to remember about Athena is that, at the end of the day, it’s still Presto under the hood. This means that any knowledge of Presto can, in most cases, directly benefit your use of Athena. Some examples include:
- Utilizing subqueries in favor of joins to allow for quicker results.
- Making sure to always include any applicable partitions.
- Keeping the larger dataset to the left when joining.
To recap, if you’re looking for a way to easily and effortlessly query your big data infrastructure, Amazon’s Athena should be one of the first options you consider. Athena’s headache free and usage-based billing means you’re only charged for queries that run successfully. The serverless platform can be accessed via JDBC drivers and the AWS SDK, making integrating Athena into your stack a breeze. Athena’s UI is simple enough for account managers running saved queries, yet powerful enough for a developer to query anything their heart desires.