Recently I had one of those "Aha, I get serverless" moments that really extends your serverless understanding beyond just Functions-as-a-Service. I discovered this while working on a simple service to check the state of a resource and send an event to PagerDuty to notify me when the resource went into an unexpected state. The realization I had was:
Don't code if you don't have to.
When I first approached building this service I drew a simple set of Lambda functions and began coding. But the more I looked at the problem the more I realized the less code I needed to build this service and solve my issue.
Initial Build
When I started I immediately pictured two functions in my head connected by SNS. Best of all, I could reuse my PagerDuty publisher for some other needs too.
- Function 1: Check resource state
- Function 2: Send event to PagerDuty
The StateCheck Lambda function is triggered on a schedule and if the state of the checked resource changes, a message is published to SNS which triggers PagerDutyEventPublisher to publish to PagerDuty. For the PagerDuty part, I immediately went to the API documentation and then looked at their Python module. I was all set to write some code for publishing to the PagerDuty web API.
This is what I initially diagrammed as my application to build.
And this was the initial code for the PagerDutyEventPublisher Lambda function which published to the PagerDuty Events v2 API.
handlers/pagerduty_event_publisher.py
'''Publish message from SNS to Slack'''
import json
import logging
import os
import pypd
log_level = os.environ.get('LOG_LEVEL', 'INFO')
logging.root.setLevel(logging.getLevelName(log_level)) # type: ignore
_logger = logging.getLogger(__name__)
pypd.api_key = os.environ.get('PD_API_KEY')
PD_INT_KEY = os.environ.get('PD_INT_KEY')
PD_SEVERITY = os.environ.get('PD_SEVERITY')
PD_SOURCE = os.environ.get('PD_SOURCE')
def _get_message_from_event(event: dict) -> str:
'''Get the message from the event'''
return event.get('Records')[0].get('Sns').get('Message')
def _publish_event_to_pagerduty(msg: str,
integration_key: str = PD_INT_KEY,
severity: str = PD_SEVERITY,
source: str = PD_SOURCE) -> dict:
'''Publish a message to the PagerDuty API'''
r = pypd.EventV2.create(
data={
'routing_key': integration_key,
'event_action': 'trigger',
'payload': {
'summary': msg,
'severity': severity,
'source': source,
}
}
)
return r
def handler(event, context):
'''Function entry'''
_logger.debug('Event received: {}'.format(json.dumps(event)))
msg = _get_message_from_event(event)
pagerduty_response = _publish_event_to_pagerduty(msg)
resp = {
'pagerduty_response': pagerduty_response,
'status': 'OK'
}
_logger.debug('Response: {}'.format(json.dumps(resp)))
return resp
The code for publishing to PagerDuty is pretty simple but it’s only an initial first pass. The next pass involves figuring out how the code can break or fail to perform, resulting in me not being notified of an issue. As an ops person, one of the things I think we should be doing is reviewing code for failure points and improving the resilience of code.
There’s two types of failure I’d expect. Plus, some failures are potentially transient and an action is retriable while others are not.
- Human error via configuration
- Communication failure with the PagerDuty API.
As a result of spending more time with the PagerDuty API documentation, this is what my code came to look like.
handlers/pagerduty_event_publisher.py
'''Publish message from SNS to Slack'''
import json
import logging
import os
import sys
import pypd
from pypd.errors import Error as PdError
log_level = os.environ.get('LOG_LEVEL', 'INFO')
logging.root.setLevel(logging.getLevelName(log_level)) # type: ignore
_logger = logging.getLogger(__name__)
PD_INT_KEY = os.environ.get('PD_INT_KEY')
PD_SEVERITY = os.environ.get('PD_SEVERITY')
PD_SOURCE = os.environ.get('PD_SOURCE')
class HandlerBaseError(Exception):
'''Base error class'''
class PagerDutyBaseError(HandlerBaseError):
'''Base PagerDuty Error'''
class PagerDutyApiError(PagerDutyBaseError):
'''PagerDuty Communication Error'''
class PagerDutyApiRetryableError(PagerDutyBaseError):
'''PagerDuty Communication Error (Retryable)'''
class PagerDutyDataSeverityTypeError(PagerDutyBaseError):
'''PagerDuty Severity Not Valid Type Error'''
class PagerDutyDataEventValidationError(PagerDutyBaseError):
'''PagerDuty Event Data Validation Error'''
PD_ALLOWED_SEVERITIES = pypd.EventV2.SEVERITY_TYPES
if PD_SEVERITY not in PD_ALLOWED_SEVERITIES:
raise PagerDutyDataSeverityTypeError(
'Event source "{}" not in "{}"'.format(PD_SEVERITY, PD_ALLOWED_SEVERITIES)
)
def _get_message_from_event(event: dict) -> str:
'''Get the message from the event'''
return event.get('Records')[0].get('Sns').get('Message')
def _publish_event_to_pagerduty(msg: str,
integration_key: str = PD_INT_KEY,
severity: str = PD_SEVERITY,
source: str = PD_SOURCE) -> dict:
'''Publish a message to the PagerDuty API'''
try:
r = pypd.EventV2.create(
data={
'routing_key': integration_key,
'event_action': 'trigger',
'payload': {
'summary': msg,
'severity': severity,
'source': source,
}
}
)
except PdError as e:
tb = sys.exc_info()[2]
if hasattr(e, 'code'):
if e.code == 429 or e.code >= 500:
raise PagerDutyApiRetryableError(e).with_traceback(tb)
raise PagerDutyApiError(e).with_traceback(tb)
except AssertionError as e:
tb = sys.exc_info()[2]
raise PagerDutyDataEventValidationError(e).with_traceback(tb)
return r
def handler(event, context):
'''Function entry'''
_logger.debug('Event received: {}'.format(json.dumps(event)))
msg = _get_message_from_event(event)
pagerduty_response = _publish_event_to_pagerduty(msg)
resp = {
'pagerduty_response': pagerduty_response,
'status': 'OK'
}
_logger.debug('Response: {}'.format(json.dumps(resp)))
return resp
I’m pretty knowledgeable about the PagerDuty API now and how to communicate with it. I’m also fairly happy at reusing this code elsewhere. I’m even happy having someone else use this code feeling that if it does fail, someone can easily understand why with minimal effort. (This is why I’m a fan of catch and re-raise for handling errors which also makes my function very verbose.)
But, that wasn’t the point of serverless application I was building. The point of the application is to check a resource state and alert me when there’s a state change. Yet, I focused most of my time and energy into how I got alerted, which should really just be a minor detail in my application.
Removing Code
When I went back and looked at the PagerDuty integration setup I saw I could just use AWS CloudWatch. After reading the documentation I realized I didn't need to write code to alert PagerDuty. All I needed was to create a CloudWatch alarm or CloudWatch event rule, and send those events to an SNS topic with an endpoint (given to me by PagerDuty) subscribed to that SNS topic.
This was what I ended up creating by pursuing this path.
The CheckState function checks the state of my resource and then writes a value to a CloudWatch custom metric. There’s an alarm on that metric and when the metric is undesirable for a configured period of time, a message is published to SNS and it’s picked up by a PagerDuty HTTP endpoint subscribed to the SNS topic. That endpoint is managed entirely by PagerDuty.
My code is so simple now, it just records a state change to CloudWatch. And the event generated by that travels through resources I set up in CloudFormation, all the way to PagerDuty where it alerts me. I’m also no longer responsible delivering the event to PagerDuty. I’ve handed that responsibility over directly to AWS and PagerDuty. I’m actually okay with this. I think PagerDuty knows better than I how to receive events and reliably alert me than the sort of job I could do.
Simplicity Strikes
Staring at the simplicity of what I just built and that is my aha moment. Why spend time with third-party API docs and handling web requests when a simple AWS CloudWatch API call and AWS services setup with CloudFormation would take care of all I needed? To solve my problem I didn’t need as much code as I thought. I only needed some simple code and a pipeline of cloud services configured to function together.
What I’m realizing today is I’ve still been in the regular mindset of building serverless apps similarly to how I would have built microservices. I 1:1 map old patterns to new services. This service was the first time I did something actually different and relied more heavily on AWS services than code.
For serverless, less code is more.
This idea of writing less code can be an advantage for operations people who have minimal code experience. If you’re someone who’s comfortable with DSLs, like CloudFormation, and understanding how third party services work then you can probably do well with serverless. Not having your first inclination to be writing code can be used to your advantage.
Going Serverless Is Multiple Leaps
The first leap to serverless is understanding how to deliver functional and reliable applications code to an infrastructure without servers. The next leap is learning how to deliver less and simpler code. And finally the third leap is understanding how to spend most of your time focused on what’s really valuable.
The next step for me will be sharing my PagerDuty publisher with others via AWS SAR. That's another blog post.
(By the way, I’m wondering how many companies should be offering more than just a webhook URL and also an SNS topic subscription endpoint.)