tl;dr: Go is cool, Opsgenie and Slack has nice APIs we can use to patch things together.

Intro

We are using Opsgenie for two primary purposes:

  • responding to/managing operational alerts
  • internal support system using on-call schedules

The first one is the whole purpose alert management systems exist and make money. The second one is an adaptation for us to ease internal-support responsibility of our SRE team.

Previously, all of the SRE team were receiving notifications; we couldn’t prevent it. For any request/question, all team members were getting notified. To solve this problem, we decided to spare a person to be the first-responder to the support requests. Since we had Opsgenie already, we used its on-call schedules to determine who that person should be, leveraging overrides, forwardings, etc.

As an interruption hater, I like this system. It allows me to focus on my work outside of my on-call weeks.

After this conceptualization, we need a way to directly notify the on-call users defined in Opsgenie within Slack. I will explain three methods, with their pros and cons.

We will use Opsgenie Go SDK and slack-go in the examples.


DISCLAIMER

Following code snippets are not production ready. If you want to copy/paste code snippets from this blog post, please read TODOs carefully, make necessary error handling.


Generic Slack app template

For Option 1 and Option 2, we need a Slack app implementation that handles DMs and mentions in Slack. The following code snippet does exactly that. It is adopted from slack-go library examples.

go
package main

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"net/http"
	"os"

	"github.com/slack-go/slack"
	"github.com/slack-go/slack/slackevents"
)

// You more than likely want your "Bot User OAuth Access Token" which starts with "xoxb-"
var slackApi = slack.New(os.Getenv("SLACK_BOT_TOKEN"))

func main() {
	signingSecret := os.Getenv("SLACK_SIGNING_SECRET")

	http.HandleFunc("/events-endpoint", func(w http.ResponseWriter, r *http.Request) {
        buf := new(bytes.Buffer)
		buf.ReadFrom(r.Body)
		body := buf.String()

		eventsAPIEvent, e := slackevents.ParseEvent(json.RawMessage(body), slackevents.OptionVerifyToken(&slackevents.TokenComparator{VerificationToken: h.verificationToken}))
		if e != nil {
			w.WriteHeader(http.StatusInternalServerError)
		}
		if eventsAPIEvent.Type == slackevents.URLVerification {
			var r *slackevents.ChallengeResponse
			err := json.Unmarshal([]byte(body), &r)
			if err != nil {
				w.WriteHeader(http.StatusInternalServerError)
				return
			}
			w.Header().Set("Content-Type", "text")
			w.Write([]byte(r.Challenge))
		}
		if eventsAPIEvent.Type == slackevents.CallbackEvent {
			innerEvent := eventsAPIEvent.InnerEvent
			switch ev := innerEvent.Data.(type) {
			case *slackevents.AppMentionEvent:
                slackApi.PostMessage(ev.Channel, slack.MsgOptionText("Let me find someone to help you...", false), slack.MsgOptionAsUser(true), slack.MsgOptionTS(ev.TimeStamp))
                err := handleMention(ev)
                panicOnErr(err)
			}
		}
	})
	fmt.Println("[INFO] Server listening")
	http.ListenAndServe(":3000", nil)
}

func handleMention(ev slackevents.AppMentionEvent) error {

    // other things...
}

In Option 1 and Option 2, we will modify handleMention function body.

Of course, we need to run this in a server, register a domain for it, configure the DNS in Slack’s app configuration; Slack has enough documentation for that. To test on your local, you can use a tool like www.ngrok.com and paste the ngrok endpoint to your Slack app’s configuration. There are other serverless options which would make this blog post 15 times longer.

Option 1: generate alerts for each mention

Simply don’t do this.

The modern way of doing integrations in Slack is using Slack apps, as of June 2021. We created a Slack app named ops, which can be mentioned with @ops anytime, anywhere. Since our purpose is notifying the on-call, the most basic we can do is for each interaction we can create an alert, drop the Slack message link in the alert body, and let on-call handle the rest.

I didn’t like the final idea here. Let’s go over the pros/cons:

Pros:

  • We find the on-call person of that moment.
  • We have a guaranteed way of contacting the on-call person - Opsgenie handles this well. It’s what they are good at.

Cons:

  • Another alert will be created when someone edits the Slack message with @ops mention. You need to handle alias, set it to Slack message’s timestamp probably.
  • On-call will hate you. You are receiving support notifications from the same medium you receive outage alerts.
  • Not very interactive. On-call needs to open Opsgenie first, go to alert, find the Slack message link, click it, wait for it open, and respond.

In Slack:


In Opsgenie:

Personally, I would hate this flow.

To handle this interaction, we need to use Opsgenie create alert API and Slack post message API. We need to modify handleMention as follows:

package main
import (
    ...
	"github.com/opsgenie/opsgenie-go-sdk-v2/alert"
	"github.com/opsgenie/opsgenie-go-sdk-v2/client"
)

...

var alertClient, _ = alert.NewClient(&client.Config{
	ApiKey: os.Getenv("OPSGENIE_API_KEY"), // TODO - modify for your environment
})

func handleMention(ev slackevents.AppMentionEvent) error {
    // TODO - get actual name of the user from Slack API, ev.User is in Slack's user id format
    messageLink := fmt.Sprintf("myorg.slack.com/archives/%s/p%s", ev.Channel, strings.Replace(ev.EventTimeStamp, ".", ""))
    return createAlert(ev.User, messageLink)
}

func createAlert(user, messageLink string) error {
	return alertClient.Create(nil, &alert.CreateAlertRequest{
		Message:     fmt.Sprintf("Support request - %s needs help in Slack", user),
		Description: fmt.Sprintf("%s mentioned you in Slack. Click here for the message - %s", user, messageLink),
		Responders: []alert.Responder{
			{Type: alert.TeamResponder, Name: "SRE"},
		},
		Tags:     []string{"slack", "support"},
		Priority: alert.P4,
	})
}

We took ideas from this approach but never implemented this idea to keep our sanity.

Option 2: find on-call person, respond to message’s thread in Slack

Using the same responsive bot app idea, we decided to find the on-call person at that moment and mention them in the thread.

Here we need Opsgenie get on-call users API and Slack post message API. We need to modify handleMention as follows:

package main
import (
    ...
	"github.com/opsgenie/opsgenie-go-sdk-v2/schedule"
	"github.com/opsgenie/opsgenie-go-sdk-v2/client"
)

var scheduleClient, _ = schedule.NewClient(&client.Config{
  ApiKey: os.Getenv("OPSGENIE_API_KEY")
})

func handleMention(ev slackevents.AppMentionEvent) error {
    slackApi.PostMessage(ev.Channel, slack.MsgOptionText("Let me find someone to help you...", false), slack.MsgOptionAsUser(true), slack.MsgOptionTS(ev.TimeStamp))

	oncallUsers, err := findOncallSlackUsers()
    if err != nil {
        return err
    }

    slackApi.PostMessage(ev.Channel, slack.MsgOptionText(strings.Join(oncallUsers, ""), false), slack.MsgOptionAsUser(true), slack.MsgOptionTS(ev.TimeStamp))
}

func findOncallSlackUsers() ([]string, error) {
    flat := false
    scheduleResult, err := scheduleClient.GetOnCalls(nil, &schedule.GetOnCallsRequest{
      Flat:                   &flat,
      ScheduleIdentifierType: schedule.Name,
      ScheduleIdentifier:     "SRE_schedule", // TODO Change schedule name to your team's schedule here
    })
    if err != nil {
        return nil, err
    }
    participants := scheduleResult.OncallParticipants

	users := make([]string, len(participants))
	for i, p := range participants {
		user, err := slackApi.GetUserByEmail(p.Name)
		if err != nil {
			return nil, err
		}
		users[i] = fmt.Sprintf("<@%s>", user.ID)
	}
	return users, nil
}

Pros:

  • Keep on-call happy, do not create meaningless alerts
  • Allow interactivity on Slack app directly by mentioning their names
  • Requester knows who is on-call without visiting Opsgenie, can initiate DM with the on-call if needed

Cons:

  • Solution depends on on-call’s Slack notification preferences, not as robust as Opsgenie
  • There is always a few seconds of delay.
  • We may hit Opsgenie’s rate-limiting. We hit this limit a few times, especially because of a nice bug we introduced ourselves with this above code: if the bot is triggered in DM, Slack still sends an event callback to our server for both your message and bot’s response, so the bot tries to respond to itself in an endless loop. In the handleMention function, we need a check like:
if ev.User == "BOT_USER_ID" {
    return nil
}

One improvement here can be caching the on-call participants result from Opsgenie. In some cases, like urgent schedule overrides, this improvement can create mayhem for the cache TTL of time. A way to overcome this would be invalidating the cache when the app is mentioned/DMed like @ops cache clear

Option 3: create a user group in Slack, update it with on-call participants via cronjob every x minutes

This option does not require creating a Slack app, manage interactions, etc. Like the above options, you create a user group named ops instead of Slack app and then update it periodically.

We need to use Slack update user group API and Opsgenie get on-call users API for this purpose. A sample code that can do it is as follows:

package main

import (
	"os"
	"strings"

	"github.com/opsgenie/opsgenie-go-sdk-v2/client"
	"github.com/opsgenie/opsgenie-go-sdk-v2/schedule"
	"github.com/slack-go/slack"
)

var slackApi = slack.New(os.Getenv("SLACK_BOT_TOKEN"))

var scheduleClient, _ = schedule.NewClient(&client.Config{
	ApiKey: os.Getenv("OPSGENIE_API_KEY"),
})

func main() {
	userGroup := "opsoncall"

	oncallUsers, err := findOncallSlackUsers() // TODO - from the method in option 2, assume it returns user ID list only, not mentioned users in chat
	panicOnErr(err)

	slackApi.UpdateUserGroupMembers(userGroup, strings.Join(oncallUsers, ","))
}

Pros:

  • Interaction is blazing fast - no networking with your app.
  • You don’t deal with Slack app nonsense and Slack’s periodic deprecation of concepts and APIs.1

Cons:

  • If you want to do more fancy stuff within your app, like auto-replying certain messages, deciding priority based on content, measuring the number of requests, you lose that option.
  • While option-2 can be implemented in a serverless function, like lambda, this one would create some billing problems since you’d like to be frequent to not buzz the wrong person at shift changes.

Conclusion

Thinking we would be fancy guys, implement some kind of chatbot to interact, we decided to go with option 2. I think the last two options have viable use cases. Even the first one makes sense for certain companies.


  1. We loved slash commands, but Slack killed it for us! Now slash commands are at 10% of productivity they used to be for a chatops tool like ours. [return]