Homeβ€Ί Blogβ€Ί Week 4 - Building a Fleet Intelligence Platform wi…
AI GitOps Terraform

Week 4 - Building a Fleet Intelligence Platform with AWS Glue, Step Functions & ServiceNow

AWS Platform Engineering Lab Β· Week 04

Building a Fleet Intelligence Platform
with AWS Glue, Step Functions & ServiceNow

From scattered SSM JSON files to SQL-queryable fleet health data β€” automated end to end, ticketed from ServiceNow

7AWS Services
4Parquet Tables
3Lambda Functions
100%IaC (Terraform)

Why β€” The Problem This Solves

Every AWS account running EC2 instances has the same invisible problem: you cannot answer basic fleet questions without logging into dozens of consoles, running scripts, or waiting for a weekly report.

Questions like:

  • Which servers are missing critical patches right now?
  • How many Windows vs Linux instances are running in prod?
  • What version of Java is installed across the fleet?
  • Which instances haven't reported to SSM in the last 24 hours?

In a small environment you can click around the SSM console to answer these. In an enterprise with hundreds or thousands of EC2 instances across multiple accounts, that becomes impossible. The data exists β€” SSM is already collecting it β€” but it's fragmented: one JSON file per instance, stored in S3, with no way to query across all of it at once.

The Real-World Pain Point

Security audits require patch compliance reports. Compliance teams need OS inventory snapshots. Operations teams want to know which instances are running outdated agents. All of this data lives in SSM β€” but extracting it manually every time is slow, error-prone, and doesn't scale.

The Business Value

This platform turns a manual, hours-long process into a self-service, 30-minute automated pipeline:

1

Self-Service Request

Any team member submits a ServiceNow RITM ticket. No AWS console access required. No engineer involvement needed to kick it off.

2

Automatic Pipeline Execution

The ticket triggers an AWS Step Functions workflow. Crawler β†’ ETL β†’ data ready. Fully orchestrated, no manual steps.

3

Ticket Auto-Closes

When the pipeline finishes, the Lambda closes the ServiceNow ticket with the Athena query URL in the notes. Requester gets notified automatically.

4

SQL Queries on Fleet Data

Analyst opens Athena, runs SQL against clean Parquet tables β€” patch compliance, inventory, applications, fleet summary. No Python, no scripts.

Who Benefits

Security teams get patch compliance reports on demand. Ops teams get fleet inventory without digging through consoles. Compliance teams get audit-ready data that refreshes automatically. Engineers get out of the loop β€” no one needs to ping them to run a script.

What You Need to Know β€” Skills & Tools

This project is deliberately complex β€” it's Week 4 of an enterprise lab series. Before attempting it, you should be comfortable with the following:

Required AWS Knowledge

πŸ”§ SSM (Systems Manager)
πŸ”₯ AWS Glue (Crawler + ETL Jobs)
πŸ”„ Step Functions (state machines)
⚑ Lambda (Python)
🌐 API Gateway (REST + HMAC)
πŸ—„οΈ S3 (data lake patterns)
πŸ” Athena (SQL queries)
πŸ” IAM (roles, least privilege)
πŸ“¦ Parameter Store (SSM)

Required Engineering Skills

πŸ—οΈ Terraform (modules, remote state)
🐍 Python (boto3, PySpark basics)
βš™οΈ GitHub Actions (CI/CD)
πŸ”‘ HMAC-SHA256 (webhook security)
πŸ“Š Parquet / columnar storage
πŸ—‚οΈ Medallion architecture (raw β†’ curated)

Concepts to Understand Before Starting

πŸ…

Medallion Architecture

Raw data (Bronze) β†’ cleaned data (Silver) β†’ aggregated data (Gold). This project implements all three layers in S3.

πŸ“

Glue Data Catalog

A metadata store that maps S3 file paths to table schemas. Athena uses it to know what columns exist and where the files live.

πŸ”—

Event-Driven Architecture

ServiceNow fires a webhook β†’ API Gateway receives it β†’ Lambda validates β†’ Step Functions orchestrates. No polling, no cron jobs.

πŸ”„

Step Functions Polling Loop

A Wait β†’ Check β†’ Choice state pattern used to monitor long-running Glue jobs without Lambda timeouts. Standard pattern for async orchestration.

My Background Coming Into This

14 years as a DBA gave me strong foundations in SQL, data modelling, and understanding what "good" data looks like. But Glue, Step Functions, and the event-driven integration pattern were all new territory. This week pushed me furthest from my comfort zone β€” and was the most valuable for it.

Architecture β€” How It Fits Together

ServiceNow RITM Ticket Submitted HMAC-signed webhook API Gateway REST API β€” POST /webhook webhook_receiver Ξ» Validates HMAC Β· extracts ticket_id Step Functions State Machine StartCrawler β–Ί WaitForCrawler β–Ί CheckStatus β–Ί StartETLJob β–Ί WaitForETLJob β–Ί CheckStatus β–· PipelineSucceeded Β Β Β  β–· PipelineFailed Glue Crawler Scans raw S3 β–Ί Glue Catalog status_updater Ξ» PATCHes RITM β†’ Closed Complete Glue ETL Job boto3 reads raw Β· writes Parquet closes ticket reads writes S3 Raw Bucket NDJSON (SSM Resource Sync) S3 Curated Bucket Parquet, partitioned by date Bronze / Raw Layer Silver / Curated Layer Glue Data Catalog Table schemas registered Amazon Athena SQL queries by analysts Gold / Analytics Layer
Step Functions state machine
Step Functions state machine β€” the orchestration backbone of the pipeline

Why Step Functions Instead of a Single Lambda?

Glue crawlers and ETL jobs run for minutes β€” sometimes 10–15 minutes for a large fleet. Lambda has a 15-minute maximum timeout, but more importantly, keeping a Lambda alive just to poll a job wastes money and is fragile. Step Functions' Wait states are free β€” the state machine pauses and resumes when the job finishes, using zero compute while waiting.

Why Two S3 Buckets?

The raw bucket holds everything SSM exports, unchanged. If the ETL logic ever needs to be corrected or the schema changes, we can reprocess from raw without re-running the SSM sync. The curated bucket holds clean, queryable Parquet β€” the "source of truth" for analysts. This separation is the foundation of every production data lake.

How We Built It β€” Step by Step

Step 1: Terraform Modules

Everything is infrastructure-as-code across six Terraform modules: s3, iam, glue, lambda, api-gateway, and step-functions. The dev environment wires them together via terraform/environments/dev/main.tf.

HCL β€” Module Composition
module "glue" {
  source              = "../../modules/glue"
  project_name        = var.project_name
  environment         = var.environment
  raw_bucket_name     = module.s3.raw_bucket_name
  curated_bucket_name = module.s3.curated_bucket_name
  glue_role_arn       = module.iam.glue_role_arn
}
Terraform apply output
GitHub Actions running terraform apply β€” 10 resources created

Step 2: SSM Resource Data Sync

SSM Resource Data Sync is a built-in AWS feature that continuously exports EC2 fleet data (patch compliance, inventory, installed applications) to S3 as NDJSON files. One file per instance, organized in partition-style paths:

S3 Path Structure
ssm/AWS:ComplianceSummary/
  accountid=684346483786/
    region=us-east-1/
      resourcetype=ManagedInstanceInventory/
        i-0abc123def456.json   ← one file per instance
SSM Data Sync configuration
SSM Resource Data Sync configured and syncing fleet data to S3
Raw S3 bucket SSM folders
Raw S3 bucket showing AWS:ComplianceSummary, AWS:InstanceInformation, and other SSM data types

Step 3: Glue Crawler

The Glue crawler scans the raw S3 bucket and registers the SSM data structures in the Glue Data Catalog. This creates queryable tables like aws_compliancesummary and aws_instancedetailedinformation β€” giving the ETL job a schema to work from.

Glue crawler run
Glue crawler successfully cataloging raw SSM data
Glue catalog tables
Glue Data Catalog tables discovered from the SSM raw data

Step 4: Glue ETL Job (PySpark + boto3)

The ETL job reads raw NDJSON files, transforms them into clean typed columns, and writes four Parquet tables to the curated bucket. The most important design decision: we use boto3 instead of Spark's native S3 reader to load the files.

Why boto3 instead of spark.read.json()?

The SSM folder names contain colons β€” AWS:ComplianceSummary. Java's URI parser (which Spark uses internally) treats the colon as a protocol separator and crashes with java.net.URISyntaxException: Relative path in absolute URI. boto3 has no such limitation and reads the files directly from S3 without going through the Java URI stack.

Python β€” boto3 S3 Reader (key pattern)
def read_ssm_prefix(bucket, prefix):
    records = []
    paginator = s3_client.get_paginator("list_objects_v2")

    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            if obj["Size"] == 0:
                continue
            key = obj["Key"]
            # Extract account + region from partition path
            acct_match   = re.search(r"accountid=([^/]+)", key)
            region_match = re.search(r"region=([^/]+)",   key)

            body   = s3_client.get_object(Bucket=bucket, Key=key)["Body"].read()
            record = json.loads(body)
            record["_account_id"] = acct_match.group(1)   if acct_match   else ""
            record["_region"]     = region_match.group(1) if region_match else ""
            records.append(record)

    return records   # list of dicts β†’ spark.createDataFrame(records)
Glue ETL job
Glue ETL job jay-fleet-intelligence-fleet-etl-dev β€” Glue 4.0, 2 Γ— G.1X workers
Glue ETL run succeeded
ETL job run SUCCEEDED β€” showing execution time and DPU hours consumed
Curated S3 Parquet data
Curated S3 bucket β€” fleet/ prefix with Parquet partitions: patch_compliance, inventory, fleet_summary

Step 5: Step Functions Orchestration

A Step Functions state machine ties the whole pipeline together. It polls the Glue crawler and ETL job using a Wait β†’ Check β†’ Choice loop rather than hard-coded timeouts β€” so the pipeline self-adapts to however long each step takes.

1
StartCrawler
glue_trigger Lambda starts the Glue crawler
2
WaitForCrawler
Wait 30 s, then poll status
3
IsCrawlerReady
Choice state: loops back until crawler is READY
4
StartETLJob
glue_trigger Lambda submits the PySpark job with ticket context
5
WaitForETLJob
Wait 60 s, then poll job run status
6
IsETLComplete
Choice state: loops back until SUCCEEDED or FAILED
7
UpdateServiceNow
status_updater Lambda closes RITM ticket with Athena URL
Step Functions β€” Polling Loop Pattern (ASL)
"WaitForCrawler": {
  "Type": "Wait",
  "Seconds": 30,
  "Next": "CheckCrawlerStatus"
},
"CheckCrawlerStatus": {
  "Type": "Task",
  "Resource": "${glue_trigger_arn}",
  "Parameters": {
    "action": "get_crawler_status",
    "ticket_id.$": "$.ticket_id"
  },
  "ResultPath": "$.crawler_status",
  "Next": "IsCrawlerReady"
},
"IsCrawlerReady": {
  "Type": "Choice",
  "Choices": [
    {
      "Variable": "$.crawler_status.ready",
      "BooleanEquals": true,
      "Next": "StartETLJob"
    }
  ],
  "Default": "WaitForCrawler"   ← loop back if not ready
}
Step Functions state machine
Step Functions visual workflow β€” all states green on a successful run
Step Functions execution succeeded
Execution detail β€” SUCCEEDED with full state timeline

Step 6: ServiceNow Webhook & Ticket Closure

The pipeline is triggered by a ServiceNow webhook. The webhook_receiver Lambda validates the HMAC-SHA256 signature on every inbound request before starting the Step Functions execution.

Python β€” HMAC-SHA256 Validation
def verify_hmac(body: str, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(), body.encode(), hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

# In handler:
secret = ssm_client.get_parameter(
    Name=HMAC_SECRET_PARAM, WithDecryption=True
)["Parameter"]["Value"]

if not verify_hmac(body, signature, secret):
    return {"statusCode": 401, "body": json.dumps({"error": "Unauthorized"})}

When the pipeline finishes, the status_updater Lambda calls the ServiceNow REST API to close the RITM ticket with a direct Athena console link in the close notes.

ServiceNow ticket open
ServiceNow RITM ticket in Open state β€” Fleet Intelligence Report request
ServiceNow ticket closed
Ticket automatically closed by status_updater β€” close notes contain the Athena query URL

Step 7: Querying with Athena

With Parquet data in S3 and tables registered in the Glue Data Catalog, Athena can query the full fleet for patch compliance and OS inventory in seconds.

SQL β€” Patch Compliance Summary
SELECT
    compliance_status,
    COUNT(*)         AS instance_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM fleet.patch_compliance
GROUP BY compliance_status
ORDER BY instance_count DESC;
SQL β€” OS Distribution
SELECT
    platform_name,
    platform_version,
    COUNT(*) AS instance_count
FROM fleet.inventory
GROUP BY platform_name, platform_version
ORDER BY instance_count DESC;
Athena workgroup
Athena workgroup jay-fleet-intelligence-fleet-dev β€” output location and encryption configured
Athena patch compliance query
Patch compliance query results β€” compliance_status breakdown across fleet
Athena OS inventory query
OS inventory query β€” platform distribution across managed instances

Challenges β€” What Actually Went Wrong

Every real build has friction. Here are the six issues that slowed us down and exactly how each was fixed.

Challenge 1: java.net.URISyntaxException: Relative path in absolute URI

What happened: The Glue PySpark job crashed immediately on spark.read.json("s3://raw-bucket/ssm/AWS:ComplianceSummary/..."). Java’s URI parser treats the colon in AWS:ComplianceSummary as a protocol separator.

Fix: Replaced spark.read.json() with a boto3 paginator that lists and downloads files directly from S3. boto3 treats S3 keys as strings β€” no URI parsing involved. The downloaded records are loaded into Spark using spark.createDataFrame(records).

Challenge 2: HMAC Validation Always Returning 401

What happened: Every test webhook call returned 401 Unauthorized even with the correct secret. The signature was being compared correctly but always mismatched.

Fix: The ServiceNow webhook was sending the signature in a header named x-servicenow-hmac but the Lambda was reading X-ServiceNow-HMAC. API Gateway lowercases all headers. Updated the Lambda to read x-servicenow-hmac (lowercase).

Challenge 3: Step Functions Execution Timing Out on Crawler Poll

What happened: The state machine was set to wait 30 seconds between crawler polls, but the Glue crawler was taking 4–5 minutes to finish on first run (cold start + schema inference). The execution appeared stuck.

Fix: This is expected β€” the Wait state naturally loops. Added CloudWatch logging to the state machine so the poll count was visible. No code change needed; just better visibility. First run is always slower due to schema inference caching.

Challenge 4: Glue Job Script Not Found at Runtime

What happened: Glue job failed immediately with Script not found at s3://raw-bucket/scripts/fleet_etl.py. Terraform created the job before the script was uploaded.

Fix: Added an aws s3 cp step to the GitHub Actions workflow that uploads glue/scripts/fleet_etl.py to S3 before terraform apply runs. The script now always exists when Terraform creates the Glue job resource.

Challenge 5: SSM Resource Data Sync β€” Empty S3 Bucket After Setup

What happened: Configured SSM Resource Data Sync pointing to the raw S3 bucket but the bucket remained empty for 20+ minutes.

Fix: SSM Resource Data Sync has an initial sync delay of up to 30 minutes. Also discovered the IAM role attached to the sync needed ssm:PutComplianceItems on the target instances. After fixing permissions and waiting, data appeared as expected in partition-style NDJSON files.

Challenge 6: ServiceNow RITM Ticket sys_id vs number

What happened: The webhook payload sent the ticket number (e.g. RITM0012345) but the ServiceNow REST API PATCH endpoint requires the sys_id (a GUID). The update call returned 404.

Fix: Added a lookup step in status_updater: first GET /api/now/table/sc_req_item?sysparm_query=number=RITM0012345 to resolve the sys_id, then PATCH using that sys_id. Updated the webhook payload schema to always include sys_id from ServiceNow to avoid the extra lookup in production.

Security β€” Controls at Every Layer

πŸ” Secrets Management

All credentials (ServiceNow password, HMAC secret) are stored in SSM Parameter Store as SecureString (KMS-encrypted). Lambda functions retrieve them at runtime β€” nothing in environment variables or source code.

πŸ”’ Webhook Authentication

Every inbound request is validated with HMAC-SHA256 before any processing begins. hmac.compare_digest() is used (not ==) to prevent timing attacks. Unsigned requests return 401 immediately.

πŸ“„ IAM Least Privilege

Each Lambda and Glue job has its own IAM role scoped to exactly the resources it needs. No wildcard * on resource ARNs. S3 policies restrict access by bucket prefix per service.

πŸ’Ύ S3 Encryption

Both raw and curated buckets enforce AES-256 server-side encryption by default. Public access is blocked at the bucket level. Object lifecycle policies expire raw data after 90 days.

What’s Missing for Production
  • VPC endpoints for S3, Glue, SSM β€” currently traffic routes over public AWS endpoints
  • WAF on the API Gateway to rate-limit and block malicious IPs
  • S3 Object Lock on the curated bucket for immutable audit trail
  • CloudTrail data events enabled on S3 buckets for full audit logging
  • Glue job bookmark enabled to avoid reprocessing already-seen data

Cost β€” What This Actually Costs

This pipeline is designed to run on-demand (triggered by ServiceNow, not continuously). Here’s the cost breakdown per pipeline run in dev, based on a fleet of ~50 EC2 instances.

ServiceUsage per runUnit priceCost per run
Glue Crawler2 DPU Γ— 10 min$0.44/DPU-hour$0.15
Glue ETL Job2 Γ— G.1X Γ— 5 min$0.44/DPU-hour$0.07
Step Functions~20 state transitions$0.025/1000<$0.001
Lambda (3 functions)~5 invocations total$0.20/1M requests<$0.001
Athena~1 MB scanned$5/TB<$0.001
API Gateway1 request$3.50/1M<$0.001
S3 Storage~100 MB (raw + curated)$0.023/GB/month<$0.001
Total per pipeline run: ~$0.22
Running twice daily for a month: ~$13.20/month. The Glue crawler and ETL job dominate cost β€” consolidating crawler runs and using job bookmarks to skip unchanged data would cut this by 40–60%.

Cost Optimisation Tips

  • Enable Glue job bookmarks to skip files already processed β€” reduces ETL run time significantly after the first run
  • Use G.025X worker type instead of G.1X for small fleets (<500 instances) β€” 4Γ— cheaper
  • Schedule SSM Resource Data Sync to run every 6 hours instead of continuously to reduce S3 write costs
  • Set S3 Intelligent-Tiering on the curated bucket after 30 days

Cleanup β€” Tearing It Down

Everything except SSM Resource Data Sync and S3 bucket contents is managed by Terraform and can be destroyed in one command.

Bash β€” Terraform Destroy
cd terraform/environments/dev
terraform destroy -var-file=terraform.tfvars

# Expected output:
# Destroy complete! Resources: 44 destroyed.
ResourceDestroyed by Terraform?Manual step needed?
S3 buckets (raw, curated, athena-results)Yes β€” if emptyEmpty buckets first: aws s3 rm s3://bucket --recursive
Glue Database + TablesYesNone
Glue Crawler + ETL JobYesNone
Step Functions State MachineYesNone
Lambda Functions (3)YesNone
API GatewayYesNone
IAM Roles + PoliciesYesNone
SSM Parameter Store (secrets)YesNone
SSM Resource Data SyncNo β€” not in TerraformDelete manually in SSM Console β†’ Fleet Manager β†’ Resource Data Sync
Empty S3 buckets before running destroy
Terraform will fail to delete non-empty S3 buckets. Run aws s3 rm s3://BUCKET-NAME --recursive for all three buckets before terraform destroy.

Takeaways β€” What I’d Do Differently

Six lessons from this build that apply beyond this specific project.

πŸ’‘ boto3 > Spark for S3 with Special Characters

If your S3 paths contain colons, spaces, or other URI-special characters (common with AWS service prefixes like AWS:ComplianceSummary), bypass Spark’s native S3 reader entirely. Use boto3 to list and download, then create a Spark DataFrame from the in-memory records. Cleaner and avoids a whole class of Java URI bugs.

πŸ”„ The Polling Loop is the Right Pattern

Step Functions Wait β†’ Check β†’ Choice loops are more reliable than hard-coded sleep timers. The pipeline self-adapts to actual job duration. Adding a maximum-iteration counter via a $.poll_count variable prevents infinite loops in production.

πŸ”’ Always Validate Webhooks

HMAC-SHA256 validation on inbound webhooks is non-negotiable. It took 20 extra lines of Lambda code and saved the pipeline from being triggerable by anyone who knows the API Gateway URL. Use hmac.compare_digest(), not ==, to prevent timing attacks.

🏒 Medallion Architecture Pays Off

Separating raw NDJSON (Bronze) from curated Parquet (Silver) from Athena analytics (Gold) makes each layer independently queryable and reprocessable. When the ETL schema changes, you can reprocess Bronze without re-running SSM sync. Worth the extra S3 bucket.

πŸ“‹ SSM Resource Data Sync Has a Learning Curve

The 30-minute initial sync delay is not documented prominently. The IAM requirements for the sync role are easy to under-scope. Plan for this in your timeline and add a CloudWatch alarm on the S3 bucket to confirm data is arriving before starting the pipeline.

πŸ“„ Ticket-Driven Pipelines Delight Operations Teams

Triggering the pipeline from a ServiceNow request and auto-closing the ticket with the Athena URL changed how the ops team experienced the tool. They raised a ticket and got results back in the ticket β€” no AWS console access needed. This is the pattern to replicate for every data pipeline that serves a non-technical audience.

Week 04 complete. A production-pattern fleet intelligence platform: SSM β†’ S3 β†’ Glue β†’ Athena, fully orchestrated by Step Functions, triggered by ServiceNow webhook, deployed with Terraform and GitHub Actions. Next week: adding scheduled compliance drift detection with EventBridge.

Comments

How was your experience?
Your feedback helps improve this site.
PoorExcellent