AWS SSM Patch Failed – No IMDS credentials – S3 Access Denied – Solution

Another impossible to debug AWS issue:

I ran an AWS Systems Manager Patch Manager RunPatchBaseline association on a managed instance. My managed instance uses a custom IAM Role slash Instance Profile, but I read the documentation and added the proper permissions.

I ran RunPatchBaseline and got an Association Failed status on the instance with the custom role. The Run Command Output showed the following error:

No IMDS credentials found on instance.failed to run commands: exit status 156

I connected to the instance and verified that IMDS does work fine on the instance, following the AWS documentation here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html

Further details under the Run Command Output showed a completely different error:

ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Further up in the logs I found that the SSM agent was trying to download the baseline_overrides.json from a dedicated S3 bucket operated by Amazon, starting with aws-quicksetup-patchpolicy-ACCOUNTID-QUICKSETUPCONFIGID. My IAM profile, even though it is setup to allow access to this bucket, and can list the baseline_overrides.json object, is blocked from downloading it.

Solution: by comparing the custom role to the Amazon generated QuickSetup role, I figured out that I have to manually add a Tag to the custom IAM role for this to work. S3 is checking if the ROLE has a special Tag associated before allowing access to the object. It is documented on AWS here: https://docs.aws.amazon.com/systems-manager/latest/userguide/quick-setup-patch-manager.html

“You must tag your IAM instance profile or IAM service role with the following key-value pair.
Key: QSConfigId-quick-setup-configuration-id, Value: quick-setup-configuration-id”

Hope This Helps,

imre Fitos

j j j

Solution to BOTO3 S3 CreateMultipartUpload Access Denied problem

We ran into a weird problem when we tried to stream to an S3 file using boto3, and all the posts on Stack Overflow had wildly inaccurate and generally non-working solutions, so I’m posting this hoping that maybe it will save someone some time.

The problem: You write an S3 upload in python, and it gives you the following error:

ValueError: the bucket 'XXX' does not exist, or is forbidden for access (ClientError('An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: Access Denied'))

The error clearly spells out that this is a permission problem, so you spend some time trying to add the proper permissions. You learn that there is no such thing as an S3:CreateMultipartUpload permission – boto3 uses the normal s3:PutObject permission. So you google some more.

Then you think it’s an ACL permission – nope.

Then you think maybe your encrypted S3 bucket is the problem and you need to add kms:GenerateDataKey permission? But no, you use encryption with Amazon S3 managed keys (SSE-S3) and it does not require extra kms permissions. Another dead end. How did it ever work for other people?

Then you throw all the permissions that exist on the user and it’s still failing. What gives?

You enable boto3 debug logs with boto3.set_stream_logger('') but the log looks okay, except that it gets a 403 access denied from Amazon.

Then your brilliant colleague Fatih Elmali reads the code and says that regardless of all the examples Amazon has published, the following is not enough:

client = boto3.client('s3', aws_access_key_id=...)

The proper way to setup authentication for a boto3 s3 client is the following:

session = boto3.Session(aws_access_key_id=...)
client = session.client('s3')

This will set up the proper session authentication and streaming to an S3 file object will work.

j j j

Resource-based project scheduling and Monday.com

Resource-based project scheduling or Resource-constrained project scheduling has been around for decades, but it is more common in in construction industry and not that well known in software development.

Most everybody is familiar with Time-based project scheduling – the work breakdown and the deadline specification happens first, and then the PM tries to get resources to meet the deadlines. The manual method of making resources work is called resource leveling.

Resource-based project scheduling puts the availability of resources first. Each resource, e.g. a software developer, has a calendar of availability, with the already allocated dates blocked out. The PM assigns the task to the resource, and the system allocates the first available time period automatically.

If a PM rearranges the order of tasks or entire projects, the system rearranges the date ranges allocated to the tasks. The PM can easily see if a proposed deadline can be met with the available resources, or do other projects need to be re-prioritized to ensure success.

This system makes managing project portfolios much easier – no chance of double booking, and if a task takes less effort than originally planned, the delivery dates of the remaining tasks automatically move earlier.

These features also make rebaselining and reallocation much easier, simpler, faster, and makes resource leveling completely automatic.


Monday.com is a popular online project management platform. While it is mostly Time-based, it has a few features that help with resource management.

The major component is creating a Dashboard and a Timeline Widget on the Dashboard. Connect all the project boards that you want to manage together, and the Timeline will be automatically populated with swimlines per Person, and all the tasks assigned to them from all the connected projects.

At this point the PM can move the tasks up and down on the timeline to make sure that people are not double booked or underused, and the changes are immediately implemented on all the connected project boards.

This ensures that all projects are properly resourced, and makes it easy to respond to changes in resource availability.

Another component is creating a project board dedicated to Vacation/PTO. On this board people can reserve their time off, and then this will be plainly visible on the main Timeline Widget, so their time off won’t be double-booked with work and they can meet their deadlines.

This means that you can go from this:

project portfolio before leveling

To this, only by using a Dashboard Timeline Widget:

project portfolio after leveling

Monday.com is quite popular and I hope they will implement even more features to help managing project portfolios. There are many other solutions out there, but if your organization is committed to Monday.com, you can make it work. Let’s see what they do next!

j j j

AWS Chatbot custom message – solution

Most DevOps people who set up AWS Chatbot integrations with other AWS services eventually start wondering how to send custom messages through Chatbot.

At this point I would to remind you that your life will be much easier if you give up on the idea and instead send your message directly to Slack using a web hook.

But if you want to see this to the end:

Tom Stroobants documented the general SNS message format that Chatbot expects and it looks like this:

{
  "version": "0",
  "time": "1970-01-01T00:00:00Z",
  "id": "00000000-0000-0000-0000-000000000000",
  "account": "[your real account id]",
  "region": "[a real region]",
  "source": "aws.[a service prefix e.g. ec2]",
  "detail-type": "[you can use this field for your message]",
  "resources": [],
  "detail": {}
}

As long as these fields are present in the message AWS Chatbot will forward the message to Slack, but will not display any more details other than the text in the “detail-type” field, and doubles up that text.

To make AWS Chatbot deliver a more detailed message, one has to format the message according to the AWS Events that Chatbot supports. Which means our messages will have to have a predefined “detail-type” and “source”.

To see examples of all message formats that Chatbot can display, to find one that we could co-opt for our purposes:

  1. Open the EventBridge console at https://console.aws.amazon.com/events/.
  2. In the navigation pane, choose Rules.
  3. Choose Create rule.
  4. Enter a name and description for the rule.
  5. For Define pattern, choose Rule with an event pattern.
  6. Hit Next.
  7. For Event source, leave it on AWS events
  8. Now you can browse all available events under Sample Event / AWS events.

You will quickly notice that the event names are quite specific, and you might not want to use “VoiceId Batch Fraudster Registration Action” for your custom message.

I found that the “AWS Health Event” is innocent enough to be reusable, and now I am able to send free form paragraphs using the following:

{
    "version": "0",
    "id": "00000000-0000-0000-0000-000000000000",
    "account": "[my AWS account number]",
    "time": "1970-01-01T00:00:00Z",
    "region": "us-east-1",
    "source": "aws.health",
    "detail-type": "AWS Health Event",
    "resources": [],
    "detail": {
      "eventDescription": [{
        "language": "en_US",
        "latestDescription": "Long form message\nMore lines"
      }]
    }
}

I hope somebody with good enough connections to the AWS Chatbot team will get more details out of them, right now their official line is “AWS Chatbot only supports AWS Services”. Help?

HTH, imre

j j j

AWS Force MFA example policy doesn’t work on Administrators – Fix

There are several example policies written by Amazon itself, and also by other security providers like Yubico that claim to enforce MFA use, but simply do not work on users who have AdministratorAccess policy.

Here is an actual example policy written by Amazon that actually works: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_aws_my-sec-creds-self-manage.html

j j j

AWS SAM and the case of missing directories – Solution

I think serverless is the future, so I have been going through a Nodejs Lambda tutorial to learn more about it. Amazon recommends SAM for provisioning Lambda functions and it’s pretty straightforward at first sight.

Later you will learn that AWS SAM is opinionated, and it keeps its opinions to itself. It doesn’t tell you what it does, how it does it, there are no options to change it, and not much of a debug function to see what goes wrong.

sam build did not copy a directory into my Node package. Documentation says nothing about this behavior. Debug shows no information about what gets copied. Googling led nowhere. Eventually I simply guessed that it reads my .gitignore file and it ignores everything that I want Git to ignore, and I was right. And I was grumpy.

TL/DR: If you have missing files or missing directories in your sam build package, look at your .gitignore

j j j

LambdaAccessDenied error in AWS Load Balancer – Solution

Permission handling in ELB and Lambda is somewhat magical, some of the tools autoprovision permissions behind the scene, and some of them sometimes mess up.

I had a Lambda that I was invoking from a load balancer and it simply did not work. The only hint was “LambdaAccessDenied” in the ALB logs.

I had everything configured correctly. I have added a lambda permission for the entire elasticloadbalancing.amazonaws.com service to invoke my function. I had the proper target groups. I had even enabled AWS SAM to autoprovision the IAM roles. The Lambda function was firing correctly, I had logs to show that it was executing.

But I kept getting “502 Bad Gateway” from the load balancer and the logs kept showing LambdaAccessDenied.

I removed all the custom stuff I created. I removed the alias. I removed and re provisioned the entire lambda function. I removed and recreated the target group.

Eventually I removed the target group and the permission I created,
and provisioned an “Application Load Balancer” Trigger from the Lambda console. This created a new target group and a new resource-based policy under Permissions, and suddenly everything started working, even though the new entries looked exactly the same as the entries I created.

Since there are only five entries on Google that even mention this error message, I figured you might want to save some time and learn from my experience.

j j j

How to backup and restore an Easy-RSA certificate authority

Easy-RSA is great, but the documentation doesn’t cover much about backup and restore, so this is a quick write up on this topic.

If you want to back up your entire CA, save your easyrsa3/pki directory. You can simply restore this pki directory in a new install of easy-rsa and you will be back in business.

If you don’t want to backup your issued certificates, because for example you are using your CA for VPN authentication (then you only need the certificate serials for revocation, those are in pki/index.txt), then you only need to save the following four files:

pki/ca.crt
pki/private/ca.key
pki/issued/server.crt
pki/private/server.key

These files don’t ever change, so you don’t need to back them up frequently.

When you want to restore your easy-rsa install, you first have to create a skeleton pki directory with the easy-rsa init-pki command, then put the four files from above back in their previous places.

easy-rsa will still complain about other missing files and directories, but it doesn’t expect any data in those, so we can simply create empty files and directories to fix this:

touch easy-rsa/easyrsa3/pki/serial
touch easy-rsa/easyrsa3/pki/index.txt
touch easy-rsa/easyrsa3/pki/index.txt.attr
mkdir easy-rsa/easyrsa3/pki/certs_by_serial

So if you see errors like:

Easy-RSA error:

Missing expected CA file: serial (perhaps you need to run build-ca?)

Then run the empty file creation commands above.

If you have any questions, your best bet is to reach me on twitter at https://twitter.com/imreFitos

j j j

ELTE stunnel setup for Mac in 2021

ELTE is a great university but they don’t support Apple products well. If you are an ELTE student, use a Mac, and trying to access ELTE resources from home during the lockdown, this is the tutorial you need.

You have to have a Caesar or IIG username and password for this to work.

Step 1: install the Homebrew package manager from https://brew.sh/

  • Click on Applications -> Utilities -> Terminal
  • Copy the following line into the Terminal window (this is one single line):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  • When it asks you for your password, enter your computer’s password.

Please note: this can take 10-20 minutes to complete.

Step 2: install the stunnel package using Homebrew

  • in the same Terminal window, type the following line:
brew install stunnel

Step 3: put the ELTE stunnel.conf file in the stunnel directory

The following 7 lines are the configuration for stunnel. You need to save this into a file on your computer called /usr/local/etc/stunnel/stunnel.conf

foreground = yes
socket = l:TCP_NODELAY=1
socket = r:TCP_NODELAY=1
[proxys]
accept = 8080
connect = proxy.elte.hu:8443
client = yes

Step 4: Start up stunnel

brew services start stunnel

This will make sure that stunnel will always be running on your computer, even after rebooting.

Step 5: Configure your computer to go through ELTE for web browsing

  • Go to Apple Icon -> System Preferences -> Network
  • Click on the “Advanced” button in the bottom right corner
  • Click on the “Proxies” tab on the top row
  • Select “Web Proxy (HTTP)”
  • Add 127.0.0.1 under Web Proxy Server
  • Add 8080 next to the 127.0.0.1 after the colon symbol
  • Enable the “Proxy server requires password” option
  • Enter your Caesar/IIG username and password
  • ALSO repeat this under “Secure Web Proxy (HTTPS)”

This is it! Your web browsers will start going through ELTE with all their traffic.

To test, start up a browser, and google the following phrase “what is my ip address”. If you did everything right, the IP address Google will report back will start with 157.181.

Step 6: Turn off the ELTE browser redirect when you don’t need it

The setup above will send all your web browsing through ELTE, including YouTube and Netflix traffic, so it will be slow for you and problematic for them. It’s better to turn it off when you don’t need it.

  • Go to Apple Icon -> System Preferences -> Network
  • Click on the “Advanced” button in the bottom right corner
  • Click on the “Proxies” tab on the top row
  • UNselect “Web Proxy (HTTP)”
  • UNselect “Secure Web Proxy (HTTPS)”

That’s it, you are all set.

imre

j j j

How to monitor and alert on the Sidekiq Retry Queue

Sidekiq is the most popular queue processing service for Ruby on Rails. It has many brilliant features, one of them is its automatic retry when a queued job fails, to account for intermittent problems.

The retry system is automatic, by default Sidekiq retries a job 25 times before putting it on the Dead Job Queue. The retry delay grows exponentially – by the 25th retry a job would have spent three weeks in the Retry Queue!

Of course generally everybody has an alert system for when jobs fail. But, the Sidekiq retry logic works well and most errors are transient, so people grow complacent and start ignoring the messages about the failed jobs.

This works well until it doesn’t. This was the point when I started looking into ways to properly monitor the Sidekiq Retry Queue.

I had the following questions:

  • How to alert on jobs that have failed too many times for comfort?
  • How to alert if a deluge of jobs fail?
  • How to make sure the alerts we send are actionable?
  • How to check if the alerting system is operational?

I took some time during Christmas and wrote a single file ruby app called https://github.com/imreFitos/sidekiq_retry_alert. This app queries a Sidekiq server’s Retry Queue and sends alerts to a Slack channel when a single job keeps failing repeatedly, and if it finds a lot of failing jobs, it tallies them up into easily read Slack messages.

This is how it looks in Slack:

PRODUCTION ALARM: 2 NameOfTheImportantJobs on the Important queue have failed X+ times

The app remembers the previous state of the queue, so you only get messages when the queue’s state changes.

To check if the alerting system works, I wrote a second script that simply sends a daily report to the Slack channel. If you don’t see the daily report, chances are your alert system has stopped working.

This is how the daily report looks in Slack:

Daily report on production sidekiq retries:
ImportantQueue: 2 NameOfTheImportantJobs are retried

I recommend running them from cron.

I hope this helps!

imre

j j j