Updates from October, 2021 Toggle Comment Threads | Keyboard Shortcuts

  • kmitov 8:57 am on October 8, 2021 Permalink |
    Tags: amazon-s3, ,   

    Sometimes you need automated test on production 

    In this article I am making the case that sometimes you just need to run automated tests against the real production and the real systems with real data for real users.

    The case

    We have a feature on one of our platforms:

    1. User clicks on “Export” for a “record”
    2. A job is scheduled. It generates a CSV file with information about the record and uploads on S3. Then a presigned_url for 72 hours is generated and an email is sent to the user with a link to download the file.

    The question is how do you test this?

    Confidence

    When it comes to specs I like to develop automated specs that give me the confidence that I deliver quality software. I am not particularly religious to what the spec is as long as it gives me confidence and it is not standing in my way by being too fragile.

    Sometimes these specs are model/unit specs, many times they are system/feature/integration specs, but there are cases where you just need to run a test on production against the production db, production S3, production env, production user, production everything.

    Go in a System/Integration spec

    A spec that would give me confidence here is to simulate the user behavior with Rails system specs.
    The user goes and click on the Export and I check that we’ve received an email and this email contains a link

      scenario "create an export, uploads it on s3 and send an email" do
        # Set up the record
        user = FactoryBot.create(:user)
        record = FactoryBot.create(:record)
        ... 
    
        # Start the spec
        login_as user
        visit "/records"
        click_on "Export"
        expect(page).to have_text "Export successfully scheduled. You will receive an email with a link soon."
    
        mail_html_content = ActionMailer::Base.deliveries.select{|email| email.subject == "Successful export"}.last.html_part.to_s
        expect(mail_html_content).to have_xpath "//a[text()='#{export_name}']"
        link_to_exported_zip = Nokogiri::HTML(mail_html_content).xpath("//a[text()='#{export_name}']").attribute("href").value
    
        csv_content = read_csv_in_zip_given_my_link link_to_exported_zip 
        expect(csv_content).not_to be_nil
        expect(csv_content).to include user.username
      end

    This spec does not work!

    First problem – AWS was stubbed

    We have a lot of other specs that are using S3 API. It is a good practice as you don’t want all your specs to touch S3 for real. It is slow and it is too coupled. But for this spec there was a problem. There was a file uploaded on S3, but the file was empty. The reason was that on one of the machines that was running the spes there was no ‘zip’ command. It was not installed and we are using ‘zip’ to create a zip of the csv files.

    Because of this I wanted to upload an actual file somehow and actually check what is in the file.

    I created a spec filter that would start a specific spec with real S3.

    # spec/rails_helper.rb
    RSpec.configure do |config|
      config.before(:each) do
        # Stub S3 for all specs
        Aws.config[:s3] = {
          stub_responses: true
        }
      end
    
      config.before(:each, s3_stub_responses: false) do
        # but for some specs, those that have "s3_stub_responses: false" tag do not stub s3 and call the real s3.
        Aws.config[:s3] = {
          stub_responses: false
        }
      end
    end

    `This allows us to start the spec

      scenario "create an export, uploads it on s3 and send an email", s3_stub_responses: false do
        # No in this spec S3 is not stubbed and we upload the file
      end

    Yes, we could create a local s3 server, but then the second problem comes.

    Mailer was adding invalid params

    In the email we are sending a presigned_url to the S3 file as the file is not public.
    But the mailer that we were using was adding “utm_campaign=…” to the url params.
    This means that the S3 presigned url was not valid. Checking if there is an url in the email was simply not enough. We had to actually download the file from S3 to make sure the link is correct.

    This was still not enough.

    It is still not working on production

    All the tests were passing with real S3 and real mailer in test and development env, but when I went on production the feature was not working.

    The problem was with the configuration. In order to upload to S3 we should know the bucket. The bucket was configured for Test and Development but was missing for production

    config/environments/development.rb:  config.aws_bucket = 'the-bucket'
    config/environments/test.rb:  config.aws_bucket = 'the-bucket'
    config/environments/production.rb: # there was no config.aws_bucket

    The only way I could make sure that the configuration in production is correct and that the bucket is set up correctly is to run the spec on a real production.

    Should we run all specs on a real production?

    Of course not. But there should be a few specs for a few features that should test that the buckets have the right permissions and they are accessible and the configuration in production is right. This is what I’ve added. Once a day a spec goes on the production and tests that everything works on production with real S3, real db, real env and configuration, the same way that users will use the feature.

    How is this part of the CI/CD?

    It is not. We do not run this spec before deploy. We run all the other specs before deploy that gives us 99% confidence that everything works. But for the one percent we run a spec once every day (or after deploy) just to check a real, complex scenario, involving the communication between different systems.

    It pays off.

     
  • kmitov 7:33 am on October 8, 2021 Permalink |
    Tags: hotwire, , , turbo   

    [Rails, Hotwire] Migrate to Turbo from rails-ujs and turbolinks – how it went for us. 

    We recently decided to migrate one of our newest platforms to Turbo. The goal of this article is to help anyone who plans to do the same migration. I hope it gives you a perspective of the amount of work required. Generally it was easy and straightforward, but a few specs had to be changed because of urls and controller results

    Gemfile

    Remove turbolinks and add turbo-rails. The change was

    --- a/Gemfile.lock
    +++ b/Gemfile.lock
    @@ -227,9 +227,8 @@ GEM
         switch_user (1.5.4)
         thor (1.1.0)
         tilt (2.0.10)
    -    turbolinks (5.2.1)
    -      turbolinks-source (~> 5.2)
    -    turbolinks-source (5.2.0)
    +    turbo-rails (0.7.8)
    +      rails (>= 6.0.0)

    application.js and no more rails-ujs and Turbolinks

    Added “@notwired/turbo-rails” and removed Rails.start() and Turbolinks.start()

    --- a/app/javascript/packs/application.js
    +++ b/app/javascript/packs/application.js
    @@ -3,8 +3,7 @@
     // a relevant structure within app/javascript and only use these pack files to reference
     // that code so it'll be compiled.
    
    -import Rails from "@rails/ujs"
    -import Turbolinks from "turbolinks"
    +import "@hotwired/turbo-rails"
     import * as ActiveStorage from "@rails/activestorage"
     import "channels"
    
    @@ -14,8 +13,6 @@ import "channels"
     // Collapse - needed for navbar
     import { Collapse } from 'bootstrap';
    
    -Rails.start()
    -Turbolinks.start()
     ActiveStorage.start()

    package.json

    The change was small

    --- a/package.json
    +++ b/package.json
    @@ -2,10 +2,10 @@
       "name": "platform",
       "private": true,
       "dependencies": {
    +    "@hotwired/turbo-rails": "^7.0.0-rc.3",
         "@popperjs/core": "^2.9.2",
         "@rails/actioncable": "^6.0.0",
         "@rails/activestorage": "^6.0.0",
    -    "@rails/ujs": "^6.0.0",
         "@rails/webpacker": "5.4.0",
         "bootstrap": "^5.0.2",
         "stimulus": "^2.0.0",

    Device still does not work

    For the device forms you have to add “data: {turbo: ‘false’}” to disable turbo for them

    +<%= form_for(resource, as: resource_name, url: password_path(resource_name), html: { method: :post }, data: {turbo: "false"}) do |f| %>;

    We are waiting for resolutions on https://github.com/heartcombo/devise/pull/5340

    Controllers have to return an unprocessable_entity on form errors

    If there are active_record.errors in the controller we must now return status: :unprocessable_entity

    +++ b/app/controllers/records_controller.rb
    @@ -14,7 +14,7 @@ class RecordsController < ApplicationController
         if @record.save
           redirect_to edit_record_path(@record)
         else
    -      render :new
    +      render :new, status: :unprocessable_entity
         end
       end

    application.js was reduced significantly

    The old application.js – 923 KiB

      application (932 KiB)
          js/application-dce2ae8c3797246e3c4b.js

    The new application.js – 248 KiB

    remote:        Assets: 
    remote:          js/application-b52f4ecd1b3d48f2f393.js (248 KiB)

    Conclusion

    Overall a good experience. We are still facing some minor issues with third party chat widgets like tawk.to that do not work well with turbo, as they are sending 1 more request, refreshing the page and adding the widget to an iframe that is lost with turbo navigation. But we would probably move away from tawk.to.

     
  • kmitov 6:29 am on October 8, 2021 Permalink |
    Tags: , , ,   

    [Rails] Warden.test_reset! does not always reset and the user is still logged in 

    We had this strange case of a spec that was randomly failing

      scenario "generate a subscribe link for not logged in users" js: true do 
        visit "/page_url"
    
        expect(page).to have_xpath "//a[text()='Subscribe']"
        click_link "Subscribe"
        ...
      end 

    When a user is logged in we generate a button that subscribes them immediately. But when a user is not logged in we generate a link that will direct the users to the subscription page for them to learn more about the subscription.

    This works well, but the spec is randomly failing sometimes.

    We expect there to be a link, eg. “//a” but on the page there is actually a button, eg. “//button”

    What this means is that when the spec started there was a logged in user. The user was still not logged out from the previous spec.
    This explains why sometimes the spec fails and why not – because we are running all the specs with a random order

    $ RAILS_ENV=test rake spec SPEC='...' SPEC_OPTS='--order random'

    Warden.test_reset! is not always working

    There is a Warden.test_reset! that is supposed to reset the session, but it seems for js: true cases where we have a Selenium driver the user is not always reset before the next test starts.

    # spec/rails_helper.rb
    RSpec.configure do |config|
      ...
      config.after(:each, type: :system) do
        Warden.test_reset!
      end
    end

    Logout before each system spec that is js: true

    I decided to try to explicitly log out before each js: true spec that is ‘system’ so I improved the RSpec.configuration

    RSpec.configure do |config|
      config.before(:each, type: :system, js: true) do
        logout # NOTE Sometimes when we have a js spec the user is still logged in from the previous one
        # Here I am logging it out explicitly. For js it seems Warden.test_reset! is not enough
        #
        # When I run specs without this logout are
        # Finished in 3 minutes 53.7 seconds (files took 28.79 seconds to load)
        #   383 examples, 0 failures, 2 pending
        #
        # With the logout are
        #
        # Finished in 3 minutes 34.2 seconds (files took 21.15 seconds to load)
        #   383 examples, 0 failures, 2 pending
        #
        # Randomized with seed 26106
        # 
        # So we should not be losing on performance
      end
    end

    Conclusion

    Warden.test_reset! does not always logout the user successfully before the next spec when specs are with Selenium driver – eg. “js: true”. I don’t know why, but that is the observed behavior.
    I’ve added a call to “logout” before each system spec that is js: true  to make sure the user is logged out.

     
  • kmitov 5:48 am on September 6, 2021 Permalink |
    Tags: , , , ,   

    Refresh while waiting with RSpec+Capybara in a Rails project 

    This is some serious advanced stuff here. You should share it.

    A colleague, looking at the git logs

    I recently had to create a spec with Capybary+RSpec where I refresh the page and wait for a value to appear on this page. It this particular scenario there is no need for WebSockets or and JS. We just need to refresh the page.

    But how to we test it?

    # Expect that the new records page will show the correct value of the record
    # We must do this in a loop as we are constantly refreshing the page.
    # We need to stay here and refresh the page
    # 
    # Use the Tmeout.timeout to stop the execution after the default Capybara.default_max_wait_time
    Timeout.timeout(Capybara.default_max_wait_time) do
      loop do
        # Visit the page. If you visit the same page a second time
        # it will refresh the page.
        visit "/records"
        # The smart thing here is the wait: 0 param
        # By default find_all will wait for Capybara.default_max_wait_time as it is waiting for all JS methods 
        # to complete. But there is no JS to complete and we want to check the page as is, without waiting 
        # for any JS, because there is no JS. 
        # 
        # We pase a "wait: 0" which will check and return
        break if find_all(:xpath, "//a[@href='/records/#{record.to_param}' and text()='Continue']", wait: 0).any?
    
        # If we could not find our record we sleep for 0.25 seconds and try again.
        sleep 0.25
      end
    end

    I hope it is helpful.

    Want to keep it touch – find me on LinkedIn or Twitter.

     
  • kmitov 10:49 am on September 3, 2021 Permalink |
    Tags: aws, cloudflare, nginx,   

    When the policeman becomes the criminal – how Cloudflare attacks my machines. 

    On the Internet you are nobody until someone attacks you.

    It gets even more interesting when the attack comes from someone with practically unlimited resources and when these are the same people that are supposed to protect you.

    This article is the story of how Cloudflare started an “attack” on a machine at the FLLCasts platform. This increased the traffic of the machine about 10x and AWS started charging the account 10x more. I managed to stop them and I hope my experience is useful for all CTOs, sysadmins, devops and others that would like to understand more and look out for such cases.

    TL; DR;

    Current up to date status is – after all the investigation it turns out that when a client makes a HEAD request for a file this will hit Cloudflare infrastructure. Cloudflare will then send a GET request to the account machine and will cache the file. This has changed at 28 of August. Before 28 of August when clients were sending HEAD requests, Cloudflare was sending HEAD requests (that don’t generate traffic). After 28 of August clients are still sending HEAD requests, but now Cloudflare is sending GET requests, generating terabytes of additional traffic that is not needed.

    Increase of the Bill

    On 28 of August 2021 I got a notification from AWS that the account is close to surpassing its budget for the month. This is not surprising as it was the end of the month, but nevertheless I decided to check. It seems that the traffic to one of the machines has increased 10x in a day. Nothing else has increased. No visits, no other resources, just the traffic to this one particular machine. 
    That was strange. This has been going on for 7 days now and this is the increase of the traffic.

    AWS increase of the bill

    Limit billing on AWS

    First thought was “How can I set a global limit to AWS spending for this account? I don’t want to wake up with $50K in traffic charges the next day?”

    The answer is “You can’t”. There is no way to set a global spending limit for an AWS account. This was something I already knew, but decided to check again with support and yes, you can’t set such a limit. This means that AWS is providing all the tools for you to be bankrupt by a third party and they are not willing to limit it.

    Limit billing on Digital Ocean

    I have some machines on Digital Ocean and I checked there. “Can I set a global spending limit for my account where I will no longer be charged and all my services will stop if my spending is above X amount of dollars?”.
    The answer was again – “No. Digital ocean does not provide it”.

    Should there be a global limit on spending on cloud providers?

    My understanding is – yes. There is a break even point where users are coming to your service and generating revenue and you are delivering the service and this is costing you money. Once it costs you more to deliver the service than the revenue that the service is generating, I would personally prefer to stop the service. No need for it to be running. Otherwize you could wake up with a $50K bill.

    AWS monitoring

    I had the bill from AWS so I tried looking at the monitoring.
    There is a spike every day between 03:00 AM UTC and 05:00 AM UTC. This spike is increasing the traffic with hundreds of gigabytes. It could easily be terabytes next time.
    The conclusion is that the machine is heavily loaded during this time.

    AWS monitoring

    Nginx access.log

    Looking at the access log I see that there are a lot of requests by machines that are using a user agent called ‘curl’. ‘curl’ is a popular tool for accessing files over HTTP and is heavily used by different bots. But bots tend to identify themselves.

    This is how the access.log looks like:

    172.68.65.227 - - [30/Aug/2021:03:26:02 +0000] "GET /f9a13214d1d16a7fb2ebc0dce9ee496e/file1.webm HTTP/1.1" 200 27755976 "-" "curl/7.58.0"

    Parsing the log file

    I have my years in bash experience and couple of commands later I get a list of all the IPs and how many requests we’ve received from these IPs.

    grep curl access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n

    The result is 547 machines. The full log file is available at – Full list of Cloudflare IPs attacking my machine. The top 20 are (there are some IPs that are not from Cloudflare). The first is the number of requests, the second is the IP of the machine.  

    NumberOfRequest IP
        113 172.69.63.18
        117 172.68.65.107
        150 172.70.42.135
        158 172.70.42.161
        164 172.69.63.82
        167 172.70.42.129
        169 172.69.63.116
        170 172.68.65.231
        173 172.68.65.101
        178 172.69.63.16
        178 172.70.42.143
        188 173.245.54.236
        264 172.70.134.69
        268 172.70.134.117
        269 172.70.134.45
        287 172.70.134.153
        844 172.70.34.131
        866 172.70.34.19
        904 172.70.34.61
        912 172.70.34.69

    These are Cloudflare machines!

    Looking at the machines that are making the requests these are 547 different machines, most of which are Cloudflare machines. These are servers that Cloudflare seems to be running that are making the request.

    How does Cloudflare work?

    For this particular FLLCasts account with this particular machine I have years ago setup Cloudflare to sit in front of the machine to help  protect the account from internet attacks.

    The way Cloudflare works is that only Cloudflare knows what is the IP address of our machine. This is the promise that Cloudflare is making. Because only they know the IP address of the machine, only they know what is the IP address for a given domain. In this way when a user points their browser at the “http://domainname&#8221; the internet will direct this request to Cloudflare, then Cloudflare will check if this request is ok, and then and only then forward this request to our machine. But in the meantime Cloudflare is trying to help businesses like the platform by caching the content. This means that when Cloudflare receives a request for a file, they will check on their Cloudflare infrastructure if this file was cached and send a request to the account machine only if there is no cache.

    In a nutshell Cloudflare maintains a cache for the content the platform is delivering.

    Image is from Cloudflare support at https://support.cloudflare.com/hc/en-us/articles/205177068-How-does-Cloudflare-work-

    What is broken?

    Cloudflare maintains a cache of the platform resources. Every night between 03:00 AM UTC and 05:00 AM UTC some 547 Cloudflare machines decide to update their cache and they start sending requests to our server. These are 10x more requests that the machine generally receives from all users. The content on the server does not change. It’s been the same content for years. But for the last 7 days Cloudflare is caching the same content every night on 547 machines.

    And AWS bills us for this.

    Can Cloudflare help?

    I created a ticket. The response was along the lines of “You are not subscribed for our support, you can get only community support”. Fine.
    I called them on the phone early in the morning.
    I called enterprise sales and I asked them.

    Me - "Hi, I am under attack. Can you help?"
    They - "Yes, we can help. Who is attacking you?"
    Me - "Well, you are. Is there an enterprise package I could buy so that you can protect me against your attack?"

    Luckily the guy on the phone caught my sense of humor and urgency and quickly organized a meeting with a product representative. Regrettably there were no solution engineers on this call.

    Both guys were very knowledgeable, but I had difficulties explaining that it was actually Cloudflare causing the traffic increase. I had all the data from AWS, from the access.log files, but the support agents still had some difficulty accepting it.

    To be clear – I don’t think that Cloudflare is maliciously causing this. There is no point. What I think has happened is some misconfiguration on their side that caused this for the last 7 days.

    What I think has happened?

    I tried to explain to the support agents that there are three scenarios all of which Cloudflare is responsible for.

    1. Option 1 – “someone that has 547 machines is trying to attack the FLLCasts account and Cloudflare is failing to stop it”. First this is very unlikely. Nobody will invest in starting 547 machines just to make the platform pay a few dollars more this month. And even if this is the case, this is what Cloudflare should actually prevent, right? Option 1: “Cloudflare is failing in preventing attacks” (unlikely)


    2. Option 2 – “only Cloudflare knows the IP of this domain name and they have been compromised.”. The connection between domain name and ip address is something that only Cloudflare knows about. If a third party knows the domain name and they are able to find the IP name this means that they are compromising Cloudflare. Option 2: “Cloudflare is compromised” (possible, but again, unlikely)

    3. Option 3 – “there is a misconfiguration in some of the Cloudflare servers”. I don’t like looking for malicious activity where everything could be explained with simple ignorance or a mistake. Most likely there is a misconfiguration in the Cloudflare infrastructure that is causing these servers to behave in this way. Option 3: “There is a misconfiguration in Cloudflare infrastructure”

    4. Option 4 – “there is a mistake on our end”. As there basically is nothing on our end and this nothing has not changed in years, the possibility for this to be the case is minimal. 

    On a support call we set a plan with the support agents to investigate it. I will change the public IP of the AWS machine and will reconfigure it on Cloudflare. In this way we hope to stop some of the requests. We have no plan for what to do after that.

    Can I block in on the Nginx level?

    Nginx is an HTTP server,serving files. There are a couple of options to explore there, but the most reasonable was to stop all curl requests to the Nginx server. This was the shortest path. There was no need to protect against other attacks, there was only the need to protect against Cloudflare attacks. The Cloudflare attack was using “curl” as a tool. I decided to stop ‘curl’

      # Surely not the best, but the simplest and will get the job done for now.
      if ($http_user_agent ~ 'curl') {
          return 444; # Consider returning 444. It's a custom nginx code that drop the connection without responding.
      }

    Resolution

    I am now waiting to see if the change of the public IP of the AWS machine will have any impact and if not I am just rejecting all “curl” requests that seem to be what Cloudflare is using.

    Update 1

    The first solution that we decide to implement is to

    Change the public IP of the AWS machine and change it in the DNS settings at Cloudflare. In this way we would make sure that only Cloud flare really knows this IP.

    Resolution is – It did not work!

    I know it won’t, because it was another way for support to get me to do anything without really looking into the issue, but I went along with it. Better exhaust this options and be sure.

    The traffic of a Cloudflare attacked machine. Changing the IP address of 03 of September had no effect.

    Update 2

    Adding CF-Connection-IP header

    Cloudflare support was really helpful. They asked me to include CF-Connection-IP in the logs. In this way we would know what is the real IP that is making the requests and if these are in fact Cloudflare machines.

    The header is described at https://support.cloudflare.com/hc/en-us/articles/200170986-How-does-Cloudflare-handle-HTTP-Request-headers-

    I went on and updated the Nginx configuration

    log_format  cloudflare_debug     '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" "$http_cf_connecting_ip"';
    
    access_log /var/log/nginx/access.log cloudflare_debug;
    

    Now the log file contained the original IP.

    Cloudflare is making GET when client makes a HEAD request

    This is what I found out. The platform has a daily job that checks the machine and makes sure files are ok. This integrity check was left there from times when we had to do it, like years ago. It is still running and is starting every night checking the machine with HEAD requests. But Cloudflare started making GET request at 28 of August 2021 and this increases the traffic to the machine.

    Steps to reproduce

    Here are the steps to reproduce:

    1. I am sending a HEAD request with ‘curl -I’

    2. Cloudflare has not cached the file so there is “cf-cache-status: MISS”

    3. Cloudflare sends a GET request and gets the whole file

    4. Cloudflare responds to the HEAD request.

    5. I send a HEAD request agian with ‘curl -I’

    6. Cloudflare has the file cached and there is a “cf-cache-status: HIT”

    7. The account server is not hit.

    The problem here is that I am sending a HEAD request to my file and Cloudflare is sending a GET request for the whole file in order to cache this file

    Commands to reproduce

    This is a HEAD request:

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:11 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: MISS
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=Xg9TLgssa5Gm6j1fRlJZH8VahaoY21LdCE1W1JqVueu49mzdiTmh9MZp4pFZDsVeSmRg%2Bc%2FMryoN7tgmKUmdxhWzE7UZdVvgG%2FRxHSZ%2FYS6pDtxLwpXSD71jo5ADNyT4TSpKXtE%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689564111e594ee0-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    This is the log right after the HEAD request. Not that I am sending HEAD request to domain.com and Cloudflare is sending GET request for the file.

    162.158.94.236 - - [04/Sep/2021:07:09:12 +0000] "GET /file1.webm HTTP/1.1" 200 2256504 "-" "curl/7.68.0" "188.254.161.195" "188.254.161.195"

    Then I send a second HEAD requests

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:53 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: HIT
    age: 42
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=CKSvpGGHoj5LfV6xXpPUK5kHJtdsX3fylgt%2F2%2B6G94oUsdAd8FnHmUgEUIgnj5dd2Vvsv%2BKQxxgsHdHA0RvpjTxATakFKFuirMeI%2FS3lAdDX5VA0tY74z0CRYEHM2rS%2Fld6K738%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689565175dffc29f-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    And then there is NOTHING in the log file

    Note that for the last HEAD request there is a “cf-cache-status: HIT”.

    Status and how it could be resolved?

    Yes, we are doing HEAD requests every day to the files in order to check that they are all working. Every day we send a HEAD request for every file to make sure all files are up to date. This has been going on for years and is a left over of an integrity check we implemented in 2015.

    What has changed on 28 of August 2021 is that when Cloudflare receives a HEAD request for a file it is sending a GET request to our machine in order to cache the file. This is what has changed and this is generating all the traffic.

    We send HEAD request with ‘curl -I’

    I have 30 weeks of log files that show that Cloudflare was sending HEAD requests like

    I have asked Cloudflare

    Could you please rollback this change in the infrastructure and do not send a GET request to our machine when you receive a HEAD request from a client?

    Let’s see how will this be resolved.

    Up to date conclusion

    Check your machines from time to time. I hope you don’t get in this situation.

    Want to keep in touch? – find me on LinkedIn or Twitter

     
  • kmitov 12:34 pm on August 17, 2021 Permalink |
    Tags: admin, microsoft   

    GoDaddy+Microsoft 365 and how an email was compromised for about a day 

    In two hours I have a C-suite meeting and one of the topics would be our internal stack and whether we stay with Microsoft+GoDaddy or we migrate.

    This article is my objective summary of:

    1. How Microsoft+GoDaddy keep an email account compromised for more than a day
    2. What is difficult with the stack of Microsoft+GoDaddy
    3. Why can’t we just migrate to Microsoft without GoDaddy
    4. Why would I like to stop using Microsoft

    I hope other companies that have found themselves in this situation will be able to make the right decision given my experience.

    Note: This article is as of 2021-08-17. Things may change. I hope they will.

    Why GoDaddy?

    When the project was initially formed the domain {ourdomain.com} was bought from GoDaddy. Nothing for and nothing against. Since then the emails were added at GoDaddy.

    Why GoDaddy+Microsoft 365?

    GoDaddy offers Microsoft 365. You can purchase an Email+Office that will give you the email that is a Microsoft 365 email.

    Why not migrate out of GoDaddy and using only Microsoft?

    As we onboard more people in the team we identified that keeping both GoDaddy and Microsoft would be difficult. I tried to migrate us only to a Microsoft where the emails and office and everything will come from Microsoft and we won’t be handling two services.

    After spending about a day on this it turned out it was not possible. I even have a ticket created from GoDaddy support that should have been resolved in 72 hours, but almost a month later I still don’t have any notification if it is resolved or not. The issue is that I as an admin can not redirect the emails to be received at onmicrosoft.com while we are migrating. This means there will be a moment of time where people will not receive emails. I also can not export the user’s emails. I have to log in with every user, but I don’t know their passwords, so they should reset their passwords and share them with me and I should export their mailboxes through a desktop outlook application and then import them again. Which would easily take days in communication and sync. Yes, there is no “export all emails” and “import all emails”. It should be done by hand, manually, for every user in sync with the user. There simply is no such tool available from Microsoft in the GoDaddy+Microsoft setup.

    When migrating from GoDaddy+Microsoft 365 to Microsoft you should manually log in with each user and manually export and import each and every mailbox and manually sync with the users to give you their password, because as admin you can not change the password. This could take days, if not weeks for a team.

    Because of this we’ve decided to postpone this migration.

    How did an email get compromised?

    During my regular security audit I found out that I don’t know who has access to admin@{ourdomain}.com. This is the admin email. I have access to it. A couple of colleagues have access to it. But I don’t know who has access to it.

    Naturally I tried to reset the password for this account.

    The way I tried to reset the password is

    1. Go to GoDaddy.
    2. Log in with my account username@{ourdomain.com} and try to change the password for admin@{ourdomain.com}.
    3. The site returned that the password was successfully changed.
    4. Then I asked my colleague who has access to admin@{ourdomain.com} through the Desktop version of Outlook to see if he still has access.
    5. He still did. It did not matter that I changed the password.

    I have changed the password for admin@{ourdomain.com}, but users that do not know the new password still have access to the email through the desktop version of Microsoft Outlook

    The implications here are huge. This means that I don’t know who has access to admin@{ourdomain.com} and there is no way I could prevent them from accessing it.

    The only way would be for them to willingly sign out and try to sign in again. But this is not going to happen as I now consider the email to be compromised. Since the moment I am auditing this email to the moment I know who exactly has access to the email I consider this email to be compromised. Probably nobody else had access to it.

    But Microsoft and GoDaddy do not provide me with the tools to check who has access and to prevent people from accessing it, even after I changed the password.

    Can GoDaddy support help?

    It should be mentioned that GoDaddy documentation says that it might take up to 30 minutes for this password change to be reflected. I am ok with this. Not the best security, but I am ok.

    I have waited for 120 minutes before getting in touch with GoDaddy support.

    After spending a total of 4 hours with 3 different agents of GoDaddy we could not resolve the issue. What I found out is the following:

    1. The only solution GoDaddy support agents could advise me to is to ask my colleagues to sign out of admin@{ourdomain.com}. I could not explain to them that I don’t know who has access and I want to prevent any access to this email. They kept insisting I should ask people to sign out and they could not understand that I consider the email to be compromised and we should act like this. I am attaching the transcript of the communication as this was unbelievable.
    2. The second thing I found out is that after you spend more that 5-10 minutes with the “award winning support” of GoDaddy the agents start to ask you to restart your browser. One of the agents asked me to restart my computer in order for the change of password of admin@{ourdomain.com} to take effect. The reason I assume they are doing this is so that the chat session between me and them stops. In this way the next time I try to get in touch with support I am talking with a new agent.

    GoDaddy could not help. We’ve tried all kinds of things. Waiting for 7 hours, resetting the password of admin@{ourdomain.com} while I am logged in as admin@{ourdomain.com} and while logged in as username@{ourdomain.com}. Non of this help.

    7 hours in and the email admin@{ourdomain.com}, hosted on GoDaddy with Microsoft 365 software is still compromised.

    Can Microsoft help?

    7 hours in, I tried to get Microsoft support. I was reluctant until now because I knew what the outcome would be, but nevertheless I tried.

    10 minutes after calling Microsoft I got a response from an Agent. The Agent knew a lot of things and was actively trying to help me.

    First thing he asked me was to visit admin.microsoft.com. I did and this redirects to https://productivity.godaddy.com/settings#/mailbox/18071199

    The agent was a little surprised. I have a Microsoft 365 account but I did not have access to admin.microsoft.com and the tools that this portal is providing. I only had access the GoDaddy admin interface which we already found out was not working and the password could not be reset from it. It just did not work

    What I had access to is “admin.exchange.microsoft.com”. This seems to be the admin interface for the Exchange server. I am familiar with the Exchange server and I tried to explain to the agent that there is no way to reset the password from the Exchange admin interface.

    We spend 20-30 minutes looking through all the options of the Exchange admin interface, but there are no tools there to manage the user admin@{ourdomain.com}

    When you buy Microsoft 365 from GoDaddy you get access to admin.exchange.microsoft.com where you can manage the Exchange server, but you do not get access to admin.microsoft.com. You can not reset the password for a mailbox from the admin.exchange.microsoft.com, but only through admin.microsoft.com, but you don’t have access to admin.microsoft.com

    Can we workaround this in the Exchange admin interface?

    We tried. Me and the support. There are options to add additional roles to the Organization Management from the Exchange admin server. We tried it for about 20 minutes, but we could not.

    Can we workaround this from Azure?

    The Microsoft support agent asked me to go to portal.azure.com. I had a lot of hope. In the azure interface we could again see the users in the Active Directory. When we tried to change the password for admin@{ourdomain.com} from the portal.azure.com interface we got an error that we don’t have the license to change the password. I will later attach a screenshot here.

    How did we resolve it?

    More than 24 hours after the moment I made the audit and considered the admin@{ourdomain.com} compromised I got a response from Microsoft support. I had to go to https://www.godaddy.com/help/sign-out-of-all-devices-32032

    This is an article that specifically says “When working to secure a compromised Microsoft 365 account, sign out of all sessions and devices.”

    This article was sent to me from Microsoft support. This means that GoDaddy was there before, they even wrote an article. None of the 3 support agents knew about this article. I did not know about this article.

    The solutions was to visit https://myaccount.microsoft.com/ and to click “Sign out everywhere”

    Does this really resolve it?

    In a GoDaddy+Microsoft setup to reset the password of username_to_reset@{ourdomain.com} while we are logged in as username@{ourdomain.com} we must:

    1. Get access to username_to_reset@{ourdomain.com}
    2. Reset the password for the username_to_reset@{ourdomain.com} and receive a new email at username_to_reset@{ourdomain.com} and follow the instructions of how to reset the password. Note that this reset of password does not in any way prevent the users that have access to username_to_reset@{ourdomain.com} to continue to access it.
    3. Then sign in at GoDaddy with the new password for username_to_reset@{ourdomain.com} and go to https://myaccount.microsoft.com/. How do you get to https://myaccount.microsoft.com/ from the GoDaddy site? – I don’t know.
    4. After arriving at https://myaccount.microsoft.com/ you must click “Sign out everywhere”

    My conclusion

    I only need to change the password of a mailbox. The setup Microsoft+GoDaddy does not provide me with the tools to adequately manage users and mailboxes. I don’t know what else I would be missing down the road, but if password reset is 24 hours to find out how to do it with 4 support agents, I guess other things will be even more difficult.

    I could live on any stack and tools. If my team was not using that much Microsoft tools I would close all Microsoft+GoDaddy inboxes and tools and move out of this stack as it is not proving productive for the things I need to do to administer this. But it is a team effort. If the team is more productive with the tools Microsoft is providing then we just have to factor the cost of having a compromised email for 24 hours as the cost of business.

    But there was support, wasn’t there

    Yes, I spend a total of 6 hours on the line with 4 different support agents. There was support, but support does not solve this.

    I don’t like AWS, but I’ve been a client of AWS for 7 years and I’ve managed some complex infrastructure. I have 0 support requests with AWS for 7 years. This is how support should look like. 0 minutes. I spend ~6 hours in total for a GoDaddy+Microsoft 365 support with 3 agents from GoDaddy and 1 from Microsoft to resolve my case. No wonder I am kind of reluctant to deploy anything on Microsoft in the future.

     
  • kmitov 4:19 pm on June 13, 2021 Permalink |
    Tags: , , ,   

    Dependencies – one more variable adding to the “cost of the code” 

    One thing I have to explain a lot is what are the costs of software development. Why are things taking so long? Why is there any needed for maintenance and support? Why are developers spending significant amount of their time looking over the existing code base and why we can not just add the next and the next feature?

    Today I have an example of this – and these are “dependencies”.

    The goal of this article is to give people more understanding on how the “tech works.”. I’ve seen that every line of code and every dependency that we add to a project will inevitably result in further costs down the road so we should really keep free of unnecessary dependencies and features.

    Daily builds

    Many contemporary professional software projects have a daily build. This means that every day at least once the project is “built” from zero, all the tests are run and we automatically validate that the customers could use it.

    Weekly dependencies updates

    Every software project depends on libraries that implement common functionality and features. Having few dependencies is healthy for the project, but having no dependencies and implementing everything on your own is just not viable in today’s world.

    These libraries and frameworks that we depend on also regularly release new versions.

    My general rule that I follow in every project is that we check for new versions of the dependencies every Wednesday at around 08:00 in the morning. We check for new dependencies, we download them, we build the project and we run the specs/tests. If the tests fail this means that the new dependencies that we’ve downloaded have somehow changed the behavior of the project.

    Dependencies change

    Most of the time dependencies are changed in a way that does not break any of the functionality of your project. This week was not such a week. A new dependency came along and it broke a few of the projects.

    The problem came from a change in two dependencies:

    Fetching websocket-driver 0.7.5 (was 0.7.4)
    Fetching mustache-js-rails 4.2.0.1 (was 4.1.0)
    Installing mustache-js-rails 4.2.0.1 (was 4.1.0)
    Installing websocket-driver 0.7.5 (was 0.7.4) with native extensions
    

    We have installed new versions of two of the dependencies “websocket-driver” and “mustache-js-rails’

    These two dependencies broke the builds.

    Why should we keep up to date

    Now out of the blue we should resolve this problem. This takes time. Sometimes it is 5 minutes. Sometimes it could be an hour or two. If we don’t do it, it will probably result in more time at a later stage. As the change is new in ‘mustache-js-rails’ we have the chance to get in touch with the developers of the library and resolve the issue while it is fresh for them and they are still “in the context” of what they were doing.

    Given the large number of dependencies that each software project has there is a constant need to keep up to date with new recent versions of your dependencies.

    What if we don’t keep up to date?

    I have one such platform. We decided 6-7 years ago not to invest any further in it. It is still working but it is completely out of date. Any new development will cost the same as basically developing the platform as brand new. That’s the drawback of not keeping up to date. And it happens even with larger systems on a state level with the famous search for COBOL developers because a state did not invest in keeping their platform up to date for some 30+ years.

     
  • kmitov 6:41 am on June 5, 2021 Permalink |
    Tags: , , , ,   

    Yet another random failing spec 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    This article is about a random failing spec. I spent more than 5 hours on this trying to track it down so I decided to share with our team what has happened and what the stupid mistake was.

    Random failing

    Random failing specs are most of the time passing and sometimes failing. The context of their fail seems to be random.

    Context

    At FLLCasts.com we have categories. There was an error when people were visiting the categories. We receive each and every error on an email and some of the categories stopped working, because of a wrong sql query. After migration from Rails 6.0 to Rails 6.1 some of the queries started working differently mostly because of eager loads and we had to change them.

    The spec

    This is the code of the spec

     scenario "show category content" do
        category = FactoryBot.create(:category, slug: SecureRandom.hex(16))
        episode = FactoryBot.create(:episode, :published_with_thumbnail, title: SecureRandom.hex(16))
        material = FactoryBot.create(:material, :published_with_thumbnail, title: SecureRandom.hex(16))
        program = FactoryBot.create(:program, :published_with_thumbnail, title: SecureRandom.hex(16))
        course = FactoryBot.create(:course, :published_with_thumbnail, title: SecureRandom.hex(16))
    
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: episode, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: material, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: program, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: course, category: category)
    
        expect(category.category_content_refs.count).to eq 4
        visit "/categories/#{category.to_param}"
    
        find_by_xpath_with_page_dump "//a[@href='/tutorials/#{episode.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/materials/#{material.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/programs/#{program.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/courses/#{course.to_param}']"
    
      end

    We add a few objects tot he category and then we check that we see them when we visit the category.

    The problem

    Sometime while running the spec only 1 of the objects in the category are shown. Sometimes non, most of the time all of them are shown.

    The debug process

    The controller

    def show
      @category_content_refs ||= @category.category_content_refs.published
    end

    In the category we just call published to get all the published content that is in this category. There are other things in the show but they are not relevant. We were using apply_scopes, we were using other concerns.

    The model

      scope :published, lambda {
        include_contents.where(PUBLISHED_OR_COMING_WHERE_SQL)
      }

    The scope in the model makes a query for published or coming.

    And the query, i kid you not, that was committed in 2018 and we’ve had this query for so long was

    class CategoryContentRef < ApplicationRecord
       
        PUBLISHED_OR_COMING_WHERE_SQL = ' (category_content_refs.content_type = \'Episode\' AND (episodes.published_at <= ? OR episodes.is_visible = true) ) OR
         (category_content_refs.content_type = \'Course\' AND courses.published_at <= ?) OR
         (category_content_refs.content_type = \'Material\' AND (materials.published_at <= ? OR materials.is_visible = true) ) OR
         category_content_refs.content_type=\'Playlist\'', *[Time.now.utc.strftime("%Y-%m-%d %H:%M:%S")]*4].freeze
    
    end
    

    I will give you a hit that the problem is with this query.

    You can take a moment a try to see where the problem is.

    The query problem

    The problem is with the .freeze and the constant in the class. The query is initialized when the class is loaded. Because of this it takes the time at the moment of loading the class and not the time of the query.

    Because the specs are fast sometimes the time of loading of the class is right before the spec and sometimes there are specs executed in between.

    It seems simple once you see it, but these are the kind of things that you keep missing while debugging. They are right in-front of your eyes and yet again sometimes you just can’t see them, until you finally see them and they you can not unsee them.

     
  • kmitov 3:19 pm on May 31, 2021 Permalink |
    Tags: , ,   

    When caching is bad and you should not cache. 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    On Friday we did some refactoring at FLLCasts.com. We removed Refinery CMS, which is a topic for another article, but one issue pop-up – on a specific page caching was used in a way that made the page very slow. This article is about how and why. It is mainly for our team as a way to share the knowledge among ourselves, but I think the whole community could benefit, especially the Ruby on Rails community.

    TL;DR;

    When making a request to a cache service, be it MemCachir, Redis or any other, you are making a request to a cache service. This will include a get(key) method call and if the value is not stored in the cache, it will include a set(key) method call. When the calculation you are doing is simple it will take more time to cache the result from the calculation than to do the calculation again, especially if this calculation is a simple string concatenation.

    Processors (CPUs) are really good at string concatenation and could do them in a single digit milliseconds. So if you are about to cache something, make sure that you cache something worth caching. There is absolutely no reason to cache the result of:

    # Simple string concatenation. You calculate the value. No need to cache it.
    value = "<a href=#{link}>Text</a>". 
    
    # The same result, but with caching
    # There isn't a universe in which the code below will be faster than the code above.
    hash = calculate_hash(link)
    cached_value = cache.get(hash)
    if cached_value == nil
       cached_value = "<a href=#{link}>Text</a>". 
       cache.set(hash, cached_value)
    end 
    
    value = cached_value

    Context for Rails

    Rails makes caching painfully easy. Any server side generated HTML could be cached and returned to the user.

    <% # The call below will render the partial "page" for every page and will cache the result %>
    <% # Pretty simple, and yet there is something wrong %>
    <%= render partial: "page", collection: @pages, cached: true %>

    What’s wrong is that we open the browser and it takes more than 15 seconds to load.

    Here is a profile result from New Relic.

    As you can see there a lot of Memcached calls – like 10, and a lot of set calls. There are also a lot of Postgres find methods. All of this is because of how caching was set up in the platform. The whole “page” partial, after a decent amount of refactoring turns out to be a simple string concatenation as:

    <a href="<%= page.path%>"><%= page.title %></a>

    That’s it. We were caching the result of a simple string concatenation which the CPU is quite fast in doing. Because there were a lot of pages and we were doing the call for all of the pages, when opening the browser for the first time it just took too much to call all the get(key), set(key) methods and the page was returning a “Time out”

    Conclusion

    You should absolutely use caching and cache the values of your calculations, but only if those calculations take more time than asking the cache for a value. Otherwise it is just not useful.

     
  • kmitov 9:14 am on May 7, 2021 Permalink |
    Tags: ,   

    “[DOM] Input elements should have autocomplete attributes” 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    This is one of the things that could make a platform better. Here is how the warning looks like in the browser console.

    More information at – https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/autocomplete

    The autocomplete attributes could allow browsers, extensions and other agents guess what the user should do on this page. It could make it easier for the user. For example an extension could suggest a new password in the field, or could understand to fill the name of the user in the “name” field.

    Additionally we don’t like warnings.

    To check out the behavior, if you have a password manager for example go to

    https://www.fllcasts.com/users/sign_in

    or

    https://www.buildin3d.com/users/sign_in

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel