Updates from kmitov Toggle Comment Threads | Keyboard Shortcuts

  • kmitov 7:33 am on October 8, 2021 Permalink |
    Tags: hotwire, , , turbo   

    [Rails, Hotwire] Migrate to Turbo from rails-ujs and turbolinks – how it went for us. 

    We recently decided to migrate one of our newest platforms to Turbo. The goal of this article is to help anyone who plans to do the same migration. I hope it gives you a perspective of the amount of work required. Generally it was easy and straightforward, but a few specs had to be changed because of urls and controller results

    Gemfile

    Remove turbolinks and add turbo-rails. The change was

    --- a/Gemfile.lock
    +++ b/Gemfile.lock
    @@ -227,9 +227,8 @@ GEM
         switch_user (1.5.4)
         thor (1.1.0)
         tilt (2.0.10)
    -    turbolinks (5.2.1)
    -      turbolinks-source (~> 5.2)
    -    turbolinks-source (5.2.0)
    +    turbo-rails (0.7.8)
    +      rails (>= 6.0.0)

    application.js and no more rails-ujs and Turbolinks

    Added “@notwired/turbo-rails” and removed Rails.start() and Turbolinks.start()

    --- a/app/javascript/packs/application.js
    +++ b/app/javascript/packs/application.js
    @@ -3,8 +3,7 @@
     // a relevant structure within app/javascript and only use these pack files to reference
     // that code so it'll be compiled.
    
    -import Rails from "@rails/ujs"
    -import Turbolinks from "turbolinks"
    +import "@hotwired/turbo-rails"
     import * as ActiveStorage from "@rails/activestorage"
     import "channels"
    
    @@ -14,8 +13,6 @@ import "channels"
     // Collapse - needed for navbar
     import { Collapse } from 'bootstrap';
    
    -Rails.start()
    -Turbolinks.start()
     ActiveStorage.start()

    package.json

    The change was small

    --- a/package.json
    +++ b/package.json
    @@ -2,10 +2,10 @@
       "name": "platform",
       "private": true,
       "dependencies": {
    +    "@hotwired/turbo-rails": "^7.0.0-rc.3",
         "@popperjs/core": "^2.9.2",
         "@rails/actioncable": "^6.0.0",
         "@rails/activestorage": "^6.0.0",
    -    "@rails/ujs": "^6.0.0",
         "@rails/webpacker": "5.4.0",
         "bootstrap": "^5.0.2",
         "stimulus": "^2.0.0",

    Device still does not work

    For the device forms you have to add “data: {turbo: ‘false’}” to disable turbo for them

    +<%= form_for(resource, as: resource_name, url: password_path(resource_name), html: { method: :post }, data: {turbo: "false"}) do |f| %>;

    We are waiting for resolutions on https://github.com/heartcombo/devise/pull/5340

    Controllers have to return an unprocessable_entity on form errors

    If there are active_record.errors in the controller we must now return status: :unprocessable_entity

    +++ b/app/controllers/records_controller.rb
    @@ -14,7 +14,7 @@ class RecordsController < ApplicationController
         if @record.save
           redirect_to edit_record_path(@record)
         else
    -      render :new
    +      render :new, status: :unprocessable_entity
         end
       end

    application.js was reduced significantly

    The old application.js – 923 KiB

      application (932 KiB)
          js/application-dce2ae8c3797246e3c4b.js

    The new application.js – 248 KiB

    remote:        Assets: 
    remote:          js/application-b52f4ecd1b3d48f2f393.js (248 KiB)

    Conclusion

    Overall a good experience. We are still facing some minor issues with third party chat widgets like tawk.to that do not work well with turbo, as they are sending 1 more request, refreshing the page and adding the widget to an iframe that is lost with turbo navigation. But we would probably move away from tawk.to.

     
  • kmitov 6:29 am on October 8, 2021 Permalink |
    Tags: , , ,   

    [Rails] Warden.test_reset! does not always reset and the user is still logged in 

    We had this strange case of a spec that was randomly failing

      scenario "generate a subscribe link for not logged in users" js: true do 
        visit "/page_url"
    
        expect(page).to have_xpath "//a[text()='Subscribe']"
        click_link "Subscribe"
        ...
      end 

    When a user is logged in we generate a button that subscribes them immediately. But when a user is not logged in we generate a link that will direct the users to the subscription page for them to learn more about the subscription.

    This works well, but the spec is randomly failing sometimes.

    We expect there to be a link, eg. “//a” but on the page there is actually a button, eg. “//button”

    What this means is that when the spec started there was a logged in user. The user was still not logged out from the previous spec.
    This explains why sometimes the spec fails and why not – because we are running all the specs with a random order

    $ RAILS_ENV=test rake spec SPEC='...' SPEC_OPTS='--order random'

    Warden.test_reset! is not always working

    There is a Warden.test_reset! that is supposed to reset the session, but it seems for js: true cases where we have a Selenium driver the user is not always reset before the next test starts.

    # spec/rails_helper.rb
    RSpec.configure do |config|
      ...
      config.after(:each, type: :system) do
        Warden.test_reset!
      end
    end

    Logout before each system spec that is js: true

    I decided to try to explicitly log out before each js: true spec that is ‘system’ so I improved the RSpec.configuration

    RSpec.configure do |config|
      config.before(:each, type: :system, js: true) do
        logout # NOTE Sometimes when we have a js spec the user is still logged in from the previous one
        # Here I am logging it out explicitly. For js it seems Warden.test_reset! is not enough
        #
        # When I run specs without this logout are
        # Finished in 3 minutes 53.7 seconds (files took 28.79 seconds to load)
        #   383 examples, 0 failures, 2 pending
        #
        # With the logout are
        #
        # Finished in 3 minutes 34.2 seconds (files took 21.15 seconds to load)
        #   383 examples, 0 failures, 2 pending
        #
        # Randomized with seed 26106
        # 
        # So we should not be losing on performance
      end
    end

    Conclusion

    Warden.test_reset! does not always logout the user successfully before the next spec when specs are with Selenium driver – eg. “js: true”. I don’t know why, but that is the observed behavior.
    I’ve added a call to “logout” before each system spec that is js: true  to make sure the user is logged out.

     
  • kmitov 9:00 am on October 5, 2021 Permalink |
    Tags:   

    A day of the life of a CTO – “what do you do, day to day?” 

    My brother asked me the other day:

    He: – So what do you do, day to day?

    Me: – I work in engineering (software, data and AI).

    It’s a little bit more than that. Not all the days are the same. There are a lot of decisions to be made, and generally with a little luck those decisions will keep the ship at least in the right direction.

    I decided to look deeper and to record. There is a difference between what we think we are doing and what we are actually doing. I tried to summarize just one of my recent days that I spent in engineering. This was a day without any software development for me.

    My hope with this article is to be able to answer my brother – “What do you do, day to day?” and I hope this answer and examples could be interesting to people entering the world of Software engineering and to business and product people trying to learn more about how their engineers spend their day.

    Adding a JSONB column to a scheme

    A colleague was facing the issue of storing an array of values in a DB. The values were the result of calling the API of an external service for our business.

    How do you store these values? There are many different ways. I supported his recommendation to store the data as JSON format. I only suggested changing the type of the column to JSONB as this will later allow us to query the table in an easier way. At the same time I had to re-think part of the stack to see if there would be any implications on the whole platform when this new column JSONB is introduced. Luckily there were no implications.

    Automated test that we create a table

    A colleague was working on a pipeline and the specs for this pipeline. The question was how do we build a spec for part of the logic. The logic creates a new table. How do we test this in an automated way? How do we test that this logic creates a db? We considered two different approaches and together we looked for a good API call to test this.

    We decided on how to spec this behavior.

    DateTime field

    A colleague was facing the issue with date in the data platform that had invalid values. The issue was that we were storing both the date and the time for an event where we should have been storing only the time. The implications were huge. We now had to migrate all the values in all the records.

    In this case we looked at the product requirements on what should be included in the data platform in the near future. Turns out that there is a requirement for engineering to store not only the time, but also the date of the event. This means there was nothing to fix as we were ahead of time in engineering. We only have to migrate a couple of records.

    The decision here was whether we should spend a day migrating this data and what would be the issue if the data was not migrated.

    Choose a service $50/80GB

    A colleague had the task to look at different services that we could use. We had to make a decision should we use this $50 service or that $50 service. The decision is important because once you decide on a service to add in your stack it is difficult to move out of this service. You kind of stay with them at least for the near future.

    Sometimes when you look at two services on your own you can overlook a few of the aspects so it is a good practice to have a second look from someone else. Also at the end it is a team decision of what to include in the stack.

    Integration with an external API

    A colleague was working on integrating with an external API. The issue was the this API is returning different formats for different calls. The question was how do we handle this. Should we hard code the scheme for this API, should we infer it, should we do something smarter? How does this impact the abstraction for the other Data Sources. We had to get on a call with the external API representatives to discuss if they could help us.

    Creating new repos

    A colleague was working on new features in the platform. These new feature should be extracted into new repositories. We had to decide on the name of the repositories. In the world of software development there are two hard things – invalidating cache and naming. Naming is important because it gives you power over things. Once you name them you have power over them. If you name them bad, then they have power over you. Nevertheless, we had to make a decision on how we name two new code repositories.

    Abilities

    A colleague was working on the authorization part of the platform. We are adding new authorizations based on roles. He developed the code and was ready for a Code review. I was there and decided to jump on the Code Review. The issue with the implementation was that it was coupling the authorization with all the modules in a single class. Coupling is bad in the long run as it is not very agile and difficult to maintain. We spent time decoupling the implementation.

    System vs model specs

    A colleague was in the middle of developing an automated spec. There are generally two types of specs – integration and unit. In our case we use “system” and “model” specs. System specs test the behavior of the whole feature. Unit specs test the behavior of a specific unit (class, function). My general rule of thumb is – 10% system specs, 90% model specs, but start with the system spec. I’ve been in situations with too many system specs which make the system unmaintainable and require us to invest a lot of time in refactoring. Since then I am cautious about what kind of spec are developed, when and why. We revised current assumptions and decided if current specs should be developed as a system or unit.

    Flash messages

    A colleague was working on some flash messages on the platform that are appearing at a specific moment. I took a look and confirmed the implementation and the behavior.

    Constructing new objects

    A colleague was working on refactoring part of the code. A general rule of thumb I try to follow is “always construct instances of a given type” only at one specific place in the code. We revised the implementation and saw that there are a few places where instances of a given type were constructed. There is an easy solution for this. We schedule it for the following week to be implemented.

    Submit button change type to input

    A colleague was working on a feature on the web platform and noticed that a few of the forms had the wrong type of button. I was around and I was the one to previously commit this form so he notified me about the change and we discussed the implications.

    Structure of blob storage

    A colleague was working on an integration with an API that will store information in our BigData Lake. We had to sync the structure of the lake and how it will accommodate the new API.

    Infrastructure from code

    A colleague was working on deploying on a cloud provider. We try to create our infrastructure from code. It is way too easy to set up an infrastructure, spend a week deploying it, and then be unable to reproduce it later on because of the gazillion different options and little configurations you have to do on the cloud providers. Just ask anyone who has configured AWS IAM permissions and resources.

    It is important to have a script that would create your infrastructure from code. We had to revise, review and think about the implications of the resources that the code creates.

    Conclusion

    No actual conclusion. This is just a diary of the day. I hope that my brother along with many others now understand more about our work.

     
  • kmitov 4:30 am on September 8, 2021 Permalink |
    Tags: airflow, apache, bigdata   

    Orchestration of BigData with Apache Airflow 

    It was a please for me to do this presentation and discuss how we can orchestrate BigData with Apache Airflow at the 2021 OpenFest event

    Video is in Bulgarian

     
  • kmitov 5:48 am on September 6, 2021 Permalink |
    Tags: , , , ,   

    Refresh while waiting with RSpec+Capybara in a Rails project 

    This is some serious advanced stuff here. You should share it.

    A colleague, looking at the git logs

    I recently had to create a spec with Capybary+RSpec where I refresh the page and wait for a value to appear on this page. It this particular scenario there is no need for WebSockets or and JS. We just need to refresh the page.

    But how to we test it?

    # Expect that the new records page will show the correct value of the record
    # We must do this in a loop as we are constantly refreshing the page.
    # We need to stay here and refresh the page
    # 
    # Use the Tmeout.timeout to stop the execution after the default Capybara.default_max_wait_time
    Timeout.timeout(Capybara.default_max_wait_time) do
      loop do
        # Visit the page. If you visit the same page a second time
        # it will refresh the page.
        visit "/records"
        # The smart thing here is the wait: 0 param
        # By default find_all will wait for Capybara.default_max_wait_time as it is waiting for all JS methods 
        # to complete. But there is no JS to complete and we want to check the page as is, without waiting 
        # for any JS, because there is no JS. 
        # 
        # We pase a "wait: 0" which will check and return
        break if find_all(:xpath, "//a[@href='/records/#{record.to_param}' and text()='Continue']", wait: 0).any?
    
        # If we could not find our record we sleep for 0.25 seconds and try again.
        sleep 0.25
      end
    end

    I hope it is helpful.

    Want to keep it touch – find me on LinkedIn or Twitter.

     
  • kmitov 10:49 am on September 3, 2021 Permalink |
    Tags: aws, cloudflare, nginx,   

    When the policeman becomes the criminal – how Cloudflare attacks my machines. 

    On the Internet you are nobody until someone attacks you.

    It gets even more interesting when the attack comes from someone with practically unlimited resources and when these are the same people that are supposed to protect you.

    This article is the story of how Cloudflare started an “attack” on a machine at the FLLCasts platform. This increased the traffic of the machine about 10x and AWS started charging the account 10x more. I managed to stop them and I hope my experience is useful for all CTOs, sysadmins, devops and others that would like to understand more and look out for such cases.

    TL; DR;

    Current up to date status is – after all the investigation it turns out that when a client makes a HEAD request for a file this will hit Cloudflare infrastructure. Cloudflare will then send a GET request to the account machine and will cache the file. This has changed at 28 of August. Before 28 of August when clients were sending HEAD requests, Cloudflare was sending HEAD requests (that don’t generate traffic). After 28 of August clients are still sending HEAD requests, but now Cloudflare is sending GET requests, generating terabytes of additional traffic that is not needed.

    Increase of the Bill

    On 28 of August 2021 I got a notification from AWS that the account is close to surpassing its budget for the month. This is not surprising as it was the end of the month, but nevertheless I decided to check. It seems that the traffic to one of the machines has increased 10x in a day. Nothing else has increased. No visits, no other resources, just the traffic to this one particular machine. 
    That was strange. This has been going on for 7 days now and this is the increase of the traffic.

    AWS increase of the bill

    Limit billing on AWS

    First thought was “How can I set a global limit to AWS spending for this account? I don’t want to wake up with $50K in traffic charges the next day?”

    The answer is “You can’t”. There is no way to set a global spending limit for an AWS account. This was something I already knew, but decided to check again with support and yes, you can’t set such a limit. This means that AWS is providing all the tools for you to be bankrupt by a third party and they are not willing to limit it.

    Limit billing on Digital Ocean

    I have some machines on Digital Ocean and I checked there. “Can I set a global spending limit for my account where I will no longer be charged and all my services will stop if my spending is above X amount of dollars?”.
    The answer was again – “No. Digital ocean does not provide it”.

    Should there be a global limit on spending on cloud providers?

    My understanding is – yes. There is a break even point where users are coming to your service and generating revenue and you are delivering the service and this is costing you money. Once it costs you more to deliver the service than the revenue that the service is generating, I would personally prefer to stop the service. No need for it to be running. Otherwize you could wake up with a $50K bill.

    AWS monitoring

    I had the bill from AWS so I tried looking at the monitoring.
    There is a spike every day between 03:00 AM UTC and 05:00 AM UTC. This spike is increasing the traffic with hundreds of gigabytes. It could easily be terabytes next time.
    The conclusion is that the machine is heavily loaded during this time.

    AWS monitoring

    Nginx access.log

    Looking at the access log I see that there are a lot of requests by machines that are using a user agent called ‘curl’. ‘curl’ is a popular tool for accessing files over HTTP and is heavily used by different bots. But bots tend to identify themselves.

    This is how the access.log looks like:

    172.68.65.227 - - [30/Aug/2021:03:26:02 +0000] "GET /f9a13214d1d16a7fb2ebc0dce9ee496e/file1.webm HTTP/1.1" 200 27755976 "-" "curl/7.58.0"

    Parsing the log file

    I have my years in bash experience and couple of commands later I get a list of all the IPs and how many requests we’ve received from these IPs.

    grep curl access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n

    The result is 547 machines. The full log file is available at – Full list of Cloudflare IPs attacking my machine. The top 20 are (there are some IPs that are not from Cloudflare). The first is the number of requests, the second is the IP of the machine.  

    NumberOfRequest IP
        113 172.69.63.18
        117 172.68.65.107
        150 172.70.42.135
        158 172.70.42.161
        164 172.69.63.82
        167 172.70.42.129
        169 172.69.63.116
        170 172.68.65.231
        173 172.68.65.101
        178 172.69.63.16
        178 172.70.42.143
        188 173.245.54.236
        264 172.70.134.69
        268 172.70.134.117
        269 172.70.134.45
        287 172.70.134.153
        844 172.70.34.131
        866 172.70.34.19
        904 172.70.34.61
        912 172.70.34.69

    These are Cloudflare machines!

    Looking at the machines that are making the requests these are 547 different machines, most of which are Cloudflare machines. These are servers that Cloudflare seems to be running that are making the request.

    How does Cloudflare work?

    For this particular FLLCasts account with this particular machine I have years ago setup Cloudflare to sit in front of the machine to help  protect the account from internet attacks.

    The way Cloudflare works is that only Cloudflare knows what is the IP address of our machine. This is the promise that Cloudflare is making. Because only they know the IP address of the machine, only they know what is the IP address for a given domain. In this way when a user points their browser at the “http://domainname&#8221; the internet will direct this request to Cloudflare, then Cloudflare will check if this request is ok, and then and only then forward this request to our machine. But in the meantime Cloudflare is trying to help businesses like the platform by caching the content. This means that when Cloudflare receives a request for a file, they will check on their Cloudflare infrastructure if this file was cached and send a request to the account machine only if there is no cache.

    In a nutshell Cloudflare maintains a cache for the content the platform is delivering.

    Image is from Cloudflare support at https://support.cloudflare.com/hc/en-us/articles/205177068-How-does-Cloudflare-work-

    What is broken?

    Cloudflare maintains a cache of the platform resources. Every night between 03:00 AM UTC and 05:00 AM UTC some 547 Cloudflare machines decide to update their cache and they start sending requests to our server. These are 10x more requests that the machine generally receives from all users. The content on the server does not change. It’s been the same content for years. But for the last 7 days Cloudflare is caching the same content every night on 547 machines.

    And AWS bills us for this.

    Can Cloudflare help?

    I created a ticket. The response was along the lines of “You are not subscribed for our support, you can get only community support”. Fine.
    I called them on the phone early in the morning.
    I called enterprise sales and I asked them.

    Me - "Hi, I am under attack. Can you help?"
    They - "Yes, we can help. Who is attacking you?"
    Me - "Well, you are. Is there an enterprise package I could buy so that you can protect me against your attack?"

    Luckily the guy on the phone caught my sense of humor and urgency and quickly organized a meeting with a product representative. Regrettably there were no solution engineers on this call.

    Both guys were very knowledgeable, but I had difficulties explaining that it was actually Cloudflare causing the traffic increase. I had all the data from AWS, from the access.log files, but the support agents still had some difficulty accepting it.

    To be clear – I don’t think that Cloudflare is maliciously causing this. There is no point. What I think has happened is some misconfiguration on their side that caused this for the last 7 days.

    What I think has happened?

    I tried to explain to the support agents that there are three scenarios all of which Cloudflare is responsible for.

    1. Option 1 – “someone that has 547 machines is trying to attack the FLLCasts account and Cloudflare is failing to stop it”. First this is very unlikely. Nobody will invest in starting 547 machines just to make the platform pay a few dollars more this month. And even if this is the case, this is what Cloudflare should actually prevent, right? Option 1: “Cloudflare is failing in preventing attacks” (unlikely)


    2. Option 2 – “only Cloudflare knows the IP of this domain name and they have been compromised.”. The connection between domain name and ip address is something that only Cloudflare knows about. If a third party knows the domain name and they are able to find the IP name this means that they are compromising Cloudflare. Option 2: “Cloudflare is compromised” (possible, but again, unlikely)

    3. Option 3 – “there is a misconfiguration in some of the Cloudflare servers”. I don’t like looking for malicious activity where everything could be explained with simple ignorance or a mistake. Most likely there is a misconfiguration in the Cloudflare infrastructure that is causing these servers to behave in this way. Option 3: “There is a misconfiguration in Cloudflare infrastructure”

    4. Option 4 – “there is a mistake on our end”. As there basically is nothing on our end and this nothing has not changed in years, the possibility for this to be the case is minimal. 

    On a support call we set a plan with the support agents to investigate it. I will change the public IP of the AWS machine and will reconfigure it on Cloudflare. In this way we hope to stop some of the requests. We have no plan for what to do after that.

    Can I block in on the Nginx level?

    Nginx is an HTTP server,serving files. There are a couple of options to explore there, but the most reasonable was to stop all curl requests to the Nginx server. This was the shortest path. There was no need to protect against other attacks, there was only the need to protect against Cloudflare attacks. The Cloudflare attack was using “curl” as a tool. I decided to stop ‘curl’

      # Surely not the best, but the simplest and will get the job done for now.
      if ($http_user_agent ~ 'curl') {
          return 444; # Consider returning 444. It's a custom nginx code that drop the connection without responding.
      }

    Resolution

    I am now waiting to see if the change of the public IP of the AWS machine will have any impact and if not I am just rejecting all “curl” requests that seem to be what Cloudflare is using.

    Update 1

    The first solution that we decide to implement is to

    Change the public IP of the AWS machine and change it in the DNS settings at Cloudflare. In this way we would make sure that only Cloud flare really knows this IP.

    Resolution is – It did not work!

    I know it won’t, because it was another way for support to get me to do anything without really looking into the issue, but I went along with it. Better exhaust this options and be sure.

    The traffic of a Cloudflare attacked machine. Changing the IP address of 03 of September had no effect.

    Update 2

    Adding CF-Connection-IP header

    Cloudflare support was really helpful. They asked me to include CF-Connection-IP in the logs. In this way we would know what is the real IP that is making the requests and if these are in fact Cloudflare machines.

    The header is described at https://support.cloudflare.com/hc/en-us/articles/200170986-How-does-Cloudflare-handle-HTTP-Request-headers-

    I went on and updated the Nginx configuration

    log_format  cloudflare_debug     '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" "$http_cf_connecting_ip"';
    
    access_log /var/log/nginx/access.log cloudflare_debug;
    

    Now the log file contained the original IP.

    Cloudflare is making GET when client makes a HEAD request

    This is what I found out. The platform has a daily job that checks the machine and makes sure files are ok. This integrity check was left there from times when we had to do it, like years ago. It is still running and is starting every night checking the machine with HEAD requests. But Cloudflare started making GET request at 28 of August 2021 and this increases the traffic to the machine.

    Steps to reproduce

    Here are the steps to reproduce:

    1. I am sending a HEAD request with ‘curl -I’

    2. Cloudflare has not cached the file so there is “cf-cache-status: MISS”

    3. Cloudflare sends a GET request and gets the whole file

    4. Cloudflare responds to the HEAD request.

    5. I send a HEAD request agian with ‘curl -I’

    6. Cloudflare has the file cached and there is a “cf-cache-status: HIT”

    7. The account server is not hit.

    The problem here is that I am sending a HEAD request to my file and Cloudflare is sending a GET request for the whole file in order to cache this file

    Commands to reproduce

    This is a HEAD request:

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:11 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: MISS
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=Xg9TLgssa5Gm6j1fRlJZH8VahaoY21LdCE1W1JqVueu49mzdiTmh9MZp4pFZDsVeSmRg%2Bc%2FMryoN7tgmKUmdxhWzE7UZdVvgG%2FRxHSZ%2FYS6pDtxLwpXSD71jo5ADNyT4TSpKXtE%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689564111e594ee0-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    This is the log right after the HEAD request. Not that I am sending HEAD request to domain.com and Cloudflare is sending GET request for the file.

    162.158.94.236 - - [04/Sep/2021:07:09:12 +0000] "GET /file1.webm HTTP/1.1" 200 2256504 "-" "curl/7.68.0" "188.254.161.195" "188.254.161.195"

    Then I send a second HEAD requests

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:53 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: HIT
    age: 42
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=CKSvpGGHoj5LfV6xXpPUK5kHJtdsX3fylgt%2F2%2B6G94oUsdAd8FnHmUgEUIgnj5dd2Vvsv%2BKQxxgsHdHA0RvpjTxATakFKFuirMeI%2FS3lAdDX5VA0tY74z0CRYEHM2rS%2Fld6K738%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689565175dffc29f-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    And then there is NOTHING in the log file

    Note that for the last HEAD request there is a “cf-cache-status: HIT”.

    Status and how it could be resolved?

    Yes, we are doing HEAD requests every day to the files in order to check that they are all working. Every day we send a HEAD request for every file to make sure all files are up to date. This has been going on for years and is a left over of an integrity check we implemented in 2015.

    What has changed on 28 of August 2021 is that when Cloudflare receives a HEAD request for a file it is sending a GET request to our machine in order to cache the file. This is what has changed and this is generating all the traffic.

    We send HEAD request with ‘curl -I’

    I have 30 weeks of log files that show that Cloudflare was sending HEAD requests like

    I have asked Cloudflare

    Could you please rollback this change in the infrastructure and do not send a GET request to our machine when you receive a HEAD request from a client?

    Let’s see how will this be resolved.

    Up to date conclusion

    Check your machines from time to time. I hope you don’t get in this situation.

    Want to keep in touch? – find me on LinkedIn or Twitter

     
  • kmitov 9:48 am on August 27, 2021 Permalink
    Tags: customer,   

    How we lost $1000 because we did not talk to the customer early enough 

    This content is password protected. To view it please enter your password below:

     
  • kmitov 12:34 pm on August 17, 2021 Permalink |
    Tags: admin, microsoft   

    GoDaddy+Microsoft 365 and how an email was compromised for about a day 

    In two hours I have a C-suite meeting and one of the topics would be our internal stack and whether we stay with Microsoft+GoDaddy or we migrate.

    This article is my objective summary of:

    1. How Microsoft+GoDaddy keep an email account compromised for more than a day
    2. What is difficult with the stack of Microsoft+GoDaddy
    3. Why can’t we just migrate to Microsoft without GoDaddy
    4. Why would I like to stop using Microsoft

    I hope other companies that have found themselves in this situation will be able to make the right decision given my experience.

    Note: This article is as of 2021-08-17. Things may change. I hope they will.

    Why GoDaddy?

    When the project was initially formed the domain {ourdomain.com} was bought from GoDaddy. Nothing for and nothing against. Since then the emails were added at GoDaddy.

    Why GoDaddy+Microsoft 365?

    GoDaddy offers Microsoft 365. You can purchase an Email+Office that will give you the email that is a Microsoft 365 email.

    Why not migrate out of GoDaddy and using only Microsoft?

    As we onboard more people in the team we identified that keeping both GoDaddy and Microsoft would be difficult. I tried to migrate us only to a Microsoft where the emails and office and everything will come from Microsoft and we won’t be handling two services.

    After spending about a day on this it turned out it was not possible. I even have a ticket created from GoDaddy support that should have been resolved in 72 hours, but almost a month later I still don’t have any notification if it is resolved or not. The issue is that I as an admin can not redirect the emails to be received at onmicrosoft.com while we are migrating. This means there will be a moment of time where people will not receive emails. I also can not export the user’s emails. I have to log in with every user, but I don’t know their passwords, so they should reset their passwords and share them with me and I should export their mailboxes through a desktop outlook application and then import them again. Which would easily take days in communication and sync. Yes, there is no “export all emails” and “import all emails”. It should be done by hand, manually, for every user in sync with the user. There simply is no such tool available from Microsoft in the GoDaddy+Microsoft setup.

    When migrating from GoDaddy+Microsoft 365 to Microsoft you should manually log in with each user and manually export and import each and every mailbox and manually sync with the users to give you their password, because as admin you can not change the password. This could take days, if not weeks for a team.

    Because of this we’ve decided to postpone this migration.

    How did an email get compromised?

    During my regular security audit I found out that I don’t know who has access to admin@{ourdomain}.com. This is the admin email. I have access to it. A couple of colleagues have access to it. But I don’t know who has access to it.

    Naturally I tried to reset the password for this account.

    The way I tried to reset the password is

    1. Go to GoDaddy.
    2. Log in with my account username@{ourdomain.com} and try to change the password for admin@{ourdomain.com}.
    3. The site returned that the password was successfully changed.
    4. Then I asked my colleague who has access to admin@{ourdomain.com} through the Desktop version of Outlook to see if he still has access.
    5. He still did. It did not matter that I changed the password.

    I have changed the password for admin@{ourdomain.com}, but users that do not know the new password still have access to the email through the desktop version of Microsoft Outlook

    The implications here are huge. This means that I don’t know who has access to admin@{ourdomain.com} and there is no way I could prevent them from accessing it.

    The only way would be for them to willingly sign out and try to sign in again. But this is not going to happen as I now consider the email to be compromised. Since the moment I am auditing this email to the moment I know who exactly has access to the email I consider this email to be compromised. Probably nobody else had access to it.

    But Microsoft and GoDaddy do not provide me with the tools to check who has access and to prevent people from accessing it, even after I changed the password.

    Can GoDaddy support help?

    It should be mentioned that GoDaddy documentation says that it might take up to 30 minutes for this password change to be reflected. I am ok with this. Not the best security, but I am ok.

    I have waited for 120 minutes before getting in touch with GoDaddy support.

    After spending a total of 4 hours with 3 different agents of GoDaddy we could not resolve the issue. What I found out is the following:

    1. The only solution GoDaddy support agents could advise me to is to ask my colleagues to sign out of admin@{ourdomain.com}. I could not explain to them that I don’t know who has access and I want to prevent any access to this email. They kept insisting I should ask people to sign out and they could not understand that I consider the email to be compromised and we should act like this. I am attaching the transcript of the communication as this was unbelievable.
    2. The second thing I found out is that after you spend more that 5-10 minutes with the “award winning support” of GoDaddy the agents start to ask you to restart your browser. One of the agents asked me to restart my computer in order for the change of password of admin@{ourdomain.com} to take effect. The reason I assume they are doing this is so that the chat session between me and them stops. In this way the next time I try to get in touch with support I am talking with a new agent.

    GoDaddy could not help. We’ve tried all kinds of things. Waiting for 7 hours, resetting the password of admin@{ourdomain.com} while I am logged in as admin@{ourdomain.com} and while logged in as username@{ourdomain.com}. Non of this help.

    7 hours in and the email admin@{ourdomain.com}, hosted on GoDaddy with Microsoft 365 software is still compromised.

    Can Microsoft help?

    7 hours in, I tried to get Microsoft support. I was reluctant until now because I knew what the outcome would be, but nevertheless I tried.

    10 minutes after calling Microsoft I got a response from an Agent. The Agent knew a lot of things and was actively trying to help me.

    First thing he asked me was to visit admin.microsoft.com. I did and this redirects to https://productivity.godaddy.com/settings#/mailbox/18071199

    The agent was a little surprised. I have a Microsoft 365 account but I did not have access to admin.microsoft.com and the tools that this portal is providing. I only had access the GoDaddy admin interface which we already found out was not working and the password could not be reset from it. It just did not work

    What I had access to is “admin.exchange.microsoft.com”. This seems to be the admin interface for the Exchange server. I am familiar with the Exchange server and I tried to explain to the agent that there is no way to reset the password from the Exchange admin interface.

    We spend 20-30 minutes looking through all the options of the Exchange admin interface, but there are no tools there to manage the user admin@{ourdomain.com}

    When you buy Microsoft 365 from GoDaddy you get access to admin.exchange.microsoft.com where you can manage the Exchange server, but you do not get access to admin.microsoft.com. You can not reset the password for a mailbox from the admin.exchange.microsoft.com, but only through admin.microsoft.com, but you don’t have access to admin.microsoft.com

    Can we workaround this in the Exchange admin interface?

    We tried. Me and the support. There are options to add additional roles to the Organization Management from the Exchange admin server. We tried it for about 20 minutes, but we could not.

    Can we workaround this from Azure?

    The Microsoft support agent asked me to go to portal.azure.com. I had a lot of hope. In the azure interface we could again see the users in the Active Directory. When we tried to change the password for admin@{ourdomain.com} from the portal.azure.com interface we got an error that we don’t have the license to change the password. I will later attach a screenshot here.

    How did we resolve it?

    More than 24 hours after the moment I made the audit and considered the admin@{ourdomain.com} compromised I got a response from Microsoft support. I had to go to https://www.godaddy.com/help/sign-out-of-all-devices-32032

    This is an article that specifically says “When working to secure a compromised Microsoft 365 account, sign out of all sessions and devices.”

    This article was sent to me from Microsoft support. This means that GoDaddy was there before, they even wrote an article. None of the 3 support agents knew about this article. I did not know about this article.

    The solutions was to visit https://myaccount.microsoft.com/ and to click “Sign out everywhere”

    Does this really resolve it?

    In a GoDaddy+Microsoft setup to reset the password of username_to_reset@{ourdomain.com} while we are logged in as username@{ourdomain.com} we must:

    1. Get access to username_to_reset@{ourdomain.com}
    2. Reset the password for the username_to_reset@{ourdomain.com} and receive a new email at username_to_reset@{ourdomain.com} and follow the instructions of how to reset the password. Note that this reset of password does not in any way prevent the users that have access to username_to_reset@{ourdomain.com} to continue to access it.
    3. Then sign in at GoDaddy with the new password for username_to_reset@{ourdomain.com} and go to https://myaccount.microsoft.com/. How do you get to https://myaccount.microsoft.com/ from the GoDaddy site? – I don’t know.
    4. After arriving at https://myaccount.microsoft.com/ you must click “Sign out everywhere”

    My conclusion

    I only need to change the password of a mailbox. The setup Microsoft+GoDaddy does not provide me with the tools to adequately manage users and mailboxes. I don’t know what else I would be missing down the road, but if password reset is 24 hours to find out how to do it with 4 support agents, I guess other things will be even more difficult.

    I could live on any stack and tools. If my team was not using that much Microsoft tools I would close all Microsoft+GoDaddy inboxes and tools and move out of this stack as it is not proving productive for the things I need to do to administer this. But it is a team effort. If the team is more productive with the tools Microsoft is providing then we just have to factor the cost of having a compromised email for 24 hours as the cost of business.

    But there was support, wasn’t there

    Yes, I spend a total of 6 hours on the line with 4 different support agents. There was support, but support does not solve this.

    I don’t like AWS, but I’ve been a client of AWS for 7 years and I’ve managed some complex infrastructure. I have 0 support requests with AWS for 7 years. This is how support should look like. 0 minutes. I spend ~6 hours in total for a GoDaddy+Microsoft 365 support with 3 agents from GoDaddy and 1 from Microsoft to resolve my case. No wonder I am kind of reluctant to deploy anything on Microsoft in the future.

     
  • kmitov 4:19 pm on June 13, 2021 Permalink |
    Tags: , , ,   

    Dependencies – one more variable adding to the “cost of the code” 

    One thing I have to explain a lot is what are the costs of software development. Why are things taking so long? Why is there any needed for maintenance and support? Why are developers spending significant amount of their time looking over the existing code base and why we can not just add the next and the next feature?

    Today I have an example of this – and these are “dependencies”.

    The goal of this article is to give people more understanding on how the “tech works.”. I’ve seen that every line of code and every dependency that we add to a project will inevitably result in further costs down the road so we should really keep free of unnecessary dependencies and features.

    Daily builds

    Many contemporary professional software projects have a daily build. This means that every day at least once the project is “built” from zero, all the tests are run and we automatically validate that the customers could use it.

    Weekly dependencies updates

    Every software project depends on libraries that implement common functionality and features. Having few dependencies is healthy for the project, but having no dependencies and implementing everything on your own is just not viable in today’s world.

    These libraries and frameworks that we depend on also regularly release new versions.

    My general rule that I follow in every project is that we check for new versions of the dependencies every Wednesday at around 08:00 in the morning. We check for new dependencies, we download them, we build the project and we run the specs/tests. If the tests fail this means that the new dependencies that we’ve downloaded have somehow changed the behavior of the project.

    Dependencies change

    Most of the time dependencies are changed in a way that does not break any of the functionality of your project. This week was not such a week. A new dependency came along and it broke a few of the projects.

    The problem came from a change in two dependencies:

    Fetching websocket-driver 0.7.5 (was 0.7.4)
    Fetching mustache-js-rails 4.2.0.1 (was 4.1.0)
    Installing mustache-js-rails 4.2.0.1 (was 4.1.0)
    Installing websocket-driver 0.7.5 (was 0.7.4) with native extensions
    

    We have installed new versions of two of the dependencies “websocket-driver” and “mustache-js-rails’

    These two dependencies broke the builds.

    Why should we keep up to date

    Now out of the blue we should resolve this problem. This takes time. Sometimes it is 5 minutes. Sometimes it could be an hour or two. If we don’t do it, it will probably result in more time at a later stage. As the change is new in ‘mustache-js-rails’ we have the chance to get in touch with the developers of the library and resolve the issue while it is fresh for them and they are still “in the context” of what they were doing.

    Given the large number of dependencies that each software project has there is a constant need to keep up to date with new recent versions of your dependencies.

    What if we don’t keep up to date?

    I have one such platform. We decided 6-7 years ago not to invest any further in it. It is still working but it is completely out of date. Any new development will cost the same as basically developing the platform as brand new. That’s the drawback of not keeping up to date. And it happens even with larger systems on a state level with the famous search for COBOL developers because a state did not invest in keeping their platform up to date for some 30+ years.

     
  • kmitov 6:41 am on June 5, 2021 Permalink |
    Tags: , , , ,   

    Yet another random failing spec 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    This article is about a random failing spec. I spent more than 5 hours on this trying to track it down so I decided to share with our team what has happened and what the stupid mistake was.

    Random failing

    Random failing specs are most of the time passing and sometimes failing. The context of their fail seems to be random.

    Context

    At FLLCasts.com we have categories. There was an error when people were visiting the categories. We receive each and every error on an email and some of the categories stopped working, because of a wrong sql query. After migration from Rails 6.0 to Rails 6.1 some of the queries started working differently mostly because of eager loads and we had to change them.

    The spec

    This is the code of the spec

     scenario "show category content" do
        category = FactoryBot.create(:category, slug: SecureRandom.hex(16))
        episode = FactoryBot.create(:episode, :published_with_thumbnail, title: SecureRandom.hex(16))
        material = FactoryBot.create(:material, :published_with_thumbnail, title: SecureRandom.hex(16))
        program = FactoryBot.create(:program, :published_with_thumbnail, title: SecureRandom.hex(16))
        course = FactoryBot.create(:course, :published_with_thumbnail, title: SecureRandom.hex(16))
    
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: episode, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: material, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: program, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: course, category: category)
    
        expect(category.category_content_refs.count).to eq 4
        visit "/categories/#{category.to_param}"
    
        find_by_xpath_with_page_dump "//a[@href='/tutorials/#{episode.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/materials/#{material.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/programs/#{program.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/courses/#{course.to_param}']"
    
      end

    We add a few objects tot he category and then we check that we see them when we visit the category.

    The problem

    Sometime while running the spec only 1 of the objects in the category are shown. Sometimes non, most of the time all of them are shown.

    The debug process

    The controller

    def show
      @category_content_refs ||= @category.category_content_refs.published
    end

    In the category we just call published to get all the published content that is in this category. There are other things in the show but they are not relevant. We were using apply_scopes, we were using other concerns.

    The model

      scope :published, lambda {
        include_contents.where(PUBLISHED_OR_COMING_WHERE_SQL)
      }

    The scope in the model makes a query for published or coming.

    And the query, i kid you not, that was committed in 2018 and we’ve had this query for so long was

    class CategoryContentRef < ApplicationRecord
       
        PUBLISHED_OR_COMING_WHERE_SQL = ' (category_content_refs.content_type = \'Episode\' AND (episodes.published_at <= ? OR episodes.is_visible = true) ) OR
         (category_content_refs.content_type = \'Course\' AND courses.published_at <= ?) OR
         (category_content_refs.content_type = \'Material\' AND (materials.published_at <= ? OR materials.is_visible = true) ) OR
         category_content_refs.content_type=\'Playlist\'', *[Time.now.utc.strftime("%Y-%m-%d %H:%M:%S")]*4].freeze
    
    end
    

    I will give you a hit that the problem is with this query.

    You can take a moment a try to see where the problem is.

    The query problem

    The problem is with the .freeze and the constant in the class. The query is initialized when the class is loaded. Because of this it takes the time at the moment of loading the class and not the time of the query.

    Because the specs are fast sometimes the time of loading of the class is right before the spec and sometimes there are specs executed in between.

    It seems simple once you see it, but these are the kind of things that you keep missing while debugging. They are right in-front of your eyes and yet again sometimes you just can’t see them, until you finally see them and they you can not unsee them.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel