Recent Updates Toggle Comment Threads | Keyboard Shortcuts

  • kmitov 4:30 am on September 8, 2021 Permalink |
    Tags: airflow, apache, bigdata   

    Orchestration of BigData with Apache Airflow 

    It was a please for me to do this presentation and discuss how we can orchestrate BigData with Apache Airflow at the 2021 OpenFest event

    Video is in Bulgarian

     
  • kmitov 5:48 am on September 6, 2021 Permalink |
    Tags: , , , ,   

    Refresh while waiting with RSpec+Capybara in a Rails project 

    This is some serious advanced stuff here. You should share it.

    A colleague, looking at the git logs

    I recently had to create a spec with Capybary+RSpec where I refresh the page and wait for a value to appear on this page. It this particular scenario there is no need for WebSockets or and JS. We just need to refresh the page.

    But how to we test it?

    # Expect that the new records page will show the correct value of the record
    # We must do this in a loop as we are constantly refreshing the page.
    # We need to stay here and refresh the page
    # 
    # Use the Tmeout.timeout to stop the execution after the default Capybara.default_max_wait_time
    Timeout.timeout(Capybara.default_max_wait_time) do
      loop do
        # Visit the page. If you visit the same page a second time
        # it will refresh the page.
        visit "/records"
        # The smart thing here is the wait: 0 param
        # By default find_all will wait for Capybara.default_max_wait_time as it is waiting for all JS methods 
        # to complete. But there is no JS to complete and we want to check the page as is, without waiting 
        # for any JS, because there is no JS. 
        # 
        # We pase a "wait: 0" which will check and return
        break if find_all(:xpath, "//a[@href='/records/#{record.to_param}' and text()='Continue']", wait: 0).any?
    
        # If we could not find our record we sleep for 0.25 seconds and try again.
        sleep 0.25
      end
    end

    I hope it is helpful.

    Want to keep it touch – find me on LinkedIn or Twitter.

     
  • kmitov 10:49 am on September 3, 2021 Permalink |
    Tags: aws, cloudflare, nginx,   

    When the policeman becomes the criminal – how Cloudflare attacks my machines. 

    On the Internet you are nobody until someone attacks you.

    It gets even more interesting when the attack comes from someone with practically unlimited resources and when these are the same people that are supposed to protect you.

    This article is the story of how Cloudflare started an “attack” on a machine at the FLLCasts platform. This increased the traffic of the machine about 10x and AWS started charging the account 10x more. I managed to stop them and I hope my experience is useful for all CTOs, sysadmins, devops and others that would like to understand more and look out for such cases.

    TL; DR;

    Current up to date status is – after all the investigation it turns out that when a client makes a HEAD request for a file this will hit Cloudflare infrastructure. Cloudflare will then send a GET request to the account machine and will cache the file. This has changed at 28 of August. Before 28 of August when clients were sending HEAD requests, Cloudflare was sending HEAD requests (that don’t generate traffic). After 28 of August clients are still sending HEAD requests, but now Cloudflare is sending GET requests, generating terabytes of additional traffic that is not needed.

    Increase of the Bill

    On 28 of August 2021 I got a notification from AWS that the account is close to surpassing its budget for the month. This is not surprising as it was the end of the month, but nevertheless I decided to check. It seems that the traffic to one of the machines has increased 10x in a day. Nothing else has increased. No visits, no other resources, just the traffic to this one particular machine. 
    That was strange. This has been going on for 7 days now and this is the increase of the traffic.

    AWS increase of the bill

    Limit billing on AWS

    First thought was “How can I set a global limit to AWS spending for this account? I don’t want to wake up with $50K in traffic charges the next day?”

    The answer is “You can’t”. There is no way to set a global spending limit for an AWS account. This was something I already knew, but decided to check again with support and yes, you can’t set such a limit. This means that AWS is providing all the tools for you to be bankrupt by a third party and they are not willing to limit it.

    Limit billing on Digital Ocean

    I have some machines on Digital Ocean and I checked there. “Can I set a global spending limit for my account where I will no longer be charged and all my services will stop if my spending is above X amount of dollars?”.
    The answer was again – “No. Digital ocean does not provide it”.

    Should there be a global limit on spending on cloud providers?

    My understanding is – yes. There is a break even point where users are coming to your service and generating revenue and you are delivering the service and this is costing you money. Once it costs you more to deliver the service than the revenue that the service is generating, I would personally prefer to stop the service. No need for it to be running. Otherwize you could wake up with a $50K bill.

    AWS monitoring

    I had the bill from AWS so I tried looking at the monitoring.
    There is a spike every day between 03:00 AM UTC and 05:00 AM UTC. This spike is increasing the traffic with hundreds of gigabytes. It could easily be terabytes next time.
    The conclusion is that the machine is heavily loaded during this time.

    AWS monitoring

    Nginx access.log

    Looking at the access log I see that there are a lot of requests by machines that are using a user agent called ‘curl’. ‘curl’ is a popular tool for accessing files over HTTP and is heavily used by different bots. But bots tend to identify themselves.

    This is how the access.log looks like:

    172.68.65.227 - - [30/Aug/2021:03:26:02 +0000] "GET /f9a13214d1d16a7fb2ebc0dce9ee496e/file1.webm HTTP/1.1" 200 27755976 "-" "curl/7.58.0"

    Parsing the log file

    I have my years in bash experience and couple of commands later I get a list of all the IPs and how many requests we’ve received from these IPs.

    grep curl access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n

    The result is 547 machines. The full log file is available at – Full list of Cloudflare IPs attacking my machine. The top 20 are (there are some IPs that are not from Cloudflare). The first is the number of requests, the second is the IP of the machine.  

    NumberOfRequest IP
        113 172.69.63.18
        117 172.68.65.107
        150 172.70.42.135
        158 172.70.42.161
        164 172.69.63.82
        167 172.70.42.129
        169 172.69.63.116
        170 172.68.65.231
        173 172.68.65.101
        178 172.69.63.16
        178 172.70.42.143
        188 173.245.54.236
        264 172.70.134.69
        268 172.70.134.117
        269 172.70.134.45
        287 172.70.134.153
        844 172.70.34.131
        866 172.70.34.19
        904 172.70.34.61
        912 172.70.34.69

    These are Cloudflare machines!

    Looking at the machines that are making the requests these are 547 different machines, most of which are Cloudflare machines. These are servers that Cloudflare seems to be running that are making the request.

    How does Cloudflare work?

    For this particular FLLCasts account with this particular machine I have years ago setup Cloudflare to sit in front of the machine to help  protect the account from internet attacks.

    The way Cloudflare works is that only Cloudflare knows what is the IP address of our machine. This is the promise that Cloudflare is making. Because only they know the IP address of the machine, only they know what is the IP address for a given domain. In this way when a user points their browser at the “http://domainname” the internet will direct this request to Cloudflare, then Cloudflare will check if this request is ok, and then and only then forward this request to our machine. But in the meantime Cloudflare is trying to help businesses like the platform by caching the content. This means that when Cloudflare receives a request for a file, they will check on their Cloudflare infrastructure if this file was cached and send a request to the account machine only if there is no cache.

    In a nutshell Cloudflare maintains a cache for the content the platform is delivering.

    Image is from Cloudflare support at https://support.cloudflare.com/hc/en-us/articles/205177068-How-does-Cloudflare-work-

    What is broken?

    Cloudflare maintains a cache of the platform resources. Every night between 03:00 AM UTC and 05:00 AM UTC some 547 Cloudflare machines decide to update their cache and they start sending requests to our server. These are 10x more requests that the machine generally receives from all users. The content on the server does not change. It’s been the same content for years. But for the last 7 days Cloudflare is caching the same content every night on 547 machines.

    And AWS bills us for this.

    Can Cloudflare help?

    I created a ticket. The response was along the lines of “You are not subscribed for our support, you can get only community support”. Fine.
    I called them on the phone early in the morning.
    I called enterprise sales and I asked them.

    Me - "Hi, I am under attack. Can you help?"
    They - "Yes, we can help. Who is attacking you?"
    Me - "Well, you are. Is there an enterprise package I could buy so that you can protect me against your attack?"

    Luckily the guy on the phone caught my sense of humor and urgency and quickly organized a meeting with a product representative. Regrettably there were no solution engineers on this call.

    Both guys were very knowledgeable, but I had difficulties explaining that it was actually Cloudflare causing the traffic increase. I had all the data from AWS, from the access.log files, but the support agents still had some difficulty accepting it.

    To be clear – I don’t think that Cloudflare is maliciously causing this. There is no point. What I think has happened is some misconfiguration on their side that caused this for the last 7 days.

    What I think has happened?

    I tried to explain to the support agents that there are three scenarios all of which Cloudflare is responsible for.

    1. Option 1 – “someone that has 547 machines is trying to attack the FLLCasts account and Cloudflare is failing to stop it”. First this is very unlikely. Nobody will invest in starting 547 machines just to make the platform pay a few dollars more this month. And even if this is the case, this is what Cloudflare should actually prevent, right? Option 1: “Cloudflare is failing in preventing attacks” (unlikely)


    2. Option 2 – “only Cloudflare knows the IP of this domain name and they have been compromised.”. The connection between domain name and ip address is something that only Cloudflare knows about. If a third party knows the domain name and they are able to find the IP name this means that they are compromising Cloudflare. Option 2: “Cloudflare is compromised” (possible, but again, unlikely)

    3. Option 3 – “there is a misconfiguration in some of the Cloudflare servers”. I don’t like looking for malicious activity where everything could be explained with simple ignorance or a mistake. Most likely there is a misconfiguration in the Cloudflare infrastructure that is causing these servers to behave in this way. Option 3: “There is a misconfiguration in Cloudflare infrastructure”

    4. Option 4 – “there is a mistake on our end”. As there basically is nothing on our end and this nothing has not changed in years, the possibility for this to be the case is minimal. 

    On a support call we set a plan with the support agents to investigate it. I will change the public IP of the AWS machine and will reconfigure it on Cloudflare. In this way we hope to stop some of the requests. We have no plan for what to do after that.

    Can I block in on the Nginx level?

    Nginx is an HTTP server,serving files. There are a couple of options to explore there, but the most reasonable was to stop all curl requests to the Nginx server. This was the shortest path. There was no need to protect against other attacks, there was only the need to protect against Cloudflare attacks. The Cloudflare attack was using “curl” as a tool. I decided to stop ‘curl’

      # Surely not the best, but the simplest and will get the job done for now.
      if ($http_user_agent ~ 'curl') {
          return 444; # Consider returning 444. It's a custom nginx code that drop the connection without responding.
      }

    Resolution

    I am now waiting to see if the change of the public IP of the AWS machine will have any impact and if not I am just rejecting all “curl” requests that seem to be what Cloudflare is using.

    Update 1

    The first solution that we decide to implement is to

    Change the public IP of the AWS machine and change it in the DNS settings at Cloudflare. In this way we would make sure that only Cloud flare really knows this IP.

    Resolution is – It did not work!

    I know it won’t, because it was another way for support to get me to do anything without really looking into the issue, but I went along with it. Better exhaust this options and be sure.

    The traffic of a Cloudflare attacked machine. Changing the IP address of 03 of September had no effect.

    Update 2

    Adding CF-Connection-IP header

    Cloudflare support was really helpful. They asked me to include CF-Connection-IP in the logs. In this way we would know what is the real IP that is making the requests and if these are in fact Cloudflare machines.

    The header is described at https://support.cloudflare.com/hc/en-us/articles/200170986-How-does-Cloudflare-handle-HTTP-Request-headers-

    I went on and updated the Nginx configuration

    log_format  cloudflare_debug     '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" "$http_cf_connecting_ip"';
    
    access_log /var/log/nginx/access.log cloudflare_debug;
    

    Now the log file contained the original IP.

    Cloudflare is making GET when client makes a HEAD request

    This is what I found out. The platform has a daily job that checks the machine and makes sure files are ok. This integrity check was left there from times when we had to do it, like years ago. It is still running and is starting every night checking the machine with HEAD requests. But Cloudflare started making GET request at 28 of August 2021 and this increases the traffic to the machine.

    Steps to reproduce

    Here are the steps to reproduce:

    1. I am sending a HEAD request with ‘curl -I’

    2. Cloudflare has not cached the file so there is “cf-cache-status: MISS”

    3. Cloudflare sends a GET request and gets the whole file

    4. Cloudflare responds to the HEAD request.

    5. I send a HEAD request agian with ‘curl -I’

    6. Cloudflare has the file cached and there is a “cf-cache-status: HIT”

    7. The account server is not hit.

    The problem here is that I am sending a HEAD request to my file and Cloudflare is sending a GET request for the whole file in order to cache this file

    Commands to reproduce

    This is a HEAD request:

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:11 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: MISS
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=Xg9TLgssa5Gm6j1fRlJZH8VahaoY21LdCE1W1JqVueu49mzdiTmh9MZp4pFZDsVeSmRg%2Bc%2FMryoN7tgmKUmdxhWzE7UZdVvgG%2FRxHSZ%2FYS6pDtxLwpXSD71jo5ADNyT4TSpKXtE%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689564111e594ee0-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    This is the log right after the HEAD request. Not that I am sending HEAD request to domain.com and Cloudflare is sending GET request for the file.

    162.158.94.236 - - [04/Sep/2021:07:09:12 +0000] "GET /file1.webm HTTP/1.1" 200 2256504 "-" "curl/7.68.0" "188.254.161.195" "188.254.161.195"

    Then I send a second HEAD requests

    $ curl -I https://domain.com/file1.webm
    HTTP/2 200
    date: Sat, 04 Sep 2021 07:09:53 GMT
    content-type: video/webm
    content-length: 2256504
    last-modified: Sat, 04 Jan 2014 14:24:01 GMT
    etag: "52c81981-226e78"
    cache-control: max-age=14400
    cf-cache-status: HIT
    age: 42
    accept-ranges: bytes
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=CKSvpGGHoj5LfV6xXpPUK5kHJtdsX3fylgt%2F2%2B6G94oUsdAd8FnHmUgEUIgnj5dd2Vvsv%2BKQxxgsHdHA0RvpjTxATakFKFuirMeI%2FS3lAdDX5VA0tY74z0CRYEHM2rS%2Fld6K738%3D"}],"group":"cf-nel","max_age":604800}
    nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
    server: cloudflare
    cf-ray: 689565175dffc29f-FRA
    alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

    And then there is NOTHING in the log file

    Note that for the last HEAD request there is a “cf-cache-status: HIT”.

    Status and how it could be resolved?

    Yes, we are doing HEAD requests every day to the files in order to check that they are all working. Every day we send a HEAD request for every file to make sure all files are up to date. This has been going on for years and is a left over of an integrity check we implemented in 2015.

    What has changed on 28 of August 2021 is that when Cloudflare receives a HEAD request for a file it is sending a GET request to our machine in order to cache the file. This is what has changed and this is generating all the traffic.

    We send HEAD request with ‘curl -I’

    I have 30 weeks of log files that show that Cloudflare was sending HEAD requests like

    I have asked Cloudflare

    Could you please rollback this change in the infrastructure and do not send a GET request to our machine when you receive a HEAD request from a client?

    Let’s see how will this be resolved.

    Up to date conclusion

    Check your machines from time to time. I hope you don’t get in this situation.

    Want to keep in touch? – find me on LinkedIn or Twitter

     
  • kmitov 9:48 am on August 27, 2021 Permalink
    Tags: customer,   

    How we lost $1000 because we did not talk to the customer early enough 

    This content is password protected. To view it please enter your password below:

     
  • kmitov 12:34 pm on August 17, 2021 Permalink |
    Tags: admin, microsoft   

    GoDaddy+Microsoft 365 and how an email was compromised for about a day 

    In two hours I have a C-suite meeting and one of the topics would be our internal stack and whether we stay with Microsoft+GoDaddy or we migrate.

    This article is my objective summary of:

    1. How Microsoft+GoDaddy keep an email account compromised for more than a day
    2. What is difficult with the stack of Microsoft+GoDaddy
    3. Why can’t we just migrate to Microsoft without GoDaddy
    4. Why would I like to stop using Microsoft

    I hope other companies that have found themselves in this situation will be able to make the right decision given my experience.

    Note: This article is as of 2021-08-17. Things may change. I hope they will.

    Why GoDaddy?

    When the project was initially formed the domain {ourdomain.com} was bought from GoDaddy. Nothing for and nothing against. Since then the emails were added at GoDaddy.

    Why GoDaddy+Microsoft 365?

    GoDaddy offers Microsoft 365. You can purchase an Email+Office that will give you the email that is a Microsoft 365 email.

    Why not migrate out of GoDaddy and using only Microsoft?

    As we onboard more people in the team we identified that keeping both GoDaddy and Microsoft would be difficult. I tried to migrate us only to a Microsoft where the emails and office and everything will come from Microsoft and we won’t be handling two services.

    After spending about a day on this it turned out it was not possible. I even have a ticket created from GoDaddy support that should have been resolved in 72 hours, but almost a month later I still don’t have any notification if it is resolved or not. The issue is that I as an admin can not redirect the emails to be received at onmicrosoft.com while we are migrating. This means there will be a moment of time where people will not receive emails. I also can not export the user’s emails. I have to log in with every user, but I don’t know their passwords, so they should reset their passwords and share them with me and I should export their mailboxes through a desktop outlook application and then import them again. Which would easily take days in communication and sync. Yes, there is no “export all emails” and “import all emails”. It should be done by hand, manually, for every user in sync with the user. There simply is no such tool available from Microsoft in the GoDaddy+Microsoft setup.

    When migrating from GoDaddy+Microsoft 365 to Microsoft you should manually log in with each user and manually export and import each and every mailbox and manually sync with the users to give you their password, because as admin you can not change the password. This could take days, if not weeks for a team.

    Because of this we’ve decided to postpone this migration.

    How did an email get compromised?

    During my regular security audit I found out that I don’t know who has access to admin@{ourdomain}.com. This is the admin email. I have access to it. A couple of colleagues have access to it. But I don’t know who has access to it.

    Naturally I tried to reset the password for this account.

    The way I tried to reset the password is

    1. Go to GoDaddy.
    2. Log in with my account username@{ourdomain.com} and try to change the password for admin@{ourdomain.com}.
    3. The site returned that the password was successfully changed.
    4. Then I asked my colleague who has access to admin@{ourdomain.com} through the Desktop version of Outlook to see if he still has access.
    5. He still did. It did not matter that I changed the password.

    I have changed the password for admin@{ourdomain.com}, but users that do not know the new password still have access to the email through the desktop version of Microsoft Outlook

    The implications here are huge. This means that I don’t know who has access to admin@{ourdomain.com} and there is no way I could prevent them from accessing it.

    The only way would be for them to willingly sign out and try to sign in again. But this is not going to happen as I now consider the email to be compromised. Since the moment I am auditing this email to the moment I know who exactly has access to the email I consider this email to be compromised. Probably nobody else had access to it.

    But Microsoft and GoDaddy do not provide me with the tools to check who has access and to prevent people from accessing it, even after I changed the password.

    Can GoDaddy support help?

    It should be mentioned that GoDaddy documentation says that it might take up to 30 minutes for this password change to be reflected. I am ok with this. Not the best security, but I am ok.

    I have waited for 120 minutes before getting in touch with GoDaddy support.

    After spending a total of 4 hours with 3 different agents of GoDaddy we could not resolve the issue. What I found out is the following:

    1. The only solution GoDaddy support agents could advise me to is to ask my colleagues to sign out of admin@{ourdomain.com}. I could not explain to them that I don’t know who has access and I want to prevent any access to this email. They kept insisting I should ask people to sign out and they could not understand that I consider the email to be compromised and we should act like this. I am attaching the transcript of the communication as this was unbelievable.
    2. The second thing I found out is that after you spend more that 5-10 minutes with the “award winning support” of GoDaddy the agents start to ask you to restart your browser. One of the agents asked me to restart my computer in order for the change of password of admin@{ourdomain.com} to take effect. The reason I assume they are doing this is so that the chat session between me and them stops. In this way the next time I try to get in touch with support I am talking with a new agent.

    GoDaddy could not help. We’ve tried all kinds of things. Waiting for 7 hours, resetting the password of admin@{ourdomain.com} while I am logged in as admin@{ourdomain.com} and while logged in as username@{ourdomain.com}. Non of this help.

    7 hours in and the email admin@{ourdomain.com}, hosted on GoDaddy with Microsoft 365 software is still compromised.

    Can Microsoft help?

    7 hours in, I tried to get Microsoft support. I was reluctant until now because I knew what the outcome would be, but nevertheless I tried.

    10 minutes after calling Microsoft I got a response from an Agent. The Agent knew a lot of things and was actively trying to help me.

    First thing he asked me was to visit admin.microsoft.com. I did and this redirects to https://productivity.godaddy.com/settings#/mailbox/18071199

    The agent was a little surprised. I have a Microsoft 365 account but I did not have access to admin.microsoft.com and the tools that this portal is providing. I only had access the GoDaddy admin interface which we already found out was not working and the password could not be reset from it. It just did not work

    What I had access to is “admin.exchange.microsoft.com”. This seems to be the admin interface for the Exchange server. I am familiar with the Exchange server and I tried to explain to the agent that there is no way to reset the password from the Exchange admin interface.

    We spend 20-30 minutes looking through all the options of the Exchange admin interface, but there are no tools there to manage the user admin@{ourdomain.com}

    When you buy Microsoft 365 from GoDaddy you get access to admin.exchange.microsoft.com where you can manage the Exchange server, but you do not get access to admin.microsoft.com. You can not reset the password for a mailbox from the admin.exchange.microsoft.com, but only through admin.microsoft.com, but you don’t have access to admin.microsoft.com

    Can we workaround this in the Exchange admin interface?

    We tried. Me and the support. There are options to add additional roles to the Organization Management from the Exchange admin server. We tried it for about 20 minutes, but we could not.

    Can we workaround this from Azure?

    The Microsoft support agent asked me to go to portal.azure.com. I had a lot of hope. In the azure interface we could again see the users in the Active Directory. When we tried to change the password for admin@{ourdomain.com} from the portal.azure.com interface we got an error that we don’t have the license to change the password. I will later attach a screenshot here.

    How did we resolve it?

    More than 24 hours after the moment I made the audit and considered the admin@{ourdomain.com} compromised I got a response from Microsoft support. I had to go to https://www.godaddy.com/help/sign-out-of-all-devices-32032

    This is an article that specifically says “When working to secure a compromised Microsoft 365 account, sign out of all sessions and devices.”

    This article was sent to me from Microsoft support. This means that GoDaddy was there before, they even wrote an article. None of the 3 support agents knew about this article. I did not know about this article.

    The solutions was to visit https://myaccount.microsoft.com/ and to click “Sign out everywhere”

    Does this really resolve it?

    In a GoDaddy+Microsoft setup to reset the password of username_to_reset@{ourdomain.com} while we are logged in as username@{ourdomain.com} we must:

    1. Get access to username_to_reset@{ourdomain.com}
    2. Reset the password for the username_to_reset@{ourdomain.com} and receive a new email at username_to_reset@{ourdomain.com} and follow the instructions of how to reset the password. Note that this reset of password does not in any way prevent the users that have access to username_to_reset@{ourdomain.com} to continue to access it.
    3. Then sign in at GoDaddy with the new password for username_to_reset@{ourdomain.com} and go to https://myaccount.microsoft.com/. How do you get to https://myaccount.microsoft.com/ from the GoDaddy site? – I don’t know.
    4. After arriving at https://myaccount.microsoft.com/ you must click “Sign out everywhere”

    My conclusion

    I only need to change the password of a mailbox. The setup Microsoft+GoDaddy does not provide me with the tools to adequately manage users and mailboxes. I don’t know what else I would be missing down the road, but if password reset is 24 hours to find out how to do it with 4 support agents, I guess other things will be even more difficult.

    I could live on any stack and tools. If my team was not using that much Microsoft tools I would close all Microsoft+GoDaddy inboxes and tools and move out of this stack as it is not proving productive for the things I need to do to administer this. But it is a team effort. If the team is more productive with the tools Microsoft is providing then we just have to factor the cost of having a compromised email for 24 hours as the cost of business.

    But there was support, wasn’t there

    Yes, I spend a total of 6 hours on the line with 4 different support agents. There was support, but support does not solve this.

    I don’t like AWS, but I’ve been a client of AWS for 7 years and I’ve managed some complex infrastructure. I have 0 support requests with AWS for 7 years. This is how support should look like. 0 minutes. I spend ~6 hours in total for a GoDaddy+Microsoft 365 support with 3 agents from GoDaddy and 1 from Microsoft to resolve my case. No wonder I am kind of reluctant to deploy anything on Microsoft in the future.

     
  • kmitov 4:19 pm on June 13, 2021 Permalink |
    Tags: , , ,   

    Dependencies – one more variable adding to the “cost of the code” 

    One thing I have to explain a lot is what are the costs of software development. Why are things taking so long? Why is there any needed for maintenance and support? Why are developers spending significant amount of their time looking over the existing code base and why we can not just add the next and the next feature?

    Today I have an example of this – and these are “dependencies”.

    The goal of this article is to give people more understanding on how the “tech works.”. I’ve seen that every line of code and every dependency that we add to a project will inevitably result in further costs down the road so we should really keep free of unnecessary dependencies and features.

    Daily builds

    Many contemporary professional software projects have a daily build. This means that every day at least once the project is “built” from zero, all the tests are run and we automatically validate that the customers could use it.

    Weekly dependencies updates

    Every software project depends on libraries that implement common functionality and features. Having few dependencies is healthy for the project, but having no dependencies and implementing everything on your own is just not viable in today’s world.

    These libraries and frameworks that we depend on also regularly release new versions.

    My general rule that I follow in every project is that we check for new versions of the dependencies every Wednesday at around 08:00 in the morning. We check for new dependencies, we download them, we build the project and we run the specs/tests. If the tests fail this means that the new dependencies that we’ve downloaded have somehow changed the behavior of the project.

    Dependencies change

    Most of the time dependencies are changed in a way that does not break any of the functionality of your project. This week was not such a week. A new dependency came along and it broke a few of the projects.

    The problem came from a change in two dependencies:

    Fetching websocket-driver 0.7.5 (was 0.7.4)
    Fetching mustache-js-rails 4.2.0.1 (was 4.1.0)
    Installing mustache-js-rails 4.2.0.1 (was 4.1.0)
    Installing websocket-driver 0.7.5 (was 0.7.4) with native extensions
    

    We have installed new versions of two of the dependencies “websocket-driver” and “mustache-js-rails’

    These two dependencies broke the builds.

    Why should we keep up to date

    Now out of the blue we should resolve this problem. This takes time. Sometimes it is 5 minutes. Sometimes it could be an hour or two. If we don’t do it, it will probably result in more time at a later stage. As the change is new in ‘mustache-js-rails’ we have the chance to get in touch with the developers of the library and resolve the issue while it is fresh for them and they are still “in the context” of what they were doing.

    Given the large number of dependencies that each software project has there is a constant need to keep up to date with new recent versions of your dependencies.

    What if we don’t keep up to date?

    I have one such platform. We decided 6-7 years ago not to invest any further in it. It is still working but it is completely out of date. Any new development will cost the same as basically developing the platform as brand new. That’s the drawback of not keeping up to date. And it happens even with larger systems on a state level with the famous search for COBOL developers because a state did not invest in keeping their platform up to date for some 30+ years.

     
  • kmitov 6:41 am on June 5, 2021 Permalink |
    Tags: , , , ,   

    Yet another random failing spec 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    This article is about a random failing spec. I spent more than 5 hours on this trying to track it down so I decided to share with our team what has happened and what the stupid mistake was.

    Random failing

    Random failing specs are most of the time passing and sometimes failing. The context of their fail seems to be random.

    Context

    At FLLCasts.com we have categories. There was an error when people were visiting the categories. We receive each and every error on an email and some of the categories stopped working, because of a wrong sql query. After migration from Rails 6.0 to Rails 6.1 some of the queries started working differently mostly because of eager loads and we had to change them.

    The spec

    This is the code of the spec

     scenario "show category content" do
        category = FactoryBot.create(:category, slug: SecureRandom.hex(16))
        episode = FactoryBot.create(:episode, :published_with_thumbnail, title: SecureRandom.hex(16))
        material = FactoryBot.create(:material, :published_with_thumbnail, title: SecureRandom.hex(16))
        program = FactoryBot.create(:program, :published_with_thumbnail, title: SecureRandom.hex(16))
        course = FactoryBot.create(:course, :published_with_thumbnail, title: SecureRandom.hex(16))
    
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: episode, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: material, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: program, category: category)
        category.category_content_refs << FactoryBot.create(:category_content_ref, content: course, category: category)
    
        expect(category.category_content_refs.count).to eq 4
        visit "/categories/#{category.to_param}"
    
        find_by_xpath_with_page_dump "//a[@href='/tutorials/#{episode.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/materials/#{material.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/programs/#{program.to_param}']"
        find_by_xpath_with_page_dump "//a[@href='/courses/#{course.to_param}']"
    
      end

    We add a few objects tot he category and then we check that we see them when we visit the category.

    The problem

    Sometime while running the spec only 1 of the objects in the category are shown. Sometimes non, most of the time all of them are shown.

    The debug process

    The controller

    def show
      @category_content_refs ||= @category.category_content_refs.published
    end

    In the category we just call published to get all the published content that is in this category. There are other things in the show but they are not relevant. We were using apply_scopes, we were using other concerns.

    The model

      scope :published, lambda {
        include_contents.where(PUBLISHED_OR_COMING_WHERE_SQL)
      }

    The scope in the model makes a query for published or coming.

    And the query, i kid you not, that was committed in 2018 and we’ve had this query for so long was

    class CategoryContentRef < ApplicationRecord
       
        PUBLISHED_OR_COMING_WHERE_SQL = ' (category_content_refs.content_type = \'Episode\' AND (episodes.published_at <= ? OR episodes.is_visible = true) ) OR
         (category_content_refs.content_type = \'Course\' AND courses.published_at <= ?) OR
         (category_content_refs.content_type = \'Material\' AND (materials.published_at <= ? OR materials.is_visible = true) ) OR
         category_content_refs.content_type=\'Playlist\'', *[Time.now.utc.strftime("%Y-%m-%d %H:%M:%S")]*4].freeze
    
    end
    

    I will give you a hit that the problem is with this query.

    You can take a moment a try to see where the problem is.

    The query problem

    The problem is with the .freeze and the constant in the class. The query is initialized when the class is loaded. Because of this it takes the time at the moment of loading the class and not the time of the query.

    Because the specs are fast sometimes the time of loading of the class is right before the spec and sometimes there are specs executed in between.

    It seems simple once you see it, but these are the kind of things that you keep missing while debugging. They are right in-front of your eyes and yet again sometimes you just can’t see them, until you finally see them and they you can not unsee them.

     
  • kmitov 3:19 pm on May 31, 2021 Permalink |
    Tags: , ,   

    When caching is bad and you should not cache. 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    On Friday we did some refactoring at FLLCasts.com. We removed Refinery CMS, which is a topic for another article, but one issue pop-up – on a specific page caching was used in a way that made the page very slow. This article is about how and why. It is mainly for our team as a way to share the knowledge among ourselves, but I think the whole community could benefit, especially the Ruby on Rails community.

    TL;DR;

    When making a request to a cache service, be it MemCachir, Redis or any other, you are making a request to a cache service. This will include a get(key) method call and if the value is not stored in the cache, it will include a set(key) method call. When the calculation you are doing is simple it will take more time to cache the result from the calculation than to do the calculation again, especially if this calculation is a simple string concatenation.

    Processors (CPUs) are really good at string concatenation and could do them in a single digit milliseconds. So if you are about to cache something, make sure that you cache something worth caching. There is absolutely no reason to cache the result of:

    # Simple string concatenation. You calculate the value. No need to cache it.
    value = "<a href=#{link}>Text</a>". 
    
    # The same result, but with caching
    # There isn't a universe in which the code below will be faster than the code above.
    hash = calculate_hash(link)
    cached_value = cache.get(hash)
    if cached_value == nil
       cached_value = "<a href=#{link}>Text</a>". 
       cache.set(hash, cached_value)
    end 
    
    value = cached_value

    Context for Rails

    Rails makes caching painfully easy. Any server side generated HTML could be cached and returned to the user.

    <% # The call below will render the partial "page" for every page and will cache the result %>
    <% # Pretty simple, and yet there is something wrong %>
    <%= render partial: "page", collection: @pages, cached: true %>

    What’s wrong is that we open the browser and it takes more than 15 seconds to load.

    Here is a profile result from New Relic.

    As you can see there a lot of Memcached calls – like 10, and a lot of set calls. There are also a lot of Postgres find methods. All of this is because of how caching was set up in the platform. The whole “page” partial, after a decent amount of refactoring turns out to be a simple string concatenation as:

    <a href="<%= page.path%>"><%= page.title %></a>

    That’s it. We were caching the result of a simple string concatenation which the CPU is quite fast in doing. Because there were a lot of pages and we were doing the call for all of the pages, when opening the browser for the first time it just took too much to call all the get(key), set(key) methods and the page was returning a “Time out”

    Conclusion

    You should absolutely use caching and cache the values of your calculations, but only if those calculations take more time than asking the cache for a value. Otherwise it is just not useful.

     
  • kmitov 9:14 am on May 7, 2021 Permalink |
    Tags: ,   

    “[DOM] Input elements should have autocomplete attributes” 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    This is one of the things that could make a platform better. Here is how the warning looks like in the browser console.

    More information at – https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/autocomplete

    The autocomplete attributes could allow browsers, extensions and other agents guess what the user should do on this page. It could make it easier for the user. For example an extension could suggest a new password in the field, or could understand to fill the name of the user in the “name” field.

    Additionally we don’t like warnings.

    To check out the behavior, if you have a password manager for example go to

    https://www.fllcasts.com/users/sign_in

    or

    https://www.buildin3d.com/users/sign_in

     
  • kmitov 8:39 am on May 7, 2021 Permalink |
    Tags: csrf, ,   

    [Rails] Implementing an API token for post requests 

    (Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

    At the BuildIn3D platform we provide clients with API to send certain HTTP POST requests. Questions is – how do we authenticate them.

    Here is one of the authentication steps – we implemented our own build_token. When authenticity_token for CSRF is available we also use the authenticity_token. But it is not always available because the authenticity_token depends on the session and the session cookie. But there might not be a session and a cookie in some cases and yet we still need some authentication. Here is how we do it.

    Generate a Unique encrypted token on the server side

    The server generates a token based on pass params. This could be username or password or other info.

        def to_build_token
          len   = ActiveSupport::MessageEncryptor.key_len
          salt  = Rails.appplicaton.secret_build_token_salt
          key   = ActiveSupport::KeyGenerator.new(Rails.application.secret_key_base).generate_key(salt, len)
          crypt = ActiveSupport::MessageEncryptor.new(key)
          encrypted_data = crypt.encrypt_and_sign(self.build_id)
          Base64.encode64(encrypted_data)
        end

    This will return a new token that has encrypted the build_id.

    encrypted_data = crypt.encrypt_and_sign(self.build_id)
    # We could easily add more things to encrypt, like user, or some params or anything you need to get back as information from the token when it is later submitted

    Given this token we can pass this token to the client. The token could expire after some time.

    We would require the client to send us this token on every request from now on. In this way we know that the client has authenticated with our server.

    Decryption of the token

    What we are trying to extract is the build_id from the token. The token is encrypted so the user can not know the secret information that is the build_id.

    def self.build_id_from_token token
      len   = ActiveSupport::MessageEncryptor.key_len
      salt  = Rails.application.secret_salt_for_build_token
      key   = ActiveSupport::KeyGenerator.new(Rails.application.secret_key_base).generate_key(salt, len)
      crypt = ActiveSupport::MessageEncryptor.new(key)
      crypt.decrypt_and_verify(Base64.decode64(token))
    end

    Requiring the param in each post request

    When a post request is made we should check that the token is available and it was generated from our server. This is with:

      def create
          build_token = params.require("build_token")
          build_id_from_token = Record.build_id_from_token(build_token)
          .... # other logic that now has the buid_id token
      end

    The build token is one of the things we use with the IS at BuildIn3D and FLLCasts.

    Polar bear approves of our security.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel