HTTP Protocol

41 mins read.

Knowing the basics is important. A lot of information exchanged on the Internet is carried over HTTP. Si, if you want to build great web apps, isn’t it obvious that you need to understand what happens behind the scenes?

REST, which is a very important architectural style nowadays is relying completely on utilizing HTTP features, so that make HTTP even more important to understand. If you want to make great RESTful applications, you must understand HTTP first.

Overview of HTTP Basic Concepts

HTTP Definition

The HyperText Transfer Protocol is the protocol that applications use to communicate with each other. In essence, HTTP is in charge of delegating all of the Internet’s media files between clients and servers. That includes HTML, images, text files, movies and everything in between. And it does this quickly and reliably.

HTTP is the application protocol and not the transport protocol because we are using it for the communication in the application layer. This is what the Network Stack looks like.

Network Stack

Resources

Everything on the internet is a resource, and HTTP works with resources. That includes files, streams, services and everything else. An HTML page is a resource, a youtube video is a resource, your spreadsheets of daily tasks on a web application is a resource…

We differentiate one resource from another by giving them URLs (Uniform Resource Locators).

A URL points to the unique location where the resource is located.

Internet Resources

How To Exchange Messages Between a Web Client and a Web Server

Every piece of content, every resource lives on some web server (HTTP server). These servers are expecting HTTP requests for those resources.

But to request a resource from a web server, you need an HTTP client.

The most common type of clients are Web browsers. Some of the most popular clients are Google’s Chrome, Mozilla’s Firefox, Opera, Apple’s Safari, and unfortunately still the infamous Internet Explorer.

Some Message Examples

Here is an example of one GET and one POST request.

GET request

GET /repos/CodeMazeBlog/ConsumeRestfulApisExamples HTTP/1.1
Host: api.github.com
Content-Type: application/json
Authorization: Basic dGhhbmtzIEhhcmFsZCBSb21iYXV0LCBtdWNoIGFwcHJlY2lhdGVk
Cache-Control: no-cache

POST request

POST /repos/CodeMazeBlog/ConsumeRestfulApisExamples/hooks?access_token=5643f4128a9cf974517346b2158d04c8aa7ad45f HTTP/1.1
Host: api.github.com
Content-Type: application/json
Cache-Control: no-cache

{
  "url": "http://www.example.com/example",
  "events": [
    "push"
  ],
  "name": "web",
  "active": true,
  "config": {
    "url": "http://www.example.com/example",
    "conten_type": "json"
  }
}
  1. The first line of the request is reserved for the request line. It consists of the request method name, the request URI, and the HTTP version.
  2. The next few lines represent the request headers. Request headers provide additional info to the requests, like the content types the request expects in response, authorization information etc.
  3. For a GET request, the story end right there. A POST request can also have a body and carry additional info in the form of a body message. In this case, is is a JSON message with additional info on how to create the GitHub webhook for the given repo specified in the URI.

The request line and request headers must be followed by <CR><LF> (carriage return and line feed \r\n), and there is a single empty line between the message headers and the message body that contains only CRLF.

Response Message

HTTP/1.1 200 OK
Server: GitHub.com
Date: Sun, 18 Jun 2017 13:10:41 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Status: 200 OK
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4996
X-RateLimit-Reset: 1497792723
Cache-Control: private, max-age=60, s-maxage=60

[
  {
    "type": "Repository",
    "id": 14437404,
    "name": "web",
    "active": true,
    "events": [
      "push"
    ],
    "config": {
      "content_type": "json",
      "insecure_ssl": "0",
      "url": "http://www.example.com/example"
    },
    "updated_at": "2017-06-18T12:17:15Z",
    "created_at": "2017-06-18T12:03:15Z",
    "url": "https://api.github.com/repos/CodeMazeBlog/ConsumeRestfulApisExamples/hooks/144374004",
    "test_url": "https://api.github.com/repos/CodeMazeBlog/ConsumeRestfulApisExamples/hooks/14437404/test",
    "ping_url": "https://api.github.com/repos/CodeMazeBlog/ConsumeRestfulApisExamples/hooks/14437404/pings",
    "last_reponse": {
      "code": 422,
      "status": "misconfigured",
      "message": "Invalid HTTP Response: 404"
    }
  },
]

The response message is pretty much structured the same as the request, except the first line, called the status line, which surprising as it is, carries information about the response status.

MIME types

MIME types represent a standardized way to describe the file types on the internet. Your browser has a list of MIME types and same goes for web servers. That way we can transfer files in the fashion regardless of the operating system.

MIME stands for the Multipurpose Internet Mail Extension because they were originally developed for the multimedia email, but later were adapted to be used for HTTP and several other protocols since.

Every MIME type consist of a type, subtype and a list of optional parameters in the following format: type/subtype; optional parameters.

Content-Type: application/json
Content-Type: text/xml; charset=utf-8
Accept: image/gif

Request Methods

HTTP request methods (referred to also as “verbs”) define the action that will be performed on the resource. [HTTP defines several request methods][http request methods]. The most commonly known/used are GET and POST methods.

A request method can be idempotent1 or not idempotent. This is just a fancy term for explaining that method is safe/unsafe to be called several times on the same resources2.

Of the major HTTP verbs, GET, PUT and DELETE should be by default implemented in an idempotent manner according to the standard, as calling them over and over should not result in a different response, on the other hand POST doesn’t need to be idempotent.

Find more how each one of these methods works in the HTTP Reference.

Headers

Header fields are colon-separated name-value fields you can find just after the first line of a request or response message. They provide more context to the HTTP messages and inform clients and servers about the nature of the request or response.

There are five types of headers:

  • General headers: These headers are useful to both the server and the client. One good example is the Date header field which provides the information about the time of the message creation.
  • Request headers: Specific to the request messages. They provide the server with additional information. For example, Accept: */* header field informs the server that the client is willing to receive any media type.
  • Response headers: Specific to the response messages. They provide the client with the additional information. For example, Allow: GET, HEAD, PUT header field informs the client which methods are allowed for the requested resource.
  • Entity headers: These headers deal with the entity body. For example, Content-Type: text/html header lets the application know that the data is HTML document.
  • Extension headers: These are nonstandard headers application developers can construct. Although they are not part of HTTP, it tolerates them.

A list of commonly used request and response headers is on the HTTP Reference

Status Codes

Status Codes

The status code is a three digit number that denotes the result of a request. Reason phrase which is humanly readable status code explanation, comes right after.

Some examples include:

  • 200 OK
  • 404 Not Found
  • 500 Internal Server Error

The status codes are classified by the range in five different groups.

Both the status code classification and the entire list of status codes and their meaning can be found in the HTTP Reference.

Architectural Aspects

HTTP cannot function by itself as an application protocol. It needs infrastructure in form of a hardware and software solutions that provide different services and make the communication over the World Wide Web possible and efficient.

These are an integral part of our internet life, and you will learn exactly what the purpose of each one of these is, and how it works. This knowledge will help you connect the dots from the first article, and understand the flow of the HTTP communication even better.

Web Servers

The primary function of a Web server is to store the resources and to serve them upon receiving requests. You access the Web server using a Web client (aka Web browser) and in return get the requested resource or change the state of existing ones. Web servers can be accessed automatically too, using Web crawlers.

Servers

Some of the most popular Web servers out there and probably the ones you heard of are Apache HTTP Server, Nginx, IIS, Glassfish…

Web servers can vary from the very simple and easy to use, to sophisticated and complicated pieces of software. Modern Web servers are capable of performing a lot of different tasks. Basic tasks that Web server should be able to do:

  • Set up connection - accept or close client connection.
  • Receive request - read an HTTP request message.
  • Process request - interpret the request message and take action.
  • Access resource - access the resource specified in the message.
  • Construct response - create the HTTP response message.
  • Send response - send the response back to the client
  • Log transaction - write about the completed transaction in a log file

Basic flow of the Web server

Phase 1: Setting up connection

When a web client wants to access the Web server, it must try to open a new TCP connection. On the other side, the server tries to extract the IP address of the client. After that it is up to the server to decide to open or close the TCP connection of that client.

If the server accepts the connection, it adds it to the list of existing connections and watches the data on that connection.

It can also close the connection if the client is not authorized or blacklisted (malicious).

The server can also try to identify the hostname of the client by using the “reverse DNS”. This information can help when loggin the messages, but hostname lookups can take a while, slowing the translations.

Phase 2: Receiving/Processing requests

When parsing the incoming request, Web server parse the information from the message request line, headers, and body (if provided). One thing to note is that the connection can pause at any time, and in that case, the server mus store the information temporarily until receives the rest of the data.

High-end Web servers should be able to open many simultaneous connections. This includes multiple simultaneous connections from the same client. A typical web page can request many different resources from the server.

Phase 3: Accessing the resource

Since Web servers are primarily the resource providers, they have multiple ways to map and access the resources.

The simplest way to map the resource is to use the request URI to find the file in the Web server’s filesystem. Typically, the resources are contained in a special folder on the server, called docroot.

For example, docroot on the Windows server can be located on F:\\WebResources. If a GET request wants to access the file on the /images/codemazblog.txt, the server translates this to F:\WebResources\images\codemazeblog.txt, the server translates this to F:\WebResources\images\codemazeblog.txt and returns that file in the response message. When more than one website is hosted on a Web server, each one can have its separate docroot.

If a Web server receives a request for a directory instead of a file, it can resolve it in a few ways:

  • It can return an error message,
  • return default index file of the directory or
  • scan the directory and return the HTML file with contents.

The server may also map the request URI to the dynamic resource - a software application that generates some result. There is a whole class of servers called application servers which purpose is to connect web servers to the complicated software solutions and serve dynamic content.

Phase 3: Generating and sending the response

Once the server identified the resource it needs to serve, it forms the response message. The response message contains the status code, response headers, and response body if one was needed.

If the body is present in the response, the message usually contains the Content-Length headers describing the size of the body and the Content-Type header describing the MIME type of the returned resource.

After generating the response, the server chooses the client it needs to send the response to. For the non-persistent connections, the server needs to close the connection when the entire response message is sent.

Phase 4: Loggin

When the transaction is complete, the server logs all the transaction information in the file. Many servers provide loggin customizations.

Proxy Servers

Proxy servers (proxies) are the intermediary servers. They are often found between the Web server and Web client. Due to their nature, proxy servers need to behave both like Web client and Web server.

But why do we need Proxy servers? Why don’t we just communicate directly between Web clients and Web servers? Isn’t that much simpler and faster? Well, simpler it may be, but faster, not really.

Before explaining what proxies are, we need to get one thing out of the way. That is the concept of reverse proxy of the difference between the forward proxy and reverse proxy.

The forward proxy acts as a proxy for the client requesting the resource from a Web server. It protects the client by filtering requests through the firewall or hiding the information about the client.

Forward Proxy

The reverse proxy on the other hand, works exactly the opposite way. It is usually placed behind the firewall3 and protects the Web servers. For all the clients know, they talk to the real Web server and remain unaware of the network behind the reverse proxy.

Forward Proxy

Proxies are very useful and their application is pretty wide. Let’s go through some of the ways you can use proxy servers.

  • Compression - Compressing the content directly increases the communication speed. Simple as that.
  • Monitoring - Want to deny access to adult websites to children in the elementary school? The proxy is the right solution for you.
  • Security - Proxies can serve as a single entry point to the entire network. The can detect malicious applications and restrict application level protocols.
  • Anonymity - Requests can be modified by the proxy to achieve greater anonymity. It can strip the sensitive information from the request and leave just the important stuff. Although sending less information to the server might degrade the user experience, anonymity is sometimes the more important factor.
  • Access control - Pretty straightforward, you can centralize the access control of the many servers on a single proxy server.
  • Caching - You can use the proxy server to cache the popular content, and thus greatly reduce the loading speeds.
  • Load balancing - If you have a service that gets a log of “peak traffic” you can use a proxy to distribute the workload on more computing resources or Web servers. Load balancers route traffic to avoid overloading the single server when the peak happens.
  • Transcoding - Changing the contents of the message body can also be the proxy’s responsibility.

Caching

Web caches are devices that automatically make copies of the request data and save them in the local storage.

By doing this, they can:

  • Reduce traffic flow
  • Eliminate network bottle necks
  • Prevent server overload
  • Reduce the response delay on long distances

So you can clearly say that Web caches improve both user experience and Web server performance. And of course, potentially save a lot of money.

The fraction of the requests served from the cache is called Hit Rate. It can range form 0 to 1, where 0 is 0% and 1 is 100% request served. The ideal goal is of course to achieve 100%, but the real number is usually closer to 40%.

Here is how the basic Web cache workflow looks like:

Cache Flow

Gateways, Tunnels, and Relays

In time, as the HTTP matured, people found many different ways to use it. HTTP became useful as a framework to connect different applications and protocols.

Let’s see how.

Gateways

Gateways refer to pieces of hardware that can enable HTTP to communicate with different protocols and applications by abstracting a way to get a resource. They are also called the protocol converters and are far more complex than routers or switches due to the usage of multiple protocols.

You can, for example, use a gateway to get the file over FTP by sending and HTTP request. Or you can receive an encrypted message over SSL and convert it to HTTP (Client-Side Security Accelerator Gateways) or convert HTTP to more secure HTTPS message (Server-Side Security Gateways).

Tunnels

Tunnels make use of the CONNECT request method. They enable sending non-HTTP data over HTTP. The CONNECT method asks the tunnel to open a connection to the destination server and to relay the data between client and server.

CONNECT request:

CONNECT api.github.com:443 HTTP/1.0
User-Agent: Chrome/58.0.3029.110
Accept: text/html,application/xhtml+xml,application/xml

CONNECT response:

HTTP/1.0 200 Connection Established
Proxy-agent: Netscape-Proxy/1.1

The CONNECT response doesn’t need to specify the Content-Type unlike a normal HTTP response would.

Once the connection is established, the data can be sent between client and server directly.

Relays

Relays are the outlaws of the HTTP world and they don’t need to abide by the HTTP laws. They are dumbed-down versions of proxies that relay any information the receive as long as they can establish a connection using the minimal information from the request messages.

They sole existence stems from the need to implement a proxy with as little as possible. That can also potentially lead to trouble, but its use is very situational and there is a certainly risk to benefit ratio to consider when implementing relays.

Web Crawlers

Web Crawler

Also, popularly called spiders, they are bots that crawl over the World Wide Web and index its contents. So, the Web crawler is the essential tool for Search engines and many other websites.

The web crawler is a fully automated piece of software and it doesn’t need human interaction to work. The complexity of we crawlers can vary greatly and some of the web crawlers are pretty sophisticated pieces of software (like the ones search engines use).

Web crawlers consume the resources of the website they are visiting. For this reason, public websites have a mechanism to tell the crawlers which parts of the website to crawl, or to tell them not to crawl anything at all. You can do this by using the robots.txt (robots exclusion standard).

Of course, since it is just an standard, robots.txt cannot prevent uninvited web crawlers to crawl the website. Some of the malicious robots include email harvesters, spambots and malware.

Some examples of the robots.txt file:

  • This one tells all the crawlers to stay out
    User-agent: *
    Disallow: /
    
  • And this one refers only to these two specific directories and a single file.
    User-agent: *
    Disallow: /somefolder/
    Disallow: /noninterestingstuff/
    Disallow: /directory/file.html
    
  • You can also disallow a specific crawler, like in this case
    User-agen: Googlebot
    Disallow: /private/
    

But given the vast nature of the World Wide Web, even the most powerful crawlers ever made cannot crawl and index the entirety of it. And that’s why they use selection policy to crawl the most relevant parts of it. Also, the WWW changes frequently and dynamically, so the crawlers must use the freshness policy to calculate whether to revisit websites or not. And since crawlers can easily overburden the servers by requesting too much too fast, there is a politeness policy in place. The most of the know crawlers use the intervals of 20 seconds to 3-4 minutes to poll the servers to avoid generating the load on the server.

The mysterious and evil deep web or dark web, it’s nothing more than the part of the web, that is intentionally not indexed by search engines to hide the information.

Client Identification

Why client identification is important and how can Web servers identify Web clients? How that information is used and stored?

Client Identification and Why It’s Extremely Important

Every website, or at least those that care enough about you and your actions, include some form of content personalization.

That personalization includes suggested items if you visit e-commerce website, or “the people you might know/want to connect with” on social networks, recommended videos, ads that almost spookily know what you need, news articles that are relevant to you and so on.

This effect feels like a double edged sword. On one hand, it’s pretty nifty having personalized, custom content delivered to you. On the other hand, it can lead to Confirmation bias that can result in all kinds of stereotypes and prejudices. Here an excelled Dilbert comic.

Today, content personalization has become part of our daily lives we can’t and we probably don’t event want to do anything about it.

So, how the Web servers can identify you to achieve this effect?

Different Ways to Identify the Client

There are several ways that a web server can identify you:

HTTP authentication is better described in the authentication mechanisms section.

HTTP Request Headers Used for Identification

Web servers have a few ways to extract information about you directly from the HTTP request headers.

Those headers are:

  • From - contains user’s email address if provided
  • User-agent - contains the information about Web client
  • Referer - contains the source the user came from
  • Authorization - contains username and password
  • Client-ip - contains user’s IP address
  • X-Forwarded-For - contains user’s IP address (when going through the proxy server)
  • Cookie - contains server-generated ID label

In theory, the From header would be ideal to uniquely identify the user, but in practice, this headers is rarely used due to the security concerns of email collection.

The user-agent header contains the information like the browser version and operating system. While this is important for customizing content, it doesn’t identify the user in a more relevant way.

The Referer header tells the server where the user is coming from. This information is used to improve the understanding of the user behavior, but less so to identify it.

While these headers provide some useful information about the client, it is not enough to personalize content in a meaningful way.

The remaining headers offer more precise mechanisms of identification.

IP Address

The method of client identification by IP address has been used more in the past when IP addresses weren’t so easily faked/swapped. Although it can be used as an additional security check, it just isn’t reliable enough to be used on its own.

Here are some of the reasons why:

  • It describes the machine, not the user
  • NAT firewalls - many ISPs (Internet Server Providers) use NAT firewalls to enhance security and deal with IP address shortage
  • Dynamic IP addresses - users often get the dynamic IP address from the ISP
  • HTTP proxies and gateways - these can hide the original IP address. Some proxies use Client-IP or X-Forwarded-For to preserve the original IP address

Long (Fat) URLs

It is not that uncommon to see websites utilize URLs to improve the user experience. They add more information as the user browses the website until URLs look complicated and illegible.

You can see what the long URL looks like browsing Amazon store.

https://www.amazon.com/gp/product/1942788002/ref=s9u_psimh_gw_i2?ie=UTF8&fpl=fresh&pd_rd_i=1942788002&pd_rd_r=70BRSEN2K19345MWASF0&pd_rd_w=KpLza&pd_rd_wg=gTIeL&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=&pf_rd_r=RWRKQXA6PBHQG52JTRW2&pf_rd_t=36701&pf_rd_p=1cf9d009-399c-49e1-901a-7b8786e59436&pf_rd_i=desktop

There are several problems when using this approach.

  • It’s ugly
  • Not shareable
  • Breaks caching
  • It’s limited to that session
  • Increases the load on the server

Cookies

The best client identification method up to date excluding the authentication. Developed by Netscape, but now every browser supports them.

There are two types of cookies: session cookies and persistent cookies. A session cookie is deleted upon leaving the browser, and persistent cookies are saved on disk and last longer. For the session cookie to be treated as the persistent cookie, Max-Age or Expiry property needs to be set.

Modern browsers like Firefox and Chrome can keep background processes working when you shut them down so you can resume where you left off. This can result in the preservation of the session cookies, so be careful.

So how do the cookies work?

Cookies contain a list of name=value pairs that server sets using Set-Cookie or Set-Cookie2 response header. Usually, the information stored in a cookie is some kind of client id, but some websites store other information as well.

The browser stores this information in its cookie database and returns it when the user visits the page/website next time. The browser can handle thousands of different cookies and it knows when to serve each one.

Here is an example flow.

  1. User Agent -> Server
    POST /acme/login HTTP/1.1
    [form data]
    
  2. Server -> User Agent
    HTTP/1.1 200 OK
    Set-Cookie2: Customer="WILE_E_COYOTE"; Version="1"; Path="/acme"
    

    The server send the Set-Cookie2 response header to instruct the User Agent (browser) to set information about the user in a cookie.

  3. User Agent -> Server
    POST /acme/pickitem HTTP/1.1
    Cookie: $Version="1"; Customer="WILE_E_COYOTE"; $Path="/acme"
    [form data]
    

    The user selects the item to the shop basket.

  4. Server -> User Agent
    HTTP/1.1 200 OK
    Set-Cookie2: Part_Number="Rocket_Launcher_0001"; Version="1"; Path="/acme"
    

    Shopping basket contains an item.

  5. User Agent -> Server
    POST /acme/shipping HTTP/1.1
    Cookie: $Version="1"; Customer="WILE_E_COYOTE"; $PATH="/acme";
         Part_Number="Rocket_Launcher_0001";
    [form data]
    

    The user selects the shipping method

  6. Server -> User Agent
    HTTP/1.1 200 OK
    Set-Cookie2: Shipping="FedEx"; Version="1": Path="/acme"
    

    New cookie reflects shipping method.

  7. User Agent -> Server
    POST /acme/process HTTP/1.1
    Cookie: $Version="1"; 
         Customer="WILE_E_COYOTE"; $Path="/acme";
         Part_Number="Rocket_Launcher_0001"; $Path="/acme";
         Shipping="FedEx"; $Path="/acme"
    [form data]
    

    That’s it.

There is one more thing, the cookies are not perfect either. Besides security concerns, there is also a problem with cookies colliding with REST architectural style. (The section about misusing cookies).

You can learn more about cookies in the RFC 2965.

Authentication Mechanisms

We already talked about the different ways that websites can use to identify the visiting user, but

Identification itself represents just a claim. When you identify yourself, you are claiming that you are someone. But there is no proof of that.

Authentication on the other hand, is showing a proof that you are what you claim to be, like showing your personal id or typing in you password.

More often than not, the websites need that proof to serve you sensitive resources.

HTTP has its own authentication mechanisms that allow the servers to issue challenges and get the proof they need. We are going to learn about what they are and how they work. We’re also going to cover the pros and cons of each one and find out if they are really good enough to use on their own (spoiler: they are not).

Before venturing deeper into concrete HTTP authentication mechanisms, let’s explore what the HTTP authentication is.

How does the HTTP Authentication Work?

Authentication is a way to identify yourself to the Web server. You need to show proof that you have the right to access the requested resources. Usually, this is done by using a combination of username and password (key and secret) which the server validates and then decides if you can access the resource.

HTTP offers two authentication protocols:

  • Basic Authentication
  • Digest Authentication

Before delving into each one, let’s go through some of the basic concepts.

Challenge/Response Authentication Framework

What does this mean?

It means that when someone sends a request, instead of responding to it immediately, the server sends authentication challenge. It challenges the user to provide the proof of identity by entering the secret information (username and password).

After that, the request is repeated using the provided credentials, and if they are correct, the user gets the expected response. In case the credentials are wrong, the server can reissue the challenge or just send the error message.

The server issues the challenge by utilizing the WWW-Authenticate response header. It contains the information about the authentication protocol and the security realm.

After the client inputs the credentials, the request is sent again. This time with the Authorization header containing the authentication algorithm and the username/password combination.

If the credentials are correct, the server returns the response and additional info in an optional Authentication-Info response header.

Security Realms

Security realms provide the way to associate different access right to different resource groups on the server. They are called protection spaces.

What this means effectively is that depending on the resource you want to access, you might need to enter different credentials.

The server can have multiple realms. For example, one would be for website statistics information that only website admins can access.

/admin/statistics/financials.txt -> Realm="Admin Statistics"

Another would be for website images that other users can access and upload images to.

/images/img1.png -> Realm="Images"

When you try to access the financials.txt the server will challenge you and the response from it would look like this:

HTTP/1.0 401 Unauthorized
WWW-Authenticate: Basic realm="Admin Statistics"

More about security realms: https://tools.ietf.org/html/rfc7235#section-2.2

Simple HTTP authentication example

Now let’s connect the dots by looking at the simplest HTTP authentication example (Basic authentication, explained below):

  1. User Agent -> Server
    The user request access to some image on the server.
    GET /gallery/personal/images/image1.jpg HTTP/1.1
    Host: www.somedomain.vom
    
  2. Server -> User Agent
    The server send the authentication challenge to the user.
    HTTP/1.1 401 Access Denied
    WWW-Authenticate: Basic realm="gallery"
    
  3. User Agent -> Server
    The user identifies itself via form input.
    GET /gallery/personal/images/image1.jpg HTTP/1.1
    Authorization: Basic Zm9vOmJhcg==
    
  4. Server -> User Agent The server checks the credentials and sends the 200 OK status code and the image data.
    HTTP/1.1 200 OK
    Content-type: image/jpeg
    ...<image data>
    

    Not that complicated, right?

Now let’s drill down and look into basic authentication.

Basic Authentication

The most prevalent and supported authentication protocol out there. It has been around since HTTP/1.0 and every major client implements it.

The example above depicts how to authenticate by using Basic authentication. It’s rather simple to implement and use, but is has some security flaws.

Before going to the security issues, let’s see how the Basic authentication deals with username and password.

Basic authentication packs the username and password into one string and separates them using the colon (:). After that, it encodes them using the Base64 encoding. Despite what it looks like, the scrambled sequence of characters is not secure and you can decode it easily.

The purpose of the Base64 encoding is not to encrypt, but to make the username and password HTTP compatible. The main reason for that is because you can’t use international characters in HTTP headers.

GET /gallery/personal/images/image1.jpg HTTP/1.1
Authorization: Basic Zm9vOmJhcg==

The Zm9vOmJhcg== from this example is nothing more than Base64 encoded “foo:bar” string.

So anyone listening to the requests can easily decode and use the credentials.

Even worse than that, encoding the username and password wouldn’t help. A malicious third party coudl still send the scrambled sequence to achieve the same effect.

There is also no protection agains proxies or any other type of attack that changes the request body and leaves the request headers intact.

So, as you can see, Basic authentication is less than perfect authentication mechanism.

Still, despite that, you can use it to prevent accidental access to protected resources and to the site offers a degree of personalization.

To make it more secure and usable, Basic authentication can be implemented by using HTTPS over SSL.

Some would argue it’s only as secure as your transport mechanism.

Digest Authentication

Digest authentication is a more secure and reliable alternative to simple but insecure Basic authentication.

So, how does it work?

Digest authentication uses MD5 cryptographic hashing combined with the usage of nonces. That way it hides the password information to prevent different kinds of malicious attacks.

This might sound a bit complicated, but it will get clearer with an simple example.

Example

  1. User Agent -> Server
    GET /dir/index.html HTTP/1.0
    Host: localhost
    

    The client sends an unauthenticated request.

  2. Server -> User Agent
    HTTP/1.0 401 Unauthorized
    WWW-Authenticate: Digest realm="shire@middleearth.com", qop="auth,auth-int", nonce="cmFuZG9tbHlnZW5lcmF0ZWRub25jZQ", opaque="c29tZXJhbmRvbW9wYXF1ZXN0cmluZw"
    Content-Type: text/html
    Content-Length: 153
    
    <!DOCTYPE html>
    <html lang="en">
    <head>
      <meta charset="UTF-8">
      <title>Error</title>
    </head>
    <body>
      <h1>401 Unauthorized</h1>
    </body>
    </html>
    

    The server challenges the client to authenticate using the Digest authentication and sends the required information to the client.

  3. User Agent -> Server
    GET /dir/index.html HTTP/1.0
    Host: localhost
    Authorization: Digest username="Gandalf",
                       realm="shire@middleearth.com",
                       nonce="cmFuZG9tbHlnZW5lcmF0ZWRub25jZQ",
                       uri="/dir/index.html",
                       qop=auth,
                       nc=00000001,
                       cnonce="0a4f113b",
                       responce="5a1c3bb349cf6986abf985257d968d86",
                       opaque="c29tZXJhbmRvbW9wYXF1ZXN0cmluZw"
    

    The client calculates the response value and sends it together with username, realm, URI, nonce, opaque, qop, nc, and cnonce. A lot of stuff.

  4. Server -> User Agent
    HTTP/1.0 200 OK
    Content-Type: text/html
    Content-Length: 2345
    ... <content data>
    

    The server computes the hash on its own and compares the two. If they match it serves the client with the requested data.

Detailed Explanation

Let’s define these:

  • nonce and opaque - the server defined strings that client returns upon receiving them
  • qop (quality of protection) - one or more of the predefined values (“auth” | “auth-int” | token). These values affect the computation of the digest.
  • cnonce - client nonce, must be generated if qop is set. Is is used to avoid chosen plaintext attacks and to provide message integrity protection.
  • nc - nonce count, must be sent if qop is set. This directive allows the server to detect requests replays by maintaining its own copy of this count - if the same nc value appears twice, then the request is a replay.
  • The response attribute is calculated in the following way:
HA1 = MD5("Gandalf:shire@middleearth.com:Lord Of the Rings")
      = 681028410e804a5b60f69e894701d4b4

HA2 = MD5("Get:/dir/index.html")
      = 39aff3a2bab6126f332b942af96d3366

Response = MD5( "681028410e804a5b60f69e894701d4b4:
                 cmFuZG9tbHlnZW5lcmF0ZWRub25jZQ:
                 00000001:0a4f113b:auth:
                 39aff3a2bab6126f332b942af96d3366" )
          = 5a1c3bb349cf6986abf985257d968d86

If you are interesting in learning how to compute the response depending on qop, you can find it in the RFC 2617.

Short Summary

As you can see the Digest authentication is more complicated to understand and implement.

It is also more secure than Basic authentication, but still vulnerable to man-in-the-middle attack. RFC 2617 recommends that Digest authentication is used instead of the Basic authentication since it remedies some of its weaknesses. It also doesn’t hide the fact that Digest authentication is still weak by modern cryptographic standars. Its strength largely depends on the implementation.

So in summary digest authentication:

  • Does not send plain text passwords over the network
  • Prevents replay attacks
  • Guards against message tampering

Some of the weaknesses:

  • Vulnerability to the man-in-the-middle attack
  • Many of the security options are not required and thus make Digest authentication function in a less secure manner if not set.
  • Prevents the use of strong password hashing algorithms when storing passwords.

Due to these facts, the Digest authentication still hasn’t gained major traction. The Basic authentication is much simpler and combined with SSL still more secure than the Digest authentication.

Security

Many companies have been victims to security breaches. To name just a few prominent ones: Dropbox, LinkedIn, MySpace, Adobe, Sony, Forbes and many others were on the receiving end of malicious attacks. Many accounts were compromised and the chances are, at least one of those was yours.

You can actually check that on Have I Been Pwned.

There are many aspects of the Web application security, too much to cover in one article, but let’s star right from the beginning. Let’s see how to secure our HTTP communication first.

Do You Really Need HTTP?

You might be thinking: “Surely all websites need to be protected and secured”. If a website doesn’t serve sensitive data or doesn’t have any form submissions, it would be overkill to buy certificates and slow the website down, just to get the little green mark at the URL bar that says “Secures”.

If you own a website, you know it is crucial that it loads as fast as possible, so you try not to burden it with unnecessary stuff.

Why would you willingly go through the painful process of migration to the HTTPS just to secure the website that doesn’t need to be protected in the first place? And on top of that, you even need to pay for that privilege.

It’s worth the trouble?

HTTPS Encrypts Your Messages and Solves the MITM Problem

The HTTP authentication mechanisms have their security flaws. The problem that both Basic and Digest authentication cannot solve is the Man in the Middle attack. Man in the middle represents any malicious party that inserts itself between you and the website you are communicating with. Its goal is to intercept the original messages both ways and hide its presence by forwarding the modified messages.

Man In The Middle

Original participants of the communication might not be aware that their messages are being listened to.

HTTPS solves the MITM attacks problem by encrypting the communication. Now, that doesn’t mean that you traffic cannot be listened to anymore. It does mean that anyone that listens and intercepts you messages, won’t be able to see it’s content. To decrypt the message you need the key.

HTTPS as a Ranking Signal

Not that recently, Google made HTTPS a ranking signal.

What does that mean?

It means that if you are a webmaster, and you care about your Google ranking, you should definitely implement the HTTPS on your website. Although it’s not as potent as some other signals like quality content and back-links, it definitely counts.

By doing this, Google gives incentive to webmasters to move to HTTPS as soon as possible and improve the overall security of the internet.

It’s Completely Free

To enable HTTPS (SSL/TLS) for a website you need a certificate issued by a Certificate Authority. Until recently, certificates were costly and had to be renewed every year.

Thanks to the folks at Let’s encrypt you can get very affordable certificates now ($0!). They are completely free.

Let’s encrypt certificates are easily installed, have major companies support and a great community of people.

Let’s encrypt issues a DV (domain validation) certificates only and have no plan of issuing organizational (OV) or extended validation (EV) certificates. The certificate last for 90 days and is easily renewed.

Like any other great technology, it has a down side too. Since certificates are easily available now, they are being abused by Phishing websites.

It’s All About the Speed

Many people associate HTTPS with additional speed overhead. Take a quick look at the http://httpvshttps.com/

Here are my results for HTTP and HTTPS: HTTP vs HTTPS

So what happened there? Why is HTTPS so much faster? How is that possible?

HTTPS is the requirement for using the HTTP 2.0 protocol.

If we look at the network tab, we will see that in the HTTPS case, the images were loaded over HTTP/2.0 protocol. And the waterfall looks very different too.

The HTTP/2.0 is the successor of the current prevalent HTTP/1.1.

It has many advantages over HTTP/1.1:

  • It’s binary, instead of textual
  • It’s fully multiplexed, which means it can send multiple requests in parallel over a single TCP connection
  • Reduces overhead by using HPACK compression
  • It uses the new ALPN extension which allows for faster-encrypted connections
  • It reduces additional round trip times (RTT), making you website load faster
  • Many others

You Will Be Frowned Upon by Browsers

If you are not convinced by now, you should probably know, that some browsers stared waging war against unencrypted content. Google has published a blog that clearly explains how will Chrome treat insecure websites.

Here is how it looked before Chrome version 56.

HTTP shaming 1

And here is how it looks now.

HTTP shaming 2

Moving to HTTPS is complicated

This is also the relic of the past times. While moving to HTTPS might be harder for the websites that exists for a long time because the sheer amount of resources uploaded to over HTTP, the hosting providers are generally trying to make this process easier.

Many hosting providers offer automatic migration to HTTPS. It can be as easy as clicking one button in the options panel.

If you plan to setup your website over HTTPS, check if the hosting provider offers HTTPS migration first. Or if it has shell access, so you can do it yourself easily with Let’s encrypt and a bit of server configuration.

So, these are the reasons to move to HTTPS. Hopefully, by now, you are convinced of the HTTPS value and understand how it works.

HTTPS Fundamental Concepts

HTTPS stands for Hypertext Transfer Protocol Secure. This effectively means that client and server communicate through HTTP but over the secure SSL/TLS connection.

You already know how HTTP communication works, but what does the SSL/TLS part stand for and why do I use both SSL and TLS?

SSL vs TLS

Terms SSL (Secure Socket Layer) and TLS (Transport Layer Security) are used interchangeably, but in fact, today, when someone mentions SSL they probably mean TLS.

SSL was originally developed by Netscape but version 1.0 never saw the light of the day. Version 2.0 was introduced in 1995 and version 3.0 in 1996, and they were used for a long time after that (at least what is considered long in IT), event though their successor TLS already started tacking traction. Up until 2014, fallback from TLS to SSL was supported by servers, and that was the main reason the POODLE attack was so successful.

After that, the fallback to SSL was complety disabled.

If you check yours or any other website with Qualys SSL Labs tool, you will probably see something like this:

Website Protocols

My website is hosted on GitHub Pages and as you can see, it only supports TLS 1.2 (which is the recommended protocol version at the moment), TLS 1.3 support is experimental thus is disabled, SSL 2/3 are disabled for security reasons, and I’m not sure why TLS 1.0/1.1 are not supported.

But, because SSL was so prevalent for so long, it became a term that most people are familiar with and now it’s used for pretty much anything. So when you hear someone using SSL instead of TLS it is just for historical reasons, not because they really mean SSL.

TLS Handshake

Before the real, encrypted communication between the client and server starts, they perform what is called the “TLS handshake”.

Here is how TLS handshake works (very simplified).

TLS handshake

The encrypted communication starts after the connection is established.

The actual mechanism is much more complicated than this, but to implement the HTTPS, you don’t need to know all the actual details of the TLS handshake implementation.

What you need to know is that there is an initial handshake between the client and the server, in which they exchange keys and certificates. After that handshake, encrypted communication is ready to start.

If you want to know exactly how TLS handshake works, you can look it up in the RFC 2246.

In the TLS handshake image, certificates are being sent, so let’s see what a certificate represents and how it’s being issued.

Certificate and Certification Authorities (CAs)

Certificates are the crucial part of the secure communication over HTML. They are issued by one of the trusted Certification Authorities.

A digital certificate allows the users of the website to communicate in the secure fashion when using a website.

For example, the certificate you are using when browsing through my blog looks like this:

SSL certificate

I would like to point out two things. The red box, shows what the real purpose of the certificate is and that it is a Domain Validation certificate. Is just ensures that you are talking to the right website. If someone was to for example impersonate the website you think you are communicating with, you would certainly get notified by your browser.

That would not prevent malicious attackers to steal you credentials if they have a legitimate domain with a legitimate certificate. So be careful. Green “secure” in the top left just means that you are communicating with the right website. It doesn’t say anything about the honesty of that website’s owner.

Extended validation certificates, on the other hand prove that the legal entity is controlling the website. There is an ongoing debate whether EV certs are all that useful to the typical user of the internet. You can recognize them by the custom text left of your URL bar. For example, when you browse twitter.com you can see:

EV cert example

That means they are using EV certificate to prove that their company stands behind that website.

Certificate Chains

So why would your browser trust the certificate that the server send back? Any server can send a piece of digital documentation and claim it is what you want it to be.

That’s where root certificates come in. Typically certificates are chained and the root certificate is one your machine implicitly trusts.

For my blog it looks like this: SSL Root certificate

Lowest one is the GitHub certificate (where I host my blog), which is signed by the certificate above it and so on… Until the root certificate is reached.

But who signs the root certificate? Well, it signs itself. SSL Root certificate
itself

And your machine and your browsers have a list of trusted root certificates which they rely upon to ensure the domain you are browsing is trusted. If the certificate chain is broken for some reason (can happen when enabling a CDN), the site will be displayed as insecure on some machines.

By exchanging certificate, client and server know that they are talking to the right party and can be begin encrypted message transfer.

HTTPS Weakness

HTTPS can provide a false sense of security if site backend is not properly implemented. There are a plethora of different ways to extract customers information, and many sites are leaking data even if they are using HTTPS. There are many other mechanisms besides MITM to get sensitive information form the website.

Another potential problem that the websites can have HTTP links event thought they run over HTTPS. This can be a chance for MITM attack. While migrating websites to HTTPS, this might get by unnoticed.

And here is another one as a bonus: login forms accessed thorugh an insecure page. So it’s best to keep the entire website secure to avoid this one.


  1. Is the property of certain operations/calls where they can be applied multiple times without changing the result beyond the initial application, i.e. the system state remains the same after one or several calls. 

  2. Idempotent operations are often used in the design of network protocols, where a request to perform an operation is guaranteed to happen at least once, but might also happen more than once, without causing unintended effects (it should not result in a different response). 

  3. A network security system that monitors and controls incoming and outgoing network traffic based on predetermined rules. Firewalls are often categorized as either network firewalls that filter traffic between two or more networks and run on network hardware, or host-based firewalls that run on host computers and control network traffic in and out of those machines.