Thursday, November 17, 2011

Understanding URLs - protocols, domains and ports

URLs are at the heart of the internet. So let us make sure we understand them clearly.

URLs are of the form:
protocol://domain:port/something/something/something/something/..

In browsers, protocol is almost always http.
But it can also be ftp, sftp, https, or your_custom_protocol.

Protocol tells the browser how to communicate with the server.

If you are communicating with a http server, you need to specify the protocol as http. Obviously, if you specify a different protocol, say ftp, you will get an error - because the browser will be talking greek to a french server.

Secondly, note that while the URL format does not stop you from specifying any protocol, even your_custom_protocol, it will work only if the browser knows how to communicate in that protocol. Most modern browsers can do http, ftp, sftp, https.

Domain is the DNS name of the server - like www.yahoo.com, www.google.com, www.apache.org

The browser first contacts the DNS server of your network service provider, to find out the IP address of the server that is registered for that domain. Once it knows the IP address, it can contact the server.

Port is very often not specified, and defaults to 80 for http (21 for ftp, 22 for sftp, 443 for https).

What exactly is port?
Port comes from TCP. Suppose you run two different network services on the same computer - say service1 and service2. When a network packet is received, how can TCP know which service to give the packet to? It uses a number called port. Port is just an integer and, in principle, can be anything >= 1. That said, several port numbers have been reserved and are called well-known ports - 80 for HTTP, 21 for FTP, etc.

When you start a TCP/IP service on a computer, you have to specify a port number.
When clients try to access this service, they need to specify the port number.

If you create your_custom_service, and you start it on the server at port 998, then your URL will have to specify 998.

If you start a HTTP server, but use a non-default port, say 212 - then you will need to specify 212 in your URL. Otherwise, by default the browser will use 80 - and it will give you an error that no HTTP service is running on port 80.

The last part of the URL /something/something/something - is of no importance to the browser. It doesn't care to understand it. It just passes it as a string to the web server. It is for the server to interpret it.

How does the server interpret it?

Depends on the server.

A simple HTTP server can interpret it as dir1/dir2/dir3/file
A Java web server can interpret it as context/servlet/args

We will learn more about this (something/something/something) part of the URL later.
For now, know that the browser does not interpret it at all.

No comments: