First, give me a five sentence summary of HTTP.
HyperText Transfer Protocol (HTTP) is the technology used to pass information between a client application (like Microsoft’s Internet Explorer browser) and a web server (like). HTTP provides mechanisms to request and deliver text, images, sound, files of any kind, but not smells. If it’s in your browser, it probably got there via HTTP. HTTP is designed for the delivery of information from a web server to a client application (probably a browser), unlike say ftp, which is a two-way protocol for sending stuff in either direction.
HTTP is always used in conjunction with URLs (so the URLis prefixed with “http://”). URLs provide the common language to find what you want, and HTTP provides the common way to ask for the resource that the URL points to and then to interpret what gets passed back.
So what is a cookie really?
HTTP can’t save session state. It’s “stateless”. The server’s memory of the client is completely wiped clean once each HTTP transaction is made. HTTP clients (usually browsers) have to continually retransmit the same information to remind the server about who they are and where they were when they last spoke. That’s what cookies are for. Cookies store information about you on the browser side so it’s easy to retransmit this information. This stateless design was intentional since it makes for easy scaling and failover. Need your web site to go faster or support more users? Just add more web server machines. Those machines don’t have to talk to each other to manage session state since each HTTP transaction has everything needed inside of it.
Tell me the basics in 5 sentences or less.
Like most protocols, each HTTP message is divided into a header where information about the message is stored (like the message length or attributes that designate the message body is text or a picture) and a message body (where the actual message contents reside). It also specifies how the message is encoded (for example, what standard should be used to interpret the message bits, or if data compression is used), how message receipt acknowledgements will be handled, and how transmission errors will be reported. Certainly all excellent topics for a hot date.
What's all this I hear about "Get" and "Post"?
HTTP provides two key routines – “GET” and “POST”. The “GET” routine is used to request information from the server and the parameters of the request are visible in the URL that is passed (as in http://www.PracticingSafeTechs.com?parameter=”XYZ”). The “POST” routine is normally used to submit information to the server for processing (you hit a “submit” button on a web page, for example), and the parameters of the request are hidden in the body of the message (so the user only sees a request to http://www.PracticingSafeTechs.com, without the parameter=XYZ part).
Sidebar - So why does everyone use the POST routine to GET information?
The HTTP POST routine passes parameters in the body of the message rather than in the URL itself. So if you need to pass something like a password (even if it is encrypted), security is marginally increased if that password does not appear in the URL where is can be more easily seen. This is an example of one of many scenarios where the POST routine is preferable over the GET routine (where parameters are always visible in the URL), even if the end user is simply requesting information and not really trying to submit anything for processing. Consequently the POST routine is used to simply GET information all the time. Pretty confusing, huh! HTTP also provides a DELETE operation, but POST is still the routine everyone calls to create, retrieve, update, or delete (CRUD) information on web servers. The software industry only ever uses GET for the simplest most straightforward requests. In practice, the terms “GET” and “POST” have lost all meaning since POST is used for everything. This will become a big deal in the REST chapter since REST will advocate the return to the good ol’ days when words had meaning.
An HTTP “GET” or “POST” request returns:
- Simple HTML text that constitutes the writing on a web page.
- One or more compressed image files .
- Or information that has been dynamically generated (perhaps by a PHP application that retrieves information from a database). Usually that information is designed to look like an HTML text file, but it might be designed to look like something else too, like an XML file.
So is HTTP completely harmless? Does it only return information?
This last bullet point above is particularly important. HTTP is normally used to retrieve files of some sort, but an HTTP request can also be used to trigger an operation that returns information that looks like a file but is not. That’s why people say HTTP is used to retrieve “resources”. The term “resources” is broader than just “files”. Servlets and Web services are also examples of dynamically generated information. They are really just programs that get called by the web server when it receives certain HTTP requests. The web server knows to package up the output of the program to look like an HTML file is being returned. For example, when the web server at Google receives a POST request to http://api.google.com/search/beta2, it knows to run a certain program and to return the output of that program to look like an XML file is being returned. That’s how you run google’s web services.
Since HTTP is generally regarded as a communication mechanism to deliver harmless text information to a browser, network administrators usually allow HTTP traffic to freely pass through their company firewalls. As a result, HTTP is now frequently used to wrap other protocols so that the pesky firewalls can be subverted. For example, SOAP messages, which are used to call routines running on other computers, are frequently encapsulated and sent inside of HTTP messages to reach computers on the web behind firewalls (see our SOAP page for the full story). Consequently, HTTP isn’t necessarily safe to let pass through firewalls un-inspected anymore, and a very new and improved firewall industry is just emerging. Firewalls need to be smart enough to look inside of HTTP messages now.
In 15 seconds or less, tell me about CGI.
At this point it’s worth mentioning a related technology called Common Gateway Interface, or “CGI”. CGI provides another way of allowing a web page request to trigger the web server to call a custom host-side program. If you see a URL with the letters “CGI” embedded somewhere in it, then you are probably dealing with one of these old-style web applications. The browser transmits the HTTP “CGI URL” request to the server, and the server runs a program on the host. The result of the host-side program is packaged up by the web server and returned in HTML form to the browser. CGI provides a primitive but very quick-to-implement technique for dynamically assembling web page contents on the host.
What's this I hear about JSPs and PHP then?
CGI suffers from a number of problems that have caused it to become almost completely antiquated by superior technologies such as JSP or PHP. CGI does not scale well since a new process is kicked over on the server for every request (rather than reusing the same one) . It’s slow. Moreover, the CGI strategy ultimately assumes the client is a browser, which is one of the reasons it is not an effective replacement for SOAP.
So what's "secure" HTTP (HTTPS)?
HTTPS is simply encrypted HTTP. HTTP is really implemented with operating system level “sockets” and HTTPS uses secure sockets (or SSL). Browsers typically display some kind of a lock icon when secure communications are being used to talk to the remote web server. Sites that require HTTPS have URLs that begin with “https://...”. Under the covers, HTTPS uses another protocol called Secure Sockets Layer (SSL) to protect data from prying eyes. Web programmers typically don’t have to worry much about encrypting data for transmission since the browser and the web server carry the responsibility for properly using HTTPS. The exception to this rule is that HTTPS can introduce performance delays for large amounts of text based data. So, in cases where some information in the message needs to be protected but the entire message can’t be encrypted because of performance problems, the web programmer has to do a great deal of worrying. But this situation is rare. Usually the entire message can be encrypted. Moreover, SSL accelerators - special hardware devices - can be used to off-load the encryption/decryption of HTTPS messages that would otherwise be performed with the computer’s CPU.
My developers tell me not to worry about security since we'll just configure secure HTTP before deployment. Is that easy?
Even though web programmers don’t have to fret much about HTTPS, web administrators, on the other hand, have to set up a number of confusing configuration settings pertaining to “certificates” that allow the web server to support security. HTTPS really is a headache to them, which explains why they’re always in such a bad mood. HTTPS can be difficult to set up since there are a lot of options pertaining to the degree of encryption desired, firewall settings have to be altered to allow HTTPS traffic to pass through, and a third party is generally required if the company wishes to provide their own “certificate”. So the administration for HTTPS is often significant and underestimated.
Seems like everything's a protocol. TCP. SOAP. HTTP. Help me out here, but do it fast.
Discussions about protocols like HTTP can be somewhat confusing since one protocol is frequently nested within another protocol. For example, HTTP uses the underlying protocol TCP, and messages sent inside of HTTP might themselves be in the “SOAP protocol”. So protocols are wrapped in protocols and consulting fees are guaranteed well into the future. But each of these layers adds value of some sort (other than the fees). In this example, TCP is a low level peer to peer protocol used to support communications across a network. The HTTP client-server protocol adds a consistent way to request and modify resources (usually HTML files) that reside on the web server. SOAP is often used within HTTP to allow computers to call methods on remote computers.
Sidebar - Protocols 101
A protocol is really just an agreed to communication format. Is the message encrypted or compressed and if so how? If the message receiver doesn’t get part of the message, how does it say “huh?” and then get a retransmission. A protocol is often used to chop up the information, encapsulate the pieces, and reassemble them on the other end. For example, TCP might divide a large file into smaller pieces for transmission, much the same way your airline chops up your luggage and reassembles it with duct tape on the baggage carrousel.
What are the common mistakes that projects continually make with HTTP?
Most people think of HTTP as a technology that happens under the covers that you don’t have to worry about. For the average Internet surfer, that’s true. However it’s not true if you have a project with performance and security concerns. HTTP may also reemerge as a concern for you as your company adopts some new web service and service oriented architecture technologies (like REST). You probably now understand the basics of HTTP from the book excerpt above, but what are the primary concerns for this (and all) underlying transport layers? If you're not very conscious of the common mistakes that get made with HTTP, you should spend a few bucks now to be certain you know what you don’t know - ya know? No? Click the “Buy Now” button below - it’s a business expense, you get the analysis for all the other technologies too, and you'll avoid the mistakes we see companies repeatedly make.