CSCI 356 - Computer Networking

CSCI 356 / Fall 2024

Computer Networking

Project 2 - Dynamic Web Server

UPDATE Sept 24 7pm: If you are using MacOS and are encountering a permission-denied error related to files or images you have downloaded from the internet (such as the index.html file you might have downloaded, or items in the associated index_files folder), this may be a MacOS security feature -- MacOS sometimes "quarantines" any files it consideres "dangerous" that you download from the Internet, and prevents python code from accessing those files. Here's a fix. In a terminal, run "xattr -r -d com.apple.quarantine ~/Desktop/yourproject/web_root" (but replace the directory path, of course, to specify wherever your web_root folder is). This removes the "quarantine" flag, allowing python to access the files.

In this project I am giving you a simple yet functional web server, written from scratch in Python3. Tasks for you:

gain more experience with sockets programming,
explore the basics of the HTTP protocol without using HTTP python modules,
understand both static and dynamic web servers,
and (optionally) work in a team.

This project can be done individually or in teams of two students. If you need help finding a partner, let me know and I can help connect you with others. If working as a team, I expect both members to write code and contribute changes to the github repository, and both members should fully understand all of the project and all of the code and be prepared to answer related questions on a midterm exam.

Code for the project can be found on GitHub classroom, using this invitation: https://classroom.github.com/a/CeoS8jLt If working as a team, make sure to create a single team github repository, rather than two separate ones.

WARNING: Do not use http-related python modules or built-in python webserver features. The goal is to write your own HTTP code, not merely invoke someone else's HTTP code.

Reach Goals: I've marked a few parts of the project as reach goals. Aim to complete as many of these as you can, but it's okay if you can't complete all of them by the deadline.

Hint: Use GitHub for collaboration and include commit messages! Commit and push changes to github as you work. Include a short message each time to keep track of your progress (example: "hello page is now dynamic, tested and working"). Your coding partner can then "pull" down your changes, keeping everyone in sync, and GitHub will show a log of your progress. Also update the README file, checking off the items as you complete them.

About online sources and AI/LLM assistants: For help on the python language or sockets programming, or for understanding HTTP, or the code I have given you, feel free to collaborate with any other students or search for help online, or even use AI/LLM assistants. But you must cite your sources, as always (example: "asked chatgpt what the code using threading.Condition() in webserver.py does and what it is for"). For writing your actual web server code, however, be very careful about using online sources or using AI/LLM assistants. There are numerous examples online of web servers written in python (and other languages), with a variety of different styles and features. These will often mislead you, and they are not likely to match the specific requirements detailed below. And if you do use online sources for writing code, you must of course cite your sources, as always.

Background

We discussed in class and you saw in the readings: web browsers (i.e. clients), web servers, and the HTTP protocol they use to communicate. You may find you need to review that material as you work on this project.

I have provided you a fairly traditional (some would say antiquated) mostly-static, file-based, multi-threaded web server, written from scratch in Python3.

It is (mostly) a static server, which means for each URL that the server knows how to handle, there is some fixed response it sends back regardless of who made the request, when the request was made, or any other factors. If Alice requests some URL on Monday, and Bintu requests the same URL on Tuesday, they will get exactly the same response. The responses don't change over time. There is no "dynamic state" maintained by the server. The server can crash, be restarted afresh, and clients would not even notice, as the response for any given URL will be exactly the same as before.
It is file-based, meaning that to generate a response, it simply takes each URL, such as "http://myserver.org/some/page.html", and maps that to a file name, such as "./web_root/some/page.html". Here, "./web_root" is a directory the server can access containing all the html files, images, and other content that defines the web site. The server simply uses the contents of the matching file as the body of the response. This is a very traditional way to design a web server, and it is still very common. Server administrators add content to the web server simply by putting files into the "./web_root" directory, no other configuration or code changes needed.
It is multi-threaded. There is one "main" thread that waits for connections from clients (web browsers), but each connection that arrives is handled by a separate "worker" thread. As a rough approximation, you can think of each thread as its own "mini program", as if you had run the web-server multiple times in separate terminals. The main thread is responsible for initialization and accepting incomming connection requests using a "welcoming socket" or "listening socket". Each worker thread is responsible for handling just one single connection using a "connected socket". All the threads are running their code (at different lines of the program) concurrently. So while one worker thread might be reading in a request from one browser, another worker thread might be sending a response to a different browser, and meanwhile the main thread might be waiting to accept a new connection when it arrives from some browser. But unlike running multiple programs in separate terminal windows, all of the threads share all of their global variables with each other, they all run inside the same program, and when they print, they all print to the same terminal window (which does make a rather jumbled mess of the output).

Running the server

Try out the server to get familiar with how it works, running either on your laptop or on arpa.kwalsh.org:

localhost: To run on your laptop, run:
python3 webserver.py localhost 8888
or:
python3 webserver.py 127.0.0.1 8888
Then you can access the server using a regular web browser (Chrome, Firefox, Safari, etc.) by going to the URL printed by the server.
arpa.kwalsh.org: Follow the instructions sent by email to configure ssh access. Then get your code and web_root directory onto the server. You can either log in using ssh, then use git on the command line to clone your repository. Or, run "scp -r web_files arpa.kwalsh.org:" and "scp webserver.py arpa.kwalsh.org:" (be sure to include the trailing ":"). Then log in and run:
python3 webserver.py arpa.kwalsh.org 8888
or:
sudo python3 webserver.py 192.133.83.134 80
then you can access the server using a regular web browser by going to the URL printed by the server.
Note: for security reasons, ITS blocks connections to all logos non-standard ports, even from HCWireless. So using logos won't work. Also note: You'll need to pick a different port number (any number between 1024 and 49151), since only one program can use port 8888 on logos at a time. If you use "sudo" you can also use lower port numbers, including the standard port 80, so long as no other student is using the port at the same time. Port number competition isn't usually an issue when running on your own laptop.

(In)Security WARNING: Your code will almost certainly have gaping security holes. The code I have given you takes several precautions, however, including only allowing connections from your own laptop (also known as "localhost", which has the special IP address 127.0.0.1) unless you are running on arpa.kwalsh.org. That way attackers can't try to break into your laptop over the internet through your web server. Still, it is probably best to be sure to kill your server when you are not actively working on the project.

Familiarize yourself with the code

Look carefully at the debugging output the server prints to the terminal. The first line of the request is known as the request-line... what does it look like? The remaining lines are the headers. Look for the "User-Agent" header... is it surprising? Do any other headers seem obvious?
Look carefully at the response. The first line is called the status line. How is this, and the headers that follow, similar or different from the request? There are a lot of similarities, but also slight differences, so look carefully.
Request and response headers use "\r\n" for line endings, just like pop3. This is a vestige of history.
When looking at the welcome page, you should see multiple requests and multiple connections. Why isn't it just one request and one response? Does your browser use multiple connections (e.g. one per request), or just a few (e.g. multiple requests per connection)?
There many be only a few headers, or many. How does the server or client know how many? Where should it stop processing headers and start processing the body? Hint: In POP3, the message boundary was a single "." on a line by itself. HTTP is different, but it's the same idea.
The body of the message—the payload—sent from the server back to the browser isn't always text. It could be binary image data, or video, or audio, or anything else. How does the browser know how much payload to process? Is there a "." ending like POP3, or something else? How does the client know what type of data it is and how it should be displayed (as an image, or video, or text, or html, etc.)? Hint: Look at the response headers the server sends back.
Try different URLs. Just pick any URL ending, see what happens. Or look through the welcome page for other URLs that might work on this server. Compare the request and response debugging info printed on the server console.

Nearly all of the code I have provided you is (hopefully) correct, working code, and you can mostly leave it alone. Read the comments to make sense of it. Your task will be to add new features—while you will have to make a few changes to the existing code, for the most part you can leave the existing code alone and just add new code. The code I have provided already implements these features:

Opens a server "welcoming" socket and waits for connections from browsers.
Starts a thread for each connection.
Keeps track of some statistics, like the number of connections, the number of errors, etc.
If the request is to GET /hello, it sends back a simple greeting.
If the request is to GET /status, it sends back a nicely formatted printout of the server statistics.
If the request is to GET any other path, it tries to open a file with that name and send back the contents as an HTML file.

Your Tasks

See README.md for a list of all tasks to be completed. Additional hints and details are below.

HTML hello: Find the code that creates the response for "GET /hello" requests:
http://localhost:8888/hello. Currently the response is just plain text, with no HTML formatting. You can see the content-type (aka "mime-type") sent in the response is "text/plain". Change the code so it sends "text/html" for "GET /hello" responses, and change the body of the response so it contains valid HTML markup. No need to be fancy here. Just add "<html>" and "</html>" at the start and end, add the "head", "body", and "title" opening and closing tags, use "<p>" to start a new paragraph, and make make the links clickable. Add an image or two if you like, or change the colors, fonts, etc.

NOTE: Depending on where and how you run the code, you'll need to fix the links on the response page. You will need to change "http://logos.holycross.edu:8888/..." to be "localhost" or a different port number, for example, or change it to "http://arpa.kwalsh.org:8888/". Better yet, change the code to leave off the protocol, host, and port part of the link, and just use a host-relative path, like "/hello" as the link. Relative links work no matter where the server is located, since the client web browser will fill in the rest automatically.

Mime-type handling: Find the code that creates a response for requests to GET a file. Currently, the code always returns a response with content-type "text/html", regardless of the actual type of the file being requested. So even if the browser does "GET /xkcd_online_communities_2.png", which is clearly asking for an image file, the server incorrectly marks the response as "text/html". This can cause images and other media types to not always display correctly. Example: clicking http://localhost:8888/xkcd_online_communities_2.png probably will not render correctly in your browser if the mime-type in the response is incorrect. Also, the style sheets might not work correctly. Once you have mime-types correct, then http://localhost:8888/welcome.html and http://localhost:8888/compsci.html should look nicer, with colored backgrounds, custom fonts, etc. Fix the code so the correct mime-type is used, depending on the file name ending: ".html", ".htm", ".jpg", ".jpeg", ".png", ".txt", ".css", and ".js" should all be supported. Any mix of uppercase and lowercase should be allowed. You can find the appropriate mime-types for each of these online. Hint: don't guess, the mime-types aren't always obvious.

Add an index.html file: Add an "index.html" file and other content to your web_root directory to serve as a main landing page. As a stress-test of your mime-type handling code, here is one approach: Go to http://www.holycross.edu/ or some other relatively simple non-interactive webpage using Chrome or Firefox, then do "Save As ... (WebPage, Complete)" to safe the page as "index.html" inside your "web_root" folder. NOTE: Make sure you choose the "WebPage, Complete" option when saving. This doesn't work in Safari. This should create a file named "index.html" within your web_root folder, and it should also create a sub-folder (probably called "index_files") containing many javascript, css, and image files. Now you should be able to go to http://localhost:8888/index.html and see a nearly-complete holy cross homepage. It's okay if a few images and some other parts of the page are not 100% working—the College homepage is a bit fragile and the "Save As ..." doesn't always fully capture the entire page. But it should look pretty close to normal. You can do the same thing with other pages, such as http://www.dolekemp96.org/main.htm, which is a bit simpler than the Holy Cross home page as it's nearly 30 years old!

Default and directory pages: Change the server so that if the browser requests is to "GET /", then it actually gets "/index.html" instead. Be sure to use the right mime-type, and if the "index.html" file does not exist, return a 404 NOT FOUND error, as usual. Similarly, for any file request that ends in "/", change it to end in "/index.html". So "/subdir/" becomes "/subdir/index.html", and so on. This way, users can go to http://localhost:8888/ and get to your main landing page, without having to type "index.html" all the time. This is a widely followed convention among web servers. As a reach goal, do the same thing for any request path that corresponds to a directory instead of a file, even if it doesn't end in "/". For example, "GET /subdir" should work the same as "GET /subdir/" and "GET /subdir/index.html".

[Reach Goal] Directory listings: For directories, if the "index.html" file does not exist, rather than sending a 404 NOT FOUND response, instead send back a nicely-formatted HTML response containing a listing of all items in the directory. The listed items should be clickable links. Most servers now disable this feture for security/privacy reasons. But see, for example, https://www.arngren.net/sitebuilder/ which is one of the directories behind this ancient Norwegian site.

Dynamic hello: Even though the "GET /hello" response is not a file, the server returns the exact same response every time. In other words, it is still static. Make it dynamic by including some content that varies in the response body. Maybe have it vary the background color at random? Or keep track of how many times the hello-page has been displayed so far (just use a global counter variable in python for this), and include that counter in the response? Whatever you choose to do, going to http://localhost:8888/hello and hitting refresh should allow the client to see a (slightly) different page each time.

Interactive hello: make the hello page respond to user input, so the form at right works. The resulting page should have some sort of custom greeting for the chosen username. Hint: Look at the HTTP request to see how the server can get the user's name and favorite color. If you aren't using localhost with port 8888, make your own web form with the correct URL (or as a temporary fix, simply adjust the URL in your browser after clicking the "Greet Me!" button).

NOTE on weird characters: Hint: URLs can't use certain characters. If inputs have spaces or other special characters, you will notice the browser substitutes alternatives, an operation called "quoting" or "escaping". Fortunately, undoing URL quoting is easy in python: s = urllib.parse.unquote(...)

[Reach Goal] Keep-alive: In the original HTTP protocol, each socket connection carried a single request-response pair. Both the browser and server would close their sockets after the response was finished. Now, HTTP's keep-alive feature can improve peformance by allowing either a browser or a server to request that the connection be held open instead, and re-used for subsequent requests. The rule is simple. If the browser so chooses, it includes a "Connection: keep-alive" header in it's request. If the server so chooses, it will then include a "Connection: keep-alive" header in the response, and both sides will then leave the socket connection open so that the browser can send another request, and the server can respond, as many times as desired, using the same socket. Each of these subsequent requests and responses would similarly have a "Connection: keep-alive" header. Or, either side is allowed to instead include "Connection: close" with a request or response, indicating that it no longer wishes to keep the connection open. Upon noticing a "Connection: close" header from the other side, either side would close its socket.

Modify the server to support HTTP's keep-alive feature. Your server should keep the connection open, processing one request after another from the connection, so long as each new request contains the keep-alive header. Be sure to stop if the browser sends a "Connection: close". Either side can also, at any time, simply close the connection. Your server should initiate the close operation eventually if the browser does not, e.g. after a certain number of requests on the socket (on the 10th request?), or after the socket has been held open for a certain amount of time (after 1 minute?), or after the socket has been idle for a certain duration (e.g. after 10 seconds?). Be sure to send the appropriate keep-alive or close header in every response your server sends. Hint: Don't try to implement this all in one spot in the code. Instead, use a boolean variable inside the Connection object to keep track of whether it should be kept open, then use and modify this variable from various places in the code.

[Reach Goal] Who Am I? Servers can learn a bit about the clients connecting to them. Upon receiving a GET request for "/whoami", the server should respond with a special page that contains details about the web browser client, gleaned from the HTTP request itself or from the python socket connected to the client. Include details like: IP address and port of the client; "user-agent" string identifying the browser type and version; the languages the user prefers to accept; the value of any "Cookie" header, if present; and the value of the "DNT" (do-not-track) header or "Sec-GPC" (Global Privacy Control) header, if present.

[Reach Goal] Ban a browser: Some websites ban certain web browsers. Many users find this annoying. Modify the server so that all requests coming from your least favorite browser (e.g. just Safari on MacOS), responses contain a polite "sorry, this browser is not supported" message rather than the intended content. Hint: the user-agent header in the request will be useful here. The value of that header isn't obvious or sensible, but it is still how this feature is commonly implemented.

[Reach Goal] Persistence with Cookies: Use HTTP cookie headers to remember something about a client, and use that information in later responses. For example, when the user submits their name and favorite color on the "/hello" page, send a cookie to the browser containing that info. When the user later visits the same page, the cookie will be present in the request headers, and you can customize the response based on that information. Or, you could put a counter value into a cookie to keep track of how many times this client has visited your site. Then, on the "/hello", "/status", or "/whoami" pages, you can include this information, e.g. "Welcome back, this is your 37th visit."

HTTP Requirements and Simplifications

Whitespace: The actual rules for whitespace in HTTP requests are somewhat complex, but you can assume that HTTP requests and headers will always be formatted with whitespace shown exactly as above, i.e. using a single space (and no tabs) between each word. Chrome and most other browsers will only send reasonably-formatted requests... usually.
Capitalization: "HTTP", "GET", and similar keywords are nearly always capitalized in most places. Most other things in requests or responses can be upper or lowercase, and you need to acccount for this. If you need to look at a header sent by the client, e.g. the "Connection: keep-alive" header, you might want to first convert to lowercase before doing a string comparison, since the client could send any variation of "Connection", "connection", "Keep-alive", "kEeP-aLivE", etc.
Header ordering and defensive programming: Headers in HTTP requests are not always listed in the same order, so code defensively. Different browsers will send different headders in different orders, and even use slightly different spacing, punctuation, etc. Account for this in your code wherever possible. HTTP is a fairly permissive protocol, allowing wide flexibility in formatting of requests and response. On the other hand, you should be strict when appropriate. For example, don't treat "GET /hello" and "GET /chello" as the same thing, which is what would happen if you just check if the word "hello" appears in the url.
Error checking: Include enough error-checking so that your web server is functional. I will test your code using obvious, simple inputs with one or two popular browsers (Chrome, Safari, or Firefox). I will not test your server under any highly unusual scenarios, and I will not penalize you if your code dies in rare corner-cases. It is a good idea to test with at least 2 browsers. To handle errors, a reasonable strategy is for your server to simply close the connection to the client if any type of unexpected situation is encountered (e.g. a malformed request). Python's "try... except..." mechanism is useful here.
HTTP/1.1: You only need to support HTTP/1.1. Don't bother with HTTP/1.0 or HTTP/2, and go ahead and assume that the client is using HTTP/1.1.
GET, PUT, and POST methods: You only need to support HTTP GET for this project. Don't bother with POST, PUT, HEAD, DELETE, or other methods.
What headers do I need to pay attention to? In each HTTP request, the browser will include many headers. You can ignore all of them, except the few that were mentioned above and that are needed to implement some feature. I already provided some helper functions to deal with headers.
HTTP status codes: Include with each response an appropriate status code. You will use at least status codes 200 (OK) and possibly 404 (NOT FOUND). Depending on how you write your server and how much error checking you do, you may also want to use other codes like 400 (BAD REQUEST), 403 (FORBIDDEN), 418 (I'M A TEAPOT), and others.
Logging/debugging: There is no python debugger. For your sake and mine, your server should print to the console a simple trace of activity. It's okay if the messages are a bit jumbled and out-of-order, due to the concurrent multi-threaded operation.

Hints: Global Variables in Python

It's fine to use global variables for these small projects, especially if it's some variable that lots of functions need to use. Creating a global variable in python is easy, just make the variable outside of any function, e.g. at the top of the program:

  myvariable = 37
  mylist = [ "Swords", "Fenwick", "Dinand" ]

When you are inside a function and want to modify some global variable (myvariable, for example, or mylist), you might need to put "global myvariable" or "global mylist" INSIDE each such function. Like this:

  def somefunction(...):
     global myvariable
     global mylist
     ...
     myvariable += 1
     ...
     mylist.append("whatever")

When you are inside a function and want to use but not modify some global variable, you can probably just go ahead and use it, without the "global" declaration, it should work fine. But if you forget the "global" inside some function, and that function tries to modify the variable, it will accidentally create a second local variable instead.

Hints: Concurrency and Synchronization in Python

With multiple client browsers, each opening multiple socket connections to your server, there is the potential that your server will be simultaneously processing multiple requests at the same time from one or more different hosts. This is normal: most browsers will open 5-10 simultaneous connections, for performance reasons. Global variables can cause trouble in this situation. To avoid trouble, you must use Python's threading.Condition() objects. Here's how.

Suppose we have a few global variables that are all related to each other in some way.

    a = [ "Foo", "Bar" ]
    b = 17
    c = "Something else"

Make another global variable to accompany them, like this:

    # the my_lock variable is intended to protect a, b, and c
    my_lock = threading.Condition()

Now, whenever any piece of code ever tries to access a, b, or c, it should always do so inside of a "with my_lock" block, like this:

    ... some code ...        # danger zone: don't ever use a, b, or c out here
    with my_lock:
           a.append("Hi mom")    # this line is within the safe zone
           b = b + 1             # this line is within the safe zone
           c = "Nevermind"       # this line is within the safe zone
    ... more code here ...   # danger zone: don't ever use a, b, or c out here

What the "with my_lock:" block accomplishes: Even though many different threads may be running, executing many different parts of your code at the same time, Python will ensure that only one thread at a time will ever be within any "with my_lock:" block. Essentially, it makes your code take turns executing the "with my_lock:" blocks.

There is already just such a variable in the server code, stats.lock, intended to protect all of the statistics variables. This is needed because every thread needs to access and update those variables, and could potentially try to do so at the same time.

NOTE: If a lock is used to protect some variable, then it is NEVER SAFE to access that variable in any way outside of the "with my_lock:" blocks, ALL accesses to the variable must use the lock. A good rule of thumb for concurrent programming is to use locks to protect all global variables except constants (i.e. variables that never change except when they are first initialized).

Submissions

Push your code to your github repository on the master branch. Once your code is pushed to GitHub, there is nothing further to submit. All code you commit to projects in the CSCI 356 GitHub classroom are shared with your instructor automatically, but otherwise are private by default.

Collaboration log Please either add a collaboration.txt file to your project and push it to GitHub, or include your collaboration log as comments at the top of your python code or in the README file. Either is fine. If your code requires anything unusual to run it (e.g., command line arguments), then say so in a comment at the top of your program. Also indicate in the README file if you know of anything that is broken, incomplete, etc.

You don't need to submit your web_root folder, though you can if you like, or if you made any important changes (e.g. to welcome.html) that I would need to test your server.