CSCI 356 / Fall 2024

Computer Networking

Project 2 - Dynamic Web Server

UPDATE Sept 24 7pm: If you are using MacOS and are encountering a permission-denied error related to files or images you have downloaded from the internet (such as the index.html file you might have downloaded, or items in the associated index_files folder), this may be a MacOS security feature -- MacOS sometimes "quarantines" any files it consideres "dangerous" that you download from the Internet, and prevents python code from accessing those files. Here's a fix. In a terminal, run "xattr -r -d com.apple.quarantine ~/Desktop/yourproject/web_root" (but replace the directory path, of course, to specify wherever your web_root folder is). This removes the "quarantine" flag, allowing python to access the files.

In this project I am giving you a simple yet functional web server, written from scratch in Python3. Tasks for you:

This project can be done individually or in teams of two students. If you need help finding a partner, let me know and I can help connect you with others. If working as a team, I expect both members to write code and contribute changes to the github repository, and both members should fully understand all of the project and all of the code and be prepared to answer related questions on a midterm exam.

Code for the project can be found on GitHub classroom, using this invitation: https://classroom.github.com/a/CeoS8jLt If working as a team, make sure to create a single team github repository, rather than two separate ones.

WARNING: Do not use http-related python modules or built-in python webserver features. The goal is to write your own HTTP code, not merely invoke someone else's HTTP code.

Reach Goals: I've marked a few parts of the project as reach goals. Aim to complete as many of these as you can, but it's okay if you can't complete all of them by the deadline.

Hint: Use GitHub for collaboration and include commit messages! Commit and push changes to github as you work. Include a short message each time to keep track of your progress (example: "hello page is now dynamic, tested and working"). Your coding partner can then "pull" down your changes, keeping everyone in sync, and GitHub will show a log of your progress. Also update the README file, checking off the items as you complete them.

About online sources and AI/LLM assistants: For help on the python language or sockets programming, or for understanding HTTP, or the code I have given you, feel free to collaborate with any other students or search for help online, or even use AI/LLM assistants. But you must cite your sources, as always (example: "asked chatgpt what the code using threading.Condition() in webserver.py does and what it is for"). For writing your actual web server code, however, be very careful about using online sources or using AI/LLM assistants. There are numerous examples online of web servers written in python (and other languages), with a variety of different styles and features. These will often mislead you, and they are not likely to match the specific requirements detailed below. And if you do use online sources for writing code, you must of course cite your sources, as always.

Background

We discussed in class and you saw in the readings: web browsers (i.e. clients), web servers, and the HTTP protocol they use to communicate. You may find you need to review that material as you work on this project.

I have provided you a fairly traditional (some would say antiquated) mostly-static, file-based, multi-threaded web server, written from scratch in Python3.

Running the server

Try out the server to get familiar with how it works, running either on your laptop or on arpa.kwalsh.org:

(In)Security WARNING: Your code will almost certainly have gaping security holes. The code I have given you takes several precautions, however, including only allowing connections from your own laptop (also known as "localhost", which has the special IP address 127.0.0.1) unless you are running on arpa.kwalsh.org. That way attackers can't try to break into your laptop over the internet through your web server. Still, it is probably best to be sure to kill your server when you are not actively working on the project.

Familiarize yourself with the code

Nearly all of the code I have provided you is (hopefully) correct, working code, and you can mostly leave it alone. Read the comments to make sense of it. Your task will be to add new features—while you will have to make a few changes to the existing code, for the most part you can leave the existing code alone and just add new code. The code I have provided already implements these features:

Your Tasks

See README.md for a list of all tasks to be completed. Additional hints and details are below.

HTML hello: Find the code that creates the response for "GET /hello" requests:
http://localhost:8888/hello. Currently the response is just plain text, with no HTML formatting. You can see the content-type (aka "mime-type") sent in the response is "text/plain". Change the code so it sends "text/html" for "GET /hello" responses, and change the body of the response so it contains valid HTML markup. No need to be fancy here. Just add "<html>" and "</html>" at the start and end, add the "head", "body", and "title" opening and closing tags, use "<p>" to start a new paragraph, and make make the links clickable. Add an image or two if you like, or change the colors, fonts, etc.

NOTE: Depending on where and how you run the code, you'll need to fix the links on the response page. You will need to change "http://logos.holycross.edu:8888/..." to be "localhost" or a different port number, for example, or change it to "http://arpa.kwalsh.org:8888/". Better yet, change the code to leave off the protocol, host, and port part of the link, and just use a host-relative path, like "/hello" as the link. Relative links work no matter where the server is located, since the client web browser will fill in the rest automatically.

Mime-type handling: Find the code that creates a response for requests to GET a file. Currently, the code always returns a response with content-type "text/html", regardless of the actual type of the file being requested. So even if the browser does "GET /xkcd_online_communities_2.png", which is clearly asking for an image file, the server incorrectly marks the response as "text/html". This can cause images and other media types to not always display correctly. Example: clicking http://localhost:8888/xkcd_online_communities_2.png probably will not render correctly in your browser if the mime-type in the response is incorrect. Also, the style sheets might not work correctly. Once you have mime-types correct, then http://localhost:8888/welcome.html and http://localhost:8888/compsci.html should look nicer, with colored backgrounds, custom fonts, etc. Fix the code so the correct mime-type is used, depending on the file name ending: ".html", ".htm", ".jpg", ".jpeg", ".png", ".txt", ".css", and ".js" should all be supported. Any mix of uppercase and lowercase should be allowed. You can find the appropriate mime-types for each of these online. Hint: don't guess, the mime-types aren't always obvious.

Add an index.html file: Add an "index.html" file and other content to your web_root directory to serve as a main landing page. As a stress-test of your mime-type handling code, here is one approach: Go to http://www.holycross.edu/ or some other relatively simple non-interactive webpage using Chrome or Firefox, then do "Save As ... (WebPage, Complete)" to safe the page as "index.html" inside your "web_root" folder. NOTE: Make sure you choose the "WebPage, Complete" option when saving. This doesn't work in Safari. This should create a file named "index.html" within your web_root folder, and it should also create a sub-folder (probably called "index_files") containing many javascript, css, and image files. Now you should be able to go to http://localhost:8888/index.html and see a nearly-complete holy cross homepage. It's okay if a few images and some other parts of the page are not 100% working—the College homepage is a bit fragile and the "Save As ..." doesn't always fully capture the entire page. But it should look pretty close to normal. You can do the same thing with other pages, such as http://www.dolekemp96.org/main.htm, which is a bit simpler than the Holy Cross home page as it's nearly 30 years old!

Default and directory pages: Change the server so that if the browser requests is to "GET /", then it actually gets "/index.html" instead. Be sure to use the right mime-type, and if the "index.html" file does not exist, return a 404 NOT FOUND error, as usual. Similarly, for any file request that ends in "/", change it to end in "/index.html". So "/subdir/" becomes "/subdir/index.html", and so on. This way, users can go to http://localhost:8888/ and get to your main landing page, without having to type "index.html" all the time. This is a widely followed convention among web servers. As a reach goal, do the same thing for any request path that corresponds to a directory instead of a file, even if it doesn't end in "/". For example, "GET /subdir" should work the same as "GET /subdir/" and "GET /subdir/index.html".

[Reach Goal] Directory listings: For directories, if the "index.html" file does not exist, rather than sending a 404 NOT FOUND response, instead send back a nicely-formatted HTML response containing a listing of all items in the directory. The listed items should be clickable links. Most servers now disable this feture for security/privacy reasons. But see, for example, https://www.arngren.net/sitebuilder/ which is one of the directories behind this ancient Norwegian site.

Dynamic hello: Even though the "GET /hello" response is not a file, the server returns the exact same response every time. In other words, it is still static. Make it dynamic by including some content that varies in the response body. Maybe have it vary the background color at random? Or keep track of how many times the hello-page has been displayed so far (just use a global counter variable in python for this), and include that counter in the response? Whatever you choose to do, going to http://localhost:8888/hello and hitting refresh should allow the client to see a (slightly) different page each time.

Enter your name:
Favorite color:

Interactive hello: make the hello page respond to user input, so the form at right works. The resulting page should have some sort of custom greeting for the chosen username. Hint: Look at the HTTP request to see how the server can get the user's name and favorite color. If you aren't using localhost with port 8888, make your own web form with the correct URL (or as a temporary fix, simply adjust the URL in your browser after clicking the "Greet Me!" button).

NOTE on weird characters: Hint: URLs can't use certain characters. If inputs have spaces or other special characters, you will notice the browser substitutes alternatives, an operation called "quoting" or "escaping". Fortunately, undoing URL quoting is easy in python: s = urllib.parse.unquote(...)

[Reach Goal] Keep-alive: In the original HTTP protocol, each socket connection carried a single request-response pair. Both the browser and server would close their sockets after the response was finished. Now, HTTP's keep-alive feature can improve peformance by allowing either a browser or a server to request that the connection be held open instead, and re-used for subsequent requests. The rule is simple. If the browser so chooses, it includes a "Connection: keep-alive" header in it's request. If the server so chooses, it will then include a "Connection: keep-alive" header in the response, and both sides will then leave the socket connection open so that the browser can send another request, and the server can respond, as many times as desired, using the same socket. Each of these subsequent requests and responses would similarly have a "Connection: keep-alive" header. Or, either side is allowed to instead include "Connection: close" with a request or response, indicating that it no longer wishes to keep the connection open. Upon noticing a "Connection: close" header from the other side, either side would close its socket.

Modify the server to support HTTP's keep-alive feature. Your server should keep the connection open, processing one request after another from the connection, so long as each new request contains the keep-alive header. Be sure to stop if the browser sends a "Connection: close". Either side can also, at any time, simply close the connection. Your server should initiate the close operation eventually if the browser does not, e.g. after a certain number of requests on the socket (on the 10th request?), or after the socket has been held open for a certain amount of time (after 1 minute?), or after the socket has been idle for a certain duration (e.g. after 10 seconds?). Be sure to send the appropriate keep-alive or close header in every response your server sends. Hint: Don't try to implement this all in one spot in the code. Instead, use a boolean variable inside the Connection object to keep track of whether it should be kept open, then use and modify this variable from various places in the code.

[Reach Goal] Who Am I? Servers can learn a bit about the clients connecting to them. Upon receiving a GET request for "/whoami", the server should respond with a special page that contains details about the web browser client, gleaned from the HTTP request itself or from the python socket connected to the client. Include details like: IP address and port of the client; "user-agent" string identifying the browser type and version; the languages the user prefers to accept; the value of any "Cookie" header, if present; and the value of the "DNT" (do-not-track) header or "Sec-GPC" (Global Privacy Control) header, if present.

[Reach Goal] Ban a browser: Some websites ban certain web browsers. Many users find this annoying. Modify the server so that all requests coming from your least favorite browser (e.g. just Safari on MacOS), responses contain a polite "sorry, this browser is not supported" message rather than the intended content. Hint: the user-agent header in the request will be useful here. The value of that header isn't obvious or sensible, but it is still how this feature is commonly implemented.

[Reach Goal] Persistence with Cookies: Use HTTP cookie headers to remember something about a client, and use that information in later responses. For example, when the user submits their name and favorite color on the "/hello" page, send a cookie to the browser containing that info. When the user later visits the same page, the cookie will be present in the request headers, and you can customize the response based on that information. Or, you could put a counter value into a cookie to keep track of how many times this client has visited your site. Then, on the "/hello", "/status", or "/whoami" pages, you can include this information, e.g. "Welcome back, this is your 37th visit."

HTTP Requirements and Simplifications

Hints: Global Variables in Python

It's fine to use global variables for these small projects, especially if it's some variable that lots of functions need to use. Creating a global variable in python is easy, just make the variable outside of any function, e.g. at the top of the program:

  myvariable = 37
  mylist = [ "Swords", "Fenwick", "Dinand" ]

When you are inside a function and want to modify some global variable (myvariable, for example, or mylist), you might need to put "global myvariable" or "global mylist" INSIDE each such function. Like this:

  def somefunction(...):
     global myvariable
     global mylist
     ...
     myvariable += 1
     ...
     mylist.append("whatever")

When you are inside a function and want to use but not modify some global variable, you can probably just go ahead and use it, without the "global" declaration, it should work fine. But if you forget the "global" inside some function, and that function tries to modify the variable, it will accidentally create a second local variable instead.

Hints: Concurrency and Synchronization in Python

With multiple client browsers, each opening multiple socket connections to your server, there is the potential that your server will be simultaneously processing multiple requests at the same time from one or more different hosts. This is normal: most browsers will open 5-10 simultaneous connections, for performance reasons. Global variables can cause trouble in this situation. To avoid trouble, you must use Python's threading.Condition() objects. Here's how.

Suppose we have a few global variables that are all related to each other in some way.

    a = [ "Foo", "Bar" ]
    b = 17
    c = "Something else"

Make another global variable to accompany them, like this:

    # the my_lock variable is intended to protect a, b, and c
    my_lock = threading.Condition()

Now, whenever any piece of code ever tries to access a, b, or c, it should always do so inside of a "with my_lock" block, like this:

    ... some code ...        # danger zone: don't ever use a, b, or c out here
    with my_lock:
           a.append("Hi mom")    # this line is within the safe zone
           b = b + 1             # this line is within the safe zone
           c = "Nevermind"       # this line is within the safe zone
    ... more code here ...   # danger zone: don't ever use a, b, or c out here

What the "with my_lock:" block accomplishes: Even though many different threads may be running, executing many different parts of your code at the same time, Python will ensure that only one thread at a time will ever be within any "with my_lock:" block. Essentially, it makes your code take turns executing the "with my_lock:" blocks.

There is already just such a variable in the server code, stats.lock, intended to protect all of the statistics variables. This is needed because every thread needs to access and update those variables, and could potentially try to do so at the same time.

NOTE: If a lock is used to protect some variable, then it is NEVER SAFE to access that variable in any way outside of the "with my_lock:" blocks, ALL accesses to the variable must use the lock. A good rule of thumb for concurrent programming is to use locks to protect all global variables except constants (i.e. variables that never change except when they are first initialized).

Submissions

Push your code to your github repository on the master branch. Once your code is pushed to GitHub, there is nothing further to submit. All code you commit to projects in the CSCI 356 GitHub classroom are shared with your instructor automatically, but otherwise are private by default.

Collaboration log Please either add a collaboration.txt file to your project and push it to GitHub, or include your collaboration log as comments at the top of your python code or in the README file. Either is fine. If your code requires anything unusual to run it (e.g., command line arguments), then say so in a comment at the top of your program. Also indicate in the README file if you know of anything that is broken, incomplete, etc.

You don't need to submit your web_root folder, though you can if you like, or if you made any important changes (e.g. to welcome.html) that I would need to test your server.