CSCI 307
    Data Mining

    College of the Holy Cross, Fall 2023


    Home | | Schedule | | Resources


    Python resoucres

    Anaconda is a great distribution to get started with Python for data mining applications, as it comes pre-installed with a number of useful libraries and has a good library management systems.

    For people more familiar with C++/Java, these can be useful.

    • Introduction to Python for C++ users (Archived lecture notes from a past ICL course)
    • Introduction to Python for Java users
    • A video on Python strategies I liked


    The first library we will start studying is NumPy. And the official documentation is quite good.

    • Beginner's guide
    • Quickstart guide
    Beyond that, I quite like this chapter from the "Data Analysis with Python" book by Wes McKinney.
    • Numpy Basics


    Next we start on Pandas. Which is great for reading in structured tabular data, such as in .csv files or such. It also has a lot of functionalities for data analysis and works seamlessly with numpy, matplotlib (which we'll see later) etc.

    • There is a 10-minute intro in the Pandas official documentation, but it's somewhat bit messy.
    • You might find this book to be a better reference as you keep doing more stuff on Python.
    • The previously mentioned "Data Analysis with Python" book also has a great Pandas chapter


    The primary plotting library we will be using is matplotlib.pyplot. There's a very short tutorial here, that gives you lots of examples. Feel free to find the scatter plots in the page and compare it to what you do in HW2.

    • PyPlot tutorial


    Math Resources

    Matrix and Linear Algebra

    • The Essence of Linear Algebra series of videos from 3Blue1Brown is an excellent resource for gaining intuition about matrices, matrix operations and linear algebra.

    Principal Component Analysis

    • StackExchange, StackOverflow, etc. are often great sources for learning. For example, this answer Explaining PCA is excellent.