lecture23.ipynb - cs107-lecture-examples - Example codes used during Harvard CS107 lectures

lecture23.ipynb (32813B)
      1 {
      2  "cells": [
      3   {
      4    "cell_type": "markdown",
      5    "id": "8d63b077",
      6    "metadata": {},
      7    "source": [
      8     "# Lecture 23\n",
      9     "## Thursday, November 17th 2022\n",
     10     "### Joins with SQLite and Pandas"
     11    ]
     12   },
     13   {
     14    "cell_type": "markdown",
     15    "id": "bbe7abc2",
     16    "metadata": {},
     17    "source": [
     18     "## Starting Up\n",
     19     "\n",
     20     "You can connect to the saved database from Tuesday if you want.\n",
     21     "Alternatively, for extra practice, you can just recreate it from the datasets\n",
     22     "provided in the `.txt` files.\n",
     23     "\n",
     24     "# Exercise Deliverables\n",
     25     "\n",
     26     "> 1. Copy the Jupyter notebook into `lab/pp12` in your private Git repository and\n",
     27     ">    commit on your default branch.  You should already have the `candidates.txt` and\n",
     28     ">    `contributors.txt` data files in this directory from last time.\n",
     29     "> 2. For each exercise in the notebook, there are instructions labeled\n",
     30     ">    \"**Do the following:**\". Put all the code\n",
     31     ">    for those instructions in _code cell(s) immediately following the\n",
     32     ">    instructions_. The code in those cells should be regular Python code.\n",
     33     ">    You should place comments where appropriate that describe your intentions.\n",
     34     ">    **Note:** To get the\n",
     35     ">    Pandas tables to display in a cell, use `display()`.\n",
     36     "> 3. Save and close your database.  Be sure to upload your databases in\n",
     37     ">    `lab/pp12` as well.  Please name your databases **`lecture23.sqlite`**\n",
     38     ">    and **`lecture23_pandas.sqlite`**.\n"
     39    ]
     40   },
     41   {
     42    "cell_type": "code",
     43    "execution_count": null,
     44    "id": "ee27a7d0",
     45    "metadata": {},
     46    "outputs": [],
     47    "source": [
     48     "import sqlite3\n",
     49     "import numpy as np\n",
     50     "import pandas as pd\n",
     51     "import time\n",
     52     "\n",
     53     "pd.set_option('display.width', 500)\n",
     54     "pd.set_option('display.max_rows', None)\n",
     55     "pd.set_option('display.max_columns', 100)\n",
     56     "pd.set_option('display.notebook_repr_html', True)\n",
     57     "\n",
     58     "db = sqlite3.connect('lecture23.sqlite')\n",
     59     "cursor = db.cursor()\n",
     60     "cursor.execute(\"DROP TABLE IF EXISTS candidates\")\n",
     61     "cursor.execute(\"DROP TABLE IF EXISTS contributors\")\n",
     62     "cursor.execute(\"PRAGMA foreign_keys=1\")\n",
     63     "\n",
     64     "cursor.execute(\n",
     65     "    '''CREATE TABLE candidates (\n",
     66     "               id INTEGER PRIMARY KEY NOT NULL, \n",
     67     "               first_name TEXT, \n",
     68     "               last_name TEXT, \n",
     69     "               middle_name TEXT, \n",
     70     "               party TEXT NOT NULL)'''\n",
     71     ")\n",
     72     "\n",
     73     "db.commit()  # Commit changes to the database\n",
     74     "\n",
     75     "cursor.execute(\n",
     76     "    '''CREATE TABLE contributors (\n",
     77     "          id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \n",
     78     "          last_name TEXT, \n",
     79     "          first_name TEXT, \n",
     80     "          middle_name TEXT, \n",
     81     "          street_1 TEXT, \n",
     82     "          street_2 TEXT, \n",
     83     "          city TEXT, \n",
     84     "          state TEXT, \n",
     85     "          zip TEXT, \n",
     86     "          amount REAL, \n",
     87     "          date DATETIME, \n",
     88     "          candidate_id INTEGER NOT NULL, \n",
     89     "          FOREIGN KEY(candidate_id) REFERENCES candidates(id))'''\n",
     90     ")\n",
     91     "\n",
     92     "db.commit()\n",
     93     "\n",
     94     "with open(\"candidates.txt\") as candidates:\n",
     95     "    next(candidates)  # jump over the header\n",
     96     "    for line in candidates.readlines():\n",
     97     "        vals_to_insert = line.strip().split('|')\n",
     98     "        cursor.execute(\n",
     99     "            '''INSERT INTO candidates \n",
    100     "                  (id, first_name, last_name, middle_name, party)\n",
    101     "                  VALUES (?, ?, ?, ?, ?)''', vals_to_insert\n",
    102     "        )\n",
    103     "\n",
    104     "with open(\"contributors.txt\") as contributors:\n",
    105     "    next(contributors)  # jump over the header\n",
    106     "    for line in contributors.readlines():\n",
    107     "        vals_to_insert = line.strip().split('|')[1:]\n",
    108     "        cursor.execute(\n",
    109     "            '''INSERT INTO contributors (last_name, first_name, middle_name, \n",
    110     "                           street_1, street_2, city, state, zip, amount, date, candidate_id) \n",
    111     "                           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''',\n",
    112     "            vals_to_insert\n",
    113     "        )\n",
    114     "\n",
    115     "\n",
    116     "def viz_tables(query, *, database=db):\n",
    117     "    return pd.read_sql_query(query, database)\n",
    118     "\n",
    119     "from IPython.display import display"
    120    ]
    121   },
    122   {
    123    "cell_type": "markdown",
    124    "id": "4ab0105a",
    125    "metadata": {},
    126    "source": [
    127     "## Recap\n",
    128     "\n",
    129     "Last time, you played with a number of SQLite commands to query and update the\n",
    130     "tables in the database.\n",
    131     "\n",
    132     "One thing we didn't get to was how to query the contributors table based off\n",
    133     "of a query in the candidates table (parent/child relationships).  For example,\n",
    134     "suppose you want to query which contributors donated to Obama. You could use a\n",
    135     "nested `SELECT` statement to accomplish that."
    136    ]
    137   },
    138   {
    139    "cell_type": "code",
    140    "execution_count": null,
    141    "id": "b2e61c76",
    142    "metadata": {},
    143    "outputs": [],
    144    "source": [
    145     "viz_tables(\n",
    146     "    '''SELECT * FROM contributors WHERE candidate_id = (SELECT id from candidates WHERE last_name = \"Obama\")'''\n",
    147     ")"
    148    ]
    149   },
    150   {
    151    "cell_type": "markdown",
    152    "id": "2801fdb8",
    153    "metadata": {},
    154    "source": [
    155     "# Joins\n",
    156     "\n",
    157     "The last example involved querying data from multiple tables.\n",
    158     "\n",
    159     "In particular, we combined columns from the two related tables (related through\n",
    160     "the `FOREIGN KEY`).\n",
    161     "\n",
    162     "This leads to the idea of *joining* multiple tables together.  SQL has a set\n",
    163     "of commands to handle different types of joins.  SQLite does not support the\n",
    164     "full suite of join commands offered by SQL but you should still be able to get\n",
    165     "the main ideas from the limited command set.\n",
    166     "\n",
    167     "We'll begin with the `INNER JOIN`."
    168    ]
    169   },
    170   {
    171    "cell_type": "markdown",
    172    "id": "c9695740",
    173    "metadata": {},
    174    "source": [
    175     "## INNER JOIN\n",
    176     "\n",
    177     "The idea here is that you will combine the tables if the values of certain\n",
    178     "columns are the same between the two tables.  In our example, we will join the\n",
    179     "two tables based on the candidate `id`.  The result of the `INNER JOIN` will be\n",
    180     "a new table consisting of the columns we requested and containing the common\n",
    181     "data.  Since we are joining based off of the candidate `id`, we will not be\n",
    182     "excluding any rows.\n",
    183     "\n",
    184     "### Example\n",
    185     "\n",
    186     "Here are two tables.  Table A has the form:\n",
    187     "\n",
    188     "| nA   | attr  | id  |\n",
    189     "|:----:|:-----:|:---:|\n",
    190     "| s1   | 23    | 0   |\n",
    191     "| s2   | 7     | 2   |\n",
    192     "\n",
    193     "and table B has the form:\n",
    194     "\n",
    195     "| nB   | attr  | id  |\n",
    196     "|:----:|:-----:|:---:|\n",
    197     "| t1   | 60    | 0   |\n",
    198     "| t2   | 14    | 7   |\n",
    199     "| t3   | 22    | 2   |\n",
    200     "\n",
    201     "Table A is associated with Table B through a foreign key on the id column.\n",
    202     "\n",
    203     "If we join the two tables by comparing the id columns and selecting the nA, nB,\n",
    204     "and attr columns then we'll get\n",
    205     "\n",
    206     "| nA   | A.attr  | nB  | B.attr |\n",
    207     "|:----:|:-------:|:---:|:------:|\n",
    208     "| s1   | 23      | t1  | 60     |\n",
    209     "| s2   | 7       | t3  | 22     |\n",
    210     "\n",
    211     "\n",
    212     "The SQLite code to do this join would be\n",
    213     "\n",
    214     "```SQL\n",
    215     "SELECT nA, A.attr, nB, B.attr FROM A INNER JOIN B ON B.id = A.id\n",
    216     "```\n",
    217     "\n",
    218     "Notice that the second row in table B is gone because the id values are not the\n",
    219     "same.\n",
    220     "\n",
    221     "### Visualization\n",
    222     "\n",
    223     "What is SQL doing with this operation?  It may help to visualize this with a\n",
    224     "Venn diagram.  Table A has rows with values corresponding to the `id`\n",
    225     "attribute.  Table B has rows with values corresponding to the `id` attribute.\n",
    226     "The `INNER JOIN` will combine the two tables such that rows with common entries\n",
    227     "in the `id` fields are included.  We essentially have the following Venn\n",
    228     "diagram.\n",
    229     "\n",
    230     "![](./fig/inner_join.png)"
    231    ]
    232   },
    233   {
    234    "cell_type": "markdown",
    235    "id": "1c12c276",
    236    "metadata": {},
    237    "source": [
    238     "## Do the following:\n",
    239     "\n",
    240     "1. Using an `INNER JOIN`, join the `candidates` and `contributors` tables by\n",
    241     "   comparing the `id` column in the `candidates` table with the `candidate_id`\n",
    242     "   column in the `contributors` table.  Display your joined table with the\n",
    243     "   columns `contributors.last_name`, `contributors.first_name`, `amount` and\n",
    244     "   `candidates.last_name`.\n",
    245     "2. Do the same inner join as above, but this time append a `WHERE`\n",
    246     "   clause to select a specific candidate's last name.\n"
    247    ]
    248   },
    249   {
    250    "cell_type": "markdown",
    251    "id": "8e462543",
    252    "metadata": {},
    253    "source": [
    254     "## `LEFT JOIN` or `LEFT OUTER JOIN`\n",
    255     "\n",
    256     "There are many ways to combine two tables.  We just explored one possibility in\n",
    257     "which we combined the tables based upon the intersection of the two tables (the\n",
    258     "`INNER JOIN`).\n",
    259     "\n",
    260     "Now we'll look at the `LEFT JOIN` (or `LEFT OUTER JOIN` in some databases).\n",
    261     "\n",
    262     "In words, the `LEFT JOIN` is combining the tables based upon what is in the\n",
    263     "intersection of the two tables *and* what is in the \"reference\" table (left\n",
    264     "table in the SQL command).\n",
    265     "\n",
    266     "We can consider our toy example in two guises:\n",
    267     "\n",
    268     "#### Example A\n",
    269     "\n",
    270     "Let's do a `LEFT JOIN` of table B from table A.  That is, we'd like to make a\n",
    271     "new table by putting table B into table A.  In this case, we'll consider table A\n",
    272     "our \"reference\" table.  We're comparing by the `id` column again.  We know that\n",
    273     "these two tables share ids 0 and 2 and table A doesn't have anything else in it.\n",
    274     "The resulting table is:\n",
    275     "\n",
    276     "| nA   | A.attr  | nB  | B.attr |\n",
    277     "|:----:|:-------:|:---:|:------:|\n",
    278     "| s1   | 23      | t1  | 60     |\n",
    279     "| s2   | 7       | t3  | 22     |\n",
    280     "\n",
    281     "That's not very exciting.  It's the same result as from the `INNER JOIN`.  We\n",
    282     "can do another example that may be more enlightening.\n",
    283     "\n",
    284     "#### Example B\n",
    285     "\n",
    286     "Let's do a `LEFT JOIN` of table A from table B.  That is, we'd like to make a\n",
    287     "new table by putting table A into table B.  In this case, we'll consider table B\n",
    288     "our \"reference\" table.  Again, we use the `id` column from comparison.  We know\n",
    289     "that these two tables share ids 0 and 2.  This time, table B also contains the\n",
    290     "id 7, which is not shared by table A.  The resulting table is:\n",
    291     "\n",
    292     "| nA   | A.attr  | nB  | B.attr |\n",
    293     "|:----:|:-------:|:---:|:------:|\n",
    294     "| s1   | 23      | t1  | 60     |\n",
    295     "| NULL | NULL    | t2  | 14     |\n",
    296     "| s2   | 7       | t3  | 22     |\n",
    297     "\n",
    298     "Notice that SQLite filled in the missing entries for us.  This is necessary\n",
    299     "for completion of the requested join.\n",
    300     "\n",
    301     "The SQLite commands to accomplish all of this are:\n",
    302     "\n",
    303     "```SQL\n",
    304     "SELECT nA, A.attr, nB, B.attr FROM A LEFT JOIN B ON B.id = A.id\n",
    305     "```\n",
    306     "\n",
    307     "and\n",
    308     "\n",
    309     "```SQL\n",
    310     "SELECT nA, A.attr, nB, B.attr FROM B LEFT JOIN A ON A.id = B.id\n",
    311     "```\n",
    312     "\n",
    313     "Here is a visualization using Venn diagrams of the `LEFT JOIN`.\n",
    314     "\n",
    315     "![](./fig/left_join.png)"
    316    ]
    317   },
    318   {
    319    "cell_type": "markdown",
    320    "id": "e2146d49",
    321    "metadata": {},
    322    "source": [
    323     "## Do the following:\n",
    324     "\n",
    325     "Use the following two tables to do the first two exercises in this section.\n",
    326     "Table A has the form:\n",
    327     "\n",
    328     "| nA   | attr  | id  |\n",
    329     "|:----:|:-----:|:---:|\n",
    330     "| s1   | 23    | 0   |\n",
    331     "| s2   | 7     | 2   |\n",
    332     "| s3   | 15    | 3   |\n",
    333     "| s4   | 31    | 7   |\n",
    334     "\n",
    335     "and table B has the form:\n",
    336     "\n",
    337     "| nB   | attr  | id  |\n",
    338     "|:----:|:-----:|:---:|\n",
    339     "| t1   | 60    | 0   |\n",
    340     "| t2   | 14    | 7   |\n",
    341     "| t3   | 22    | 2   |\n",
    342     "\n",
    343     "1. Write the markdown table that would result from a `LEFT JOIN` using table A\n",
    344     "   as the reference and the `id` columns for comparison.  Example with arbitrary\n",
    345     "   column names:\n",
    346     "   ```md\n",
    347     "   | Col1 | Col2 | Col3 | Col4 |\n",
    348     "   |:----:|:----:|:----:|:----:|\n",
    349     "   | val1 | val2 | val3 | val4 |\n",
    350     "   | val5 | val6 | val7 | val8 |\n",
    351     "   ```\n",
    352     "2. Write the markdown table that would result from a `LEFT JOIN` using table B\n",
    353     "   as the reference and the `id` columns for comparison.\n",
    354     "3. Now back to the candidates and their contributors.  Create a new table with\n",
    355     "   the following form:\n",
    356     "\n",
    357     "   | average contribution | candidate last name |\n",
    358     "   |:--------------------:|:-------------------:|\n",
    359     "   |    ...               |    ...              |\n",
    360     "\n",
    361     "   The table should be created using the `LEFT JOIN` clause on the\n",
    362     "   `contributors` table by joining the `candidates` table using the `id`\n",
    363     "   column. The `average contribution` column should be obtained using the\n",
    364     "   `AVG()` SQL function. Use the `GROUP BY` clause on the `candidates` last\n",
    365     "   name.\n",
    366     "\n",
    367     "---"
    368    ]
    369   },
    370   {
    371    "cell_type": "markdown",
    372    "id": "7a608a69",
    373    "metadata": {},
    374    "source": [
    375     "# Pandas"
    376    ]
    377   },
    378   {
    379    "cell_type": "markdown",
    380    "id": "e9962018",
    381    "metadata": {},
    382    "source": [
    383     "We've been working with databases for the last few lectures and learning\n",
    384     "SQLite commands to work with and manipulate the SQL databases.  Pandas is\n",
    385     "a powerful Python package that provides broad support for data structures.  It\n",
    386     "can be used to interact with relational databases through its own methods and\n",
    387     "even through SQL commands.\n",
    388     "\n",
    389     "> In the last part of this lecture, you will get to redo a number of the\n",
    390     "> previous database exercises using Pandas.\n",
    391     "\n",
    392     "We won't be able to cover Pandas from the ground up, but it's a\n",
    393     "well-documented library and is fairly easy to get up and running.  The website\n",
    394     "can be found at the following link: [Pandas](http://pandas.pydata.org/).  A\n",
    395     "very good reference for Pandas is the book [\"Python for data\n",
    396     "analysis\"](https://www.amazon.com/gp/product/1491957662/ref=as_li_tl?ie=UTF8&tag=quantpytho-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=1491957662&linkId=8c3bf87b221dbcd8f541f0db20d4da83)\n",
    397     "by Wes McKinney, the creator of Pandas himself."
    398    ]
    399   },
    400   {
    401    "cell_type": "markdown",
    402    "id": "c3b3f76d",
    403    "metadata": {},
    404    "source": [
    405     "## Reading a datafile into Pandas"
    406    ]
    407   },
    408   {
    409    "cell_type": "code",
    410    "execution_count": null,
    411    "id": "daf5676d",
    412    "metadata": {},
    413    "outputs": [],
    414    "source": [
    415     "# Using Pandas naming convention\n",
    416     "dfcand = pd.read_csv(\"candidates.txt\", sep=\"|\")\n",
    417     "dfcand"
    418    ]
    419   },
    420   {
    421    "cell_type": "code",
    422    "execution_count": null,
    423    "id": "38244660",
    424    "metadata": {},
    425    "outputs": [],
    426    "source": [
    427     "dfcontr = pd.read_csv(\"contributors.txt\", sep=\"|\")\n",
    428     "dfcontr"
    429    ]
    430   },
    431   {
    432    "cell_type": "markdown",
    433    "id": "1f989838",
    434    "metadata": {},
    435    "source": [
    436     "Reading things in is quite easy with Pandas. Notice that Pandas populates\n",
    437     "empty fields with `NaN` values. The `id` column in the `contributors` dataset is\n",
    438     "superfluous.  Let's delete it."
    439    ]
    440   },
    441   {
    442    "cell_type": "code",
    443    "execution_count": null,
    444    "id": "c69cbd93",
    445    "metadata": {},
    446    "outputs": [],
    447    "source": [
    448     "del dfcontr['id']\n",
    449     "dfcontr.head()"
    450    ]
    451   },
    452   {
    453    "cell_type": "markdown",
    454    "id": "1177054b",
    455    "metadata": {},
    456    "source": [
    457     "Very nice!  And we used the `head` method to print out the first five rows."
    458    ]
    459   },
    460   {
    461    "cell_type": "markdown",
    462    "id": "372c68d9",
    463    "metadata": {},
    464    "source": [
    465     "## Creating a Table with Pandas\n",
    466     "\n",
    467     "We can use Pandas to create tables in a database. First, let's create a new\n",
    468     "(empty) SQLite database:"
    469    ]
    470   },
    471   {
    472    "cell_type": "code",
    473    "execution_count": null,
    474    "id": "4830fb87",
    475    "metadata": {},
    476    "outputs": [],
    477    "source": [
    478     "# commit the previous work\n",
    479     "db.commit()\n",
    480     "\n",
    481     "# new database for Pandas operations\n",
    482     "db = sqlite3.connect('lecture23_pandas.sqlite')\n",
    483     "cursor = db.cursor()\n",
    484     "cursor.execute(\"DROP TABLE IF EXISTS candidates\")\n",
    485     "cursor.execute(\"DROP TABLE IF EXISTS contributors\")\n",
    486     "cursor.execute(\"PRAGMA foreign_keys=1\")\n",
    487     "\n",
    488     "cursor.execute(\n",
    489     "    '''CREATE TABLE candidates (\n",
    490     "               id INTEGER PRIMARY KEY NOT NULL, \n",
    491     "               first_name TEXT, \n",
    492     "               last_name TEXT, \n",
    493     "               middle_name TEXT, \n",
    494     "               party TEXT NOT NULL)'''\n",
    495     ")\n",
    496     "\n",
    497     "db.commit()  # Commit changes to the database\n",
    498     "\n",
    499     "cursor.execute(\n",
    500     "    '''CREATE TABLE contributors (\n",
    501     "          id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \n",
    502     "          last_name TEXT, \n",
    503     "          first_name TEXT, \n",
    504     "          middle_name TEXT, \n",
    505     "          street_1 TEXT, \n",
    506     "          street_2 TEXT, \n",
    507     "          city TEXT, \n",
    508     "          state TEXT, \n",
    509     "          zip TEXT, \n",
    510     "          amount REAL, \n",
    511     "          date DATETIME, \n",
    512     "          candidate_id INTEGER NOT NULL, \n",
    513     "          FOREIGN KEY(candidate_id) REFERENCES candidates(id))'''\n",
    514     ")\n",
    515     "\n",
    516     "db.commit()"
    517    ]
    518   },
    519   {
    520    "cell_type": "markdown",
    521    "id": "3d335768",
    522    "metadata": {},
    523    "source": [
    524     "Last time, we opened the data files with Python and then manually used\n",
    525     "SQLite commands to populate the individual tables.  We can use Pandas\n",
    526     "instead like so:"
    527    ]
    528   },
    529   {
    530    "cell_type": "code",
    531    "execution_count": null,
    532    "id": "2ef39d41",
    533    "metadata": {},
    534    "outputs": [],
    535    "source": [
    536     "dfcand.to_sql(\"candidates\", db, if_exists=\"append\", index=False)"
    537    ]
    538   },
    539   {
    540    "cell_type": "markdown",
    541    "id": "74e41652",
    542    "metadata": {},
    543    "source": [
    544     "What is the size of our table?"
    545    ]
    546   },
    547   {
    548    "cell_type": "code",
    549    "execution_count": null,
    550    "id": "dda8ab8a",
    551    "metadata": {},
    552    "outputs": [],
    553    "source": [
    554     "dfcand.shape"
    555    ]
    556   },
    557   {
    558    "cell_type": "markdown",
    559    "id": "bccda7d1",
    560    "metadata": {},
    561    "source": [
    562     "We can visualize the data in our Pandas-populated SQL table.  No surprises\n",
    563     "here except that Pandas did everything for us (contrast this to our manual\n",
    564     "file read last time using `for`-loops and list comprehensions)."
    565    ]
    566   },
    567   {
    568    "cell_type": "code",
    569    "execution_count": null,
    570    "id": "7205d082",
    571    "metadata": {},
    572    "outputs": [],
    573    "source": [
    574     "display(viz_tables('''SELECT * FROM candidates'''))"
    575    ]
    576   },
    577   {
    578    "cell_type": "markdown",
    579    "id": "6463c34e",
    580    "metadata": {},
    581    "source": [
    582     "## Querying a table with Pandas"
    583    ]
    584   },
    585   {
    586    "cell_type": "markdown",
    587    "id": "09479688",
    588    "metadata": {},
    589    "source": [
    590     "### One Way\n",
    591     "\n",
    592     "Using the `query` method:"
    593    ]
    594   },
    595   {
    596    "cell_type": "code",
    597    "execution_count": null,
    598    "id": "4dfdae96",
    599    "metadata": {},
    600    "outputs": [],
    601    "source": [
    602     "dfcand.query(\"first_name=='Mike' & party=='D'\")"
    603    ]
    604   },
    605   {
    606    "cell_type": "markdown",
    607    "id": "8d5ac2e6",
    608    "metadata": {},
    609    "source": [
    610     "### Another Way\n",
    611     "\n",
    612     "Using the `__getitem__` special method:"
    613    ]
    614   },
    615   {
    616    "cell_type": "code",
    617    "execution_count": null,
    618    "id": "a0ff20ba",
    619    "metadata": {},
    620    "outputs": [],
    621    "source": [
    622     "dfcand[(dfcand.first_name==\"Mike\") & (dfcand.party==\"D\")]"
    623    ]
    624   },
    625   {
    626    "cell_type": "markdown",
    627    "id": "d0443eef",
    628    "metadata": {},
    629    "source": [
    630     "### More Queries"
    631    ]
    632   },
    633   {
    634    "cell_type": "code",
    635    "execution_count": null,
    636    "id": "6e9d78ac",
    637    "metadata": {},
    638    "outputs": [],
    639    "source": [
    640     "dfcand[dfcand.middle_name.notnull()]"
    641    ]
    642   },
    643   {
    644    "cell_type": "code",
    645    "execution_count": null,
    646    "id": "dd900a67",
    647    "metadata": {},
    648    "outputs": [],
    649    "source": [
    650     "dfcand[dfcand.first_name.isin(['Mike', 'Hillary'])]"
    651    ]
    652   },
    653   {
    654    "cell_type": "markdown",
    655    "id": "55a51695",
    656    "metadata": {},
    657    "source": [
    658     "## Do the following:\n",
    659     "\n",
    660     "1. Use Pandas to populate the contributors table in the SQLite database and\n",
    661     "   display the SQL table with the `viz_tables` helper.\n",
    662     "2. Query the `contributors` Pandas `DataFrame` with the following constraints:\n",
    663     "  1. List entries where the state is \"VA\" and the amount is less than\n",
    664     "     $\\$400.00$.\n",
    665     "  2. List entries where the state is \"NULL\".\n",
    666     "  3. List entries for the states of Texas and Pennsylvania.\n",
    667     "  4. List entries where the amount contributed is between $\\$10.00$ and\n",
    668     "     $\\$50.00$."
    669    ]
    670   },
    671   {
    672    "cell_type": "markdown",
    673    "id": "0b46cc6e",
    674    "metadata": {},
    675    "source": [
    676     "## Sorting"
    677    ]
    678   },
    679   {
    680    "cell_type": "code",
    681    "execution_count": null,
    682    "id": "8c5a40bf",
    683    "metadata": {},
    684    "outputs": [],
    685    "source": [
    686     "dfcand.sort_values(by='party')"
    687    ]
    688   },
    689   {
    690    "cell_type": "code",
    691    "execution_count": null,
    692    "id": "499d7aa0",
    693    "metadata": {},
    694    "outputs": [],
    695    "source": [
    696     "dfcand.sort_values(by='party', ascending=False)"
    697    ]
    698   },
    699   {
    700    "cell_type": "markdown",
    701    "id": "4364b1bc",
    702    "metadata": {},
    703    "source": [
    704     "## Selecting Columns"
    705    ]
    706   },
    707   {
    708    "cell_type": "code",
    709    "execution_count": null,
    710    "id": "baf9374f",
    711    "metadata": {},
    712    "outputs": [],
    713    "source": [
    714     "dfcand[['last_name', 'party']]"
    715    ]
    716   },
    717   {
    718    "cell_type": "code",
    719    "execution_count": null,
    720    "id": "e1232473",
    721    "metadata": {},
    722    "outputs": [],
    723    "source": [
    724     "dfcand[['last_name', 'party']].count()"
    725    ]
    726   },
    727   {
    728    "cell_type": "code",
    729    "execution_count": null,
    730    "id": "cfd06fa6",
    731    "metadata": {},
    732    "outputs": [],
    733    "source": [
    734     "dfcand[['first_name']].drop_duplicates()"
    735    ]
    736   },
    737   {
    738    "cell_type": "code",
    739    "execution_count": null,
    740    "id": "830863f8",
    741    "metadata": {},
    742    "outputs": [],
    743    "source": [
    744     "dfcand[['first_name']].drop_duplicates().count()"
    745    ]
    746   },
    747   {
    748    "cell_type": "markdown",
    749    "id": "1a5e16fa",
    750    "metadata": {},
    751    "source": [
    752     "## Do the following:\n",
    753     "\n",
    754     "Use the `contributors` Pandas `DataFrame` and the `display` helper to\n",
    755     "display the results.\n",
    756     "\n",
    757     "1. Sort the `DataFrame` by `amount` and order in *descending* order.\n",
    758     "2. Select the `first_name` and `amount` columns.\n",
    759     "3. Select the `last_name` and `first_name` columns and drop duplicates.\n",
    760     "4. Count how many there are after the duplicates have been dropped."
    761    ]
    762   },
    763   {
    764    "cell_type": "markdown",
    765    "id": "89e753b0",
    766    "metadata": {},
    767    "source": [
    768     "## Altering Tables"
    769    ]
    770   },
    771   {
    772    "cell_type": "markdown",
    773    "id": "dc563f80",
    774    "metadata": {},
    775    "source": [
    776     "Creating a new column is quite easy with Pandas."
    777    ]
    778   },
    779   {
    780    "cell_type": "code",
    781    "execution_count": null,
    782    "id": "2c565bc3",
    783    "metadata": {},
    784    "outputs": [],
    785    "source": [
    786     "dfcand['name'] = dfcand['last_name'] + \", \" + dfcand['first_name']\n",
    787     "dfcand"
    788    ]
    789   },
    790   {
    791    "cell_type": "markdown",
    792    "id": "6933f2cd",
    793    "metadata": {},
    794    "source": [
    795     "We can change an existing field as well (see the [`loc`\n",
    796     "method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)):"
    797    ]
    798   },
    799   {
    800    "cell_type": "code",
    801    "execution_count": null,
    802    "id": "beed2150",
    803    "metadata": {},
    804    "outputs": [],
    805    "source": [
    806     "dfcand.loc[dfcand.first_name == \"Mike\", \"name\"]"
    807    ]
    808   },
    809   {
    810    "cell_type": "code",
    811    "execution_count": null,
    812    "id": "07415edc",
    813    "metadata": {},
    814    "outputs": [],
    815    "source": [
    816     "dfcand.loc[dfcand.first_name == \"Mike\", \"name\"] = \"Mikey\"\n",
    817     "dfcand"
    818    ]
    819   },
    820   {
    821    "cell_type": "code",
    822    "execution_count": null,
    823    "id": "c8f5ce9c",
    824    "metadata": {},
    825    "outputs": [],
    826    "source": [
    827     "dfcand.query(\"first_name == 'Mike'\")\n",
    828     "dfcand"
    829    ]
    830   },
    831   {
    832    "cell_type": "markdown",
    833    "id": "0c1ef2e3",
    834    "metadata": {},
    835    "source": [
    836     "You may recall that SQLite doesn't have the functionality to drop a column.\n",
    837     "It can be done in one line using Pandas:"
    838    ]
    839   },
    840   {
    841    "cell_type": "code",
    842    "execution_count": null,
    843    "id": "5055a447",
    844    "metadata": {},
    845    "outputs": [],
    846    "source": [
    847     "del dfcand['name']\n",
    848     "dfcand"
    849    ]
    850   },
    851   {
    852    "cell_type": "markdown",
    853    "id": "d8843e4a",
    854    "metadata": {},
    855    "source": [
    856     "## Do the following:\n",
    857     "\n",
    858     "Use the `contributors` Pandas `DataFrame` and the `display` helper to\n",
    859     "display the results.\n",
    860     "\n",
    861     "1. Create a `name` column for the `contributors` table with field entries of the\n",
    862     "   form \"last name, first name\"\n",
    863     "2. For contributors from the state of \"PA\", change `name` to \"X\".\n",
    864     "3. Delete the newly created name column."
    865    ]
    866   },
    867   {
    868    "cell_type": "markdown",
    869    "id": "b0080e20",
    870    "metadata": {},
    871    "source": [
    872     "## Aggregation\n",
    873     "\n",
    874     "We'd like to get information about the tables such as the maximum amount\n",
    875     "contributed to the candidates.  Basic statistics on a Pandas frame can be\n",
    876     "obtained using the `describe()` method:"
    877    ]
    878   },
    879   {
    880    "cell_type": "code",
    881    "execution_count": null,
    882    "id": "66937b4b",
    883    "metadata": {},
    884    "outputs": [],
    885    "source": [
    886     "dfcand.describe()"
    887    ]
    888   },
    889   {
    890    "cell_type": "markdown",
    891    "id": "4bfe4c7f",
    892    "metadata": {},
    893    "source": [
    894     "It's not very interesting with the candidates table because the candidates table\n",
    895     "only has one numeric column. Here are a few more data queries using the\n",
    896     "contributors table:"
    897    ]
    898   },
    899   {
    900    "cell_type": "code",
    901    "execution_count": null,
    902    "id": "dfecf3cf",
    903    "metadata": {},
    904    "outputs": [],
    905    "source": [
    906     "dfcontr.amount.max()"
    907    ]
    908   },
    909   {
    910    "cell_type": "code",
    911    "execution_count": null,
    912    "id": "7b148b1d",
    913    "metadata": {},
    914    "outputs": [],
    915    "source": [
    916     "dfcontr[dfcontr.amount==dfcontr.amount.max()]"
    917    ]
    918   },
    919   {
    920    "cell_type": "code",
    921    "execution_count": null,
    922    "id": "6515bac7",
    923    "metadata": {},
    924    "outputs": [],
    925    "source": [
    926     "dfcontr.groupby(\"state\").sum()"
    927    ]
    928   },
    929   {
    930    "cell_type": "code",
    931    "execution_count": null,
    932    "id": "3759b090",
    933    "metadata": {},
    934    "outputs": [],
    935    "source": [
    936     "dfcontr.groupby(\"state\")[\"amount\"].sum()"
    937    ]
    938   },
    939   {
    940    "cell_type": "code",
    941    "execution_count": null,
    942    "id": "a6a0175d",
    943    "metadata": {},
    944    "outputs": [],
    945    "source": [
    946     "dfcontr.state.unique()"
    947    ]
    948   },
    949   {
    950    "cell_type": "markdown",
    951    "id": "c727e2fd",
    952    "metadata": {},
    953    "source": [
    954     "There is also a version of the `LIMIT` clause in SQL.  It's very intuitive\n",
    955     "using Pandas in Python:"
    956    ]
    957   },
    958   {
    959    "cell_type": "code",
    960    "execution_count": null,
    961    "id": "3c2b94df",
    962    "metadata": {},
    963    "outputs": [],
    964    "source": [
    965     "dfcand[0:3]"
    966    ]
    967   },
    968   {
    969    "cell_type": "markdown",
    970    "id": "0b61964d",
    971    "metadata": {},
    972    "source": [
    973     "The usual Python slicing works just fine!\n"
    974    ]
    975   },
    976   {
    977    "cell_type": "markdown",
    978    "id": "9b1c9e3d",
    979    "metadata": {},
    980    "source": [
    981     "## Do the following:\n",
    982     "\n",
    983     "Use the `describe()` method on the `contributors` table."
    984    ]
    985   },
    986   {
    987    "cell_type": "markdown",
    988    "id": "cc537cb1",
    989    "metadata": {},
    990    "source": [
    991     "## Joins with Pandas"
    992    ]
    993   },
    994   {
    995    "cell_type": "markdown",
    996    "id": "5d6a729a",
    997    "metadata": {},
    998    "source": [
    999     "Pandas has some documentation on `joins`:  [Merge, join, and\n",
   1000     "concatenate](http://pandas.pydata.org/pandas-docs/stable/merging.html).  If you\n",
   1001     "want some more reinforcement on the concepts from earlier regarding `JOIN`, then\n",
   1002     "the Pandas documentation may be a good place to get it.\n",
   1003     "\n",
   1004     "You may also be interested in [a comparison with\n",
   1005     "SQL](http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join).\n",
   1006     "\n",
   1007     "> To do joins with Pandas, we use the `merge` method.\n",
   1008     "\n",
   1009     "Here's an example of an explicit inner join:"
   1010    ]
   1011   },
   1012   {
   1013    "cell_type": "code",
   1014    "execution_count": null,
   1015    "id": "7f956838",
   1016    "metadata": {},
   1017    "outputs": [],
   1018    "source": [
   1019     "cols_wanted = ['last_name_x', 'first_name_x', 'candidate_id', 'id', 'last_name_y', 'amount']\n",
   1020     "dfcontr.merge(dfcand, left_on=\"candidate_id\", right_on=\"id\")[cols_wanted]"
   1021    ]
   1022   },
   1023   {
   1024    "cell_type": "markdown",
   1025    "id": "842f3710",
   1026    "metadata": {},
   1027    "source": [
   1028     "Somewhat more organized with additional grouping and description of resulting\n",
   1029     "data frame:"
   1030    ]
   1031   },
   1032   {
   1033    "cell_type": "code",
   1034    "execution_count": null,
   1035    "id": "7322f44e",
   1036    "metadata": {},
   1037    "outputs": [],
   1038    "source": [
   1039     "dfcontr.merge(dfcand, left_on=\"candidate_id\", right_on=\"id\")[cols_wanted].groupby('last_name_y').describe()"
   1040    ]
   1041   },
   1042   {
   1043    "cell_type": "markdown",
   1044    "id": "79bcc62c",
   1045    "metadata": {},
   1046    "source": [
   1047     "### Other Joins with Pandas\n",
   1048     "\n",
   1049     "We didn't cover all possible joins because SQLite can only handle the few that\n",
   1050     "we did discuss.  As mentioned, there are workarounds for some things in\n",
   1051     "SQLite, but not everything.  Fortunately, Pandas can handle pretty much\n",
   1052     "everything.  Here are a few joins that Pandas can handle:\n",
   1053     "\n",
   1054     "* `LEFT OUTER`: discussed above\n",
   1055     "* `RIGHT OUTER`: think of the \"opposite\" of a `LEFT OUTER` join (shade the\n",
   1056     "  intersection and *right* set in the Venn diagram).\n",
   1057     "* `FULL OUTER`: combine everything from both tables (shade the entire Venn\n",
   1058     "  diagram)\n",
   1059     "\n",
   1060     "Lets build the tables from the join exercise above in Pandas:"
   1061    ]
   1062   },
   1063   {
   1064    "cell_type": "code",
   1065    "execution_count": null,
   1066    "id": "b7eea35c",
   1067    "metadata": {
   1068     "lines_to_next_cell": 2
   1069    },
   1070    "outputs": [],
   1071    "source": [
   1072     "df_A = pd.DataFrame(\n",
   1073     "    {\n",
   1074     "        'nA': ['s1', 's2', 's3', 's4'],\n",
   1075     "        'attr': [23, 7, 15, 31],\n",
   1076     "        'id': [0, 2, 3, 7]\n",
   1077     "    }\n",
   1078     ")\n",
   1079     "\n",
   1080     "df_B = pd.DataFrame(\n",
   1081     "    {\n",
   1082     "        'nB': ['t1', 't2', 't3'],\n",
   1083     "        'attr': [60, 14, 22],\n",
   1084     "        'id': [0, 7, 2]\n",
   1085     "    }\n",
   1086     ")"
   1087    ]
   1088   },
   1089   {
   1090    "cell_type": "markdown",
   1091    "id": "98fcc8c4",
   1092    "metadata": {},
   1093    "source": [
   1094     "#### Left Outer Join with Pandas"
   1095    ]
   1096   },
   1097   {
   1098    "cell_type": "code",
   1099    "execution_count": null,
   1100    "id": "a0dd2148",
   1101    "metadata": {},
   1102    "outputs": [],
   1103    "source": [
   1104     "df_A.merge(df_B, left_on='id', right_on='id', how='left')"
   1105    ]
   1106   },
   1107   {
   1108    "cell_type": "markdown",
   1109    "id": "862e7b92",
   1110    "metadata": {},
   1111    "source": [
   1112     "#### Right Outer Join with Pandas"
   1113    ]
   1114   },
   1115   {
   1116    "cell_type": "code",
   1117    "execution_count": null,
   1118    "id": "c8dd4523",
   1119    "metadata": {},
   1120    "outputs": [],
   1121    "source": [
   1122     "df_A.merge(df_B, left_on='id', right_on='id', how='right')"
   1123    ]
   1124   },
   1125   {
   1126    "cell_type": "markdown",
   1127    "id": "53eceb9e",
   1128    "metadata": {},
   1129    "source": [
   1130     "#### Full Outer Join with Pandas"
   1131    ]
   1132   },
   1133   {
   1134    "cell_type": "code",
   1135    "execution_count": null,
   1136    "id": "531b99f2",
   1137    "metadata": {},
   1138    "outputs": [],
   1139    "source": [
   1140     "df_A.merge(df_B, left_on='id', right_on='id', how='outer')"
   1141    ]
   1142   },
   1143   {
   1144    "cell_type": "markdown",
   1145    "id": "2c8822c6",
   1146    "metadata": {},
   1147    "source": [
   1148     "# Save our databases\n",
   1149     "\n",
   1150     "Commit the changes to the open SQL databases and close them.  Well done!"
   1151    ]
   1152   },
   1153   {
   1154    "cell_type": "code",
   1155    "execution_count": null,
   1156    "id": "0d994c14",
   1157    "metadata": {},
   1158    "outputs": [],
   1159    "source": [
   1160     "db.commit()\n",
   1161     "db.close()"
   1162    ]
   1163   }
   1164  ],
   1165  "metadata": {
   1166   "kernelspec": {
   1167    "display_name": "Python 3",
   1168    "language": "python",
   1169    "name": "python3"
   1170   }
   1171  },
   1172  "nbformat": 4,
   1173  "nbformat_minor": 5
   1174 }
	cs107-lecture-examples Example codes used during Harvard CS107 lectures
	git clone https://git.0xfab.ch/cs107-lecture-examples.git
	Log \| Files \| Refs \| README \| LICENSE