Recent Posts

  1. How to groupby in Pandas with a missing group

    Wed 25 April 2018

    This is a note intended for my future self. Here’s how to do it:

    1. Have a list of the values that you expect for each group.
    2. Iterate over that list, and look up the values using .loc.

    Say I want to group by months, but not all of the …

  2. Boston 2018

    Mon 16 April 2018

    This was an incredible experience.

    The crowds were just amazing, and seeing my son Olson at mile 6 in the rain and cold was amazing. With the weather, Olson on my mind, and the huge crowds, parts of the race were very emotional.

    At about mile 11-13, I sped up …

  3. UVM Twitter data notes

    Sat 25 November 2017

    First, important dates:

    • 2008-09-11: We have the deca-hose from here, with a higher % of the tweets at the beginning and down to 10% now (with the total volume increasing greatly over that time). Geo is (was) roughly 1% of all tweets, first with a "coordinates" and then with the "places …
  4. Linking files from GitHub in CodePen

    Wed 08 November 2017

    In the course I teach at UC Berkeley in the MIDS program, we use CodePen to build interactive web graphics. There are a host of reasons to use CodePen, but setting that aside for now, let's talk about how to host data files for CodePen. CodePen lacks a way for …

  5. Boston-bound for 2018

    Sun 10 September 2017

    Well, I never thought it would happen. Today I qualified for the Boston Marathon with a 3:00:04 showing at the Presque Isle Marathon.

  6. Enabling Jupyter notebook dashboards

    Thu 04 May 2017

    If you perform EDA using jupyter notebooks, it’s really easy to share those results with some moderate interaction via a jupyter dashboard. Here are the basic steps:

    1. Build the analysis, etc. Assuming this is done locally. Install the dashboard layout extension and lay out some sweet graphs. Optional: decorate …

  7. Should I set metadata manually in pyspark?

    Thu 04 May 2017

    Well, let’s do a simple test and find out if it speeds up the process of one-hot encoding a variable in our data. There are other reasons to set it, and we’ll get to those. Starting with the very helpful code snippet from spark-gotchas:

    import json
    
    from pyspark …

« Page 4 / 7 »