kenpompy: College Basketball for Nerds

https://travis-ci.com/j-andrews7/kenpompy.svg?branch=master https://codecov.io/gh/j-andrews7/kenpompy/branch/master/graph/badge.svg kenpompy logo

This python package serves as a convenient web scraper for kenpom, which provides tons of great NCAA basketball statistics and metrics. It requires a subscription to KenPomeroy’s site for use, otherwise only the home page will be accessible. It’s a small fee for a year of access, and totally worth it in my opinion.

Objective

Ultimately, this package is to allow both hobbyist and reknown sports analysts alike to get data from kenpom in a format more suitable for visualization, transformation, and additional analysis. It’s meant to be simple, easy to use, and to yield information in a way that is immediately usable.

Responsible Use

As with many web scrapers, the responsibility to use this package in a reasonable manner falls upon the user. Don’t be a jerk and constantly scrape the site a thousand times a minute or you run the risk of potentially getting barred from it, which you’d likely deserve. I am in no way responsible for how you use (or abuse) this package. Be sensible.

But I Use R

Yeah, yeah, but have you heard of reticulate? It’s an R interface to python that also supports passing objects (like dataframes!) between them.

Installation

kenpompy is easily installed via pip:

pip install kenpompy

What It Can (and Can’t) Do

This a work in progress - it can currently scrape all of the summary, FanMatch, and miscellaneous tables, pretty much all of those under the Stats and Miscellany headings. Team and Player classes are planned, but they’re more complicated and will take some time.

Usage

kenpompy is simple to use. Generally, tables on each page are scraped into pandas dataframes with simple parameters to select different seasons or tables. As many tables have headers that don’t parse well, some are manually altered to a small degree to make the resulting dataframe easier to interpret and manipulate.

First, you must login:

from kenpompy.utils import login

# Returns an authenticated browser that can then be used to scrape pages that require authorization.
browser = login(your_email, your_password)

Then you can request specific pages that will be parsed into convenient dataframes:

import kenpompy.summary as kp

# Returns a pandas dataframe containing the efficiency and tempo stats for the current season (https://kenpom.com/summary.php).
eff_stats = kp.get_efficiency(browser)

Full API Reference

utils

The utils module provides utility functions, such as logging in.

kenpompy.utils.get_html(browser: CloudScraper, url: str)

Performs a get request on the specified url.

Parameters:

browser – Authenticated browser with full access to kenpom.com generated

Returns:

The return content.

Return type:

html (Bytes | Any)

Raises:

Exception if get request gets a non-200 response code.

kenpompy.utils.login(email: str, password: str)

Logs in to kenpom.com using user credentials.

Parameters:
  • email (str) – User e-mail for login to kenpom.com.

  • password (str) – User password for login to kenpom.com.

Returns:

Authenticated browser with full access to kenpom.com.

Return type:

browser (mechanicalsoup StatefulBrowser)

misc

This module provides functions for scraping the miscellaneous stats kenpom.com pages into more usable pandas dataframes.

kenpompy.misc.get_arenas(browser: CloudScraper, season: str | None = None)

Scrapes the arenas table (https://kenpom.com/arenas.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 2010 is the earliest available season. Most recent season is the default.

Returns:

Pandas dataframe containing the arenas table from kenpom.com.

Return type:

arenas_df (pandas dataframe)

Raises:

ValueError – If season is less than 2010.

kenpompy.misc.get_current_season(browser: CloudScraper)

Scrapes the KenPom homepage to get the latest season year that has data published

Parameters:

browser – Authenticated browser with full access to kenpom.com generated

Returns:

Number corresponding to the last season year that has data published

Return type:

current_season (int)

kenpompy.misc.get_gameattribs(browser: CloudScraper, season: str | None = None, metric: str = 'Excitement')

Scrapes the Game Attributes tables (https://kenpom.com/game_attrs.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 2010 is the earliest available season. Most recent season is the default.

  • metric (str, optional) – Used to get highest ranking games for different metrics. Available values are: ‘Excitement’, ‘Tension’, ‘Dominance’, ‘ComeBack’, ‘FanMatch’, ‘Upsets’, and ‘Busts’. Default is ‘Excitement’. ‘FanMatch’, ‘Upsets’, and ‘Busts’ are only valid for seasons after 2010.

Returns:

Pandas dataframe containing the Game Attributes table from kenpom.com for a given metric.

Return type:

ga_df (pandas dataframe)

Raises:
  • ValueError – If season is less than 2010.

  • KeyError – If metric is invalid.

kenpompy.misc.get_hca(browser: CloudScraper)

Scrapes the home court advantage table (https://kenpom.com/hca.php) into a dataframe.

Parameters:

browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

Returns:

Pandas dataframe containing the home court advantage table from kenpom.com.

Return type:

hca_df (pandas dataframe)

kenpompy.misc.get_pomeroy_ratings(browser: CloudScraper, season: str | None = None)

Scrapes the Pomeroy College Basketball Ratings table (https://kenpom.com/index.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season. Most recent season is the default.

Returns:

Pandas dataframe containing the Pomeroy College Basketball Ratings table from kenpom.com.

Return type:

refs_df (pandas dataframe)

Raises:

ValueError – If season is less than 1999.

kenpompy.misc.get_program_ratings(browser: CloudScraper)

Scrapes the program ratings table (https://kenpom.com/programs.php) into a dataframe.

Parameters:

browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

Returns:

Pandas dataframe containing the program ratings table from kenpom.com.

Return type:

programs_df (pandas dataframe)

kenpompy.misc.get_refs(browser: CloudScraper, season: str | None = None)

Scrapes the officials rankings table (https://kenpom.com/officials.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 2016 is the earliest available season. Most recent season is the default.

Returns:

Pandas dataframe containing the officials rankings table from kenpom.com.

Return type:

refs_df (pandas dataframe)

Raises:

ValueError – If season is less than 2016.

Scrapes the statistical trends table (https://kenpom.com/trends.php) into a dataframe.

Parameters:

browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

Returns:

Pandas dataframe containing the statistical trends table from kenpom.com.

Return type:

trends_df (pandas dataframe)

summary

This module provides functions for scraping the summary stats kenpom.com pages into more usable pandas dataframes.

kenpompy.summary.get_efficiency(browser: CloudScraper, season: str | None = None)

Scrapes the Efficiency stats table (https://kenpom.com/summary.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season but possession length data wasn’t available until 2010. Most recent season is the default.

Returns:

Pandas dataframe containing the summary efficiency/tempo table from kenpom.com.

Return type:

eff_df (pandas dataframe)

Raises:

ValueError – If season is less than 1999.

kenpompy.summary.get_fourfactors(browser: CloudScraper, season: str | None = None)

Scrapes the Four Factors table (https://kenpom.com/stats.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season. Most recent season is the default.

Returns:

Pandas dataframe containing the summary Four Factors table from kenpom.com.

Return type:

ff_df (pandas dataframe)

Raises:

ValueError – If season is less than 1999.

kenpompy.summary.get_height(browser: CloudScraper, season: str | None = None)

Scrapes the Height/Experience table (https://kenpom.com/height.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 2007 is the earliest available season but continuity data wasn’t available until 2008. Most recent season is the default.

Returns:

Pandas dataframe containing the Height/Experience table from kenpom.com.

Return type:

h_df (pandas dataframe)

Raises:

ValueError – If season is less than 2007.

kenpompy.summary.get_kpoy(browser: CloudScraper, season: str | None = None)

Scrapes the kenpom Player of the Year tables (https://kenpom.com/kpoy.php) into dataframes.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 2011 is the earliest available season. Most recent season is the default.

Returns:

List of dandas dataframes containing the kenpom Player of the Year

and Game MVP leaders tables from kenpom.com. Game MVP table only available from 2013 season onwards.

Return type:

kpoy_dfs (list of pandas dataframe)

Raises:

ValueError – If season is less than 2011.

kenpompy.summary.get_playerstats(browser: CloudScraper, season: str | None = None, metric: str = 'EFG', conf: str | None = None, conf_only: bool = False)

Scrapes the Player Leaders tables (https://kenpom.com/playerstats.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 2004 is the earliest available season. Most recent season is the default.

  • metric (str, optional) – Used to get leaders for different metrics. Available values are: ‘ORtg’, ‘Min’, ‘eFG’, ‘Poss’, ‘Shots’, ‘OR’, ‘DR’, ‘TO’, ‘ARate’, ‘Blk’, ‘FTRate’, ‘Stl’, ‘TS’, ‘FC40’, ‘FD40’, ‘2P’, ‘3P’, ‘FT’. Default is ‘eFG’. ‘ORtg’ returns a list of four dataframes, as there are four tables: players that used >28% of possessions, >24% of possessions, >20% of possessions, and with no possession restriction.

  • conf (str, optional) – Used to limit to players in a specific conference. Allowed values are: ‘A10’, ‘ACC’, ‘AE’, ‘AMER’, ‘ASUN’, ‘B10’, ‘B12’, ‘BE’, ‘BSKY’, ‘BSTH’, ‘BW’, ‘CAA’, ‘CUSA’, ‘HORZ’, ‘IND’, IVY’, ‘MAAC’, ‘MAC’, ‘MEAC’, ‘MVC’, ‘MWC’, ‘NEC’, ‘OVC’, ‘P12’, ‘PAT’, ‘SB’, ‘SC’, ‘SEC’, ‘SLND’, ‘SUM’, ‘SWAC’, ‘WAC’, ‘WCC’. If you try to use a conference that doesn’t exist for a given season, like ‘IND’ and ‘2018’, you’ll get an empty table, as kenpom.com doesn’t serve 404 pages for invalid table queries like that. No filter applied by default.

  • conf_only (bool, optional) – Used to define whether stats should reflect conference games only. Only available if specific conference is defined. Only available for seasons after 2013. False by default.

Returns:

Pandas dataframe containing the Player Leaders table from kenpom.com.

Return type:

ps_df (pandas dataframe)

Raises:
  • ValueError – If season is less than 2004 or conf_only is used with an invalid season.

  • KeyError – If metric is invalid.

kenpompy.summary.get_pointdist(browser: CloudScraper, season: str | None = None)

Scrapes the Team Points Distribution table (https://kenpom.com/pointdist.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season. Most recent season is the default.

Returns:

Pandas dataframe containing the Team Points Distribution table from kenpom.com.

Return type:

dist_df (pandas dataframe)

Raises:

ValueError – If season is less than 1999.

kenpompy.summary.get_teamstats(browser: CloudScraper, defense: bool | None = False, season: str | None = None)

Scrapes the Miscellaneous Team Stats table (https://kenpom.com/teamstats.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • defense (bool, optional) – Used to flag whether the defensive teamstats table is wanted or not. False by default.

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season. Most recent season is the default.

Returns:

Pandas dataframe containing the Miscellaneous Team Stats table from kenpom.com.

Return type:

ts_df (pandas dataframe)

Raises:

ValueError – If season is less than 1999.

FanMatch

This module contains the FanMatch class for scraping the FanMatch pages into more usable objects.

class kenpompy.FanMatch.FanMatch(browser: CloudScraper, date: str | None = None)

Object to hold FanMatch page scraping results.

This class scrapes the kenpom FanMatch page when a new instance is created.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function.

  • date (str) – Date to scrape, in format “YYYY-MM-DD”, such as “2020-01-29”.

url

Full url for the page to be scraped.

Type:

str

date

Date to be scraped.

Type:

str

lines_o_night

List containing lines of the night if games have taken place.

Type:

list

ppg

Average points per game for the day.

Type:

float

avg_eff

Average efficiency for the day.

Type:

float

pos_40

Possessions per 40 minutes for the day.

Type:

float

mean_abs_err_pred_total_score

The mean absolute error of predicted total score for the day.

Type:

float

bias_pred_total_score

The bias of predicted total score for the day.

Type:

float

mean_abs_err_pred_mov

The mean absolute error of the predicted margin of victory for the day.

Type:

float

record_favs

Record of favorites for the day.

Type:

str

expected_record_favs

Expected record of favorites for the day.

Type:

str

exact_mov

Number of games where margin of victory was accurately predicted out of total played.

Type:

str

fm_df

Pandas dataframe containing parsed FanMatch table. If there are no games that day, fm_df will be None.

Type:

pandas dataframe or None

Team pages

This module contains functions for scraping the team page kenpom.com tables into pandas dataframes

kenpompy.team.get_schedule(browser: CloudScraper, team: str | None = None, season: str | None = None)

Scrapes a team’s schedule from (https://kenpom.com/team.php) into a dataframe.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function

  • team (str, optional) – Used to determine which team to scrape for schedule.

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season.

Returns:

Dataframe containing a team’s schedule for the given season.

Return type:

team_df (pandas dataframe)

Raises:
  • ValueError if season is less than 1999.

  • ValueError if season is greater than the current year.

  • ValueError if team is not in the valid team list.

kenpompy.team.get_scouting_report(browser: CloudScraper, team: str, season: int | None = None, conference_only: bool = False)

Retrieves and parses team scouting report data from (https://kenpom.com/team.php) into a dictionary.

Parameters:
  • browser (CloudScraper) – The mechanize browser object for web scraping.

  • team (str) – team: Used to determine which team to scrape for schedule.

  • season (int, optional) – Used to define different seasons. 1999 is the earliest available season.

  • conference_only (bool, optional) – When True, only conference-related stats are retrieved; otherwise, all stats are fetched.

Returns:

A dictionary containing various team statistics.

Return type:

dict

Raises:

ValueError if the provided season is earlier than 1999 or greater than the current year – ValueError if the team name is invalid or not found in the specified year

kenpompy.team.get_valid_teams(browser: CloudScraper, season: str | None = None)

Scrapes the teams (https://kenpom.com) into a list.

Parameters:
  • browser (CloudScraper) – Authenticated browser with full access to kenpom.com generated by the login function

  • season (str, optional) – Used to define different seasons. 1999 is the earliest available season.

Returns:

List containing all valid teams for the given season on kenpom.com.

Return type:

team_list (list)

Contributing

You can contribute by creating issues to highlight bugs and make suggestions for additional features. Pull requests are also welcome.

License

kenpompy is released on the GNU GPL v3.0 license. You are free to use, modify, or redistribute it in almost any way, provided you state changes to the code, disclose the source, and use the same license. It is released with zero warranty for any purpose and I retain no liability for its use. Read the full license for additional details.

Indices and tables