Apr 08 2019 Localvore Menu Planner

Introduction

A few months ago, I had the idea of creating a menu planner that created menus based around whatever is in season in your area, and secondly, creates as much overlap as possible with perishable ingredients. Thus, this is a menu planner that encourages people to eat locally and minimize food waste. Eating locally is great for both reducing your carbon footprint and supporting local businesses. Food waste is also a major problem in the US. Unsurprisingly, this ended up being a hard problem, and I’d like to share with you some of the succeses and failures of this project.

Stack

Flask This is effectively an API endpoint. It may eventually be converted into a full webpage or Twitter bot, but for now it simply returns a JSON of the menu, which is still human readable.

-Spacy Self-proclaimed “industrial strength natural language processing,” Spacy is an incredibly simple method to obtain vector representations of words, which was then used as inputs for other ML models.

-ScikitLearn The workhorse for most ML projects. Hopefully deep learning won’t be necessary outside of Spacy as a pre-processing step, so that the request time can stay low.

-MongoDB Recipes are effectively documents, and Mongo is a document-store DB. I found a few ways to store recipe data using an RDBMS, but most seemed overly complex for this application. More than that, the training data was aggregated by scraping multiple websites, which often have very different schema. Thus the advantage of a schemaless DB. (That being said, I tried to introduce some structure to my backend).

-requests-html Usually I use requests and BeautifulSoup, but I found out pretty quickly that I needed to do some JavaScript emulation on top of this. My first instinct was Selenium, but this is a very heavyweight library, and probably overkill here. That’s when I found requests-html, which was able to do the JS emulation and had a built-in html scraper that met my needs. 1 library >>> 3 libraries. Huge props to Kenneth Reitz, he’s effectively the deity of Python APIs.

Getting the data

The first collection of recipes I worked with was a dataset of recipes from Epicurious. This data was a large JSON, making it easy to dump into a MongoDB as a collection. Each recipe contained categories, a description, ingredients, rating, and several macronutrient counts. The “categories” object conveniently enough has key ingredients, so seeing which recipes have local ingredients can be done with a single set intersection. Getting seasonal, regional veggies involved a small amount of scraping on seasonal food guide:

def get_date() -> str:
    """Gets date in format compatible with seasonalfoodguide urls,
    i.e. early-january, late-may, etc."""
    month = date.today().strftime('%B').lower()
    period = 'early' if date.today().day <= 15 else 'late'

    return f'{period}-{month}'

def get_seasonal_veggies() -> List[str]:
    """TODO: State should be extracted from Google location services
    Currently defined in __init__.py"""
    today = get_date()
    session = HTMLSession()
    r = session.get(f'http://www.seasonalfoodguide.org/{STATE}}/{today}')
    r.html.render(wait=1, sleep=1)
    veggies = [card.text.lower().split('\n')[0]
               for card in r.html.find('#col-veg-detail-card')]
    #The CSS selector above returns the individual cards, and the first item
    #is the card title, aka vegetable.

    assert r.status_code == 200, 'Unsuccessful request. Check wait and sleep.'
    assert len(veggies) > 0, 'No vegetables returned. Check state spelling.'
    return veggies

def backend_query(collection: str, state=STATE, mongo_path=MONGOPATH):
    """Pings API from seasonalfoodguide, scrapes html, and finds set union
    of veggies and recipe collection by keyword."""
    client = MongoClient(mongo_path)
    recipes = client.RECIPES[collection]
    assert recipes.count_documents({}) > 0, "Invalid collection name entered"
    veggies = get_seasonal_veggies(state)
    result = recipes.find({'ingredients': {"$in": veggies}})

    return result

I often implement a “test-as-I-go” approach when coding, baking unit tests into intermeidate variables within functions before later refactoring them into a Pytest suite to avoid any performance costs of having these extra assertions. One of the strangest hurdles I encountered in this poject was the wait parameter for HTMLSession().get(). A one second delay ended up working in all cases, but I’m not sure why a shorter delay had issues. Yes, the JavaScript needed time to load, but it certainly doesn’t take 1s to render in Chrome, and requests-html uses Chromium. Thankfully, this is only called when the database is being populated, so it shouldn’t end up being too much of a bottleneck for the system.

As outlined in my later in this post, I got some rather strange results from the Epicurious dataeset in the form of highly impractical recipes, which couldn’t really be filtered out except by hand-labelling. My next step was to try and obtain additional recipes from my personal favorite recipe website, Budget Bytes. The scraper.py file in this project’s respository explains how I got all of the Budget Bytes links.

Algorithmic Approach

The key idea is to create an embedding space where “recipe vectors” can be compared. Getting these recipe vectors is quite easy, thanks to Spacy’s implementation of Word2Vec. One of the assumptions made is that the training data for this implementation of Word2Vec is sufficiently large and diverse enough that the semantic differences between all of the ingredients can be expressed within these vectors. This implementation uses Common Crawl, which scrapes over 25 billion webpages, therefore sample size certainly isn’t an issue. To save computational resources at runtime, this is used as a bulk write to each item in the recipe collection:

def keyword_vectorization(col: str) -> None:
    """Large write operation to mongodb. Adds average word vector of all ingredients
    to each recipe document"""
    collection = MongoClient(MONGO_PATH).RECIPES[col]
    #MONGO_PATH is currently gobal variable defined in __init__.py for development.
    nlp = spacy.load('en_core_web_lg')
    for recipe in tqdm(collection.find()):
        ingredients = " ".join([item.lower() for item in recipe['ingredients']])
        tokens = nlp(ingredients)
        recipe['vector'] = Binary(pickle.dumps(tokens.vector))
        collection.save(recipe)

Note that the resulting vector represents the mean value of all ingredient tokens. We can now do clustering using the ‘vector’ element of each recipe. After querying the mongo collection using backend_query, we have geographically and temporally relevant data subset to do clustering on!

In full accordance with occam’s razor, implementing SKLearn’s Nearest Neighbors implementation. Some of the results using this simple of an approach were quite decent. The menu below uses Budget Bytes recipes with vegetables that are local to Mississippi in early April:

{
"menu": [
    "Sriracha Chickpea Salad Wraps",
    "Roasted Cauliflower and Quinoa Salad",
    "Falafel Salad",
    "Blackened Shrimp Tacos",
    "Mediterranean Farro Salad with Spiced Chickpeas"
]
}

However, there were two primary modes of failure for this approach. The first, in the case above, is rather impractical recipes. This is an issue that is difficult to fix outside of hand-tuning the results. The second is that the recipes returned are in fact TOO similar. Yes, lasagna and veggie lasagna share a lot of ingredients, but you probably wouldn’t want both in the same week, unless you’re an orange cartoon cat.

Next Steps

The next logical step would be to get a visualization of recipe vectors. This requires reducing the data down to a 2D manifold. t-SNE is a good first attempt at this, as it puts similar items next to each other (as opposed to on top of each other with my second choice, PCA). Note that although individual clusters may not actually mean anything, this provides a good sanity-check for other clustering algorithms.

In the interest of length, I’ll direct you to my bitbucket repository for a work-in-progess view, and produce updated blog posts as this project proceeds. The next steps are as follows:

As I noticed both in several of my menu results and the above t-SNE visualization, several of the results are sauces, desserts, side items, etc. Since these aren’t labeled explicitly, another pre-processing step will be necessary to remove them.
One proposed fix for the similarity issue is to cluster based solely on perishable ingredients, since one of the core purposes of this project is minimizing food waste. Thankfully, the USDA has an API which allows for food group extraction, which can be used to cluster based solely on how many perishable ingredients are shared between recipes.
DBSCAN was attempted as an alternative clustering model, with samples that are very close together (ϵ < 0.3) being grouped, and a single datum being considered a cluster if there are no adjacent points. This did a decent job of removing redundant recipes, but sampling from the resultant clusters was relatively meaningless, as the cluster number only corresponds to the order in the database, rather than any meaningful distance metric. Therefore, pt. 2 will have some exploration of combining this with other nearest-neighbors apporaches, and comparing the results with those from approach 1.

Please let me know via Twitter if you have any suggestions going forward, and I look forward to brining you pt. 2!