Art Data

Web Scraping, Data, Analysis, Art Market, Python

Spring 2015 - Fall 2015

Housed within immaculately white walls of galleries, the contemporary art market is opaque. Available information is limited to winning bid prices published by auction houses. Transactions taking place in the primary market, i.e. sale of art that comes directly from the artists' studios are impenetrable.


Challenged by the lack of transparency, I sought my own answers. Artsy is an online platform that boasts to house an excess of 300,000 images of art, frequently available for sale. By scraping Artsy’s website,, I was able to attain information on over 66,000 artworks.

def get_artworks_by_categories(self, categories = DEFAULT_CATEGORIES, max_results_per_category = 10000): records = []
pages = self.calculate_pages(max_results_per_category)
for category in categories: for page in xrange(1, pages + 1): session = requests.session() session.headers['X-XAPP-TOKEN'] = self.token response = session.get(CATEGORY_API_URL.format(page = page, category = category))
self.check_response(response) if response.status_code == break records.extend(response.json()[0:(max_results_per_category - len(records))]) return records


My scraped dataset included the following features:

  1. Sale Status - indicating that only 7% of the artworks have sold

  2. Images of the artworks
  3. Artwork Category - Painting, Photography, Print, etc.

  4. Name of the partner institution

  5. Partner institution type (gallery, museum, etc)

  6. Artwork dimensions - as well as whether it was voluminous

  7. Artwork prices were sparse - only about 28% of the records had prices associated with them; prices ranged from 1 to 2,000,000 USD to make the price feature more consistent I took the median price for the Artwork Category

  8. Currency

  9. Published date

  10. Name of the artist - to give the name some context, I scraped Artnet’s top 300 list of artists and assigned a Boolean value of 1 if the artist’s name amongst the listed on Artnet

    url = '' page = requests.(url) top_300 = pd.read_html(page.text[0][0]
    1 Banksy
    2 Andy Warhol
    3 Nobuyoshi Araki
    4 Roy Lichtenstein
    5 Pablo Picasso
    6 Helmut Newton


The feature I was most excited by was the images. In his book, Collectible Investments for the High Net Worth Investor, Stephen Satchell puts forth the following: “Attractive subjects are more highly valued than unattractive subjects… Certain colors are more desirable than others – for example, red and blue will generally dominate yellow and green [in terms of artwork sales].”

Hypothesis 1

The color of the image will have an impact on whether an artwork sells.The result set comprised of 128,736 hexadecimal values. I realized that this is more than my model can handle. Beside, what is color #92bbb7, anyway?
I set out to group the colors into primary categories – color names a human can relate to. Converting colors to their names using Euclidian distance was surprisingly difficult. Deep reds were categorized as olive and browns were dubbed blue. Using a very unscientific approach I created a custom dictionary that translated colors into a subset of CSS3 colors. My result set was cut down to 38 unique colors names.


Running a logistic regression confirmed that red artworks sell, particularly indianred. A color called peachpuff, along with yellow showed to be positively correlated. Confirming Satchell’s findings – green (the dark olive tone especially) has a negative correlation. However, drawing a conclusion about blue was more difficult. Lightblue was negatively correlated and a deeper blue showed to have an unstable coefficient with little impact on artwork sale. Black had a negative impact while white was positively correlated. Suggesting that lighter paintings have a tendency to sell over dark ones – a similar pattern was observed between silver and grey.

What about Brightness?

I extracted brightness from the original images.

def brightness(im_file): im = im = im.convert('RGB') X,Y =0,0 pixelRGB = im.getpixel((X, Y)) R,G,B = pixelRGB return sqrt(0.241*R**2 + 0.691*G**2 + 0.068*B**2)

Just as a stand-alone feature Brightness had a poor roc auc score - 54% and a correlation of 6.2%. Yet, I was curious to know how it related to size and whether there was a relationship there. To do so, I first assigning a relative measure to the Brightness feature (255 being the maximum Brightness. value). Secondly, I grouped the sizes into 4 categories – large (4), medium (3), small(2), extra-small (1). Alas, a clear relationship seemed to be lacking between the two.

Note: -1 represents a missing data point

Hypothesis 2

There is more to the sale of artwork than color alone.

Back to the original data set

In running logistic regression models, I discovered my data set was imbalanced. With a ratio of 7:1 of unsold to sold just my model had an above 90% score just by predicting that artwork will not sell.

model = LogisticRegression() model =, y_train) model.score(X, y)


To adjust for the sensitivity of my model, I ran the roc auc score that yielded results in line with my expectations.

probs = model.predict_proba(X_test) metrics.roc_auc_score(y_test, probs[:, 1])

roc auc = 0.5349322812587709

Feature Selection

To improve roc auc score of my model, I experimented with selecting different features. I settled n the following: Price Currency, Partner Type, Artwork Category, 3D, Top Artist, Artwork Price, Median Category Price of Artwork, Published Month, Size of Artwork, Brightness, and Color indicators (for the colors that were a significant indicator in my color model - aqua, black, blue, darkolivegreen, gray, green, indianred, lightblue, lime, orange, peachpuff, red, sienna, silver, teal, white and yellow.

roc auc = 0.67316470636565873

  • Albeit significant, the model’s improvement was not dramatic. In its current state, it is only slightly superior to a random guess.


    Yet, despite the shortcomings of my model and data set, I thought the findings were sufficient to draw interesting conclusions. Aside from Color, the other strong indicators were Published Month, Artwork Category, and whether the artwork was 3D.

    In regards to Published Month, it appeared that pieces listed a couple of years ago (2013-06) did the best. On the other hand artworks that were newly published (2015-01) had a negative coefficient. Thus implying that the longer an artwork has been published on, the higher its exposure, which can drive sales. It must be noted that the relationship is not straightforward as artworks published 3 years ago, for example, have a lower coefficient than the ones published 2 years back. This might be a testament to whether or not the pieces in question appeal to popular taste.
    It is typical that paintings are priced higher relative to prints, all other factors being equal. Prints frequently have editions of a few to many and thus are deemed less valuable. Last year, a New York Times article reported on art market trends shifting toward prints and works on paper. Perhaps this trend hasn’t been embraced entirely. My results point to artworks in the Painting Category being one of the strongest features. Conversely, works classified under Prints and Photography have a negative coefficient.
    I was surprised to see that volume as well as the works in the Sculpture Category had a positive relationship to sales. As a New Yorker, I always thought smaller pieces were more accessible, but I suppose the main chunk of fine art buyers are working with a bit more than a shoebox apartment.


    As I continue to explore and expand this data set, I hope to perform text classification on the artwork title and artist names using Naïve Bayes. In order to improve the validity of my model, I may also experiment up sampling as well as different cost functions and decision values.