24 Best Ecommerce & Retail Datasets for Machine Learning

Article by Alex Nguyen | February 01, 2019

Machine learning represents a huge growth opportunity for online retailers. With machine learning, ecommerce companies can potentially boost sales, reduce waste, and increase overall efficiency while actively engaging with consumers.

Not only that, ecommerce companies have a lot of data at their fingertips. For example, according to Seshu Adunuthula, Senior Director of Analytics Infrastructure at eBay, “data is eBay’s most important asset.”

The problem for machine learning developers lies in the availability of that data. Retail datasets typically contain proprietary information and are consequently hard to find on publicly available databases.

Luckily for you, we at Lionbridge AI have scoured the internet to gather a list of publicly available ecommerce and retail datasets for machine learning projects. Enjoy!


Product Datasets for Machine Learning

Fashion-MNIST: A retail dataset consisting of 60,000 training images and 10,000 test images of fashion products across 10 classes.

Innerwear Data from Victoria’s Secret and Others: Data from 600,000+ innerwear products extracted from popular retail sites. It includes product description, price, category, rating and more.

Electronic Products and Pricing Data: A list of over 7,000 electronic products with 10 fields of pricing information.

Men’s Shoe Prices: A list of 10,000 men’s shoes and the various prices at which they are sold.

Women’s Shoe Prices: A list of 10,000 women’s shoes and the various prices at which they are sold.

eCommerce Item Data: 500 SKUs and their descriptions from an outdoor apparel brand’s product catalog.

Fashion Products on Amazon.com: A retail dataset of 22,000 fashion products on Amazon.

E-commerce Tagging for Clothing: This retail dataset contains images from E-commerce sites with bounding boxes drawn around shirts, jackets, sunglasses etc. It has 907 items, of which 504 items have been manually labeled.


Retail Transaction Datasets for Machine Learning

Online Retail Dataset (UCI Machine Learning Repository): This is a transnational dataset that contains all the transactions during an eight month period (01/12/2010-09/12/2011) for a UK-based online retail company.

Brazilian E-Commerce Public Dataset: A Brazilian public retail dataset of anonymized orders made at Olist (100k orders) from 2016 to 2018 made at multiple marketplaces.

Online Auctions Dataset: Retail dataset that contains eBay auction data on Cartier wristwatches, Xbox game consoles, Palm Pilot M515 PDAs, and Swarovski beads.

Retailrocket Recommender System Dataset: Collected from a real-world ecommerce website, this retail dataset contains information on visitor behavior including events like clicks, add to carts, and transactions.


Search Relevance Datasets for Machine Learning

ECommerce Search Relevance: This set contains image URLs, rank on page, a description for each product, the search query that led to each result, and more from five major English-language ecommerce sites.

Best Buy Search Queries NER Dataset: A retail dataset containing manually labeled search queries on bestbuy.com. The search queries have phrases labeled into various important entities like Brand, Model name, Category Name & etc.


Customer Review Datasets for Machine Learning

Women’s E-Commerce Clothing Reviews: 23,000 Customer Reviews and Ratings. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

Amazon Commerce Reviews Set: This retail dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

Multidomain Sentiment Analysis Dataset: A slightly older retail dataset that contains product reviews data by product type and rating.

Amazon and Best Buy Electronics: A list of over 7,000 online reviews from 50 electronic products.

Grammar and Online Product Reviews: A list of 71,045 online reviews from 1,000 different products.


Ecommerce Datasets for Machine Learning

Annual Retail Trade Survey (ARTS): National estimates of total annual sales, e-commerce sales, end-of-year inventories, inventory-to-sales ratios, purchases, total operating expenses, inventories held outside the United States.

Economic Census: Provides a detailed portrait of business activities in industries and communities once every five years, from the national to the local level.

E-Stats: Surveys used different measures of economic activity such as shipments for manufacturing, sales for wholesale and retail trade, and revenues for service industries.

EU External Trade Datasets: The value of imports, exports and trade surplus, volume indices, unadjusted and seasonally adjusted; price and terms of trade indices; imports and exports classified by commodity, and by country of origin or destination.

ECommerce Sales by Merchandise Category 1999-2015: Census data showing total ecommerce sales by merchandise line and compound annual growth rate from 1999-2015.


Liked this article? You can find all our previous dataset compilations here.

Still can’t find the custom data you need to train your model? Lionbridge AI provides custom AI training data in 300 languages for your specific machine learning project needs.

Contact us to learn more about how Lionbridge AI can work for you.

The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.


Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.