24 Best Retail, Sales, and Ecommerce Datasets for Machine Learning

Article by Alex Nguyen | May 07, 2020

Machine learning presents a huge growth opportunity for online retailers. With machine learning, smart ecommerce companies can boost sales, reduce waste, and increase overall efficiency while actively engaging with consumers. Not only that, companies have a lot of ecommerce data at their fingertips. For example, according to Seshu Adunuthula, Senior Director of Analytics Infrastructure at eBay, “data is eBay’s most important asset.” The problem for machine learning developers lies in the availability of that data. Retail datasets typically contain proprietary information and are consequently hard to find, as are sales datasets. To help you out, we have scoured the internet to gather a list of open data sources that may prove useful for your projects. 


Product Datasets for Machine Learning

Fashion-MNIST: Perfect for product categorization use cases, MNIST contains nearly 60,000 training images and 10,000 test images of fashion products across 10 classes.

Innerwear Data from Victoria’s Secret and Others: Data from 600,000+ innerwear products extracted from popular retail sites. It includes product description, price, category, rating and more.

Electronic Products and Pricing Data: This dataset contains a list of over 7,000 electronic products with 10 fields of pricing information.

Men’s Shoe Prices: A list of 10,000 men’s shoes and the various prices at which they are sold.

Women’s Shoe Prices: In addition to the previous dataset, this contains a list of 10,000 women’s shoes and the various prices at which they are sold.

Item Data: Useful for recommendation systems, this dataset contains SKUs and their associated product descriptions from an outdoor apparel brand’s product catalog.

Fashion Products on Amazon.com: This is a pre-crawled dataset created by extracting data from Amazon. It consists of roughly 22,000 fashion products on Amazon.

E-commerce Tagging for Clothing: This retail dataset contains images from ecommerce sites with bounding boxes drawn around shirts, jackets, sunglasses etc. It has 907 items, of which 504 items have been manually labeled.

Fashion-MNIST Dataset

Retail Transaction Datasets for Machine Learning

Online Retail Dataset (UCI Machine Learning Repository): This dataset contains all the transactions during an eight month period (01/12/2010-09/12/2011) for a UK-based online retail company.

Brazilian E-Commerce Public Dataset: This dataset contains Brazilian over 100,000 anonymized orders made at Olist (100k orders) from 2016 to 2018 made at multiple marketplaces. Furthermore, it includes multiple dimensions from order status, price, payment and freight performance to real written reviews by customers.

Online Auctions Dataset: Retail dataset that contains eBay auction data on Cartier wristwatches, Xbox game consoles, Palm Pilot M515 PDAs, and Swarovski beads.

Retailrocket Recommender System Dataset: This data was collected from a real-world ecommerce website over a period of 4.5 months. Furthermore, it contains information on visitor behavior including events like clicks, add to carts, and transactions.


Ecommerce Data and Search Relevance Datasets for Machine Learning

ECommerce Search Relevance: This set contains image URLs, rank on page, a description for each product, the search query that led to each result, and more from five major English-language ecommerce sites.

Best Buy Search Queries NER Dataset: A retail dataset containing manually labeled search queries on bestbuy.com. The search queries have phrases labeled into various important entities like Brand, Model name, Category Name & etc.


Customer Review Datasets for Machine Learning


Women’s E-Commerce Clothing Reviews: Another great resource for ecommerce data, this Kaggle dataset contains 23,000 real customer reviews and ratings. However, because it features is real commercial data, all information has been anonymized. Because of this, references to the company in the review text and body have been replaced with “retailer”.

Amazon Commerce Reviews Set: This retail dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition. Furthermore, to examine the robustness of clasification algorithms, we identified 50 of the most active users who frequently posted reviews.

Multidomain Sentiment Analysis Dataset: A slightly older retail dataset that contains product reviews data by product type and rating. Furthermore, reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.

Amazon and Best Buy Electronics: A list of over 7,000 online reviews from 50 electronic products. In addition to the review itself, the dataset includes the date, source, rating, title, reviewer metadata, and more.

Grammar and Online Product Reviews: This is a sample of a large dataset by Datafiniti. It contains a list of over 70,000 review, which can be used for a number of machine learning use cases. For example, you can assess how writing quality impacts positive and negative online product reviews.

Ecommerce Data for Machine Learning

Annual Retail Trade Survey (ARTS): This dataset provides national estimates of total annual sales, operating expenses and inventories held outside the United States.

Economic Census: The Economic Census provides a detailed portrait of business activities in industries and communities once every five years, from the national to the local level.

E-Stats: This dataset by the US government reports the value of goods and services sold online whether over open networks such as the Internet.

EU External Trade Datasets: Another government dataset, the EU External Trade datasets provide information on the value of imports, exports and trade surplus classified by commodity, and by country of origin or destination.

ECommerce Sales by Merchandise Category 1999-2015: This dataset contains real census data that shows total ecommerce sales by merchandise line and compound annual growth rate from 1999-2015.


Still can’t find the ecommerce data you need for your machine learning project? Don’t worry, Lionbridge can help. We provide custom AI training data in 300 languages for the world’s largest ecommerce brands. Contact us today for a free consultation.

Interested? Get high-quality data now
The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.