10 Best Legal Datasets for Machine Learning

Article by Alex Nguyen | June 10, 2019

AI technology is making headlines in a wide range of industries including financial services and medical, but legal AI may not immediately come to mind for many. However, AI is already transforming the legal sector in many ways, primarily because it is streamlining traditionally cumbersome processes and allowing professionals to focus on higher-level tasks.

For those interested in developing legal machine learning applications, we at Lionbridge AI have scoured the web to put together a collection of the best publicly available legal datasets.


Legal Datasets for Machine Learning

  1. Legal Case Reports: A textual corpus of 4000 legal cases for automatic summarization and citation analysis. For each document we collect catchphrases, citations sentences, citation catchphrases and citation classes.
  2. Department of Justice Open Data: The United States DOJ released a high-value data inventory in 2013, which includes raw datasets such as crime related data, statistical reports, and more.
  3. The Supreme Court Database: The SCDB contains over two hundred pieces of information about each case decided by the Court between the 1791 and 2017.
  4. Caselaw Access Project (CAP): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
  5. Bureau of Justice: Here, you can find data on law enforcement agencies, jails, parole and probation agencies and courts.
  6. Carp-Manning U.S. District Court Database: This dataset contains decision-making data on 110,000+ decisions by federal district court judges handed down from 1927 to 2012.
  7. Patent Litigations: This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
  8. Google Patents Public Data: The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
  9. California Crime and Law Enforcement: This dataset includes data on crime rates and law enforcement employment in the state of California.
  10. Credit card agreement database: The CFPB maintains a database of credit card agreements from hundreds of card issuers.


In case you missed our previous dataset compilations, you can find them all here. Still can’t find the custom data you need to train your model? Lionbridge AI provides custom training data in over 300 languages for your specific machine learning project needs.

Contact us to learn more about how Lionbridge AI can work for you.

The Author
Alex Nguyen

Alex manages content production for Lionbridge’s marketing team. Originally from San Francisco but based in Tokyo, she loves all things culture and design. When not at Lionbridge, she’s likely brushing up on her Japanese, letting loose at indie electronic shows or trying out new ice cream spots in the city.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.