AI technology is making headlines in a wide range of industries including financial services and medical, but legal AI may not immediately come to mind for many. However, AI is already transforming the legal sector in many ways, primarily because it is streamlining traditionally cumbersome processes and allowing professionals to focus on higher-level tasks.
For those interested in developing legal machine learning applications, we at Lionbridge AI have scoured the web to put together a collection of the best publicly available legal datasets.
Legal Datasets for Machine Learning
- Legal Case Reports: A textual corpus of 4000 legal cases for automatic summarization and citation analysis. For each document we collect catchphrases, citations sentences, citation catchphrases and citation classes.
- Department of Justice Open Data: The United States DOJ released a high-value data inventory in 2013, which includes raw datasets such as crime related data, statistical reports, and more.
- The Supreme Court Database: The SCDB contains over two hundred pieces of information about each case decided by the Court between the 1791 and 2017.
- Caselaw Access Project (CAP): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
- Bureau of Justice: Here, you can find data on law enforcement agencies, jails, parole and probation agencies and courts.
- Carp-Manning U.S. District Court Database: This dataset contains decision-making data on 110,000+ decisions by federal district court judges handed down from 1927 to 2012.
- Patent Litigations: This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
- Google Patents Public Data: The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
- California Crime and Law Enforcement: This dataset includes data on crime rates and law enforcement employment in the state of California.
- Credit card agreement database: The CFPB maintains a database of credit card agreements from hundreds of card issuers.
In case you missed our previous dataset compilations, you can find them all here. Still can’t find the custom data you need to train your model? Lionbridge AI provides custom training data in over 300 languages for your specific machine learning project needs.
Contact us to learn more about how Lionbridge AI can work for you.