Bulk

Structure

Bulk data files are provided as zipped directories. Each directory is in BagIt format, with a layout like this:

.
├── bag-info.txt
├── bagit.txt
├── data/
│   └── data.jsonl.xz
└── manifest-sha512.txt

Data Format

Caselaw data is stored within the data/data.jsonl.xz file. The .jsonl.xz suffix indicates that the file is compressed with xzip, and is a text file where each line represents a JSON object. Each line of the JSON file is an object retrieved from the API.

API

API queries always return JSON. Here's what they look like. For more details on queries, check out the API Reference.

Individual Records

If you specify an individual record (reachable through the "url" value present in most types of records) then you'll receive a single JSON object as formatted below.

Query Results

If you're not specifying a specific record to return by its primary key (usually an id), your results will be structured to return multiple objects, even if there's only one match to your query.

{
    "count": (int),
    "next": (url with pagination cursor),
    "previous": (url with pagination cursor),
    "results": (array of json objects, as listed below)
}

Individual Objects

Case

{
    "id": (int),
    "url": (API url to this case),
    "name": (string),
    "name_abbreviation": (string),
    "decision_date": (string),
    "docket_number": (string),
    "first_page": (string),
    "last_page": (string),
    "citations": [array of citation objects],
    "volume": {Volume Object},
    "reporter": {Reporter Object},
    "court": {Court Object},
    "jurisdiction": {Jurisdiction Object},
    "cites_to": [array of cases this case cites to],
    "frontend_url": (url of case on our website),
    "frontend_pdf_url": (url of case pdf),
    "preview": [array of snippets that contain search term],
    "analysis": {
        "cardinality": (int),
        "char_count": (int),
        "ocr_confidence": (float),
        "sha256": (str),
        "simhash": (str),
        "word_count": (int)
    },
    "last_updated": (datetime),
    "casebody": {
        "status": ok/(error)"
        "data": (null if status is not ok) {
            "judges": [array of strings that contain judges names],
            "parties": [array of strings containing party names],
            "opinions": [
                {
                    "text": (case text),
                    "type": (string),
                    "author": (string)
                }
            ],
            "attorneys": [array of strings that contain attorneys names],
            "corrections": (string. May include formatting notes.),
            "head_matter": (elements before the case text)
        }
    }
}

Casebody

Without the full_case=true parameter set, this query would not have a case body. This can be useful when you want to browse the metadata of a bunch of cases but only get case text for specific ones, conserving your 500-case-per-day limit.

This shows the default output for casebody— a JSON field with structured plain text. You can change that to HTML or XML by setting the body_format query parameter to either html or xml.

This is what you can expect from different format specifications using the body_format parameter:

Text Format (default)

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=true

The default text format is best for natural language processing. Example response data:

"data": {
      "head_matter": "Fifth District\n(No. 70-17;\nThe People of the State of Illinois ...",
      "opinions": [
          {
              "author": "Mr. PRESIDING JUSTICE EBERSPACHER",
              "text": "Mr. PRESIDING JUSTICE EBERSPACHER\ndelivered the opinion of the court: ...",
              "type": "majority"
          }
      ],
      "judges": [],
      "attorneys": [
          "John D. Shulleriberger, Morton Zwick, ...",
          "Robert H. Rice, State’s Attorney, of Belleville, for the Peop ..."
      ]
  }
}

In this example, "head_matter" is a string representing all text printed in the volume before the text prepared by judges. "opinions" is an array containing a dictionary for each opinion in the case. "judges", and "attorneys" are particular substrings from "head_matter" that we believe refer to entities involved with the case.

XML Format

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=true&body_format=xml

The XML format is best if your analysis requires more information about pagination, formatting, or page layout. It contains a superset of the information available from body_format=text, but requires parsing XML data. Example response data:

"data": "<?xml version='1.0' encoding='utf-8'?>\n<casebody ..."

HTML Format

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=true&body_format=html

The HTML format is best if you want to show readable, formatted caselaw to humans. It represents a best-effort attempt to transform our XML-formatted data to semantic HTML ready for CSS formatting of your choice. Example response data:

"data": "<section class=\"casebody\" data-firstpage=\"538\" data-lastpage=\"543\"> ..."

Analysis Fields

Analysis fields are values calculated by processing the raw case text. They can be searched with filters.

Each case result in the API returns an analysis section, such as:

"analysis": { 
    "word_count": 1110, 
    "sha256": "0876189e8ac20dd03b7...", 
    "ocr_confidence": 0.654, 
    "char_count": 6890, 
    "pagerank": { 
        "percentile": 0.31980916105919643, 
        "raw": 5.770123949632993e-08 
     }, 
    "cardinality": 390,
    "simhash": "1:3459aad720da314e" 
}

All analysis fields are optional, and may or may not appear for a given case.

Analysis fields have the following meanings:

Cardinality (cardinality)

The number of unique words in the full case text including head matter.

Character count (char_count)

The number of unicode characters in the full case text including head matter.

OCR Confidence (ocr_confidence)

A relative score of the predicted accuracy of optical character recognition in the case, from 0.0 to 1.0. ocr_confidence is generated by averaging the OCR engine's reported confidence for each word in the case. The score has no objective interpretation, other than that a case with a lower score is likely to have more typographical errors than a case with a higher score.

PageRank (pagerank)

Example: "pagerank": {"raw": 0.00278, "percentile": 0.997}

An estimate of the all-time significance of this case in the citation graph, from 0.0 to 1.0, calculated using the PageRank algorithm. Cases with no inbound citations will not have this field, and implicitly have a rank of 0.

The "raw" score can be interpreted as the probability of encountering that case if you start at a random case and followed random citations. The "percentile" score indicates the percentage of cases, between 0.0 and 1.0, that have a lower raw score than the given case.

SHA-256 (sha256)

The hex-encoded SHA-256 hash of the full case text including head matter. This will match only if two cases have identical text, and will change if case text is edited (such as for OCR correction).

SimHash (simhash)

The hex-encoded, 64-bit SimHash of the full case text including head matter. The simhash of cases with more similar text will have lower Hamming distance.

Simhashes are prepended by a version number, such as "1:33e68120ecb2d7de", to allow for algorithmic improvements. Simhashes with different version numbers may have been calculated using different parameters (such as hash algorithm or tokenization) and may not be directly comparable.

Word count (word_count)

The number of words in the full case text including head matter.

Jurisdiction

    {
        "url": (url),
        "id": (int),
        "slug": (string),
        "name": (string),
        "name_long": (string),
        "whitelisted": true/false
    }

Court

    {
        "id": (int),
        "url": (url),
        "name": (string),
        "name_abbreviation":(string),
        "jurisdiction":(string),
        "jurisdiction_url": (url),
        "slug": (string)
    },

Volume

    {
        "url": (url),
        "barcode": (string),
        "volume_number": (string),
        "title": (string),
        "publisher": (string),
        "publication_year": (int),
        "start_year": (int),
        "end_year": (int),
        "nominative_volume_number": (string),
        "nominative_name": (string),
        "series_volume_number": (string),
        "reporter": (string),
        "reporter_url": (url),
        "jurisdictions": [list of jurisdiction objects],
        "pdf_url": (url),
        "frontend_url": (url)
    },

Reporter

     {
        "id": (int),
        "url": (url),
        "full_name": (string),
        "short_name": (string),
        "start_year": (int),
        "end_year": (int),
        "jurisdictions": [list of jurisdiction objects],
        "frontend_url": (url)
    },

Citation

    {
        "id": (int),
        "cite": (string),
        "cited_by": (url)
    },

Ngrams

    (search term): {
        (string jurisdiction)/"total": [
            {
                "year": (string),
                "count": [
                    (int),
                    (int)
                ],
                "doc_count": [
                    (int),
                    (int)
                ]
            }
        ]
    }

Find what you were looking for?

If you have suggestions for improving this documentation, let us know!