Multi Page Document Classification using Machine Learning and NLP (2022)

An approach to classify documents with different variations shapes, text and page sizes.

Multi Page Document Classification using Machine Learning and NLP (1)

This article describes a novel Multi Page Document Classification solution approach, which leverages advanced machine learning and textual analytics to solve one of the major challenges in the Mortgage industry.

Even in today’s technological era most of the business is done using documents and the amount of paperwork involved will vary from industry to industry. Many of these industries need to scan through scanned document images (which usually contains non-selectable text) to get the information for key index fields to operate their daily tasks.

To achieve this, the first major task is to index different types of documents, which later helps in extraction of information and meta-data from a variety of complex documents. This blog post will represent how Advanced Machine learning and NLP techniques can be leveraged to solve this major part of the puzzle, formally called Document Classification.

In the mortgage industry, different companies perform mortgage loan audits of thousands of people.

Each individual audit is performed on an assortment of documents, submitted as a bundle which is called a Loan Package. A package is a combination of scanned pages, which can vary from (100–400~) pages. There are multiple sub-components within the package which may consist of (1–30~) pages. Such sub-components are called Documents or document classes. The following table represents this visually.

Multi Page Document Classification using Machine Learning and NLP (2)

Background and Problem Statement

Traditionally, while evaluating the loan audits, Document Classification is one of the major parts of the manual effort. The mortgage companies mostly outsource this work to third party BPO companies, which execute this task by using manual or partially automated classification techniques i.e rule engines, template matching. The underlying problem which is faced by the current implementations is that the Business Process Outsourcing (BPO) staff has to manually find and sort the documents present in the packages.

Although, some degree of automation is achieved by few third-party companies using keyword searches, regular expressions, etc. The accuracy and robustness of such solutions are questionable and their manual workload reduction is still not satisfactory. Keyword searches and regular expressions mean that these solutions need to account for every new document or document variations which are presented and also need to add rules for that. This in itself becomes a manual effort and only partial automation is achieved. There still remains a chance where the system might identify a document class to be “Doc A” but it is in fact “Doc B”, because of common rules present in both. Additionally, there is no degree of certainty towards an identification. More often than not, manual verification is still required.

There are several hundred document types, the BPO staff needs to have a knowledge base of “how a certain document looks, and what are the different variations of the same document?”, in order to classify documents. On top of that, if the manual work is too much, Human error tends to increase.

Objective

The document classification solution should significantly reduce the manual human effort. It should achieve a higher level of accuracy and automation with minimal human intervention

The solution approach which we will be discussing in this series of blogs is not only limited to the Mortgage industry, it can be applied where ever there are scanned document images, and sorting of such documents is required. A few of the possible industries are financial organizations, academia, research institutes, retail stores

In order to make a solution pipeline, the first step is to know what is the data and what are its different characteristics. Since we have been working in the mortgage domain, we will define the characteristics of data we process in the mortgage industry.

Within a package, there are many types of pages, but generally, these can be categorized into three types:

Structured | Consistent forms and templates

Multi Page Document Classification using Machine Learning and NLP (3)

Unstructured | Textual, no formatting and tables

Multi Page Document Classification using Machine Learning and NLP (4)

Semi-Structured | Hybrid of above two, may have partial structure

Multi Page Document Classification using Machine Learning and NLP (5)

In terms of documents, the following are the characteristics which are observed in the data.

(Video) How to classify documents automatically using NLP

  • The documents present in the packages are not in a consistent order. For example i.e. in one package document “A” might come after document “B” and in the other one it’s the other way around
  • There are many variations of a same document class. One document class can have different looking variations, for example, a document class “A” page template/format might change for different US states. Within mortgage domain, these represent the same information, but have difference in formatting and contents. In other words, if “cat” is a document, different breeds of cats would be the “variations”.
  • The document types have different kinds of scanned deformities i.e. Noise, 2D and 3D rotations, Bad scan quality, Page orientation, Which messes up OCR for those documents.

In this section, we will abstractly explain how our solution pipeline works, and how each component or module comes together to produce an end-to-end pipeline. Following flow diagram of the solution.

Multi Page Document Classification using Machine Learning and NLP (6)

Since the goal is to identify the documents within the package, we had to identify what kind of characteristics within a document, make it different from another one?. In our case, we decided that the text present in the document is the key, because intuitively we humans also do it this way. The next challenge was to figure out the location of the document within the package. In the case of multi-page documents, boundary pages (start, end) have the most significance. because using these pages, a range of documents can be identified.

Machine Learning Classes

In terms of Machine learning, we treated this problem as a classification problem. Where we decided to identify the first and last pages of each document. we categorized our Machine Learning Classes (ML classes) in three types:

  • First Page Classes: These classes are the first pages of each document class, which will be responsible to identify the start of the document.
  • Last Page Classes: These classes are the last pages of each document class, which will be responsible to identify the end of the document. These classes will be made only for the document classes which have samples with more than one page.
  • Other Class: This class is a single class which contains the middle pages of all the document classes combined into one class. Having this class helps the pipeline in the later stages, it reduces the instances where a middle page of a document is classified as the first or last page of the same document, which intuitively is possible because there can be similarities between all the pages such as headers, footers and templates. This allows the model to learn more robust features.

Following diagram represents, how these different types of ML classes would look like in terms of package and documents

Multi Page Document Classification using Machine Learning and NLP (7)

Machine Learning Engine

Once the ML classes are defined, the next step is to prepare the dataset for training the Machine Learning Engine (The data preparation part will be discussed in detail in the next sections). Following diagram explains the inner workings of the Machine Learning Engine, and is a more technical view for the solution pipeline.

Multi Page Document Classification using Machine Learning and NLP (8)

Let’s step-by-step describe the different phases of the solution.

Step 1

  • Package (which are in pdf format) is split into individual pages (images)

Step 2

  • The individual pages are processed through an OCR (Optical Character Recognition), which extracts the text from the image and generates the text files. We have used a state-of-art OCR engine to produce the text in our case. There are many free online offerings of OCR which can be used in this step.

Step 3

  • The text corresponding to each page is then passed to the Machine learning engine where the Text Vectorizer (Doc2Vec) generates its feature vector representation, which essentially is a list of floats.

Step 4

  • The feature vectors are then passed to the classifier (Logistic Regression). The classifier then predicts the class for each feature vector. Which are essentially one of the ML classes which we have previously discussed (first, last or other)

Additionally, the classifier returns the confidence scores for all the ML classes (the section on the most right of the diagram). For example let (D1,D2 ..) be the ML classes then for a single page the results may look like the following.

Multi Page Document Classification using Machine Learning and NLP (9)

Post Processing

Once the whole package is processed, we use the results/predictions to identify the boundaries of the documents. The results contain the predicted class and the confidence scores of the predictions for all the pages of the package. See the following table

Multi Page Document Classification using Machine Learning and NLP (10)

Following is the simple algorithm and steps which are used to identify the Document boundaries using the output from the Machine learning engine.

Multi Page Document Classification using Machine Learning and NLP (11)

(Video) BERT Document Classification Tutorial with Code

Data Preparation

In pursuit of developing an end-to-end document classification pipeline. the very first, and arguably the most important step is data preparation because the solution is as good as the data it uses. The data we used for our experiments, were documents from the mortgage domain. The strategies we adopted, can be applied to any form of document datasets in a similar fashion. Following are the steps which were performed.

Definition : Document Sample is an instance of a particular document. Usually it is a (pdf) file containing only the pages of that document.

Step 1

  • First step is to decide, which documents within a package are to be recognized and classified?. Ideally, all the documents which are present in packages should be selected. Once the document classes are decided we move onto the extraction part. In our case, we decided to classify (44) document classes.

Step 2

  • To obtain the data set, we collected pdfs of several hundred packages, and manually extracted the selected documents from those packages. Once the document was identified in the package, the pages of that document were separated and concatenated together in the form of a pdf file. For example, if we had found “Doc A” from page 4 to 10 in a package. we would extract the 6 pages (4–10) and merge them into a 6-pager pdf. This 6-pager pdf constitutes a document sample. All the samples extracted for a particular document class was put into a seperate folder. Following shows the folder structure. We collected 300+ document samples for each document class. Each document class was given a unique identifier which we called “DocumentIdentifierID

Multi Page Document Classification using Machine Learning and NLP (12)

Step 3

  • The next step is to apply OCR and extract text from all the pages present in the document samples. The OCR iterated on all the folders and generated excel files, having the extract text and some meta-data. Following shows the format of the excel files, Each row represents one page

Multi Page Document Classification using Machine Learning and NLP (13)

Dataset Table with sample rows.

Loan Number, File Name : These are unique sample (pdf) identifiers. There are two (green, yellow) samples present in the table.

Document Identifier ID , Document Name : Represent the document class, which these samples belong to.

Page Count : Total number of pages present in one particular sample. (both samples have 2 pages)

Page Number : Is the ordered page number of each page within a sample.

IsLastPage : If 1, it means the page is the last page of that particular sample.

Page Text : Is the text returned from the OCR for that particular page.

Data Transformations

Once the data is generated in the above format, next step is to transform it. In the transformations phase, the data is converted/manipulated into the format which is essential for training a machine learning model. Following are the transformations which are applied to the dataset.

Step 1 | Generating ML classes

  • First step of transformation is to generate first page, last page, and other page classes. To do this, Page Number and IsLastPage columns values are used. Following shows a conditional representation of the logic used.

Multi Page Document Classification using Machine Learning and NLP (14)

  • Moreover, below table represents the columns. Notice the yellow column where 6853 represents the first page class, 6853-last represents the last page class, while mid-pages are considered as Other class

Multi Page Document Classification using Machine Learning and NLP (15)

Step 2 | Data Split for Training and Test the Pipeline

  • Once the step 1 is complete, from that point on we only need two columns “Page Text” and “ML Class” to make the training pipeline. Other columns are used for testing evaluations.
  • Next step is to split the data for training and testing the pipeline, The data is split in a way where 80% is used for training and 20% is used for testing. The data is also randomly shuffled, but in a stratified fashion for each class. For more information click the link.

Step 3 | Data cleaning and transformation

  • The “Page Text” column which contains the OCR text for each page is cleaned, this process is applied on train and test both. Following are the processes which are performed.
  1. Case correction: All the text is converted to UPPER or lower case.
  2. Regex for non-alphanumeric characters: All the characters which are not alphanumeric are removed.
  3. Word Tokenization: All the words are tokenized, which means the one Page Text string becomes list of words
  4. Stopwords Removal: Stopwords are the words which are too common in the English language and might not be helpful in classifying the individual documents. For example words like “the”, “is”, “a”. These words can also be domain specific. it can be used to remove redundant words, which are common in many different documents. i.e. in terms of finance or mortgage, the word “price” can occur in many documents.

Following tables show before and after transformations

Multi Page Document Classification using Machine Learning and NLP (16)

Training Pipeline

In the previous Machine Learning Engine Section, we abstractly discussed the inner workings of the Machine Learning Engine. The two main components were.

  1. Text Vectorizer: In our case, we have used Doc2Vec
  2. Classifier Model: Logistic Regressor is used for classification.

Text Vectorizer (Doc2Vec)

Since the beginning of the Natural Language Processing (NLP), there has been the need to transform text into something a machine can understand. Which means, transforming textual information into a meaningful representation which is usually known as vectors (or array) of numbers. Research community has been developing different methods to perform this task. In our research and development we tried different techniques and found Doc2Vec to be the best amongst all.

Doc2Vec is based on Word2Vec model. Word2Vec model is a Predictive Vector Space Model. To understand Word2Vec, let us begin with Vector Space Models.

(Video) Multi-label Text Classification | Implementation | Python Keras | LSTM | TensorFlow |NLP tutorial

Vector Space Models (VSMs): Embeds words into a continuous vector space where semantically similar words are mapped to nearby points

Two Approaches for VSM:

  1. Count-Based Methods: Compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word (e.g. TFIDF)
  2. Predictive Methods: Predict a word from its neighbors in terms of learned small, dense embedding vectors (e.g. Skip-Gram, CBOW). Word2Vec and Doc2Vec belong to this category of models

Word2Vec Model

It is a computationally efficient predictive model for learning word embedding from raw text. Word2Vec can be created by using the following two models:

  1. Skip-Gram: Creates a sliding window around current word (target word). Then use current word to predict all surrounding words (the context words). (e.g. predicts ‘the cat sits on the‘ from ‘mat‘)
  2. Continuous Bag-of-Words (CBOW): Creates a sliding window around current word (target word). Then predict the current word from surrounding words (the context words). (e.g. predicts ‘mat’ from ‘the cat sits on the’)

For more details, read this article. it explains different aspects of it in detail.

Doc2Vec Model

This text vectorization technique was introduced in the scientific research paper Distributed Representations of Sentences and Documents. Moreover, further technical details can be found here.

Definition | it is an unsupervised algorithm that learns fixed-length feature vector representation from variable-length pieces of texts. Then these vectors can be used in any machine learning classifier to predict the classes label.

It is similar to Word2Vec model except, it uses all words in each text file to create a unique column in a matrix (called it Paragraph Matrix). Then a single layer NN, like the one seen in Skip-Gram model, will be trained where the input data are all surrounding words of the current word along with the current paragraph column to predict the current word. The rest is same as the Skip-Gram or CBOW models.

Multi Page Document Classification using Machine Learning and NLP (17)

The advantage of Doc2Vec model:

  • On sentiment analysis task, Doc2Vec achieves new state-of-the-art results, better than complex methods, yielding a relative improvement of more than 16% in terms of error rate.
  • On text classification task, Doc2Vec convincingly beats bag-of-words models, giving a relative improvement of about 30%.

Classifier Model (Logistic Regressor)

Once the text is converted to a vector format. it is ready for a machine learning classifier to learn the patterns present in the vectors of different document types and identify the correct distinctions. Since, there are many classification techniques which can be used here, we tried best of the bunch and evaluated their results. i.e. Random Forest, SVM, Multi-Layer Perceptron and Logistic Regressor. Many different parameters were tried for each classifier to obtain the optimal results. Logistic Regressor was found to be the best amongst all of these models.

Training Procedure

  • Once the data is transformed. Firstly, we train the Doc2Vec model on the training split (as discussed in the data transformation section). -
  • After the Doc2Vec model is trained. the training data is passed through it again, but this time the model is not trained, rather we infer the vectors for the training samples. The last step is to pass these vectors and the actual ML class label to the classification model (Logistic Regressor).
  • Once the models are trained on the training data, the both models are saved to the disk, so that these can be loaded into memory to be used in testing and ultimate production deployment. Following diagram shows the basic flow of this collaborative scheme.

Multi Page Document Classification using Machine Learning and NLP (18)

Testing & Evaluation Pipeline

Once the pipeline is trained(which includes both the Doc2Vec model and the Classifier), The following flow diagram shows how it is used to predict the document classes for the testing data split.

Multi Page Document Classification using Machine Learning and NLP (19)

The transformed testing data is passed through the trained Doc2Vec Model, where the vector representations of all the pages present in the testing data are extracted and inferred. These vectors are then classified through the classifier which returns the predicted class and the confidence score for all the ML classes.

For the detailed evaluation of the Machine Learning Engine, we generate an excel file from the results. Following table shows the columns and the information generated in the testing phase.

Multi Page Document Classification using Machine Learning and NLP (20)

Page Text, File Name, Page Number : These are the same columns we had in the data preparation stage, these are just taken as it is from the source dataset.

ground, pred : ground shows the actual ML class of that page, while pred shows the predicted ML class by the ML engine.

Trained classes columns: Columns in this section represent the ML classes on which the model was trained on and the confidence scores for those classes.

MaxProb, Range : MaxProb shows the max confidence score achieved by any of the columns in Trained classes section. See the red colored text, Range shows the range in which the MaxProb falls in.

Currently there are three levels of results evaluation.

(Video) Multiclass Text Classification using Keras Tensorflow

  1. Cumulative Error Evaluation Metric
  2. Confusion Matrix
  3. Class level confidence scores analysis

Cumulative Error Evaluation Metric

This evaluation calculates two metrics, Accuracy and F1-Score. For more details check this blog. These provide us an abstract insight into goodness of the pipeline. The scores can be between (1–100). where higher number represents how good the pipeline is in classifying the documents. In our experiments, we got the following accuracy and f1-score.

Multi Page Document Classification using Machine Learning and NLP (21)

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Essentially, it makes it easier to understand:

  • Which classes are not performing well?
  • What is the accuracy score of an individual class?
  • Which classes are confused with each other?

The below plot represents the confusion matrix we generated after our testing. It is an embed link so click to view the confusion matrix.

Values on both X-axis (True labels) and Y-axis (Predicted labels) represents document classes which we trained on. The numbers within the cells show the percentage of testing dataset belonging to the class on the left and bottom.

The values at the diagonal represent the percentage of data where the predicted classes were correct. Higher percentage is better. i.e. if 0.99 then it means 99% of the testing data for that particular class was predicted correctly. All the other cells show wrong predictions and percentage shows how much a certain class was confused by the another class.

As it can be seen, that the model is able to correctly classify most of the ML classes with more than 90% accuracy.

Class level confidence scores analysis

Although the confusion matrix gives details about the class confusions, but it doesn’t represent the confidence scores of the predictions. Which in other words means

  • “How confident the model is, when making a prediction about a document class?”

What is the need?

In the ideal situation, model should have high confidence when predicting a correct ML class, and low confidence when predicting a wrong ML class. But this is not a strict behavior and depends on many factors i.e. performance of a particular class, actual domain similarities between document classes etc. To evaluate, whether this behavior exist and confidence scores can be a useful indication of a true predictions, we devised an additional evaluation approach.

Approach

Since the task is to reduce the manual work, it was decided that only the predictions with high confidence will be chosen.This way wrong predictions will be not happen (because those wont have high confidence). Rest of the documents and pages will be verified manually by the BPO.

Threshold

In this step confidence scores of the classes are calculated and the threshold is defined, Threshold is a percentage i.e. 80%, 75% which is decided based on following conditions.

  • What is the confidence score value where, wrong predictions are in insignificant numbers and true predictions are in higher numbers. In other words, It is about finding the sweet spot.

Following line plot shows the true positives (blue line) and false positives (red line).

Multi Page Document Classification using Machine Learning and NLP (22)

X-axis shows the ML classes, and Y-axis shows the percentage of the testing data for a particular class, which is covered by true positives or false positives.

For example: in case of the the ML class 1330, true predictions cover almost 70% of the whole testing data-set for that class. Which means ML engine was able to predict 70% of the data right, with confidence score greater than 90%. Moreover the false positives covered only 1% of the testing data-set, which means only 1% of the test data was predicted wrong with confidence score higher than 90%.

Although, because of the threshold, sometimes we lose on true positives (when confidence score is less than threshold). But that is not as bad as the false positives with high confidence. Such pages/documents will be verified manually.

The previous plot is made with threshold (90% and above). In the following plot, threshold is (80% and above). Notice that even if the threshold is dropped to 80% the false positives do not increase, while true positives increase significantly. Which means, that between 90% and 80% thresholds, 80% is optimal.

Multi Page Document Classification using Machine Learning and NLP (23)

While doing this analysis all the levels are checked i.e. 50%, 60%, 70%. The most optimal threshold is chosen using this evaluation metric.

(Video) How to do Multiclass text classification with Spacy on Custom Dataset?

  • Fast Predictions | The classification time for one page is under (~300ms). if we include the OCR time, one page can be classified well under 1 second. Moreover, if multi-processing is adopted,
  • High Accuracy | The current solution pipeline is able to identify and classify documents with high accuracy and high confidence. In most of the classes we get more than 95% accuracy.
  • Labeled Data Requirements | Within our experiments we have observed that the pipeline can work good with most 300 samples per document class. (Like in the experiment we discussed in these blogs). But this is dependent on the variations and type of document class. Moreover, we see accuracy and confidence scores increasing with more sample counts.
  • Confidence Score Threshold | The pipeline provides prediction confidence scores, which enables a tuning approach, and allows to tune between the True Positives and False Positives.
  • Multi-Processing | The Doc2Vec implementation allows for multi-processing, Moreover our data transformation scripts are highly parallelized.

Machine learning and Natural Language Processing has been doing wonders in many fields, we see first hand, how it helped to reduce the manual effort and automated the task of Document Classification. The solution is not only fast, but also very accurate.

Because of the sensitive nature of data used in this process. The code base is not available. I will rework the codebase on some dummy data which will allow me to upload it to my github. Please follow me on github for further updates. Also check out some of my other projects ;)

FAQs

How do you classify a document in NLP? ›

Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort. By Parsa Ghaffari.

How is NLP used in text classification? ›

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

Which algorithm is best for document classification? ›

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

What are the 3 classification of documents? ›

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the ...

What is NLP in machine learning? ›

Natural Language Processing (NLP) Natural language processing strives to build machines that understand and respond to text or voice data—and respond with text or speech of their own—in much the same way humans do.

What are the application of natural language programming NLP? ›

Natural Language Processing (NLP) is a component of AI in the field of linguistics that deals with interpretation and manipulation of human speech or text using software. It enables the computer to understand the natural way of human communication by combining machine learning, deep learning and statistical models.

How do you create a document classification model in python? ›

Following are the steps required to create a text classification model in Python:
  1. Importing Libraries.
  2. Importing The dataset.
  3. Text Preprocessing.
  4. Converting Text to Numbers.
  5. Training and Test Sets.
  6. Training Text Classification Model and Predicting Sentiment.
  7. Evaluating The Model.
  8. Saving and Loading the Model.

What is document classification system? ›

Document classification is a conventional method to separate text based on their subjects among scientific text, web pages and digital library. Different methods and techniques are proposed for document classifications that have advantages and deficiencies.

What are the four classification of documents? ›

Typically, there are four classifications for data: public, internal-only, confidential, and restricted.

What are the 4 types of classified matters? ›

Documents and other information must be properly marked "by the author" with one of several (hierarchical) levels of sensitivity—e.g. restricted, confidential, secret, and top secret.

Why do we need document classification? ›

Document classification is an age-old problem in information retrieval, and it plays an important role in a variety of applications for effectively managing text and large volumes of unstructured information.

What is difference between NLP and machine learning? ›

NLP interprets written language, whereas Machine Learning makes predictions based on patterns learned from experience.

What are the 5 steps in NLP? ›

The five phases of NLP involve lexical (structure) analysis, parsing, semantic analysis, discourse integration, and pragmatic analysis.

Which algorithm is used in NLP? ›

NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statistical inference.

How many NLP techniques are there? ›

Natural Language Processing (NLP): 7 Key Techniques.

What are the types of data used for NLP applications? ›

NLP data is typically broken down into five sets:
  • Speech recognition: the actual words spoken in audio are converted to text for further analysis.
  • Text classification and language modeling: chunking and classifying speech into concepts for further analysis.
  • Image captioning: Text added to describe a photograph.
Jun 11, 2021

How is NLP used in daily life? ›

Here are a few prominent examples.
  1. Email filters. Email filters are one of the most basic and initial applications of NLP online. ...
  2. Smart assistants. ...
  3. Search results. ...
  4. Predictive text. ...
  5. Language translation. ...
  6. Digital phone calls. ...
  7. Data analysis. ...
  8. Text analytics.

How do you document classification? ›

Following are the steps required to create a text classification model in Python:
  1. Importing Libraries.
  2. Importing The dataset.
  3. Text Preprocessing.
  4. Converting Text to Numbers.
  5. Training and Test Sets.
  6. Training Text Classification Model and Predicting Sentiment.
  7. Evaluating The Model.
  8. Saving and Loading the Model.

How do I classify a PDF document? ›

Step 1 – First, download and open the file in the PDF Multitool Classifier Test Tool. Step 2 – Run the Test Rules. Step 3 – Click on the Copy Rules for PDF.co or API Server.

What are the 4 data classification levels? ›

Typically, there are four classifications for data: public, internal-only, confidential, and restricted.

What is the importance of document classification? ›

Document classification enables the user to upload different documents in bulk and classify them into their respective types. It helps ease the processing of different document types and assign them to the right team-member for reviewing, processing, and analysis.

What are the 5 types of data classification? ›

5 data classification types
  • Public data. Public data is important information, though often available material that's freely accessible for people to read, research, review and store. ...
  • Private data. ...
  • Internal data. ...
  • Confidential data. ...
  • Restricted data.
Jul 13, 2021

What are the 4 types of classified matters? ›

Documents and other information must be properly marked "by the author" with one of several (hierarchical) levels of sensitivity—e.g. restricted, confidential, secret, and top secret.

What are the types of classification? ›

There are four types of classification. They are Geographical classification, Chronological classification, Qualitative classification, Quantitative classification.

What do you mean by classification? ›

Definition of classification

1 : the act or process of classifying. 2a : systematic arrangement in groups or categories according to established criteria specifically : taxonomy. b : class, category.

Videos

1. Document Classification using Deep Learning
(Bhavesh Bhatt)
2. 028-Document Classification using Naive Bayes
(Byte Size Data Science)
3. Multi-Output Text Classification with Machine Learning Python
(JCharisTech)
4. Machine Learning model deployment using Flask - Multi Class Text Classifier
(AIEngineering)
5. Multiclass Text Classification with LSTM(KERAS + NLTK)
(Shabbir Governor)
6. Multi label text classification In Machine Learning - 3
(Learn Python)

You might also like

Latest Posts

Article information

Author: Kimberely Baumbach CPA

Last Updated: 10/30/2022

Views: 6641

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.