Document Intelligence and Index Creation Using Azure ML with Parallel Processing (Part 1)
Besides Azure portal, you can also do document intelligence and index creation in ML studio. The entire process of index creation includes several steps, crack_and_chunk, generate_embeddings, update_index, and register_index. In Azure ML studio you can create or use components for each of those steps and stitch them together as a pipeline.
Section 1. What is it?
Usually, a ML pipeline component does the job in serial, for example, it crack_and_chunk each input file, i.e., pdf file, one by one. If there are a couple of thousands of files, it would take several hours to finish the crack_and_chunk, and several hours for generate_embeddings, a total of a dozen hours for the entire index creation job. Imagine if there are hundreds of thousands or millions of files, it would take weeks to finish the entire index creation process.
Parallel processing capability is extremely important to speed up the index creation process, where the two most time-consuming components are crack_and_chunk and generate_embeddings.
Below figure shows the two components applying parallel processing capability for index creation: crack_and_chunk_with_doc_intel_parallel and generate_embeddings_parallel.
Section 2. How is the parallelism achieved?
Given crack_and_chunk_with_doc_intel_parallel component as an example, the logic of parallel process is like this: the ML job is run on a compute cluster which includes multiple nodes with multiple processors in each node, all files in the input folder is distributed into mini_batches, so each processor can handle some mini_batches, in this way, all processors can execute the crack_and_chunk job in parallel. Compared with serial pipelines, the parallel processing significantly improves the processing speed.
Below shows an experiment of creating an index on about 120 pdf files, and compared the time spent on each step of the index creation. Parallel processing improved the speed a lot. Running on GPU cluster is even faster than on CPU cluster. I want to make a note here, for parallel processing, there is overhead at the beginning of the job for scheduling the tasks to each processor, for small number of input files, time saving of parallel processing comparing to serial process may not be significant; but if the number of input files is huge, the time saving will be more significant.
How is the parallelism implemented in Azure ML? Please see this article:
How to use parallel job in pipeline – Azure Machine Learning | Microsoft Learn
There are several functions: init(), run() and shutdown(). The shutdown() function is optional.
Section 3. Code example
Please see the code in azure-example github repo as example:
This code repo creates parallel run component crack_and_chunk_doc_intel_component_parallel and stitches other Azure built-in components together to create a ML pipeline, the file crack_and_chunk_with_doc_intel/crack_and_chunk_parallel.py implements the parallelism logic. Several ways of providing .pdf inputs are addressed in the .ipynb files in this code repo.
There are some especially important features supported in this implementation:
Error handling. During crack_and_chunk, errors may happen when processing certain files, if there is no error handling, the subsequent job will be halted. In this solution you can decide how many errors you want to ignore before halting the whole job, you can even decide to ignore all errors. So, if there are some input files causing errors, you can continue crack_and_chunk for other input files.
You can set desired timeout value to ensure enough time for some big input files to be processed (crack_and_chunk) and responses to be received.
Retry. If there is error in crack_and_chunk, you can set number of retries.
Be sure to check out this article for guidance of setting optimum parameters for parallel processing:
ParallelRunStep Performance Tuning Guide · Azure ML-Ops (Accelerator) (microsoft.github.io)
Section 4. Benefits of using Azure ML
Although there are other ways of creating AI search indexes, there are benefits of creating indexes in Azure ML.
The jobs get the whole managed environment in Azure ML including monitoring.
There are opportunities to use abundant compute resources in Azure ML platform, including VMs, CPUs, GPUs, etc.
Security and authentication features are provided, such as system identify, managed identity.
There are a variety of logs related to ML job execution, which help debugging and provide statistics for job analysis.
See picture below, this log tells how much time is spent on each min_batch.
There are other logs for performance, errors, user logs, system logs, etc.
Azure ML provides version control for output chunks, embeddings, and indexes. This provides flexibility for users to select desired version of these entities when building applications.
Below picture shows that you can specify the index version when you ask questions.
Azure ML can connect the index to promptflow natively.
For this parallel processing feature, a header is added to indicate the crack_and_chunk_parallel processing API call.
Some other capabilities can be built on top of this parallel processing ML pipeline:
Scheduling. Once you are satisfied with a ML job, you can set a recurrent schedule to run it.
Publish as pipeline endpoint, then submit to setup pipeline job easily.
Use the index in a promptflow.
Section 5. Future enhancements:
Some future enhancements are considered, for example, re-indexing, which is to detect the changes in input files and only update the index with the changes. We will experiment with that part and publish Part 2 of the solution in the future.
Acknowledgement:
Thanks to the reviewers for providing feedback, involving in the discussion, reviewing the code, or sharing experience in Azure ML parallel processing:
Alex Zeltov, Vincent Houdebine, Randy Thurman, Lu Zhang, Shu Peng, Jingyi Zhu, Long Chen, Alain Li, Yi Zhou.
Microsoft Tech Community – Latest Blogs –Read More