Month: July 2024
zipping empty subdirectories
I try to zip a directory which contains empty sub-directories but zip() function ignore the empty ones. Is there way to include the empty subdirectories as well?I try to zip a directory which contains empty sub-directories but zip() function ignore the empty ones. Is there way to include the empty subdirectories as well? I try to zip a directory which contains empty sub-directories but zip() function ignore the empty ones. Is there way to include the empty subdirectories as well? zip MATLAB Answers — New Questions
Solve integral in nested function
hello there,
i am trying to build something like this:
if true
function x = first
x = 5*y
function y = nested
I = @(x) x^2
y = integral(I,0,1)
end
end
end
but there occurs an error stating: ‘Error: Function definitions are not permitted in this context.’
Does anyone know an alternative or sees the misstake i made?
thanks!hello there,
i am trying to build something like this:
if true
function x = first
x = 5*y
function y = nested
I = @(x) x^2
y = integral(I,0,1)
end
end
end
but there occurs an error stating: ‘Error: Function definitions are not permitted in this context.’
Does anyone know an alternative or sees the misstake i made?
thanks! hello there,
i am trying to build something like this:
if true
function x = first
x = 5*y
function y = nested
I = @(x) x^2
y = integral(I,0,1)
end
end
end
but there occurs an error stating: ‘Error: Function definitions are not permitted in this context.’
Does anyone know an alternative or sees the misstake i made?
thanks! integral, nested functions, passing results MATLAB Answers — New Questions
Azure Health bot scenarios building with code
I know that we can build custom scenarios with azure health bot UI.
But is there a way to build custom scenarios directly with code? through any API?
I know that we can build custom scenarios with azure health bot UI.But is there a way to build custom scenarios directly with code? through any API? Read More
Responsible AI for All from Children to Parents
As mentioned in our previous blog post, Learn about Responsible AI with MVP Veronika Kolesnikova, it’s crucial to understand the principles of Responsible AI to ensure ethical use of AI in the future.
In this blog post, we focus on promoting an understanding of responsible AI across a broad range of audiences, from business users to high school students. We interviewed Komes Chandavimol from Thailand, who was awarded the Microsoft MVP for AI Platform in March of this year, to share his expert insights.
———-
Please tell us details about your recent community activities focusing on responsible AI.
In Thailand, the movement in responsible AI started with large enterprises that are exposed to the development of machine learning and attempt to apply basic responsible AI concepts such as explainable AI and reliable AI. Since then, it has become popular in two ways. First, at the community level, experts share their knowledge through sessions including meetups, podcasts, or blogs. Additionally, responsible AI is embedded into education since machine learning subjects include libraries such as SHAP or ML in operation.
As a data and AI practitioner and visiting professor, I try to encourage everyone in both ways. My Data Science Thailand page usually shares content about responsible AI concepts, toolkits, and use cases. Moreover, I have also conducted responsible AI workshops at several events from Microsoft and partner collaborations. Such events included Code; Without Barriers and SoundByte Digital Inclusion in Australia, which empower diversity efforts in our industryOn the other hand, I included responsible AI in my data science for business subject, where I teach responsible AI and focus on how to apply toolkits to ensure each principle has an example that undergraduates understand and can apply to their use cases.
Recently, I have expanded responsible AI to a wider range of audiences such as high school students. I went to a volunteering event and taught the concept of responsible AI. This includes the “Responsible AI for Young” initiative that I teach in schools to make sure students are aware of the risks in AI and how to avoid them.
What do you suggest AI users be aware of when using generative AI in terms of responsible AI?
It depends on the audience. When I conduct a class for high school students, I focus on fun and engagement, where I flip the classroom to let everyone experiment first, and then follow up with the concepts. On the other hand, if I teach younger students, I may start with a text-to-music theme and bring them to the generative AI model’s capabilities and the risks of using it in public.
In one session, I conducted responsible AI training for parents who are preparing their kids for university. This group is tech-savvy and knowledgeable about AI. The point I tried to emphasize is that they should be the “human in the loop” with their kids. Parental involvement is key to their children’s responsible AI use, and their responsibility is very important.
All in all, I believe in the concept of the 4Ps in learning: Passion, Play, Peer, and Project. I bring them to the passion with peer review and mostly give them small projects to involve and ask them to reflect on their learning with AI.
As a community leader, how do you help community members who would like to learn more about Responsible AI?
I normally give them guidelines to follow and encourage them to learn based on their interests. Many of my materials are published on my Facebook page, and I also have a small YouTube channel where I post some of the videos I teach to my students. However, I believe that today we have rich information about responsible AI, and they can connect by themselves, so I just give them shortcuts to knowledge and encourage them to learn on their own.
In conclusion, my dedication to promoting responsible AI is driven by a passion for ensuring ethical and reliable AI practices. Whether through community activities, educational initiatives, or professional workshops, I aim to inspire and equip others to navigate the complexities of AI responsibly. Through this process, I have also learned a great deal about the diverse perspectives and innovative approaches within our community, which continuously enrich my understanding and practice of responsible AI.
I look forward to continuing this journey with you all, fostering a collaborative environment where we can learn, grow, and make a positive impact together.
Thank you for your attention and commitment to responsible AI.
———-
Now that AI benefits are accessible to everyone, not just a select few, Komes’s efforts to raise awareness about responsible AI among various age groups are incredibly valuable. We encourage everyone reading this to explore the resources provided below to learn more about responsible AI and share this knowledge within your own communities.
– Empowering responsible AI practices | Microsoft AI
– Responsible AI Principles and Approach | Microsoft AI
– Responsible AI Solutions | Microsoft Azure
– Skill up on Responsible AI Developer Hub | Responsible AI Developer Hub (azure.github.io)
– Train a model and debug it with Responsible AI dashboard – Training | Microsoft Learn
– Embrace responsible AI principles and practices – Training | Microsoft Learn
– Responsible AI – Cloud Adoption Framework | Microsoft Learn
Microsoft Tech Community – Latest Blogs –Read More
Generate Synthetic QnAs from Real-world Data on Azure
1. Background
In the rapidly evolving field of Generative AI, SLM/LLM fine-tuning and implementing RAG (Retrieval-Augmented Generation) techniques have become essential for achieving high-performance and domain-specific applications. Creating synthetic datasets for these purposes is crucial as it allows for the generation of tailored training data that addresses specific gaps and nuances in the target domain, which might not be adequately covered by existing datasets. This approach enhances the model’s ability to understand and generate relevant and accurate information, ultimately leading to more robust, reliable, and context-aware AI systems that can better serve users’ needs in specialized areas.
Generating high-quality datasets from diverse formats of raw data, such as PDFs, CSVs, and TXTs, especially those containing a mix of images, tables, and text, presents several significant challenges. This is mainly because the extraction process itself is complex, as each format requires different parsing techniques to accurately retrieve and interpret the content. PDFs, for instance, can have varied structures and may not follow a standardized layout, making it difficult to consistently extract text and images. Additionally, handling tables within PDFs is particularly challenging because they can span multiple pages and have complex cell structures.
If you want to improve the performance of your model using a seed dataset generated from raw data as a baseline, you may need data augmentation to generate high-quality synthetic data. But there is a risk of introducing biases or inconsistencies during the augmentation process. Augmented data needs to be representative of the diversity and variability present in the real-world data. If not carefully managed, augmentation can lead to overfitting, where the model performs well on augmented data but poorly on actual data, or it can amplify existing biases in the dataset. In this blog, we share in detail the methodology and code snippets that will help you solve these challenges.
2. Constructing a Seed Dataset
2.1. Overview
The task is to preprocess and convert this heterogeneous data into a structured format suitable for fine-tuning or RAG. This involves extracting and cleaning text from various file formats, converting tables and images to text using Azure AI Services if necessary. This dataset is used as a seed dataset for fine-tuning or RAG and serves as a baseline to improve the performance of domain-specific use cases. Here’s an easy to follow hands-on for you based on a typical use case. All of this code is uploaded here.
make_qa_multimodal_pdf_docai.ipynb: (Recommended) Generate QnA synthetic dataset from a Complex PDF using Azure AI Document Intelligence.
make_qa_multimodal_pdf_oss.ipynb: Generate QnA synthetic dataset from a Complex PDF using Open source (Unstructured toolkit for this hands-on). To run this file, you first need to install the required packages with startup_unstructured.sh. The installation will take a few minutes.
make_qa_only_image_multiple_pdf.ipynb: Generate QnA synthetic dataset from multiple PDFs – Image-heavy PDF.
make_qa_only_image_pdf.ipynb: Generate QnA synthetic dataset from a PDF – Image-heavy PDF.
CSV
make_qa_csv.ipynb: This is the general case. It is not difficult to create a QnA dataset by reading and chunking with CSVLoader.
make_qa_image_url_csv.ipynb: This is another common case. If image url information is included, change this url to a summary result for that image.
Let’s take a look at the contents of the most representative make_qa_multimodal_pdf_docai.ipynb.
2.2. Separate the PDF pages
Separate the PDF pages into “Text”, “Image”, and “Mixed” with text, image, and table by using analyze_pdf_page_content(…)
def analyze_pdf_page_content(pdf_path, text_length_thres=600):
document = fitz.open(pdf_path)
page_analysis = defaultdict(list)
for page_num in range(len(document)):
page = document.load_page(page_num)
text = page.get_text(“text”)
image_list = page.get_images(full=True)
text_length = len(text)
num_images = len(image_list)
if text_length > text_length_thres and num_images == 0:
content_type = ‘Text’
elif text_length <= text_length_thres and num_images > 0:
content_type = ‘Image’
else:
content_type = ‘Mixed’
page_analysis[content_type].append(page_num)
return dict(page_analysis)
Text-heavy page can be processed with open source (e.g., PyMuPDF) without the need to use toolkits like Azure AI Document Intelligence or Unstructured.
Image-heavy page can be converted the entire page to images and let a multimodal LLM like Azure OpenAI GPT-4o summarize each page.
Mixed page uses Azure AI Document Intelligence to separate images, text, and tables. Azure Document Intelligence’s built-in models offer key features for document analysis, including:
Text Extraction: Identifies and extracts text from various document types.
Table Recognition: Detects and extracts table structures.
Selection Marks: Recognizes checkboxes and radio buttons.
Form Field Recognition: Extracts fields from forms.
Document Structure Understanding: Differentiates between titles, headers, footers, and other sections.
Multi-language Support: Handles documents in multiple languages.
SDK and REST API Integration: Provides tools for seamless integration into applications.
For more details, visit Microsoft’s AI Document Intelligence official page.
2.3. Mixed page processing
Now let’s look at how to extract information from mixed pages in detail.
Extract mixed pages as prebuilt-layout models and convert them to markdown -document_intelligence_client.begin_analyze_document(“prebuilt-layout”, output_content_format=ContentFormat.MARKDOWN, …)
Extract images using bounding boxes (x, y, w, h) stored in figure tags – crop_image_from_file(…). If the bounding box size is too small (is_bounding_box_larger_than(…)) or the image is a simple pattern with no meaning (image_complexity(…)), the image is not extracted.
Summarize the extracted images with GPT-4o. (understand_image_with_gpt(…))
Here is a code snippet that summarizes this briefly. Note that the actual implementation is more complex than this.
# Import necessary functions for processing
from util.preprocess import (
image_complexity, is_bounding_box_larger_than, crop_image_from_file,
understand_image_with_gpt, update_figure_description
)
if “Mixed” in analyzed_pdf_result:
pdf_mixed_path = path_to_mixed_pdf
# Open and analyze the PDF file with Azure Document Intelligence
with open(pdf_mixed_path, “rb”) as file:
poller = document_intelligence_client.begin_analyze_document(
“prebuilt-layout”, analyze_request=file, content_type=”application/octet-stream”,
output_content_format=ContentFormat.MARKDOWN
)
result = poller.result()
md_content = result.content
output_folder = “pdf_mixed_tmp”
delete_folder_and_make_folder(output_folder)
input_file_path = file_path
if result.figures:
for idx, figure in enumerate(result.figures):
figure_content = “”
img_description = “”
# Extract figure content from markdown
for span in figure.spans:
figure_content += md_content[span.offset:span.offset + span.length]
for region in figure.bounding_regions:
boundingbox = extract_bounding_box(region)
# Extract images only when the bounding box size is greater than a certain size.
if is_bounding_box_larger_than(boundingbox):
cropped_image = crop_image_from_file(input_file_path, region.page_number – 1, boundingbox)
# Extract images only when the image complexity is high.
if image_complexity(cropped_image) == “Complex”:
cropped_image_filename = save_cropped_image(cropped_image, idx, input_file_path, output_folder)
try:
image_summarization = understand_image_with_gpt(client, aoai_deployment_name, cropped_image_filename, “”, max_tokens, language)
except openai.BadRequestError:
image_summarization = “”
img_description += image_summarization
# Update the figure tag in the extracted markdown document
md_content = update_figure_description(md_content, img_description, idx)
image_complexity(…) assesses the complexity of an image by analyzing its histogram entropy, Laplacian variance, and edge count. It converts the image to a format suitable for OpenCV processing, calculates the entropy of the color histograms, the variance of the Laplacian to measure focus, and the number of edges detected by the Canny edge detector. Based on these metrics, it classifies the image as either “Complex” or “Simple” depending on predefined threshold values for each metric. The code snippet is below.
# Function to calculate complexity using variance of Laplacian and Canny edge detection
def image_complexity(img, laplacian_var_thres=500, edge_count_thres=10000, total_entropy_thres=5.0):
if isinstance(img, Image.Image):
img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
##### Histogram entropy
hist_b = cv2.calcHist([img], [0], None, [256], [0, 256])
hist_g = cv2.calcHist([img], [1], None, [256], [0, 256])
hist_r = cv2.calcHist([img], [2], None, [256], [0, 256])
# Normalize the histograms
hist_b /= hist_b.sum()
hist_g /= hist_g.sum()
hist_r /= hist_r.sum()
# Calculate histogram entropy
entropy_b = -np.sum(hist_b * np.log2(hist_b + 1e-7))
entropy_g = -np.sum(hist_g * np.log2(hist_g + 1e-7))
entropy_r = -np.sum(hist_r * np.log2(hist_r + 1e-7))
# Total entropy
total_entropy = entropy_b + entropy_g + entropy_r
### Laplacian variance
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
laplacian_var = cv2.Laplacian(gray_img, cv2.CV_64F).var()
### Canny edge detection
edges = cv2.Canny(gray_img, 100, 200)
edge_count = np.sum(edges > 0)
if laplacian_var_thres > laplacian_var_thres or edge_count > edge_count_thres or total_entropy > total_entropy_thres:
return “Complex”, laplacian_var, edge_count, total_entropy
else:
return “Simple”, laplacian_var, edge_count, total_entropy
2.4. Construct QnA Pairs
We can leverage the azure-ai-generative package. The QADataGenerator class in this package makes it easy to generate QnA synthetic questions. However, using this class as is has the disadvantage of not being able to use custom prompts, so we inherited from it and created the CustomQADataGenerator class as follows.
import os
from azure.ai.generative.synthetic.qa import QADataGenerator, QAType
from typing import Dict, List, Tuple, Any, Union, Optional
from azure.ai.generative._telemetry import ActivityType, monitor_with_activity, ActivityLogger
activity_logger = ActivityLogger(__name__)
logger, module_logger = activity_logger.package_logger, activity_logger.module_logger
class CustomQADataGenerator(QADataGenerator):
def __init__(self, templates_dir: str, **kwargs):
self.templates_dir = templates_dir
super().__init__(**kwargs)
def _get_template(self, filename) -> str:
logger.debug(“Getting prompt template from %s file”, filename)
filepath = os.path.join(self.templates_dir, filename)
with open(filepath, encoding=”utf-8″) as f:
template = f.read()
return template
def _get_messages_for_qa_type(self, qa_type: QAType, text: str, num_questions: int) -> List:
logger.debug(“Getting prompt messages for %s QA type”, qa_type)
template_filename = {
QAType.SHORT_ANSWER: “prompt_qa_short_answer.txt”,
QAType.LONG_ANSWER: “prompt_qa_long_answer.txt”,
QAType.BOOLEAN: “prompt_qa_boolean.txt”,
QAType.SUMMARY: “prompt_qa_summary.txt”,
QAType.CONVERSATION: “prompt_qa_conversation.txt”,
}
filename = template_filename[qa_type]
messages = self._get_messages_from_file(filename)
input_variables: Dict[str, Any] = {“text”: text}
if qa_type == QAType.SUMMARY:
input_variables[“num_words”] = 100
else:
input_variables[“num_questions”] = num_questions
messages[-1][“content”] = messages[-1][“content”].format(**input_variables)
return messages
def _get_messages_for_modify_conversation(self, questions: List[str]) -> List:
messages = self._get_messages_from_file(“prompt_qa_conversation_modify.txt”)
questions_str = “n”.join([f”[Q]: {q}” for q in questions])
messages[-1][“content”] = messages[-1][“content”].format(questions=questions_str)
return messages
All you have to do is put your own prompts into a text file and you’re done. There are some prompt examples at this link.
Now, you can easily create a QnA dataset for fine-tuning using the code snippet below. Of course, you can also use it for RAG with just a little modification of the code.
import os
import asyncio
from collections import Counter
from typing import Dict
from azure.ai.generative.synthetic.qa import QAType
from util.qa import CustomQADataGenerator
model_config = {
“deployment”: os.getenv(“AZURE_OPENAI_DEPLOYMENT_NAME”),
“model”: “gpt-4o”,
“max_tokens”: 2000,
}
qa_generator = CustomQADataGenerator(model_config=model_config, templates_dir=f”./prompt_template/{LANGUAGE_CODE}”)
concurrency = 6 # number of concurrent calls
sem = asyncio.Semaphore(concurrency)
#qa_type = QAType.CONVERSATION
qa_type = QAType.LONG_ANSWER
async def generate_async(text: str) -> Dict:
async with sem:
return await qa_generator.generate_async(
text=text,
qa_type=qa_type,
num_questions=3, # Number of questions to generate per text
)
input_batch = mixed_chunks + text_chunks + image_summaries
results = await asyncio.gather(*[generate_async(text) for text in input_batch], return_exceptions=True)
question_answer_list = []
token_usage = Counter()
for result in results:
if isinstance(result, Exception):
raise result # exception raised inside generate_async()
question_answer_list.append(result[“question_answers”])
token_usage += result[“token_usage”]
print(“Successfully generated QAs”)
The screenshot below is a Q&A result extracted from sample raw data, you can see sample results in this folder, all of the sample raw data is based on articles I have written or data I have generated, so there are no license issues. If you are doing a PoC/MVP, please prepare your own dataset.
Below is a comparison of the results before and after fine tuning of GPT-4o without RAG for a Korea customer PoC. GPT-4o is available to a small number of customers as a private preview as of July 2024. This is the result of creating a set of 16 questions and answers for PoC and comparing three indicators of Similarity, Coherence, and Fluency in Azure AI studio. The values of the indicator are on a scale of 1-5, with higher values being better.
3. Synthetic Data Generation
After fine-tuning with the generated dataset from the above section, a baseline was established, but the performance requires improvement due to a lack of data (e.g., there are only 1,000 samples in the dataset). In this case, a synthetic dataset must be created by applying data augmentation techniques to enhance performance. The data augmentation technique utilizes the representative techniques announced by Microsoft: Evol-Instruct, GLAN (Generalized Instruction Tuning), and Auto Evol-Instruct.
Evol-Instruct: Generate diverse instructional data to augment the dataset from the seed datset.
GLAN: Apply generalized instruction tuning to expand the variety of Q&A pairs.
Auto Evol-Instruct: Automate the generation of synthetic data to scale the augmentation process.
3.1. Augment your dataset – Evol-Instruct
The Evol-Instruct concept developed by Microsoft aims to enhance the capabilities of LLMs by automatically evolving instructions to various complexity levels, instead of relying solely on manually created instructions. This method involves several key components and steps:
Instruction Evolution: Starting with an initial set of instructions, the model uses a LLM like GPT-4o to iteratively rewrite these instructions into more complex versions. The evolution process involves two types of instruction enhancement: in-depth evolving (adding constraints, deepening, increasing reasoning steps) and in-breadth evolving (generating new, diverse instructions).
Response Generation: After evolving the instructions, the LLM generates responses to these newly complex instructions, ensuring that they are reasonable and understandable by humans.
Instruction Elimination: The process includes a step to filter out any instructions that fail to evolve properly, ensuring only high-quality, challenging instructions remain.
This open-source implementation is based on the WizardLM paper and h2o-wizardlm. We added the following features to the original implementation:
Modified it to be able to call Azure OpenAI by adding the AzureGPTPipeline class.
The prompt has been refined and modified to support multiple languages. Use -language argument for other language. (e.g., -language Korean)
Made it possible to create questions only when necessary. A better strategy is to create questions and answers separately. Use -question_only argument. (e.g., -questioin_only True)
Prevented infinite loop. mutate() in the original implementation determines the validity of the augmented statement and repeats the loop until it is valid. However, this process takes a very long time and there is a problem in that the loop repeats infinitely in certain situations.
You can easily convert your jsonl file from the previous section with convert.py and augment your dataset with evolve.py.
#!/bin/bash
INPUT_FILE=”../seed/samples/advertising-oai.jsonl”
SEED_FILE=”seed.jsonl”
COLUMN_NAMES=”Instruction”
NUM_ROWS=10
MAX_LEN_CHARS=256
python convert.py –input_file “$INPUT_FILE” –output_file “$SEED_FILE
python evolve.py –seed_file “$SEED_FILE” –column_names “$COLUMN_NAMES” –num_rows “$NUM_ROWS” –max_len_chars “$MAX_LEN_CHARS”
Seed instruction – after running convert.py
{“idx”: 1, “Skill”: “Distributed training on Cloud”, “Difficulty”: 5, “Instruction”: “What version of TensorFlow was used in the evaluation?”}
{“idx”: 2, “Skill”: “Distributed training on Cloud”, “Difficulty”: 5, “Instruction”: “What is the first step to prepare the validation set for ImageNet training?”}
{“idx”: 3, “Skill”: “Distributed training on Cloud”, “Difficulty”: 5, “Instruction”: “What is the purpose of the script ‘preprocess_imagenet.py’ and how is it executed?”}
Synthetic instruction – after running evolve.py (shows only 10 samples)
[
{
“input”: “Could you determine the specific learning rate value (LR) recorded at the 7000th iteration within the hvd_train_log?”
},
{
“input”: “Can you provide the accuracy and precision metrics recorded at epoch 37 in the model training logs?”
},
{
“input”: “Could you elaborate on the process and specific steps needed to resize the TFRecord training dataset for ImageNet utilizing the provided scripts, including any potential challenges and solutions?”
},
{
“input”: “Could you outline the sequential procedures to be followed post-data transformation to ensure both the reliability of data backup and its continuous availability, requiring detailed verification and contingency planning?”
},
{
“input”: “Could you explain the primary function and broader implications of using the ‘nohup’ command, especially in the context of manually downloading the extensive ImageNet dataset, along with potential advantages and disadvantages?”
},
{
“input”: “What is the purpose of using the ‘nohup’ command when manually downloading the ImageNet dataset?”
},
{
“input”: “Can you explain the rationale behind not performing the resizing to 224×224 for the RecordIO format, and discuss any potential consequences and alternatives?”
},
{
“input”: “What are the underlying reasons that make a supplementary EBS volume of 1.0TB advisable for efficiently handling the ImageNet dataset? Please elaborate on the factors involved.”
},
{
“input”: “Taking into account both cost-efficiency and performance, what is the recommended EC2 instance type for storing the ImageNet dataset and why?”
},
{
“input”: “What is the purpose of the script ‘preprocess_imagenet.py’ and how is it executed?”
}
]
Example datasets are placed in this folder. Please try the minimal example first and configure your dataset by referring to the tunable parameters.
3.2. Improve the generalizability of your model – GLAN
Catastrophic forgetting, also known as catastrophic interference, occurs during SLM/LLM fine-tuning when a model trained on new data overwrites the knowledge it previously acquired, leading to a significant drop in performance on earlier tasks. This issue is particularly prominent in scenarios where the model needs to adapt to diverse and potentially conflicting data distributions without losing its ability to perform well on initial tasks.
GLAN (Generalized Instruction Tuning) addresses catastrophic forgetting by leveraging a systematically curated taxonomy of human knowledge to generate synthetic, diverse, and comprehensive instruction datasets. Specifically, GLAN uses a pre-curated taxonomy of human knowledge and capabilities to generate large-scale, diverse instruction data across various disciplines. This taxonomy mirrors the systematic structure of human education, breaking down knowledge into fields, sub-fields, and distinct disciplines, which are then used to design syllabi for different subjects. These syllabi detail key concepts for each class session, enabling the generation of diverse and comprehensive instructions. Here is GLAN’s key features:
Taxonomy Creation: A detailed hierarchy of human knowledge is built using LLMs and human verification.
Subject and Syllabus Generation: LLMs generate a list of subjects and detailed syllabi, breaking down subjects into class sessions with key concepts.
Instruction Generation: Homework questions and their answers are generated based on the syllabi, ensuring broad coverage and diversity.
The author has implemented the concept of this paper from the scratch and made it public. Let’s take a look at the core concepts of the code.
generate_taxonomy(…): Generate a taxonomy of human knowledge and capabilities.
generate_subjects(…): Generate a list of subjects for a given discipline. Please refer to section 2.2 of the paper.
generate_syllabus(…): Generate a syllabus for a given subject at a specific level. Please refer to section 2.3 of the paper.
sample_class_sessions_and_key_concepts(…): Sample class sessions and key concepts to generate questions of varying difficulty.
generate_questions(…): Generate questions based on class sessions and key concepts using LangChain pipeline. Please refer to section 2.4 of the paper.
generate_answers(…): Generate answers to the questions using LangChain pipeline. Please refer to section 2.4 of the paper.
glan_instruction_generation(…): GLAN pipeline
This pseudocode for glan_instruction_generation(…) outlines the steps of generating or loading disciplines, generating subjects and syllabi for each discipline, creating questions based on the syllabi, saving the questions, and optionally generating and saving answers to create a comprehensive instruction dataset.
def glan_instruction_generation(args):
if args.generate_disciplines:
taxonomy_json, disciplines = generate_taxonomy(
max_number_of_fields=args.max_number_of_fields,
model_name=args.model_name
)
else:
disciplines = read_text_to_list(args.disciplines_filepath)
all_questions = []
for discipline in disciplines:
subjects_json = generate_subjects(
discipline,
max_number_of_subjects=args.max_number_of_subjects,
max_number_of_subtopics=args.max_number_of_subtopics,
model_name=args.model_name
)
for subject_info in subjects_json[“subjects”]:
subject = subject_info[‘subject’]
level = subject_info[‘level’]
subtopics = “, “.join(subject_info[‘subtopics’])
class_sessions, key_concepts = generate_syllabus(
subject, level, subtopics,
max_number_of_session_name=args.max_number_of_session_name,
model_name=args.model_name
)
questions = generate_questions(
class_sessions, key_concepts,
subject, level, subtopics,
model_name=args.model_name,
num_iterations=args.num_iterations,
num_questions_per_iteration=args.num_questions_per_iteration,
max_tokens=args.question_max_tokens,
batch_size=args.question_batch_size,
language=args.language
)
all_questions.extend(questions)
save_questions(all_questions, args.output_dir, args.language)
if not args.generate_question_only:
all_answers = generate_answers(
all_questions,
model_name=args.model_name_for_answer,
max_tokens=args.answer_max_tokens,
batch_size=args.answer_batch_size
)
save_instructions(all_questions, all_answers, args.output_dir, args.language)
This implementation supports all languages supported by LLM, so you can easily create datasets in your country’s language. Below is an example created with this code.
English language
Korean language
Example datasets are placed in this folder. Please try the minimal example first and configure your dataset by referring to the tunable parameters.
The code snippet below is a shell script that generates a large synthetic dataset. If you are willing to pay enough PTUs, you can create your own datasets just by running this script.
# Initialize counter
counter=1
# Read disciplines.txt line by line
while IFS= read -r line || [[ -n “$line” ]]; do
# Create the corresponding disciplines file
discipline_file=”disciplines_line${counter}.txt”
echo Created “$discipline_file”
echo “$line” > “$discipline_file”
# # Run the Python script with the current disciplines file
python generate.py
–disciplines_filepath “$discipline_file”
–language Korean
–max_number_of_subjects 15
–max_number_of_subtopics 30
–max_number_of_session_name 30
–num_iterations 15
–num_questions_per_iteration 18
–question_max_tokens 1024
–question_batch_size 9
–model_name_for_answer gpt-4o
–answer_max_tokens 2048
–answer_batch_size 9
# Increment counter
((counter++))
# Delete the temporary disciplines file
rm “$discipline_file”
done < disciplines_sample.txt
The author has uploaded a synthetic dataset in Korean to the Hugging Face Hub under a public license. Korean customers are welcome to use this dataset as a baseline!
https://huggingface.co/datasets/daekeun-ml/GLAN-Korean
4. Conclusion
In the realm of Generative AI, creating and augmenting datasets for fine-tuning SLM/LLM models is a crucial step to ensure robust, reliable, and context-aware AI systems. This blog post outlined the process of constructing a seed dataset from diverse raw data formats and then applying data augmentation techniques such as Evol-Instruct and GLAN to enhance the dataset’s quality and diversity. These techniques help mitigate the challenges of extracting high-quality data from complex formats like PDFs and CSVs and prevent issues such as catastrophic forgetting during the fine-tuning process.
By leveraging advanced methods like GLAN, we can systematically generate comprehensive instruction datasets across various disciplines, enhancing the model’s performance without compromising previously learned knowledge. This approach not only improves the generalizability of the model but also ensures that it can handle a wide range of complex tasks effectively.
For those interested in exploring these techniques further and implementing them in their projects, all the code and examples discussed in this blog are available on the Azure synthetic-qa-generation GitHub repository. This resource provides a practical guide and tools to generate high-quality QnA datasets, enabling you to fine-tune and optimize your AI models for specialized applications. By following the methodologies and utilizing the tools provided, developers and data scientists can create robust datasets that significantly enhance the capabilities of their AI models, paving the way for more advanced and reliable AI solutions in various domains.
References
End-to-end hands-on labs: https://github.com/Azure/synthetic-qa-generation
Korean GLAN dataset: https://huggingface.co/datasets/daekeun-ml/GLAN-Korean
Evolve-Instruct paper: https://arxiv.org/abs/2304.12244
GLAN paper: https://arxiv.org/abs/2402.13064
Auto Evolve-Instruct paper: https://arxiv.org/abs/2406.00770
Microsoft Tech Community – Latest Blogs –Read More
SIMULINK: Set a random seed in the Block Parameters: Random Number GUI
SIMULINK
How do I set a random seed in the user GUI for a random number block? (So I get a random number at each sim run)
ThanksSIMULINK
How do I set a random seed in the user GUI for a random number block? (So I get a random number at each sim run)
Thanks SIMULINK
How do I set a random seed in the user GUI for a random number block? (So I get a random number at each sim run)
Thanks simulink, random number generator MATLAB Answers — New Questions
How do I merge vectors?
I have a 10X1 vector A=[1;3;7;10;12;14;15;18;20;21] and a 10X1 vector B=[3;12;15;18;20;0;0;0;0;0]. Is there a way to come up with a 10X2 vector C such that the equal elements in each vector sit at the same row in vector C? I want to retain all elements of vector A in vector C and keep only the elements from vector B that have values equal to the one elements in vector A. Ideally, My final vector C would look something like this:
C=[1 0;3 3;7 0;10 0;12 12;14 0;15 15;18 18;20 20;21 0]. Any thoughts?I have a 10X1 vector A=[1;3;7;10;12;14;15;18;20;21] and a 10X1 vector B=[3;12;15;18;20;0;0;0;0;0]. Is there a way to come up with a 10X2 vector C such that the equal elements in each vector sit at the same row in vector C? I want to retain all elements of vector A in vector C and keep only the elements from vector B that have values equal to the one elements in vector A. Ideally, My final vector C would look something like this:
C=[1 0;3 3;7 0;10 0;12 12;14 0;15 15;18 18;20 20;21 0]. Any thoughts? I have a 10X1 vector A=[1;3;7;10;12;14;15;18;20;21] and a 10X1 vector B=[3;12;15;18;20;0;0;0;0;0]. Is there a way to come up with a 10X2 vector C such that the equal elements in each vector sit at the same row in vector C? I want to retain all elements of vector A in vector C and keep only the elements from vector B that have values equal to the one elements in vector A. Ideally, My final vector C would look something like this:
C=[1 0;3 3;7 0;10 0;12 12;14 0;15 15;18 18;20 20;21 0]. Any thoughts? merging vectors MATLAB Answers — New Questions
Can you explain the difference between the two matrix operations in MATLAB?
I am trying to understand how MATLAB performs the following matrix operations:
Example 1:
clearvars; clc; close all;
Nx = 8;
Ny = 8;
Lx=2*pi;
dx = Lx/Nx;
Vec = fftshift(-Nx/2:Nx/2-1);
Vector1 = (sin( Vec * dx/2)/(dx/2)).^2 ;
[Matrix2,x] = cheb(Ny);
for m = 1:length(Vec)
Matrix1 = -1 * (Vector1(m))+ Matrix2;
end
Example 2:
clearvars; clc; close all;
Nx = 8;
Ny = 8;
Lx=2*pi;
dx = Lx/Nx;
Vec = fftshift(-Nx/2:Nx/2-1);
Vector1 = (sin( Vec * dx/2)/(dx/2)).^2 ;
Igl = speye(Ny+1);
[Matrix2,x] = cheb(Ny);
for m = 1:length(Vec)
Matrix1 = -Igl * (Vector1(m))+ Matrix2;
end
Why is Matrix1 different in Example1 and Example 2? In particular, in Example 1 how is the scalar multiplication of the row vector (Vector1(m)) added to Matrix 2? I am trying to understand the matrix operation done in Example 1 specifically so I can transfer it to C/C++. ThanksI am trying to understand how MATLAB performs the following matrix operations:
Example 1:
clearvars; clc; close all;
Nx = 8;
Ny = 8;
Lx=2*pi;
dx = Lx/Nx;
Vec = fftshift(-Nx/2:Nx/2-1);
Vector1 = (sin( Vec * dx/2)/(dx/2)).^2 ;
[Matrix2,x] = cheb(Ny);
for m = 1:length(Vec)
Matrix1 = -1 * (Vector1(m))+ Matrix2;
end
Example 2:
clearvars; clc; close all;
Nx = 8;
Ny = 8;
Lx=2*pi;
dx = Lx/Nx;
Vec = fftshift(-Nx/2:Nx/2-1);
Vector1 = (sin( Vec * dx/2)/(dx/2)).^2 ;
Igl = speye(Ny+1);
[Matrix2,x] = cheb(Ny);
for m = 1:length(Vec)
Matrix1 = -Igl * (Vector1(m))+ Matrix2;
end
Why is Matrix1 different in Example1 and Example 2? In particular, in Example 1 how is the scalar multiplication of the row vector (Vector1(m)) added to Matrix 2? I am trying to understand the matrix operation done in Example 1 specifically so I can transfer it to C/C++. Thanks I am trying to understand how MATLAB performs the following matrix operations:
Example 1:
clearvars; clc; close all;
Nx = 8;
Ny = 8;
Lx=2*pi;
dx = Lx/Nx;
Vec = fftshift(-Nx/2:Nx/2-1);
Vector1 = (sin( Vec * dx/2)/(dx/2)).^2 ;
[Matrix2,x] = cheb(Ny);
for m = 1:length(Vec)
Matrix1 = -1 * (Vector1(m))+ Matrix2;
end
Example 2:
clearvars; clc; close all;
Nx = 8;
Ny = 8;
Lx=2*pi;
dx = Lx/Nx;
Vec = fftshift(-Nx/2:Nx/2-1);
Vector1 = (sin( Vec * dx/2)/(dx/2)).^2 ;
Igl = speye(Ny+1);
[Matrix2,x] = cheb(Ny);
for m = 1:length(Vec)
Matrix1 = -Igl * (Vector1(m))+ Matrix2;
end
Why is Matrix1 different in Example1 and Example 2? In particular, in Example 1 how is the scalar multiplication of the row vector (Vector1(m)) added to Matrix 2? I am trying to understand the matrix operation done in Example 1 specifically so I can transfer it to C/C++. Thanks matrix MATLAB Answers — New Questions
Always On Availability Group — Disk resize on the secondary replica
Hi,
We currently have a 2 node on Always ON Availability group with a cloud witness. Recently, we found the need to resize the data file disk. This 2 nodes are SQL Server on Azure VM.
Find below our current situation
1st Resize on the secondary node (read-only)
Resizing first on the secondary node – we noticed that the disk type changed from Basic to Dynamic mode and having seen some documentation on internet — Aligning dynamic disks – BlackCat Reasearch Facility (lokna.no) we thought of been cautious of having the disk in dynamic mode.
As a result of the above, we did not proceed to resizing the the production until we are sure.
Question
Is it save to proceed with disk in dynamic mode?
What if we proceed and the primary is still on Basic mode, is it OK to have the disk type of primary node as Basic and the disk type of secondary as Dynamic?
Thanks
Hi, We currently have a 2 node on Always ON Availability group with a cloud witness. Recently, we found the need to resize the data file disk. This 2 nodes are SQL Server on Azure VM. Find below our current situation1st Resize on the secondary node (read-only) Resizing first on the secondary node – we noticed that the disk type changed from Basic to Dynamic mode and having seen some documentation on internet — Aligning dynamic disks – BlackCat Reasearch Facility (lokna.no) we thought of been cautious of having the disk in dynamic mode. As a result of the above, we did not proceed to resizing the the production until we are sure.Question Is it save to proceed with disk in dynamic mode? What if we proceed and the primary is still on Basic mode, is it OK to have the disk type of primary node as Basic and the disk type of secondary as Dynamic?Thanks Read More
How can I read and write an ntfs drive on Mac?
I recently encountered a problem when using a Mac computer, and I hope to get some help here. My problem is that I can’t read and write NTFS drives on Mac. I know that the default setting of the Mac system is that it can only read the contents of NTFS drives, but not write. This is very inconvenient for me because I often need to transfer files between Windows and Mac.
I have tried some methods and tools, but I still haven’t found a stable and safe solution. Does anyone know how to effectively read and write NTFS-formatted hard drives on Mac? If you have any good suggestions or experiences, it would be great if you could share them with me, thank you very much!
I recently encountered a problem when using a Mac computer, and I hope to get some help here. My problem is that I can’t read and write NTFS drives on Mac. I know that the default setting of the Mac system is that it can only read the contents of NTFS drives, but not write. This is very inconvenient for me because I often need to transfer files between Windows and Mac. I have tried some methods and tools, but I still haven’t found a stable and safe solution. Does anyone know how to effectively read and write NTFS-formatted hard drives on Mac? If you have any good suggestions or experiences, it would be great if you could share them with me, thank you very much! Read More
Sharepoint site page header web part has lost editing capability, now cannot edit background image.
In the process of building a sharepoint page based on the newsletter template and we replaced the default header background image several days ago.
Now when logged in cannot access the image edit overlay menu at all. This disappeared yesterday after reopening the page.
Screencapture inserted below for visual reference.
As can be seen on the standard page when it has been created the header element accesses a menu overlay with which to update or replace the background image asset. Screencapture inserted below for visual reference of this.
The question here is (a) what has happened to the page leading to the loss of editing functionality and (b) what can be done to restore this editing functionality. There is no way to copy web parts between two pages so we cannot simply create a new page and copy all other 44 web parts over to this new page.
In the process of building a sharepoint page based on the newsletter template and we replaced the default header background image several days ago. Now when logged in cannot access the image edit overlay menu at all. This disappeared yesterday after reopening the page. Screencapture inserted below for visual reference. As can be seen on the standard page when it has been created the header element accesses a menu overlay with which to update or replace the background image asset. Screencapture inserted below for visual reference of this. The question here is (a) what has happened to the page leading to the loss of editing functionality and (b) what can be done to restore this editing functionality. There is no way to copy web parts between two pages so we cannot simply create a new page and copy all other 44 web parts over to this new page. Read More
AuthentificationManager Browser not supported error
I wrote myself a little .net c# program to run CSOM commands on some SPO lists about a year ago. That worked fine until yesterday. The program opens the 2 factor authentification process as always and goes through it as always but after the last step when it usually shows “Signing into Sharepoint” that page appears for a millisecond to then show:
Browser not supported
Use Microsoft Edge or Google Chrome or any other browsers mentioned in this documentation:
Now did my SPO admins tinker with it or was it Microsoft that changed something? How can I fix that. Can I force the AuthentificationManager to use another browser runtime?
Any help is greatly appreciated.
I wrote myself a little .net c# program to run CSOM commands on some SPO lists about a year ago. That worked fine until yesterday. The program opens the 2 factor authentification process as always and goes through it as always but after the last step when it usually shows “Signing into Sharepoint” that page appears for a millisecond to then show: Browser not supportedUse Microsoft Edge or Google Chrome or any other browsers mentioned in this documentation: Now did my SPO admins tinker with it or was it Microsoft that changed something? How can I fix that. Can I force the AuthentificationManager to use another browser runtime? Any help is greatly appreciated. Read More
Rufus alternative for mac – Is there a program like Rufus for Mac?
I recently encountered a problem on Mac and hope to find help here. I used to use Rufus software for Windows to create bootable USB drives. Now that I have switched to Mac, I found that Rufus does not support Mac. Is there any Rufus alternative for Mac? I need a tool that can create a bootable disk, especially for installing Linux or Windows. If you know any good software recommendations or other solutions, please share with me, thank you very much!
I recently encountered a problem on Mac and hope to find help here. I used to use Rufus software for Windows to create bootable USB drives. Now that I have switched to Mac, I found that Rufus does not support Mac. Is there any Rufus alternative for Mac? I need a tool that can create a bootable disk, especially for installing Linux or Windows. If you know any good software recommendations or other solutions, please share with me, thank you very much! Read More
Comparing columns with long decimal numbers
Hello. I have a bunch of long columns I need to compare that they look similar. Im trying to use
Find & Select → Go to special and pick Column differences, instead of looking at one row at the time. There are some differences at the last decimal digits (shown below), but its so small difference that it doesnt matter.
I have tried to remove decimals by using “decrease decimal places” for all the columns. But its not working as the whole decial number is still shown in the formula bar. Neither is the round function.
How can I compare the columns in a fast and simple way?
Hello. I have a bunch of long columns I need to compare that they look similar. Im trying to useFind & Select → Go to special and pick Column differences, instead of looking at one row at the time. There are some differences at the last decimal digits (shown below), but its so small difference that it doesnt matter. I have tried to remove decimals by using “decrease decimal places” for all the columns. But its not working as the whole decial number is still shown in the formula bar. Neither is the round function. How can I compare the columns in a fast and simple way? Read More
Responsible AI for All from Children to Adults
As mentioned in our previous blog post, Learn about Responsible AI with MVP Veronika Kolesnikova, it’s crucial to understand the principles of Responsible AI to ensure ethical use of AI in the future.
In this blog post, we focus on promoting an understanding of responsible AI across a broad range of audiences, from business users to high school students. We interviewed Komes Chandavimol from Thailand, who was awarded the Microsoft MVP for AI Platform in March of this year, to share his expert insights.
———-
Please tell us details about your recent community activities focusing on responsible AI.
In Thailand, the movement in responsible AI started with large enterprises that are exposed to the development of machine learning and attempt to apply basic responsible AI concepts such as explainable AI and reliable AI. Since then, it has become popular in two ways. First, at the community level, experts share their knowledge through sessions including meetups, podcasts, or blogs. Additionally, responsible AI is embedded into education since machine learning subjects include libraries such as SHAP or ML in operation.
As a data and AI practitioner and visiting professor, I try to encourage everyone in both ways. My Data Science Thailand page usually shares content about responsible AI concepts, toolkits, and use cases. Moreover, I have also conducted responsible AI workshops at several events from Microsoft and partner collaborations. Such events included Code; Without Barriers and SoundByte Digital Inclusion in Australia, which empower diversity efforts in our industryOn the other hand, I included responsible AI in my data science for business subject, where I teach responsible AI and focus on how to apply toolkits to ensure each principle has an example that undergraduates understand and can apply to their use cases.
Recently, I have expanded responsible AI to a wider range of audiences such as high school students. I went to a volunteering event and taught the concept of responsible AI. This includes the “Responsible AI for Young” initiative that I teach in schools to make sure students are aware of the risks in AI and how to avoid them.
What do you suggest AI users be aware of when using generative AI in terms of responsible AI?
It depends on the audience. When I conduct a class for high school students, I focus on fun and engagement, where I flip the classroom to let everyone experiment first, and then follow up with the concepts. On the other hand, if I teach younger students, I may start with a text-to-music theme and bring them to the generative AI model’s capabilities and the risks of using it in public.
In one session, I conducted responsible AI training for parents who are preparing their kids for university. This group is tech-savvy and knowledgeable about AI. The point I tried to emphasize is that they should be the “human in the loop” with their kids. Parental involvement is key to their children’s responsible AI use, and their responsibility is very important.
All in all, I believe in the concept of the 4Ps in learning: Passion, Play, Peer, and Project. I bring them to the passion with peer review and mostly give them small projects to involve and ask them to reflect on their learning with AI.
As a community leader, how do you help community members who would like to learn more about Responsible AI?
I normally give them guidelines to follow and encourage them to learn based on their interests. Many of my materials are published on my Facebook page, and I also have a small YouTube channel where I post some of the videos I teach to my students. However, I believe that today we have rich information about responsible AI, and they can connect by themselves, so I just give them shortcuts to knowledge and encourage them to learn on their own.
In conclusion, my dedication to promoting responsible AI is driven by a passion for ensuring ethical and reliable AI practices. Whether through community activities, educational initiatives, or professional workshops, I aim to inspire and equip others to navigate the complexities of AI responsibly. Through this process, I have also learned a great deal about the diverse perspectives and innovative approaches within our community, which continuously enrich my understanding and practice of responsible AI.
I look forward to continuing this journey with you all, fostering a collaborative environment where we can learn, grow, and make a positive impact together.
Thank you for your attention and commitment to responsible AI.
———-
Now that AI benefits are accessible to everyone, not just a select few, Komes’s efforts to raise awareness about responsible AI among various age groups are incredibly valuable. We encourage everyone reading this to explore the resources provided below to learn more about responsible AI and share this knowledge within your own communities.
– Empowering responsible AI practices | Microsoft AI
– Responsible AI Principles and Approach | Microsoft AI
– Responsible AI Solutions | Microsoft Azure
– Skill up on Responsible AI Developer Hub | Responsible AI Developer Hub (azure.github.io)
– Train a model and debug it with Responsible AI dashboard – Training | Microsoft Learn
– Embrace responsible AI principles and practices – Training | Microsoft Learn
– Responsible AI – Cloud Adoption Framework | Microsoft Learn
Microsoft Tech Community – Latest Blogs –Read More
Enhancing Student Resumes: An Innovative Approach Using Azure OpenAI ChatGPT-4o
Problem
LinkedIn offers a valuable feature that allows users to create and download resumes directly from their profiles, effectively eliminating the challenges associated with resume formatting. However, students, being inexperienced, often struggle to craft high-quality resumes. As one of the career mentors, I find myself reviewing over 200 student resumes in an iterative process. Unfortunately, due to the sheer volume, my colleagues often overlook the quality of these resumes, allowing students to indiscriminately send out subpar or error-ridden resumes to potential employers.
This practice has resulted in a decreased employment rate and has negatively impacted the reputation of our course.
Furthermore, career mentors need to review/check each resume, analysis student’s profile, provide feedback to student and refer them to different types of job role.
Solution
We request students to upload their LinkedIn Resume PDFs to our Learning Management System (LMS) – Moodle as a part of their assignment. We frequently review these resumes using Azure OpenAI ChatGPT-4o.
In this post, I won’t delve into the specifics of data preprocessing, but here are the key steps involved:
Unzip the submitted resumes.
Rename the folder to the respective student’s name, ensuring there are no duplicates.
Transform each page of the LinkedIn PDF resume into a PNG format.
AI Resume Reviewer
AI career master’s system prompt, and it shows the function for AI Resume Reviewer.
As a dedicated career guide, your responsibility is to meticulously examine student resumes and provide feedback in Markdown format. Here are the detailed instructions:
Identify and enumerate contact details, list actual value of the email address, mobile number, and LinkedIn Profile URL, in the initial section.
List out all URLs present in the resume.
List out all technologies mentioned.
List out all skills highlighted.
List out all certifications acquired.
List out all educational qualifications along with the duration.
List out all professional experiences along with the duration.
The resume **should** contain an email and phone number for communication. Issue an alert if these details are missing.
The profile section **should** contain the student’s name, course name, institution, and GitHub URL. Issue an alert if any of these elements are missing.
Students are anticipated to be enrolled in the **Higher Diploma in Cloud and Data Centre Administration** course in Hong Kong. Issue an alert if this information is missing or incorrect.
Be vigilant for any illogical content (excluding irrelevant/non-IT work experience) or spelling mistakes. Issue an alert and underline the errors if any are detected.
The summary section should be devoid of any pronouns.
Ensure the consistency of tenses throughout the resume.
Propose a suitable job title for the student based on the resume content.
Assign a “Resume Rating” on a scale of 1 to 10, where 10 signifies an outstanding resume.
If there are any alerts or missing information, the “Resume Rating” **should not** exceed 5.
If the phone number or email address is missing, the “Resume Rating” **should** be 0.
Assume the role of an IT interviewer and justify the “Resume Rating”, correlating it with the likelihood of securing a job.
Suggest the kind of job the student is likely to land, such as a Cloud Engineer, Data Centre Technician, or Network Engineer, based on the resume content.
Group resumes images for each student
import os
from collections import defaultdict
# Define the path to the “data” folder
data_folder = “data”
cv_images = []
# Traverse through each subfolder inside the “data” folder
for root, dirs, files in os.walk(data_folder):
# Iterate over each file in the current subfolder
for file in files:
# Check if the file has a PNG extension
if file.endswith(“.png”):
# Print the file path
# print(os.path.join(root, file))
cv_images.append(os.path.join(root, file))
# Group cv_images by folder
cv_images_by_folder = defaultdict(list)
for image_path in cv_images:
folder = os.path.dirname(image_path)
cv_images_by_folder[folder].append(image_path)
Prepare the chat prompts
import base64
# Function to encode an image file as a base64 string
def encode_image(image_path):
with open(image_path, “rb”) as image_file:
return base64.b64encode(image_file.read()).decode(“utf-8”)
# Function to create messages for the AI model
def create_messages(base64_images):
return [
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: [
{“type”: “text”, “text”: “Describe the images as an alternative text, provide feedback, warning if any and ratiing on the resume.”},
*[
{“type”: “image_url”, “image_url”: {“url”: f”data:image/png;base64,{img}”}}
for img in base64_images
]
]}
]
AI Review and saves the result for each student
from tqdm import tqdm
import os
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
openai_api_version=os.getenv(“AZURE_OPENAI_GPT4O_API_VERSION”),
azure_deployment=os.getenv(“AZURE_OPENAI_GPT4O_DEPLOYMENT_NAME”),
temperature=0,
)
# Sort the cv_images_by_folder dictionary by folder
sorted_cv_images_by_folder = dict(sorted(cv_images_by_folder.items(), key=lambda x: x[0]))
for folder, images in tqdm(sorted_cv_images_by_folder.items(), desc=”Processing folders”):
save_path = os.path.join(folder, ‘chatgpt_result.md’)
if os.path.exists(save_path):
continue
encode_images = [encode_image(image) for image in images]
messages = create_messages(encode_images)
ai_message = llm.invoke(messages)
# print(ai_message.content)
# Save ai_message.content to a file
with open(save_path, ‘w’) as file:
file.write(ai_message.content)
Masked sample results 1
### Alternative Text Description
The image is a resume for XXXXXXXX. The resume is divided into three main sections: Contact, Experience, and Education.
**Contact Section:**
– Address: <deleted>
– Mobile: <deleted>
– Email: <deleted>
– LinkedIn: www.linkedin.com/in/<deleted>
**Experience Section:**
– DFI Retail Group
– Position: Casual Sales Assistant
– Duration: August 2023 – Present (11 months)
– Location: Hong Kong, Hong Kong SAR
**Education Section:**
– Hong Kong Institute of Information Technology (HKIIT) at IVE (Lee Wai Lee)
– Course: Higher Diploma in Cloud and Data Centre Administration, Cloud Computing
– Duration: 2023 – 2025
### Feedback
#### Contact Details
– **Email:** Present
– **Phone Number:** Present
– **LinkedIn Profile URL:** Present
#### URLs
– www.linkedin.com/in/<deleted>
#### Technologies Mentioned
– None
#### Skills Highlighted
– None
#### Certifications Acquired
– None
#### Educational Qualifications
– Higher Diploma in Cloud and Data Centre Administration, Cloud Computing (2023 – 2025)
#### Professional Experiences
– Casual Sales Assistant at DFI Retail Group (August 2023 – Present, 11 months)
### Alerts and Warnings
1. **Missing Technologies and Skills:** The resume does not mention any specific technologies or skills.
2. **Missing Certifications:** No certifications are listed.
3. **Profile Section:** The profile section is missing the GitHub URL.
4. **Course Information:** The course name and institution are correctly mentioned.
5. **Spelling and Grammar:** No spelling mistakes detected.
6. **Summary Section:** The summary section is devoid of pronouns.
7. **Tense Consistency:** The tenses are consistent throughout the resume.
### Suggested Job Title
– Entry-Level Cloud and Data Centre Technician
### Resume Rating
**Rating: 4/10**
### Justification
The resume contains the essential contact details and educational qualifications, which are crucial for any job application. However, it lacks specific technologies, skills, and certifications that are vital for a career in cloud and data centre administration. The absence of a GitHub URL in the profile section is also a significant omission. These missing elements reduce the likelihood of securing a job in the desired field.
### Suggested Job
Based on the current content of the resume, the student is likely to land an entry-level position such as a Cloud and Data Centre Technician. To improve the chances of securing a job, it is recommended to include relevant technologies, skills, and certifications.
Masked sample results 2
### Alternative Text Description
The image is a resume for Kelvin Yiu, an XYZ Cloud Club Captain from New Territories, Hong Kong SAR. The resume is divided into several sections: Contact, Top Skills, Languages, Certifications, Summary, Experience, and Education.
#### Contact
– **Mobile:** <deleted>
– **Email:** <deleted>
– **LinkedIn:** www.linkedin.com/in/<deleted>
– **GitHub:** github.com/<deleted>
#### Top Skills
– ______________ Services (XYZ)
– Terraform
– Kubernetes
#### Languages
– Cantonese (Native or Bilingual)
– Mandarin (Professional Working)
– English (Professional Working)
#### Certifications
– XYZ Certified Solutions Architect – Associate
– XYZ Academy Graduate – XYZ Academy Cloud Foundations
#### Summary
A tech enthusiast with at least 3 years of hands-on experience in developing with Python and Golang, working on several cloud projects. Has a cybersecurity background and led a team to participate in numerous public cybersecurity competitions in Hong Kong during high school studies.
#### Experience
**Amazon Web Services (XYZ)**
– **Role:** Cloud Captain
– **Duration:** March 2024 – Present (3 months)
– **Location:** Hong Kong SAR
– **Responsibilities:**
– Started the first XYZ Cloud Club in Hong Kong.
– Planned events to teach about clouds and prepare people for jobs in cloud technology.
– Helped students join XYZ Cloud Clubs to build a cloud community.
– Led the growth of the Hong Kong Regional Cloud Club.
#### Education
**Hong Kong Institute of Information Technology (HKIIT) at IVE (Lee Wai Lee)**
– **Course:** Higher Diploma in Cloud and Data Centre Administration, Cloud Computing
– **Duration:** September 2023 – September 2025
### Feedback and Warnings
1. **Contact Details:**
– **Email:** <deleted>
– **Mobile Number:** <deleted>
– **LinkedIn Profile URL:** www.linkedin.com/in/<deleted>
2. **URLs Present:**
– www.linkedin.com/in/<deleted>
– github.com/<deleted>
3. **Technologies Mentioned:**
– Amazon Web Services (XYZ)
– Terraform
– Kubernetes
– Python
– Golang
4. **Skills Highlighted:**
– Amazon Web Services (XYZ)
– Terraform
– Kubernetes
5. **Certifications Acquired:**
– XYZ Certified Solutions Architect – Associate
– XYZ Academy Graduate – XYZ Academy Cloud Foundations
6. **Educational Qualifications:**
– Higher Diploma in Cloud and Data Centre Administration, Cloud Computing (September 2023 – September 2025)
7. **Professional Experiences:**
– XXXXXXXX Services (XYZ), Cloud Captain (March 2024 – Present, 3 months)
### Alerts
1. **Profile Section:**
– Missing GitHub URL in the profile section.
2. **Summary Section:**
– Contains the pronoun “I” which should be avoided.
– Spelling mistake: “I have” should be “I have”.
3. **Course Information:**
– Correct course information is present.
### Resume Rating
**Rating: 4/10**
### Justification
The resume contains essential contact details, educational qualifications, and professional experiences. However, it has several issues:
– The summary section contains a pronoun and a spelling mistake.
– The GitHub URL is missing from the profile section.
– The professional experience is relatively short (3 months).
These issues reduce the overall quality and effectiveness of the resume, making it less likely to secure a job.
### Suggested Job Title
– Cloud Engineer
– Data Centre Technician
### Likely Job
Based on the resume content, the student is likely to land a job as a Cloud Engineer or Data Centre Technician.
AI Resume Extractor
It retrieves all the review outcomes and exports them to a Microsoft Excel file. This process involves a function calling that guarantees the data is returned in the correct format and mapped to a structured record.
Get all AI reviews
import os
# Define the path to the “data” folder
data_folder = “data”
chatgpt_results = []
# Traverse through each subfolder inside the “data” folder
for root, dirs, files in os.walk(data_folder):
# Iterate over each file in the current subfolder
for file in files:
if file == “chatgpt_result.md”:
# Print the file path
chatgpt_results.append(os.path.join(root, file))
chatgpt_results.sort()
Setup the function calling with LangChain and Pydantic.
from langchain_core.utils.function_calling import convert_to_openai_function
from typing import List, Optional
from langchain.pydantic_v1 import BaseModel, Field
class StudentCvRecord(BaseModel):
“””Call this to save a student CV record in markdown format.”””
name: str = Field(description=”Name of the student”)
email: Optional[str] = Field(description=”Email address”)
mobile_number: Optional[str] = Field(description=”Contact number”)
linkedin_profile_url: str = Field(description=”LinkedIn profile url”)
resume_rating: int = Field(
description=”Rating of the resume between 1 to 10″)
rationale: str = Field(description=”Rationale for the rating”)
warning: str = Field(description=”Any warning message”)
feedback: str = Field(description=”Feedback message”)
proposed_job_titles: List[str] = Field(description=”Proposed job titles”)
certifications: List[str] = Field(description=”List of certifications”)
technologies: List[str] = Field(description=”List of technologies”)
skills: List[str] = Field(description=”List of skills”)
work_experience: List[str] = Field(description=”List of work experiences”)
student_cv_record_function = convert_to_openai_function(StudentCvRecord)
Extract the result for each student, and fallback to GPT-4o if GPT-35-turbo cannot handle the JSON encode.
import json
from tqdm import tqdm
student_records = []
for result_path in tqdm(chatgpt_results):
result_path_json = result_path.replace(“.md”, “.json”)
if os.path.exists(result_path_json):
with open(result_path_json, “r”) as f:
result_json = f.read()
result = StudentCvRecord.parse_raw(result_json)
student_records.append(result)
continue
with open(result_path, “r”) as f:
cv = f.read()
name = result_path.split(“/”)[-2]
try:
result = chain35.invoke({“cv”: cv})
except Exception as e:
result = chain4o.invoke({“cv”: cv})
result.name = name
result_json = json.dumps(result.dict())
with open(result_path_json, “w”) as f:
f.write(result_json)
student_records.append(result)
Microsoft Excel Report
Now, we can mail merge the result to students and let them fix their resumes.
How to use it?
Fork https://github.com/wongcyrus/linkedin-resume-reviewer
Create a GitHub Code Spaces
Fill in .env_template and rename it to .env.
Create data folder and upload zipped PDF resumes in it.
Modify zip_file_path and run data-preprocessing.ipynb
Run ai-resume-reviewer.ipynb to use Azure OpenAI ChatGPT4o to review resumes images.
Run ai-resume-extractor.ipynb to use Azure OpenAI ChatGPT 3.5 Tubo and 4o to extract the reviewer result.
Conclusion
The integration of Azure OpenAI ChatGPT-4o into our resume review process has significantly improved the quality of student resumes. By automating the initial review and feedback process, we ensure that each resume is meticulously examined for errors, missing information, and overall quality. This approach not only saves time for career mentors but also enhances the employability of our students by providing them with high-quality resumes. As a result, we have observed an increase in employment rates and a positive impact on the reputation of our course. This innovative solution demonstrates the potential of AI in transforming educational and career support services.
Enhancing a LinkedIn-generated resume PDF encourages students to maintain an impressive LinkedIn online presence. It’s crucial to uphold a well-crafted LinkedIn profile throughout one’s career.
Project collaborators include, Kelvin Yiu, Karl Chan, and Mandy Lau from the IT114115 Higher Diploma in Cloud and Data Centre Administration and Microsoft Learn Student Ambassadors candidates.
About the Author
Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology (HKIIT) at IVE(Lee Wai Lee). and he focuses on teaching public cloud technologies. He is one of the Microsoft Learn for Educators Ambassador and Microsoft Azure AI MVP from Hong Kong.
Microsoft Tech Community – Latest Blogs –Read More
how to find connected component in an image
i’m doing a project to recognize kannada text,the first step says find connected components from a binarynimage.i tried doing it using bwconncomp but i’m not able to display the image.soo can u please help me with dis.i’m doing a project to recognize kannada text,the first step says find connected components from a binarynimage.i tried doing it using bwconncomp but i’m not able to display the image.soo can u please help me with dis. i’m doing a project to recognize kannada text,the first step says find connected components from a binarynimage.i tried doing it using bwconncomp but i’m not able to display the image.soo can u please help me with dis. ocr, kannada, image segmentation, connected components MATLAB Answers — New Questions
I am running external mode on arduino mega and the analog inputs are always high. What is wrong in my settings?
I am trying to run external mode and read an Analog Input on an Arduino Mega. I always see the input set to high (5) no mater what my actual voltage is. I just convert the analog input to read between 0-5V (5/1023), but there is no other response. It’s always stuch in high. All Analog outputs are working properly, but I cannot fix my input readings. Is there a setting that I have to change? What am I doing wrong? TIAI am trying to run external mode and read an Analog Input on an Arduino Mega. I always see the input set to high (5) no mater what my actual voltage is. I just convert the analog input to read between 0-5V (5/1023), but there is no other response. It’s always stuch in high. All Analog outputs are working properly, but I cannot fix my input readings. Is there a setting that I have to change? What am I doing wrong? TIA I am trying to run external mode and read an Analog Input on an Arduino Mega. I always see the input set to high (5) no mater what my actual voltage is. I just convert the analog input to read between 0-5V (5/1023), but there is no other response. It’s always stuch in high. All Analog outputs are working properly, but I cannot fix my input readings. Is there a setting that I have to change? What am I doing wrong? TIA arduino mega, analog inputs, external mode MATLAB Answers — New Questions
LTE Turbo encoder with R = 1/2 – BER performance with LTE Toolbox
Hi there,
I am trying to make BER performance of LTE Turbo encoding with rate matching R = 1/2. But, when I implement symbol modulation and demodulation I can’t get BER = 0 (without any AWGN existence). This indicates that problem is in mod/demod part. I will be also great to use BPSK instead of QPSK but lteSymbolModulate doesn’t have this modulation scheme. Thank You in advance.
All best,
Mirza
Code: (credits to answer how-to-use-rate-matching-to-alter-turbo-code-rate)
clear;
K = 128;
E = 2*K;
mbits = randi([0,1],K, 1);
crc = lteCRCEncode(mbits,’24A’);
cbs = lteCodeBlockSegment(crc); % This will provide a length 4096 bits that is input to the Turbo encoder
cd = lteTurboEncode(cbs);
cdrm = lteRateMatchTurbo(cd, E, 0);
%cdrm(cdrm == 0) = -1; % Make them as LLRs
txSymbols = lteSymbolModulate(cdrm,’QPSK’);
awgnchan = comm.AWGNChannel(‘NoiseMethod’,’Variance’,’Variance’,3);
rxSymbols = txSymbols;% awgnchan(txSymbols);
softBits= lteSymbolDemodulate(rxSymbols,’QPSK’,’Soft’);
cdrx = lteRateRecoverTurbo(softBits, K, 0);
%
mhat = lteTurboDecode(cdrx); % So after the Turbo decoding, it will have 4096 bits rather than 4032 bits, due to addition of filler bits and CRC
cbshat = lteCodeBlockDesegment(mhat, K+24);
crchat = lteCRCDecode(cbshat,’24A’); % This will return the same as message length
% Check the decoded bits with message bits
[~,BER] = biterr( double(crchat),mbits);
BERHi there,
I am trying to make BER performance of LTE Turbo encoding with rate matching R = 1/2. But, when I implement symbol modulation and demodulation I can’t get BER = 0 (without any AWGN existence). This indicates that problem is in mod/demod part. I will be also great to use BPSK instead of QPSK but lteSymbolModulate doesn’t have this modulation scheme. Thank You in advance.
All best,
Mirza
Code: (credits to answer how-to-use-rate-matching-to-alter-turbo-code-rate)
clear;
K = 128;
E = 2*K;
mbits = randi([0,1],K, 1);
crc = lteCRCEncode(mbits,’24A’);
cbs = lteCodeBlockSegment(crc); % This will provide a length 4096 bits that is input to the Turbo encoder
cd = lteTurboEncode(cbs);
cdrm = lteRateMatchTurbo(cd, E, 0);
%cdrm(cdrm == 0) = -1; % Make them as LLRs
txSymbols = lteSymbolModulate(cdrm,’QPSK’);
awgnchan = comm.AWGNChannel(‘NoiseMethod’,’Variance’,’Variance’,3);
rxSymbols = txSymbols;% awgnchan(txSymbols);
softBits= lteSymbolDemodulate(rxSymbols,’QPSK’,’Soft’);
cdrx = lteRateRecoverTurbo(softBits, K, 0);
%
mhat = lteTurboDecode(cdrx); % So after the Turbo decoding, it will have 4096 bits rather than 4032 bits, due to addition of filler bits and CRC
cbshat = lteCodeBlockDesegment(mhat, K+24);
crchat = lteCRCDecode(cbshat,’24A’); % This will return the same as message length
% Check the decoded bits with message bits
[~,BER] = biterr( double(crchat),mbits);
BER Hi there,
I am trying to make BER performance of LTE Turbo encoding with rate matching R = 1/2. But, when I implement symbol modulation and demodulation I can’t get BER = 0 (without any AWGN existence). This indicates that problem is in mod/demod part. I will be also great to use BPSK instead of QPSK but lteSymbolModulate doesn’t have this modulation scheme. Thank You in advance.
All best,
Mirza
Code: (credits to answer how-to-use-rate-matching-to-alter-turbo-code-rate)
clear;
K = 128;
E = 2*K;
mbits = randi([0,1],K, 1);
crc = lteCRCEncode(mbits,’24A’);
cbs = lteCodeBlockSegment(crc); % This will provide a length 4096 bits that is input to the Turbo encoder
cd = lteTurboEncode(cbs);
cdrm = lteRateMatchTurbo(cd, E, 0);
%cdrm(cdrm == 0) = -1; % Make them as LLRs
txSymbols = lteSymbolModulate(cdrm,’QPSK’);
awgnchan = comm.AWGNChannel(‘NoiseMethod’,’Variance’,’Variance’,3);
rxSymbols = txSymbols;% awgnchan(txSymbols);
softBits= lteSymbolDemodulate(rxSymbols,’QPSK’,’Soft’);
cdrx = lteRateRecoverTurbo(softBits, K, 0);
%
mhat = lteTurboDecode(cdrx); % So after the Turbo decoding, it will have 4096 bits rather than 4032 bits, due to addition of filler bits and CRC
cbshat = lteCodeBlockDesegment(mhat, K+24);
crchat = lteCRCDecode(cbshat,’24A’); % This will return the same as message length
% Check the decoded bits with message bits
[~,BER] = biterr( double(crchat),mbits);
BER lte toolbox, lte rate match turbo MATLAB Answers — New Questions