Month: September 2024
Formula returning dash when I add a new cell
extremely frustrating I use this sheet to track my side job pay and it glitches everytime I try to edit it and returns 0. i am trying to add august to the gross pay total.
extremely frustrating I use this sheet to track my side job pay and it glitches everytime I try to edit it and returns 0. i am trying to add august to the gross pay total. Read More
Tasks
When I open Tasks I get “The task owner has restricted this action,” and “This list cannot be modified as it no longer exists.” I am horrified as I use it every day. I can’t modify the task in any way. How can I fix this?
When I open Tasks I get “The task owner has restricted this action,” and “This list cannot be modified as it no longer exists.” I am horrified as I use it every day. I can’t modify the task in any way. How can I fix this? Read More
A generalisation of the MAP lambda helper function
Discussion topic. Your thoughts are welcome.
On Saturday I finally bit the bullet and completed a MAPλ Lambda function that generalises the in-built MAP Lambda helper function. As examples, I tried problems of generating the Kronecker product of two matrices and then one of generating variants of an amortisation table.
The original amortisation schedule uses SCAN to calculate closing balances step by step from opening balances. Having returned the closing balances as an array, the principal is inserted at the first element to give opening balances. An array calculation based on the same code is used to return other values of interest using HSTACK.
Following that, I created the array of loan terms {10, 15, 20} (yrs) and used the formula
= MAPλ(variousTerms, AmortisationTableλ(principal, rate, startYear))
to generate
as a single spilt range.
I have posted a copy of MAPλ on GitHub
A version of Excel MAP helper function that will return an array of arrays (github.com)
The intention is that the function can be used without knowing how it works but you are, of course, welcome to try to pick through it.
Discussion topic. Your thoughts are welcome. On Saturday I finally bit the bullet and completed a MAPλ Lambda function that generalises the in-built MAP Lambda helper function. As examples, I tried problems of generating the Kronecker product of two matrices and then one of generating variants of an amortisation table. The original amortisation schedule uses SCAN to calculate closing balances step by step from opening balances. Having returned the closing balances as an array, the principal is inserted at the first element to give opening balances. An array calculation based on the same code is used to return other values of interest using HSTACK.Following that, I created the array of loan terms {10, 15, 20} (yrs) and used the formula = MAPλ(variousTerms, AmortisationTableλ(principal, rate, startYear)) to generateas a single spilt range. I have posted a copy of MAPλ on GitHub A version of Excel MAP helper function that will return an array of arrays (github.com)The intention is that the function can be used without knowing how it works but you are, of course, welcome to try to pick through it. Read More
Update Error for Windows 11 Insider Preview (10.0.26120.1542)
Hi!
When the update Windows 11 Insider Preview (10.0.26120.1542) started, it reached 1% and suddenly stopped.
I tried to run a Troubleshoot for Windows Update inside Configurations and it shows an error 0x803C010A and didn’t proceed as well.
Anyone solved this problem?
Thanks
Hi!When the update Windows 11 Insider Preview (10.0.26120.1542) started, it reached 1% and suddenly stopped.I tried to run a Troubleshoot for Windows Update inside Configurations and it shows an error 0x803C010A and didn’t proceed as well.Anyone solved this problem? Thanks Read More
How to sync Outlook Notes with Gmail account
I have Outlook 2021 desktop installed on my PC. I would like to sync the Outlook Notes:
with my Google Workspace account. Is this possible?
I have Outlook 2021 desktop installed on my PC. I would like to sync the Outlook Notes: with my Google Workspace account. Is this possible? Read More
Default SQL Server Connection for SSMS
SQL 2019 – SSMS 19.3.4.0
I was always wrongly under the impression that SSMS required a server connection in the Object Explorer to run a script against. We have databases with the same names on 2 servers as we’re preparing for migration and I accidentally ran a script on server B, even though there appeared to be no connection open to server B. Only Server A was connected in the object explorer. I was then shocked to find that any new sql script I opened was connected to server B which had been closed out in Object Explorer.
What controls the default server for a script when opening via File / Open in SSMS? What is the best way to lock a script to specific server or make it more obvious which server this is being applied to. I may need to get used to looking in the bottom right where it displays the SQL server, but I’d like to make it more fool proof.
I see activating SQLCMD Mode on the Query Menu is one option, but I wonder what the downside to this might be such that it is not default behaviour.
SQL 2019 – SSMS 19.3.4.0I was always wrongly under the impression that SSMS required a server connection in the Object Explorer to run a script against. We have databases with the same names on 2 servers as we’re preparing for migration and I accidentally ran a script on server B, even though there appeared to be no connection open to server B. Only Server A was connected in the object explorer. I was then shocked to find that any new sql script I opened was connected to server B which had been closed out in Object Explorer. What controls the default server for a script when opening via File / Open in SSMS? What is the best way to lock a script to specific server or make it more obvious which server this is being applied to. I may need to get used to looking in the bottom right where it displays the SQL server, but I’d like to make it more fool proof. I see activating SQLCMD Mode on the Query Menu is one option, but I wonder what the downside to this might be such that it is not default behaviour. Read More
AI Studio End-to-End Baseline Reference Implementation
Azure AI Studio is designed to cater to the growing needs of developers seeking to integrate advanced AI capabilities into their applications with a focus on operational excellence. Addressing key factors such as security, scalability, and regulatory adherence, Azure AI Studio ensures that AI deployments are seamless, sustainable, and strategically aligned with business objectives.
We’re excited to present the end-to-end baseline reference implementation for Azure AI Studio, a definitive guide designed to facilitate the deployment of AI workloads in the cloud. This architecture has been designed to assist organizations in finding structured solutions for deploying AI applications that are production ready in an enterprise environment at scale.
Features of the Baseline Architecture
This architecture incorporates several important features:
Secure Network Perimeter: Creates a secure boundary for AI applications with strict network security and segmentation capabilities.
Identity Management: Implements strong access management to regulate interactions and maintain secure operations within AI services and data.
Scalability: Provides a flexible infrastructure to support the growth of AI applications, ensuring performance is not sacrificed as demand increases.
Compliance and Governance: Maintains a commitment to following enterprise governance policies and meeting compliance standards throughout the life of an AI application.
Supported Scenarios of the Baseline Architecture
The reference architecture supports various important use cases, including:
AI Studio Project Playground: An integrated environment for engaging with Azure OpenAI technologies, where you can chat with your data, test out various AI-powered assistants, and utilize completion features for text. This tool serves as a one-stop shop to assess, refine, and validate your AI-driven projects.
Promptflow Workflows: This feature supports the development of complex AI workflows, integrating elements like custom Python scripts and large language model integrations, enhancing operational excellence.
Resilient, Managed Deployments: Manages the deployment of AI applications to Azure’s managed virtual networks, ensuring solid and dependable access via client UI hosted in Azure App Service.
Self-Hosting with Azure App Service: This alternative gives enterprises full control to customize and manage Promptflow deployment using Azure App Service leveraging advanced options such as availability zones.
You can find the reference implementation in the following link: aistudio-end-to-end-baseline-architecture
Microsoft Tech Community – Latest Blogs –Read More
¡Temporada de IA para Desarrolladores!
Si te apasiona la Inteligencia Artificial y el desarrollo de aplicaciones, no te pierdas la oportunidad de ver esta increíble serie de Microsoft Reactor. Durante la temporada, exploramos desde los fundamentos de Azure OpenAI hasta las últimas innovaciones presentadas en Microsoft Build 2024, finalizando con el potente framework Semantic Kernel para la creación de aplicaciones inteligentes. Todas las sesiones están cargadas de numerosos demos para que puedas comprender cada concepto y aplicarlo de manera efectiva.
Episodios:
Episodio 1: Introducción a Azure OpenAI
Exploramos los modelos de Azure OpenAI, sus capacidades, y cómo integrarlos con el SDK de Azure.
Episodio 2: Consideraciones para Implementar Modelos en Azure OpenAI
Aprendimos a gestionar la cuota del servicio, equilibrar rendimiento y latencia, planificar la gestión de costos, y aplicar el patrón RAG para optimizar tus implementaciones.
Episodio 3: Novedades de Microsoft Build: PHI3, GPT-4o, Azure Content Safety y más
Descubrimos las últimas novedades de Microsoft Build, incluyendo PHI 3, GPT-4o con capacidades multimodales, el nuevo Azure AI Studio, y Azure Content Safety.
Episodio 4: Comenzando con Semantic Kernel
Conocimos Semantic Kernel, un SDK de código abierto que permite integrar fácilmente LLM avanzados en tus aplicaciones para crear experiencias más inteligentes y naturales.
Episodio 5: Construye tu propio Copilot con Semantic Kernel
Aprendimos a utilizar Plugins, Planners y Memories de Semantic Kernel para crear copilotos que trabajan codo a codo con los usuarios, brindándoles sugerencias inteligentes para completar tareas.
-¡No te lo pierdas! Revive cada episodio para descubrir cómo puedes llevar tus aplicaciones al siguiente nivel con la IA de Microsoft.
-Obtén más información y desarrolla tus habilidades con la IA durante esta serie con esta colección de recursos de Microsoft Learn:
Speakers:
Luis Beltran – Microsoft MVP – LinkedIn
Pablo Piovano – Microsoft MVP – LinkedIn
Microsoft Tech Community – Latest Blogs –Read More
Make High Quality Dataset from WARC for Pre-training
You’re welcome to follow my GitHub repo and give it a star:https://github.com/xinyuwei-david/david-share.git
In the following subsections, we will explain each step involved in generating High Qualit dataset Pre-training
How to evaluate the quality of training data?
There are 4 methods to evaluate the quality of training data, including but not limited to.
Using a “clean” corpus and perplexity check
Method: Train a model using a high-quality corpus (e.g., Wikipedia) and then use this model to check the perplexity of the new dataset.
Advantages:
Quick: Can quickly assess the quality of the dataset.
Simple: Relatively simple to implement, does not require complex computational resources.
Disadvantages:
Limitations: Low perplexity does not necessarily mean better performance on specific tasks.
Single Metric: Perplexity is just a single metric and cannot fully reflect the quality of the dataset.
Training small models and testing on evaluation tasks
Method: Extract a portion of data from the dataset, train a small model, and test the model’s performance on a set of specific evaluation tasks (e.g., SQuAD, GLUE, etc.).
Advantages:
Specific: Provides specific performance feedback by testing the model on actual tasks.
Diversity: Allows for the selection of various evaluation tasks to comprehensively assess the dataset quality.
Disadvantages:
Resource Demand: Requires a certain amount of computational resources and time.
Task Selection: Needs to select diverse and representative evaluation tasks, which may increase complexity.
Early signal method
Method: Train a small model and conduct preliminary evaluations on some simple and quick benchmark tasks (e.g., text classification, sentiment analysis, etc.).
Advantages:
Rapid Iteration: Quickly obtain initial feedback, facilitating rapid iteration and optimization.
Suitable for Early Stages: Helps quickly screen datasets in the early stages of development.
Disadvantages:
Simple Tasks: These tasks may be relatively simple and may not fully represent the model’s performance on complex tasks.
Preliminary Evaluation: Only provides initial performance feedback, which may require further detailed evaluation.
Using GPT-4 for evaluation
Method: Use the GPT-4 model to evaluate the new dataset, potentially including various tasks (e.g., text generation, question answering, sentiment analysis, etc.).
Advantages:
High-Quality Evaluation: As a powerful language model, GPT-4 can provide high-quality evaluation results, especially on complex tasks.
Multi-Task Capability: Can evaluate on various tasks, providing comprehensive performance feedback.
Real-World Usage: Evaluation results are closer to actual usage, especially if your final application is also based on similar advanced models.
Disadvantages:
Computational Resources: Training and evaluating GPT-4 requires a large amount of computational resources and time, which may increase costs.
Complexity: The complexity of GPT-4 means more potential issues during debugging and optimization.
Overfitting Risk: If not careful, there is a risk of over-optimizing specific tasks, leading to poorer performance on other tasks.
Summary
Using a “clean” corpus and perplexity check: Suitable for quick, preliminary quality assessment but limited to a single metric.
Training small models and testing on evaluation tasks: Suitable for scenarios requiring specific task performance feedback but requires more resources and task selection.
Early signal method: Suitable for the early stages of development to quickly screen datasets but involves simpler tasks.
Using GPT-4 for evaluation: Suitable for scenarios requiring high-quality and comprehensive evaluation, providing feedback closest to actual usage but with high resource demands.
Prepare environment
In the following content, I will show how to create High Quality Dataset from WARC.
Create conda env
#conda create –name=dataclean python=3.10
#conda activate dataclean
(dataclean) root@david1a100:~# cd dataclean/
(dataclean) root@david1a100:~/dataclean# hostname
david1a100.australiaeast.cloudapp.azure.com
#pip install datatrove xxhash faust-cchardet python-magic warcio fasteners tldextract trafilatura fasttext-wheel nltk awscli fasttext numpy==1.21.0
#pip install datatrove[all]
#pip install datatrove trafilatura awscli
#aws configure
Download WARC
Access the following link to check WARC file address:
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/index.html
Download this file named warc.paths.gz :
Check file path just as follwing in warc.paths.gz. There are so many warc.gz files, I only take CC-MAIN-20230527223515-20230528013515-00000.warc.gz as an example.
crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz
Download files as follwing script:
(dataclean) root@david1a100:~/dataclean# cat download_warc_file.py
import os
import subprocess
def download_warc_file(url, output_dir):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f”Downloading {url}…”)
command = f”wget -P {output_dir} {url}”
subprocess.run(command, shell=True, check=True)
if __name__ == ‘__main__’:
# URL of the WARC file
warc_url = “https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”
# output directory
output_dir = “/root/dataclean/data/CC-MAIN-2023-23/segments”
download_warc_file(warc_url, output_dir)
Basic data processing
After download 00000.warc.gz, I uses the local executor LocalPipelineExecutor to execute the data processing pipeline, which includes the following steps:
reading WARC files
filtering URLs
extracting content using Trafilatura
filtering non-English content
filtering duplicate content
filtering low-quality content
writing the processed data to JSONL files.
(dataclean) root@david1a100:~/dataclean# cat process_common_crawl_dump.py
import nltk
import sys
import os
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import (
GopherQualityFilter,
GopherRepetitionFilter,
LanguageFilter,
URLFilter,
)
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
def download_punkt():
nltk.download(‘punkt’)
nltk.download(‘punkt_tab’)
def set_nltk_data_path():
nltk.data.path.append(‘/root/nltk_data’)
set_nltk_data_path()
download_punkt()
def main():
# DUMP should be given as an argument. Example: CC-MAIN-2023-23
if len(sys.argv) != 2:
print(“Argument required: dump name”)
sys.exit(-1)
DUMP = sys.argv[1]
MAIN_OUTPUT_PATH = “./output” # Local Output Path
DATA_PATH = f”./data/{DUMP}/segments/”
print(f”Checking files in {DATA_PATH}”)
for root, dirs, files in os.walk(DATA_PATH):
print(f”Found directory: {root}”)
for file in files:
print(f”Found file: {file}”)
if not any(os.scandir(DATA_PATH)):
print(f”No files found in {DATA_PATH}”)
sys.exit(-1)
def initializer():
set_nltk_data_path()
download_punkt()
from multiprocessing import Pool
with Pool(processes=8, initializer=initializer) as pool:
executor = LocalPipelineExecutor(
pipeline=[
WarcReader(
DATA_PATH,
glob_pattern=”*.warc.gz”,
default_metadata={“dump”: DUMP},
),
URLFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/url/{DUMP}”)),
Trafilatura(favour_precision=True),
LanguageFilter(
exclusion_writer=JsonlWriter(
f”{MAIN_OUTPUT_PATH}/non_english/”,
output_filename=”${language}/” + DUMP + “/${rank}.jsonl.gz”, # 文件夹结构:language/dump/file
)
),
GopherRepetitionFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/repetitive/{DUMP}”)),
GopherQualityFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/quality/{DUMP}”)),
JsonlWriter(f”{MAIN_OUTPUT_PATH}/output/{DUMP}”),
],
tasks=8, # Number of local tasks, adjusted to your VM configuration
logging_dir=f”{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP}”,
)
executor.run()
if __name__ == ‘__main__’:
main()
Run script as following:
#python3 process_common_crawl_dump.py CC-MAIN-2023-23
Script will run for 26 minutes, final output is as follwing:
2024-08-14 05:11:53.451 | INFO | datatrove.utils.logging:add_task_logger:47 – Launching pipeline for rank=0
2024-08-14 05:11:53.452 | INFO | datatrove.utils.logging:log_pipeline:76 –
— 🛠️ PIPELINE 🛠
📖 – READER: 🕷 Warc
🔻 – FILTER: 😈 Url-filter
🛢 – EXTRAC: ⛏ Trafilatura
🔻 – FILTER: 🌍 Language ID
🔻 – FILTER: 👯 Gopher Repetition
🔻 – FILTER: 🥇 Gopher Quality
💽 – WRITER: 🐿 Jsonl
2024-08-14 05:11:53.452 | INFO | datatrove.pipeline.readers.base:read_files_shard:193 – Reading input file CC-MAIN-20230527223515-20230528013515-00000.warc.gz
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data…
[nltk_data] Package punkt_tab is already up-to-date!
2024-08-14 05:11:55.704 | WARNING | datatrove.pipeline.extractors.base:run:60 – ❌ Error “” while cleaning record text. Skipping record.
…
2024-08-14 05:38:47.661 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 8/8 tasks completed.
2024-08-14 05:38:47.686 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 8 tasks 📉📉📉
Total Runtime: 26 minutes and 36 seconds
📖 – READER: 🕷 Warc
Runtime: (2.11%) 33 seconds [0.29 milliseconds±3.12 milliseconds/doc]
Stats: {input_files: 1, doc_len: 4795961005 [min=1, max=1048576, 140974.75±182620/doc], documents: 34019 [34019.00/input_file]}
🔻 – FILTER: 😈 Url-filter
Runtime: (0.35%) 5 seconds [0.16 milliseconds±11.08 milliseconds/doc]
Stats: {total: 34020, forwarded: 33834, doc_len: 4776069530 [min=1, max=1048576, 141161.84±182866/doc], dropped: 186, dropped_domain: 90, dropped_hard_blacklisted: 67, dropped_blacklisted_subword: 21, dropped_soft_blacklisted: 6, dropped_subdomain: 2}
🛢 – EXTRAC: ⛏ Trafilatura
Runtime: (75.94%) 20 minutes and 12 seconds [35.84 milliseconds±29.25 milliseconds/doc]
Stats: {total: 33834, forwarded: 27384, doc_len: 57232496 [min=1, max=551300, 2090.00±6280/doc], dropped: 4168}
🔻 – FILTER: 🌍 Language ID
Runtime: (0.91%) 14 seconds [0.53 milliseconds±2.54 milliseconds/doc]
Stats: {total: 27384, dropped: 16500, forwarded: 10884, doc_len: 24989254 [min=2, max=73080, 2295.96±4166/doc]}
🔻 – FILTER: 👯 Gopher Repetition
Runtime: (13.00%) 3 minutes and 27 seconds [19.07 milliseconds±33.46 milliseconds/doc]
Stats: {total: 10884, forwarded: 8161, doc_len: 21401662 [min=5, max=73080, 2622.43±4274/doc], dropped: 2723, dropped_top_4_gram: 345, dropped_dup_line_frac: 633, dropped_top_2_gram: 796, dropped_duplicated_5_n_grams: 281, dropped_top_3_gram: 399, dropped_duplicated_6_n_grams: 25, dropped_dup_line_char_frac: 173, dropped_duplicated_8_n_grams: 13, dropped_duplicated_10_n_grams: 16, dropped_duplicated_9_n_grams: 23, dropped_duplicated_7_n_grams: 19}
🔻 – FILTER: 🥇 Gopher Quality
Runtime: (7.55%) 2 minutes [14.76 milliseconds±8.44 milliseconds/doc]
Stats: {total: 8161, dropped: 2433, dropped_gopher_too_many_end_ellipsis: 232, dropped_gopher_below_alpha_threshold: 1201, forwarded: 5728, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], dropped_gopher_short_doc: 941, dropped_gopher_too_many_bullets: 49, dropped_gopher_enough_stop_words: 6, dropped_gopher_below_avg_threshold: 1, dropped_gopher_too_many_ellipsis: 1, dropped_gopher_too_many_hashes: 2}
💽 – WRITER: 🐿 Jsonl
Runtime: (0.14%) 2 seconds [0.40 milliseconds±0.60 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5728, total: 5728, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc]}
Check data processing results
root@david1a100:~/dataclean/output/output/CC-MAIN-2023-23# zcat ./00000.jsonl.gz | head -n 2 | jq .
Output:
{
“text”: “Buy Ambien Online Legally (Zolpidem) belongs to the imidazopyridines class of opioids. Ambien complements the exceptional of sleep via way of means of decreasing the time it takes to fall asleep, decreasing the frequency of nocturnal awakenings, and growing the general period of sleep. Lengthens the second one degree of sleep and the deep sleep degree (III and IV). It does now no longer make you sleepy throughout the day. If you’re seeking to Buy Ambien Online at an inexpensive cost, come to our on line pharmacy.”,
“id”: “<urn:uuid:dd20979b-ada8-4c5b-b53e-4ade7274bc1b>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://42627.dynamicboard.de/u101117_ambienusa.html”,
“date”: “2023-05-27T23:12:51Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.8990675806999207
}
}
{
“text”: “My little guy turned two over the summer and we celebrated with an oh-so-cute Golf Birthday Party. He is all boy and loves anything that includes a stick and ball, which made choosing the golf theme fairly easy. We had fun golfing games, snacks & treats and each little caddie even received there very own golf bag. The post was getting fairly large I decided to split it in two parts. Part one covers the favor and dessert table and part two will focus on the food and games. Enjoy!nGolf Pro Shop for the favor tablenEach “Golf Pro” received his/her own set of golf clubs (thank you Target dollar section for saving the day!), a blue or green visor I purchased at Joann’s, practice golf balls and a water bottle to stay hydrated on the course.nI created the backdrop for the dessert table with a tan table cloth I had and pinned it to the window frame with thumb tacks (my husband wasn’t too happy about that one…opps!) I used 12” white tissue paper balls that I purchased from Devra Party and hung them by grosgrain ribbon.nI wanted to use items on the dessert table that went along with the theme so I racked my brain for some golf terms. The sign over the table was “Caddie’s Sweet Spot” (sweet spot refers to the center point of the face of the club).nThere was a “water hazard” ~ blue jell-o jigglers, “wormburners” (which is the term for a ball that skims the grass) ~ chocolate pudding pack topped with crumbled Oreos and gummy worms plus a sand trap of “doughnut hole in one” ~ made with powder sugar doughnuts and crumbled graham crackers for the sand.nI also made cake pops that resembled golf balls ~ some like a lollipop and others with a golf flag and the number two for the birthday boy. The kids had a few candy choices and a small bag to fill so they could bring treats home.n“Wormburners” – Chocolate pudding cups topped with crushed oreos and gummy wormsnGreen Grass Cupcakes, with white gumball and printable golf flags.nThank you so much to everyone who helped make this party amazing, I couldn’t have done it without you.nVendor List:nPhotography: Andary StudionParty Printables: Printable Studio by 505 Design, IncnGolf Club Sets: Target Dollar SectionnFoam Visors: Joann’snGreen & White Tissue Balls: Devra PartynGreen Polka Dot Balloons: Paws Attraction BoutiquenCupcakes – My super talented sisternInterested in hosting your own Golf Themed Party – Check out the Golf Pro Printable set now available in the shop.nMore details coming soon….nThanks for stopping by! Cathy C.”,
“id”: “<urn:uuid:9ad54ec1-b946-4293-8099-abc434ef154c>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://505-design.com/tag/boys-party/”,
“date”: “2023-05-27T23:24:49Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.9405166506767273
}
}
Minhash deduplication
I use the local executor LocalPipelineExecutor to execute the data deduplication pipeline, which includes the following steps:
Configuring Minhash: Setting up Minhash with 64-bit hashes for better precision and fewer false positives (collisions).
Reading Input Data: Using JsonlReader to read input data from a specified directory.
Stage 1: Calculating Minhash Signatures:
Pipeline: Reads input data and calculates Minhash signatures.
Output: Stores signatures in a specified folder.
Tasks: Configured to run with a specified number of tasks based on the local environment.
Stage 2: Finding Matches Between Signatures in Each Bucket :
Pipeline: Processes the signatures to find matches within each bucket.
Output: Stores bucketed signatures in a specified folder.
Tasks: Runs with a number of tasks equal to the number of buckets.
Dependency: Depends on the completion of Stage 1.
Stage 3: Creating Clusters of Duplicates:
Pipeline: Uses the results from all buckets to create clusters of duplicate items.
Output: Stores IDs of items to be removed in a specified folder.
Tasks: Runs as a single task.
Dependency: Depends on the completion of Stage 2.
Stage 4: Filtering Out Duplicates:
Pipeline: Reads the original input data, counts tokens, filters out duplicates (keeping only one sample per cluster), and writes the deduplicated data to JSONL files.
Output: Stores deduplicated output and removed items in specified folders.
Tasks: Configured to run with a specified number of tasks.
Dependency: Depends on the completion of Stage 3.
root@david1a100:~/dataclean# cat minhash_deduplication.py
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import MinhashDedupSignature
from datatrove.pipeline.dedup.minhash import (
MinhashConfig,
MinhashDedupBuckets,
MinhashDedupCluster,
MinhashDedupFilter,
)
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.tokens import TokensCounter
from datatrove.pipeline.writers.jsonl import JsonlWriter
def main():
minhash_config = MinhashConfig(use_64bit_hashes=True)
LOCAL_MINHASH_BASE_PATH = “./minhash”
LOCAL_LOGS_FOLDER = “./logs”
TOTAL_TASKS = 8
# Input data path
INPUT_READER = JsonlReader(“./output/output/CC-MAIN-2023-23/”)
# Stage 1: Calculate the Minhash signature for each task
stage1 = LocalPipelineExecutor(
pipeline=[
INPUT_READER,
MinhashDedupSignature(output_folder=f”{LOCAL_MINHASH_BASE_PATH}/signatures”, config=minhash_config),
],
tasks=TOTAL_TASKS,
logging_dir=f”{LOCAL_LOGS_FOLDER}/signatures”,
)
# Stage 2: Finding matches between signatures in each bucket
stage2 = LocalPipelineExecutor(
pipeline=[
MinhashDedupBuckets(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/signatures”,
output_folder=f”{LOCAL_MINHASH_BASE_PATH}/buckets”,
config=minhash_config,
),
],
tasks=minhash_config.num_buckets,
logging_dir=f”{LOCAL_LOGS_FOLDER}/buckets”,
depends=stage1,
)
# Stage 3: Create clusters of duplicate items using the results of all buckets
stage3 = LocalPipelineExecutor(
pipeline=[
MinhashDedupCluster(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/buckets”,
output_folder=f”{LOCAL_MINHASH_BASE_PATH}/remove_ids”,
config=minhash_config,
),
],
tasks=1,
logging_dir=f”{LOCAL_LOGS_FOLDER}/clusters”,
depends=stage2,
)
# Stage 4: Read raw input data and remove all samples from each duplicate cluster (keep only one)
stage4 = LocalPipelineExecutor(
pipeline=[
INPUT_READER,
TokensCounter(), # View the number of tokens before and after de-duplication
MinhashDedupFilter(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/remove_ids”,
exclusion_writer=JsonlWriter(f”{LOCAL_MINHASH_BASE_PATH}/removed”),
),
JsonlWriter(output_folder=f”{LOCAL_MINHASH_BASE_PATH}/deduplicated_output”),
],
tasks=TOTAL_TASKS,
logging_dir=f”{LOCAL_LOGS_FOLDER}/filter”,
depends=stage3,
)
stage4.run()
if __name__ == ‘__main__’:
import multiprocessing
multiprocessing.freeze_support()
main()
Run code:
(dataclean) root@david1a100:~/dataclean# python minhash_deduplication.py
Results are as following:
— 🛠️ PIPELINE 🛠
📖 – READER: 🐿 Jsonl
🔢 – TOKENIZER: 📊 Counter
🫂 – DEDUP: 🎯 MinHash stage 4
💽 – WRITER: 🐿 Jsonl
2024-08-14 07:20:58.795 | INFO | datatrove.pipeline.readers.base:read_files_shard:193 – Reading input file 00000.jsonl.gz
2024-08-14 07:20:58.802 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 1/8 tasks completed.
2024-08-14 07:20:58.804 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 2/8 tasks completed.
2024-08-14 07:20:58.805 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 3/8 tasks completed.
2024-08-14 07:20:58.807 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 4/8 tasks completed.
2024-08-14 07:20:58.808 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 5/8 tasks completed.
2024-08-14 07:20:58.810 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 6/8 tasks completed.
2024-08-14 07:20:58.812 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 7/8 tasks completed.
2024-08-14 07:21:08.399 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=0
2024-08-14 07:21:08.401 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 0 📉📉📉
Total Runtime: 9 seconds
📖 – READER: 🐿 Jsonl
Runtime: (1.54%) 0 seconds [0.03 milliseconds±0.01 milliseconds/doc]
Stats: {input_files: 1, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], documents: 5727 [5727.00/input_file]}
🔢 – TOKENIZER: 📊 Counter
Runtime: (79.15%) 7 seconds [1.29 milliseconds±5.90 milliseconds/doc]
Stats: {tokens: 3989039 [min=54, max=18060, 696.41±1020/doc]}
🫂 – DEDUP: 🎯 MinHash stage 4
Runtime: (0.44%) 0 seconds [0.01 milliseconds±0.03 milliseconds/doc]
Stats: {total: 5728, forwarded: 5548, dropped: 180}
💽 – WRITER: 🐿 Jsonl
Runtime: (18.86%) 1 second [0.32 milliseconds±0.44 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5548, total: 5548, doc_len: 17896157 [min=257, max=73080, 3225.70±4665/doc], doc_len_tokens: 3943328 [min=54, max=18060, 710.77±1032/doc]}
2024-08-14 07:21:08.405 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 8/8 tasks completed.
2024-08-14 07:21:08.417 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 8 tasks 📉📉📉
Total Runtime: 1 second ± 2 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (1.54%) 0 seconds±0 seconds/task, min=0 seconds [0.03 milliseconds±0.01 milliseconds/doc]
Stats: {input_files: 1, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], documents: 5727 [5727.00/input_file]}
🔢 – TOKENIZER: 📊 Counter
Runtime: (79.15%) 0 seconds±2 seconds/task, min=0 seconds [1.29 milliseconds±5.90 milliseconds/doc]
Stats: {tokens: 3989039 [min=54, max=18060, 696.41±1020/doc]}
🫂 – DEDUP: 🎯 MinHash stage 4
Runtime: (0.44%) 0 seconds±0 seconds/task, min=0 seconds [0.01 milliseconds±0.03 milliseconds/doc]
Stats: {total: 5728, forwarded: 5548, dropped: 180}
💽 – WRITER: 🐿 Jsonl
Runtime: (18.86%) 0 seconds±0 seconds/task, min=0 seconds [0.32 milliseconds±0.44 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5548, total: 5548, doc_len: 17896157 [min=257, max=73080, 3225.70±4665/doc], doc_len_tokens: 3943328 [min=54, max=18060, 710.77±1032/doc]}
Check removed and final result in this part:
(dataclean) root@david1a100:~/dataclean/minhash# ls -al removed/
total 76
drwx—— 2 root root 4096 Aug 14 07:20 .
drwx—— 7 root root 4096 Aug 14 07:20 ..
-rw——- 1 root root 65584 Aug 14 07:21 00000.jsonl.gz
(dataclean) root@david1a100:~/dataclean/minhash# ls -al deduplicated_output/
total 7372
drwx—— 2 root root 4096 Aug 14 07:20 .
drwx—— 7 root root 4096 Aug 14 07:20 ..
-rw——- 1 root root 7539420 Aug 14 07:21 00000.jsonl.gz
(dataclean) root@david1a100:~/dataclean/minhash#
Check first intem in final output file:
(dataclean) root@david1a100:~/dataclean/minhash/deduplicated_output# zcat ./00000.jsonl.gz | head -n 1 | jq .
{
“text”: “Buy Ambien Online Legally (Zolpidem) belongs to the imidazopyridines class of opioids. Ambien complements the exceptional of sleep via way of means of decreasing the time it takes to fall asleep, decreasing the frequency of nocturnal awakenings, and growing the general period of sleep. Lengthens the second one degree of sleep and the deep sleep degree (III and IV). It does now no longer make you sleepy throughout the day. If you’re seeking to Buy Ambien Online at an inexpensive cost, come to our on line pharmacy.”,
“id”: “<urn:uuid:dd20979b-ada8-4c5b-b53e-4ade7274bc1b>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://42627.dynamicboard.de/u101117_ambienusa.html”,
“date”: “2023-05-27T23:12:51Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.8990675806999207,
“token_count”: 120
}
}
Sentence deduplication
My code uses the local executor LocalPipelineExecutor to execute the data deduplication pipeline, which includes the following steps:
Configuring Sentence Deduplication: Setting up sentence deduplication with specific configurations such as the number of sentences, splitting sentences, and minimum document words.
Preprocessing Data: Using NLTK to download the Punkt tokenizer and preprocess data before starting multiprocessing.
Reading Input Data: Using JsonlReader to read input data from a specified directory.
Stage 1: Extracting and Filtering Content:
Pipeline: Reads input data, extracts content using Trafilatura, filters based on quality and language, and writes intermediate results to JSONL files.
Output: Stores intermediate results in a specified folder.
Tasks: Configured to run with a specified number of tasks.
Stage 2: Calculating Sentence Deduplication Signatures:
Pipeline: Processes the intermediate results to calculate sentence deduplication signatures.
Output: Stores signatures in a specified folder.
Tasks: Runs with a number of tasks equal to the number of finder workers.
Stage 3: Finding and Filtering Duplicates:
Pipeline: Reads the intermediate results, finds duplicates using the calculated signatures, and filters out duplicates (keeping only one sample per cluster).
Output: Stores deduplicated output in a specified folder.
Tasks: Configured to run with a specified number of tasks.
The pipeline is executed by running executor_1.run(), executor_2.run(), and executor_3.run().
(dataclean) root@david1a100:~/dataclean# cat sentence_deduplication.py
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from datatrove.executor.base import PipelineExecutor
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
from datatrove.io import get_datafolder
from collections import UserDict
import multiprocessing
# Ensure punkt tokenizer is downloaded before multiprocessing
nltk.download(‘punkt’, force=True)
# Custom function to load PunktSentenceTokenizer
def load_punkt_tokenizer():
punkt_param = PunktParameters()
with open(nltk.data.find(‘tokenizers/punkt/english.pickle’), ‘rb’) as f:
tokenizer = PunktSentenceTokenizer(punkt_param)
return tokenizer
# Load tokenizer in the main process
tokenizer = load_punkt_tokenizer()
# Example configuration for sentence deduplication
sent_dedup_config = SentDedupConfig(
n_sentences=3,
split_sentences=True,
only_dedup_in_index=True,
min_doc_words=50,
)
FINDER_WORKERS = 10
class TimeStats:
def __init__(self):
self.global_mean = 0
self.global_std_dev = 0
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
pass
def __repr__(self):
return f”TimeStats(global_mean={self.global_mean}, global_std_dev={self.global_std_dev})”
def __add__(self, other):
result = TimeStats()
result.global_mean = self.global_mean + other.global_mean
result.global_std_dev = self.global_std_dev + other.global_std_dev
return result
class Stat:
def __init__(self):
self.value = 0
def update(self, value, unit=None):
self.value += value
def __repr__(self):
return f”Stat(value={self.value})”
def __add__(self, other):
result = Stat()
result.value = self.value + other.value
return result
class PipelineStats(UserDict):
def __init__(self):
super().__init__()
self.total_runtime = 0
self.time_stats = TimeStats()
self.data[‘total’] = Stat()
self.data[‘removed_sentences’] = Stat()
self.data[‘original_sentences’] = Stat()
def as_dict(self):
return {
‘total_runtime’: self.total_runtime,
‘time_stats’: repr(self.time_stats),
‘stats’: {key: repr(value) for key, value in self.data.items()}
}
def to_dict(self):
return self.as_dict()
def to_json(self):
import json
return json.dumps(self.to_dict(), indent=4)
def save_to_disk(self, file):
file.write(self.to_json())
def get_repr(self, task_name):
x = f”nn Stats: {task_name} nnTotal Runtime: {self.total_runtime} secondsnn”
x += “n”.join([repr(stat) for stat in self.data.values()])
return x
def __repr__(self, *args, **kwargs):
return f”PipelineStats(total_runtime={self.total_runtime}, time_stats={self.time_stats})”
def __add__(self, other):
result = PipelineStats()
result.total_runtime = self.total_runtime + other.total_runtime
result.time_stats = self.time_stats + other.time_stats
for key in self.data:
result.data[key] = self.data[key] + other.data[key]
return result
class CustomSentenceDedupFilter(SentenceDedupFilter):
def __init__(self, data_folder, config):
self.data_folder = get_datafolder(data_folder)
self.config = config
self._tokenizer = None
self.exclusion_writer = None
self.stats = PipelineStats()
self.language = ‘english’
def set_tokenizer(self, tokenizer):
self._tokenizer = tokenizer
def run(self, data, rank, world_size, *args):
# Implement the logic for the run method here
# For now, let’s just print the arguments to verify they are passed correctly
print(f”Running with data: {data}, rank: {rank}, world_size: {world_size}, args: {args}”)
# Add your actual processing logic here
return data
def preprocess_data():
# Preprocess data using nltk before starting multiprocessing
# This is a placeholder function. Implement your preprocessing logic here.
# For example, you can read the input files, tokenize the sentences, and save the preprocessed data.
pass
def run_example():
preprocess_data() # Preprocess data before starting multiprocessing
pipeline_1 = [
JsonlReader(data_folder=”./minhash/deduplicated_output/”),
Trafilatura(),
GopherQualityFilter(min_stop_words=0),
LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
JsonlWriter(“./intermediate/”),
SentenceDedupSignature(output_folder=”./c4/sigs”, config=sent_dedup_config, finder_workers=FINDER_WORKERS),
]
pipeline_2 = [SentenceFindDedups(data_folder=”./c4/sigs”, output_folder=”./c4/dups”, config=sent_dedup_config)]
sentence_dedup_filter = CustomSentenceDedupFilter(data_folder=”./c4/dups”, config=sent_dedup_config)
sentence_dedup_filter.set_tokenizer(tokenizer)
pipeline_3 = [
JsonlReader(data_folder=”./intermediate/”),
sentence_dedup_filter,
JsonlWriter(output_folder=”./final_deduplicated_output/”),
]
executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)
executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)
executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)
print(executor_1.run())
print(executor_2.run())
print(executor_3.run())
if __name__ == ‘__main__’:
multiprocessing.freeze_support()
run_example()
Run the script:
(dataclean) root@david1a100:~/dataclean# python3 sentence_deduplication.py
Some of the output:
2024-08-15 03:46:20.151 | INFO | datatrove.pipeline.dedup.sentence_dedup:run:247 – PQ initialized.
2024-08-15 03:46:20.151 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=9
2024-08-15 03:46:20.152 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 9 📉📉📉
Total Runtime: 0 seconds
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds [1.17 milliseconds±0 milliseconds/doc]
2024-08-15 03:46:20.156 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 10 tasks 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds±0 seconds/task, min=0 seconds, max=0 seconds [1.68 milliseconds±1.21 milliseconds/doc]
📉📉📉 Stats 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds±0 seconds/task, min=0 seconds, max=0 seconds [1.68 milliseconds±1.21 milliseconds/doc]
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
2024-08-15 03:46:20.887 | INFO | datatrove.utils.logging:add_task_logger:47 – Launching pipeline for rank=2
2024-08-15 03:46:20.887 | INFO | datatrove.utils.logging:log_pipeline:76 –
— 🛠️ PIPELINE 🛠
📖 – READER: 🐿 Jsonl
🫂 – DEDUPS: 💥 sentence-deduplication stage 3
💽 – WRITER: 🐿 Jsonl
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 2, world_size: 4, args: ()
2024-08-15 03:46:20.887 | WARNING | datatrove.pipeline.readers.base:run:226 – No files found on /root/dataclean/intermediate for rank=2
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 1, world_size: 4, args: ()
2024-08-15 03:46:20.887 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=2
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 0, world_size: 4, args: ()
2024-08-15 03:46:20.888 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 2 📉📉📉
Total Runtime: 0 seconds
📖 – READER: 🐿 Jsonl
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
2024-08-15 03:46:20.891 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 1/4 tasks completed.
2024-08-15 03:46:20.892 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 2/4 tasks completed.
2024-08-15 03:46:20.897 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 3/4 tasks completed.
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a340>, rank: 3, world_size: 4, args: ()
2024-08-15 03:46:20.911 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 4/4 tasks completed.
2024-08-15 03:46:20.948 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 4 tasks 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (7.77%) 0 seconds±0 seconds/task, min=0 seconds [0.06 milliseconds±0.04 milliseconds/doc]
Stats: {input_files: 1, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc], documents: 3 [3.00/input_file]}
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
Runtime: (92.23%) 0 seconds±0 seconds/task, min=0 seconds [0.66 milliseconds±0.88 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 4, total: 4, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc]}
📉📉📉 Stats 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (7.77%) 0 seconds±0 seconds/task, min=0 seconds [0.06 milliseconds±0.04 milliseconds/doc]
Stats: {input_files: 1, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc], documents: 3 [3.00/input_file]}
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
Runtime: (92.23%) 0 seconds±0 seconds/task, min=0 seconds [0.66 milliseconds±0.88 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 4, total: 4, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc]}
Check the the first item of final outputs:
(dataclean) root@david1a100:~/dataclean/final_deduplicated_output# zcat ./00000.jsonl.gz | head -n 1 | jq .
Check quality of the corpus
This part of my code is refer to: https://github.com/Azure/synthetic-qa-generation/tree/main*, I modified some codes, please refer to corpus-suggestions.ipynb in my repo: https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/Make-High-Quality-Dataset-From-WARC, which could analyze quality of the corpus from the last steps and give lots of useful suggestions.
Take some results as examples:
Result 1:
Feedback Required: [True, False, True, False, True]
Feedback List:
#Need Feedback#: Yes
#Issue Name#: Lack of new scenarios or contexts
#Reason#: The evolved instruction does not introduce any new scenarios or examples.
#Feedback#: Introduce diverse contexts or examples to enhance the instructional variety.
#Need Feedback#: No
#Need Feedback#: Yes
#Issue Name#: Limited diversity in examples
#Reason#: No new scenarios or varied contexts introduced in the evolved instruction.
#Feedback#: Incorporate diverse examples or contexts to cover a wider range of situations.
#Need Feedback#: No
#Need Feedback#: Yes
#Issue Name#: Limited diversity
#Reason#: No new scenarios, examples, or contexts introduced.
#Feedback#: Include various use cases and contexts for accessing journal content.
Optimized Instruction:
Accessing full-text articles for free on HTML pages can be a convenient way to stay informed, but if you need the article in PDF or Epub format, a subscription to the Journal of Postgraduate Medicine is required. Here are different ways to access the content based on various contexts:
1. **Individual Subscription:** If you frequently need access to articles in PDF or Epub format, consider subscribing online for a year. Subscribing is a straightforward process:
– Visit the Journal of Postgraduate Medicine’s subscription page.
– Choose the subscription plan that suits your needs.
– Complete the payment process to gain access to the content.
2. **Institutional Access:** If you are affiliated with a university or a research institution, you might recommend that your institution’s library subscribe to the journal. This way, everyone at your institution can have unrestricted access to the content.
– Click on the “Recommend the Journal” link typically provided on the journal’s website.
– Fill out the recommendation form with the necessary details.
– Submit the recommendation to your institution’s library acquisition team.
3. **Library Access:** If your local library has a subscription to the journal, you can access the PDF and Epub formats through their facilities. Check with your library to see if they offer remote access options, especially useful during non-operational hours or remote working conditions.
4. **Interlibrary Loan (ILL):** If neither you nor your institution has a subscription and you need a specific article in PDF or Epub format, you can request it through interlibrary loan services:
– Contact your library’s interlibrary loan department.
– Provide the details of the article you need.
– Wait for your library to obtain a copy from another subscribing institution.
5. **Pay-Per-View Purchase:** Some journals offer pay-per-view options for non-subscribers to access specific articles:
– Visit the article page on the journal’s website.
– Look for a purchase or pay-per-view option.
– Complete the payment to download the article in PDF or Epub format.
By understanding these various methods, you can choose the most appropriate way to access the Journal of Postgraduate Medicine articles based on your specific context and needs.
Evolved Instruction Step 1:
Accessing full-text articles for free on HTML pages can be a convenient way to stay informed, but if you need the article in PDF or Epub format or face geographic restrictions, a subscription to the Journal of Postgraduate Medicine is required. Here are different ways to access the content based on various contexts and considerations:
1. **Individual Subscription:** If you frequently need access to articles in PDF or Epub format, consider subscribing online for a year. Consider different subscription tiers based on your usage frequency and preferred payment method (credit card, PayPal, or wire transfer):
– Visit the Journal of Postgraduate Medicine’s subscription page.
– Choose the appropriate subscription plan that suits your reading needs and budget.
– Complete the payment process, selecting your preferred payment method, to gain access to the content.
– Confirm your subscription through the verification email you will receive.
2. **Institutional Access:** If you are affiliated with a university, specialized institute, or research organization, you might recommend that your institution’s library subscribe to the journal, allowing everyone at your institution unrestricted access to the content:
– Click on the “Recommend the Journal” link typically provided on the journal’s website.
– Fill out the recommendation form with the necessary details, specifying your institution type.
– Submit the recommendation to your institution’s library acquisition team.
– Follow up with your acquisition team to verify the status of the subscription request.
3. **Library Access:** If your local library has a subscription to the journal, you can access the PDF and Epub formats through their facilities. Check with your library to see if they offer remote access options or have updated policies for off-hour access due to remote working conditions or geographical restrictions:
– Visit your library’s online resource portal.
– Authenticate your library membership details to access the journal remotely.
– Verify the access duration and loan policies to ensure continuous availability.
4. **Interlibrary Loan (ILL):** If neither you nor your institution has a subscription and you need a specific article in PDF or Epub format, you can request it through Interlibrary Loan services, which might involve multiple steps and waiting periods:
– Contact your library’s interlibrary loan department and inquire about any pre-requisites.
– Provide the exact details of the article you need and verify your contact information.
– Wait for your library to notify you on the progress and estimated delivery time of the article from another subscribing institution.
– Confirm the received article’s access duration to avoid lapses in availability.
5. **Pay-Per-View Purchase:** Some journals offer pay-per-view options for non-subscribers to access specific articles. Be aware of different payment methods and possible return policies if the article does not meet your needs:
– Visit the article page on the journal’s website.
– Look for a purchase or pay-per-view option and compare prices if there are multiple.
– Complete the payment process, choosing a method that’s secure and convenient for you.
– Download the article in PDF or Epub format, and review any return policies if you face access issues.
By understanding these various methods, including conditional scenarios and additional steps, you can choose the most appropriate way to access the Journal of Postgraduate Medicine articles based on your specific context, requirements, and potential contingent situations.
New Feedback Required: [True, True, True, True, True]
New Feedback List:
#Need Feedback#: Yes
#Issue Name#: Preservation of key information
#Reason#: Key information is maintained with added details and considerations.
#Feedback#: Key information preserved well with added context and steps for clarity.
#Need Feedback#: Yes
#Issue Name#: Complexity
#Reason#: More details and steps have been added sufficiently.
#Feedback#: Complexity increased adequately with detailed steps and additional considerations.
#Need Feedback#: Yes
#Issue Name#: Insufficient scenario diversity
#Reason#: Limited expansion on new contexts or examples in evolved instruction.
#Feedback#: Introduce more varied scenarios to enhance diversity and coverage of different situations.
#Need Feedback#: Yes
#Issue Name#: Increased complexity
#Reason#: The Evolved Instruction introduces more detailed steps and additional considerations.
#Feedback#: The complexity has increased adequately with additional steps and detailed guidance.
#Need Feedback#: Yes
#Issue Name#: Limited diversity in access methods
#Reason#: Few new scenarios or examples introduced in the evolved instruction.
#Feedback#: Expand diversity by adding varied contexts, like international access options.
Genarate Synthetic Q&A
Refer to generate-QA.ipynb, we could generate high quality synthetic Q&A pairs with GPT-4o. Prompt temlpate is refer to : https://github.com/Azure/synthetic-qa-generation/tree/main/seed/prompt_template/en
Take some results as examples:
1.**What type of access is free in HTML pages?**
Full text access is free in HTML pages.
2. **Who can access PDF and EPub formats of the journal?**
PDF and EPub access is only available to paid subscribers and members.
3. **What must you do to access the article in PDF format?**
To access the article in PDF format, you should be a subscriber to the Journal of Postgraduate Medicine.
4. **How can you subscribe to the Journal of Postgraduate Medicine?**
You can subscribe online for a year.
5. **What can you do if you want your institution to have unrestricted access to the journal?**
You could recommend your institution’s library to subscribe to the journal so that you can have unrestricted access.
References
DataTrove: https://github.com/huggingface/datatrove/
Generate Synthetic QnAs from Real-world Data: https://github.com/Azure/synthetic-qa-generation/
Microsoft Tech Community – Latest Blogs –Read More
Generative AI with Microsoft Fabric
Microsoft Fabric seamlessly integrates with generative AI to enhance data-driven decision-making across your organization. It unifies data management and analysis, allowing for real-time insights and actions.
With Real Time Intelligence, keeping grounding data for large language models (LLMs) up-to-date is simplified. This ensures that generative AI responses are based on the most current information, enhancing the relevance and accuracy of outputs. Microsoft Fabric also infuses generative AI experiences throughout its platform, with tools like Copilot in Fabric and Azure AI Studio enabling easy connection of unified data to sophisticated AI models.
Check out GenAI experiences with Microsoft Fabric.
Classify and protect schematized data with Microsoft Purview.
Connect data from OneLake to Azure AI Studio.
Watch our video here:
QUICK LINKS:
00:00 — Unify data with Microsoft Fabric
00:35 — Unified data storage & real-time analysis
01:08 — Security with Microsoft Purview
01:25 — Real-Time Intelligence
02:05 — Integration with Azure AI Studio
Link References
This is Part 3 of 3 in our series on leveraging generative AI. Watch our playlist at https://aka.ms/GenAIwithAzureDBs
Unfamiliar with Microsoft Mechanics?
As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.
Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries
Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast
Keep getting this insider knowledge, join us on social:
Follow us on Twitter: https://twitter.com/MSFTMechanics
Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
Enjoy us on Instagram: https://www.instagram.com/msftmechanics/
Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics
Video Transcript:
-If you want to bring custom Gen AI experiences to your app so that users can interact with them using natural language, the better the quality and recency of the data used to ground responses, the more relevant and accurate the generated outcome.
-The challenge, of course, is that your data may be sitting across multiple clouds, in your own data center and also on the edge. Here’s where the complete analytics platform Microsoft Fabric helps you to unify data wherever it lives at unlimited scale, without you having to move it.
-It incorporates a logical multi-cloud data lake, OneLake, for unified data storage and access and separately provides a real-time hub optimized for event-based streaming data, where change data capture feeds can be streamed from multiple cloud sources for analysis in real time without the need to pull your data. Then with your data unified, data professionals can work together in a collaborative workspace to ingest and transform it, analyze it, and also endorse it as they build quality data sets.
-And when, used with Microsoft Purview, this can be achieved with an additional layer of security where you can classify and protect your schematized data with protections flowing as everyone from your engineers, data analysts to your business users works with data in the Fabric workspace. Keeping grounding data for your LLMs up to date is also made easier by being able to act on it with Real Time Intelligence.
-For example, you might have a product recommendation engine on an e-commerce site and using Real Time Intelligence, you can create granular conditions to listen for changes in your data, like new stock coming in, and update data pipelines feeding the grounding data for your large language models.
-So now, whereas before the gen AI may not have had the latest inventory data available to it to ground responses, with Real Time Intelligence, generated responses can benefit from the most real-time, up-to-date information so you don’t lose out on sales. And as you work with your data, gen AI experiences are infused throughout Fabric. In fact, Copilot in Fabric experiences are available for all Microsoft Fabric workloads to assist you as you work.
-And once your data set is complete, connecting it from Microsoft Fabric to ground large language models in your gen AI apps is made easy with Azure AI Studio, where you can bring in data from OneLake seamlessly and choose from some of the most sophisticated large language models hosted in Azure to build custom AI experiences on your data, all of which is only made possible when you unify your data and act on it with Microsoft Fabric.
Microsoft Tech Community – Latest Blogs –Read More
Mseries announcements – GA of Mv3 High Memory and details on Mv3 Very High Memory virtual machines
Mv3 High Memory General Availability
Executing on our plan to have our third version of M-series (Mv3) powered by 4th generation Intel® Xeon® processors (Sapphire Rapids) across the board, we’re excited to announce that Mv3 High Memory (HM) virtual machines (VMs) are now generally available. These next-generation M-series High Memory VMs give customers faster insights, more uptime, lower total cost of ownership and improved price-performance for their most demanding workloads. Mv3 HM VMs are supported for RISE with SAP customers as well. With the release of this Mv3 sub-family and the sub-family that offers around 32TB memory, Microsoft is the only public cloud provider that can provide HANA certified VMs from around 1TB memory to around 32TB memory all powered by 4th generation Intel® Xeon® processors (Sapphire Rapids).
Key features on the new Mv3 HM VMs
The Mv3 HM VMs can scale for workloads from 6TB to 16TB.
Mv3 delivers up to 40% throughput over our Mv2 High Memory (HM), enabling significantly faster SAP HANA data load times for SAP OLAP workloads and significant higher performance per core for SAP OLTP workloads over the previous generation Mv2.
Powered by Azure Boost, Mv3 HM provides up to 2x more throughput to Azure premium SSD storage and up to 25% improvement in network throughput over Mv2, with more deterministic performance.
Designed from the ground up for increased resilience against failures in memory, disks, and networking based on intelligence from past generations.
Available in both disk and diskless offerings allowing customers the flexibility to choose the option that best meets their workload needs.
During our private preview, several customers such as SwissRe unlocked gains from the new VM sizes. In their own words:
“Mv3 High Memory VM results are promising – in average we see a 30% increase in the performance without any big adjustment.”
SwissRe
Msv3 High Memory series (NVMe)
Size
vCPU
Memory in GiB
Max data disks
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416s_6_v3
416
5,696
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M416s_8_v3
416
7,600
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M624s_12_v3
624
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_12_v3
832
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_16_v3
832
15,200
64
130,000/ 8,000
260,000/ 8,000
8
40,000
Msv3 High Memory series (SCSI)
Size
vCPU
Memory in GiB
Max data disks
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416s_6_v3
416
5,696
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M416s_8_v3
416
7,600
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M624s_12_v3
624
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_12_v3
832
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_16_v3
832
15,200
64
130,000/ 8,000
130,000/ 8,000
8
40,000
Mdsv3 High Memory series (NVMe)
Size
vCPU
Memory in GiB
Temp storage (SSD) GiB
Max data disks
Max cached* and temp storage throughput: IOPS / MBps
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416ds_6_v3
416
5,696
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M416ds_8_v3
416
7,600
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M624ds_12_v3
624
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_12_v3
832
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_16_v3
832
15,200
400
64
250,000/1,600
130,000/ 8,000
260,000/ 8,000
8
40,000
Mdsv3 High Memory series (SCSI)
Size
vCPU
Memory in GiB
Temp storage (SSD) GiB
Max data disks
Max cached* and temp storage throughput: IOPS / MBps
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416ds_6_v3
416
5,696
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M416ds_8_v3
416
7,600
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M624ds_12_v3
624
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_12_v3
832
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_16_v3
832
15,200
400
64
250,000/1,600
130,000/ 8,000
130,000/ 8,000
8
40,000
*Read iops is optimized for sequential reads
Regional Availability and Pricing
The VMs are now available in West Europe, North Europe, East US, and West US 2. For pricing details, please take a look here for Windows and Linux.
Additional resources:
SAP Certification for Mv3 on Azure
Details on Mv3 Very High Memory Virtual Machines
We are thrilled to unveil the latest and largest additions to our Mv3-Series, Standard_M896ixds_32_v3 and Standard_M1792ixds_32_v3 VM SKUs. These new VM SKUs are the result of a close collaboration between Microsoft, SAP, experienced hardware partners, and our valued customers.
Key features on the new Mv3 VHM VMs
Unmatched Memory Capacity: With close to 32TB of memory, both the Standard_M896ixds_32_v3 and Standard_M1792ixds_32_v3 VMs are ideal for supporting very large in-memory databases and workloads.
High CPU Power: Featuring 896 cores in the Standard_M896ixds_32_v3 VM and 1792 vCPUs** in the Standard_M1792ixds_32_v3 VM, these VMs are designed to handle high-end S/4HANA workloads, providing more CPU power than other public cloud offerings. Enhanced Network and Storage Bandwidth: Both VM types provide the highest network and storage bandwidth available in Azure for a full node VM, including up to 200-Gbps network bandwidth with Azure Boost.
Optimal Performance for SAP HANA: Certified for SAP HANA, these VMs adhere to the SAP prescribed socket-to-memory ratio, ensuring optimal performance for in-memory analytics and relational database servers.
Size
vCPU or cores
Memory in GiB
SAP HANA Workload Type
Standard_M896ixds_32_v3
896
30,400
OLTP (S/4HANA) / OLAP Scaleup
Standard_M1792ixds_32_v3
1792**
30,400
OLAP Scaleup
**Hyperthreaded vCPUs
Microsoft Tech Community – Latest Blogs –Read More
Azure Extended Zones: Optimizing Performance, Compliance, and Accessibility
Azure Extended Zones are designed to bring the power of Azure closer to end users in specific metropolitan areas or jurisdictions, catering to organizations that require low latency and stringent data residency controls. This innovative solution supports a variety of use cases, including real-time media editing, financial services, healthcare, and any industry where data localization and rapid response times are critical.
Key Benefits and Features:
Low Latency and High Performance:
Reduced Latency: Azure Extended Zones enable applications requiring rapid response times to operate with minimal latency. This is particularly beneficial for sectors such as media, where real-time processing is crucial. By locating resources closer to the end-users, Extended Zones ensure faster data access and lower latency, leading to improved performance and user experience.
Enhanced User Experience: Applications that depend on quick response times, like gaming or real-time analytics, benefit significantly from Azure Extended Zones’ ability to reduce the delay in data transmission.
Data Residency and Compliance:
Geographical Data Control: These zones allow organizations to keep their data within specific geographical boundaries, aligning with local privacy laws, regulatory requirements, and compliance standards. This is particularly crucial for industries such as finance, healthcare, and government, where data sovereignty is a major concern.
Regulatory Compliance: By ensuring that data stays within a defined region, Azure Extended Zones help organizations meet stringent data residency requirements, such as those mandated by GDPR in Europe or other regional data protection laws.
Service Availability and Integration:
Supported Azure Services: Azure Extended Zones offer a wide range of following Azure services
Service category
Available services
Compute
Azure virtual machines (general purpose: A, B, D, E, and F series and GPU NVadsA10 v5 series)
Virtual Machine Scale Sets
Azure Kubernetes Service
Networking
Azure Private Link
Standard public IP
Virtual networks
Virtual network peering
ExpressRoute
Azure Standard Load Balancer
DDoS (Standard protection)
Storage
Azure managed disks
Azure Premium Page Blobs
Azure Premium Block Blobs
Azure Premium Files
Azure Data Lake Storage Gen2
Hierarchical Namespace
Azure Data Lake Storage Gen2 Flat Namespace
Change Feed
Blob Features
– SFTP
– NFS
BCDR
Azure Site Recovery
Azure Backup
These services can be deployed and managed within Extended Zones, providing businesses with the flexibility to run complex workloads close to their customers.
Reference Architecture:
Existing Azure customers can integrate Extended Zones into their current setups with minimal disruption. The service is designed to complement Azure’s global infrastructure, making it easy to expand into new regions or jurisdictions as shown in the following diagram.
Requesting Access and Workload Deployment:
Requesting Access to Azure Extended Zones
To register for an Azure Extended Zone, follow these steps:
Select a Subscription: Choose the Azure subscription you want to register for an Extended Zone.
List Available Zones: Use the Get-AzEdgeZonesExtendedZone cmdlet in Azure PowerShell to list all available Extended Zones.
Register a Zone: Use Register-AzEdgeZonesExtendedZone -Name ‘zonename’ to register for a specific zone (e.g., Los Angeles).
Check Registration Status: Confirm the registration state with Get-AzEdgeZonesExtendedZone -Name ‘zonename’. The zone becomes usable once its state is “Registered.”
Workload Deployment: Once access is granted, users can deploy available Azure services within Azure Extended Zones using Azure Portal or CLI.
Use Cases and Industry Applications:
Media and Entertainment: Azure Extended Zones enable low-latency streaming and real-time media processing, making them ideal for content creation and distribution.
Financial Services: With stringent data residency and low-latency requirements, financial institutions can benefit from keeping data within local jurisdictions while ensuring fast transaction processing.
Healthcare: Extended Zones provide healthcare organizations with the ability to store and process patient data locally, ensuring compliance with health data regulations and improving response times for critical applications.
FAQs and Common Queries:
How does Azure Extended Zones differ from traditional Azure regions? Azure Extended Zones are designed to serve specific metropolitan areas or jurisdictions, focusing on low latency and data residency. Unlike traditional Azure regions that cover broader geographical areas, Extended Zones offer a more localized solution.
Can I use existing Azure services within Extended Zones? Yes, many Azure services, including virtual machines, Kubernetes, storage, and networking, are available within Extended Zones. This allows for seamless integration with your existing Azure infrastructure.
What are the limitations of Azure Extended Zones? While Extended Zones offer numerous benefits, they are currently available only in preview and may have limited service availability depending on the region. Additionally, not all Azure services may be supported within Extended Zones, so it’s important to verify compatibility based on your specific needs.
How can I request access to Azure Extended Zones? Access can be requested through the Azure portal by submitting a request form. The process involves providing details about your intended use case and the specific region where you need the service. Microsoft will review the request and grant access based on availability and alignment with the service’s objectives.
For more details and to request access, visit the Azure Extended Zones Overview, FAQ, and Request Access pages.
Please note: Azure Extended Zones are currently in preview. For legal terms applicable to Azure features in beta or preview, refer to Supplemental Terms of Use for Microsoft Azure Previews.
Microsoft Tech Community – Latest Blogs –Read More
Evaluate Fine-tuned Phi-3 / 3.5 Models in Azure AI Studio Focusing on Microsoft’s Responsible AI
Evaluate Fine-tuned Phi-3 / 3.5 Models in Azure AI Studio Focusing on Microsoft’s Responsible AI
This blog series has several versions, each covering different aspects and techniques. Check out the following resources:
Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow: Step-by-Step Guide
Detailed instructions for fine-tuning and integrating custom Phi-3 models with Prompt flow using a code-first approach.
Available on:
Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow in Azure AI Studio
Detailed instructions for fine-tuning and integrating custom Phi-3 models with Prompt flow in Azure AI / ML Studio using a low-code approach.
Available on:
Evaluate Fine-tuned Phi-3 / Phi-3.5 Models in Azure AI Studio Focusing on Microsoft’s Responsible AI
Detailed instructions for evaluating the Phi-3 / Phi-3.5 model in Azure AI Studio using a low-code approach.
Available on:
Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow: Step-by-Step GuideDetailed instructions for fine-tuning and integrating custom Phi-3 models with Prompt flow using a code-first approach.Available on:
MS Tech Community Phi-3 CookBook on GitHub
Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow in Azure AI StudioDetailed instructions for fine-tuning and integrating custom Phi-3 models with Prompt flow in Azure AI / ML Studio using a low-code approach.Available on:
MS Tech Community Phi-3 CookBook on GitHub
Evaluate Fine-tuned Phi-3 / Phi-3.5 Models in Azure AI Studio Focusing on Microsoft’s Responsible AIDetailed instructions for evaluating the Phi-3 / Phi-3.5 model in Azure AI Studio using a low-code approach.Available on:
MS Tech Community
How can you evaluate the safety and performance of a fine-tuned Phi-3 / Phi-3.5 model in Azure AI Studio?
Fine-tuning a model can sometimes lead to unintended or undesired responses. To ensure that the model remains safe and effective, it’s important to evaluate it. This evaluation helps to assess the model’s potential to generate harmful content and its ability to produce accurate, relevant, and coherent responses. In this tutorial, you will learn how to evaluate the safety and performance of a fine-tuned Phi-3 / Phi-3.5 model integrated with Prompt flow in Azure AI Studio.
Here is an Azure AI Studio’s evaluation process.
The Code first approach tutorial includes tips on how to use the Phi-3.5 model below at Fine-tune the Phi-3 model section.
The Low code approach tutorial currently supports only the Phi-3 model. This tutorial will be updated to include Phi-3.5 model fine-tuning as soon as it is supported in Azure AI/ML Studio.
The evaluation process in Azure AI Studio is identical for both Phi-3 and Phi-3.5, so the title of this tutorial includes both models.
For more detailed information and to explore additional resources about Phi-3 and Phi-3.5, please visit the Phi-3CookBook.
Prerequisites
Python
Azure subscription
Visual Studio Code
Fine-tuned Phi-3 / Phi-3.5 model
Table of Contents
Series1: Introduction to Azure AI Studio’s Prompt flow evaluation
Introduction to safety evaluation
Introduction to performance evaluation
Series2: Evaluating the Phi-3 / Phi-3.5 model in Azure AI Studio
Before you begin
Deploy Azure OpenAI to evaluate the Phi-3 / Phi-3.5 model
Evaluate the fine-tuned Phi-3 / Phi-3.5 model using Azure AI Studio’s Prompt flow evaluation
Series1: Introduction to Azure AI Studio’s Prompt flow evaluation
Introduction to safety evaluation
To ensure that your AI model is ethical and safe, it’s crucial to evaluate it against Microsoft’s Responsible AI Principles. In Azure AI Studio, safety evaluations allow you to evaluate an your model’s vulnerability to jailbreak attacks and its potential to generate harmful content, which is directly aligned with these principles.
Microsoft’s Responsible AI Principles
Before beginning the technical steps, it’s essential to understand Microsoft’s Responsible AI Principles, an ethical framework designed to guide the responsible development, deployment, and operation of AI systems. These principles guide the responsible design, development, and deployment of AI systems, ensuring that AI technologies are built in a way that is fair, transparent, and inclusive. These principles are the foundation for evaluating the safety of AI models.
Microsoft’s Responsible AI Principles include:
Fairness and Inclusiveness: AI systems should treat everyone fairly and avoid affecting similarly situated groups of people in different ways. For example, when AI systems provide guidance on medical treatment, loan applications, or employment, they should make the same recommendations to everyone who has similar symptoms, financial circumstances, or professional qualifications.
Reliability and Safety: To build trust, it’s critical that AI systems operate reliably, safely, and consistently. These systems should be able to operate as they were originally designed, respond safely to unanticipated conditions, and resist harmful manipulation. How they behave and the variety of conditions they can handle reflect the range of situations and circumstances that developers anticipated during design and testing.
Transparency: When AI systems help inform decisions that have tremendous impacts on people’s lives, it’s critical that people understand how those decisions were made. For example, a bank might use an AI system to decide whether a person is creditworthy. A company might use an AI system to determine the most qualified candidates to hire.
Privacy and Security: As AI becomes more prevalent, protecting privacy and securing personal and business information are becoming more important and complex. With AI, privacy and data security require close attention because access to data is essential for AI systems to make accurate and informed predictions and decisions about people.
Accountability: The people who design and deploy AI systems must be accountable for how their systems operate. Organizations should draw upon industry standards to develop accountability norms. These norms can ensure that AI systems aren’t the final authority on any decision that affects people’s lives. They can also ensure that humans maintain meaningful control over otherwise highly autonomous AI systems.
To learn more about Microsoft’s Responsible AI Principles, visit the What is Responsible AI?.
Safety metrics
In this tutorial, you will evaluate the safety of the fine-tuned Phi-3 / Phi-3.5 model using Azure AI Studio’s safety metrics. These metrics help you assess the model’s potential to generate harmful content and its vulnerability to jailbreak attacks. The safety metrics include:
Self-harm-related Content: Evaluates whether the model has a tendency to produce self-harm related content.
Hateful and Unfair Content: Evaluates whether the model has a tendency to produce hateful or unfair content.
Violent Content: Evaluates whether the model has a tendency to produce violent content.
Sexual Content: Evaluates whether the model has a tendency to produce inappropriate sexual content.
Evaluating these aspects ensures that the AI model does not produce harmful or offensive content, aligning it with societal values and regulatory standards.
Introduction to performance evaluation
To ensure that your AI model is performing as expected, it’s important to evaluate its performance against performance metrics. In Azure AI Studio, performance evaluations allow you to evaluate your model’s effectiveness in generating accurate, relevant, and coherent responses.
Image Source: Evaluation of generative AI applications
Performance metrics
In this tutorial, you will evaluate the performance of the fine-tuned Phi-3 / Phi-3.5 model using Azure AI Studio’s performance metrics. These metrics help you assess the model’s effectiveness in generating accurate, relevant, and coherent responses. The performance metrics include:
Groundedness: Evaluate how well the generated answers align with the information from the input source.
Relevance: Evaluates the pertinence of generated responses to the given questions.
Coherence: Evaluate how smoothly the generated text flows, reads naturally, and resembles human-like language.
Fluency: Evaluate the language proficiency of the generated text.
GPT Similarity: Compares the generated response with the ground truth for similarity.
F1 Score: Calculates the ratio of shared words between the generated response and the source data.
These metrics help you evaluate the model’s effectiveness in generating accurate, relevant, and coherent responses.
Series2: Evaluating the Phi-3 / Phi-3.5 model in Azure AI Studio
Before you begin
This tutorial is a follow up to the previous blog posts, “Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow: Step-by-Step Guide” and “Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow in Azure AI Studio.” In these posts, we walked through the process of fine-tuning a Phi-3 / Phi-3.5 model in Azure AI Studio and integrating it with Prompt flow.
In this tutorial, you will deploy an Azure OpenAI model as an evaluator in Azure AI Studio and use it to evaluate your fine-tuned Phi-3 / Phi-3.5 model.
Before you begin this tutorial, make sure you have the following prerequisites, as described in the previous tutorials:
A prepared dataset to evaluate the fine-tuned Phi-3 / Phi-3.5 model.
A Phi-3 / Phi-3.5 model that has been fine-tuned and deployed to Azure Machine Learning.
A Prompt flow integrated with your fine-tuned Phi-3 / Phi-3.5 model in Azure AI Studio.
You will use the test_data.jsonl file, located in the data folder from the ULTRACHAT_200k dataset downloaded in the previous blog posts, as the dataset to evaluate the fine-tuned Phi-3 / Phi-3.5 model.
Integrate the custom Phi-3 / Phi-3.5 model with Prompt flow in Azure AI Studio(Code first approach)
If you followed the low-code approach described in “Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow in Azure AI Studio“, you can skip this exercise and proceed to the next one. However, if you followed the code-first approach described in “Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow: Step-by-Step Guide” to fine-tune and deploy your Phi-3 / Phi-3.5 model, the process of connecting your model to Prompt flow is slightly different. You will learn this process in this exercise.
To proceed, you need to integrate your fine-tuned Phi-3 / Phi-3.5 model into Prompt flow in Azure AI Studio.
Create Azure AI Studio Hub
You need to create a Hub before creating the Project. A Hub acts like a Resource Group, allowing you to organize and manage multiple Projects within Azure AI Studio.
Sign in Azure AI Studio.
Select All hubs from the left side tab.
Select + New hub from the navigation menu.
Perform the following tasks:
Enter Hub name. It must be a unique value.
Select your Azure Subscription.
Select the Resource group to use (create a new one if needed).
Select the Location you’d like to use.
Select the Connect Azure AI Services to use (create a new one if needed).
Select Connect Azure AI Search to Skip connecting.
Select Next.
Create Azure AI Studio Project
In the Hub that you created, select All projects from the left side tab.
Select + New project from the navigation menu.
Enter Project name. It must be a unique value.
Select Create a project.
Add a custom connection for the fine-tuned Phi-3 / Phi-3.5 model
To integrate your custom Phi-3 / Phi-3.5 model with Prompt flow, you need to save the model’s endpoint and key in a custom connection. This setup ensures access to your custom Phi-3 / Phi-3.5 model in Prompt flow.
Set api key and endpoint uri of the fine-tuned Phi-3 / Phi-3.5 model
Visit Azure ML Studio.
Navigate to the Azure Machine learning workspace that you created.
Select Endpoints from the left side tab.
Select endpoint that you created.
Select Consume from the navigation menu.
Copy your REST endpoint and Primary key.
Add the Custom Connection
Visit Azure AI Studio.
Navigate to the Azure AI Studio project that you created.
In the Project that you created, select Settings from the left side tab.
Select + New connection.
Select Custom keys from the navigation menu.
Perform the following tasks:
Select + Add key value pairs.
For the key name, enter endpoint and paste the endpoint you copied from Azure ML Studio into the value field.
Select + Add key value pairs again.
For the key name, enter key and paste the key you copied from Azure ML Studio into the value field.
After adding the keys, select is secret to prevent the key from being exposed.
Select Add connection.
Create Prompt flow
You have added a custom connection in Azure AI Studio. Now, let’s create a Prompt flow using the following steps. Then, you will connect this Prompt flow to the custom connection to use the fine-tuned model within the Prompt flow.
Navigate to the Azure AI Studio project that you created.
Select Prompt flow from the left side tab.
Select + Create from the navigation menu.
Select Chat flow from the navigation menu.
Enter Folder name to use.
Select Create.
Set up Prompt flow to chat with your custom Phi-3 / Phi-3.5 model
You need to integrate the fine-tuned Phi-3 / Phi-3.5 model into a Prompt flow. However, the existing Prompt flow provided is not designed for this purpose. Therefore, you must redesign the Prompt flow to enable the integration of the custom model.
In the Prompt flow, perform the following tasks to rebuild the existing flow:
Select Raw file mode.
Delete all existing code in the flow.dag.yml file.
Add the folling code to flow.dag.yml.
input_data:
type: string
default: “Who founded Microsoft?”
outputs:
answer:
type: string
reference: ${integrate_with_promptflow.output}
nodes:
– name: integrate_with_promptflow
type: python
source:
type: code
path: integrate_with_promptflow.py
inputs:
input_data: ${inputs.input_data}
Select Save.
Add the following code to integrate_with_promptflow.py to use the custom Phi-3 / Phi-3.5 model in Prompt flow.
import requests
from promptflow import tool
from promptflow.connections import CustomConnection
# Logging setup
logging.basicConfig(
format=“%(asctime)s – %(levelname)s – %(name)s – %(message)s”,
datefmt=“%Y-%m-%d %H:%M:%S”,
level=logging.DEBUG
)
logger = logging.getLogger(__name__)
def query_phi3_model(input_data: str, connection: CustomConnection) -> str:
“””
Send a request to the Phi-3 / Phi-3.5 model endpoint with the given input data using Custom Connection.
“””
# “connection” is the name of the Custom Connection, “endpoint”, “key” are the keys in the Custom Connection
endpoint_url = connection.endpoint
api_key = connection.key
headers = {
“Content-Type”: “application/json”,
“Authorization”: f”Bearer {api_key}“
}
data = {
“input_data”: [input_data],
“params”: {
“temperature”: 0.7,
“max_new_tokens”: 128,
“do_sample”: True,
“return_full_text”: True
}
}
try:
response = requests.post(endpoint_url, json=data, headers=headers)
response.raise_for_status()
# Log the full JSON response
logger.debug(f”Full JSON response: {response.json()}“)
result = response.json()[“output”]
logger.info(“Successfully received response from Azure ML Endpoint.”)
return result
except requests.exceptions.RequestException as e:
logger.error(f”Error querying Azure ML Endpoint: {e}“)
raise
def my_python_tool(input_data: str, connection: CustomConnection) -> str:
“””
Tool function to process input data and query the Phi-3 / Phi-3.5 model.
“””
return query_phi3_model(input_data, connection)
For more detailed information on using Prompt flow in Azure AI Studio, you can refer to Prompt flow in Azure AI Studio.
Select Chat input, Chat output to enable chat with your model.
Now you are ready to chat with your custom Phi-3 / Phi-3.5 model. In the next exercise, you will learn how to start Prompt flow and use it to chat with your fine-tuned Phi-3 / Phi-3.5 model.
The rebuilt flow should look like the image below:
Start Prompt flow
Select Start compute sessions to start Prompt flow.
Select Validate and parse input to renew parameters.
Select the Value of the connection to the custom connection you created. For example, connection.
Chat with your custom Phi-3 / Phi-3.5 model
Select Chat.
Here’s an example of the results: Now you can chat with your custom Phi-3 / Phi-3.5 model. It is recommended to ask questions based on the data used for fine-tuning.
Deploy Azure OpenAI to evaluate the Phi-3 / Phi-3.5 model
To evaluate the Phi-3 / Phi-3.5 model in Azure AI Studio, you need to deploy an Azure OpenAI model. This model will be used to evaluate the performance of the Phi-3 / Phi-3.5 model.
Deploy Azure OpenAI
Sign in to Azure AI Studio.
Navigate to the Azure AI Studio project that you created.
In the Project that you created, select Deployments from the left side tab.
Select + Deploy model from the navigation menu.
Select Deploy base model.
Select Azure OpenAI model you’d like to use. For example, gpt-4o.
Select Confirm.
Evaluate the fine-tuned Phi-3 / Phi-3.5 model using Azure AI Studio’s Prompt flow evaluation
Start a new evaluation
Visit Azure AI Studio.
Navigate to the Azure AI Studio project that you created.
In the Project that you created, select Evaluation from the left side tab.
Select + New evaluation from the navigation menu.
Select Prompt flow evaluation.
perform the following tasks:
Enter the evaluation name. It must be a unique value.
Select Question and answer without context as the task type. Because, the UlTRACHAT_200k dataset used in this tutorial does not contain context.
Select the prompt flow you’d like to evaluate.
Select Next.
perform the following tasks:
Select Add your dataset to upload the dataset. For example, you can upload the test dataset file, such as test_data.json1, which is included when you download the ULTRACHAT_200k dataset.
Select the appropriate Dataset column that matches your dataset. For example, if you are using the ULTRACHAT_200k dataset, select ${data.prompt} as the dataset column.
Select Next.
perform the following tasks to configure the performance and quality metrics:
Select the performance and quality metrics you’d like to use.
Select the Azure OpenAI model that you created for evaluation. For example, select gpt-4o.
perform the following tasks to configure the risk and safety metrics:
Select the risk and safety metrics you’d like to use.
Select the threshold to calculate the defect rate you’d like to use. For example, select Medium.
For question, select Data source to {$data.prompt}.
For answer, select Data source to {$run.outputs.answer}.
For ground_truth, select Data source to {$data.message}.
Select Next.
Select Submit to start the evaluation.
The evaluation will take some time to complete. You can monitor the progress in the Evaluation tab.
Review the Evaluation Results
The results presented below are intended to illustrate the evaluation process. In this tutorial, we have used a model fine-tuned on a relatively small dataset, which may lead to sub-optimal results. Actual results may vary significantly depending on the size, quality, and diversity of the dataset used, as well as the specific configuration of the model.
Once the evaluation is complete, you can review the results for both performance and safety metrics.
Performance and quality metrics:
evaluate the model’s effectiveness in generating coherent, fluent, and relevant responses.
Risk and safety metrics:
Ensure that the model’s outputs are safe and align with Responsible AI Principles, avoiding any harmful or offensive content.
You can scroll down to view Detailed metrics result.
By evaluating your custom Phi-3 / Phi-3.5 model against both performance and safety metrics, you can confirm that the model is not only effective, but also adheres to responsible AI practices, making it ready for real-world deployment.
Congratulations!
You’ve completed this tutorial
You have successfully evaluated the fine-tuned Phi-3 model integrated with Prompt flow in Azure AI Studio. This is an important step in ensuring that your AI models not only perform well, but also adhere to Microsoft’s Responsible AI principles to help you build trustworthy and reliable AI applications.
Clean Up Azure Resources
Cleanup your Azure resources to avoid additional charges to your account. Go to the Azure portal and delete the following resources:
The Azure Machine learning resource.
The Azure Machine learning model endpoint.
The Azure AI Studio Project resource.
The Azure AI Studio Prompt flow resource.
Next Steps
Documentation
microsoft/Phi-3CookBook
Assess AI systems by using the Responsible AI dashboard
Evaluation and monitoring metrics for generative AI
Azure AI Studio documentation
Prompt flow documentation
Training Content
Introduction to Microsoft’s Responsible AI Approach
Introduction to Azure AI Studio
Reference
microsoft/Phi-3CookBook
What is Responsible AI?
Announcing new tools in Azure AI to help you build more secure and trustworthy generative AI applications
Evaluation of generative AI applications
Microsoft Tech Community – Latest Blogs –Read More
Bring Your Organizational Data to Azure AI Services with Microsoft Graph
Using AI to connect your business data with the AI applications you rely on isn’t just a nice-to-have—it’s essential in the current landscape.
By linking data from platforms like Microsoft 365 into AI-driven apps, you can simplify the tasks, reduce the need to switch between apps, and boost productivity.
This blog will walk you through how to easily connect your business data to Azure (and extention of that could be integrating it with the OpenAI services ) using Microsoft Graph , showing you just how powerful and straightforward these tools can be.
Why Integrate Your Data?
Imagine you’re deep in a project and need to find a specific document, email, or chat from Microsoft Teams. Normally, you’d have to jump between Outlook, OneDrive, and Teams, disrupting your workflow and wasting time. This is where integrating your business data into your applications becomes incredibly useful.
By using Microsoft Graph and Azure OpenAI services, you can pull all this information directly into your app, keeping everything in one place. This not only saves time but also helps you stay focused on your work. Whether you need to find files, emails, or chat histories, integrating these tools can simplify your day and keep you on track.
Core Use Cases for Microsoft Graph Enhanced by Generative AI
Microsoft Graph is versatile, and its applications are numerous. Here are some common use cases, now supercharged with generative AI:
Automating Microsoft 365 Workflows with Generative AI
Use Microsoft Graph in combination with generative AI to automate tasks such as:
Email Management: Not only can you automatically sort and respond to emails, but generative AI can draft personalized responses, summarize lengthy email threads, and even predict and prioritize emails that require immediate attention.
File Operations: Beyond managing files in OneDrive and SharePoint, generative AI can assist in creating content, generating summaries of documents, and suggesting relevant files based on the context of your work.
User Management: Automate user provisioning and updates, while generative AI can predict user needs, suggest role changes, and provide insights into user behavior and engagement.
Integrating Microsoft Teams to Enhance Productivity with Generative AI
Microsoft Graph enables deep integrations with Teams, and generative AI takes it a step further by allowing you to:
Create Teams and Channels: Automate the setup of new teams for projects, and use generative AI to suggest optimal team structures, recommend channels based on project requirements, and even draft initial posts to kickstart discussions.
Manage Conversations: Archive or monitor conversations for compliance, while generative AI can analyze conversation trends, detect sentiment, and provide insights into team dynamics and areas for improvement.
Custom Bots: Develop bots that interact with Teams users, enhanced with generative AI to provide more natural and context-aware interactions, answer complex queries, and even assist in decision-making processes.
By leveraging generative AI, Microsoft Graph can not only automate and streamline workflows but also provide intelligent insights and personalized experiences, significantly boosting productivity and efficiency.
Getting Started with Microsoft Graph
Microsoft Graph is a powerful API that lets you connect to various data points in Microsoft 365. With it, you can pull in emails, chats, files, and more into your application. To begin, you’ll need to set up something called an “App Registration” in Microsoft Entra ID (formerly Azure Active Directory). This registration allows your app to access the data securely.
Step 1: Set Up App Registration
Log in to the Azure Portal and navigate to Microsoft Entra ID.
Create a new app registration by giving it a name.
Select the type of accounts that can access this app—whether it’s just within your organization or available to users in other organizations as well.
Configure the Redirect URI if you’re developing a web app. For local development, this might look like http://localhost:3000.
Here’s a basic example of how your app registration might look in code:
{
“client_id”: “YOUR_CLIENT_ID”,
“tenant_id”: “YOUR_TENANT_ID”,
“redirect_uri”: “http://localhost:3000”
}
Now that your app is registered, you can start pulling in data using Microsoft Graph. We’ll be using a library called Microsoft Graph Toolkit (MGT), which makes this process much simpler.
Step 2: Install Microsoft Graph Toolkit
First, install the MGT package:
npm install /mgt
In your app, you’ll want to set up a provider that will handle authentication and make it easier to call Microsoft Graph APIs.
Step 3: Set Up Authentication
Create a graphService.js file where you’ll configure the provider:
import { Providers, MsalProvider } from ‘@microsoft/mgt’;
export const initGraph = () => {
if (!Providers.globalProvider) {
Providers.globalProvider = new MsalProvider({
clientId: ‘YOUR_CLIENT_ID’,
scopes: [‘User.Read’, ‘Files.Read’, ‘Mail.Read’, ‘Chat.Read’]
});
}
};
This snippet sets up the authentication process using your app registration details.
Once authentication is set up, you can start pulling data like files, emails, and chats into your app. Let’s look at a couple of ways to do this.
Step 4: Fetch and Display Files
You can fetch files related to a specific project or customer. Here’s how you might do that:
import { graph } from ‘@microsoft/mgt’;
const getFiles = async (query) => {
const response = await graph.api(‘/me/drive/search(q=” + query + ”)’)
.get();
return response.value;
};
// Example usage:
getFiles(‘ProjectX’).then(files => {
console.log(files);
});
Step 5: Use MGT Components to Simplify
Instead of writing the above code, you can use MGT’s ready-made components to fetch and display data with minimal code.
<mgt-file-list></mgt-file-list>
This single line of code will automatically pull in and display the user’s files. It’s simple, powerful, and easy to implement.
Microsoft Tech Community – Latest Blogs –Read More
DAPR, KEDA on ARO (Azure RedHat OpenShift): passo a passo
Neste artigo, teremos foco nas configurações necessárias para rodar DAPR, KEDA on ARO (Azure RedHat OpenShift).
Desta forma, aproveitei para montar este repositório no GitHub chamado “App-Plant-Tree” que cobre conceitos sobre Arquitetura Cloud-Native combinando as seguintes tecnologias:
Go – Producer/Consumer App
Distributed Application Runtime – DAPR
Kubernetes Event Driven Autoscaling – KEDA
Azure RedHat OpenShift (ARO)
Azure Container Registry (ACR)
Go SDK
Azure CLI
OpenShift CLI
DAPR CLI
Kubectl
Helm CLI
GIT bash
Visual Studio Code
Login no Azure usando CLI:
Defina os valores das variáveis conforme seu ambiente:
– $Location = ‘‘
– $ResourceGroupName = ‘‘
– $ClusterName = ‘‘
– $ContainerRegistryName = ‘‘
– $ServiceBusNamespace = ‘‘
Selecione sua assinatura azure:
Crie resource group:
Crie a virtual network
Crie a subnet para control plane
Crie a subnet para workers
Desligando configurações de network policies para Private Link Service
Crie o cluster ARO:
Crie o Container Registry:
Conectando o Container Registry ao ARO:
oc create secret docker–registry —docker–server=$ContainerRegistryName.azurecr.io —docker–username=<user name> —docker–password=<your password>—docker–email=unused acr–secret
oc secrets link default <pull_secret_name> —for=pull
Pegue a URL da console OpenShift
Pegue as credenciais OpenShift:
Valide a conexão com o cluster:
Adicione as referências:
helm repo update
helm upgrade —install dapr dapr/dapr —namespace dapr–system —create–namespace
helm upgrade —install dapr–dashboard dapr/dapr–dashboard —namespace dapr–system —create–namespace
Validar se os pods estão rodando:
Resposta esperada:
DAPR dashboard available at http://localhost:8080
Adicione as referências:
helm repo update
helm upgrade —install keda kedacore/keda –n keda–system —create–namespace
helm upgrade —install keda–add-ons–http kedacore/keda–add-ons–http –n keda–system —create–namespace
Verifique se os pods estão rodando:
Neste projeto, temos 3 diferentes opções exemplificadas (escolha uma):
Azure Service Bus
Redis
RabbitMq
docker build –t “$ContainerRegistryName.azurecr.io/consumer-app:1.0.0“ -f cmd/consumer/dockerfile .
docker build –t “$ContainerRegistryName.azurecr.io/producer-app:1.0.0“ -f cmd/producer/dockerfile .
docker push “$ContainerRegistryName.azurecr.io/producer-app:1.0.0“
Validar se os pods estão rodando:
kubectl logs -f –l app=consumer1 —all–containers=true –n tree
# configurar a porta para acesso local
kubectl port–forward pod/producer1 8081 8081 –n tree
# enviar post para a aplicação producer
– POST –> http://localhost:8081/plant
– Json Body: {“numberOfTrees“:100}
# Validar status dos pods
kubectl get pod –l app=consumer1 –n tree
Após finalizar seus testes, os próximos comandos te ajudarão a desinstalar todos os componentes de aplicação além de também excluir todos os componentes na azure.
helm uninstall keda –n keda–system
helm uninstall dapr –n dapr–system
Delete all Azure resources:
az acr delete —resource–group $ResourceGroupName —name $ContainerRegistryName
az group delete —name $ResourceGroupName
DAPR KEDA GO Project
DAPR – Pros/Cons
KEDA – Pros/Cons
Microsoft Tech Community – Latest Blogs –Read More
Remote Desktop Services enrolling for TLS certificate from an Enterprise CA
Hey! Rob Greene again. Been on a roll with all things crypto as of late, and you are not going to be disappointed with this one either!
Background
Many know that Remote Desktop Services uses a self-signed certificate for its TLS connection from the RDS Client to the RDS Server over the TCP 3389 connection by default. However, Remote Desktop Services can be configured to enroll for a certificate against an Enterprise CA, instead of continuing to use those annoying self-signed certificates everywhere.
I know there are other blogs out there that cover setting up the certificate template, and the group policy, but what if I told you most of the blogs that I have seen on this setup are incomplete, inaccurate, and do not explain what is happening with the enrollment and subsequent renewals of the RDS certificate!? I know… Shocker!!!
How this works
The Remote Desktop Service looks for a certificate, in the computer personal store, that has a specific Enhanced Key Usage with the Object Identifier (OID) of 1.3.6.1.4.1.311.54.1.2, which is typically named Remote Desktop Authentication, or Server Authentication. It prefers a certificate with the OID of Remote Desktop Authentication. https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-R2-and-2012/dn781533(v=ws.11)
Sidebar:
If you are a pretty regular consumer of the AskDS blog content you know how we love to recommend using one certificate on the server for a specific Enhanced Key Usage (EKU), and make sure that you have all the information required on the certificate so that it works with all applications that need to use the certificate.
This certificate is no different. I would recommend that the certificate that is used ONLY has the EKU for Remote Desktop Authentication and DOES NOT have an EKU of Server Authentication at all. The reason for this is that this certificate should not be controlled / maintained via Autoenrollment/renewal behaviors. This needs to be maintained by the Remote Desktop Configuration service, and you do not want certificates being used by other applications being replaced by a service like this as it will cause an issue in the long run.
There is a group policy setting that can be enabled to configure the Remote Desktop Service to enroll for the specified certificate and gives the NT AuthorityNetworkService account permission to the certificates private key which is a requirement for this to work.
The interesting thing about this is that you would think that the Remote Desktop Service service would be the service responsible for enrolling for this certificate, however it is the Remote Desktop Configuration (SessionEnv) service that is responsible for initial certificate requests as well as certificate renewals.
It is common to see the RDS Authentication Certificate template configured for autoenrollment, however this is one of the worse things you can do, and WILL cause issues with Remote Desktop Services once the certificate renewal timeframe comes in. Autoenrollment will archive the existing certificate causing RDS to no longer be able to find the existing certificate; then when you require TLS on the RDS Listener, users will fail to connect to the server. Then, at some point, Remote Desktop Configuration service will replace the newly issued certificate with a new one because it maintains the Thumbprint of the certificate that RDS should be using within WMI. When it tries to locate the original thumbprint and cannot find it, it will then attempt to enroll for a new one at the next service start. This is generally when we see the cases rolling in to the Windows Directory Services team because it appears to be a certificate issue even though this is a Remote Desktop Services configuration issue.
What we want to do is first make sure that all the steps are taken to properly configure the environment so that the Remote Desktop Configuration service is able to properly issue certificates.
The Steps
Like everything in IT (information technology), there is a list of steps that need to be completed to get this setup properly.
Configure the certificate template and add it to a Certification Authority to issue the template.
Configure the Group Policy setting.
Configuring the Certificate Template
The first step in the process is to create and configure the certificate template that we want to use:
Log on to a computer that has the Active Directory Certificate Services Tools Remote Server Administration Tools (RSAT) installed or a Certification Authority within the environment.
Launch: CertTmpl.msc (Certificate Template MMC)
Find the template named Computer, right click on it and select Duplicate Template.
On the Compatibility tab, select up to Windows Server 2012 R2 for Certification Authority and Certificate recipient. Going above this might cause issues with CEP / CES environments.
On the General tab, we need to give the template a name and validity period.
Type in a good descriptive name in the Template display name field.
If you would like to change the Validity period, you can do that as well.
You should NOT check the box Publish certificate in Active Directory.
NOTE: Make sure to copy the value in the Template name field, as this is the name that you will need to type in the group policy setting. Normally it will be the display name without any spaces in the name, but do not rely on this. Use the value you see during template creation or when looking back at the template later.
6. On the Extensions tab, the Enhanced Key Usage / Application Policies need to be modified.
a. Select Application Policies, and then click on the Edit button.
b. Multi select or select individually Client Authentication and Server Authentication and click the Remove button.
c. Click the Add button, and then click on the New button if you need to create the Application Policy for Remote Desktop Authentication. Otherwise find the Remote Desktop Authentication policy in the list and click the OK button.
d. If you need to create the Remote Desktop Authentication application policy, click the Add button, and then for the Name type in Remote Desktop Authentication, and type in 1.3.6.1.4.1.311.54.1.2 for the Object identifier value, and click the OK button.
e. Verify the newly created Remote Desktop Authentication application policy, and then click the OK button twice.
7. Remote Desktop service can use a Key Storage Provider (KSP). So, if you would like to change over from a Legacy Cryptographic Service Provider (CSP) to using a Key Storage Provider this can be done on the Cryptography tab.
8. Get the permissions set properly. To do this click on the Security tab.
a. Click the Add button and add any specific computer or computer groups you want to enroll for a certificate.
b. Then Make sure to ONLY select Allow Enroll permission. DO NOT select Autoenroll.
NOTE: Please keep in mind that Domain Controllers DO NOT belong to the Domain Computers group, so if you want all workstations, member server and Domain Controllers to enroll for this certificate, you will need Domain Computers and Enterprise Domain Controllers or Domain Controllers groups added with the security permission of Allow – Enroll.
9. When done making other changes to the template as needed, click the OK button to save the template.
Configure the Group Policy
After working through getting the certificate template created and configured to your liking, the next step in the process is to setup the Group Policy Object properly. The group policy setting that needs to be configured is located at: Computer ConfigurationPoliciesAdministrative TemplatesWindows ComponentsRemote Desktop ServicesRemote Desktop Session HostSecurity
With the Policy “Server authentication certificate template“
When adding the template name to this group policy it will accept one of two things:
Certificate template name, again this is NOT the certificate template display name.
Certificate templates Object Identifier value. Using this is not common, however some engineers will recommend this over the template name.
If you use the certificate template display name, the Remote Desktop Configuration service (SessionEnv) will successfully enroll for the certificate, however the next time the policy applies it will enroll for a new certificate again. This causes enrollments to happen and can make a CA very busy.
Troubleshoot issues of certificate issuance
Troubleshooting problems with certificate issuance is usually easy once you have a good understanding of how Remote Desktop Services goes about doing the enrollment, and there are only a few things to check out.
Investigating what Certificate Remote Desktop Service is configured to use.
The first thing to investigate is figuring out what certificate, if any,the Remote Desktop Services is currently configured to use. This is done by running a WMI query and can be done via PowerShell or good’ol WMIC. (Note: WMIC is deprecated and will be removed at a future date.)
PowerShell: Get-WmiObject -Class “Win32_TSGeneralSetting” -Namespace Rootcimv2Terminalservices
WMIC: wmic /namespace:\rootcimv2TerminalServices PATH Win32_TSGeneralSetting Get SSLCertificateSHA1Hash
We are interested in the SSLCertificateSHA1Hash value that is returned. This will tell us the thumbprint of the certificate it is attempting to load.
Keep in mind that if the Remote Desktop Service is still using the self-signed certificate, it can be found by:
launch the local computer certificate store (CertLM.msc).
Once the Computer store opened look for the store named: Certificates – Local ComputerRemote DesktopCertificates.
We would then double click on the certificate, then click on the Details tab, and find the field named Thumbprint.
Then validate if this value matches the value of SSLCertificateSHA1Hash from the output.
If there is no certificate in the Remote Desktop store, or if the SSLCertificateSHA1Hash value does not match any of the certificates in the store Thumbprint field, then it would be best to visit the Certificates – Local ComputerPersonalCertificates store next. Look for a certificate that has the Thumbprint field matching the SSLCertificateSHA1Hash value.
Does the Remote Desktop Service have permission to the Certificate private key
Once the certificate has been tracked down, we then must figure out if the certificate has a private key and if so, does the account running the service have permission to the private key?
If you are using Group Policy to deploy the certificate template information and the computer has permissions to enroll for the certificate, then the permissions in theory should be configured properly for the private key and have the NT AuthorityNetworkService with Allow – Read permissions to the private key.
If you are having this problem, then more than likely the environment is NOT configured to deploy the certificate template via the group policy setting, and it is just relying on computer certificate autoenrollment and a certificate that is valid for Server Authentication. Relying on certificate autoenrollment is not going to configure the correct permissions for the private key and add Network Service account permissions.
To check this, follow these steps:
launch the local computer certificate store (CertLM.msc).
Once the Computer store opened look for the store named: Certificates – Local ComputerPersonalCertificates.
Right click on the certificate that you are interested in, then select All Tasks, and click on Manage Private Keys.
4. Verify that Network Service account has Allow – Read Permissions. If not, then add it.
a. Click the Add button.
b. In the Select Users or Groups, click the Locations button, and select the local computer in the list.
c. Type in the name “Network Service”
d. Then click the Check Names button, and then click the OK button.
5. If the certificate does not appear to have a private key associated with it in via the Local Computer Certificate store snapin, then you may want to run the following CertUtil command to see if you can repair the association. CertUtil -RepairStore My [* / CertThumbprint].
How to change the certificate that Remote Desktop Services is using
If you have determined that Remote Desktop Services is using the wrong certificate, there are a couple of things that we can do to resolve this.
We can delete the certificate from the Computer Personal store and then cycle the Remote Desktop Configuration (SessionEnv) service. This would cause immediate enrollment of a certificate using the certificate template defined in the group policy.
PowerShell:
$RDPSettings = Get-WmiObject -Class “Win32_TSGeneralSetting” -Namespace Rootcimv2Terminalservices -Filter “TerminalName=’rdp-tcp'”
CertUtil -DelStore My $RDPSettings.SSLCertificateSHA1Hash
Net Stop SessionEnv
Net Start SessionEnv
2. We could update the Thumbprint value in WMI to reference another certificates thumbprint.
PowerShell:
$PATH = (Get-WmiObject -class “Win32_TSGeneralSetting” -Namespace rootcimv2terminalservices)
Set-WmiInstance -Path $PATH -argument @{SSLCertificateSHA1Hash=”CERTIFICATETHUMBRPINT”}
WMIC: wmic /namespace:\rootcimv2TerminalServices PATH Win32_TSGeneralSetting Set SSLCertificateSHA1Hash = “CERTIFICATETHUMBPRINT”
Conclusion
The first thing to remember is deploying certificates for Remote Desktop Services is best done by the Group Policy setting and to NOT setup the certificate template for autoenrollment. Setting the template up for autoenrollment will cause certificate issuance problems within the environment from multiple angles.
Unless you modify the certificate templates default Key Permissions setting found on the Request Handling tab, the account running the Remote Desktop Service will not have permission to the private key if the certificate is acquired via autoenrollment. This is not something that we would recommend.
This will cause a scenario where even if the SSLCertificateSHA1Hash value is correct, it will not be able to use the certificate because it will not have permission to use the private key. If you do have the template configured for custom Private Key permissions, you could again still have issues with the WMI SSLCertificateSHA1Hash value not being correct.
2. Configure the group policy setting properly as well as the certificate template. It is best to manage this configuration via group policy and you can ensure consistent experience for all RDS connections.
I know that a lot of you might have deeper questions about how the Remote Desktop Configuration service does this enrollment process, however, please keep in mind that the Remote Desktop Service is really owned by the Windows User Experience team in CSS, and so us Windows Directory Services engineers may not have that deeper level knowledge. We just get called in when the certificates do not work or fail to get issued. This is how we tend to know so much about the most common misconfigurations for this solution.
Rob “Why are RDS Certificates so complicated” Greene
Microsoft Tech Community – Latest Blogs –Read More
Office 365 for IT Pros September 2024 Update
Monthly Update #111 for Office 365 for IT Pros eBook
The Office 365 for IT Pros eBook team is delighted to announce that files are available for download for the September 2024 update of:
Office 365 for IT Pros (2025 edition) in PDF and EPUB formats.
Automating Microsoft 365 with PowerShell in PDF and EPUB formats.
Automating Microsoft 365 with PowerShell is available as part of the Office 365 for IT Pros bundle and as a separate product.
Subscribers can download the updates files using the link in the receipt emailed to them after their original purchase or from the library in their Gumroad.com account. We no longer make a Kindle version of the Office 365 for IT Pros eBook available through Amazon. It proved too difficult to release updates to readers through the convoluted Amazon process. The Automating Microsoft 365 with PowerShell book is available through Amazon in Kindle and paperback versions. The paperback is our first attempt at delivering a printed book and the response has been interesting. I guess some folk still like to have text on paper as a reference.
See our change log for information about the changes in the September 2024 update and our FAQ for details about how to download updates.
Changes in the Ecosystem
To ensure that the book content is updated and remains current, we spend a lot of time tracking change within the Microsoft 365 ecosystem. Three issues that are causing people some concerns are:
Microsoft plans to require accounts that connect to Azure administrative portals, like the Azure portal, Entra admin center, and Intune admin center or use the Azure PowerShell module and Cl, to use multifactor authentication. The requirement swings into force on October 15. In many respects, this is an excellent idea because the only accounts that access these sites are by definition administrators and all administrator accounts should be protected. But people assume that Microsoft will force all accounts to use MFA and that’s just not correct. More information is available here.
This month Microsoft plans to update Exchange Online with a revised SMTP AUTH Clients submission report to help organizations understand if apps and devices are using SMTP AUTH with basic authentication to submit messages to Exchange. The plan is to remove basic authentication for SMTP AUTH in September 2025, and the signs are that some organizations will struggle with this deadline as they do not know how to upgrade hardware (devices like multifunction printers) or apps to support OAuth. Follow the discussion online and if you have concerns, voice them there. Ian McDonald from the Exchange development group is responding to queries as they arise.
The new Outlook for Windows is generally available, and Microsoft is renaming the older Win32 version to be Outlook (classic). The rename process for the application is starting around now. Microsoft still plans to support Outlook classic until 2029 at the earliest so there’s no cause for immediate concern. The new Outlook is not ready to take over from Outlook classic yet and won’t be for several years. But it is the case that new functionality will increasingly be only available in the new Outlook (and likely OWA), and that’s something to take into consideration as Microsoft 365 tenants plan their client strategy for the coming years.
Other stuff is happening too – and all the time- but these are three of the big issues I hear discussed on an ongoing basis.
Discounted Subscriptions
We have traditionally allowed subscribers of prior editions to continue their subscriptions to cover new edition at discounted rates. The cheapest way to upgrade is always within three weeks of the release of a new edition. After that, we start to gradually reduce the discount. Our discount period finished today and there are no longer general discounts available for previous subscribers. Instead, we’re reaching out to people who have supported us over several editions to offer targeted discounts. We think this is a fairer approach to reward people who have helped us and to control the misuse of discount codes.
We know of about 70 cases where people who have never subscribed before having taken out subscriptions to the 2025 edition using codes that we made available to previous subscribers. Sometimes this happens because people pass their subscription to co-workers and sometimes it’s because people just like to share. In any case, our ability to offer discounted subscriptions is compromised when codes are misused, so we’re going to be a little more restrictive about how we issue discounts. I don’t think anyone’s doing anything particularly horrible here, but we’d like to take care of the folks who support us before anyone else gets the chance to use a discount.
On to Update #112
There’s no rest for the wicked and the Office 365 for IT Pros team is already working (or so they tell me) on update #112, which we anticipate releasing on October 1. No doubt lots will happen between this and then to add to the rich tapestry of life and the joys (!!!) of coping with constant change inside the Microsoft 365 ecosystem.