Month: September 2024
Transferring Reusable PowerShell Objects Between Microsoft 365 Tenants
The Graph SDK’s ToJsonString Method Proves Its Worth
One of the frustrations about using the internet is when you find some code that seems useful, copy the code to try it out in your tenant, and discover that some formatting issue prevents the code from running. Many reasons cause this to happen. Sometimes it’s as simple as an error when copying code into a web editor, and sometimes errors creep in after copying the code, perhaps when formatting it for display. I guess fixing the problems is an opportunity to learn what the code really does.
Answers created by generative AI solutions like ChatGPT, Copilot for Microsoft 365, and GitHub Copilot compound the problem by faithfully reproducing errors in its responses. This is no fault of the technology, which works by creating answers from what’s gone before. If published code includes a formatting error, generative AI is unlikely to find and fix the problem.
Dealing with JSON Payloads
All of which brings me to a variation on the problem. The documentation for Graph APIs used to create or update objects usually include an example of a JSON-formatted payload containing the parameter values for the request. The Graph API interpret the JSON content in the payload to extract the parameters to run a request. By comparison, Microsoft Graph PowerShell SDK cmdlets use hash tables and arrays to pass parameters. The hash tables and arrays mimic the elements of the JSON structure used by the underlying Graph APIs.
Composing a JSON payload is no challenge If you can write perfect JSON. Like any other rules for programming or formatting, it takes time to become fluent with JSON, and who can afford that time when other work exists to be done? Here’s a way to make things easier.
Every object generated by a Graph SDK cmdlet has a ToJsonString method to create a JSON-formatted version of the object. For example:
$User = Get-MgUser -UserId Kim.Akers@office365itpros.com
$UserJson = $User.ToJsonString()
$UserJson
{
“@odata.context”: “https://graph.microsoft.com/v1.0/$metadata#users/$entity”,
“id”: “d36b323a-32c3-4ca5-a4a5-2f7b4fbef31c”,
“businessPhones”: [ “+1 713 633-5141” ],
“displayName”: “Kim Akers (She/Her)”,
“givenName”: “Kim”,
“jobTitle”: “VP Marketing”,
“mail”: “Kim.Akers@office365itpros.com”,
“mobilePhone”: “+1 761 504-0011”,
“officeLocation”: “NYC”,
“preferredLanguage”: “en-US”,
“surname”: “Akers”,
“userPrincipalName”: Kim.Akers@office365itpros.com
}
The advantages of using the ToJsonString method instead of PowerShell’s ConvertTo-JSON cmdlet is that the method doesn’t output properties with empty values. This makes the resulting output easier to review and manage. For instance, the JSON content shown above is a lot easier to use as a template for adding new user accounts than the equivalent generated by ConvertTo-JSON.
Transferring a Conditional Access Policy Using ToJsonString
The output generated by ToJsonString becomes very interesting when you want to move objects between tenants. For example, let’s assume that you use a test tenant to create and fine tune a conditional access policy. The next piece of work is to transfer the conditional access policy from the test tenant to the production environment. Here’s how I make the transfer:
Run the Get-MgIdentityConditionalAccessPolicy cmdlet to find the target policy and export its settings to JSON. Then save the JSON content in a text file.
$Policy = Get-MgIdentityConditionalAccessPolicy -ConditionalAccessPolicyId ‘1d4063cb-5ebf-4676-bfca-3775d7160b65’
$PolicyJson = $Policy.toJsonString()
$PolicyJson > PolicyExport.txt
Edit the text file to replace any tenant-specific items with equivalent values for the target tenant. For instance, conditional access policies usually include an exclusion for break glass accounts, which are listed in the policy using the account identifiers. In this case, you need to replace the account identifiers for the source tenant in the exported text file with the account identifiers for the break glass account for the target tenant.
Disconnect from the source tenant.
Connect to the target tenant with the Policy.ReadWrite.ConditionalAccess scope.
Create a variable ($Body in this example) containing the conditional policy settings.
Run the Invoke-MgGraph-Request cmdlet to import the policy definition into the target tenant.
$Uri = “https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies”
Invoke-MgGraphRequest -uri $uri -method Post -Body $Body
The Other Way
Another way to create a conditional access policy with PowerShell is to run the New-MgIdentityConditionalAccessPolicy cmdlet, which takes a hash table as its payload. It’s easy to translate the JSON into the format used for parameter values stored in the hash table, but it’s even easier to run Invoke-MgGraphRequest and pass the edited version of the JSON exported from the source tenant. Why make things hard for yourself?
This tip is just one of the hundreds included the Automating Microsoft 365 with PowerShell eBook (available separately, as part of the Office 365 for IT Pros (2025 edition) bundle, or as a paperback from Amazon.com).
undefined symbol xcb_shm_id when trying to startup MatLab
When trying to start up MatLab, I get
> ./bin/matlab
MATLAB is selecting SOFTWARE rendering.
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
Unexpected exception: ‘N7mwboost10wrapexceptINS_16exception_detail39current_exception_std_exception_wrapperISt13runtime_errorEEEE: Error loading /home/pblase/matlab/bin/glnxa64/matlab_startup_plugins/matlab_graphics_ui/mwuixloader.so. /usr/lib64/libXt.so.6: undefined symbol: SmcModifyCallbacks: Success: Success’ in createMVMAndCallParser phase ‘Creating local MVM’When trying to start up MatLab, I get
> ./bin/matlab
MATLAB is selecting SOFTWARE rendering.
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
Unexpected exception: ‘N7mwboost10wrapexceptINS_16exception_detail39current_exception_std_exception_wrapperISt13runtime_errorEEEE: Error loading /home/pblase/matlab/bin/glnxa64/matlab_startup_plugins/matlab_graphics_ui/mwuixloader.so. /usr/lib64/libXt.so.6: undefined symbol: SmcModifyCallbacks: Success: Success’ in createMVMAndCallParser phase ‘Creating local MVM’ When trying to start up MatLab, I get
> ./bin/matlab
MATLAB is selecting SOFTWARE rendering.
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
Unexpected exception: ‘N7mwboost10wrapexceptINS_16exception_detail39current_exception_std_exception_wrapperISt13runtime_errorEEEE: Error loading /home/pblase/matlab/bin/glnxa64/matlab_startup_plugins/matlab_graphics_ui/mwuixloader.so. /usr/lib64/libXt.so.6: undefined symbol: SmcModifyCallbacks: Success: Success’ in createMVMAndCallParser phase ‘Creating local MVM’ libcairo MATLAB Answers — New Questions
Intro to matlab lab and I have no idea how this works
<</matlabcentral/answers/uploaded_files/1765134/Screenshot%202024-09-02%20at%206.21.33%E2%80%AFPM.png>>
I don’t know what I am supposed to do with the second part of question 3 and also don’t know what to do with #4. This is my first time ever taking a class about coding so I’m super lost.<</matlabcentral/answers/uploaded_files/1765134/Screenshot%202024-09-02%20at%206.21.33%E2%80%AFPM.png>>
I don’t know what I am supposed to do with the second part of question 3 and also don’t know what to do with #4. This is my first time ever taking a class about coding so I’m super lost. <</matlabcentral/answers/uploaded_files/1765134/Screenshot%202024-09-02%20at%206.21.33%E2%80%AFPM.png>>
I don’t know what I am supposed to do with the second part of question 3 and also don’t know what to do with #4. This is my first time ever taking a class about coding so I’m super lost. vector, vectors, variable MATLAB Answers — New Questions
Poor performance of linprog in practice
I have to solve a dynamic programming problem using a linear programming approach. For details, please see this paper. The LP that I want to solve is:
min c’*v
s.t.
A*v>=u,
where c is n*1, v is n*1, A is n^2*n, u is n^2*1.
The min is with respect to v, the value function of the original DP problem. I have a moderate number of variables, n=300 and m=n^2*n=90000 linear inequalities as constraints. No bound constraints on v.
I use the Matlab function linprog which in turn is based on the solver HIGHS (since R2024a). The code is slow for my purposes (i.e. a brute-force value iteration is much faster). Moreover, linprog gives correct results only if I set the option ‘Algorithm’,’dual-simplex-highs’. With other algorithms, it gets stuck.
After profiling the code, it turns out that the bottleneck is line 377 of linprog:
[x, fval, exitflag, output, lambda] = run(algorithm, problem);
I was wondering if there is a way to speed up the code. Any help or suggestion is greatly appreciated! I put below a MWE to illustrate the problem.
clear,clc,close all
%% Set parameters
crra = 2;
alpha = 0.36;
beta = 0.95;
delta = 0.1;
%% Grid for capital
k_ss = ((1-beta*(1-delta))/(alpha*beta))^(1/(alpha-1));
n_k = 300;
k_grid = linspace(0.1*k_ss,1.5*k_ss,n_k)’;
%% Build current return matrix, U(k’,k)
cons = k_grid’.^alpha+(1-delta)*k_grid’-k_grid;
U_mat = f_util(cons,crra);
U_mat(cons<=0) = -inf;
%% Using LINEAR PROGRAMMING
% min c’*v
% s.t.
% A*v>=u, where c is n*1, v is n*1, A is n^2*n, u is n^2*1
n = length(k_grid);
c_vec = ones(n,1);
u_vec = U_mat(:); %% U(k’,k), stack columnwise
%% Build A matrix using cell-based method
tic
A = cell(n,1);
bigI = (-beta)*speye(n);
for i=1:n
temp = bigI;
temp(:,i) = temp(:,i)+1;
A{i} = temp;
end
A = vertcat(A{:});
disp(‘Time to build A matrix with cell method:’)
toc
%% Call linprog
% ‘dual-simplex-highs’ (default and by far the best)
options = optimoptions(‘linprog’,’Algorithm’,’dual-simplex-highs’);
tic
[V_lin,fval,exitflag,output] = linprog(c_vec,-A,-u_vec,[],[],[],[],options);
disp(‘Time linear programming:’)
toc
if exitflag<=0
warning(‘linprog did not find a solution’)
fprintf(‘exitflag = %d n’,exitflag)
end
%% Now that we solved for V, compute policy function
RHS_mat = U_mat+beta*V_lin; % (k’,k)
[V1,pol_k_ind] = max(RHS_mat,[],1);
pol_k = k_grid(pol_k_ind);
% Plots
figure
plot(k_grid,V1)
figure
plot(k_grid,k_grid,’–‘,k_grid,pol_k)
function util = f_util(c,crra)
util = c.^(1-crra)/(1-crra);
end
PROFILEI have to solve a dynamic programming problem using a linear programming approach. For details, please see this paper. The LP that I want to solve is:
min c’*v
s.t.
A*v>=u,
where c is n*1, v is n*1, A is n^2*n, u is n^2*1.
The min is with respect to v, the value function of the original DP problem. I have a moderate number of variables, n=300 and m=n^2*n=90000 linear inequalities as constraints. No bound constraints on v.
I use the Matlab function linprog which in turn is based on the solver HIGHS (since R2024a). The code is slow for my purposes (i.e. a brute-force value iteration is much faster). Moreover, linprog gives correct results only if I set the option ‘Algorithm’,’dual-simplex-highs’. With other algorithms, it gets stuck.
After profiling the code, it turns out that the bottleneck is line 377 of linprog:
[x, fval, exitflag, output, lambda] = run(algorithm, problem);
I was wondering if there is a way to speed up the code. Any help or suggestion is greatly appreciated! I put below a MWE to illustrate the problem.
clear,clc,close all
%% Set parameters
crra = 2;
alpha = 0.36;
beta = 0.95;
delta = 0.1;
%% Grid for capital
k_ss = ((1-beta*(1-delta))/(alpha*beta))^(1/(alpha-1));
n_k = 300;
k_grid = linspace(0.1*k_ss,1.5*k_ss,n_k)’;
%% Build current return matrix, U(k’,k)
cons = k_grid’.^alpha+(1-delta)*k_grid’-k_grid;
U_mat = f_util(cons,crra);
U_mat(cons<=0) = -inf;
%% Using LINEAR PROGRAMMING
% min c’*v
% s.t.
% A*v>=u, where c is n*1, v is n*1, A is n^2*n, u is n^2*1
n = length(k_grid);
c_vec = ones(n,1);
u_vec = U_mat(:); %% U(k’,k), stack columnwise
%% Build A matrix using cell-based method
tic
A = cell(n,1);
bigI = (-beta)*speye(n);
for i=1:n
temp = bigI;
temp(:,i) = temp(:,i)+1;
A{i} = temp;
end
A = vertcat(A{:});
disp(‘Time to build A matrix with cell method:’)
toc
%% Call linprog
% ‘dual-simplex-highs’ (default and by far the best)
options = optimoptions(‘linprog’,’Algorithm’,’dual-simplex-highs’);
tic
[V_lin,fval,exitflag,output] = linprog(c_vec,-A,-u_vec,[],[],[],[],options);
disp(‘Time linear programming:’)
toc
if exitflag<=0
warning(‘linprog did not find a solution’)
fprintf(‘exitflag = %d n’,exitflag)
end
%% Now that we solved for V, compute policy function
RHS_mat = U_mat+beta*V_lin; % (k’,k)
[V1,pol_k_ind] = max(RHS_mat,[],1);
pol_k = k_grid(pol_k_ind);
% Plots
figure
plot(k_grid,V1)
figure
plot(k_grid,k_grid,’–‘,k_grid,pol_k)
function util = f_util(c,crra)
util = c.^(1-crra)/(1-crra);
end
PROFILE I have to solve a dynamic programming problem using a linear programming approach. For details, please see this paper. The LP that I want to solve is:
min c’*v
s.t.
A*v>=u,
where c is n*1, v is n*1, A is n^2*n, u is n^2*1.
The min is with respect to v, the value function of the original DP problem. I have a moderate number of variables, n=300 and m=n^2*n=90000 linear inequalities as constraints. No bound constraints on v.
I use the Matlab function linprog which in turn is based on the solver HIGHS (since R2024a). The code is slow for my purposes (i.e. a brute-force value iteration is much faster). Moreover, linprog gives correct results only if I set the option ‘Algorithm’,’dual-simplex-highs’. With other algorithms, it gets stuck.
After profiling the code, it turns out that the bottleneck is line 377 of linprog:
[x, fval, exitflag, output, lambda] = run(algorithm, problem);
I was wondering if there is a way to speed up the code. Any help or suggestion is greatly appreciated! I put below a MWE to illustrate the problem.
clear,clc,close all
%% Set parameters
crra = 2;
alpha = 0.36;
beta = 0.95;
delta = 0.1;
%% Grid for capital
k_ss = ((1-beta*(1-delta))/(alpha*beta))^(1/(alpha-1));
n_k = 300;
k_grid = linspace(0.1*k_ss,1.5*k_ss,n_k)’;
%% Build current return matrix, U(k’,k)
cons = k_grid’.^alpha+(1-delta)*k_grid’-k_grid;
U_mat = f_util(cons,crra);
U_mat(cons<=0) = -inf;
%% Using LINEAR PROGRAMMING
% min c’*v
% s.t.
% A*v>=u, where c is n*1, v is n*1, A is n^2*n, u is n^2*1
n = length(k_grid);
c_vec = ones(n,1);
u_vec = U_mat(:); %% U(k’,k), stack columnwise
%% Build A matrix using cell-based method
tic
A = cell(n,1);
bigI = (-beta)*speye(n);
for i=1:n
temp = bigI;
temp(:,i) = temp(:,i)+1;
A{i} = temp;
end
A = vertcat(A{:});
disp(‘Time to build A matrix with cell method:’)
toc
%% Call linprog
% ‘dual-simplex-highs’ (default and by far the best)
options = optimoptions(‘linprog’,’Algorithm’,’dual-simplex-highs’);
tic
[V_lin,fval,exitflag,output] = linprog(c_vec,-A,-u_vec,[],[],[],[],options);
disp(‘Time linear programming:’)
toc
if exitflag<=0
warning(‘linprog did not find a solution’)
fprintf(‘exitflag = %d n’,exitflag)
end
%% Now that we solved for V, compute policy function
RHS_mat = U_mat+beta*V_lin; % (k’,k)
[V1,pol_k_ind] = max(RHS_mat,[],1);
pol_k = k_grid(pol_k_ind);
% Plots
figure
plot(k_grid,V1)
figure
plot(k_grid,k_grid,’–‘,k_grid,pol_k)
function util = f_util(c,crra)
util = c.^(1-crra)/(1-crra);
end
PROFILE linprog, performance MATLAB Answers — New Questions
How to import .EEG or text or excel file to EEGlab
Hi all I’ve 1-hour EEG data with a sampling frequency 291hz.I’ve installed EEGlab v14.1.1 version and tried to load my data files of ‘.EEG file’,’text’ and ‘excel’formats, but none of them are loading to EEGlab.It’s showing the following error. Please help me to slove this issue since I’m new to this EEGlab softwareHi all I’ve 1-hour EEG data with a sampling frequency 291hz.I’ve installed EEGlab v14.1.1 version and tried to load my data files of ‘.EEG file’,’text’ and ‘excel’formats, but none of them are loading to EEGlab.It’s showing the following error. Please help me to slove this issue since I’m new to this EEGlab software Hi all I’ve 1-hour EEG data with a sampling frequency 291hz.I’ve installed EEGlab v14.1.1 version and tried to load my data files of ‘.EEG file’,’text’ and ‘excel’formats, but none of them are loading to EEGlab.It’s showing the following error. Please help me to slove this issue since I’m new to this EEGlab software eeg, eeglab, signal processing MATLAB Answers — New Questions
Conditional formating using formula
Hi,
I’m looking to apply a conditional format to a table (Table1) which highlights the row where a cell matches a cell within another table (Table2)
I’ve had a look online, the only thing I can find is a formula which works if I refer to an array of cells rather than another table in the workbook:
=MATCH(A2,Array1,0)
This only highlights a single cell, even if I try to apply the conditional format to the Table1
Can anyone help?
Thanks
Hi, I’m looking to apply a conditional format to a table (Table1) which highlights the row where a cell matches a cell within another table (Table2) I’ve had a look online, the only thing I can find is a formula which works if I refer to an array of cells rather than another table in the workbook:=MATCH(A2,Array1,0)This only highlights a single cell, even if I try to apply the conditional format to the Table1 Can anyone help?Thanks Read More
New Outlook:
Can’t sign in to the New Outlook. Besides, my hotmail is blocked and I cannot access my mails.
Can’t sign in to the New Outlook. Besides, my hotmail is blocked and I cannot access my mails. Read More
Migrating to 365 with 2 domains
I have a client that has two different domains (old and new). Example: Old email: email address removed for privacy reasons new email email address removed for privacy reasons. It looks like their provider created alias’s for the new domain. Problem is they still get email going to the old email that get’s forwarded(?) to the new email. I want to migrate over to 365. I’m pretty sure the migration will work to transfer over their email history using the new email, but I’m not sure how the forwarding will work. Can I create alias’s for the old email in 365 to do the same?
I have a client that has two different domains (old and new). Example: Old email: email address removed for privacy reasons new email email address removed for privacy reasons. It looks like their provider created alias’s for the new domain. Problem is they still get email going to the old email that get’s forwarded(?) to the new email. I want to migrate over to 365. I’m pretty sure the migration will work to transfer over their email history using the new email, but I’m not sure how the forwarding will work. Can I create alias’s for the old email in 365 to do the same? Read More
Upcoming marketplace webinars available in September
Whether you are brand new to marketplace or have already published multiple offers, our Mastering the Marketplace webinar series has a variety of offerings to help you maximize the marketplace opportunity. Check out these upcoming webinars in September:
▪ Creating your first offer in Partner Center (9/5): Learn how to start with a new SaaS offer in the commercial marketplace; set up the required fields in Partner Center and understand the options and tips to get you started faster!
▪ Creating Plans and Pricing for your offer (9/10): Learn about the payouts process lifecycle for the Microsoft commercial marketplace, how to view and access payout reporting and what payment processes are supported within Partner Center. We will review the payouts process lifecycle for the Azure Marketplace; how to register and the registration requirements; general payout processes from start to finish; and, how to view and access payout reporting.
▪ AI and the Microsoft commercial marketplace (9/12): Through the Microsoft commercial marketplace, get connected to the solutions you need—from innovative AI applications to cloud infra and everything in between. Join this session to learn what’s on our roadmap and see how the marketplace helps you move faster and spend smarter.
▪ Developing your SaaS offer (9/12): In this technical session, learn how to implement the components of a fully functional SaaS solution including how to implement a SaaS landing page and webhook to subscribe to change events, and how to integrate your SaaS product into the marketplace.
Find our complete schedule here:
#ISV #maximizemarketplace #Azure #MSMarketplace #MSPartners
Whether you are brand new to marketplace or have already published multiple offers, our Mastering the Marketplace webinar series has a variety of offerings to help you maximize the marketplace opportunity. Check out these upcoming webinars in September:
▪ Creating your first offer in Partner Center (9/5): Learn how to start with a new SaaS offer in the commercial marketplace; set up the required fields in Partner Center and understand the options and tips to get you started faster!
▪ Creating Plans and Pricing for your offer (9/10): Learn about the payouts process lifecycle for the Microsoft commercial marketplace, how to view and access payout reporting and what payment processes are supported within Partner Center. We will review the payouts process lifecycle for the Azure Marketplace; how to register and the registration requirements; general payout processes from start to finish; and, how to view and access payout reporting.
▪ AI and the Microsoft commercial marketplace (9/12): Through the Microsoft commercial marketplace, get connected to the solutions you need—from innovative AI applications to cloud infra and everything in between. Join this session to learn what’s on our roadmap and see how the marketplace helps you move faster and spend smarter.
▪ Developing your SaaS offer (9/12): In this technical session, learn how to implement the components of a fully functional SaaS solution including how to implement a SaaS landing page and webhook to subscribe to change events, and how to integrate your SaaS product into the marketplace.
Find our complete schedule here:
https://aka.ms/MTMwebinars
#ISV #maximizemarketplace #Azure #MSMarketplace #MSPartners
Formula returning dash when I add a new cell
extremely frustrating I use this sheet to track my side job pay and it glitches everytime I try to edit it and returns 0. i am trying to add august to the gross pay total.
extremely frustrating I use this sheet to track my side job pay and it glitches everytime I try to edit it and returns 0. i am trying to add august to the gross pay total. Read More
Tasks
When I open Tasks I get “The task owner has restricted this action,” and “This list cannot be modified as it no longer exists.” I am horrified as I use it every day. I can’t modify the task in any way. How can I fix this?
When I open Tasks I get “The task owner has restricted this action,” and “This list cannot be modified as it no longer exists.” I am horrified as I use it every day. I can’t modify the task in any way. How can I fix this? Read More
A generalisation of the MAP lambda helper function
Discussion topic. Your thoughts are welcome.
On Saturday I finally bit the bullet and completed a MAPλ Lambda function that generalises the in-built MAP Lambda helper function. As examples, I tried problems of generating the Kronecker product of two matrices and then one of generating variants of an amortisation table.
The original amortisation schedule uses SCAN to calculate closing balances step by step from opening balances. Having returned the closing balances as an array, the principal is inserted at the first element to give opening balances. An array calculation based on the same code is used to return other values of interest using HSTACK.
Following that, I created the array of loan terms {10, 15, 20} (yrs) and used the formula
= MAPλ(variousTerms, AmortisationTableλ(principal, rate, startYear))
to generate
as a single spilt range.
I have posted a copy of MAPλ on GitHub
A version of Excel MAP helper function that will return an array of arrays (github.com)
The intention is that the function can be used without knowing how it works but you are, of course, welcome to try to pick through it.
Discussion topic. Your thoughts are welcome. On Saturday I finally bit the bullet and completed a MAPλ Lambda function that generalises the in-built MAP Lambda helper function. As examples, I tried problems of generating the Kronecker product of two matrices and then one of generating variants of an amortisation table. The original amortisation schedule uses SCAN to calculate closing balances step by step from opening balances. Having returned the closing balances as an array, the principal is inserted at the first element to give opening balances. An array calculation based on the same code is used to return other values of interest using HSTACK.Following that, I created the array of loan terms {10, 15, 20} (yrs) and used the formula = MAPλ(variousTerms, AmortisationTableλ(principal, rate, startYear)) to generateas a single spilt range. I have posted a copy of MAPλ on GitHub A version of Excel MAP helper function that will return an array of arrays (github.com)The intention is that the function can be used without knowing how it works but you are, of course, welcome to try to pick through it. Read More
Update Error for Windows 11 Insider Preview (10.0.26120.1542)
Hi!
When the update Windows 11 Insider Preview (10.0.26120.1542) started, it reached 1% and suddenly stopped.
I tried to run a Troubleshoot for Windows Update inside Configurations and it shows an error 0x803C010A and didn’t proceed as well.
Anyone solved this problem?
Thanks
Hi!When the update Windows 11 Insider Preview (10.0.26120.1542) started, it reached 1% and suddenly stopped.I tried to run a Troubleshoot for Windows Update inside Configurations and it shows an error 0x803C010A and didn’t proceed as well.Anyone solved this problem? Thanks Read More
How to sync Outlook Notes with Gmail account
I have Outlook 2021 desktop installed on my PC. I would like to sync the Outlook Notes:
with my Google Workspace account. Is this possible?
I have Outlook 2021 desktop installed on my PC. I would like to sync the Outlook Notes: with my Google Workspace account. Is this possible? Read More
Default SQL Server Connection for SSMS
SQL 2019 – SSMS 19.3.4.0
I was always wrongly under the impression that SSMS required a server connection in the Object Explorer to run a script against. We have databases with the same names on 2 servers as we’re preparing for migration and I accidentally ran a script on server B, even though there appeared to be no connection open to server B. Only Server A was connected in the object explorer. I was then shocked to find that any new sql script I opened was connected to server B which had been closed out in Object Explorer.
What controls the default server for a script when opening via File / Open in SSMS? What is the best way to lock a script to specific server or make it more obvious which server this is being applied to. I may need to get used to looking in the bottom right where it displays the SQL server, but I’d like to make it more fool proof.
I see activating SQLCMD Mode on the Query Menu is one option, but I wonder what the downside to this might be such that it is not default behaviour.
SQL 2019 – SSMS 19.3.4.0I was always wrongly under the impression that SSMS required a server connection in the Object Explorer to run a script against. We have databases with the same names on 2 servers as we’re preparing for migration and I accidentally ran a script on server B, even though there appeared to be no connection open to server B. Only Server A was connected in the object explorer. I was then shocked to find that any new sql script I opened was connected to server B which had been closed out in Object Explorer. What controls the default server for a script when opening via File / Open in SSMS? What is the best way to lock a script to specific server or make it more obvious which server this is being applied to. I may need to get used to looking in the bottom right where it displays the SQL server, but I’d like to make it more fool proof. I see activating SQLCMD Mode on the Query Menu is one option, but I wonder what the downside to this might be such that it is not default behaviour. Read More
AI Studio End-to-End Baseline Reference Implementation
Azure AI Studio is designed to cater to the growing needs of developers seeking to integrate advanced AI capabilities into their applications with a focus on operational excellence. Addressing key factors such as security, scalability, and regulatory adherence, Azure AI Studio ensures that AI deployments are seamless, sustainable, and strategically aligned with business objectives.
We’re excited to present the end-to-end baseline reference implementation for Azure AI Studio, a definitive guide designed to facilitate the deployment of AI workloads in the cloud. This architecture has been designed to assist organizations in finding structured solutions for deploying AI applications that are production ready in an enterprise environment at scale.
Features of the Baseline Architecture
This architecture incorporates several important features:
Secure Network Perimeter: Creates a secure boundary for AI applications with strict network security and segmentation capabilities.
Identity Management: Implements strong access management to regulate interactions and maintain secure operations within AI services and data.
Scalability: Provides a flexible infrastructure to support the growth of AI applications, ensuring performance is not sacrificed as demand increases.
Compliance and Governance: Maintains a commitment to following enterprise governance policies and meeting compliance standards throughout the life of an AI application.
Supported Scenarios of the Baseline Architecture
The reference architecture supports various important use cases, including:
AI Studio Project Playground: An integrated environment for engaging with Azure OpenAI technologies, where you can chat with your data, test out various AI-powered assistants, and utilize completion features for text. This tool serves as a one-stop shop to assess, refine, and validate your AI-driven projects.
Promptflow Workflows: This feature supports the development of complex AI workflows, integrating elements like custom Python scripts and large language model integrations, enhancing operational excellence.
Resilient, Managed Deployments: Manages the deployment of AI applications to Azure’s managed virtual networks, ensuring solid and dependable access via client UI hosted in Azure App Service.
Self-Hosting with Azure App Service: This alternative gives enterprises full control to customize and manage Promptflow deployment using Azure App Service leveraging advanced options such as availability zones.
You can find the reference implementation in the following link: aistudio-end-to-end-baseline-architecture
Microsoft Tech Community – Latest Blogs –Read More
¡Temporada de IA para Desarrolladores!
Si te apasiona la Inteligencia Artificial y el desarrollo de aplicaciones, no te pierdas la oportunidad de ver esta increíble serie de Microsoft Reactor. Durante la temporada, exploramos desde los fundamentos de Azure OpenAI hasta las últimas innovaciones presentadas en Microsoft Build 2024, finalizando con el potente framework Semantic Kernel para la creación de aplicaciones inteligentes. Todas las sesiones están cargadas de numerosos demos para que puedas comprender cada concepto y aplicarlo de manera efectiva.
Episodios:
Episodio 1: Introducción a Azure OpenAI
Exploramos los modelos de Azure OpenAI, sus capacidades, y cómo integrarlos con el SDK de Azure.
Episodio 2: Consideraciones para Implementar Modelos en Azure OpenAI
Aprendimos a gestionar la cuota del servicio, equilibrar rendimiento y latencia, planificar la gestión de costos, y aplicar el patrón RAG para optimizar tus implementaciones.
Episodio 3: Novedades de Microsoft Build: PHI3, GPT-4o, Azure Content Safety y más
Descubrimos las últimas novedades de Microsoft Build, incluyendo PHI 3, GPT-4o con capacidades multimodales, el nuevo Azure AI Studio, y Azure Content Safety.
Episodio 4: Comenzando con Semantic Kernel
Conocimos Semantic Kernel, un SDK de código abierto que permite integrar fácilmente LLM avanzados en tus aplicaciones para crear experiencias más inteligentes y naturales.
Episodio 5: Construye tu propio Copilot con Semantic Kernel
Aprendimos a utilizar Plugins, Planners y Memories de Semantic Kernel para crear copilotos que trabajan codo a codo con los usuarios, brindándoles sugerencias inteligentes para completar tareas.
-¡No te lo pierdas! Revive cada episodio para descubrir cómo puedes llevar tus aplicaciones al siguiente nivel con la IA de Microsoft.
-Obtén más información y desarrolla tus habilidades con la IA durante esta serie con esta colección de recursos de Microsoft Learn:
Speakers:
Luis Beltran – Microsoft MVP – LinkedIn
Pablo Piovano – Microsoft MVP – LinkedIn
Microsoft Tech Community – Latest Blogs –Read More
Make High Quality Dataset from WARC for Pre-training
You’re welcome to follow my GitHub repo and give it a star:https://github.com/xinyuwei-david/david-share.git
In the following subsections, we will explain each step involved in generating High Qualit dataset Pre-training
How to evaluate the quality of training data?
There are 4 methods to evaluate the quality of training data, including but not limited to.
Using a “clean” corpus and perplexity check
Method: Train a model using a high-quality corpus (e.g., Wikipedia) and then use this model to check the perplexity of the new dataset.
Advantages:
Quick: Can quickly assess the quality of the dataset.
Simple: Relatively simple to implement, does not require complex computational resources.
Disadvantages:
Limitations: Low perplexity does not necessarily mean better performance on specific tasks.
Single Metric: Perplexity is just a single metric and cannot fully reflect the quality of the dataset.
Training small models and testing on evaluation tasks
Method: Extract a portion of data from the dataset, train a small model, and test the model’s performance on a set of specific evaluation tasks (e.g., SQuAD, GLUE, etc.).
Advantages:
Specific: Provides specific performance feedback by testing the model on actual tasks.
Diversity: Allows for the selection of various evaluation tasks to comprehensively assess the dataset quality.
Disadvantages:
Resource Demand: Requires a certain amount of computational resources and time.
Task Selection: Needs to select diverse and representative evaluation tasks, which may increase complexity.
Early signal method
Method: Train a small model and conduct preliminary evaluations on some simple and quick benchmark tasks (e.g., text classification, sentiment analysis, etc.).
Advantages:
Rapid Iteration: Quickly obtain initial feedback, facilitating rapid iteration and optimization.
Suitable for Early Stages: Helps quickly screen datasets in the early stages of development.
Disadvantages:
Simple Tasks: These tasks may be relatively simple and may not fully represent the model’s performance on complex tasks.
Preliminary Evaluation: Only provides initial performance feedback, which may require further detailed evaluation.
Using GPT-4 for evaluation
Method: Use the GPT-4 model to evaluate the new dataset, potentially including various tasks (e.g., text generation, question answering, sentiment analysis, etc.).
Advantages:
High-Quality Evaluation: As a powerful language model, GPT-4 can provide high-quality evaluation results, especially on complex tasks.
Multi-Task Capability: Can evaluate on various tasks, providing comprehensive performance feedback.
Real-World Usage: Evaluation results are closer to actual usage, especially if your final application is also based on similar advanced models.
Disadvantages:
Computational Resources: Training and evaluating GPT-4 requires a large amount of computational resources and time, which may increase costs.
Complexity: The complexity of GPT-4 means more potential issues during debugging and optimization.
Overfitting Risk: If not careful, there is a risk of over-optimizing specific tasks, leading to poorer performance on other tasks.
Summary
Using a “clean” corpus and perplexity check: Suitable for quick, preliminary quality assessment but limited to a single metric.
Training small models and testing on evaluation tasks: Suitable for scenarios requiring specific task performance feedback but requires more resources and task selection.
Early signal method: Suitable for the early stages of development to quickly screen datasets but involves simpler tasks.
Using GPT-4 for evaluation: Suitable for scenarios requiring high-quality and comprehensive evaluation, providing feedback closest to actual usage but with high resource demands.
Prepare environment
In the following content, I will show how to create High Quality Dataset from WARC.
Create conda env
#conda create –name=dataclean python=3.10
#conda activate dataclean
(dataclean) root@david1a100:~# cd dataclean/
(dataclean) root@david1a100:~/dataclean# hostname
david1a100.australiaeast.cloudapp.azure.com
#pip install datatrove xxhash faust-cchardet python-magic warcio fasteners tldextract trafilatura fasttext-wheel nltk awscli fasttext numpy==1.21.0
#pip install datatrove[all]
#pip install datatrove trafilatura awscli
#aws configure
Download WARC
Access the following link to check WARC file address:
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/index.html
Download this file named warc.paths.gz :
Check file path just as follwing in warc.paths.gz. There are so many warc.gz files, I only take CC-MAIN-20230527223515-20230528013515-00000.warc.gz as an example.
crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz
Download files as follwing script:
(dataclean) root@david1a100:~/dataclean# cat download_warc_file.py
import os
import subprocess
def download_warc_file(url, output_dir):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f”Downloading {url}…”)
command = f”wget -P {output_dir} {url}”
subprocess.run(command, shell=True, check=True)
if __name__ == ‘__main__’:
# URL of the WARC file
warc_url = “https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”
# output directory
output_dir = “/root/dataclean/data/CC-MAIN-2023-23/segments”
download_warc_file(warc_url, output_dir)
Basic data processing
After download 00000.warc.gz, I uses the local executor LocalPipelineExecutor to execute the data processing pipeline, which includes the following steps:
reading WARC files
filtering URLs
extracting content using Trafilatura
filtering non-English content
filtering duplicate content
filtering low-quality content
writing the processed data to JSONL files.
(dataclean) root@david1a100:~/dataclean# cat process_common_crawl_dump.py
import nltk
import sys
import os
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import (
GopherQualityFilter,
GopherRepetitionFilter,
LanguageFilter,
URLFilter,
)
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
def download_punkt():
nltk.download(‘punkt’)
nltk.download(‘punkt_tab’)
def set_nltk_data_path():
nltk.data.path.append(‘/root/nltk_data’)
set_nltk_data_path()
download_punkt()
def main():
# DUMP should be given as an argument. Example: CC-MAIN-2023-23
if len(sys.argv) != 2:
print(“Argument required: dump name”)
sys.exit(-1)
DUMP = sys.argv[1]
MAIN_OUTPUT_PATH = “./output” # Local Output Path
DATA_PATH = f”./data/{DUMP}/segments/”
print(f”Checking files in {DATA_PATH}”)
for root, dirs, files in os.walk(DATA_PATH):
print(f”Found directory: {root}”)
for file in files:
print(f”Found file: {file}”)
if not any(os.scandir(DATA_PATH)):
print(f”No files found in {DATA_PATH}”)
sys.exit(-1)
def initializer():
set_nltk_data_path()
download_punkt()
from multiprocessing import Pool
with Pool(processes=8, initializer=initializer) as pool:
executor = LocalPipelineExecutor(
pipeline=[
WarcReader(
DATA_PATH,
glob_pattern=”*.warc.gz”,
default_metadata={“dump”: DUMP},
),
URLFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/url/{DUMP}”)),
Trafilatura(favour_precision=True),
LanguageFilter(
exclusion_writer=JsonlWriter(
f”{MAIN_OUTPUT_PATH}/non_english/”,
output_filename=”${language}/” + DUMP + “/${rank}.jsonl.gz”, # 文件夹结构:language/dump/file
)
),
GopherRepetitionFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/repetitive/{DUMP}”)),
GopherQualityFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/quality/{DUMP}”)),
JsonlWriter(f”{MAIN_OUTPUT_PATH}/output/{DUMP}”),
],
tasks=8, # Number of local tasks, adjusted to your VM configuration
logging_dir=f”{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP}”,
)
executor.run()
if __name__ == ‘__main__’:
main()
Run script as following:
#python3 process_common_crawl_dump.py CC-MAIN-2023-23
Script will run for 26 minutes, final output is as follwing:
2024-08-14 05:11:53.451 | INFO | datatrove.utils.logging:add_task_logger:47 – Launching pipeline for rank=0
2024-08-14 05:11:53.452 | INFO | datatrove.utils.logging:log_pipeline:76 –
— 🛠️ PIPELINE 🛠
📖 – READER: 🕷 Warc
🔻 – FILTER: 😈 Url-filter
🛢 – EXTRAC: ⛏ Trafilatura
🔻 – FILTER: 🌍 Language ID
🔻 – FILTER: 👯 Gopher Repetition
🔻 – FILTER: 🥇 Gopher Quality
💽 – WRITER: 🐿 Jsonl
2024-08-14 05:11:53.452 | INFO | datatrove.pipeline.readers.base:read_files_shard:193 – Reading input file CC-MAIN-20230527223515-20230528013515-00000.warc.gz
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data…
[nltk_data] Package punkt_tab is already up-to-date!
2024-08-14 05:11:55.704 | WARNING | datatrove.pipeline.extractors.base:run:60 – ❌ Error “” while cleaning record text. Skipping record.
…
2024-08-14 05:38:47.661 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 8/8 tasks completed.
2024-08-14 05:38:47.686 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 8 tasks 📉📉📉
Total Runtime: 26 minutes and 36 seconds
📖 – READER: 🕷 Warc
Runtime: (2.11%) 33 seconds [0.29 milliseconds±3.12 milliseconds/doc]
Stats: {input_files: 1, doc_len: 4795961005 [min=1, max=1048576, 140974.75±182620/doc], documents: 34019 [34019.00/input_file]}
🔻 – FILTER: 😈 Url-filter
Runtime: (0.35%) 5 seconds [0.16 milliseconds±11.08 milliseconds/doc]
Stats: {total: 34020, forwarded: 33834, doc_len: 4776069530 [min=1, max=1048576, 141161.84±182866/doc], dropped: 186, dropped_domain: 90, dropped_hard_blacklisted: 67, dropped_blacklisted_subword: 21, dropped_soft_blacklisted: 6, dropped_subdomain: 2}
🛢 – EXTRAC: ⛏ Trafilatura
Runtime: (75.94%) 20 minutes and 12 seconds [35.84 milliseconds±29.25 milliseconds/doc]
Stats: {total: 33834, forwarded: 27384, doc_len: 57232496 [min=1, max=551300, 2090.00±6280/doc], dropped: 4168}
🔻 – FILTER: 🌍 Language ID
Runtime: (0.91%) 14 seconds [0.53 milliseconds±2.54 milliseconds/doc]
Stats: {total: 27384, dropped: 16500, forwarded: 10884, doc_len: 24989254 [min=2, max=73080, 2295.96±4166/doc]}
🔻 – FILTER: 👯 Gopher Repetition
Runtime: (13.00%) 3 minutes and 27 seconds [19.07 milliseconds±33.46 milliseconds/doc]
Stats: {total: 10884, forwarded: 8161, doc_len: 21401662 [min=5, max=73080, 2622.43±4274/doc], dropped: 2723, dropped_top_4_gram: 345, dropped_dup_line_frac: 633, dropped_top_2_gram: 796, dropped_duplicated_5_n_grams: 281, dropped_top_3_gram: 399, dropped_duplicated_6_n_grams: 25, dropped_dup_line_char_frac: 173, dropped_duplicated_8_n_grams: 13, dropped_duplicated_10_n_grams: 16, dropped_duplicated_9_n_grams: 23, dropped_duplicated_7_n_grams: 19}
🔻 – FILTER: 🥇 Gopher Quality
Runtime: (7.55%) 2 minutes [14.76 milliseconds±8.44 milliseconds/doc]
Stats: {total: 8161, dropped: 2433, dropped_gopher_too_many_end_ellipsis: 232, dropped_gopher_below_alpha_threshold: 1201, forwarded: 5728, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], dropped_gopher_short_doc: 941, dropped_gopher_too_many_bullets: 49, dropped_gopher_enough_stop_words: 6, dropped_gopher_below_avg_threshold: 1, dropped_gopher_too_many_ellipsis: 1, dropped_gopher_too_many_hashes: 2}
💽 – WRITER: 🐿 Jsonl
Runtime: (0.14%) 2 seconds [0.40 milliseconds±0.60 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5728, total: 5728, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc]}
Check data processing results
root@david1a100:~/dataclean/output/output/CC-MAIN-2023-23# zcat ./00000.jsonl.gz | head -n 2 | jq .
Output:
{
“text”: “Buy Ambien Online Legally (Zolpidem) belongs to the imidazopyridines class of opioids. Ambien complements the exceptional of sleep via way of means of decreasing the time it takes to fall asleep, decreasing the frequency of nocturnal awakenings, and growing the general period of sleep. Lengthens the second one degree of sleep and the deep sleep degree (III and IV). It does now no longer make you sleepy throughout the day. If you’re seeking to Buy Ambien Online at an inexpensive cost, come to our on line pharmacy.”,
“id”: “<urn:uuid:dd20979b-ada8-4c5b-b53e-4ade7274bc1b>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://42627.dynamicboard.de/u101117_ambienusa.html”,
“date”: “2023-05-27T23:12:51Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.8990675806999207
}
}
{
“text”: “My little guy turned two over the summer and we celebrated with an oh-so-cute Golf Birthday Party. He is all boy and loves anything that includes a stick and ball, which made choosing the golf theme fairly easy. We had fun golfing games, snacks & treats and each little caddie even received there very own golf bag. The post was getting fairly large I decided to split it in two parts. Part one covers the favor and dessert table and part two will focus on the food and games. Enjoy!nGolf Pro Shop for the favor tablenEach “Golf Pro” received his/her own set of golf clubs (thank you Target dollar section for saving the day!), a blue or green visor I purchased at Joann’s, practice golf balls and a water bottle to stay hydrated on the course.nI created the backdrop for the dessert table with a tan table cloth I had and pinned it to the window frame with thumb tacks (my husband wasn’t too happy about that one…opps!) I used 12” white tissue paper balls that I purchased from Devra Party and hung them by grosgrain ribbon.nI wanted to use items on the dessert table that went along with the theme so I racked my brain for some golf terms. The sign over the table was “Caddie’s Sweet Spot” (sweet spot refers to the center point of the face of the club).nThere was a “water hazard” ~ blue jell-o jigglers, “wormburners” (which is the term for a ball that skims the grass) ~ chocolate pudding pack topped with crumbled Oreos and gummy worms plus a sand trap of “doughnut hole in one” ~ made with powder sugar doughnuts and crumbled graham crackers for the sand.nI also made cake pops that resembled golf balls ~ some like a lollipop and others with a golf flag and the number two for the birthday boy. The kids had a few candy choices and a small bag to fill so they could bring treats home.n“Wormburners” – Chocolate pudding cups topped with crushed oreos and gummy wormsnGreen Grass Cupcakes, with white gumball and printable golf flags.nThank you so much to everyone who helped make this party amazing, I couldn’t have done it without you.nVendor List:nPhotography: Andary StudionParty Printables: Printable Studio by 505 Design, IncnGolf Club Sets: Target Dollar SectionnFoam Visors: Joann’snGreen & White Tissue Balls: Devra PartynGreen Polka Dot Balloons: Paws Attraction BoutiquenCupcakes – My super talented sisternInterested in hosting your own Golf Themed Party – Check out the Golf Pro Printable set now available in the shop.nMore details coming soon….nThanks for stopping by! Cathy C.”,
“id”: “<urn:uuid:9ad54ec1-b946-4293-8099-abc434ef154c>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://505-design.com/tag/boys-party/”,
“date”: “2023-05-27T23:24:49Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.9405166506767273
}
}
Minhash deduplication
I use the local executor LocalPipelineExecutor to execute the data deduplication pipeline, which includes the following steps:
Configuring Minhash: Setting up Minhash with 64-bit hashes for better precision and fewer false positives (collisions).
Reading Input Data: Using JsonlReader to read input data from a specified directory.
Stage 1: Calculating Minhash Signatures:
Pipeline: Reads input data and calculates Minhash signatures.
Output: Stores signatures in a specified folder.
Tasks: Configured to run with a specified number of tasks based on the local environment.
Stage 2: Finding Matches Between Signatures in Each Bucket :
Pipeline: Processes the signatures to find matches within each bucket.
Output: Stores bucketed signatures in a specified folder.
Tasks: Runs with a number of tasks equal to the number of buckets.
Dependency: Depends on the completion of Stage 1.
Stage 3: Creating Clusters of Duplicates:
Pipeline: Uses the results from all buckets to create clusters of duplicate items.
Output: Stores IDs of items to be removed in a specified folder.
Tasks: Runs as a single task.
Dependency: Depends on the completion of Stage 2.
Stage 4: Filtering Out Duplicates:
Pipeline: Reads the original input data, counts tokens, filters out duplicates (keeping only one sample per cluster), and writes the deduplicated data to JSONL files.
Output: Stores deduplicated output and removed items in specified folders.
Tasks: Configured to run with a specified number of tasks.
Dependency: Depends on the completion of Stage 3.
root@david1a100:~/dataclean# cat minhash_deduplication.py
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import MinhashDedupSignature
from datatrove.pipeline.dedup.minhash import (
MinhashConfig,
MinhashDedupBuckets,
MinhashDedupCluster,
MinhashDedupFilter,
)
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.tokens import TokensCounter
from datatrove.pipeline.writers.jsonl import JsonlWriter
def main():
minhash_config = MinhashConfig(use_64bit_hashes=True)
LOCAL_MINHASH_BASE_PATH = “./minhash”
LOCAL_LOGS_FOLDER = “./logs”
TOTAL_TASKS = 8
# Input data path
INPUT_READER = JsonlReader(“./output/output/CC-MAIN-2023-23/”)
# Stage 1: Calculate the Minhash signature for each task
stage1 = LocalPipelineExecutor(
pipeline=[
INPUT_READER,
MinhashDedupSignature(output_folder=f”{LOCAL_MINHASH_BASE_PATH}/signatures”, config=minhash_config),
],
tasks=TOTAL_TASKS,
logging_dir=f”{LOCAL_LOGS_FOLDER}/signatures”,
)
# Stage 2: Finding matches between signatures in each bucket
stage2 = LocalPipelineExecutor(
pipeline=[
MinhashDedupBuckets(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/signatures”,
output_folder=f”{LOCAL_MINHASH_BASE_PATH}/buckets”,
config=minhash_config,
),
],
tasks=minhash_config.num_buckets,
logging_dir=f”{LOCAL_LOGS_FOLDER}/buckets”,
depends=stage1,
)
# Stage 3: Create clusters of duplicate items using the results of all buckets
stage3 = LocalPipelineExecutor(
pipeline=[
MinhashDedupCluster(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/buckets”,
output_folder=f”{LOCAL_MINHASH_BASE_PATH}/remove_ids”,
config=minhash_config,
),
],
tasks=1,
logging_dir=f”{LOCAL_LOGS_FOLDER}/clusters”,
depends=stage2,
)
# Stage 4: Read raw input data and remove all samples from each duplicate cluster (keep only one)
stage4 = LocalPipelineExecutor(
pipeline=[
INPUT_READER,
TokensCounter(), # View the number of tokens before and after de-duplication
MinhashDedupFilter(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/remove_ids”,
exclusion_writer=JsonlWriter(f”{LOCAL_MINHASH_BASE_PATH}/removed”),
),
JsonlWriter(output_folder=f”{LOCAL_MINHASH_BASE_PATH}/deduplicated_output”),
],
tasks=TOTAL_TASKS,
logging_dir=f”{LOCAL_LOGS_FOLDER}/filter”,
depends=stage3,
)
stage4.run()
if __name__ == ‘__main__’:
import multiprocessing
multiprocessing.freeze_support()
main()
Run code:
(dataclean) root@david1a100:~/dataclean# python minhash_deduplication.py
Results are as following:
— 🛠️ PIPELINE 🛠
📖 – READER: 🐿 Jsonl
🔢 – TOKENIZER: 📊 Counter
🫂 – DEDUP: 🎯 MinHash stage 4
💽 – WRITER: 🐿 Jsonl
2024-08-14 07:20:58.795 | INFO | datatrove.pipeline.readers.base:read_files_shard:193 – Reading input file 00000.jsonl.gz
2024-08-14 07:20:58.802 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 1/8 tasks completed.
2024-08-14 07:20:58.804 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 2/8 tasks completed.
2024-08-14 07:20:58.805 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 3/8 tasks completed.
2024-08-14 07:20:58.807 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 4/8 tasks completed.
2024-08-14 07:20:58.808 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 5/8 tasks completed.
2024-08-14 07:20:58.810 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 6/8 tasks completed.
2024-08-14 07:20:58.812 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 7/8 tasks completed.
2024-08-14 07:21:08.399 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=0
2024-08-14 07:21:08.401 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 0 📉📉📉
Total Runtime: 9 seconds
📖 – READER: 🐿 Jsonl
Runtime: (1.54%) 0 seconds [0.03 milliseconds±0.01 milliseconds/doc]
Stats: {input_files: 1, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], documents: 5727 [5727.00/input_file]}
🔢 – TOKENIZER: 📊 Counter
Runtime: (79.15%) 7 seconds [1.29 milliseconds±5.90 milliseconds/doc]
Stats: {tokens: 3989039 [min=54, max=18060, 696.41±1020/doc]}
🫂 – DEDUP: 🎯 MinHash stage 4
Runtime: (0.44%) 0 seconds [0.01 milliseconds±0.03 milliseconds/doc]
Stats: {total: 5728, forwarded: 5548, dropped: 180}
💽 – WRITER: 🐿 Jsonl
Runtime: (18.86%) 1 second [0.32 milliseconds±0.44 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5548, total: 5548, doc_len: 17896157 [min=257, max=73080, 3225.70±4665/doc], doc_len_tokens: 3943328 [min=54, max=18060, 710.77±1032/doc]}
2024-08-14 07:21:08.405 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 8/8 tasks completed.
2024-08-14 07:21:08.417 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 8 tasks 📉📉📉
Total Runtime: 1 second ± 2 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (1.54%) 0 seconds±0 seconds/task, min=0 seconds [0.03 milliseconds±0.01 milliseconds/doc]
Stats: {input_files: 1, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], documents: 5727 [5727.00/input_file]}
🔢 – TOKENIZER: 📊 Counter
Runtime: (79.15%) 0 seconds±2 seconds/task, min=0 seconds [1.29 milliseconds±5.90 milliseconds/doc]
Stats: {tokens: 3989039 [min=54, max=18060, 696.41±1020/doc]}
🫂 – DEDUP: 🎯 MinHash stage 4
Runtime: (0.44%) 0 seconds±0 seconds/task, min=0 seconds [0.01 milliseconds±0.03 milliseconds/doc]
Stats: {total: 5728, forwarded: 5548, dropped: 180}
💽 – WRITER: 🐿 Jsonl
Runtime: (18.86%) 0 seconds±0 seconds/task, min=0 seconds [0.32 milliseconds±0.44 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5548, total: 5548, doc_len: 17896157 [min=257, max=73080, 3225.70±4665/doc], doc_len_tokens: 3943328 [min=54, max=18060, 710.77±1032/doc]}
Check removed and final result in this part:
(dataclean) root@david1a100:~/dataclean/minhash# ls -al removed/
total 76
drwx—— 2 root root 4096 Aug 14 07:20 .
drwx—— 7 root root 4096 Aug 14 07:20 ..
-rw——- 1 root root 65584 Aug 14 07:21 00000.jsonl.gz
(dataclean) root@david1a100:~/dataclean/minhash# ls -al deduplicated_output/
total 7372
drwx—— 2 root root 4096 Aug 14 07:20 .
drwx—— 7 root root 4096 Aug 14 07:20 ..
-rw——- 1 root root 7539420 Aug 14 07:21 00000.jsonl.gz
(dataclean) root@david1a100:~/dataclean/minhash#
Check first intem in final output file:
(dataclean) root@david1a100:~/dataclean/minhash/deduplicated_output# zcat ./00000.jsonl.gz | head -n 1 | jq .
{
“text”: “Buy Ambien Online Legally (Zolpidem) belongs to the imidazopyridines class of opioids. Ambien complements the exceptional of sleep via way of means of decreasing the time it takes to fall asleep, decreasing the frequency of nocturnal awakenings, and growing the general period of sleep. Lengthens the second one degree of sleep and the deep sleep degree (III and IV). It does now no longer make you sleepy throughout the day. If you’re seeking to Buy Ambien Online at an inexpensive cost, come to our on line pharmacy.”,
“id”: “<urn:uuid:dd20979b-ada8-4c5b-b53e-4ade7274bc1b>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://42627.dynamicboard.de/u101117_ambienusa.html”,
“date”: “2023-05-27T23:12:51Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.8990675806999207,
“token_count”: 120
}
}
Sentence deduplication
My code uses the local executor LocalPipelineExecutor to execute the data deduplication pipeline, which includes the following steps:
Configuring Sentence Deduplication: Setting up sentence deduplication with specific configurations such as the number of sentences, splitting sentences, and minimum document words.
Preprocessing Data: Using NLTK to download the Punkt tokenizer and preprocess data before starting multiprocessing.
Reading Input Data: Using JsonlReader to read input data from a specified directory.
Stage 1: Extracting and Filtering Content:
Pipeline: Reads input data, extracts content using Trafilatura, filters based on quality and language, and writes intermediate results to JSONL files.
Output: Stores intermediate results in a specified folder.
Tasks: Configured to run with a specified number of tasks.
Stage 2: Calculating Sentence Deduplication Signatures:
Pipeline: Processes the intermediate results to calculate sentence deduplication signatures.
Output: Stores signatures in a specified folder.
Tasks: Runs with a number of tasks equal to the number of finder workers.
Stage 3: Finding and Filtering Duplicates:
Pipeline: Reads the intermediate results, finds duplicates using the calculated signatures, and filters out duplicates (keeping only one sample per cluster).
Output: Stores deduplicated output in a specified folder.
Tasks: Configured to run with a specified number of tasks.
The pipeline is executed by running executor_1.run(), executor_2.run(), and executor_3.run().
(dataclean) root@david1a100:~/dataclean# cat sentence_deduplication.py
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from datatrove.executor.base import PipelineExecutor
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
from datatrove.io import get_datafolder
from collections import UserDict
import multiprocessing
# Ensure punkt tokenizer is downloaded before multiprocessing
nltk.download(‘punkt’, force=True)
# Custom function to load PunktSentenceTokenizer
def load_punkt_tokenizer():
punkt_param = PunktParameters()
with open(nltk.data.find(‘tokenizers/punkt/english.pickle’), ‘rb’) as f:
tokenizer = PunktSentenceTokenizer(punkt_param)
return tokenizer
# Load tokenizer in the main process
tokenizer = load_punkt_tokenizer()
# Example configuration for sentence deduplication
sent_dedup_config = SentDedupConfig(
n_sentences=3,
split_sentences=True,
only_dedup_in_index=True,
min_doc_words=50,
)
FINDER_WORKERS = 10
class TimeStats:
def __init__(self):
self.global_mean = 0
self.global_std_dev = 0
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
pass
def __repr__(self):
return f”TimeStats(global_mean={self.global_mean}, global_std_dev={self.global_std_dev})”
def __add__(self, other):
result = TimeStats()
result.global_mean = self.global_mean + other.global_mean
result.global_std_dev = self.global_std_dev + other.global_std_dev
return result
class Stat:
def __init__(self):
self.value = 0
def update(self, value, unit=None):
self.value += value
def __repr__(self):
return f”Stat(value={self.value})”
def __add__(self, other):
result = Stat()
result.value = self.value + other.value
return result
class PipelineStats(UserDict):
def __init__(self):
super().__init__()
self.total_runtime = 0
self.time_stats = TimeStats()
self.data[‘total’] = Stat()
self.data[‘removed_sentences’] = Stat()
self.data[‘original_sentences’] = Stat()
def as_dict(self):
return {
‘total_runtime’: self.total_runtime,
‘time_stats’: repr(self.time_stats),
‘stats’: {key: repr(value) for key, value in self.data.items()}
}
def to_dict(self):
return self.as_dict()
def to_json(self):
import json
return json.dumps(self.to_dict(), indent=4)
def save_to_disk(self, file):
file.write(self.to_json())
def get_repr(self, task_name):
x = f”nn Stats: {task_name} nnTotal Runtime: {self.total_runtime} secondsnn”
x += “n”.join([repr(stat) for stat in self.data.values()])
return x
def __repr__(self, *args, **kwargs):
return f”PipelineStats(total_runtime={self.total_runtime}, time_stats={self.time_stats})”
def __add__(self, other):
result = PipelineStats()
result.total_runtime = self.total_runtime + other.total_runtime
result.time_stats = self.time_stats + other.time_stats
for key in self.data:
result.data[key] = self.data[key] + other.data[key]
return result
class CustomSentenceDedupFilter(SentenceDedupFilter):
def __init__(self, data_folder, config):
self.data_folder = get_datafolder(data_folder)
self.config = config
self._tokenizer = None
self.exclusion_writer = None
self.stats = PipelineStats()
self.language = ‘english’
def set_tokenizer(self, tokenizer):
self._tokenizer = tokenizer
def run(self, data, rank, world_size, *args):
# Implement the logic for the run method here
# For now, let’s just print the arguments to verify they are passed correctly
print(f”Running with data: {data}, rank: {rank}, world_size: {world_size}, args: {args}”)
# Add your actual processing logic here
return data
def preprocess_data():
# Preprocess data using nltk before starting multiprocessing
# This is a placeholder function. Implement your preprocessing logic here.
# For example, you can read the input files, tokenize the sentences, and save the preprocessed data.
pass
def run_example():
preprocess_data() # Preprocess data before starting multiprocessing
pipeline_1 = [
JsonlReader(data_folder=”./minhash/deduplicated_output/”),
Trafilatura(),
GopherQualityFilter(min_stop_words=0),
LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
JsonlWriter(“./intermediate/”),
SentenceDedupSignature(output_folder=”./c4/sigs”, config=sent_dedup_config, finder_workers=FINDER_WORKERS),
]
pipeline_2 = [SentenceFindDedups(data_folder=”./c4/sigs”, output_folder=”./c4/dups”, config=sent_dedup_config)]
sentence_dedup_filter = CustomSentenceDedupFilter(data_folder=”./c4/dups”, config=sent_dedup_config)
sentence_dedup_filter.set_tokenizer(tokenizer)
pipeline_3 = [
JsonlReader(data_folder=”./intermediate/”),
sentence_dedup_filter,
JsonlWriter(output_folder=”./final_deduplicated_output/”),
]
executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)
executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)
executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)
print(executor_1.run())
print(executor_2.run())
print(executor_3.run())
if __name__ == ‘__main__’:
multiprocessing.freeze_support()
run_example()
Run the script:
(dataclean) root@david1a100:~/dataclean# python3 sentence_deduplication.py
Some of the output:
2024-08-15 03:46:20.151 | INFO | datatrove.pipeline.dedup.sentence_dedup:run:247 – PQ initialized.
2024-08-15 03:46:20.151 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=9
2024-08-15 03:46:20.152 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 9 📉📉📉
Total Runtime: 0 seconds
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds [1.17 milliseconds±0 milliseconds/doc]
2024-08-15 03:46:20.156 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 10 tasks 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds±0 seconds/task, min=0 seconds, max=0 seconds [1.68 milliseconds±1.21 milliseconds/doc]
📉📉📉 Stats 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds±0 seconds/task, min=0 seconds, max=0 seconds [1.68 milliseconds±1.21 milliseconds/doc]
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
2024-08-15 03:46:20.887 | INFO | datatrove.utils.logging:add_task_logger:47 – Launching pipeline for rank=2
2024-08-15 03:46:20.887 | INFO | datatrove.utils.logging:log_pipeline:76 –
— 🛠️ PIPELINE 🛠
📖 – READER: 🐿 Jsonl
🫂 – DEDUPS: 💥 sentence-deduplication stage 3
💽 – WRITER: 🐿 Jsonl
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 2, world_size: 4, args: ()
2024-08-15 03:46:20.887 | WARNING | datatrove.pipeline.readers.base:run:226 – No files found on /root/dataclean/intermediate for rank=2
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 1, world_size: 4, args: ()
2024-08-15 03:46:20.887 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=2
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 0, world_size: 4, args: ()
2024-08-15 03:46:20.888 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 2 📉📉📉
Total Runtime: 0 seconds
📖 – READER: 🐿 Jsonl
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
2024-08-15 03:46:20.891 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 1/4 tasks completed.
2024-08-15 03:46:20.892 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 2/4 tasks completed.
2024-08-15 03:46:20.897 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 3/4 tasks completed.
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a340>, rank: 3, world_size: 4, args: ()
2024-08-15 03:46:20.911 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 4/4 tasks completed.
2024-08-15 03:46:20.948 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 4 tasks 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (7.77%) 0 seconds±0 seconds/task, min=0 seconds [0.06 milliseconds±0.04 milliseconds/doc]
Stats: {input_files: 1, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc], documents: 3 [3.00/input_file]}
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
Runtime: (92.23%) 0 seconds±0 seconds/task, min=0 seconds [0.66 milliseconds±0.88 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 4, total: 4, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc]}
📉📉📉 Stats 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (7.77%) 0 seconds±0 seconds/task, min=0 seconds [0.06 milliseconds±0.04 milliseconds/doc]
Stats: {input_files: 1, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc], documents: 3 [3.00/input_file]}
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
Runtime: (92.23%) 0 seconds±0 seconds/task, min=0 seconds [0.66 milliseconds±0.88 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 4, total: 4, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc]}
Check the the first item of final outputs:
(dataclean) root@david1a100:~/dataclean/final_deduplicated_output# zcat ./00000.jsonl.gz | head -n 1 | jq .
Check quality of the corpus
This part of my code is refer to: https://github.com/Azure/synthetic-qa-generation/tree/main*, I modified some codes, please refer to corpus-suggestions.ipynb in my repo: https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/Make-High-Quality-Dataset-From-WARC, which could analyze quality of the corpus from the last steps and give lots of useful suggestions.
Take some results as examples:
Result 1:
Feedback Required: [True, False, True, False, True]
Feedback List:
#Need Feedback#: Yes
#Issue Name#: Lack of new scenarios or contexts
#Reason#: The evolved instruction does not introduce any new scenarios or examples.
#Feedback#: Introduce diverse contexts or examples to enhance the instructional variety.
#Need Feedback#: No
#Need Feedback#: Yes
#Issue Name#: Limited diversity in examples
#Reason#: No new scenarios or varied contexts introduced in the evolved instruction.
#Feedback#: Incorporate diverse examples or contexts to cover a wider range of situations.
#Need Feedback#: No
#Need Feedback#: Yes
#Issue Name#: Limited diversity
#Reason#: No new scenarios, examples, or contexts introduced.
#Feedback#: Include various use cases and contexts for accessing journal content.
Optimized Instruction:
Accessing full-text articles for free on HTML pages can be a convenient way to stay informed, but if you need the article in PDF or Epub format, a subscription to the Journal of Postgraduate Medicine is required. Here are different ways to access the content based on various contexts:
1. **Individual Subscription:** If you frequently need access to articles in PDF or Epub format, consider subscribing online for a year. Subscribing is a straightforward process:
– Visit the Journal of Postgraduate Medicine’s subscription page.
– Choose the subscription plan that suits your needs.
– Complete the payment process to gain access to the content.
2. **Institutional Access:** If you are affiliated with a university or a research institution, you might recommend that your institution’s library subscribe to the journal. This way, everyone at your institution can have unrestricted access to the content.
– Click on the “Recommend the Journal” link typically provided on the journal’s website.
– Fill out the recommendation form with the necessary details.
– Submit the recommendation to your institution’s library acquisition team.
3. **Library Access:** If your local library has a subscription to the journal, you can access the PDF and Epub formats through their facilities. Check with your library to see if they offer remote access options, especially useful during non-operational hours or remote working conditions.
4. **Interlibrary Loan (ILL):** If neither you nor your institution has a subscription and you need a specific article in PDF or Epub format, you can request it through interlibrary loan services:
– Contact your library’s interlibrary loan department.
– Provide the details of the article you need.
– Wait for your library to obtain a copy from another subscribing institution.
5. **Pay-Per-View Purchase:** Some journals offer pay-per-view options for non-subscribers to access specific articles:
– Visit the article page on the journal’s website.
– Look for a purchase or pay-per-view option.
– Complete the payment to download the article in PDF or Epub format.
By understanding these various methods, you can choose the most appropriate way to access the Journal of Postgraduate Medicine articles based on your specific context and needs.
Evolved Instruction Step 1:
Accessing full-text articles for free on HTML pages can be a convenient way to stay informed, but if you need the article in PDF or Epub format or face geographic restrictions, a subscription to the Journal of Postgraduate Medicine is required. Here are different ways to access the content based on various contexts and considerations:
1. **Individual Subscription:** If you frequently need access to articles in PDF or Epub format, consider subscribing online for a year. Consider different subscription tiers based on your usage frequency and preferred payment method (credit card, PayPal, or wire transfer):
– Visit the Journal of Postgraduate Medicine’s subscription page.
– Choose the appropriate subscription plan that suits your reading needs and budget.
– Complete the payment process, selecting your preferred payment method, to gain access to the content.
– Confirm your subscription through the verification email you will receive.
2. **Institutional Access:** If you are affiliated with a university, specialized institute, or research organization, you might recommend that your institution’s library subscribe to the journal, allowing everyone at your institution unrestricted access to the content:
– Click on the “Recommend the Journal” link typically provided on the journal’s website.
– Fill out the recommendation form with the necessary details, specifying your institution type.
– Submit the recommendation to your institution’s library acquisition team.
– Follow up with your acquisition team to verify the status of the subscription request.
3. **Library Access:** If your local library has a subscription to the journal, you can access the PDF and Epub formats through their facilities. Check with your library to see if they offer remote access options or have updated policies for off-hour access due to remote working conditions or geographical restrictions:
– Visit your library’s online resource portal.
– Authenticate your library membership details to access the journal remotely.
– Verify the access duration and loan policies to ensure continuous availability.
4. **Interlibrary Loan (ILL):** If neither you nor your institution has a subscription and you need a specific article in PDF or Epub format, you can request it through Interlibrary Loan services, which might involve multiple steps and waiting periods:
– Contact your library’s interlibrary loan department and inquire about any pre-requisites.
– Provide the exact details of the article you need and verify your contact information.
– Wait for your library to notify you on the progress and estimated delivery time of the article from another subscribing institution.
– Confirm the received article’s access duration to avoid lapses in availability.
5. **Pay-Per-View Purchase:** Some journals offer pay-per-view options for non-subscribers to access specific articles. Be aware of different payment methods and possible return policies if the article does not meet your needs:
– Visit the article page on the journal’s website.
– Look for a purchase or pay-per-view option and compare prices if there are multiple.
– Complete the payment process, choosing a method that’s secure and convenient for you.
– Download the article in PDF or Epub format, and review any return policies if you face access issues.
By understanding these various methods, including conditional scenarios and additional steps, you can choose the most appropriate way to access the Journal of Postgraduate Medicine articles based on your specific context, requirements, and potential contingent situations.
New Feedback Required: [True, True, True, True, True]
New Feedback List:
#Need Feedback#: Yes
#Issue Name#: Preservation of key information
#Reason#: Key information is maintained with added details and considerations.
#Feedback#: Key information preserved well with added context and steps for clarity.
#Need Feedback#: Yes
#Issue Name#: Complexity
#Reason#: More details and steps have been added sufficiently.
#Feedback#: Complexity increased adequately with detailed steps and additional considerations.
#Need Feedback#: Yes
#Issue Name#: Insufficient scenario diversity
#Reason#: Limited expansion on new contexts or examples in evolved instruction.
#Feedback#: Introduce more varied scenarios to enhance diversity and coverage of different situations.
#Need Feedback#: Yes
#Issue Name#: Increased complexity
#Reason#: The Evolved Instruction introduces more detailed steps and additional considerations.
#Feedback#: The complexity has increased adequately with additional steps and detailed guidance.
#Need Feedback#: Yes
#Issue Name#: Limited diversity in access methods
#Reason#: Few new scenarios or examples introduced in the evolved instruction.
#Feedback#: Expand diversity by adding varied contexts, like international access options.
Genarate Synthetic Q&A
Refer to generate-QA.ipynb, we could generate high quality synthetic Q&A pairs with GPT-4o. Prompt temlpate is refer to : https://github.com/Azure/synthetic-qa-generation/tree/main/seed/prompt_template/en
Take some results as examples:
1.**What type of access is free in HTML pages?**
Full text access is free in HTML pages.
2. **Who can access PDF and EPub formats of the journal?**
PDF and EPub access is only available to paid subscribers and members.
3. **What must you do to access the article in PDF format?**
To access the article in PDF format, you should be a subscriber to the Journal of Postgraduate Medicine.
4. **How can you subscribe to the Journal of Postgraduate Medicine?**
You can subscribe online for a year.
5. **What can you do if you want your institution to have unrestricted access to the journal?**
You could recommend your institution’s library to subscribe to the journal so that you can have unrestricted access.
References
DataTrove: https://github.com/huggingface/datatrove/
Generate Synthetic QnAs from Real-world Data: https://github.com/Azure/synthetic-qa-generation/
Microsoft Tech Community – Latest Blogs –Read More
Generative AI with Microsoft Fabric
Microsoft Fabric seamlessly integrates with generative AI to enhance data-driven decision-making across your organization. It unifies data management and analysis, allowing for real-time insights and actions.
With Real Time Intelligence, keeping grounding data for large language models (LLMs) up-to-date is simplified. This ensures that generative AI responses are based on the most current information, enhancing the relevance and accuracy of outputs. Microsoft Fabric also infuses generative AI experiences throughout its platform, with tools like Copilot in Fabric and Azure AI Studio enabling easy connection of unified data to sophisticated AI models.
Check out GenAI experiences with Microsoft Fabric.
Classify and protect schematized data with Microsoft Purview.
Connect data from OneLake to Azure AI Studio.
Watch our video here:
QUICK LINKS:
00:00 — Unify data with Microsoft Fabric
00:35 — Unified data storage & real-time analysis
01:08 — Security with Microsoft Purview
01:25 — Real-Time Intelligence
02:05 — Integration with Azure AI Studio
Link References
This is Part 3 of 3 in our series on leveraging generative AI. Watch our playlist at https://aka.ms/GenAIwithAzureDBs
Unfamiliar with Microsoft Mechanics?
As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.
Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries
Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast
Keep getting this insider knowledge, join us on social:
Follow us on Twitter: https://twitter.com/MSFTMechanics
Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
Enjoy us on Instagram: https://www.instagram.com/msftmechanics/
Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics
Video Transcript:
-If you want to bring custom Gen AI experiences to your app so that users can interact with them using natural language, the better the quality and recency of the data used to ground responses, the more relevant and accurate the generated outcome.
-The challenge, of course, is that your data may be sitting across multiple clouds, in your own data center and also on the edge. Here’s where the complete analytics platform Microsoft Fabric helps you to unify data wherever it lives at unlimited scale, without you having to move it.
-It incorporates a logical multi-cloud data lake, OneLake, for unified data storage and access and separately provides a real-time hub optimized for event-based streaming data, where change data capture feeds can be streamed from multiple cloud sources for analysis in real time without the need to pull your data. Then with your data unified, data professionals can work together in a collaborative workspace to ingest and transform it, analyze it, and also endorse it as they build quality data sets.
-And when, used with Microsoft Purview, this can be achieved with an additional layer of security where you can classify and protect your schematized data with protections flowing as everyone from your engineers, data analysts to your business users works with data in the Fabric workspace. Keeping grounding data for your LLMs up to date is also made easier by being able to act on it with Real Time Intelligence.
-For example, you might have a product recommendation engine on an e-commerce site and using Real Time Intelligence, you can create granular conditions to listen for changes in your data, like new stock coming in, and update data pipelines feeding the grounding data for your large language models.
-So now, whereas before the gen AI may not have had the latest inventory data available to it to ground responses, with Real Time Intelligence, generated responses can benefit from the most real-time, up-to-date information so you don’t lose out on sales. And as you work with your data, gen AI experiences are infused throughout Fabric. In fact, Copilot in Fabric experiences are available for all Microsoft Fabric workloads to assist you as you work.
-And once your data set is complete, connecting it from Microsoft Fabric to ground large language models in your gen AI apps is made easy with Azure AI Studio, where you can bring in data from OneLake seamlessly and choose from some of the most sophisticated large language models hosted in Azure to build custom AI experiences on your data, all of which is only made possible when you unify your data and act on it with Microsoft Fabric.
Microsoft Tech Community – Latest Blogs –Read More
Mseries announcements – GA of Mv3 High Memory and details on Mv3 Very High Memory virtual machines
Mv3 High Memory General Availability
Executing on our plan to have our third version of M-series (Mv3) powered by 4th generation Intel® Xeon® processors (Sapphire Rapids) across the board, we’re excited to announce that Mv3 High Memory (HM) virtual machines (VMs) are now generally available. These next-generation M-series High Memory VMs give customers faster insights, more uptime, lower total cost of ownership and improved price-performance for their most demanding workloads. Mv3 HM VMs are supported for RISE with SAP customers as well. With the release of this Mv3 sub-family and the sub-family that offers around 32TB memory, Microsoft is the only public cloud provider that can provide HANA certified VMs from around 1TB memory to around 32TB memory all powered by 4th generation Intel® Xeon® processors (Sapphire Rapids).
Key features on the new Mv3 HM VMs
The Mv3 HM VMs can scale for workloads from 6TB to 16TB.
Mv3 delivers up to 40% throughput over our Mv2 High Memory (HM), enabling significantly faster SAP HANA data load times for SAP OLAP workloads and significant higher performance per core for SAP OLTP workloads over the previous generation Mv2.
Powered by Azure Boost, Mv3 HM provides up to 2x more throughput to Azure premium SSD storage and up to 25% improvement in network throughput over Mv2, with more deterministic performance.
Designed from the ground up for increased resilience against failures in memory, disks, and networking based on intelligence from past generations.
Available in both disk and diskless offerings allowing customers the flexibility to choose the option that best meets their workload needs.
During our private preview, several customers such as SwissRe unlocked gains from the new VM sizes. In their own words:
“Mv3 High Memory VM results are promising – in average we see a 30% increase in the performance without any big adjustment.”
SwissRe
Msv3 High Memory series (NVMe)
Size
vCPU
Memory in GiB
Max data disks
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416s_6_v3
416
5,696
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M416s_8_v3
416
7,600
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M624s_12_v3
624
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_12_v3
832
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_16_v3
832
15,200
64
130,000/ 8,000
260,000/ 8,000
8
40,000
Msv3 High Memory series (SCSI)
Size
vCPU
Memory in GiB
Max data disks
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416s_6_v3
416
5,696
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M416s_8_v3
416
7,600
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M624s_12_v3
624
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_12_v3
832
11,400
64
130,000/4,000
130,000/4,000
8
40,000
Standard_M832s_16_v3
832
15,200
64
130,000/ 8,000
130,000/ 8,000
8
40,000
Mdsv3 High Memory series (NVMe)
Size
vCPU
Memory in GiB
Temp storage (SSD) GiB
Max data disks
Max cached* and temp storage throughput: IOPS / MBps
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416ds_6_v3
416
5,696
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M416ds_8_v3
416
7,600
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M624ds_12_v3
624
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_12_v3
832
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_16_v3
832
15,200
400
64
250,000/1,600
130,000/ 8,000
260,000/ 8,000
8
40,000
Mdsv3 High Memory series (SCSI)
Size
vCPU
Memory in GiB
Temp storage (SSD) GiB
Max data disks
Max cached* and temp storage throughput: IOPS / MBps
Max uncached Premium SSD throughput: IOPS/MBps
Max uncached Ultra Disk and Premium SSD V2 disk throughput: IOPS/MBps
Max NICs
Max network bandwidth (Mbps)
Standard_M416ds_6_v3
416
5,696
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M416ds_8_v3
416
7,600
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M624ds_12_v3
624
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_12_v3
832
11,400
400
64
250,000/1,600
130,000/4,000
130,000/4,000
8
40,000
Standard_M832ds_16_v3
832
15,200
400
64
250,000/1,600
130,000/ 8,000
130,000/ 8,000
8
40,000
*Read iops is optimized for sequential reads
Regional Availability and Pricing
The VMs are now available in West Europe, North Europe, East US, and West US 2. For pricing details, please take a look here for Windows and Linux.
Additional resources:
SAP Certification for Mv3 on Azure
Details on Mv3 Very High Memory Virtual Machines
We are thrilled to unveil the latest and largest additions to our Mv3-Series, Standard_M896ixds_32_v3 and Standard_M1792ixds_32_v3 VM SKUs. These new VM SKUs are the result of a close collaboration between Microsoft, SAP, experienced hardware partners, and our valued customers.
Key features on the new Mv3 VHM VMs
Unmatched Memory Capacity: With close to 32TB of memory, both the Standard_M896ixds_32_v3 and Standard_M1792ixds_32_v3 VMs are ideal for supporting very large in-memory databases and workloads.
High CPU Power: Featuring 896 cores in the Standard_M896ixds_32_v3 VM and 1792 vCPUs** in the Standard_M1792ixds_32_v3 VM, these VMs are designed to handle high-end S/4HANA workloads, providing more CPU power than other public cloud offerings. Enhanced Network and Storage Bandwidth: Both VM types provide the highest network and storage bandwidth available in Azure for a full node VM, including up to 200-Gbps network bandwidth with Azure Boost.
Optimal Performance for SAP HANA: Certified for SAP HANA, these VMs adhere to the SAP prescribed socket-to-memory ratio, ensuring optimal performance for in-memory analytics and relational database servers.
Size
vCPU or cores
Memory in GiB
SAP HANA Workload Type
Standard_M896ixds_32_v3
896
30,400
OLTP (S/4HANA) / OLAP Scaleup
Standard_M1792ixds_32_v3
1792**
30,400
OLAP Scaleup
**Hyperthreaded vCPUs
Microsoft Tech Community – Latest Blogs –Read More