Category: News
Teams Improves Text Pasting and Mic Pending
Who Thought that Including Metadata in Teams Pasted Text Was a Good Idea?
In an example of finally listening to user feedback, Microsoft announced in MC878422 (30 August 2024) that Teams no longer includes metadata in messages copied from chats or channel conversations. The change is effective now and means that instead of having Teams insert a timestamp and the name of the person who created the text, only the text is pasted. This is exactly the way the feature should have worked since day zero. Quite why anyone thought it was a good idea to insert additional information into copied text is one of the great mysteries of Teams development.
MC878422 notes: “Many users have voiced frustrations over copying messages in Teams, particularly the inclusion of metadata like names and timestamps. Customer feedback has been clear, signaling that this feature was adding more noise than value to user workflow.”
Copying Metadata is An Old Lync Feature
It seems likely that inserting the timestamp and author name is an idea that came to Teams from Lync Server 2013 and Skype for Business. A support article from the time describes how to change the default setting of copying message, name, and time to copying just the message. Nearly eight years after Teams entered preview in November 2016, the opportunity to update a setting as in Lync Server 2013 never appeared. The net result is that Teams users had to manually remove the unwanted metadata from copied text after pasting it into another app. Thankfully, the change “helps maintain focus and reduces unnecessary noise.”
I’ve no idea about how many of the 320 million monthly active Teams users found this aspect of the product annoying, but it’s been high up on my list along with in-product advertising and a constant stream of irritating pop-up messages.
Mic Pending is a Feature You Probably Never Knew Exists
In a more positive note, Juan Rivera, Corporate Vice President @ Microsoft. Teams Calling, Meetings & Devices Engineering posted on LinkedIn about a feature called Mic Pending state, which apparently is now rolled out to all tenants.
I have never thought much about the process required to implement the mute/unmute button in a call, but apparently Microsoft has done the work to make sure that when users hit the mic button (Figure 1), the action occurs immediately. If something gets in the way to prevent mute/unmute happening, Teams displays a “pending” icon if it notices that the action has taken more than 100 milliseconds.
Figure 1: The Teams mute mic button now works with 99.99+% reliability
The issue being addressed is to make sure that people have confidence that Teams will mute their microphone immediately they press the button and unmute the microphone in a similarly effective manner. It seems like some folks have been caught by a delay in muting. The button displayed in a Teams meeting showed that the microphone was off when it was still live. You can see how this could end up with something being heard or captured on a Teams recording that people would have preferred not to have been captured. Calling your boss a flaming idiot over an open microphone that you thought was muted is possibly not a good thing to do.
According to the post, Microsoft believe that Teams delivers 99.99+% reliability for the mute/unmute toggle, which should mean that the status for the microphone shown on screen can be trusted. Of course, the paranoid amongst us will always give a microphone two or three seconds before we consider it to be truly off.
Two Good Changes
The one thing about Teams is that it’s always changing. People like the Office 365 for IT Pros writing team have no shortage of topics to cover when it comes to Teams. Thankfully, the two topics covered here are both positive, even if mic pending hasn’t come to our attention before.
Insight like this doesn’t come easily. You’ve got to know the technology and understand how to look behind the scenes. Benefit from the knowledge and experience of the Office 365 for IT Pros team by subscribing to the best eBook covering Office 365 and the wider Microsoft 365 ecosystem.
10 istilah AI (lanjutan) yang perlu Anda ketahui
Read the English version here
Jakarta, 4 September 2024 – Sejak generative artificial intelligence (AI) menjadi semakin populer pada akhir tahun 2022, sebagian besar dari kita telah memperoleh pemahaman dasar tentang teknologi tersebut dan bagaimana teknologi ini menggunakan bahasa sehari-hari untuk membantu kita berinteraksi dengan komputer secara lebih mudah. Beberapa dari kita bahkan telah menggunakan jargon seperti “prompt” dan “machine learning” sambil minum kopi santai bersama teman-teman. Pada akhir 2023 lalu, Microsoft telah merangkumkan 10 istilah AI yang perlu Anda ketahui. Namun, seiring dengan berkembangnya AI, istilah-istilah ini juga terus berkembang. Tahukah Anda perbedaan antara model bahasa besar dan kecil? Atau apa kepanjangan dari “GPT” di ChatGPT? Berikut ini adalah sepuluh kosa kata AI tingkat lanjut yang perlu Anda ketahui.
Penalaran (reasoning)/perencanaan (planning)
Komputer yang menggunakan AI kini dapat memecahkan masalah dan menyelesaikan tugas dengan menggunakan pola yang telah mereka pelajari dari data historis untuk memahami informasi. Proses ini mirip dengan penalaran atau proses berpikir logis. Sistem AI yang paling canggih menunjukkan kemampuan untuk melangkah lebih jauh dari ini dan dapat mengatasi masalah yang semakin kompleks dengan membuat perencanaan. Ia bisa merancang urutan tindakan yang perlu diterapkan untuk mencapai tujuan tertentu.
Sebagai contoh, bayangkan Anda meminta bantuan program AI untuk membuat rencana perjalanan ke taman bermain. Anda menulis “saya ingin mengunjungi enam wahana berbeda di taman bermain X, termasuk wahana air di waktu terpanas di hari Sabtu, 5 Oktober”. Berdasarkan tujuan Anda tersebut, sistem AI dapat memecahnya menjadi langkah-langkah kecil untuk membuat jadwal sambil menggunakan penalaran, untuk memastikan Anda tidak mengunjungi wahana yang sama dua kali, dan bahwa Anda bisa menaiki wahana air antara jam 12 siang sampai jam 3 sore.
Pelatihan (training)/inferensi (inference)
Ada dua langkah yang dilakukan untuk membuat dan menggunakan sistem AI: pelatihan dan inferensi. Pelatihan adalah aktivitas mendidik sistem AI di mana ia akan diberikan dataset, dan sistem AI tersebut belajar melakukan tugas atau membuat prediksi berdasarkan data tersebut. Misalnya, sistem AI diberikan daftar harga rumah yang baru-baru ini dijual di suatu lingkungan, lengkap dengan jumlah kamar tidur dan kamar mandi di masing-masing rumah tersebut dan banyak variabel lainnya. Selama pelatihan, sistem AI akan menyesuaikan parameter internalnya. Parameter internal yang dimaksud merupakan sebuah nilai yang menentukan berapa banyak bobot yang harus diberikan terhadap tiap variabel, dan bagaimana ia memengaruhi harga jual rumah. Sementara itu, inferensi adalah ketika sistem AI menggunakan pola dan parameter yang telah dipelajari tadi untuk menghasilkan prediksi harga untuk rumah yang baru akan dipasarkan di masa depan.
Model bahasa kecil (small language model / SLM)
Model bahasa kecil, atau SLM, adalah versi mini dari model bahasa besar, atau large language models (LLM). Keduanya menggunakan teknik pembelajaran mesin (machine learning) untuk membantu mereka mengenali pola dan hubungan, sehingga mereka dapat menghasilkan respons dalam bahasa sehari-hari yang realistis. Jika LLM berukuran sangat besar dan membutuhkan daya komputasi dan memori yang besar, SLM seperti Phi-3 dilatih menggunakan dataset lebih kecil yang terkurasi dan memiliki parameter yang lebih sedikit, sehingga lebih ringkas dan bahkan dapat digunakan secara offline alias tanpa koneksi internet. Ini membuatnya cocok diaplikasikan di perangkat seperti laptop atau ponsel, di mana Anda mungkin ingin mengajukan pertanyaan sederhana tentang perawatan hewan peliharaan, tetapi tidak perlu mengetahui informasi terperinci mengenai cara melatih anjing pemandu.
Grounding
Sistem generative AI dapat menyusun cerita, puisi, dan lelucon, serta menjawab pertanyaan penelitian. Tetapi terkadang mereka kesulitan membedakan fakta dan fiksi, atau mungkin data pelatihan mereka sudah ketinggalan zaman, sehingga sistem AI dapat memberikan tanggapan yang tidak akurat—suatu kejadian yang disebut sebagai halusinasi. Developers bekerja untuk membantu AI berinteraksi dengan dunia nyata secara akurat melalui proses grounding. Ini adalah proses ketika developers menghubungkan dan menambatkan model mereka dengan data dan contoh nyata untuk meningkatkan akurasi dan menghasilkan output yang lebih relevan secara kontekstual dan dipersonalisasi.
Retrieval Augmented Generation (RAG)
Ketika developers memberikan akses sistem AI ke sumber grounding untuk membantunya menjadi lebih akurat dan terkini, mereka menggunakan metode yang disebut Retrieval Augmented Generation atau RAG. Pola RAG menghemat waktu dan sumber daya dengan memberikan pengetahuan tambahan tanpa harus melatih ulang program AI.
Ini seolah-olah Anda adalah detektif Sherlock Holmes dan Anda telah membaca setiap buku di perpustakaan tetapi belum bisa memecahkan suatu kasus, jadi Anda naik ke loteng, membuka beberapa gulungan naskah kuno, dan voilà — Anda menemukan potongan teka-teki yang hilang. Sebagai contoh lain, jika Anda memiliki perusahaan pakaian dan ingin membuat chatbot yang dapat menjawab pertanyaan khusus terkait produk Anda, Anda dapat menggunakan pola RAG di katalog produk Anda untuk membantu pelanggan menemukan sweater hijau yang sempurna dari toko Anda.
Orkestrasi (Orchestration)
Program AI perlu melakukan banyak hal saat memproses permintaan pengguna. Untuk memastikan sistem AI ini melakukan semua tugas dalam urutan yang benar demi menghasilkan respons terbaik, seluruh tugas ini diatur oleh lapisan orkestrasi.
Sebagai contoh, jika Anda bertanya kepada Microsoft Copilot “siapa Ada Lovelace”, dan kemudian menanyakan Copilot “kapan dia lahir” di prompt selanjutnya, orkestrator AI di sini menyimpan riwayat obrolan Anda untuk melihat bahwa kata “dia” di prompt kedua itu merujuk pada Ada Lovelace.
Lapisan orkestrasi juga dapat mengikuti pola RAG dengan mencari informasi segar di internet untuk ditambahkan ke dalam konteks dan membantu model menghasilkan jawaban yang lebih baik. Ini seperti seorang maestro yang mengisyaratkan pemain biola dan kemudian seruling dan oboe, sambil mengikuti lembaran musik untuk menghasilkan suara yang ada dalam benak komposer.
Memori
Model AI saat ini secara teknis tidak memiliki memori. Tetapi program AI dapat mengatur instruksi yang membantu mereka “mengingat” informasi dengan mengikuti langkah-langkah spesifik dengan setiap interaksi — seperti menyimpan pertanyaan dan jawaban sebelumnya dalam obrolan secara sementara, dan kemudian memasukkan konteks itu dalam permintaan model saat ini, atau menggunakan data grounding dari pola RAG untuk memastikan respons yang diberikan menggunakan informasi terbaru. Developers bereksperimen dengan lapisan orkestrasi untuk membantu sistem AI mengetahui apakah mereka perlu mengingat rincian langkah secara sementara, misalnya — memori jangka pendek, seperti mencatat di sticky note — atau apakah akan lebih berguna jika sistem AI mengingat sesuatu dalam jangka waktu yang lebih lama dengan menyimpannya di lokasi yang lebih permanen.
Transformer models dan diffusion models
Orang-orang telah mengajarkan sistem AI untuk memahami dan menghasilkan bahasa selama beberapa dekade, tetapi salah satu terobosan yang mempercepat kemajuan baru-baru ini adalah transformer models. Di antara model generative AI, tranformer adalah model yang memahami konteks dan nuansa terbaik dan tercepat. Mereka adalah pendongeng yang fasih, memperhatikan pola data dan mempertimbangkan pentingnya input yang berbeda untuk membantu mereka dengan cepat memprediksi apa yang akan terjadi selanjutnya, sehingga memungkinkan mereka menghasilkan teks. Bahkan transformer adalah huruf T di ChatGPT — Generative Pre-trained Transformer. Sementara itu, diffusion models yang umumnya digunakan untuk pembuatan gambar menambahkan sentuhan baru dengan bekerja secara lebih bertahap dan metodis, menyebarkan piksel gambar dari posisi acak hingga didistribusikan sampai membentuk gambar yang diminta dalam prompt. Diffusion models terus membuat perubahan kecil sampai mereka menciptakan output yang sesuai dengan kebutuhan pengguna.
Frontier models
Frontier models adalah sistem skala besar yang mendorong batas-batas AI dan dapat melakukan berbagai macam tugas dengan kemampuan baru yang lebih luas. Mereka bisa sangat maju sehingga terkadang kita terkejut dengan apa yang dapat mereka capai. Perusahaan teknologi termasuk Microsoft membentuk Frontier Model Forum untuk berbagi pengetahuan, menetapkan standar keamanan, dan membantu semua orang memahami program AI yang kuat ini guna memastikan pengembangan yang aman dan bertanggung jawab.
GPU
GPU, yang merupakan singkatan dari Graphics Processing Unit, pada dasarnya adalah kalkulator bertenaga turbo. GPU awalnya dirancang untuk menghaluskan grafis fantastis dalam video game, dan kini menjadi otot dari komputasi. Chip-nya memiliki banyak core kecil, yakni jaringan sirkuit dan transistor, yang menangani masalah matematika secara bersama-sama, atau disebut juga sebagai pemrosesan paralel. Hal ini pada dasarnya sama dengan yang AI lakukan — memecahkan banyak perhitungan dalam skala besar untuk dapat berkomunikasi dalam bahasa manusia dan mengenali gambar atau suara. Karena itu, platform AI sangat memerlukan GPU, baik untuk pelatihan dan inferensi. Faktanya, model AI paling canggih saat ini dilatih menggunakan serangkaian besar GPU yang saling berhubungan — terkadang berjumlah puluhan ribu dan tersebar di pusat data raksasa — seperti yang dimiliki Microsoft di Azure, yang merupakan salah satu komputer paling kuat yang pernah dibuat.
Pelajari selengkapnya tentang berita AI terbaru di Microsoft Source dan berita kami di Indonesia melalui halaman ini.
-SELESAI-
Transferring Reusable PowerShell Objects Between Microsoft 365 Tenants
The Graph SDK’s ToJsonString Method Proves Its Worth
One of the frustrations about using the internet is when you find some code that seems useful, copy the code to try it out in your tenant, and discover that some formatting issue prevents the code from running. Many reasons cause this to happen. Sometimes it’s as simple as an error when copying code into a web editor, and sometimes errors creep in after copying the code, perhaps when formatting it for display. I guess fixing the problems is an opportunity to learn what the code really does.
Answers created by generative AI solutions like ChatGPT, Copilot for Microsoft 365, and GitHub Copilot compound the problem by faithfully reproducing errors in its responses. This is no fault of the technology, which works by creating answers from what’s gone before. If published code includes a formatting error, generative AI is unlikely to find and fix the problem.
Dealing with JSON Payloads
All of which brings me to a variation on the problem. The documentation for Graph APIs used to create or update objects usually include an example of a JSON-formatted payload containing the parameter values for the request. The Graph API interpret the JSON content in the payload to extract the parameters to run a request. By comparison, Microsoft Graph PowerShell SDK cmdlets use hash tables and arrays to pass parameters. The hash tables and arrays mimic the elements of the JSON structure used by the underlying Graph APIs.
Composing a JSON payload is no challenge If you can write perfect JSON. Like any other rules for programming or formatting, it takes time to become fluent with JSON, and who can afford that time when other work exists to be done? Here’s a way to make things easier.
Every object generated by a Graph SDK cmdlet has a ToJsonString method to create a JSON-formatted version of the object. For example:
$User = Get-MgUser -UserId Kim.Akers@office365itpros.com
$UserJson = $User.ToJsonString()
$UserJson
{
“@odata.context”: “https://graph.microsoft.com/v1.0/$metadata#users/$entity”,
“id”: “d36b323a-32c3-4ca5-a4a5-2f7b4fbef31c”,
“businessPhones”: [ “+1 713 633-5141” ],
“displayName”: “Kim Akers (She/Her)”,
“givenName”: “Kim”,
“jobTitle”: “VP Marketing”,
“mail”: “Kim.Akers@office365itpros.com”,
“mobilePhone”: “+1 761 504-0011”,
“officeLocation”: “NYC”,
“preferredLanguage”: “en-US”,
“surname”: “Akers”,
“userPrincipalName”: Kim.Akers@office365itpros.com
}
The advantages of using the ToJsonString method instead of PowerShell’s ConvertTo-JSON cmdlet is that the method doesn’t output properties with empty values. This makes the resulting output easier to review and manage. For instance, the JSON content shown above is a lot easier to use as a template for adding new user accounts than the equivalent generated by ConvertTo-JSON.
Transferring a Conditional Access Policy Using ToJsonString
The output generated by ToJsonString becomes very interesting when you want to move objects between tenants. For example, let’s assume that you use a test tenant to create and fine tune a conditional access policy. The next piece of work is to transfer the conditional access policy from the test tenant to the production environment. Here’s how I make the transfer:
Run the Get-MgIdentityConditionalAccessPolicy cmdlet to find the target policy and export its settings to JSON. Then save the JSON content in a text file.
$Policy = Get-MgIdentityConditionalAccessPolicy -ConditionalAccessPolicyId ‘1d4063cb-5ebf-4676-bfca-3775d7160b65’
$PolicyJson = $Policy.toJsonString()
$PolicyJson > PolicyExport.txt
Edit the text file to replace any tenant-specific items with equivalent values for the target tenant. For instance, conditional access policies usually include an exclusion for break glass accounts, which are listed in the policy using the account identifiers. In this case, you need to replace the account identifiers for the source tenant in the exported text file with the account identifiers for the break glass account for the target tenant.
Disconnect from the source tenant.
Connect to the target tenant with the Policy.ReadWrite.ConditionalAccess scope.
Create a variable ($Body in this example) containing the conditional policy settings.
Run the Invoke-MgGraph-Request cmdlet to import the policy definition into the target tenant.
$Uri = “https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies”
Invoke-MgGraphRequest -uri $uri -method Post -Body $Body
The Other Way
Another way to create a conditional access policy with PowerShell is to run the New-MgIdentityConditionalAccessPolicy cmdlet, which takes a hash table as its payload. It’s easy to translate the JSON into the format used for parameter values stored in the hash table, but it’s even easier to run Invoke-MgGraphRequest and pass the edited version of the JSON exported from the source tenant. Why make things hard for yourself?
This tip is just one of the hundreds included the Automating Microsoft 365 with PowerShell eBook (available separately, as part of the Office 365 for IT Pros (2025 edition) bundle, or as a paperback from Amazon.com).
undefined symbol xcb_shm_id when trying to startup MatLab
When trying to start up MatLab, I get
> ./bin/matlab
MATLAB is selecting SOFTWARE rendering.
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
Unexpected exception: ‘N7mwboost10wrapexceptINS_16exception_detail39current_exception_std_exception_wrapperISt13runtime_errorEEEE: Error loading /home/pblase/matlab/bin/glnxa64/matlab_startup_plugins/matlab_graphics_ui/mwuixloader.so. /usr/lib64/libXt.so.6: undefined symbol: SmcModifyCallbacks: Success: Success’ in createMVMAndCallParser phase ‘Creating local MVM’When trying to start up MatLab, I get
> ./bin/matlab
MATLAB is selecting SOFTWARE rendering.
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
Unexpected exception: ‘N7mwboost10wrapexceptINS_16exception_detail39current_exception_std_exception_wrapperISt13runtime_errorEEEE: Error loading /home/pblase/matlab/bin/glnxa64/matlab_startup_plugins/matlab_graphics_ui/mwuixloader.so. /usr/lib64/libXt.so.6: undefined symbol: SmcModifyCallbacks: Success: Success’ in createMVMAndCallParser phase ‘Creating local MVM’ When trying to start up MatLab, I get
> ./bin/matlab
MATLAB is selecting SOFTWARE rendering.
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
/home/pblase/.MathWorks/ServiceHost/clr-df9a0cbb6bd34e079ef626671d1a7b7c/_tmp_MSHI_5363-9225-767d-e56f/mci/_tempinstaller_glnxa64/bin/glnxa64/InstallMathWorksServiceHost: symbol lookup error: /usr/lib64/libcairo.so.2: undefined symbol: xcb_shm_id
Unexpected exception: ‘N7mwboost10wrapexceptINS_16exception_detail39current_exception_std_exception_wrapperISt13runtime_errorEEEE: Error loading /home/pblase/matlab/bin/glnxa64/matlab_startup_plugins/matlab_graphics_ui/mwuixloader.so. /usr/lib64/libXt.so.6: undefined symbol: SmcModifyCallbacks: Success: Success’ in createMVMAndCallParser phase ‘Creating local MVM’ libcairo MATLAB Answers — New Questions
Intro to matlab lab and I have no idea how this works
<</matlabcentral/answers/uploaded_files/1765134/Screenshot%202024-09-02%20at%206.21.33%E2%80%AFPM.png>>
I don’t know what I am supposed to do with the second part of question 3 and also don’t know what to do with #4. This is my first time ever taking a class about coding so I’m super lost.<</matlabcentral/answers/uploaded_files/1765134/Screenshot%202024-09-02%20at%206.21.33%E2%80%AFPM.png>>
I don’t know what I am supposed to do with the second part of question 3 and also don’t know what to do with #4. This is my first time ever taking a class about coding so I’m super lost. <</matlabcentral/answers/uploaded_files/1765134/Screenshot%202024-09-02%20at%206.21.33%E2%80%AFPM.png>>
I don’t know what I am supposed to do with the second part of question 3 and also don’t know what to do with #4. This is my first time ever taking a class about coding so I’m super lost. vector, vectors, variable MATLAB Answers — New Questions
Poor performance of linprog in practice
I have to solve a dynamic programming problem using a linear programming approach. For details, please see this paper. The LP that I want to solve is:
min c’*v
s.t.
A*v>=u,
where c is n*1, v is n*1, A is n^2*n, u is n^2*1.
The min is with respect to v, the value function of the original DP problem. I have a moderate number of variables, n=300 and m=n^2*n=90000 linear inequalities as constraints. No bound constraints on v.
I use the Matlab function linprog which in turn is based on the solver HIGHS (since R2024a). The code is slow for my purposes (i.e. a brute-force value iteration is much faster). Moreover, linprog gives correct results only if I set the option ‘Algorithm’,’dual-simplex-highs’. With other algorithms, it gets stuck.
After profiling the code, it turns out that the bottleneck is line 377 of linprog:
[x, fval, exitflag, output, lambda] = run(algorithm, problem);
I was wondering if there is a way to speed up the code. Any help or suggestion is greatly appreciated! I put below a MWE to illustrate the problem.
clear,clc,close all
%% Set parameters
crra = 2;
alpha = 0.36;
beta = 0.95;
delta = 0.1;
%% Grid for capital
k_ss = ((1-beta*(1-delta))/(alpha*beta))^(1/(alpha-1));
n_k = 300;
k_grid = linspace(0.1*k_ss,1.5*k_ss,n_k)’;
%% Build current return matrix, U(k’,k)
cons = k_grid’.^alpha+(1-delta)*k_grid’-k_grid;
U_mat = f_util(cons,crra);
U_mat(cons<=0) = -inf;
%% Using LINEAR PROGRAMMING
% min c’*v
% s.t.
% A*v>=u, where c is n*1, v is n*1, A is n^2*n, u is n^2*1
n = length(k_grid);
c_vec = ones(n,1);
u_vec = U_mat(:); %% U(k’,k), stack columnwise
%% Build A matrix using cell-based method
tic
A = cell(n,1);
bigI = (-beta)*speye(n);
for i=1:n
temp = bigI;
temp(:,i) = temp(:,i)+1;
A{i} = temp;
end
A = vertcat(A{:});
disp(‘Time to build A matrix with cell method:’)
toc
%% Call linprog
% ‘dual-simplex-highs’ (default and by far the best)
options = optimoptions(‘linprog’,’Algorithm’,’dual-simplex-highs’);
tic
[V_lin,fval,exitflag,output] = linprog(c_vec,-A,-u_vec,[],[],[],[],options);
disp(‘Time linear programming:’)
toc
if exitflag<=0
warning(‘linprog did not find a solution’)
fprintf(‘exitflag = %d n’,exitflag)
end
%% Now that we solved for V, compute policy function
RHS_mat = U_mat+beta*V_lin; % (k’,k)
[V1,pol_k_ind] = max(RHS_mat,[],1);
pol_k = k_grid(pol_k_ind);
% Plots
figure
plot(k_grid,V1)
figure
plot(k_grid,k_grid,’–‘,k_grid,pol_k)
function util = f_util(c,crra)
util = c.^(1-crra)/(1-crra);
end
PROFILEI have to solve a dynamic programming problem using a linear programming approach. For details, please see this paper. The LP that I want to solve is:
min c’*v
s.t.
A*v>=u,
where c is n*1, v is n*1, A is n^2*n, u is n^2*1.
The min is with respect to v, the value function of the original DP problem. I have a moderate number of variables, n=300 and m=n^2*n=90000 linear inequalities as constraints. No bound constraints on v.
I use the Matlab function linprog which in turn is based on the solver HIGHS (since R2024a). The code is slow for my purposes (i.e. a brute-force value iteration is much faster). Moreover, linprog gives correct results only if I set the option ‘Algorithm’,’dual-simplex-highs’. With other algorithms, it gets stuck.
After profiling the code, it turns out that the bottleneck is line 377 of linprog:
[x, fval, exitflag, output, lambda] = run(algorithm, problem);
I was wondering if there is a way to speed up the code. Any help or suggestion is greatly appreciated! I put below a MWE to illustrate the problem.
clear,clc,close all
%% Set parameters
crra = 2;
alpha = 0.36;
beta = 0.95;
delta = 0.1;
%% Grid for capital
k_ss = ((1-beta*(1-delta))/(alpha*beta))^(1/(alpha-1));
n_k = 300;
k_grid = linspace(0.1*k_ss,1.5*k_ss,n_k)’;
%% Build current return matrix, U(k’,k)
cons = k_grid’.^alpha+(1-delta)*k_grid’-k_grid;
U_mat = f_util(cons,crra);
U_mat(cons<=0) = -inf;
%% Using LINEAR PROGRAMMING
% min c’*v
% s.t.
% A*v>=u, where c is n*1, v is n*1, A is n^2*n, u is n^2*1
n = length(k_grid);
c_vec = ones(n,1);
u_vec = U_mat(:); %% U(k’,k), stack columnwise
%% Build A matrix using cell-based method
tic
A = cell(n,1);
bigI = (-beta)*speye(n);
for i=1:n
temp = bigI;
temp(:,i) = temp(:,i)+1;
A{i} = temp;
end
A = vertcat(A{:});
disp(‘Time to build A matrix with cell method:’)
toc
%% Call linprog
% ‘dual-simplex-highs’ (default and by far the best)
options = optimoptions(‘linprog’,’Algorithm’,’dual-simplex-highs’);
tic
[V_lin,fval,exitflag,output] = linprog(c_vec,-A,-u_vec,[],[],[],[],options);
disp(‘Time linear programming:’)
toc
if exitflag<=0
warning(‘linprog did not find a solution’)
fprintf(‘exitflag = %d n’,exitflag)
end
%% Now that we solved for V, compute policy function
RHS_mat = U_mat+beta*V_lin; % (k’,k)
[V1,pol_k_ind] = max(RHS_mat,[],1);
pol_k = k_grid(pol_k_ind);
% Plots
figure
plot(k_grid,V1)
figure
plot(k_grid,k_grid,’–‘,k_grid,pol_k)
function util = f_util(c,crra)
util = c.^(1-crra)/(1-crra);
end
PROFILE I have to solve a dynamic programming problem using a linear programming approach. For details, please see this paper. The LP that I want to solve is:
min c’*v
s.t.
A*v>=u,
where c is n*1, v is n*1, A is n^2*n, u is n^2*1.
The min is with respect to v, the value function of the original DP problem. I have a moderate number of variables, n=300 and m=n^2*n=90000 linear inequalities as constraints. No bound constraints on v.
I use the Matlab function linprog which in turn is based on the solver HIGHS (since R2024a). The code is slow for my purposes (i.e. a brute-force value iteration is much faster). Moreover, linprog gives correct results only if I set the option ‘Algorithm’,’dual-simplex-highs’. With other algorithms, it gets stuck.
After profiling the code, it turns out that the bottleneck is line 377 of linprog:
[x, fval, exitflag, output, lambda] = run(algorithm, problem);
I was wondering if there is a way to speed up the code. Any help or suggestion is greatly appreciated! I put below a MWE to illustrate the problem.
clear,clc,close all
%% Set parameters
crra = 2;
alpha = 0.36;
beta = 0.95;
delta = 0.1;
%% Grid for capital
k_ss = ((1-beta*(1-delta))/(alpha*beta))^(1/(alpha-1));
n_k = 300;
k_grid = linspace(0.1*k_ss,1.5*k_ss,n_k)’;
%% Build current return matrix, U(k’,k)
cons = k_grid’.^alpha+(1-delta)*k_grid’-k_grid;
U_mat = f_util(cons,crra);
U_mat(cons<=0) = -inf;
%% Using LINEAR PROGRAMMING
% min c’*v
% s.t.
% A*v>=u, where c is n*1, v is n*1, A is n^2*n, u is n^2*1
n = length(k_grid);
c_vec = ones(n,1);
u_vec = U_mat(:); %% U(k’,k), stack columnwise
%% Build A matrix using cell-based method
tic
A = cell(n,1);
bigI = (-beta)*speye(n);
for i=1:n
temp = bigI;
temp(:,i) = temp(:,i)+1;
A{i} = temp;
end
A = vertcat(A{:});
disp(‘Time to build A matrix with cell method:’)
toc
%% Call linprog
% ‘dual-simplex-highs’ (default and by far the best)
options = optimoptions(‘linprog’,’Algorithm’,’dual-simplex-highs’);
tic
[V_lin,fval,exitflag,output] = linprog(c_vec,-A,-u_vec,[],[],[],[],options);
disp(‘Time linear programming:’)
toc
if exitflag<=0
warning(‘linprog did not find a solution’)
fprintf(‘exitflag = %d n’,exitflag)
end
%% Now that we solved for V, compute policy function
RHS_mat = U_mat+beta*V_lin; % (k’,k)
[V1,pol_k_ind] = max(RHS_mat,[],1);
pol_k = k_grid(pol_k_ind);
% Plots
figure
plot(k_grid,V1)
figure
plot(k_grid,k_grid,’–‘,k_grid,pol_k)
function util = f_util(c,crra)
util = c.^(1-crra)/(1-crra);
end
PROFILE linprog, performance MATLAB Answers — New Questions
How to import .EEG or text or excel file to EEGlab
Hi all I’ve 1-hour EEG data with a sampling frequency 291hz.I’ve installed EEGlab v14.1.1 version and tried to load my data files of ‘.EEG file’,’text’ and ‘excel’formats, but none of them are loading to EEGlab.It’s showing the following error. Please help me to slove this issue since I’m new to this EEGlab softwareHi all I’ve 1-hour EEG data with a sampling frequency 291hz.I’ve installed EEGlab v14.1.1 version and tried to load my data files of ‘.EEG file’,’text’ and ‘excel’formats, but none of them are loading to EEGlab.It’s showing the following error. Please help me to slove this issue since I’m new to this EEGlab software Hi all I’ve 1-hour EEG data with a sampling frequency 291hz.I’ve installed EEGlab v14.1.1 version and tried to load my data files of ‘.EEG file’,’text’ and ‘excel’formats, but none of them are loading to EEGlab.It’s showing the following error. Please help me to slove this issue since I’m new to this EEGlab software eeg, eeglab, signal processing MATLAB Answers — New Questions
Conditional formating using formula
Hi,
I’m looking to apply a conditional format to a table (Table1) which highlights the row where a cell matches a cell within another table (Table2)
I’ve had a look online, the only thing I can find is a formula which works if I refer to an array of cells rather than another table in the workbook:
=MATCH(A2,Array1,0)
This only highlights a single cell, even if I try to apply the conditional format to the Table1
Can anyone help?
Thanks
Hi, I’m looking to apply a conditional format to a table (Table1) which highlights the row where a cell matches a cell within another table (Table2) I’ve had a look online, the only thing I can find is a formula which works if I refer to an array of cells rather than another table in the workbook:=MATCH(A2,Array1,0)This only highlights a single cell, even if I try to apply the conditional format to the Table1 Can anyone help?Thanks Read More
New Outlook:
Can’t sign in to the New Outlook. Besides, my hotmail is blocked and I cannot access my mails.
Can’t sign in to the New Outlook. Besides, my hotmail is blocked and I cannot access my mails. Read More
Migrating to 365 with 2 domains
I have a client that has two different domains (old and new). Example: Old email: email address removed for privacy reasons new email email address removed for privacy reasons. It looks like their provider created alias’s for the new domain. Problem is they still get email going to the old email that get’s forwarded(?) to the new email. I want to migrate over to 365. I’m pretty sure the migration will work to transfer over their email history using the new email, but I’m not sure how the forwarding will work. Can I create alias’s for the old email in 365 to do the same?
I have a client that has two different domains (old and new). Example: Old email: email address removed for privacy reasons new email email address removed for privacy reasons. It looks like their provider created alias’s for the new domain. Problem is they still get email going to the old email that get’s forwarded(?) to the new email. I want to migrate over to 365. I’m pretty sure the migration will work to transfer over their email history using the new email, but I’m not sure how the forwarding will work. Can I create alias’s for the old email in 365 to do the same? Read More
Upcoming marketplace webinars available in September
Whether you are brand new to marketplace or have already published multiple offers, our Mastering the Marketplace webinar series has a variety of offerings to help you maximize the marketplace opportunity. Check out these upcoming webinars in September:
▪ Creating your first offer in Partner Center (9/5): Learn how to start with a new SaaS offer in the commercial marketplace; set up the required fields in Partner Center and understand the options and tips to get you started faster!
▪ Creating Plans and Pricing for your offer (9/10): Learn about the payouts process lifecycle for the Microsoft commercial marketplace, how to view and access payout reporting and what payment processes are supported within Partner Center. We will review the payouts process lifecycle for the Azure Marketplace; how to register and the registration requirements; general payout processes from start to finish; and, how to view and access payout reporting.
▪ AI and the Microsoft commercial marketplace (9/12): Through the Microsoft commercial marketplace, get connected to the solutions you need—from innovative AI applications to cloud infra and everything in between. Join this session to learn what’s on our roadmap and see how the marketplace helps you move faster and spend smarter.
▪ Developing your SaaS offer (9/12): In this technical session, learn how to implement the components of a fully functional SaaS solution including how to implement a SaaS landing page and webhook to subscribe to change events, and how to integrate your SaaS product into the marketplace.
Find our complete schedule here:
#ISV #maximizemarketplace #Azure #MSMarketplace #MSPartners
Whether you are brand new to marketplace or have already published multiple offers, our Mastering the Marketplace webinar series has a variety of offerings to help you maximize the marketplace opportunity. Check out these upcoming webinars in September:
▪ Creating your first offer in Partner Center (9/5): Learn how to start with a new SaaS offer in the commercial marketplace; set up the required fields in Partner Center and understand the options and tips to get you started faster!
▪ Creating Plans and Pricing for your offer (9/10): Learn about the payouts process lifecycle for the Microsoft commercial marketplace, how to view and access payout reporting and what payment processes are supported within Partner Center. We will review the payouts process lifecycle for the Azure Marketplace; how to register and the registration requirements; general payout processes from start to finish; and, how to view and access payout reporting.
▪ AI and the Microsoft commercial marketplace (9/12): Through the Microsoft commercial marketplace, get connected to the solutions you need—from innovative AI applications to cloud infra and everything in between. Join this session to learn what’s on our roadmap and see how the marketplace helps you move faster and spend smarter.
▪ Developing your SaaS offer (9/12): In this technical session, learn how to implement the components of a fully functional SaaS solution including how to implement a SaaS landing page and webhook to subscribe to change events, and how to integrate your SaaS product into the marketplace.
Find our complete schedule here:
https://aka.ms/MTMwebinars
#ISV #maximizemarketplace #Azure #MSMarketplace #MSPartners
Formula returning dash when I add a new cell
extremely frustrating I use this sheet to track my side job pay and it glitches everytime I try to edit it and returns 0. i am trying to add august to the gross pay total.
extremely frustrating I use this sheet to track my side job pay and it glitches everytime I try to edit it and returns 0. i am trying to add august to the gross pay total. Read More
Tasks
When I open Tasks I get “The task owner has restricted this action,” and “This list cannot be modified as it no longer exists.” I am horrified as I use it every day. I can’t modify the task in any way. How can I fix this?
When I open Tasks I get “The task owner has restricted this action,” and “This list cannot be modified as it no longer exists.” I am horrified as I use it every day. I can’t modify the task in any way. How can I fix this? Read More
A generalisation of the MAP lambda helper function
Discussion topic. Your thoughts are welcome.
On Saturday I finally bit the bullet and completed a MAPλ Lambda function that generalises the in-built MAP Lambda helper function. As examples, I tried problems of generating the Kronecker product of two matrices and then one of generating variants of an amortisation table.
The original amortisation schedule uses SCAN to calculate closing balances step by step from opening balances. Having returned the closing balances as an array, the principal is inserted at the first element to give opening balances. An array calculation based on the same code is used to return other values of interest using HSTACK.
Following that, I created the array of loan terms {10, 15, 20} (yrs) and used the formula
= MAPλ(variousTerms, AmortisationTableλ(principal, rate, startYear))
to generate
as a single spilt range.
I have posted a copy of MAPλ on GitHub
A version of Excel MAP helper function that will return an array of arrays (github.com)
The intention is that the function can be used without knowing how it works but you are, of course, welcome to try to pick through it.
Discussion topic. Your thoughts are welcome. On Saturday I finally bit the bullet and completed a MAPλ Lambda function that generalises the in-built MAP Lambda helper function. As examples, I tried problems of generating the Kronecker product of two matrices and then one of generating variants of an amortisation table. The original amortisation schedule uses SCAN to calculate closing balances step by step from opening balances. Having returned the closing balances as an array, the principal is inserted at the first element to give opening balances. An array calculation based on the same code is used to return other values of interest using HSTACK.Following that, I created the array of loan terms {10, 15, 20} (yrs) and used the formula = MAPλ(variousTerms, AmortisationTableλ(principal, rate, startYear)) to generateas a single spilt range. I have posted a copy of MAPλ on GitHub A version of Excel MAP helper function that will return an array of arrays (github.com)The intention is that the function can be used without knowing how it works but you are, of course, welcome to try to pick through it. Read More
Update Error for Windows 11 Insider Preview (10.0.26120.1542)
Hi!
When the update Windows 11 Insider Preview (10.0.26120.1542) started, it reached 1% and suddenly stopped.
I tried to run a Troubleshoot for Windows Update inside Configurations and it shows an error 0x803C010A and didn’t proceed as well.
Anyone solved this problem?
Thanks
Hi!When the update Windows 11 Insider Preview (10.0.26120.1542) started, it reached 1% and suddenly stopped.I tried to run a Troubleshoot for Windows Update inside Configurations and it shows an error 0x803C010A and didn’t proceed as well.Anyone solved this problem? Thanks Read More
How to sync Outlook Notes with Gmail account
I have Outlook 2021 desktop installed on my PC. I would like to sync the Outlook Notes:
with my Google Workspace account. Is this possible?
I have Outlook 2021 desktop installed on my PC. I would like to sync the Outlook Notes: with my Google Workspace account. Is this possible? Read More
Default SQL Server Connection for SSMS
SQL 2019 – SSMS 19.3.4.0
I was always wrongly under the impression that SSMS required a server connection in the Object Explorer to run a script against. We have databases with the same names on 2 servers as we’re preparing for migration and I accidentally ran a script on server B, even though there appeared to be no connection open to server B. Only Server A was connected in the object explorer. I was then shocked to find that any new sql script I opened was connected to server B which had been closed out in Object Explorer.
What controls the default server for a script when opening via File / Open in SSMS? What is the best way to lock a script to specific server or make it more obvious which server this is being applied to. I may need to get used to looking in the bottom right where it displays the SQL server, but I’d like to make it more fool proof.
I see activating SQLCMD Mode on the Query Menu is one option, but I wonder what the downside to this might be such that it is not default behaviour.
SQL 2019 – SSMS 19.3.4.0I was always wrongly under the impression that SSMS required a server connection in the Object Explorer to run a script against. We have databases with the same names on 2 servers as we’re preparing for migration and I accidentally ran a script on server B, even though there appeared to be no connection open to server B. Only Server A was connected in the object explorer. I was then shocked to find that any new sql script I opened was connected to server B which had been closed out in Object Explorer. What controls the default server for a script when opening via File / Open in SSMS? What is the best way to lock a script to specific server or make it more obvious which server this is being applied to. I may need to get used to looking in the bottom right where it displays the SQL server, but I’d like to make it more fool proof. I see activating SQLCMD Mode on the Query Menu is one option, but I wonder what the downside to this might be such that it is not default behaviour. Read More
AI Studio End-to-End Baseline Reference Implementation
Azure AI Studio is designed to cater to the growing needs of developers seeking to integrate advanced AI capabilities into their applications with a focus on operational excellence. Addressing key factors such as security, scalability, and regulatory adherence, Azure AI Studio ensures that AI deployments are seamless, sustainable, and strategically aligned with business objectives.
We’re excited to present the end-to-end baseline reference implementation for Azure AI Studio, a definitive guide designed to facilitate the deployment of AI workloads in the cloud. This architecture has been designed to assist organizations in finding structured solutions for deploying AI applications that are production ready in an enterprise environment at scale.
Features of the Baseline Architecture
This architecture incorporates several important features:
Secure Network Perimeter: Creates a secure boundary for AI applications with strict network security and segmentation capabilities.
Identity Management: Implements strong access management to regulate interactions and maintain secure operations within AI services and data.
Scalability: Provides a flexible infrastructure to support the growth of AI applications, ensuring performance is not sacrificed as demand increases.
Compliance and Governance: Maintains a commitment to following enterprise governance policies and meeting compliance standards throughout the life of an AI application.
Supported Scenarios of the Baseline Architecture
The reference architecture supports various important use cases, including:
AI Studio Project Playground: An integrated environment for engaging with Azure OpenAI technologies, where you can chat with your data, test out various AI-powered assistants, and utilize completion features for text. This tool serves as a one-stop shop to assess, refine, and validate your AI-driven projects.
Promptflow Workflows: This feature supports the development of complex AI workflows, integrating elements like custom Python scripts and large language model integrations, enhancing operational excellence.
Resilient, Managed Deployments: Manages the deployment of AI applications to Azure’s managed virtual networks, ensuring solid and dependable access via client UI hosted in Azure App Service.
Self-Hosting with Azure App Service: This alternative gives enterprises full control to customize and manage Promptflow deployment using Azure App Service leveraging advanced options such as availability zones.
You can find the reference implementation in the following link: aistudio-end-to-end-baseline-architecture
Microsoft Tech Community – Latest Blogs –Read More
¡Temporada de IA para Desarrolladores!
Si te apasiona la Inteligencia Artificial y el desarrollo de aplicaciones, no te pierdas la oportunidad de ver esta increíble serie de Microsoft Reactor. Durante la temporada, exploramos desde los fundamentos de Azure OpenAI hasta las últimas innovaciones presentadas en Microsoft Build 2024, finalizando con el potente framework Semantic Kernel para la creación de aplicaciones inteligentes. Todas las sesiones están cargadas de numerosos demos para que puedas comprender cada concepto y aplicarlo de manera efectiva.
Episodios:
Episodio 1: Introducción a Azure OpenAI
Exploramos los modelos de Azure OpenAI, sus capacidades, y cómo integrarlos con el SDK de Azure.
Episodio 2: Consideraciones para Implementar Modelos en Azure OpenAI
Aprendimos a gestionar la cuota del servicio, equilibrar rendimiento y latencia, planificar la gestión de costos, y aplicar el patrón RAG para optimizar tus implementaciones.
Episodio 3: Novedades de Microsoft Build: PHI3, GPT-4o, Azure Content Safety y más
Descubrimos las últimas novedades de Microsoft Build, incluyendo PHI 3, GPT-4o con capacidades multimodales, el nuevo Azure AI Studio, y Azure Content Safety.
Episodio 4: Comenzando con Semantic Kernel
Conocimos Semantic Kernel, un SDK de código abierto que permite integrar fácilmente LLM avanzados en tus aplicaciones para crear experiencias más inteligentes y naturales.
Episodio 5: Construye tu propio Copilot con Semantic Kernel
Aprendimos a utilizar Plugins, Planners y Memories de Semantic Kernel para crear copilotos que trabajan codo a codo con los usuarios, brindándoles sugerencias inteligentes para completar tareas.
-¡No te lo pierdas! Revive cada episodio para descubrir cómo puedes llevar tus aplicaciones al siguiente nivel con la IA de Microsoft.
-Obtén más información y desarrolla tus habilidades con la IA durante esta serie con esta colección de recursos de Microsoft Learn:
Speakers:
Luis Beltran – Microsoft MVP – LinkedIn
Pablo Piovano – Microsoft MVP – LinkedIn
Microsoft Tech Community – Latest Blogs –Read More
Make High Quality Dataset from WARC for Pre-training
You’re welcome to follow my GitHub repo and give it a star:https://github.com/xinyuwei-david/david-share.git
In the following subsections, we will explain each step involved in generating High Qualit dataset Pre-training
How to evaluate the quality of training data?
There are 4 methods to evaluate the quality of training data, including but not limited to.
Using a “clean” corpus and perplexity check
Method: Train a model using a high-quality corpus (e.g., Wikipedia) and then use this model to check the perplexity of the new dataset.
Advantages:
Quick: Can quickly assess the quality of the dataset.
Simple: Relatively simple to implement, does not require complex computational resources.
Disadvantages:
Limitations: Low perplexity does not necessarily mean better performance on specific tasks.
Single Metric: Perplexity is just a single metric and cannot fully reflect the quality of the dataset.
Training small models and testing on evaluation tasks
Method: Extract a portion of data from the dataset, train a small model, and test the model’s performance on a set of specific evaluation tasks (e.g., SQuAD, GLUE, etc.).
Advantages:
Specific: Provides specific performance feedback by testing the model on actual tasks.
Diversity: Allows for the selection of various evaluation tasks to comprehensively assess the dataset quality.
Disadvantages:
Resource Demand: Requires a certain amount of computational resources and time.
Task Selection: Needs to select diverse and representative evaluation tasks, which may increase complexity.
Early signal method
Method: Train a small model and conduct preliminary evaluations on some simple and quick benchmark tasks (e.g., text classification, sentiment analysis, etc.).
Advantages:
Rapid Iteration: Quickly obtain initial feedback, facilitating rapid iteration and optimization.
Suitable for Early Stages: Helps quickly screen datasets in the early stages of development.
Disadvantages:
Simple Tasks: These tasks may be relatively simple and may not fully represent the model’s performance on complex tasks.
Preliminary Evaluation: Only provides initial performance feedback, which may require further detailed evaluation.
Using GPT-4 for evaluation
Method: Use the GPT-4 model to evaluate the new dataset, potentially including various tasks (e.g., text generation, question answering, sentiment analysis, etc.).
Advantages:
High-Quality Evaluation: As a powerful language model, GPT-4 can provide high-quality evaluation results, especially on complex tasks.
Multi-Task Capability: Can evaluate on various tasks, providing comprehensive performance feedback.
Real-World Usage: Evaluation results are closer to actual usage, especially if your final application is also based on similar advanced models.
Disadvantages:
Computational Resources: Training and evaluating GPT-4 requires a large amount of computational resources and time, which may increase costs.
Complexity: The complexity of GPT-4 means more potential issues during debugging and optimization.
Overfitting Risk: If not careful, there is a risk of over-optimizing specific tasks, leading to poorer performance on other tasks.
Summary
Using a “clean” corpus and perplexity check: Suitable for quick, preliminary quality assessment but limited to a single metric.
Training small models and testing on evaluation tasks: Suitable for scenarios requiring specific task performance feedback but requires more resources and task selection.
Early signal method: Suitable for the early stages of development to quickly screen datasets but involves simpler tasks.
Using GPT-4 for evaluation: Suitable for scenarios requiring high-quality and comprehensive evaluation, providing feedback closest to actual usage but with high resource demands.
Prepare environment
In the following content, I will show how to create High Quality Dataset from WARC.
Create conda env
#conda create –name=dataclean python=3.10
#conda activate dataclean
(dataclean) root@david1a100:~# cd dataclean/
(dataclean) root@david1a100:~/dataclean# hostname
david1a100.australiaeast.cloudapp.azure.com
#pip install datatrove xxhash faust-cchardet python-magic warcio fasteners tldextract trafilatura fasttext-wheel nltk awscli fasttext numpy==1.21.0
#pip install datatrove[all]
#pip install datatrove trafilatura awscli
#aws configure
Download WARC
Access the following link to check WARC file address:
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/index.html
Download this file named warc.paths.gz :
Check file path just as follwing in warc.paths.gz. There are so many warc.gz files, I only take CC-MAIN-20230527223515-20230528013515-00000.warc.gz as an example.
crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz
Download files as follwing script:
(dataclean) root@david1a100:~/dataclean# cat download_warc_file.py
import os
import subprocess
def download_warc_file(url, output_dir):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f”Downloading {url}…”)
command = f”wget -P {output_dir} {url}”
subprocess.run(command, shell=True, check=True)
if __name__ == ‘__main__’:
# URL of the WARC file
warc_url = “https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”
# output directory
output_dir = “/root/dataclean/data/CC-MAIN-2023-23/segments”
download_warc_file(warc_url, output_dir)
Basic data processing
After download 00000.warc.gz, I uses the local executor LocalPipelineExecutor to execute the data processing pipeline, which includes the following steps:
reading WARC files
filtering URLs
extracting content using Trafilatura
filtering non-English content
filtering duplicate content
filtering low-quality content
writing the processed data to JSONL files.
(dataclean) root@david1a100:~/dataclean# cat process_common_crawl_dump.py
import nltk
import sys
import os
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import (
GopherQualityFilter,
GopherRepetitionFilter,
LanguageFilter,
URLFilter,
)
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
def download_punkt():
nltk.download(‘punkt’)
nltk.download(‘punkt_tab’)
def set_nltk_data_path():
nltk.data.path.append(‘/root/nltk_data’)
set_nltk_data_path()
download_punkt()
def main():
# DUMP should be given as an argument. Example: CC-MAIN-2023-23
if len(sys.argv) != 2:
print(“Argument required: dump name”)
sys.exit(-1)
DUMP = sys.argv[1]
MAIN_OUTPUT_PATH = “./output” # Local Output Path
DATA_PATH = f”./data/{DUMP}/segments/”
print(f”Checking files in {DATA_PATH}”)
for root, dirs, files in os.walk(DATA_PATH):
print(f”Found directory: {root}”)
for file in files:
print(f”Found file: {file}”)
if not any(os.scandir(DATA_PATH)):
print(f”No files found in {DATA_PATH}”)
sys.exit(-1)
def initializer():
set_nltk_data_path()
download_punkt()
from multiprocessing import Pool
with Pool(processes=8, initializer=initializer) as pool:
executor = LocalPipelineExecutor(
pipeline=[
WarcReader(
DATA_PATH,
glob_pattern=”*.warc.gz”,
default_metadata={“dump”: DUMP},
),
URLFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/url/{DUMP}”)),
Trafilatura(favour_precision=True),
LanguageFilter(
exclusion_writer=JsonlWriter(
f”{MAIN_OUTPUT_PATH}/non_english/”,
output_filename=”${language}/” + DUMP + “/${rank}.jsonl.gz”, # 文件夹结构:language/dump/file
)
),
GopherRepetitionFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/repetitive/{DUMP}”)),
GopherQualityFilter(exclusion_writer=JsonlWriter(f”{MAIN_OUTPUT_PATH}/removed/quality/{DUMP}”)),
JsonlWriter(f”{MAIN_OUTPUT_PATH}/output/{DUMP}”),
],
tasks=8, # Number of local tasks, adjusted to your VM configuration
logging_dir=f”{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP}”,
)
executor.run()
if __name__ == ‘__main__’:
main()
Run script as following:
#python3 process_common_crawl_dump.py CC-MAIN-2023-23
Script will run for 26 minutes, final output is as follwing:
2024-08-14 05:11:53.451 | INFO | datatrove.utils.logging:add_task_logger:47 – Launching pipeline for rank=0
2024-08-14 05:11:53.452 | INFO | datatrove.utils.logging:log_pipeline:76 –
— 🛠️ PIPELINE 🛠
📖 – READER: 🕷 Warc
🔻 – FILTER: 😈 Url-filter
🛢 – EXTRAC: ⛏ Trafilatura
🔻 – FILTER: 🌍 Language ID
🔻 – FILTER: 👯 Gopher Repetition
🔻 – FILTER: 🥇 Gopher Quality
💽 – WRITER: 🐿 Jsonl
2024-08-14 05:11:53.452 | INFO | datatrove.pipeline.readers.base:read_files_shard:193 – Reading input file CC-MAIN-20230527223515-20230528013515-00000.warc.gz
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data…
[nltk_data] Package punkt_tab is already up-to-date!
2024-08-14 05:11:55.704 | WARNING | datatrove.pipeline.extractors.base:run:60 – ❌ Error “” while cleaning record text. Skipping record.
…
2024-08-14 05:38:47.661 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 8/8 tasks completed.
2024-08-14 05:38:47.686 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 8 tasks 📉📉📉
Total Runtime: 26 minutes and 36 seconds
📖 – READER: 🕷 Warc
Runtime: (2.11%) 33 seconds [0.29 milliseconds±3.12 milliseconds/doc]
Stats: {input_files: 1, doc_len: 4795961005 [min=1, max=1048576, 140974.75±182620/doc], documents: 34019 [34019.00/input_file]}
🔻 – FILTER: 😈 Url-filter
Runtime: (0.35%) 5 seconds [0.16 milliseconds±11.08 milliseconds/doc]
Stats: {total: 34020, forwarded: 33834, doc_len: 4776069530 [min=1, max=1048576, 141161.84±182866/doc], dropped: 186, dropped_domain: 90, dropped_hard_blacklisted: 67, dropped_blacklisted_subword: 21, dropped_soft_blacklisted: 6, dropped_subdomain: 2}
🛢 – EXTRAC: ⛏ Trafilatura
Runtime: (75.94%) 20 minutes and 12 seconds [35.84 milliseconds±29.25 milliseconds/doc]
Stats: {total: 33834, forwarded: 27384, doc_len: 57232496 [min=1, max=551300, 2090.00±6280/doc], dropped: 4168}
🔻 – FILTER: 🌍 Language ID
Runtime: (0.91%) 14 seconds [0.53 milliseconds±2.54 milliseconds/doc]
Stats: {total: 27384, dropped: 16500, forwarded: 10884, doc_len: 24989254 [min=2, max=73080, 2295.96±4166/doc]}
🔻 – FILTER: 👯 Gopher Repetition
Runtime: (13.00%) 3 minutes and 27 seconds [19.07 milliseconds±33.46 milliseconds/doc]
Stats: {total: 10884, forwarded: 8161, doc_len: 21401662 [min=5, max=73080, 2622.43±4274/doc], dropped: 2723, dropped_top_4_gram: 345, dropped_dup_line_frac: 633, dropped_top_2_gram: 796, dropped_duplicated_5_n_grams: 281, dropped_top_3_gram: 399, dropped_duplicated_6_n_grams: 25, dropped_dup_line_char_frac: 173, dropped_duplicated_8_n_grams: 13, dropped_duplicated_10_n_grams: 16, dropped_duplicated_9_n_grams: 23, dropped_duplicated_7_n_grams: 19}
🔻 – FILTER: 🥇 Gopher Quality
Runtime: (7.55%) 2 minutes [14.76 milliseconds±8.44 milliseconds/doc]
Stats: {total: 8161, dropped: 2433, dropped_gopher_too_many_end_ellipsis: 232, dropped_gopher_below_alpha_threshold: 1201, forwarded: 5728, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], dropped_gopher_short_doc: 941, dropped_gopher_too_many_bullets: 49, dropped_gopher_enough_stop_words: 6, dropped_gopher_below_avg_threshold: 1, dropped_gopher_too_many_ellipsis: 1, dropped_gopher_too_many_hashes: 2}
💽 – WRITER: 🐿 Jsonl
Runtime: (0.14%) 2 seconds [0.40 milliseconds±0.60 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5728, total: 5728, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc]}
Check data processing results
root@david1a100:~/dataclean/output/output/CC-MAIN-2023-23# zcat ./00000.jsonl.gz | head -n 2 | jq .
Output:
{
“text”: “Buy Ambien Online Legally (Zolpidem) belongs to the imidazopyridines class of opioids. Ambien complements the exceptional of sleep via way of means of decreasing the time it takes to fall asleep, decreasing the frequency of nocturnal awakenings, and growing the general period of sleep. Lengthens the second one degree of sleep and the deep sleep degree (III and IV). It does now no longer make you sleepy throughout the day. If you’re seeking to Buy Ambien Online at an inexpensive cost, come to our on line pharmacy.”,
“id”: “<urn:uuid:dd20979b-ada8-4c5b-b53e-4ade7274bc1b>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://42627.dynamicboard.de/u101117_ambienusa.html”,
“date”: “2023-05-27T23:12:51Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.8990675806999207
}
}
{
“text”: “My little guy turned two over the summer and we celebrated with an oh-so-cute Golf Birthday Party. He is all boy and loves anything that includes a stick and ball, which made choosing the golf theme fairly easy. We had fun golfing games, snacks & treats and each little caddie even received there very own golf bag. The post was getting fairly large I decided to split it in two parts. Part one covers the favor and dessert table and part two will focus on the food and games. Enjoy!nGolf Pro Shop for the favor tablenEach “Golf Pro” received his/her own set of golf clubs (thank you Target dollar section for saving the day!), a blue or green visor I purchased at Joann’s, practice golf balls and a water bottle to stay hydrated on the course.nI created the backdrop for the dessert table with a tan table cloth I had and pinned it to the window frame with thumb tacks (my husband wasn’t too happy about that one…opps!) I used 12” white tissue paper balls that I purchased from Devra Party and hung them by grosgrain ribbon.nI wanted to use items on the dessert table that went along with the theme so I racked my brain for some golf terms. The sign over the table was “Caddie’s Sweet Spot” (sweet spot refers to the center point of the face of the club).nThere was a “water hazard” ~ blue jell-o jigglers, “wormburners” (which is the term for a ball that skims the grass) ~ chocolate pudding pack topped with crumbled Oreos and gummy worms plus a sand trap of “doughnut hole in one” ~ made with powder sugar doughnuts and crumbled graham crackers for the sand.nI also made cake pops that resembled golf balls ~ some like a lollipop and others with a golf flag and the number two for the birthday boy. The kids had a few candy choices and a small bag to fill so they could bring treats home.n“Wormburners” – Chocolate pudding cups topped with crushed oreos and gummy wormsnGreen Grass Cupcakes, with white gumball and printable golf flags.nThank you so much to everyone who helped make this party amazing, I couldn’t have done it without you.nVendor List:nPhotography: Andary StudionParty Printables: Printable Studio by 505 Design, IncnGolf Club Sets: Target Dollar SectionnFoam Visors: Joann’snGreen & White Tissue Balls: Devra PartynGreen Polka Dot Balloons: Paws Attraction BoutiquenCupcakes – My super talented sisternInterested in hosting your own Golf Themed Party – Check out the Golf Pro Printable set now available in the shop.nMore details coming soon….nThanks for stopping by! Cathy C.”,
“id”: “<urn:uuid:9ad54ec1-b946-4293-8099-abc434ef154c>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://505-design.com/tag/boys-party/”,
“date”: “2023-05-27T23:24:49Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.9405166506767273
}
}
Minhash deduplication
I use the local executor LocalPipelineExecutor to execute the data deduplication pipeline, which includes the following steps:
Configuring Minhash: Setting up Minhash with 64-bit hashes for better precision and fewer false positives (collisions).
Reading Input Data: Using JsonlReader to read input data from a specified directory.
Stage 1: Calculating Minhash Signatures:
Pipeline: Reads input data and calculates Minhash signatures.
Output: Stores signatures in a specified folder.
Tasks: Configured to run with a specified number of tasks based on the local environment.
Stage 2: Finding Matches Between Signatures in Each Bucket :
Pipeline: Processes the signatures to find matches within each bucket.
Output: Stores bucketed signatures in a specified folder.
Tasks: Runs with a number of tasks equal to the number of buckets.
Dependency: Depends on the completion of Stage 1.
Stage 3: Creating Clusters of Duplicates:
Pipeline: Uses the results from all buckets to create clusters of duplicate items.
Output: Stores IDs of items to be removed in a specified folder.
Tasks: Runs as a single task.
Dependency: Depends on the completion of Stage 2.
Stage 4: Filtering Out Duplicates:
Pipeline: Reads the original input data, counts tokens, filters out duplicates (keeping only one sample per cluster), and writes the deduplicated data to JSONL files.
Output: Stores deduplicated output and removed items in specified folders.
Tasks: Configured to run with a specified number of tasks.
Dependency: Depends on the completion of Stage 3.
root@david1a100:~/dataclean# cat minhash_deduplication.py
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import MinhashDedupSignature
from datatrove.pipeline.dedup.minhash import (
MinhashConfig,
MinhashDedupBuckets,
MinhashDedupCluster,
MinhashDedupFilter,
)
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.tokens import TokensCounter
from datatrove.pipeline.writers.jsonl import JsonlWriter
def main():
minhash_config = MinhashConfig(use_64bit_hashes=True)
LOCAL_MINHASH_BASE_PATH = “./minhash”
LOCAL_LOGS_FOLDER = “./logs”
TOTAL_TASKS = 8
# Input data path
INPUT_READER = JsonlReader(“./output/output/CC-MAIN-2023-23/”)
# Stage 1: Calculate the Minhash signature for each task
stage1 = LocalPipelineExecutor(
pipeline=[
INPUT_READER,
MinhashDedupSignature(output_folder=f”{LOCAL_MINHASH_BASE_PATH}/signatures”, config=minhash_config),
],
tasks=TOTAL_TASKS,
logging_dir=f”{LOCAL_LOGS_FOLDER}/signatures”,
)
# Stage 2: Finding matches between signatures in each bucket
stage2 = LocalPipelineExecutor(
pipeline=[
MinhashDedupBuckets(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/signatures”,
output_folder=f”{LOCAL_MINHASH_BASE_PATH}/buckets”,
config=minhash_config,
),
],
tasks=minhash_config.num_buckets,
logging_dir=f”{LOCAL_LOGS_FOLDER}/buckets”,
depends=stage1,
)
# Stage 3: Create clusters of duplicate items using the results of all buckets
stage3 = LocalPipelineExecutor(
pipeline=[
MinhashDedupCluster(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/buckets”,
output_folder=f”{LOCAL_MINHASH_BASE_PATH}/remove_ids”,
config=minhash_config,
),
],
tasks=1,
logging_dir=f”{LOCAL_LOGS_FOLDER}/clusters”,
depends=stage2,
)
# Stage 4: Read raw input data and remove all samples from each duplicate cluster (keep only one)
stage4 = LocalPipelineExecutor(
pipeline=[
INPUT_READER,
TokensCounter(), # View the number of tokens before and after de-duplication
MinhashDedupFilter(
input_folder=f”{LOCAL_MINHASH_BASE_PATH}/remove_ids”,
exclusion_writer=JsonlWriter(f”{LOCAL_MINHASH_BASE_PATH}/removed”),
),
JsonlWriter(output_folder=f”{LOCAL_MINHASH_BASE_PATH}/deduplicated_output”),
],
tasks=TOTAL_TASKS,
logging_dir=f”{LOCAL_LOGS_FOLDER}/filter”,
depends=stage3,
)
stage4.run()
if __name__ == ‘__main__’:
import multiprocessing
multiprocessing.freeze_support()
main()
Run code:
(dataclean) root@david1a100:~/dataclean# python minhash_deduplication.py
Results are as following:
— 🛠️ PIPELINE 🛠
📖 – READER: 🐿 Jsonl
🔢 – TOKENIZER: 📊 Counter
🫂 – DEDUP: 🎯 MinHash stage 4
💽 – WRITER: 🐿 Jsonl
2024-08-14 07:20:58.795 | INFO | datatrove.pipeline.readers.base:read_files_shard:193 – Reading input file 00000.jsonl.gz
2024-08-14 07:20:58.802 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 1/8 tasks completed.
2024-08-14 07:20:58.804 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 2/8 tasks completed.
2024-08-14 07:20:58.805 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 3/8 tasks completed.
2024-08-14 07:20:58.807 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 4/8 tasks completed.
2024-08-14 07:20:58.808 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 5/8 tasks completed.
2024-08-14 07:20:58.810 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 6/8 tasks completed.
2024-08-14 07:20:58.812 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 7/8 tasks completed.
2024-08-14 07:21:08.399 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=0
2024-08-14 07:21:08.401 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 0 📉📉📉
Total Runtime: 9 seconds
📖 – READER: 🐿 Jsonl
Runtime: (1.54%) 0 seconds [0.03 milliseconds±0.01 milliseconds/doc]
Stats: {input_files: 1, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], documents: 5727 [5727.00/input_file]}
🔢 – TOKENIZER: 📊 Counter
Runtime: (79.15%) 7 seconds [1.29 milliseconds±5.90 milliseconds/doc]
Stats: {tokens: 3989039 [min=54, max=18060, 696.41±1020/doc]}
🫂 – DEDUP: 🎯 MinHash stage 4
Runtime: (0.44%) 0 seconds [0.01 milliseconds±0.03 milliseconds/doc]
Stats: {total: 5728, forwarded: 5548, dropped: 180}
💽 – WRITER: 🐿 Jsonl
Runtime: (18.86%) 1 second [0.32 milliseconds±0.44 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5548, total: 5548, doc_len: 17896157 [min=257, max=73080, 3225.70±4665/doc], doc_len_tokens: 3943328 [min=54, max=18060, 710.77±1032/doc]}
2024-08-14 07:21:08.405 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 8/8 tasks completed.
2024-08-14 07:21:08.417 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 8 tasks 📉📉📉
Total Runtime: 1 second ± 2 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (1.54%) 0 seconds±0 seconds/task, min=0 seconds [0.03 milliseconds±0.01 milliseconds/doc]
Stats: {input_files: 1, doc_len: 18117059 [min=257, max=73080, 3162.89±4611/doc], documents: 5727 [5727.00/input_file]}
🔢 – TOKENIZER: 📊 Counter
Runtime: (79.15%) 0 seconds±2 seconds/task, min=0 seconds [1.29 milliseconds±5.90 milliseconds/doc]
Stats: {tokens: 3989039 [min=54, max=18060, 696.41±1020/doc]}
🫂 – DEDUP: 🎯 MinHash stage 4
Runtime: (0.44%) 0 seconds±0 seconds/task, min=0 seconds [0.01 milliseconds±0.03 milliseconds/doc]
Stats: {total: 5728, forwarded: 5548, dropped: 180}
💽 – WRITER: 🐿 Jsonl
Runtime: (18.86%) 0 seconds±0 seconds/task, min=0 seconds [0.32 milliseconds±0.44 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 5548, total: 5548, doc_len: 17896157 [min=257, max=73080, 3225.70±4665/doc], doc_len_tokens: 3943328 [min=54, max=18060, 710.77±1032/doc]}
Check removed and final result in this part:
(dataclean) root@david1a100:~/dataclean/minhash# ls -al removed/
total 76
drwx—— 2 root root 4096 Aug 14 07:20 .
drwx—— 7 root root 4096 Aug 14 07:20 ..
-rw——- 1 root root 65584 Aug 14 07:21 00000.jsonl.gz
(dataclean) root@david1a100:~/dataclean/minhash# ls -al deduplicated_output/
total 7372
drwx—— 2 root root 4096 Aug 14 07:20 .
drwx—— 7 root root 4096 Aug 14 07:20 ..
-rw——- 1 root root 7539420 Aug 14 07:21 00000.jsonl.gz
(dataclean) root@david1a100:~/dataclean/minhash#
Check first intem in final output file:
(dataclean) root@david1a100:~/dataclean/minhash/deduplicated_output# zcat ./00000.jsonl.gz | head -n 1 | jq .
{
“text”: “Buy Ambien Online Legally (Zolpidem) belongs to the imidazopyridines class of opioids. Ambien complements the exceptional of sleep via way of means of decreasing the time it takes to fall asleep, decreasing the frequency of nocturnal awakenings, and growing the general period of sleep. Lengthens the second one degree of sleep and the deep sleep degree (III and IV). It does now no longer make you sleepy throughout the day. If you’re seeking to Buy Ambien Online at an inexpensive cost, come to our on line pharmacy.”,
“id”: “<urn:uuid:dd20979b-ada8-4c5b-b53e-4ade7274bc1b>”,
“metadata”: {
“dump”: “CC-MAIN-2023-23”,
“url”: “http://42627.dynamicboard.de/u101117_ambienusa.html”,
“date”: “2023-05-27T23:12:51Z”,
“file_path”: “/root/dataclean/data/CC-MAIN-2023-23/segments/CC-MAIN-20230527223515-20230528013515-00000.warc.gz”,
“language”: “en”,
“language_score”: 0.8990675806999207,
“token_count”: 120
}
}
Sentence deduplication
My code uses the local executor LocalPipelineExecutor to execute the data deduplication pipeline, which includes the following steps:
Configuring Sentence Deduplication: Setting up sentence deduplication with specific configurations such as the number of sentences, splitting sentences, and minimum document words.
Preprocessing Data: Using NLTK to download the Punkt tokenizer and preprocess data before starting multiprocessing.
Reading Input Data: Using JsonlReader to read input data from a specified directory.
Stage 1: Extracting and Filtering Content:
Pipeline: Reads input data, extracts content using Trafilatura, filters based on quality and language, and writes intermediate results to JSONL files.
Output: Stores intermediate results in a specified folder.
Tasks: Configured to run with a specified number of tasks.
Stage 2: Calculating Sentence Deduplication Signatures:
Pipeline: Processes the intermediate results to calculate sentence deduplication signatures.
Output: Stores signatures in a specified folder.
Tasks: Runs with a number of tasks equal to the number of finder workers.
Stage 3: Finding and Filtering Duplicates:
Pipeline: Reads the intermediate results, finds duplicates using the calculated signatures, and filters out duplicates (keeping only one sample per cluster).
Output: Stores deduplicated output in a specified folder.
Tasks: Configured to run with a specified number of tasks.
The pipeline is executed by running executor_1.run(), executor_2.run(), and executor_3.run().
(dataclean) root@david1a100:~/dataclean# cat sentence_deduplication.py
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from datatrove.executor.base import PipelineExecutor
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
from datatrove.io import get_datafolder
from collections import UserDict
import multiprocessing
# Ensure punkt tokenizer is downloaded before multiprocessing
nltk.download(‘punkt’, force=True)
# Custom function to load PunktSentenceTokenizer
def load_punkt_tokenizer():
punkt_param = PunktParameters()
with open(nltk.data.find(‘tokenizers/punkt/english.pickle’), ‘rb’) as f:
tokenizer = PunktSentenceTokenizer(punkt_param)
return tokenizer
# Load tokenizer in the main process
tokenizer = load_punkt_tokenizer()
# Example configuration for sentence deduplication
sent_dedup_config = SentDedupConfig(
n_sentences=3,
split_sentences=True,
only_dedup_in_index=True,
min_doc_words=50,
)
FINDER_WORKERS = 10
class TimeStats:
def __init__(self):
self.global_mean = 0
self.global_std_dev = 0
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
pass
def __repr__(self):
return f”TimeStats(global_mean={self.global_mean}, global_std_dev={self.global_std_dev})”
def __add__(self, other):
result = TimeStats()
result.global_mean = self.global_mean + other.global_mean
result.global_std_dev = self.global_std_dev + other.global_std_dev
return result
class Stat:
def __init__(self):
self.value = 0
def update(self, value, unit=None):
self.value += value
def __repr__(self):
return f”Stat(value={self.value})”
def __add__(self, other):
result = Stat()
result.value = self.value + other.value
return result
class PipelineStats(UserDict):
def __init__(self):
super().__init__()
self.total_runtime = 0
self.time_stats = TimeStats()
self.data[‘total’] = Stat()
self.data[‘removed_sentences’] = Stat()
self.data[‘original_sentences’] = Stat()
def as_dict(self):
return {
‘total_runtime’: self.total_runtime,
‘time_stats’: repr(self.time_stats),
‘stats’: {key: repr(value) for key, value in self.data.items()}
}
def to_dict(self):
return self.as_dict()
def to_json(self):
import json
return json.dumps(self.to_dict(), indent=4)
def save_to_disk(self, file):
file.write(self.to_json())
def get_repr(self, task_name):
x = f”nn Stats: {task_name} nnTotal Runtime: {self.total_runtime} secondsnn”
x += “n”.join([repr(stat) for stat in self.data.values()])
return x
def __repr__(self, *args, **kwargs):
return f”PipelineStats(total_runtime={self.total_runtime}, time_stats={self.time_stats})”
def __add__(self, other):
result = PipelineStats()
result.total_runtime = self.total_runtime + other.total_runtime
result.time_stats = self.time_stats + other.time_stats
for key in self.data:
result.data[key] = self.data[key] + other.data[key]
return result
class CustomSentenceDedupFilter(SentenceDedupFilter):
def __init__(self, data_folder, config):
self.data_folder = get_datafolder(data_folder)
self.config = config
self._tokenizer = None
self.exclusion_writer = None
self.stats = PipelineStats()
self.language = ‘english’
def set_tokenizer(self, tokenizer):
self._tokenizer = tokenizer
def run(self, data, rank, world_size, *args):
# Implement the logic for the run method here
# For now, let’s just print the arguments to verify they are passed correctly
print(f”Running with data: {data}, rank: {rank}, world_size: {world_size}, args: {args}”)
# Add your actual processing logic here
return data
def preprocess_data():
# Preprocess data using nltk before starting multiprocessing
# This is a placeholder function. Implement your preprocessing logic here.
# For example, you can read the input files, tokenize the sentences, and save the preprocessed data.
pass
def run_example():
preprocess_data() # Preprocess data before starting multiprocessing
pipeline_1 = [
JsonlReader(data_folder=”./minhash/deduplicated_output/”),
Trafilatura(),
GopherQualityFilter(min_stop_words=0),
LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
JsonlWriter(“./intermediate/”),
SentenceDedupSignature(output_folder=”./c4/sigs”, config=sent_dedup_config, finder_workers=FINDER_WORKERS),
]
pipeline_2 = [SentenceFindDedups(data_folder=”./c4/sigs”, output_folder=”./c4/dups”, config=sent_dedup_config)]
sentence_dedup_filter = CustomSentenceDedupFilter(data_folder=”./c4/dups”, config=sent_dedup_config)
sentence_dedup_filter.set_tokenizer(tokenizer)
pipeline_3 = [
JsonlReader(data_folder=”./intermediate/”),
sentence_dedup_filter,
JsonlWriter(output_folder=”./final_deduplicated_output/”),
]
executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)
executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)
executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)
print(executor_1.run())
print(executor_2.run())
print(executor_3.run())
if __name__ == ‘__main__’:
multiprocessing.freeze_support()
run_example()
Run the script:
(dataclean) root@david1a100:~/dataclean# python3 sentence_deduplication.py
Some of the output:
2024-08-15 03:46:20.151 | INFO | datatrove.pipeline.dedup.sentence_dedup:run:247 – PQ initialized.
2024-08-15 03:46:20.151 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=9
2024-08-15 03:46:20.152 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 9 📉📉📉
Total Runtime: 0 seconds
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds [1.17 milliseconds±0 milliseconds/doc]
2024-08-15 03:46:20.156 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 10 tasks 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds±0 seconds/task, min=0 seconds, max=0 seconds [1.68 milliseconds±1.21 milliseconds/doc]
📉📉📉 Stats 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
🫂 – DEDUPS: 💥 sentence-deduplication stage 2
Runtime: (100.00%) 0 seconds±0 seconds/task, min=0 seconds, max=0 seconds [1.68 milliseconds±1.21 milliseconds/doc]
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
2024-08-15 03:46:20.887 | INFO | datatrove.utils.logging:add_task_logger:47 – Launching pipeline for rank=2
2024-08-15 03:46:20.887 | INFO | datatrove.utils.logging:log_pipeline:76 –
— 🛠️ PIPELINE 🛠
📖 – READER: 🐿 Jsonl
🫂 – DEDUPS: 💥 sentence-deduplication stage 3
💽 – WRITER: 🐿 Jsonl
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 2, world_size: 4, args: ()
2024-08-15 03:46:20.887 | WARNING | datatrove.pipeline.readers.base:run:226 – No files found on /root/dataclean/intermediate for rank=2
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 1, world_size: 4, args: ()
2024-08-15 03:46:20.887 | SUCCESS | datatrove.executor.base:_run_for_rank:85 – Processing done for rank=2
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a030>, rank: 0, world_size: 4, args: ()
2024-08-15 03:46:20.888 | INFO | datatrove.executor.base:_run_for_rank:91 –
📉📉📉 Stats: Task 2 📉📉📉
Total Runtime: 0 seconds
📖 – READER: 🐿 Jsonl
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
2024-08-15 03:46:20.891 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 1/4 tasks completed.
2024-08-15 03:46:20.892 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 2/4 tasks completed.
2024-08-15 03:46:20.897 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 3/4 tasks completed.
Running with data: <generator object BaseDiskReader.run at 0x7fc2ae75a340>, rank: 3, world_size: 4, args: ()
2024-08-15 03:46:20.911 | INFO | datatrove.executor.local:_launch_run_for_rank:79 – 4/4 tasks completed.
2024-08-15 03:46:20.948 | SUCCESS | datatrove.executor.local:run:146 –
📉📉📉 Stats: All 4 tasks 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (7.77%) 0 seconds±0 seconds/task, min=0 seconds [0.06 milliseconds±0.04 milliseconds/doc]
Stats: {input_files: 1, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc], documents: 3 [3.00/input_file]}
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
Runtime: (92.23%) 0 seconds±0 seconds/task, min=0 seconds [0.66 milliseconds±0.88 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 4, total: 4, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc]}
📉📉📉 Stats 📉📉📉
Total Runtime: 0 seconds ± 0 seconds/task
📖 – READER: 🐿 Jsonl
Runtime: (7.77%) 0 seconds±0 seconds/task, min=0 seconds [0.06 milliseconds±0.04 milliseconds/doc]
Stats: {input_files: 1, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc], documents: 3 [3.00/input_file]}
PipelineStats(total_runtime=0, time_stats=TimeStats(global_mean=0, global_std_dev=0))
💽 – WRITER: 🐿 Jsonl
Runtime: (92.23%) 0 seconds±0 seconds/task, min=0 seconds [0.66 milliseconds±0.88 milliseconds/doc]
Stats: {XXXXX.jsonl.gz: 4, total: 4, doc_len: 40103 [min=484, max=30632, 10025.75±14240/doc], doc_len_tokens: 10228 [min=95, max=6656, 2557.00±3132/doc]}
Check the the first item of final outputs:
(dataclean) root@david1a100:~/dataclean/final_deduplicated_output# zcat ./00000.jsonl.gz | head -n 1 | jq .
Check quality of the corpus
This part of my code is refer to: https://github.com/Azure/synthetic-qa-generation/tree/main*, I modified some codes, please refer to corpus-suggestions.ipynb in my repo: https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/Make-High-Quality-Dataset-From-WARC, which could analyze quality of the corpus from the last steps and give lots of useful suggestions.
Take some results as examples:
Result 1:
Feedback Required: [True, False, True, False, True]
Feedback List:
#Need Feedback#: Yes
#Issue Name#: Lack of new scenarios or contexts
#Reason#: The evolved instruction does not introduce any new scenarios or examples.
#Feedback#: Introduce diverse contexts or examples to enhance the instructional variety.
#Need Feedback#: No
#Need Feedback#: Yes
#Issue Name#: Limited diversity in examples
#Reason#: No new scenarios or varied contexts introduced in the evolved instruction.
#Feedback#: Incorporate diverse examples or contexts to cover a wider range of situations.
#Need Feedback#: No
#Need Feedback#: Yes
#Issue Name#: Limited diversity
#Reason#: No new scenarios, examples, or contexts introduced.
#Feedback#: Include various use cases and contexts for accessing journal content.
Optimized Instruction:
Accessing full-text articles for free on HTML pages can be a convenient way to stay informed, but if you need the article in PDF or Epub format, a subscription to the Journal of Postgraduate Medicine is required. Here are different ways to access the content based on various contexts:
1. **Individual Subscription:** If you frequently need access to articles in PDF or Epub format, consider subscribing online for a year. Subscribing is a straightforward process:
– Visit the Journal of Postgraduate Medicine’s subscription page.
– Choose the subscription plan that suits your needs.
– Complete the payment process to gain access to the content.
2. **Institutional Access:** If you are affiliated with a university or a research institution, you might recommend that your institution’s library subscribe to the journal. This way, everyone at your institution can have unrestricted access to the content.
– Click on the “Recommend the Journal” link typically provided on the journal’s website.
– Fill out the recommendation form with the necessary details.
– Submit the recommendation to your institution’s library acquisition team.
3. **Library Access:** If your local library has a subscription to the journal, you can access the PDF and Epub formats through their facilities. Check with your library to see if they offer remote access options, especially useful during non-operational hours or remote working conditions.
4. **Interlibrary Loan (ILL):** If neither you nor your institution has a subscription and you need a specific article in PDF or Epub format, you can request it through interlibrary loan services:
– Contact your library’s interlibrary loan department.
– Provide the details of the article you need.
– Wait for your library to obtain a copy from another subscribing institution.
5. **Pay-Per-View Purchase:** Some journals offer pay-per-view options for non-subscribers to access specific articles:
– Visit the article page on the journal’s website.
– Look for a purchase or pay-per-view option.
– Complete the payment to download the article in PDF or Epub format.
By understanding these various methods, you can choose the most appropriate way to access the Journal of Postgraduate Medicine articles based on your specific context and needs.
Evolved Instruction Step 1:
Accessing full-text articles for free on HTML pages can be a convenient way to stay informed, but if you need the article in PDF or Epub format or face geographic restrictions, a subscription to the Journal of Postgraduate Medicine is required. Here are different ways to access the content based on various contexts and considerations:
1. **Individual Subscription:** If you frequently need access to articles in PDF or Epub format, consider subscribing online for a year. Consider different subscription tiers based on your usage frequency and preferred payment method (credit card, PayPal, or wire transfer):
– Visit the Journal of Postgraduate Medicine’s subscription page.
– Choose the appropriate subscription plan that suits your reading needs and budget.
– Complete the payment process, selecting your preferred payment method, to gain access to the content.
– Confirm your subscription through the verification email you will receive.
2. **Institutional Access:** If you are affiliated with a university, specialized institute, or research organization, you might recommend that your institution’s library subscribe to the journal, allowing everyone at your institution unrestricted access to the content:
– Click on the “Recommend the Journal” link typically provided on the journal’s website.
– Fill out the recommendation form with the necessary details, specifying your institution type.
– Submit the recommendation to your institution’s library acquisition team.
– Follow up with your acquisition team to verify the status of the subscription request.
3. **Library Access:** If your local library has a subscription to the journal, you can access the PDF and Epub formats through their facilities. Check with your library to see if they offer remote access options or have updated policies for off-hour access due to remote working conditions or geographical restrictions:
– Visit your library’s online resource portal.
– Authenticate your library membership details to access the journal remotely.
– Verify the access duration and loan policies to ensure continuous availability.
4. **Interlibrary Loan (ILL):** If neither you nor your institution has a subscription and you need a specific article in PDF or Epub format, you can request it through Interlibrary Loan services, which might involve multiple steps and waiting periods:
– Contact your library’s interlibrary loan department and inquire about any pre-requisites.
– Provide the exact details of the article you need and verify your contact information.
– Wait for your library to notify you on the progress and estimated delivery time of the article from another subscribing institution.
– Confirm the received article’s access duration to avoid lapses in availability.
5. **Pay-Per-View Purchase:** Some journals offer pay-per-view options for non-subscribers to access specific articles. Be aware of different payment methods and possible return policies if the article does not meet your needs:
– Visit the article page on the journal’s website.
– Look for a purchase or pay-per-view option and compare prices if there are multiple.
– Complete the payment process, choosing a method that’s secure and convenient for you.
– Download the article in PDF or Epub format, and review any return policies if you face access issues.
By understanding these various methods, including conditional scenarios and additional steps, you can choose the most appropriate way to access the Journal of Postgraduate Medicine articles based on your specific context, requirements, and potential contingent situations.
New Feedback Required: [True, True, True, True, True]
New Feedback List:
#Need Feedback#: Yes
#Issue Name#: Preservation of key information
#Reason#: Key information is maintained with added details and considerations.
#Feedback#: Key information preserved well with added context and steps for clarity.
#Need Feedback#: Yes
#Issue Name#: Complexity
#Reason#: More details and steps have been added sufficiently.
#Feedback#: Complexity increased adequately with detailed steps and additional considerations.
#Need Feedback#: Yes
#Issue Name#: Insufficient scenario diversity
#Reason#: Limited expansion on new contexts or examples in evolved instruction.
#Feedback#: Introduce more varied scenarios to enhance diversity and coverage of different situations.
#Need Feedback#: Yes
#Issue Name#: Increased complexity
#Reason#: The Evolved Instruction introduces more detailed steps and additional considerations.
#Feedback#: The complexity has increased adequately with additional steps and detailed guidance.
#Need Feedback#: Yes
#Issue Name#: Limited diversity in access methods
#Reason#: Few new scenarios or examples introduced in the evolved instruction.
#Feedback#: Expand diversity by adding varied contexts, like international access options.
Genarate Synthetic Q&A
Refer to generate-QA.ipynb, we could generate high quality synthetic Q&A pairs with GPT-4o. Prompt temlpate is refer to : https://github.com/Azure/synthetic-qa-generation/tree/main/seed/prompt_template/en
Take some results as examples:
1.**What type of access is free in HTML pages?**
Full text access is free in HTML pages.
2. **Who can access PDF and EPub formats of the journal?**
PDF and EPub access is only available to paid subscribers and members.
3. **What must you do to access the article in PDF format?**
To access the article in PDF format, you should be a subscriber to the Journal of Postgraduate Medicine.
4. **How can you subscribe to the Journal of Postgraduate Medicine?**
You can subscribe online for a year.
5. **What can you do if you want your institution to have unrestricted access to the journal?**
You could recommend your institution’s library to subscribe to the journal so that you can have unrestricted access.
References
DataTrove: https://github.com/huggingface/datatrove/
Generate Synthetic QnAs from Real-world Data: https://github.com/Azure/synthetic-qa-generation/
Microsoft Tech Community – Latest Blogs –Read More