JP7347147B2

JP7347147B2 - Molecular descriptor generation system, molecular descriptor generation method, and molecular descriptor generation program

Info

Publication number: JP7347147B2
Application number: JP2019208027A
Authority: JP
Inventors: 裕希永井
Original assignee: Hitachi Chemical Co Ltd; Showa Denko Materials Co Ltd; Resonac Corp
Current assignee: Resonac Corp
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2023-09-20
Anticipated expiration: 2039-11-18
Also published as: JP2021081920A

Description

本開示の一側面は、分子記述子生成システム、分子記述子生成方法、及び分子記述子生成プログラムに関する。 One aspect of the present disclosure relates to a molecular descriptor generation system, a molecular descriptor generation method, and a molecular descriptor generation program.

従来から、分子の構造を所定のフォーマットで取得しそれをベクトル情報に変換して、変換した情報を機械学習アルゴリズム等に入力して特性予測等に利用することが行われている。例えば、生体高分子の立体構造と化合物の立体構造との結合性を機械学習を用いて予測する方法が知られている（下記特許文献１参照）。この方法では、生体高分子の立体構造と化合物の立体構造とに基づいて生体高分子と化合物との複合体の予測立体構造を生成し、その予測立体構造を予測立体構造ベクトルに変換し、機械学習アルゴリズムを用いて、その予測立体構造ベクトルを生体高分子の立体構造と化合物の立体構造との結合性を予測するためのデータとして用いている。 Conventionally, the structure of a molecule is obtained in a predetermined format, converted into vector information, and the converted information is input into a machine learning algorithm or the like to be used for property prediction or the like. For example, a method is known that uses machine learning to predict the bond between the three-dimensional structure of a biopolymer and the three-dimensional structure of a compound (see Patent Document 1 below). In this method, a predicted 3D structure of a complex between a biopolymer and a compound is generated based on the 3D structure of the biopolymer and the 3D structure of the compound, and the predicted 3D structure is converted into a predicted 3D structure vector. Using a learning algorithm, the predicted 3D structure vectors are used as data for predicting the bond between the 3D structure of a biopolymer and the 3D structure of a compound.

特開２０１９－２８８７９号公報JP 2019-28879 Publication

近年では、複数種類の構成単位の分子を様々な構成比で合成した化合物の構造をデータで表現することが求められている。しかしながら、上記特許文献１に記載の従来技術では、化合物の立体構造のデータの蓄積が少ない場合には、適切に化合物の分子構造の表現を行うデータを生成することは困難である。そこで、複数種類の構成単位が合成された化合物の分子構造が明らかでない場合であっても、その化合物の分子構造を適切にデータで表現するための仕組みが望まれている。 In recent years, there has been a demand for data representation of the structures of compounds synthesized from molecules of multiple types of constituent units in various composition ratios. However, in the conventional technique described in Patent Document 1, if there is little accumulation of data on the three-dimensional structure of the compound, it is difficult to generate data that appropriately represents the molecular structure of the compound. Therefore, even if the molecular structure of a compound synthesized from multiple types of structural units is not clear, a mechanism is desired that can appropriately represent the molecular structure of the compound as data.

本開示の一形態の分子記述子生成システムは、少なくとも１つのプロセッサを備え、少なくとも１つのプロセッサが、複数種類の構成単位の分子のそれぞれの分子構造および当該分子構造における外部の分子構造との結合点を表す構造データと、複数種類の構成単位の分子の構成比を表す構成比データとの入力を少なくとも受け付け、構造データの示す複数種類の構成単位の分子を構成比データの示す構成比で含む所定の構成単位数の環状構造体のデータを、複数種類の構成単位の分子をランダムに環状に配列した上で、隣接する分子間を構造データの示す結合点で結合することで繰り返し生成し、繰り返し生成した複数の環状構造体のデータを分子記述化することによって、複数のベクトルに変換し、複数のベクトルを組み合わせることによって、複数種類の構成単位の分子を構成比で含む合成化合物の分子を記述した分子記述データを生成する。 A molecular descriptor generation system according to an embodiment of the present disclosure includes at least one processor, and the at least one processor is configured to generate a molecular descriptor for each molecular structure of a plurality of types of constituent unit molecules and a combination of the molecular structure with an external molecular structure. Accepts at least input of structural data representing a point and composition ratio data representing the composition ratio of molecules of multiple types of constitutional units, and includes molecules of multiple types of constitutional units indicated by the structural data at the composition ratio indicated by the composition ratio data. Data for a cyclic structure with a predetermined number of structural units is repeatedly generated by randomly arranging molecules of multiple types of structural units in a ring, and then bonding adjacent molecules at bonding points indicated by the structural data. By converting the repeatedly generated data of multiple cyclic structures into molecular descriptions, converting them into multiple vectors, and combining the multiple vectors, we can create molecules of synthetic compounds containing molecules of multiple types of constituent units in the composition ratio. Generate written molecular description data.

あるいは、本開示の他の形態の分子記述子生成方法は、少なくとも１つのプロセッサを備えるコンピュータにより実行される分子記述子生成方法であって、複数種類の構成単位の分子のそれぞれの分子構造および当該分子構造における外部の分子構造との結合点を表す構造データと、複数種類の構成単位の分子の構成比を表す構成比データとの入力を少なくとも受け付けるステップと、構造データの示す複数種類の構成単位の分子を構成比データの示す構成比で含む所定の構成単位数の環状構造体のデータを、複数種類の構成単位の分子をランダムに環状に配列した上で、隣接する分子間を構造データの示す結合点で結合することで繰り返し生成するステップと、繰り返し生成した複数の環状構造体のデータを分子記述化することによって、複数のベクトルに変換するステップと、複数のベクトルを組み合わせることによって、複数種類の構成単位の分子を構成比で含む合成化合物の分子を記述した分子記述データを生成するステップと、を備える。 Alternatively, a molecular descriptor generation method according to another aspect of the present disclosure is a molecular descriptor generation method executed by a computer including at least one processor, which comprises a step of receiving at least input of structural data representing bonding points with external molecular structures in the molecular structure and composition ratio data representing the composition ratio of the molecules of the plurality of types of structural units; and a step of receiving inputs of the plurality of types of structural units indicated by the structural data. The data of a cyclic structure with a predetermined number of structural units containing molecules in the composition ratio indicated by the composition ratio data are obtained by randomly arranging molecules of multiple types of constituent units in a ring, and then dividing the adjacent molecules between adjacent molecules according to the structural data. A step of repeatedly generating multiple cyclic structures by combining them at the bonding points shown, a step of converting the data of the repeatedly generated multiple cyclic structures into multiple vectors by converting them into molecular descriptions, and a step of converting the multiple vectors into multiple vectors by combining the multiple vectors. The method includes the step of generating molecule description data describing molecules of a synthetic compound containing molecules of different types of constituent units in a composition ratio.

あるいは、本開示の他の形態の分子記述子生成プログラムは、コンピュータに、複数種類の構成単位の分子のそれぞれの分子構造および当該分子構造における外部の分子構造との結合点を表す構造データと、複数種類の構成単位の分子の構成比を表す構成比データとの入力を少なくとも受け付けるステップと、構造データの示す複数種類の構成単位の分子を構成比データの示す構成比で含む所定の構成単位数の環状構造体のデータを、複数種類の構成単位の分子をランダムに環状に配列した上で、隣接する分子間を構造データの示す結合点で結合することで繰り返し生成するステップと、繰り返し生成した複数の環状構造体のデータを分子記述化することによって、複数のベクトルに変換するステップと、複数のベクトルを組み合わせることによって、複数種類の構成単位の分子を構成比で含む合成化合物の分子を記述した分子記述データを生成するステップと、を実行させる。 Alternatively, a molecular descriptor generation program according to another embodiment of the present disclosure may provide a computer with structural data representing the molecular structures of molecules of multiple types of constituent units and bonding points with external molecular structures in the molecular structures; a step of receiving at least an input of composition ratio data representing the composition ratio of molecules of the plurality of types of structural units; and a predetermined number of constituent units containing molecules of the plurality of types of composition units indicated by the structural data at the composition ratio indicated by the composition ratio data. The data of a cyclic structure is repeatedly generated by randomly arranging molecules of multiple types of constituent units in a ring, and then bonding adjacent molecules at bonding points indicated by the structural data. By converting the data of multiple cyclic structures into molecular descriptions into multiple vectors, and by combining the multiple vectors, we describe the molecules of synthetic compounds containing molecules of multiple types of constituent units in the composition ratio. and generating molecular description data.

本開示の側面によれば、複数種類の構成単位が合成された合成化合物の分子構造を適切にデータで表現することができる。 According to the aspect of the present disclosure, the molecular structure of a synthetic compound in which multiple types of structural units are synthesized can be appropriately expressed as data.

実施形態に係る分子記述子生成システムを構成するコンピュータのハードウェア構成の一例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a computer that constitutes a molecular descriptor generation system according to an embodiment. 実施形態に係る分子記述子生成システムの機能構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a functional configuration of a molecular descriptor generation system according to an embodiment. 図２の取得部１１が取得する構造データの示す分子構造のイメージを示図である。3 is a diagram illustrating an image of a molecular structure indicated by structural data acquired by the acquisition unit 11 of FIG. 2. FIG. 図２の取得部１１が取得する構造データの示す分子構造の具体例を構造式で示す図である。FIG. 3 is a diagram illustrating a specific example of a molecular structure indicated by structural data acquired by the acquisition unit 11 in FIG. 2 using a structural formula. 図２の取得部１１が取得する構造データの示す分子構造の具体例を構造式で示す図である。FIG. 3 is a diagram illustrating a specific example of a molecular structure indicated by structural data acquired by the acquisition unit 11 in FIG. 2 using a structural formula. 図２のデータ生成部１２によって生成される環状構造体データＳＤｒによって特定される分子構造のイメージを示す図である。3 is a diagram showing an image of a molecular structure specified by the cyclic structure data SDr generated by the data generation unit 12 of FIG. 2. FIG. 図２のデータ生成部１２によって生成される直鎖構造体データＳＤｓによって特定される分子構造のイメージを示す図である。3 is a diagram showing an image of a molecular structure specified by linear structure data SDs generated by the data generation unit 12 of FIG. 2. FIG. 実施形態に係る分子記述子生成システムの動作の一例を示すフローチャートである。1 is a flowchart illustrating an example of the operation of the molecular descriptor generation system according to the embodiment. 実施形態に係る分子記述子生成システムによって生成されるベクトルの数値例を示す図である。FIG. 3 is a diagram showing a numerical example of a vector generated by the molecular descriptor generation system according to the embodiment. 実施形態に係る分子記述子生成システムによって生成されるベクトルの数値例を示す図である。FIG. 3 is a diagram showing a numerical example of a vector generated by the molecular descriptor generation system according to the embodiment. 実施形態に係る分子記述子生成システムによって生成されるベクトルの数値例を示す図である。FIG. 3 is a diagram showing a numerical example of a vector generated by the molecular descriptor generation system according to the embodiment.

以下、添付図面を参照して、本発明の実施形態について詳細に説明する。なお、説明において、同一要素又は同一機能を有する要素には、同一符号を用いることとし、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description, the same elements or elements having the same function will be denoted by the same reference numerals, and redundant description will be omitted.

［システムの概要］
実施形態に係る分子記述子生成システム１０は、複数種類の構成単位の分子である構成単位分子が様々な構成比で合成されることにより生成される合成化合物の分子構造を記述する分子記述子（分子記述データ）の生成処理を実行するコンピュータシステムである。構成単位分子は、合成化合物を合成するために用いられる材料を構成する分子のことをいい、例えば、単重体であるモノマーである。合成化合物とは、複数種類の構成単位分子を所定の構成比で合成することによって生成される化学物質であり、例えば、構成単位分子がモノマーの場合は複数種類のモノマーが連結して形成されるコポリマーである。なお、好ましくは、本実施形態の分子記述子生成システム１０の分子記述子の生成処理の対象とする合成化合物は、複数種類の構成単位分子が直鎖状に連結されて構成される直鎖構造の化合物である。 [System overview]
The molecular descriptor generation system 10 according to the embodiment generates molecular descriptors ( This is a computer system that executes the generation process of molecular description data. A constituent unit molecule refers to a molecule that constitutes a material used to synthesize a synthetic compound, and is, for example, a monomer that is a single polymer. A synthetic compound is a chemical substance that is produced by synthesizing multiple types of constituent unit molecules in a predetermined composition ratio. For example, if the constituent unit molecules are monomers, they are formed by linking multiple types of monomers. It is a copolymer. Preferably, the synthetic compound to be subjected to the molecular descriptor generation process of the molecular descriptor generation system 10 of the present embodiment has a linear structure in which a plurality of types of constituent unit molecules are connected in a linear chain. It is a compound of

分子記述子生成システム１０によって生成された入力データは、機械学習用の入力データとして、合成化合物の特性を予測するために用いられる。合成化合物の特性とは、例えば、ガラス転移温度、弾性率、誘電率、誘電正接、熱膨張係数等である。入力データが入力される機械学習とは、与えられた情報に基づいて反復的に学習することで法則またはルールを自律的に見つけ出す手法である。機械学習の具体的な手法は限定されない。例えば、機械学習は、ニューラルネットワークを含んで構成される計算モデルである機械学習モデルを用いた機械学習であってよい。ニューラルネットワークとは、人間の脳神経系の仕組みを模した情報処理のモデルのことをいう。より具体的な例として、機械学習は、グラフニューラルネットワーク（ＧＮＮ）、畳み込みニューラルネットワーク（ＣＮＮ）、再帰型ニューラルネットワーク（ＲＮＮ）、アテンションＲＮＮ（ＡｔｔｅｎｔｉｏｎＲＮＮ）、およびマルチヘッド・アテンション（Ｍｕｌｔｉ－ＨｅａｄＡｔｔｅｎｔｉｏｎ）、ランダムフォレスト、サポートベクタ―マシン、重回帰のうちの少なくとも一つを用いたものである。 Input data generated by the molecular descriptor generation system 10 is used as input data for machine learning to predict properties of synthetic compounds. The properties of the synthetic compound include, for example, glass transition temperature, elastic modulus, dielectric constant, dielectric loss tangent, thermal expansion coefficient, and the like. Machine learning, in which input data is input, is a method of autonomously finding laws or rules by iteratively learning based on given information. The specific method of machine learning is not limited. For example, machine learning may be machine learning using a machine learning model that is a calculation model including a neural network. A neural network is an information processing model that mimics the structure of the human nervous system. As more specific examples, machine learning includes graph neural networks (GNN), convolutional neural networks (CNN), recurrent neural networks (RNN), attention RNNs, and multi-head attention (Multi-Head Attention). ), random forest, support vector machine, and multiple regression.

［システムの構成］
分子記述子生成システム１０は１台以上のコンピュータで構成される。複数台のコンピュータを用いる場合には、これらのコンピュータがインターネット、イントラネット等の通信ネットワークを介して接続されることで、論理的に一つの分子記述子生成システム１０が構築される。 [System configuration]
The molecular descriptor generation system 10 is composed of one or more computers. When a plurality of computers are used, one logical molecular descriptor generation system 10 is constructed by connecting these computers via a communication network such as the Internet or an intranet.

図１は、分子記述子生成システム１０を構成するコンピュータ１００の一般的なハードウェア構成の一例を示す図である。例えば、コンピュータ１００は、オペレーティングシステム、アプリケーション・プログラム等を実行するプロセッサ（例えばＣＰＵ）１０１と、ＲＯＭおよびＲＡＭで構成される主記憶部１０２と、ハードディスク、フラッシュメモリ等で構成される補助記憶部１０３と、ネットワークカードまたは無線通信モジュールで構成される通信制御部１０４と、キーボード、マウス、タッチパネル等の入力装置１０５と、モニタ、タッチパネルディスプレイ等の出力装置１０６とを備える。 FIG. 1 is a diagram showing an example of a general hardware configuration of a computer 100 that constitutes a molecular descriptor generation system 10. For example, the computer 100 includes a processor (e.g., CPU) 101 that executes an operating system, application programs, etc., a main storage section 102 made up of ROM and RAM, and an auxiliary storage section 103 made up of a hard disk, flash memory, etc. , a communication control unit 104 configured with a network card or a wireless communication module, an input device 105 such as a keyboard, a mouse, a touch panel, etc., and an output device 106 such as a monitor, a touch panel display, etc.

分子記述子生成システム１０の各機能要素は、プロセッサ１０１または主記憶部１０２の上に予め定められたプログラムを読み込ませてプロセッサ１０１にそのプログラムを実行させることで実現される。プロセッサ１０１はそのプログラムに従って、通信制御部１０４、入力装置１０５、または出力装置１０６を動作させ、主記憶部１０２または補助記憶部１０３におけるデータの読み出しおよび書き込みを行う。処理に必要なデータまたはデータベースは主記憶部１０２または補助記憶部１０３内に格納される。 Each functional element of the molecular descriptor generation system 10 is realized by loading a predetermined program into the processor 101 or the main storage unit 102 and causing the processor 101 to execute the program. The processor 101 operates the communication control unit 104, the input device 105, or the output device 106 according to the program, and reads and writes data in the main storage unit 102 or the auxiliary storage unit 103. Data or databases required for processing are stored in the main storage unit 102 or the auxiliary storage unit 103.

図２は分子記述子生成システム１０の機能構成の一例を示す図である。分子記述子生成システム１０は機能要素として取得部１１、データ生成部１２、ベクトル変換部１３、ベクトル合成部１４、および重み付け加算部１５を備える。 FIG. 2 is a diagram showing an example of the functional configuration of the molecular descriptor generation system 10. The molecular descriptor generation system 10 includes an acquisition section 11, a data generation section 12, a vector conversion section 13, a vector synthesis section 14, and a weighted addition section 15 as functional elements.

取得部１１は、複数種類の構成単位分子の構造データ、及びこれらの複数種類の構成単位分子を合成して合成化合物を生成することを想定した場合のそれぞれの複数種類の構成単位分子の構成比を表す構成比データとの入力を受け付ける機能要素である。取得部１１は、これらのデータを分子記述子生成システム１０内のデータベースから分子記述子生成システム１０のユーザによる選択入力に応じて取得してもよいし、外部のコンピュータ等からユーザによる選択に応じて取得してもよい。 The acquisition unit 11 obtains structural data of a plurality of types of constituent unit molecules, and composition ratios of each of the plurality of constituent unit molecules when it is assumed that a synthetic compound is generated by synthesizing these plurality of constituent unit molecules. This is a functional element that accepts input of composition ratio data representing . The acquisition unit 11 may acquire these data from a database within the molecular descriptor generation system 10 in response to selection input by the user of the molecular descriptor generation system 10, or may acquire these data from an external computer or the like in response to selection by the user. You may also obtain it by

具体的には、取得部１１は、複数種類の構成単位分子の分子構造を特定する構造データを、特定のデータフォーマットで取得する。この特定のデータフォーマットは、分子構造を表現できるフォーマットであれば特定のものには限定されないが、分子構造を画像化したJPEG、GIF等の画像形式であってもよいし、分子構造を文字列、座標等の組み合わせで表すデータ形式であってもよいし、分子構造をノード及びエッジで表現した無向グラフの構造を、数字、英字、テキスト、ベクトル等で特定する分子グラフのデータ形式であってもよいし、これらのデータ形式のうちの任意の２以上の組合せのデータであってもよい。分子構造を文字の配列で表わす、一般的にSMILES、InChl、SLNなどといわれる一次元文字表記であってもよい。この特定のデータフォーマットを構成する個々の数値は、十進法で表されてもよいし、二進法、十六進法などの他の表記法によって表されてもよい。ここで、取得部１１は、構造データ中に、構成単位分子の分子構造において外部の分子構造と化学的に結合しうる結合点に関するデータを含んで取得する。 Specifically, the acquisition unit 11 acquires structural data specifying the molecular structures of multiple types of constituent unit molecules in a specific data format. This specific data format is not limited to any specific format as long as it is a format that can express the molecular structure, but it may be an image format such as JPEG or GIF that images the molecular structure, or it may be a character string that represents the molecular structure. , coordinates, etc., or a molecular graph data format that specifies the structure of an undirected graph in which the molecular structure is expressed by nodes and edges using numbers, alphabets, text, vectors, etc. The data may be data in any combination of two or more of these data formats. It may also be a one-dimensional character notation that represents the molecular structure as an array of characters, generally referred to as SMILES, InChl, SLN, etc. The individual numbers that make up this particular data format may be expressed in decimal notation, or in other notations such as binary or hexadecimal. Here, the acquisition unit 11 acquires the structural data including data regarding bonding points that can be chemically bonded to an external molecular structure in the molecular structure of the constituent unit molecule.

図３には、取得部１１が取得する構造データの示す分子構造のイメージを示し、図４及び図５には、取得部１１が取得する構造データの示す分子構造の具体例を構造式で示している。図３に示すように、取得部１１は、例えば、分子構造“Ａ”を表すデータと、その分子構造“Ａ”における結合点の位置を示すデータ（例えば、その位置を“＊”で示すデータ）を含む構造データＳＤ_１を取得し、複数種類の構成単位分子の分子構造“Ｂ”、“Ｃ”、“Ｄ”に関して同様な構造データＳＤ_２，ＳＤ_３，ＳＤ_４を取得する。図４に示す構造データＳＤ_５，ＳＤ_６の分子構造は、構成単位分子がプロピレンおよびブチレンの場合の例である。このように、構成単位分子がプロピレンおよびブチレンの２種類の場合は、複数の構成単位分子が互いに付加重合されることにより高分子化合物（コポリマー）が生成されうるので、両側の２つの炭素原子において結合先が限定されない等価な結合点の位置“＊”が示される。図５に示す構造データＳＤ_７，ＳＤ_８の分子構造は、テレフタル酸とエチレングルコールの場合の例である。この例では、複数の構成単位分子が互いに縮合重合されることにより高分子化合物（コポリマー）が生成されうるので、それぞれの２種類の分子の両端において結合先が限定される不等価な結合点の位置“＊_１”，“＊_２”が示される。この表示“＊_１”，“＊_２”によって、同じ識別番号を付して示されている結合点どうしは結合せず、異なる識別番号を付されている結合点どうしが結合しうることが示される。 FIG. 3 shows an image of the molecular structure indicated by the structural data obtained by the obtaining section 11, and FIGS. 4 and 5 show specific examples of the molecular structure indicated by the structural data obtained by the obtaining section 11 using structural formulas. ing. As shown in FIG. 3, the acquisition unit 11 includes, for example, data representing the molecular structure "A" and data representing the position of the bonding point in the molecular structure "A" (for example, data representing the position with "*"). ) is obtained, and similar structural data SD ₂ _, SD ₃ , and SD ₄ are obtained regarding molecular structures “B”, “C”, and “D” of a plurality of types of constituent unit molecules. The molecular structures of the structural data SD ₅ and SD ₆ shown in FIG. 4 are examples in which the constituent unit molecules are propylene and butylene. In this way, when there are two types of constituent unit molecules, propylene and butylene, a polymer compound (copolymer) can be produced by addition polymerization of multiple constituent unit molecules with each other. The position "*" of an equivalent bonding point where the bonding destination is not limited is indicated. The molecular structures of structural data SD ₇ and SD ₈ shown in FIG. 5 are examples of terephthalic acid and ethylene glycol. In this example, a polymer compound (copolymer) can be produced by condensation polymerization of multiple constituent unit molecules, so unequal bonding points with limited bonding destinations are formed at both ends of each two types of molecules. Positions “* ₁ ” and “* ₂ ” are shown. These indications “* ₁ ” and “* ₂ ” indicate that bonding points with the same identification number will not bond with each other, but bonding points with different identification numbers may bond with each other. It will be done.

また、取得部１１は、上述した複数種類の構造データによって特定される分子構造の構成単位分子に関する構成比データも併せて取得する。このとき、取得部１１は、複数種類の構成単位分子の構成比ｒａを表す混合率データとして、それぞれの構成単位分子の構成率自体を示すデータを取得してもよいし、複数の構成単位分子間の構成比を示すデータを取得してもよい。例えば、図３に示す分子構造の構造データを取得した場合には、分子構造“Ａ”の構成率ｒａ_１＝“０．２”と、分子構造“Ｂ”の構成率ｒａ_２＝“０．３”と、分子構造“Ｃ”の構成率ｒａ_３＝“０．１”と、分子構造“Ｄ”の構成率ｒａ_４＝“０．４”とを取得する。 The acquisition unit 11 also acquires composition ratio data regarding the constituent unit molecules of the molecular structure specified by the plurality of types of structural data described above. At this time, the acquisition unit 11 may acquire data indicating the composition ratio of each constituent unit molecule itself as the mixture ratio data representing the composition ratio ra of the plurality of types of constituent unit molecules, or may acquire data indicating the composition ratio of each constituent unit molecule itself. Data indicating the composition ratio between the two may also be obtained. For example, when the structural data of the molecular structure shown in FIG. 3 is obtained, the composition ratio ra ₁ = "0.2" of the molecular structure "A" and the composition ratio ra ₂ = "0.2" of the molecular structure "B". 3'', the composition ratio ra ₃ = "0.1" of the molecular structure "C", and the composition ratio ra ₄ = "0.4" of the molecular structure "D".

さらに、取得部１１は、複数種類の構成単位分子の構造データとそれらの複数種類の構成単位分子の構成比を表す構成比データとの組み合わせを、それらの複数種類の構成単位分子から合成される合成化合物ごとに複数取得し、それらの組み合わせに対応する複数種類の合成化合物の混合率を表す混合率データをさらに取得する。これにより、後述した機能を用いることで、複数種類の合成化合物を混合率データで示される混合率で混合した混合物の分子構造を記述する分子記述データを生成することができる。このとき、取得部１１は、複数種類の合成化合物の混合率ｒｂを表す混合率データとして、それぞれの合成化合物の混合率自体を示すデータを取得してもよいし、複数の合成化合物間の混合比を示すデータを取得してもよい。 Further, the acquisition unit 11 generates a combination of structural data of the plurality of types of constituent unit molecules and composition ratio data representing the composition ratio of the plurality of types of constituent unit molecules, which are synthesized from the plurality of types of constituent unit molecules. A plurality of pieces of data are acquired for each synthetic compound, and mixture rate data representing a mixture rate of a plurality of types of synthetic compounds corresponding to the combination thereof is further acquired. Thereby, by using the functions described below, it is possible to generate molecular description data that describes the molecular structure of a mixture obtained by mixing a plurality of types of synthetic compounds at a mixing ratio indicated by the mixing ratio data. At this time, the obtaining unit 11 may obtain data indicating the mixing ratio of each synthetic compound itself as the mixing ratio data representing the mixing ratio rb of the plurality of synthetic compounds, or may obtain data indicating the mixing ratio of the plurality of synthetic compounds. Data indicating the ratio may also be obtained.

データ生成部１２は、構造データと構成比データとの組み合わせを参照して、それぞれの組み合わせに対応する合成化合物の分子構造を特定する構造データを生成する。すなわち、データ生成部１２は、構造データとして、環状構造体データと直鎖構造体データとを生成する。 The data generation unit 12 refers to the combinations of structural data and composition ratio data and generates structural data that specifies the molecular structure of the synthetic compound corresponding to each combination. That is, the data generation unit 12 generates circular structure data and linear structure data as structural data.

具体的には、データ生成部１２は、構造データの示す複数種類の構成単位分子を構成比データの示す構成比で含む所定の構成単位数の環状構造体の構造データを構築する。例えば、図３に示すような構成単位分子“Ａ”、“Ｂ”，“Ｃ”、“Ｄ”に関してそれらの構成率が“０．２”、“０．３”、“０．１”、“０．４”と指定されている場合には、４種類の構成単位分子“Ａ”、“Ｂ”，“Ｃ”、“Ｄ”が、指定された構成率“０．２”、“０．３”、“０．１”、“０．４”に対応する個数となるようにランダムに選択されて合計で所定の構成単位数（例えば、２００個）となるように環状に配列される。構成単位数は、少なくとも比率が表現できる最小数以上である必要がある。例えば、構成単位分子“Ａ”の構成率が０．１，構成単位分子“Ｂ”の構成率が０．００１の場合には、構成単位数は１０１個以上とされる。その上で、環状構造体において、隣接する構成単位分子間が構造データの示す結合点で結合された分子構造となるように、それぞれの構成単位分子の配向が調整された構造データが生成される。この際、データ生成部１２は、構造データによって構成単位分子が不等価な結合点を有することが示されている場合には、その結合点を、隣接する構成単位分子における異なる識別番号が付されている結合点と結合させるように、環状構造体における分子配列あるいは分子配向を微調整する。 Specifically, the data generation unit 12 constructs structural data of a cyclic structure having a predetermined number of structural units that includes a plurality of types of structural unit molecules indicated by the structural data at a composition ratio indicated by the composition ratio data. For example, regarding the constituent unit molecules "A", "B", "C", and "D" as shown in FIG. 3, their composition ratios are "0.2", "0.3", "0.1", When "0.4" is specified, the four types of constituent unit molecules "A", "B", "C", and "D" have the specified composition ratios of "0.2" and "0. .3", "0.1", and "0.4", and are arranged in a ring so that the total number of constituent units is a predetermined number (for example, 200). . The number of constituent units needs to be at least the minimum number that can express the ratio. For example, when the composition ratio of the constituent unit molecule "A" is 0.1 and the composition ratio of the constituent unit molecule "B" is 0.001, the number of constituent units is 101 or more. Then, in the cyclic structure, structural data is generated in which the orientation of each constituent unit molecule is adjusted so that adjacent constituent unit molecules are bonded at the bonding points indicated by the structural data. . At this time, if the structural data indicates that the constituent unit molecules have unequal bonding points, the data generation unit 12 assigns different identification numbers to the bonding points in adjacent constituent unit molecules. The molecular arrangement or molecular orientation in the cyclic structure is finely adjusted so that it binds to the bonding point in the cyclic structure.

また、データ生成部１２は、構造データの示す複数種類の構成単位分子を構成比データの示す構成比で含む所定の構成単位数の直鎖構造体の構造データを構築する。例えば、図３に示すような構成単位分子“Ａ”、“Ｂ”，“Ｃ”、“Ｄ”に関してそれらの構成率が“０．２”、“０．３”、“０．１”、“０．４”と指定されている場合には、４種類の構成単位分子“Ａ”、“Ｂ”，“Ｃ”、“Ｄ”が、指定された構成率“０．２”、“０．３”、“０．１”、“０．４”に対応する個数となるようにランダムに選択されて合計で所定の構成単位数（例えば、２０，０００個）となるように直列（直鎖状）に配列される。この所定の構成単位数は、直鎖構造体における分子末端の濃度（全体の構成単位数に対する未結合の結合点の割合）が所定値（例えば、０．０１％）以下となるように予め設定されている。その上で、直鎖構造体において、隣接する構成単位分子間が構造データの示す結合点で結合された分子構造となるように、それぞれの構成単位分子の配向が調整された構造データが生成される。この際、データ生成部１２は、構造データによって構成単位分子が不等価な結合点を有することが示されている場合には、その結合点を、隣接する構成単位分子における異なる識別番号が付されている結合点と結合させるように、直鎖構造体における分子配列あるいは分子配向を微調整する。 Further, the data generation unit 12 constructs structural data of a linear structure having a predetermined number of structural units including a plurality of types of structural unit molecules indicated by the structural data at a composition ratio indicated by the composition ratio data. For example, regarding the constituent unit molecules "A", "B", "C", and "D" as shown in FIG. 3, their composition ratios are "0.2", "0.3", "0.1", When "0.4" is specified, the four types of constituent unit molecules "A", "B", "C", and "D" have the specified composition ratios of "0.2" and "0. .3", "0.1", and "0.4", and the units are randomly selected to form a predetermined number of constituent units (for example, 20,000 units) in series (in series). Arranged in a chain). This predetermined number of structural units is set in advance so that the concentration of molecular terminals in the linear structure (ratio of unbonded bonding points to the total number of structural units) is below a predetermined value (for example, 0.01%). has been done. Then, in the linear structure, structural data is generated in which the orientation of each constituent unit molecule is adjusted so that adjacent constituent unit molecules are bonded at the bonding points indicated by the structural data. Ru. At this time, if the structural data indicates that the constituent unit molecules have unequal bonding points, the data generation unit 12 assigns different identification numbers to the bonding points in adjacent constituent unit molecules. The molecular arrangement or molecular orientation in the linear structure is finely adjusted so that it binds to the bonding point in the linear structure.

そして、データ生成部１２は、構造データと構成比データとの組み合わせに対応する合成化合物に関する環状構造体データ及び直鎖構造体データの生成を、所定回数（例えば、それぞれ１００回）繰り返し、複数通りの環状構造体データおよび複数通りの直鎖構造体データを生成する。図６には、図３に示すような構成単位分子“Ａ”、“Ｂ”，“Ｃ”、“Ｄ”に関してそれらの構成率が“０．２”、“０．３”、“０．１”、“０．４”と指定されている場合に生成される環状構造体データＳＤｒによって特定される分子構造のイメージを示し、図７には、同様の場合に生成される直鎖構造体データＳＤｓによって特定される分子構造のイメージを示している。さらに、データ生成部１２は、複数通りの環状構造体データおよび複数通りの直鎖構造体データの生成を、取得部１１が取得した構造データと構成比データとの組み合わせに対応する合成化合物ごとに繰り返す。 Then, the data generation unit 12 repeatedly generates the cyclic structure data and the linear structure data regarding the synthetic compound corresponding to the combination of the structural data and the composition ratio data a predetermined number of times (for example, 100 times each) in a plurality of ways. cyclic structure data and multiple linear structure data are generated. FIG. 6 shows the composition ratios of "0.2", "0.3", "0. 1" and "0.4" are specified, and FIG. An image of the molecular structure specified by data SDs is shown. Furthermore, the data generation unit 12 generates multiple types of cyclic structure data and multiple types of linear structure data for each synthetic compound corresponding to the combination of the structural data and composition ratio data acquired by the acquisition unit 11. repeat.

ベクトル変換部１３は、データ生成部１２が生成した合成化合物ごとの複数通りの環状構造体データ及び直鎖構造体データを、それぞれ、分子記述化することによって一次元のベクトルＶｒ，Ｖｓに変換する（図６、図７）。分子記述化によって、構造データの示す分子の特徴をその化学構造に基づいて数値列として表わすことができる。この分子記述化の方式としては、分子構造をベクトル化する手法であれば任意の方式が採用できるが、例えば、ＥＣＦＰ（Extended Connectivity FingerPrints）、MACCS FingerPrints、PubChem FingerPrints、Substructure FingerPrints、Estate FingerPrints、BCI FingerPrints、Molprint2D FingerPrints、Pass base FingerPrints等が採用できる。 The vector conversion unit 13 converts the plurality of types of cyclic structure data and linear structure data for each synthetic compound generated by the data generation unit 12 into one-dimensional vectors Vr and Vs by respectively converting them into molecular descriptions. (Figure 6, Figure 7). Molecular description allows the characteristics of a molecule indicated by structural data to be expressed as a numerical sequence based on its chemical structure. Any method that vectorizes the molecular structure can be adopted as this molecular description method, but examples include ECFP (Extended Connectivity FingerPrints), MACCS FingerPrints, PubChem FingerPrints, Substructure FingerPrints, Estate FingerPrints, and BCI FingerPrints. , Molprint2D FingerPrints, Pass base FingerPrints, etc. can be adopted.

ベクトル合成部１４は、構造データと構成比データとの組み合わせに対応する合成化合物に関する複数の環状構造体データから変換された複数のベクトルＶｒを組み合わせることによって、その合成化合物の分子を数値列で記述した分子記述データＭＤｒを生成する。すなわち、ベクトル合成部１４は、合成化合物に対応して生成された複数のベクトルＶｒの各要素を加算してから、加算した各要素を複数の環状構造体における合計の分子量あるいは合計の構成要素分子数で除算することによって、合成化合物に対応する分子記述データＭＤｒを生成する。例えば、分子量が４２．０８と５６．１１の２種類の構成要素分子を構成比１対１で配列して、合計構成要素数２００個の環状構造体のデータを生成し、これを１００回繰り返す場合には、加算した各要素を、合計分子量＝４２０，８００＋５６１，１００＝９８１，９００、あるいは合計要素数＝２０，０００で除算する。同様にして、合成化合物に関する複数の直鎖構造体データから変換された複数のベクトルＶｓを組み合わせることによって、その合成化合物の分子を数値列で記述した分子記述データＭＤｓを生成する。すなわち、ベクトル合成部１４は、合成化合物に対応して生成された複数のベクトルＶｓの各要素を加算してから、加算した各要素を複数の直鎖構造体における合計の分子量あるいは合計の構成要素分子数で除算することによって、合成化合物に対応する分子記述データＭＤｓを生成する。 The vector synthesis unit 14 combines a plurality of vectors Vr converted from a plurality of cyclic structure data regarding a synthetic compound corresponding to the combination of structure data and composition ratio data, thereby describing the molecule of the synthetic compound in a numerical string. molecule description data MDr is generated. That is, the vector synthesis unit 14 adds each element of a plurality of vectors Vr generated corresponding to the synthetic compound, and then converts each added element into the total molecular weight of the plurality of cyclic structures or the total component molecule. By dividing by a number, molecular description data MDr corresponding to the synthetic compound is generated. For example, two types of component molecules with molecular weights of 42.08 and 56.11 are arranged in a 1:1 composition ratio to generate data for a cyclic structure with a total of 200 components, and this is repeated 100 times. In this case, each added element is divided by the total molecular weight = 420,800 + 561,100 = 981,900, or the total number of elements = 20,000. Similarly, by combining a plurality of vectors Vs converted from a plurality of linear structure data regarding a synthetic compound, molecular description data MDs that describes the molecule of the synthetic compound in a numerical string is generated. That is, the vector synthesis unit 14 adds each element of the plurality of vectors Vs generated corresponding to the synthetic compound, and then converts each added element into the total molecular weight or the total component of the plurality of linear structures. By dividing by the number of molecules, molecular description data MDs corresponding to the synthetic compound is generated.

そして、ベクトル合成部１４は、２つの分子記述データＭＤｒ，ＭＤｓの作成を、取得部１１が取得した構造データと構成比データとの組み合わせに対応する合成化合物ごとに繰り返す。 Then, the vector synthesis unit 14 repeats the creation of the two molecular description data MDr and MDs for each synthetic compound corresponding to the combination of the structural data and the composition ratio data acquired by the acquisition unit 11.

重み付け加算部１５は、ベクトル合成部１４によって生成された複数種類の合成化合物ごとの分子記述データＭＤｒを重み付け加算することによって、複数種類の合成化合物を混合率データで示される混合率で混合した混合材料の化学構造の特徴を示す最終的な分子記述データＭＤｒ’を生成する。すなわち、重み付け加算部１５は、複数の分子記述データＭＤｒのベクトルの各要素を、それぞれの分子記述データＭＤｒによって記述される合成化合物の混合率で重み付けして加算することにより、分子記述データＭＤｒ’を生成する。同様にして、ベクトル合成部１４によって生成された複数種類の合成化合物ごとの分子記述データＭＤｓを重み付け加算することによって、複数種類の合成化合物を混合率データで示される混合率で混合した混合材料の化学構造の特徴を示す最終的な分子記述データＭＤｓ’を生成する。すなわち、重み付け加算部１５は、複数の分子記述データＭＤｓのベクトルの各要素を、それぞれの分子記述データＭＤｓによって記述される合成化合物の混合率で重み付けして加算することにより、分子記述データＭＤｓ’を生成する。 The weighted addition unit 15 weights and adds the molecular description data MDr for each of the plurality of types of synthetic compounds generated by the vector synthesis unit 14, thereby creating a mixture in which the plurality of types of synthetic compounds are mixed at a mixing ratio indicated by the mixing ratio data. Final molecular description data MDr' indicating the characteristics of the chemical structure of the material is generated. That is, the weighted addition unit 15 weights and adds each element of the vector of a plurality of molecular description data MDr by the mixing ratio of the synthetic compound described by each molecular description data MDr, thereby obtaining the molecular description data MDr'. generate. Similarly, by weighting and adding the molecular description data MDs for each of the plurality of synthetic compounds generated by the vector synthesis unit 14, a mixed material obtained by mixing the plurality of synthetic compounds at the mixing ratio indicated by the mixing ratio data is obtained. Final molecular description data MDs' indicating the characteristics of the chemical structure is generated. That is, the weighted addition unit 15 weights and adds each element of the vector of a plurality of molecule description data MDs by the mixing ratio of the synthetic compound described by each molecule description data MDs, thereby obtaining the molecule description data MDs'. generate.

さらに、重み付け加算部１５は、生成した分子記述データＭＤｒ’，ＭＤｓ’を外部に出力する。出力された分子記述データＭＤｒ’，ＭＤｓ’は、分子記述子生成システム１０の外部に接続されたコンピュータ内のトレーニング部２０によって入力データとして読み込まれる。そして、トレーニング部２０において、その入力データが説明変数として任意の教師ラベルとともに機械学習モデルに入力されることにより、学習済みモデルが生成される。さらに、トレーニング部２０によって生成された学習済みモデルを基に予測器３０内の機械学習モデルが設定される。そして、分子記述子生成システム１０によって生成された分子記述データＭＤｒ’，ＭＤｓ’が予測器３０内の機械学習モデルに入力されることによって、予測器３０によって混合材料の特性の予測結果が生成および出力される。なお、これらのトレーニング部２０および予測器３０は、分子記述子生成システム１０を構成するコンピュータ１００と同一のコンピュータ内に構成されてもよいし、コンピュータ１００と別体のコンピュータ内に構成されてもよい。 Furthermore, the weighted addition unit 15 outputs the generated molecular description data MDr', MDs' to the outside. The output molecular description data MDr', MDs' are read as input data by a training unit 20 in a computer connected to the outside of the molecular descriptor generation system 10. Then, in the training unit 20, the input data is input to the machine learning model as an explanatory variable together with an arbitrary teacher label, thereby generating a learned model. Furthermore, a machine learning model within the predictor 30 is set based on the learned model generated by the training unit 20. The molecular description data MDr' and MDs' generated by the molecular descriptor generation system 10 are input to the machine learning model in the predictor 30, so that the predictor 30 generates and generates prediction results of the properties of the mixed material. Output. Note that the training unit 20 and the predictor 30 may be configured in the same computer as the computer 100 that configures the molecular descriptor generation system 10, or may be configured in a separate computer from the computer 100. good.

一例では、トレーニング部２０の生成する機械学習モデルは、推定精度が最も高いと期待される学習済みモデルであり、したがって「最良の機械学習モデル」ということができる。しかし、この学習済みモデルは“現実に最良である”とは限らないことに留意されたい。学習済みモデルは、入力データと出力データとの多数の組合せを含む教師データをコンピュータが処理することで生成される。コンピュータは、入力データを機械学習モデルに入力することで出力データを算出し、算出された出力データと、教師データで示される出力データとの誤差（すなわち、推定結果と正解との差）を求める。そして、コンピュータはその誤差に基づいて機械学習モデル内の所与のパラメータを更新する。コンピュータはこのような学習を繰り返すことで学習済みモデルを生成する。学習済みモデルを生成する処理は学習フェーズということができ、その学習済みモデルを利用する予測器３０の処理は運用フェーズということができる。 In one example, the machine learning model generated by the training unit 20 is a trained model that is expected to have the highest estimation accuracy, and therefore can be called the "best machine learning model." However, it should be noted that this trained model is not necessarily "actually the best". A trained model is generated by a computer processing training data that includes many combinations of input data and output data. The computer calculates output data by inputting input data into a machine learning model, and calculates the error between the calculated output data and the output data indicated by the teacher data (i.e., the difference between the estimation result and the correct answer). . The computer then updates a given parameter in the machine learning model based on that error. A computer generates a trained model by repeating such learning. The process of generating a learned model can be called a learning phase, and the process of the predictor 30 that uses the learned model can be called an operation phase.

［システムの動作］
図８を参照しながら、分子記述子生成システム１０の動作を説明するとともに本実施形態に係る分子記述子生成方法について説明する。図８は分子記述子生成システム１０の動作の一例を示すフローチャート、図９、図１０、図１１は、分子記述子生成システム１０によって生成される各ベクトルの数値例を示す図である。 [System operation]
With reference to FIG. 8, the operation of the molecular descriptor generation system 10 will be explained, and the molecular descriptor generation method according to this embodiment will be explained. FIG. 8 is a flowchart showing an example of the operation of the molecular descriptor generation system 10, and FIGS. 9, 10, and 11 are diagrams showing numerical examples of each vector generated by the molecular descriptor generation system 10.

まず、分子記述子生成システム１０のユーザの指示入力を契機に分子記述子生成処理が開始されると、取得部１１によって、複数種類の合成化合物それぞれについての構造データと構成比データとの複数の組み合わせ、及び複数種類の合成化合物の混合率に関する混合率データが取得される（ステップＳ１）。その後、データ生成部１２によって、上記の構造データと構成比データの組み合わせに対して、複数通りの環状構造体データ及び複数通りの直鎖構造体データが生成される（ステップＳ２）。 First, when molecular descriptor generation processing is started in response to an instruction input by a user of the molecular descriptor generation system 10, the acquisition unit 11 collects a plurality of structural data and composition ratio data for each of a plurality of types of synthetic compounds. Mixing ratio data regarding combinations and mixing ratios of multiple types of synthetic compounds is acquired (step S1). Thereafter, the data generation unit 12 generates a plurality of types of cyclic structure data and a plurality of types of linear structure data for the combination of the above-mentioned structure data and composition ratio data (step S2).

次に、ベクトル変換部１３によって、複数通りの環状構造体データ及び複数通りの直鎖構造体データが、それぞれ、分子記述化によりベクトルＶｒ，Ｖｓに変換される（ステップＳ３）。図９及び図１０は、図４に示す２つの分子構造を特定する構造データと構成比１：１を示す構成比データとの組み合わせを対象に生成されたベクトルＶｒ，Ｖｓの数値例を示す。 Next, the vector conversion unit 13 converts the plurality of types of cyclic structure data and the plurality of types of linear structure data into vectors Vr and Vs by molecular description, respectively (step S3). 9 and 10 show numerical examples of vectors Vr and Vs generated for the combination of the structural data specifying the two molecular structures shown in FIG. 4 and the composition ratio data indicating the composition ratio of 1:1.

さらに、ベクトル合成部１４により、複数通りの環状構造体データから変換された複数のベクトルＶｒが組み合わされて分子記述データＭＤｒが生成され、複数通りの直鎖構造体データから変換された複数のベクトルＶｓが組み合わされて分子記述データＭＤｓが生成される（ステップＳ４）。図１１は、ベクトル合成部１４によって生成された一次元ベクトルである分子記述データＭＤｒの数値例を示す。 Furthermore, the vector synthesis unit 14 combines the plurality of vectors Vr converted from the plurality of types of cyclic structure data to generate molecular description data MDr, and generates the molecular description data MDr by combining the plurality of vectors Vr converted from the plurality of types of linear structure data. Vs are combined to generate molecular description data MDs (step S4). FIG. 11 shows a numerical example of the molecular description data MDr, which is a one-dimensional vector generated by the vector synthesis unit 14.

その後、上述したステップＳ２～Ｓ４の処理が、取得部１１が取得した構造データと構成比データの組み合わせに対して繰り返される結果、複数種類の合成化合物に対応して、分子記述データＭＤｒと分子記述データＭＤｓの複数の組み合わせが生成される（ステップＳ５）。そして、重み付け加算部１５によって、複数種類の合成化合物に対応する分子記述データＭＤｒ，ＭＤｓのそれぞれが、混合率データの示す混合率で重み付け加算されることにより、最終的な分子記述データＭＤｒ’，ＭＤｓ’が入力データとして生成される（ステップＳ６）。 Thereafter, the processes of steps S2 to S4 described above are repeated for the combination of the structural data and the composition ratio data acquired by the acquisition unit 11, and as a result, the molecular description data MDr and the molecular description A plurality of combinations of data MDs are generated (step S5). Then, the weighted addition unit 15 weights and adds each of the molecular description data MDr and MDs corresponding to a plurality of types of synthetic compounds at the mixing ratio indicated by the mixing ratio data, thereby obtaining final molecular description data MDr', MDs' are generated as input data (step S6).

次に、トレーニング部２０において、学習フェーズが実行され、入力データと教師データとを用いてトレーニングを繰り返すことで学習済みモデルが生成される（ステップＳ７）。そして、生成された学習済みモデルが予測器３０に設定され、予測器３０により、新たに分子記述子生成システム１０から取得される入力データを用いて運用フェーズが実行され、混合材料の特性の予測結果が生成および出力される（ステップＳ８）。 Next, in the training unit 20, a learning phase is executed, and a learned model is generated by repeating training using input data and teacher data (step S7). Then, the generated trained model is set in the predictor 30, and the predictor 30 executes the operation phase using input data newly acquired from the molecular descriptor generation system 10 to predict the properties of the mixed material. Results are generated and output (step S8).

［プログラム］
コンピュータまたはコンピュータシステムを分子記述子生成システム１０として機能させるための分子記述子生成プログラムは、該コンピュータシステムを取得部１１、データ生成部１２、ベクトル変換部１３、ベクトル合成部１４、および重み付け加算部１５として機能させるためのプログラムコードを含む。この分子記述子生成プログラムは、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、半導体メモリ等の有形の記録媒体に固定的に記録された上で提供されてもよい。あるいは、分子記述子生成プログラムは、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。提供された分子記述子生成プログラムは例えば補助記憶部１０３に記憶される。プロセッサ１０１が補助記憶部１０３からその分子記述子生成プログラムを読み出して実行することで、上記の各機能要素が実現する。 [program]
A molecular descriptor generation program for causing a computer or a computer system to function as the molecular descriptor generation system 10 includes an acquisition unit 11, a data generation unit 12, a vector conversion unit 13, a vector synthesis unit 14, and a weighted addition unit. It includes a program code for functioning as 15. This molecular descriptor generation program may be provided after being permanently recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Alternatively, the molecular descriptor generation program may be provided via a communication network as a data signal superimposed on a carrier wave. The provided molecular descriptor generation program is stored in the auxiliary storage unit 103, for example. The processor 101 reads the molecular descriptor generation program from the auxiliary storage unit 103 and executes it, thereby realizing each of the above functional elements.

［効果］
以上説明したように、上記実施形態によれば、複数種類の構成単位分子を予め指定された構成比で含む所定の構成単位数の環状構造体のデータが、構成単位分子をランダムに配列した上で隣接する分子間が予め指定された結合点で結合された構造のデータとして生成される。このようなデータは複数種類の構成単位分子をその構成比で含む合成化合物の構造を代表するものとなる。そして、繰り返し生成された複数の環状構造体のデータが複数のベクトルに変換され、それらの複数のベクトルが組み合わされることにより、合成化合物の分子構造を適切に表現した分子記述データが生成される。これにより、複数種類の構成単位が合成された化合物の分子構造が明らかでない場合であっても、その化合物の分子構造を適切にデータで表現することができる。特に、このような機能により、反応末端の割合が少ない分子構造のデータを少ない構成単位数で効率的に生成できるので、分子記述データの生成の際の処理負荷を軽減することもできる。また、環状構造体のデータを用いることで直鎖構造の合成化合物の分子構造をより適切に表現したデータを生成することができる。加えて、構成単位分子をモノマーとして設定した場合には、合成化合物であるコポリマーの分子構造をより適切に表現したデータを生成することができる。 [effect]
As explained above, according to the above embodiment, data of a cyclic structure having a predetermined number of structural units including a plurality of types of structural unit molecules in a prespecified composition ratio are obtained by randomly arranging the structural unit molecules. is generated as structural data in which adjacent molecules are bonded at prespecified bonding points. Such data represents the structure of a synthetic compound containing multiple types of constituent unit molecules in their composition ratios. Then, the repeatedly generated data of the plurality of cyclic structures is converted into a plurality of vectors, and the plurality of vectors are combined to generate molecular description data that appropriately expresses the molecular structure of the synthetic compound. As a result, even if the molecular structure of a compound synthesized from multiple types of structural units is not clear, the molecular structure of the compound can be appropriately expressed as data. In particular, with such a function, it is possible to efficiently generate molecular structure data with a small proportion of reactive terminals using a small number of structural units, and therefore it is also possible to reduce the processing load when generating molecular description data. Further, by using data on a cyclic structure, it is possible to generate data that more appropriately represents the molecular structure of a synthetic compound having a linear structure. In addition, when the constituent unit molecules are set as monomers, it is possible to generate data that more appropriately represents the molecular structure of the copolymer, which is a synthetic compound.

また、上記実施形態では、分子記述データＭＤｒを生成する際に、複数のベクトルＶｒの要素を加算した後、その要素を複数の環状構造体における合計の分子量あるいは合計の構成単位の分子の数で除算している。こうすれば、ランダムに生成した複数の環状構造体の分子構造を用いて、それらの分子構造の大小の影響を受けずに、１つの合成化合物の分子を表現するデータを生成することができる。 Furthermore, in the above embodiment, when generating the molecular description data MDr, after adding the elements of a plurality of vectors Vr, the elements are expressed as the total molecular weight or the total number of constituent unit molecules in the plurality of cyclic structures. It is dividing. In this way, data expressing the molecule of one synthetic compound can be generated using the molecular structures of a plurality of randomly generated cyclic structures without being influenced by the size of the molecular structures.

さらに、上記実施形態によれば、複数種類の構成単位分子を予め指定された構成比で含む所定の構成単位数の直鎖構造体のデータが、構成単位分子をランダムに配列した上で隣接する分子間が予め指定された結合点で結合された構造のデータとして生成される。このようなデータも複数種類の構成単位分子をその構成比で含む合成化合物の構造を代表するものとなる。そして、繰り返し生成された複数の直鎖構造体のデータが複数のベクトルＶｓに変換され、それらの複数のベクトルＶｓがさらに組み合わされることにより、合成化合物の分子構造を適切に表現した分子記述データＭＤｓが生成される。これにより、合成化合物の分子構造が明らかでない場合であっても、その合成化合物の分子構造を適切にデータで表現することができる。特に、合成化合物の構造が直鎖構造であることが多い場合に、その合成化合物の分子構造をより適切にデータで表現することができる。 Further, according to the above embodiment, data of a linear structure having a predetermined number of constituent units including multiple types of constituent unit molecules at a prespecified composition ratio are arranged in such a manner that the constituent unit molecules are randomly arranged and then adjacent to each other. It is generated as data of a structure in which molecules are bonded at prespecified bonding points. Such data also represent the structure of a synthetic compound containing multiple types of constituent unit molecules in their composition ratios. Then, the data of multiple linear structures that are repeatedly generated are converted into multiple vectors Vs, and these multiple vectors Vs are further combined to create molecular description data MDs that appropriately expresses the molecular structure of the synthetic compound. is generated. Thereby, even if the molecular structure of a synthetic compound is not clear, the molecular structure of the synthetic compound can be appropriately expressed as data. In particular, when the structure of a synthetic compound is often a linear structure, the molecular structure of the synthetic compound can be expressed more appropriately with data.

また、上記実施形態では、分子記述データＭＤｓを生成する際に、複数のベクトルＶｓの要素を加算した後、その要素を複数の直鎖構造体における合計の分子量あるいは合計の構成単位の分子の数で除算している。このような構成の場合、ランダムに生成した複数の直鎖構造体の分子構造を用いて、それらの分子構造の大小の影響を受けずに、１つの合成化合物の分子を表現するデータを生成することができる。 In addition, in the above embodiment, when generating the molecular description data MDs, after adding the elements of a plurality of vectors Vs, the elements are calculated as It is divided by In such a configuration, the molecular structures of multiple randomly generated linear structures are used to generate data representing the molecule of one synthetic compound without being influenced by the size of those molecular structures. be able to.

また、上記実施形態では、複数種類の合成化合物に対して生成した分子記述データＭＤｒ，ＭＤｓを混合率データの示す混合率で重み付けしたベクトルである分子記述データＭＤｒ’，ＭＤｓ’がさらに生成されている。かかる構成を採れば、複数種類の合成化合物を混合した混合物の分子を表現するベクトルを適切に生成することができる。 Further, in the above embodiment, molecular description data MDr', MDs', which is a vector obtained by weighting the molecular description data MDr, MDs generated for multiple types of synthetic compounds by the mixture ratio indicated by the mixture ratio data, is further generated. There is. If such a configuration is adopted, it is possible to appropriately generate a vector representing a molecule of a mixture of a plurality of types of synthetic compounds.

［変形例］
以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 [Modified example]
The present invention has been described above in detail based on the embodiments thereof. However, the present invention is not limited to the above embodiments. The present invention can be modified in various ways without departing from the gist thereof.

上記実施形態では、分子記述子生成システム１０が生成した分子記述データを、機械学習用の入力データとして用いていたが、分子記述データの用途は他の用途であってもよい。例えば、合成化合物あるいはそれらの混合物の構造、物性等を検索するための検索処理用の用途であってもよい。 In the above embodiment, the molecular description data generated by the molecular descriptor generation system 10 is used as input data for machine learning, but the molecular description data may be used for other purposes. For example, it may be used for search processing to search for structures, physical properties, etc. of synthetic compounds or mixtures thereof.

また、上記実施形態では、分子記述データＭＤｒ，ＭＤｓを混合率データの示す混合率で重み付けしたベクトルである分子記述データＭＤｒ’，ＭＤｓ’を最終のデータとして生成および出力していたが、分子記述データＭＤｒ，ＭＤｓを最終的な出力としてもよい。また、２つの分子記述データＭＤｒ、ＭＤｓを組み合わせたベクトルを最終的な出力としてもよい。 Further, in the above embodiment, the molecular description data MDr', MDs', which is a vector weighted by the mixture ratio indicated by the mixture ratio data, is generated and outputted as the final data. The data MDr and MDs may be used as the final output. Alternatively, a vector combining the two molecular description data MDr and MDs may be used as the final output.

少なくとも一つのプロセッサにより実行される分子記述子生成方法の処理手順は上記実施形態での例に限定されない。例えば、上述したステップ（処理）の一部が省略されてもよいし、別の順序で各ステップが実行されてもよい。また、上述したステップのうちの任意の２以上のステップが組み合わされてもよいし、ステップの一部が修正または削除されてもよい。あるいは、上記の各ステップに加えて他のステップが実行されてもよい。例えばステップＳ７，Ｓ８の処理が省略されてもよい。 The processing procedure of the molecular descriptor generation method executed by at least one processor is not limited to the example in the above embodiment. For example, some of the steps (processes) described above may be omitted, or each step may be executed in a different order. Furthermore, any two or more of the steps described above may be combined, or some of the steps may be modified or deleted. Alternatively, other steps may be performed in addition to each of the above steps. For example, the processes in steps S7 and S8 may be omitted.

本開示において、「少なくとも一つのプロセッサが、第１の処理を実行し、第２の処理を実行し、…第ｎの処理を実行する。」との表現、またはこれに対応する表現は、第１の処理から第ｎの処理までのｎ個の処理の実行主体（すなわちプロセッサ）が途中で変わる場合を含む概念を示す。すなわち、この表現は、ｎ個の処理のすべてが同じプロセッサで実行される場合と、ｎ個の処理においてプロセッサが任意の方針で変わる場合との双方を含む概念を示す。 In this disclosure, the expression "at least one processor executes a first process, executes a second process, ... executes an n-th process" or an expression corresponding to this The concept includes a case where the executing entity (that is, the processor) of n processes from the first process to the nth process changes midway. That is, this expression indicates a concept that includes both a case in which all of the n processes are executed by the same processor, and a case in which the processors in the n processes are changed according to an arbitrary policy.

１０…分子記述子生成システム、１００…コンピュータ、１０１…プロセッサ、１１…取得部、１２…データ生成部、１３…ベクトル変換部、１４…ベクトル合成部、１５…重み付け加算部、２０…トレーニング部、３０…予測器、ＳＤ_１～ＳＤ_８…構造データ、ＳＤｒ…環状構造体データ、ＳＤｓ…直鎖構造体データ、Ｖｒ，Ｖｓ…ベクトル。 DESCRIPTION OF SYMBOLS 10... Molecular descriptor generation system, 100... Computer, 101... Processor, 11... Acquisition unit, 12... Data generation unit, 13... Vector conversion unit, 14... Vector synthesis unit, 15... Weighted addition unit, 20... Training unit, 30...Predictor, _SD1 to _SD8 ...structural data, SDr...cyclic structure data, SDs...linear structure data, Vr, Vs...vector.

Claims

comprising at least one processor;
the at least one processor,
Inputting structural data representing the molecular structures of molecules of multiple types of structural units and bonding points with external molecular structures in the molecular structures, and composition ratio data representing the composition ratios of the molecules of the multiple types of structural units. At least accept
Data of a cyclic structure having a predetermined number of structural units containing molecules of the plurality of types of structural units indicated by the structural data at the composition ratio indicated by the composition ratio data are obtained by randomly forming molecules of the plurality of types of constituent units into a ring shape. After arranging, repeatedly generate by bonding adjacent molecules at the bonding point indicated by the structural data,
By converting the repeatedly generated data of the plurality of cyclic structures into a plurality of vectors by molecular description,
generating molecular description data describing molecules of a synthetic compound containing molecules of the plurality of types of structural units in the composition ratio by combining the plurality of vectors;
Molecular descriptor generation system.

The synthetic compound is a compound with a linear structure,
The molecular descriptor generation system according to claim 1.

The molecules of the plurality of types of structural units are monomers,
The synthetic compound is a copolymer,
The molecular descriptor generation system according to claim 1 or 2.

The at least one processor includes:
After adding the elements of the plurality of vectors, the element is divided by the total molecular weight in the plurality of cyclic structures or the total number of molecules of the structural unit, thereby generating the molecule description data.
The molecular descriptor generation system according to any one of claims 1 to 3.

The at least one processor includes:
The data of a linear structure having a predetermined number of structural units containing molecules of the plurality of types of structural units indicated by the structural data at the composition ratio indicated by the composition ratio data are obtained by randomly arranging the molecules of the plurality of constituent units in series. and then repeatedly generate by bonding adjacent molecules at the bonding points indicated by the structural data,
Converting data of the plurality of linear structures repeatedly generated into a plurality of additional vectors by molecular description,
generating the molecular description data by further combining the plurality of additional vectors;
The molecular descriptor generation system according to any one of claims 1 to 4.

The at least one processor includes:
generating the molecular description data by adding the elements of the plurality of additional vectors and then dividing the element by the total molecular weight in the plurality of linear structures or the total number of molecules of the building blocks; ,
The molecular descriptor generation system according to claim 5.

The at least one processor includes:
We further accept mixture ratio data representing the mixture ratio of multiple types of synthetic compounds,
further generating a vector in which the molecular description data generated for the plurality of types of synthetic compounds is weighted by a mixing ratio indicated by the mixing ratio data;
The molecular descriptor generation system according to any one of claims 1 to 6.

1. A method for generating molecular descriptors performed by a computer comprising at least one processor, the method comprising:
Inputting structural data representing the molecular structures of molecules of multiple types of structural units and bonding points with external molecular structures in the molecular structures, and composition ratio data representing the composition ratios of the molecules of the multiple types of structural units. a step of accepting at least
Data of a cyclic structure having a predetermined number of structural units containing molecules of the plurality of types of structural units indicated by the structural data at the composition ratio indicated by the composition ratio data are obtained by randomly forming molecules of the plurality of types of constituent units into a ring shape. After arranging, repeatedly generating by bonding adjacent molecules at the bonding points indicated by the structural data;
converting the data of the plurality of repeatedly generated annular structures into a plurality of vectors by molecular description;
generating molecular description data describing molecules of a synthetic compound containing molecules of the plurality of types of structural units in the composition ratio by combining the plurality of vectors;
A molecular descriptor generation method comprising:

to the computer,
Inputting structural data representing the molecular structures of molecules of multiple types of structural units and bonding points with external molecular structures in the molecular structures, and composition ratio data representing the composition ratios of the molecules of the multiple types of structural units. a step of accepting at least
Data of a cyclic structure having a predetermined number of structural units containing molecules of the plurality of types of structural units indicated by the structural data at the composition ratio indicated by the composition ratio data are obtained by randomly forming molecules of the plurality of types of constituent units into a ring shape. After arranging, repeatedly generating by bonding adjacent molecules at the bonding points indicated by the structural data;
converting the data of the plurality of repeatedly generated annular structures into a plurality of vectors by molecular description;
generating molecular description data describing molecules of a synthetic compound containing molecules of the plurality of types of structural units in the composition ratio by combining the plurality of vectors;
A molecular descriptor generation program that runs