Lifelong learning with little or no data | IJCAI2016 Outstanding Student Paper

Introduction: The 2016 International Artificial Intelligence Conference (IJCAI2016) was held from July 9 to July 15. This year's conference focused on artificial intelligence of human consciousness. This is the Distinguished Student Paper of IJCAI2016. In addition to the detailed explanation of the paper, we also invited Associate Professor Li Yanjie of Harbin Institute of Technology to comment.
Using Zero-Shot Knowledge Migration to Use Task Features in Long-term Learning
Joint compilation: Blake, Zhang Min, Chen Chun
Summary
Knowledge transfer between tasks can enhance the performance of the learning model, but it requires an accurate assessment of the relationship between tasks to identify the relevant knowledge of the migration. The relationship between these tasks is generally evaluated on the basis of the training data for each task. This setting is inefficient for life-long learning that aims to rapidly learn each continuous task from a small amount of data. In order to reduce the burden, we have developed a lifelong reinforcement learning method based on coupled dictionary learning, which combines high-level task descriptions into the modeling of inter-task relationships. Our results show that the use of task descriptors can improve the performance of the learned task strategies. It not only provides a theoretical proof of our method's validity, but also proves our progress on a series of dynamic control problems. In the case of only giving a descriptor to a new task, the lifelong learner can also use the coupled dictionary to accurately predict the task strategy through zero-shot learning, eliminating the need to pause the collection of training data before solving the task.
1 Introduction
By reusing the knowledge of other related tasks, the transfer and multi-task learning (MTL) approach reduces the amount of experience required for independent task model training.
Based on training data for each task, these techniques generally select relevant migration knowledge by modeling relationships between tasks. However, prior to successful knowledge transfer, this process required sufficient training data for each task that identified the relationship. As long as there is a high-level task description, humans can quickly create a guidance program for a new task and call past experience before the actual task is executed. For example, when we see a picture of a new IKEA chair box, we can immediately think of the previous experience of assembling a chair, and then start thinking about how to assemble this chair. Similarly, an experienced reverse pole balanced agent may be able to predict the controller for a given mass and length before interaction with the physical system takes place.
Inspired by this perspective, we explore the use of high-level task descriptions to improve the efficiency of migration among multiple machine learning tasks. We focus mainly on the lifelong learning scenario in which multiple tasks are continuously performed and the goal is to quickly learn new tasks through pre-order knowledge. Although we focus on the RL task in this article, our approach can easily be extended to regression and classification problems.
Our algorithm, the Lifelong Learning Task Descriptor (TaDeLL), encodes task descriptors into feature vectors to identify each task and uses these vectors as ancillary information to further train data for independent tasks. This kind of knowledge transfer using task characteristics has also been used by scholars before. In order to compare with their work, our method is to run online for continuous tasks, and our method is more computationally efficient.
We use coupled dictionary learning to model the relationships between tasks, not only task descriptors but also independent task policies in lifelong learning. Coupling a dictionary to learn to perform such policy descriptors similar tasks should have similar policies, but still allow word dictionary elements to freely and accurately reflect the policies of different tasks. We couple the dictionary to the cross-correlated sparse coding concept connection, provide reasons why the task descriptor can improve performance, and empirically test the theoretical basis for this improvement.
In order to further improve the task policy, we propose that task identifiers allow learners to accurately predict the policies of unknown tasks with only their descriptions. This learning process without data is called zero-shot learning. This ability is very important in lifelong learning settings. It allows the system to accurately predict new mission policies through migration, eliminating the need to pause data collection on each mission.
2. Related work
Batch MTL methods often model relationships between tasks to detect knowledge transfer. These techniques include modeling task distance metrics, using correlation to detect proper migration, or modeling based on recent domains. Recently, MTL has expanded into lifelong learning settings where attenuation, classification, and reinforcement learning tasks continue. However, all of these methods require training data for each task, in order to read their contacts and then detect the knowledge of the migration.
Unlike relying solely on task training data alone, several research efforts have explored the use of high-level task descriptors in MTL to model the relationships between tasks and migrate learning settings. The combined neural network task identifiers have been used to define the predecessor of a specific task or to control the gated network in the middle of an independent task bundle. This article focuses on the classification and attenuation of multitasks in bulk settings, where the system can access data and features of all tasks, comparing our study of lifelong learning task descriptors with continuous RL tasks.
Similar to our work, Sinapopov et al. use task descriptors to estimate the portability of each group of migration learning tasks. Given a new task for the descriptor, they identify the original task that is most likely to migrate, and then use the original task in the RL. Although their approach is effective, they are too expensive to calculate because they need to calculate the mobility of each task by repeating simulations. Their assessments are also limited to the migration learning settings, they do not take into account the impact of continuous task migrations, nor do they want us to update the migration model in the lifelong learning setup.
Our work is also related to the simple Zero-Shot learning (Simple ZSL) proposed by Romera-Paredes and Tor. It learns a multi-class linear model, decomposes linear model parameters, and assumes that descriptors are potential underlying parameters of the reconstructed model.
Our approach assumes a more flexible link: Both model parameters and task descriptors can be reconstructed from the underlying underlying parameters. Compared to our lifelong learning method, the simple ZSL is operated under a variety of offline settings.
3. Background
3.1 Reinforcement learning
A reinforcement learning (RL) agent must select sequence actions in the environment to maximize expected returns. A RL task is basically planned according to the Markov decision process (MDP), ie . X is a set of state sets, A is the set of actions that the agent may perform, and P:XxAxXâ‡¥[0,1] is the state transition possibility that describes the dynamics of the system. R:XxAxXâ‡¥R is the reward function, râ‹´ [0, 1) is the return over time. In the event step h, the agent selects the action aâ‹´A in the state xhâ‹´X through the policy Ï€:XxAâ‡¥[0,1], and the function is defined by the vector control parameter. The purpose of reinforcement learning is to find the best policy Ï€ and Î¸ to maximize the estimated return. However, learning a single task still requires a lot of trajectory, which also motivates migration to reduce the amount of environmental communication.
The Policy Gradient (PG) method is our basic learning method, used as a series of RL algorithms to solve continuous state and action step height problems such as machine control. The goal of the PG method is to optimize the expected average return:
3.2 Lifelong Machine Learning
In lifelong learning settings, learners face multiple, continuous tasks and must quickly learn each task based on prior experience. The learner may encounter the previous task at any time, so the performance must be optimized based on the previous task. Agent does not know the total number of tasks Tmax, task distribution or task sequence.
At time t, lifelong learners encounter task Z(t). In this article, each task Zt is made by MDP To define, but lifelong learning settings and our methods can handle classification or attenuation tasks equally. The Agent will learn each task continuously and acquire training data before going to the next task. The goal of Agent is to learn the best policy under the corresponding parameters. Ideally, knowledge learned from previous tasks should speed up and improve the performance of each new task Z(t). Similarly, lifelong learners should be able to effectively scale to a large number of tasks while quickly learning each task from the smallest data.
The effective lifelong learning algorithm (ELLA) and PG-ELLA are designed separately for classifying/attenuating tasks and RL tasks in lifelong learning settings.
For each task model, both methods assume parameters that can be factorized using the shared knowledge base L, thereby facilitating the transfer between tasks. Specifically, the model parameters for task Z(t) are given by Î¸(t)=LS(t), where L Rdxk is a shared datum for the entire model space, and S(t) Rk is the sparse coefficient for the entire datum. This factorization is very effective for lifelong learning and multi-task learning. In this scenario, the PG's MTL goal is:

In order to achieve the goals in the lifelong learning setting, Bou Ammar et al. approximate the multitasking goal by first replacing the lower boundary of the PG target and then using the second-order Taylor to extend to the approximate target and evaluate the Î±(t) in each task Z(t). ) Rd's single task strategy parameters, and only update the coefficient s(t) at the current time point. This process reduces the attention of the MTL for the single-task policy problem on the sparse coding shared benchmark L, and ensures that S and L can be effectively solved by the following online update rules that make up PG-ELLA.

Although this is very effective for lifelong learning, before the scholars solve it, this method requires a large amount of training data to evaluate the strategy of each new method. We eliminate this restriction by incorporating task descriptions into lifelong learning to ensure that zero-shot moves to new tasks.
4. Task descriptor
Although most of the MTL and lifelong learning methods use the intrinsic task relationship of the task training data model, the advanced description can describe the task in a completely different way. For example, in multi-tasking medicine, patients are usually assigned to tasks through demographic data and disease performance. In terms of control issues, dynamic system parameters (eg, springs in spring-mass damper systems, mass and damping constants) perform task descriptions. The description can also come from external sources such as Wikipedia. This task description has been widely applied to zero-shot learning.
In general, we assume that each task Z(t) has a related descriptor m(t) (a scholar was given when the task was first introduced). Scholars are not aware of future tasks or the assignment of task descriptors. Descriptors are represented by the eigenvector Ã˜(m(t)Rdm, where Ã˜(Â·) performs feature extraction and non-linear datum transformations on (possibly) features. Although there are different descriptors in common tasks, we There is no assumption about the uniqueness of Ã˜(m(t). In addition, each task has relevant training data X(t) to learn the model; in case of RL tasks, the data is traced (through the agent's experience in the environment) Dynamically obtained) composition.
5. Lifelong Learning of Task Descriptors
We combine task descriptors into lifelong learning through a coupled dictionary to ensure that descriptors and learning strategies enhance each other. Although focused on RL tasks, our approach can easily adapt to classification or regression as described in the appendix.
5.1 Coupled Dictionary Optimization
As mentioned above, most of the multitasking and lifelong learning methods have successful cases - factoring the strategy parameters Î¸(t) for each task as a shared benchmark: Î¸(t) = Ls(t) Sparse linear combination. In terms of efficiency, each column of shared benchmarks L serves as a reusable strategy component that represents a cohesive knowledge block. In lifelong learning, the benchmark L is refined over time as the system learns more tasks. The coefficient vector S = [s(1). . . . S(T) encodes task strategies on a shared basis and provides an embedded task based on how their strategies share knowledge.
We make similar assumptions about the descriptor task - the descriptor feature Ã˜ (m(r)) can be linearly decomposed by using a potential datum D Rdm Ã— k of the descriptor space. The coefficients are the capture relations of the descriptor bases (based on the commonality of embedded tasks in their descriptors). From a co-view perspective, both strategies and descriptors provide information about tasks so they can learn from each other. For each of the two viewpoints, the basic tasks are common, so our task is to find the embedded strategies and the corresponding task descriptors. We can achieve this by coupling two benchmarks L and D, sharing the same coefficient vector reconstruction strategy and descriptor. So for task Z(t)

In order to optimize the coupling benchmarks L and D during lifelong learning, we used a coupled dictionary optimization technique from the sparse coding literature to optimize the dictionary for multiple feature spaces (share a joint sparse representation). The coupled dictionary learning concept leads to a high-performance image super-resolution algorithm that allows high-resolution images to be reconstructed from low-resolution samples and used for multi-modal retrieval and cross-domain retrieval.
Equation 6 gives the factorization, so we can redefine the multitasking goal for the coupled dictionary (Equation 1) like:
With the update of a series of pre-tasks given in Algorithm 1, this goal can now be effectively resolved online. With recursive construction based on eigenvalue decomposition, L and D are updated individually using Equation 3-5. Our complete implementation method is available on third-party websites.
5.2 Zero-shot transfer learning
In the lifelong setting, when faced with a new task, the agent's goal is to learn effective strategies for the task as soon as possible. At this stage, early multi-tasking and lifelong scholars delay before they can produce an appropriate strategy because they need to obtain data from new tasks in order to identify relevant knowledge and train new strategies.
Combining task descriptors, only descriptors are given to ensure that our method quickly predicts the strategy for new tasks. Zero-shot transfer operations are ensured through the use of coupled dictionary learning, which allows us to observe data instances in a feature space (eg, task descriptors) and use dictionaries and sparse coding in other feature spaces (eg, strategies Parameter) to recover its potential signal.
For the new task Z(tnew) given a unique descriptor m(tnew), we can evaluate the embedding of the task on the potential descriptor space space LASSO in the learning dictionary D:
Since the evaluation given by S(tnew) also serves as a coefficient of potential strategy space L, we can quickly predict the strategy for new tasks such as:
The zero-shot transfer learning process is given in Algorithm 2.
5.3 Theoretical Analysis
This section discusses why combining task descriptors through coupled dictionaries can improve the performance of learning strategies and ensure zero-shot transfer to new tasks. In Appendix 2, we provide a collection of TaDeLL. The full-sample complexity analysis is beyond the scope of the paper. In fact, it is still an open question for zero-shot learning.
In order to analyze the degree of improvement of the strategy, from the decomposition of the strategy parameters to Î¸(t) = Ls(t), we continue the experiment by showing that the use of coupled dictionary combination descriptors can improve the performance of both L and S. In this analysis, we use the concept of mutual coherence, which has long been widely studied in sparse recovery literature. The correlation of the cross-correlation measurement dictionary elements such as:
If M(Q)=0, then Q is an invertible orthogonal matrix, and sparse recovery can be solved directly by inversion; if M(Q)=1 means Q is not full rank, it is a poor dictionary. Intuitively speaking, low cross-correlation means that the columns of the dictionary are very different, so such a "good" dictionary can represent many different strategies and it is possible to get more knowledge transfer. This intuition is shown below:
Therefore, L with low consistency will lead to a more stable method for solving inaccurate task assessment strategies. Next we will use methods to reduce the mutual consistency of L.
TaDeLL changes the problem from training L to training L and D in the joint (including in K). In sparse repair theory, s(t) is the solution to formula 1 in task Z(t), so s(t) remains unchanged in all tasks. Theorem 5.1 implies that if M(K) < M(L), then joint pattern learning can help solve the problem more accurately. To further prove that, from Bayesian's point of view, formula 7 is also a derivative of MAP evaluation, which strengthens the Laplacian (Laplacian) in s(t)'s distribution and hypothesis L. It is a Gaussian matrix and its elements are all independently distributed. Use this formula as an evaluation criterion for M(L) and M(K) because the new task description has been increased by d, most of which may be M(K) <M(L), which also implies that TdDeLL learned a higher level. The self-coded code. Moreover, if M(D) â‰¤ M(L), the definition indicates that we can use D to repair the mission policy through zero-shot migration.
In order to show that the task characteristics can improve sparse repair, we prove it by the following theorem 5.2 on LASSO. Let s be a special solution to the Î¸=Qs system.
This theorem indicates that the LASSO error reconstruction is proportional to 1/d. When we include a descriptor by Î²(t), the common characteristics of RHS will change from d to (d+dm), but at the same time, K and k remain unchanged, resulting in a close fit. Therefore, the task description can improve the quality of the code code that has been learned and the accuracy of the sparse repair. By using policies or descriptors to ensure that s(t) is closely matched, Theorem 5.2 suggests that dm â‰¥ d, to ensure that zero sample learning can also produce the same evaluation for s(t).
6. Experiment
We evaluate our methods and learning control strategies based on three benchmark systems.
6.1 Baseline Power System
Spring mass damper (SM). This system is described by three parameters: spring constant, mass, and decay constant. The state of the system is determined by the position and velocity of the object. The controller tries to put it into a specified position by applying a force to the object.
Rod (BM). This system focuses on keeping the bike stationary while moving at a fixed rate on the horizontal plane. The system is characterized by the quality of the bicycle, the center of mass in the x and z coordinates, and the shape parameters of the bicycle (wheelbase, trails, and angles on the head). Its status is the degree of inclination of the bicycle and other derived states.
6.2 Methods
In each domain name we will generate 40 missions, each with different power and system parameters. The reward for each task is the gap between the current state and the goal. For lifelong learning, tasks will continue to encounter repetitions, learning will continue until each task has met at least once. Between different methods we use the same sequence of random tasks to ensure fair comparison. The learner will sample 100 step trajectories, and the learning process for each task is limited to 30 iterations. In MTL, all tasks are presented simultaneously. We use Natural Strategy Gradient Assessment (NAC) as the basis for learning standard systems and plot enhancement. In order to optimize the joint performance of all methods on 20 tasks within each domain name and balance the descriptors and strategies, we chose k and the specified parameter parameters respectively. Based on the final strategy of 40 tasks, we will evaluate the learning curve and will average the results of the 7 tests. The system parameters of each task are regarded as the characteristics of the task descriptor; we also try to convert some nonlinearities, but we find that using linear features also works well.
6.3 Results Based on Standard Systems
Figure 1 compares our TaDeLL method for lifelong learning with a task descriptor. 1. PG-ELLA does not use task characteristics. 2. GO-MTL, where MTL optimizes formula 1. 3. Single task learning using PG. For comparison, we optimized the formula 7 using MTL by changing the optimization and described the result as TaDeMTL. The shadows in the figure suggest standard errors.
We found that the task descriptor can improve lifelong learning in every system, even if the training strategy is not available from experience through GO-MTL within SM and BK domain names, it can also improve learning.

Figure 1: Multi-tasking based on standard power system (solid line) Figure 2: Running time comparison
Lifetime (dashed line), and single task learning (dotted line) performance.

Figure 3: Zero-sample migration of new tasks. Figure (a) shows the initial "powerful start" improvement at each domain name; Figures (b)-(d) describe the zero sample strategy as a result of warm-up initialization for PG startup.
The difference between TaDeMTL and TaDeLL in all domain names is almost negligible, except for CD (its task is very complicated), which also implies the effectiveness of our online optimization.
Figure 3 shows that the task descriptor is very effective for zero-sample migration for new tasks. Within each domain name, zero sample performance was detected, 40 additional tasks were generated, and the results of these tasks were averaged. Figure 3a shows that our approach improves the initial performance in new missions (for example, "strong start-up"), and this goes beyond Sinaopov et al.'s method performance and single-task PG, but this approach allows for Train. We attribute the poor performance of Sinapov et al.'s method on the CP to the fact that the CP strategy differs substantially in nature; within the domain name, the source and target strategies differ greatly, and the algorithm of Sinapov et al. cannot properly apply its source strategy. Perform the migration. In addition, the calculation cost of this method is also very expensive compared to our method (consistent with the number of tasks) (it is twice the number of tasks), as shown in Figure 2. The details of the test of the running time can be found in the appendix. Figures 3b-3d show that the zero-sample strategy is very effective for initial startup warm-ups of PG learning, and this will also improve its strategy.
6.4 Quadrotor Application
We will also apply this method to more challenging quadrotor control domain names, focusing on the application of zero-sample migration to new tasks. To ensure the dynamics of reality, we use the Bouadallah and Siegwart models, which are all confirmed by physical systems. The quadrotors are determined by three inertial constants and wing lengths, and their states include roll, pitch and yaw, and other derived states.

Figure 4: Start-up warm-up on quadrotor control
Figure 4 shows the results we used to show that TaDeLL can predict new quadrotor controllers with zero sample learning, and its accuracy is similar to PG, but PG must be trained in the system. As a benchmark, TaDeLL is very effective for warm-ups of PGs.
7. Conclusion
The method of using a joint code dictionary is recommended for integrating task descriptors into lifelong learning, because using descriptors can improve the performance of learned strategies while allowing us to predict strategies for new tasks before observing training data. On the issue of power control, tests have shown that our method performs better than other methods and requires less running time than similar models.
Comments:

When humans assemble a new chair, they usually use the assembly experience of the past to complete assembly of the new chair. Therefore, when learning the control strategy of the new task, they often hope to learn from the learning experience of other tasks, that is, the transfer of information between tasks. learning result. The transfer of information between tasks helps to improve the performance of learning, but it usually requires an accurate estimate of the connections between the tasks in order to identify the relevant information to be delivered, and these precise estimates are generally based on the training data for each task, and long-term The goal of lifelong learning is to quickly learn continuous strategies for different tasks using as little data as possible. In this case, this method of relying on accurate estimation of the links between tasks is not desirable because each task There is not so much training data. For this purpose, the task descriptors (task descriptors) are used to model the connections between tasks and the coupled dictionary optimization method is used to improve the learning effect of successive task strategies. In addition, this method is not new at all. In the case of mission training data, it is also possible to predict the strategy of the new mission.
Via IJCAI 2016
PS : This article was compiled by Lei Feng Network (search â€œLei Feng Networkâ€ public number) and it was compiled without permission.

KNLE1-63 Residual Current Circuit Breaker With Over Load Protection
KNLE1-63 TWO FUNCTION : MCB AND RCCB FUNCTIONS

leakage breaker is suitable for the leakage protection of the line of AC 50/60Hz, rated voltage single phase 240V, rated current up to 63A. When there is human electricity shock or if the leakage current of the line exceeds the prescribed value, it will automatically cut off the power within 0.1s to protect human safety and prevent the accident due to the current leakage.
leakage breaker can protect against overload and short-circuit. It can be used to protect the line from being overloaded and short-circuited as wellas infrequent changeover of the line in normal situation. It complies with standard of IEC/EN61009-1 and GB16917.1.

KNLE1-63 Residual Current Circuit Breaker,Residual Current Circuit Breaker with Over Load Protection 1p,Residual Current Circuit Breaker with Over Load Protection 2p
Wenzhou Korlen Electric Appliances Co., Ltd. , https://www.korlen-electric.com