[已解决]qlearning时间问题

1493916656 · 发表于 2023-6-3 21:28:47

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

下面的代码使用了qlearning模型，模型较为简单一般来说十几分钟就能运行出结果但是现在运行了三个小时还没有结果有哪些可能的错误？怎么解决（最好能给出修改后的代码感谢感谢）
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])

# 编译模型
model.compile(optimizer='adam', loss='mse')

# 定义市场价格函数
def market_price(x, y):
return 80 - (x + y) - 10

# 定义Q-learning算法
def q_learning():
num_episodes = 1000  # 训练的轮数
rewards = []  # 收集每个episode的总收益

for episode in range(num_episodes):
      state = np.random.choice(states)  # 随机选择初始状态
      done = False
      total_reward = 0

      while not done:
         if np.random.uniform(0, 1) < epsilon:
            # 探索：以epsilon的概率随机选择动作
            action = np.random.uniform(0, 40)
         else:
            # 开发：根据模型预测选择最优动作
            action = model.predict(np.array([state]))[0][0]

         # 更新市场价格并计算收益
         x = state
         y = action
         price = market_price(x, y)
         reward = price * x

         # 进行动作并观察新的状态和收益
         new_state = np.random.choice(states)
         new_reward = market_price(new_state, y) * new_state

         # 计算目标Q值
         target = reward + gamma * model.predict(np.array([new_state]))[0][0]

         # 使用样本(state, target)训练神经网络
         model.fit(np.array([state]), np.array([target]), verbose=0)

         state = new_state
         total_reward += reward

         if new_reward == 0:
            done = True

      rewards.append(total_reward)

return rewards

# 进行Q-learning训练
rewards = q_learning()

# 绘制产量选择和利润的图像
x_values = states
y_values = model.predict(x_values)
rewards = np.array(rewards)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x_values, y_values)
plt.xlabel('Production of A')
plt.ylabel('Production of B')
plt.title('Production Selection')

plt.subplot(1, 2, 2)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Rewards')

plt.tight_layout()
plt.show()

最佳答案

月排行榜 / 总排行榜

歌者文明清理员

2023-6-4 09:02:45

这份代码可能陷入了死循环，导致无法得出结果。在Q-Learning算法的主循环中，只有当市场价格收益为0时才会跳出循环。但是，由于随机选择新状态的方式不保证总是能够达到市场价格收益为0的状态，因此可能会导致程序陷入死循环。
为了解决这个问题，可以添加一个最大步数的限制。如果步数达到限制但是没有达到市场价格收益为0的状态，就强制跳出循环并将该次训练视为失败。
修改后的代码如下所示：
python
Copy Code
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# 定义模型
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu', input_shape=(1,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])

# 编译模型
model.compile(optimizer='adam', loss='mse')

# 定义市场价格函数
def market_price(x, y):
return 80 - (x + y) - 10

# 定义Q-learning算法
def q_learning():
num_episodes = 1000  # 训练的轮数
rewards = []  # 收集每个episode的总收益
max_steps = 10000  # 每个episode最大的步数

for episode in range(num_episodes):
      state = np.random.choice(states)  # 随机选择初始状态
      done = False
      total_reward = 0
      step = 0

      while not done and step < max_steps:
         if np.random.uniform(0, 1) < epsilon:
            # 探索：以epsilon的概率随机选择动作
            action = np.random.uniform(0, 40)
         else:
            # 开发：根据模型预测选择最优动作
            action = model.predict(np.array([state]))[0][0]

         # 更新市场价格并计算收益
         x = state
         y = action
         price = market_price(x, y)
         reward = price * x

         # 进行动作并观察新的状态和收益
         new_state = np.random.choice(states)
         new_reward = market_price(new_state, y) * new_state

         # 计算目标Q值
         target = reward + gamma * model.predict(np.array([new_state]))[0][0]

         # 使用样本(state, target)训练神经网络
         model.fit(np.array([state]), np.array([target]), verbose=0)

         state = new_state
         total_reward += reward
         step += 1

         if new_reward == 0:
            done = True

      rewards.append(total_reward)

return rewards

# 进行Q-learning训练
rewards = q_learning()

# 绘制产量选择和利润的图像
x_values = states
y_values = model.predict(x_values)
rewards = np.array(rewards)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x_values, y_values)
plt.xlabel('Production of A')
plt.ylabel('Production of B')
plt.title('Production Selection')

plt.subplot(1, 2, 2)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Rewards')

plt.tight_layout()
plt.show()
这里添加了max_steps变量和step计数器，当step达到max_steps时跳出循环。需要注意的是，如果max_steps设置得过小，可能会影响算法的收敛效果。

跳转到最佳答案楼层

isdkz · 发表于 2023-6-3 21:29:50

可能的错误有以下几个方面：

1. 模型参数设置不合理，导致训练时间过长。可以尝试调整模型参数，如增大batch_size，减小网络层数等。

2. 数据集过大，导致训练时间过长。可以尝试缩小数据集，或者使用采样的方式进行训练。

3. 训练过程中出现了梯度消失或爆炸等问题，导致训练时间过长。可以尝试使用其他的优化器，或者使用梯度裁剪等方法进行优化。

修改后的代码如下：

# 定义Q-learning算法
def q_learning():
num_episodes = 1000  # 训练的轮数
rewards = []  # 收集每个episode的总收益

for episode in range(num_episodes):
      state = np.random.choice(states)  # 随机选择初始状态
      done = False
      total_reward = 0

      while not done:
         if np.random.uniform(0, 1) < epsilon:
            # 探索：以epsilon的概率随机选择动作
            action = np.random.uniform(0, 40)
         else:
            # 开发：根据模型预测选择最优动作
            action = model.predict(np.array([state]))[0][0]

         # 更新市场价格并计算收益
         x = state
         y = action
         price = market_price(x, y)
         reward = price * x

         # 进行动作并观察新的状态和收益
         new_state = np.random.choice(states)
         new_reward = market_price(new_state, y) * new_state

         # 计算目标Q值
         target = reward + gamma * model.predict(np.array([new_state]))[0][0]

         # 使用样本(state, target)训练神经网络
         model.train_on_batch(np.array([state]), np.array([target]))

         state = new_state
         total_reward += reward

         if new_reward == 0:
            done = True

      rewards.append(total_reward)

return rewards

# 进行Q-learning训练
rewards = q_learning()

# 绘制产量选择和利润的图像
x_values = states
y_values = model.predict(x_values)
rewards = np.array(rewards)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x_values, y_values)
plt.xlabel('Production of A')
plt.ylabel('Production of B')
plt.title('Production Selection')

plt.subplot(1, 2, 2)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Rewards')

plt.tight_layout()
plt.show()

陶远航 · 发表于 2023-6-3 21:32:18

这段代码会运行非常久，可能的原因是模型训练次数太多或者采样数量过大导致计算时间增长。建议使用更小的采样数量和更少的轮数进行训练，并且增加一些早停条件，如连续多次收益不变就停止训练。

另外，这段代码中没有给出states数组的具体定义，需要确保其定义正确。

以下是修改建议（仅供参考）：

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# 定义神经网络模型
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu', input_shape=(1,)),
tf.keras.layers.Dense(1)
])
# 编译模型
model.compile(optimizer='adam', loss='mse')
# 定义市场价格函数
def market_price(x, y):
return 80 - (x + y) - 10
# 定义Q-learning算法
def q_learning():
states = np.arange(0, 100, 5)
num_episodes = 100 # 训练的轮数
rewards = [] # 收集每个episode的总收益
consecutive_no_gain = 0 # 连续多少次没有增益
for episode in range(num_episodes):
state = np.random.choice(states) # 随机选择初始状态
done = False
total_reward = 0
while not done:
if np.random.uniform(0, 1) < epsilon:
# 探索：以epsilon的概率随机选择动作
action = np.random.uniform(0, 40)
else:
# 开发：根据模型预测选择最优动作
action = model.predict(np.array([state]))[0][0]
# 更新市场价格并计算收益
x = state
y = action
price = market_price(x, y)
reward = price * x
# 进行动作并观察新的状态和收益
new_state = np.random.choice(states)
new_reward = market_price(new_state, y) * new_state
# 计算目标Q值
target = reward + gamma * model.predict(np.array([new_state]))[0][0]
# 使用样本(state, target)训练神经网络
model.fit(np.array([state]), np.array([target]), verbose=0)
state = new_state
total_reward += reward
if new_reward == 0 or consecutive_no_gain > 10:
done = True
if total_reward == rewards[-1]:
consecutive_no_gain += 1
else:
consecutive_no_gain = 0
rewards.append(total_reward)
return rewards
# 定义超参数
epsilon = 0.1
gamma = 0.9
# 进行Q-learning训练
rewards = q_learning()
# 绘制产量选择和利润的图像
x_values = np.arange(0, 100, 5)
y_values = model.predict(x_values)
rewards = np.array(rewards)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x_values, y_values)
plt.xlabel('Production of A')
plt.ylabel('Production of B')
plt.title('Production Selection')
plt.subplot(1, 2, 2)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Rewards')
plt.tight_layout()
plt.show()

复制代码

这里采样数量减少到了20个状态，轮数减少到了100轮，并增加了停止条件。这些超参数可以根据任务具体情况进行调整。

cmu052115 · 发表于 2023-6-3 21:55:37

歌者文明清理员 · 发表于 2023-6-4 09:02:45

这份代码可能陷入了死循环，导致无法得出结果。在Q-Learning算法的主循环中，只有当市场价格收益为0时才会跳出循环。但是，由于随机选择新状态的方式不保证总是能够达到市场价格收益为0的状态，因此可能会导致程序陷入死循环。
为了解决这个问题，可以添加一个最大步数的限制。如果步数达到限制但是没有达到市场价格收益为0的状态，就强制跳出循环并将该次训练视为失败。
修改后的代码如下所示：
python
Copy Code
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# 定义模型
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu', input_shape=(1,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])

# 编译模型
model.compile(optimizer='adam', loss='mse')

# 定义市场价格函数
def market_price(x, y):
return 80 - (x + y) - 10

# 定义Q-learning算法
def q_learning():
num_episodes = 1000  # 训练的轮数
rewards = []  # 收集每个episode的总收益
max_steps = 10000  # 每个episode最大的步数

for episode in range(num_episodes):
      state = np.random.choice(states)  # 随机选择初始状态
      done = False
      total_reward = 0
      step = 0

      while not done and step < max_steps:
         if np.random.uniform(0, 1) < epsilon:
            # 探索：以epsilon的概率随机选择动作
            action = np.random.uniform(0, 40)
         else:
            # 开发：根据模型预测选择最优动作
            action = model.predict(np.array([state]))[0][0]

         # 更新市场价格并计算收益
         x = state
         y = action
         price = market_price(x, y)
         reward = price * x

         # 进行动作并观察新的状态和收益
         new_state = np.random.choice(states)
         new_reward = market_price(new_state, y) * new_state

         # 计算目标Q值
         target = reward + gamma * model.predict(np.array([new_state]))[0][0]

         # 使用样本(state, target)训练神经网络
         model.fit(np.array([state]), np.array([target]), verbose=0)

         state = new_state
         total_reward += reward
         step += 1

         if new_reward == 0:
            done = True

      rewards.append(total_reward)

return rewards

# 进行Q-learning训练
rewards = q_learning()

# 绘制产量选择和利润的图像
x_values = states
y_values = model.predict(x_values)
rewards = np.array(rewards)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x_values, y_values)
plt.xlabel('Production of A')
plt.ylabel('Production of B')
plt.title('Production Selection')

plt.subplot(1, 2, 2)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Rewards')

plt.tight_layout()
plt.show()
这里添加了max_steps变量和step计数器，当step达到max_steps时跳出循环。需要注意的是，如果max_steps设置得过小，可能会影响算法的收敛效果。

cmu052115 · 发表于 2023-6-8 09:46:39

歌者文明清理员发表于 2023-6-4 09:02
这份代码可能陷入了死循环，导致无法得出结果。在Q-Learning算法的主循环中，只有当市场价格收益为0时才会 ...

活跃一下气氛还扣分，什么情况！！！

歌者文明清理员 · 发表于 2023-6-8 16:52:23

cmu052115 发表于 2023-6-8 09:46
活跃一下气氛还扣分，什么情况！！！

灌水

账号		自动登录	找回密码
密码			立即注册

[已解决]qlearning时间问题

马上注册，结交更多好友，享用更多功能^_^

评分

浏览过的版块